Infrastructure

Inference at the Edge: Running LLMs on Consumer GPUs

Balys Kriksciunas 7 min read
#ai#infrastructure#edge-ai#on-device#consumer-gpu#ollama#mlx#llama-cpp

Inference at the Edge: Running LLMs on Consumer GPUs

For the first two years of the LLM boom, “AI” meant “call an API.” In 2024 small models started running usefully on laptops. In 2025 that became a product category — privacy-first assistants, offline tools, developer-side IDE features, voice agents with local inference. In 2026 it’s approaching mainstream.

This post covers the infrastructure side of edge LLM inference: what runs where, the runtimes that matter, deployment tradeoffs, and the privacy/product patterns emerging.


What “Edge” Actually Means Here

Three distinct tiers:

  1. High-end consumer desktop — RTX 4090, RTX 5090, Apple M3/M4 Ultra, AMD Ryzen AI Max. Can run 30B–70B quantized models at usable speed.
  2. Mid-range consumer / pro laptops — M3/M4 Pro/Max, RTX 4070 Ti laptops, high-end handhelds. Runs 7B–13B comfortably.
  3. Phones and embedded — Apple A18 Pro, Snapdragon 8 Gen 3+, Tensor G5, custom silicon. Runs 1B–4B specially-designed models.

Each tier has its own runtime ecosystem. The product patterns differ wildly.


The Runtime Ecosystem

For Apple Silicon: MLX

Apple’s MLX framework is the default on M-series chips. Released late 2023, matured fast.

Strongly recommended for any Apple-first product.

For Windows + NVIDIA: llama.cpp + ExecuTorch + CUDA

llama.cpp (Georgi Gerganov’s project) is the community standard for local inference. Runs on CUDA, Vulkan, Metal, CPU. GGUF quantization format ubiquitous.

For production Windows apps with NVIDIA, CUDA direct via PyTorch or TensorRT. ExecuTorch (PyTorch’s edge runtime) is the newer managed path.

For Cross-Platform: Ollama

Ollama wraps llama.cpp with model management and an API. Best DX for “download and run” scenarios. Macs, Windows, Linux. API is OpenAI-compatible.

Default recommendation for any user who wants to run models locally without building infrastructure themselves.

For Web: WebLLM, Transformers.js

Running models in the browser via WebGPU. Transformers.js (HuggingFace) handles small models; WebLLM (CMU) scales to ~13B on capable machines.

Use cases: browser-side PII redaction, offline tools, privacy-first SaaS features.

For Mobile: Core ML, TensorFlow Lite, ExecuTorch, MediaPipe

Mobile inference is dominated by 1B–4B specially-designed models: Apple’s foundation models, Google’s Gemini Nano, Llama 3.2 1B/3B, Phi-5, Qwen3-1.5B, DeepSeek Edge, TinyLlama.


What’s Actually Usable On Each Tier

Tier 1 (high-end desktop/workstation)

Models we routinely run with decent performance (5–20 tokens/sec):

Quality is genuinely near-frontier for most tasks. Privacy use cases (legal, medical, code generation on private repos) are real.

Tier 2 (mid-range consumer)

Fast enough for interactive use, useful quality. This is the sweet spot for desktop/laptop apps.

Tier 3 (phones, tablets, small IoT)

Quality is task-dependent. Great for summarization, classification, structured generation. Less reliable for open-ended chat.


Deployment Architectures

Pattern 1: Fully local

Everything runs on device. No network calls during inference. Privacy-first.

Example: a local note-taking app that runs a 7B model in the background to generate tags and summaries.

Tradeoffs:

Pattern 2: Hybrid (local for privacy, cloud for capability)

Local model handles sensitive operations; cloud handles complex ones. Routing happens on-device.

Example: an enterprise assistant runs PII redaction locally, sends redacted queries to a cloud model for reasoning.

Tradeoffs:

Pattern 3: Local-first with cloud fallback

Default to local; fall back to cloud when local fails (doesn’t know the answer, hits latency budget).

Example: IDE code completion — local model on keystroke, cloud model when user requests “explain” or “refactor.”

Tradeoffs:

Pattern 4: Local for latency, cloud for quality

Not for privacy — purely for user experience. Local streams first tokens fast; cloud model provides the final answer.

Example: voice agent that starts responding immediately from local model while cloud model generates a better answer.

Tradeoffs:


The Quality Gap

Edge models are not API models. For reasoning-heavy tasks, a 7B local model is no match for GPT-4o or Claude Sonnet.

What edge models are competitive at:

What they struggle with:

Product design matters. Good edge AI products scope what’s asked of the local model to its capability band.


The Privacy Product Wedge

The single best reason to go edge in 2026 is privacy. Several product categories emerged specifically around local-first AI:

For B2B in regulated industries, “runs on your laptop, never leaves” is a serious product positioning. Worth real investment.


The Operational Side

Edge inference doesn’t have server-side ops — but it has its own ops problems:

1. Model distribution. 4GB to 40GB of weights have to reach user devices. CDN strategy matters. Delta updates help.

2. Version fragmentation. User on version 3.2 has model v1; user on 3.5 has v2. Both need to work.

3. Hardware capability detection. Choose the right model for the device. Fallback to smaller model if user’s GPU is insufficient.

4. User education. First run needs to download the model. Users need expectations set.

5. Telemetry. You can’t see failures server-side. Opt-in telemetry, error reporting, usage metrics all become harder.

6. Quality measurement. Your eval pipeline needs to run on-device, not just server-side.

Tools that help:


Emerging Patterns

What’s new in 2026:

1. Per-user fine-tunes

A user’s local model is tuned on their data. Private, personalized. Emerging in note-taking and coding tools.

2. Continual learning

Local models adapt based on user interactions. Privacy-preserving by design (data never leaves device).

3. Multimodal edge

Local models handle vision + text in the same pass. Especially strong on Apple Silicon (unified memory).

4. Agent-on-device

Small agents running local tools on the device itself. File search, calendar, notifications — without cloud round-trips.

5. Sub-second voice interaction

Local small models enable genuinely real-time voice agents. End-to-end latency under 500ms achievable.


The Short Version

If your product has a privacy dimension and you’re not considering edge inference, you’re missing a capability the market increasingly expects.


Further Reading

Designing an edge-AI product? Talk to us — we’ve shipped local-first AI across desktop and mobile.

← Back to Blog