"Can I run Llama 70B on my 24GB card?" is the most common question in every local-LLM forum, and almost every answer is wrong — either too optimistic ("sure, just use Q4!") or too cautious ("you need an A100"). The truth is a short piece of arithmetic that nobody bothers to show you. This guide shows the formula, then hands you the actual numbers for the models people actually run.
No rounding in our favor, no "minimum requirements" that assume a 512-token context. If a model needs 22GB and you have a 24GB card, we'll tell you it's tight — because at 32k context it won't fit, and that's exactly when you find out the hard way.
The formula everyone skips
VRAM usage is not one number — it's three, added together. Get any of them wrong and your model either refuses to load or silently spills into system RAM and crawls. Here's the whole thing:
total_vram = weights + kv_cache + overhead weights = params × bytes_per_param # quant decides bytes kv_cache ≈ 2 × n_layers × n_kv_heads × head_dim × ctx × bytes overhead ≈ 0.5–1.0 GB # CUDA context, buffers
Weightsare the model itself: the parameter count times how many bytes each parameter takes. That second part is the quantization. A "7B" model has 7 billion parameters; at full FP16 precision that's 14GB, but nobody serious runs FP16 for inference. The community standard is Q4_K_M — 4-bit weights with a few sensitive layers kept higher — which lands around 0.60GB per billion parameters. Memorize that one number and you can size any model in your head.
# Bytes per parameter by quant FP16 / BF16 → 2.00 (full precision, almost never needed) Q8_0 → 1.06 (near-lossless, 2× the size of Q4) Q6_K → 0.82 Q5_K_M → 0.71 Q4_K_M → 0.60 ← the default sweet spot Q3_K_M → 0.48 (noticeable quality loss starts here) # Quick estimate: weights_GB ≈ params_B × 0.60 for Q4_K_M # A 32B model → 32 × 0.60 ≈ 19 GB of weights
KV cache is the part everyone forgets. As the model generates, it stores a key and value vector for every token in the context window, on every layer. This grows linearly with context length — double your context, double your KV cache. For a 7B model it's tiny at 4k and substantial at 128k. For a 70B model with lots of layers it can be enormous. This is why a model that loads fine for chat falls over the moment a coding agent fills the context.
Overhead is the CUDA/Metal runtime context, compute buffers, and fragmentation — call it 0.75GBand you won't be far off. It doesn't scale with model size, so it matters most on small cards.
Modern models cut KV cache with Grouped-Query Attention (GQA): instead of one KV head per attention head, several attention heads share one KV head. Llama 3.1, Qwen2.5, and Mistral all use it, which is why their long-context footprints are far smaller than older models like the original Llama 2.
What the popular models actually cost
Here are real footprints at Q4_K_M with an 8k context window— a realistic chat-and-light-agent setup, weights plus KV cache plus overhead. "Comfortably fits" means you have a little headroom for a larger context later, not that it barely squeaks in.
| Model | Q4_K_M weights | Total @ 8k ctx | Comfortably fits on |
|---|---|---|---|
| Mistral 7B Instruct | ~4.4 GB | ~5.5 GB | 8GB VRAM / 16GB Mac |
| Llama 3.1 8B | ~4.9 GB | ~6.0 GB | 8GB VRAM / 16GB Mac |
| Qwen2.5 7B | ~4.7 GB | ~5.8 GB | 8GB VRAM / 16GB Mac |
| Qwen2.5-Coder 7B | ~4.7 GB | ~5.8 GB | 8GB VRAM / 16GB Mac |
| Qwen2.5 14B | ~9.0 GB | ~10.3 GB | 12GB VRAM / 32GB Mac |
| Qwen2.5-Coder 14B | ~9.0 GB | ~10.3 GB | 12GB VRAM / 32GB Mac |
| DeepSeek-R1-Distill 14B | ~9.0 GB | ~10.3 GB | 12GB VRAM / 32GB Mac |
| Mixtral 8x7B (MoE) | ~26 GB | ~28 GB | 2× 24GB / 48GB+ Mac |
| Qwen2.5 32B | ~19.5 GB | ~21.5 GB | 24GB VRAM / 32GB+ Mac |
| Qwen2.5-Coder 32B | ~19.5 GB | ~21.5 GB | 24GB VRAM / 32GB+ Mac |
| DeepSeek-R1-Distill 32B | ~19.5 GB | ~21.5 GB | 24GB VRAM / 32GB+ Mac |
| Llama 3.1 70B | ~42 GB | ~45 GB | 2× 24GB / 64GB+ Mac |
| Qwen2.5 72B | ~44 GB | ~47 GB | 2× 24GB / 64GB+ Mac |
Two takeaways. A 32B model at Q4_K_M is the realistic ceiling for a single 24GB consumer card — and it's tight, which the next table explains. And a 70B model genuinely needs two 24GB cards or a large unified-memory Mac; anyone telling you it fits in 24GB is quietly running Q2 or offloading half of it to CPU at single-digit tokens per second.
Mixtral 8x7B is a mixture-of-experts model: it has ~47B total parameters but only activates ~13B per token. You pay the full ~26GB to hold all the experts in memory, but it generates at roughly 13B speed. Great if you have the VRAM; not a way to cheat the weights cost.
The context-window multiplier
The 8k numbers above hide the trap. KV cache scales with context, so the same model has very different footprints depending on how much context you give it. Here's a single 7B model (Qwen2.5-style GQA, ~4.7GB of weights) across context lengths:
| Context window | KV cache | Total ≈ | What it means |
|---|---|---|---|
| 4k | ~0.2 GB | ~5.7 GB | Fine for chat, too small for agents |
| 8k | ~0.4 GB | ~5.9 GB | Comfortable general use |
| 32k | ~1.6 GB | ~7.1 GB | Coding agents — the real sweet spot |
| 128k | ~6.4 GB | ~11.9 GB | Whole-repo context, needs 16GB+ |
Notice the 7B model more than doublesits footprint going from 4k to 128k — entirely from KV cache. Now scale that up: a 32B model at 32k context adds roughly 4GB of KV cache on top of its ~19.5GB of weights, which is what pushes it past the edge of a 24GB card. The model loads at 4k and dies at 32k, and the error message won't mention context at all.
Llama.cpp can quantize the KV cache too (-ctk q8_0 -ctv q8_0), roughly halving its size for a small quality hit. It's the cheapest way to buy back context room on a card that's almost big enough.
The GPU shopping guide
Working backwards — here's the best model you can comfortablyrun at a useful 32k context on each common hardware tier. "Comfortably" is the operative word: these leave room for the KV cache to breathe, unlike spec-sheet claims.
| Hardware | Example | Best model @ 32k ctx | Notes |
|---|---|---|---|
| 8 GB VRAM | RTX 3050, 4060 | 7B–8B Q4_K_M | Drop to 16k for more headroom |
| 12 GB VRAM | RTX 3060, 4070 | 14B Q4_K_M | The value sweet spot |
| 16 GB VRAM | RTX 4060 Ti 16GB, 4080 | 14B Q5/Q6 or 32B at 8k | Comfortable 14B at full ctx |
| 24 GB VRAM | RTX 3090, 4090 | 32B Q4_K_M | Tight at 32k — quantize KV |
| 2× 24 GB | Dual 3090/4090 | 70B Q4_K_M | Or 32B at full 128k ctx |
| Mac 16 GB | M-series unified | 7B–8B Q4_K_M | ~12GB usable for the model |
| Mac 32 GB | M-series unified | 14B Q4_K_M | ~24GB usable |
| Mac 64 GB | M-series unified | 32B Q4_K_M | ~48GB usable |
| Mac 128 GB | M-series Max/Ultra | 70B Q4_K_M | ~96GB usable, slower than dual 4090 |
For most people the honest recommendation is a 12GB card for a 14B model or a 24GB card for a 32B model. Those two combinations cover the overwhelming majority of real local workloads — coding, RAG, summarization, structured extraction — without paying for a second GPU or a maxed-out Mac.
Apple Silicon plays by different rules
On a Mac there's no separate VRAM — the GPU shares the same unified memory pool as the OS. That sounds like a free lunch (a 128GB Mac can hold a 70B model an RTX 4090 can't), and partly it is. But two caveats matter.
First, not all of it is usable. macOS caps how much the GPU may allocate — plan on roughly 75% of total memory for the model and KV cache, with the rest reserved for the OS and your apps. A 16GB Mac gives you about 12GB to work with, a 64GB Mac about 48GB. You can raise the limit with sudo sysctl iogpu.wired_limit_mb, but going too high starves the system and triggers swapping.
Second, memory capacity is not memory bandwidth. A 128GB Mac can holda 70B model, but generation speed is bound by bandwidth, and even an M-series Ultra trails a pair of 4090s on big models. Apple Silicon is unbeatable for fitting large models cheaply and silently; it's not the fastest tokens-per-dollar at the high end.
# See what fits before you download anything: # the Models page filters HF GGUF quants to your node's memory. # Or check live VRAM on a running NVIDIA node: nvidia-smi --query-gpu=memory.used,memory.total --format=csv # On Apple Silicon, the unified pool is shared with macOS: # usable ≈ total × 0.75 (the rest is reserved for the OS)
The mistakes that waste VRAM
Running Q8 (or FP16) when Q4_K_M is fine
Forgetting the context window
Trusting "minimum requirements"
Splitting context across parallel slots
--parallel N for concurrent requests, the context window is divided between slots — each request gets ctx ÷ N tokens. People set a 32k window, run 4 slots, and wonder why each request only gets 8k.Skip the arithmetic
You now have the formula and the numbers, but you don't have to do the math by hand. We built two tools for exactly this: the VRAM calculator lets you punch in any model, quant, and context length and get the total instantly, and can I run AI? goes the other way — tell it your GPU and it lists every model that comfortably fits.
Inside Wide Area Intelligence the same logic runs automatically. When you deploy a model from the Modelspage, the dashboard filters the Hugging Face GGUF catalog to the quants that actually fit your node's memory at your chosen context — no guessing, no failed downloads. The node detail page shows the live memory math as you change the context window, and estimates tokens per second for every model size your hardware can run.
Put the right model on your own GPU
Sizing is the hard part; running it is easy. Add a node with the one-line installer (Cloudflare Tunnel, no port forwarding), deploy a model the dashboard says fits, raise the context window to 32k on the node's detail page, create a wai_sk_… key, and point any OpenAI-compatible tool at https://wideareaai.com/api/v1. The gateway load-balances across your nodes and fails over to the cloud if one drops — so you get the economics of your own hardware without the fragility of a single box.