How much VRAM do you actually need? The honest LLM sizing guide

Stop guessing whether a 70B model fits your GPU. The real formula is weights + KV cache + overhead — here are the numbers for every popular model, every context length, and every GPU tier, with no marketing rounding.

"Can I run Llama 70B on my 24GB card?" is the most common question in every local-LLM forum, and almost every answer is wrong — either too optimistic ("sure, just use Q4!") or too cautious ("you need an A100"). The truth is a short piece of arithmetic that nobody bothers to show you. This guide shows the formula, then hands you the actual numbers for the models people actually run.

The 20-second version.

No rounding in our favor, no "minimum requirements" that assume a 512-token context. If a model needs 22GB and you have a 24GB card, we'll tell you it's tight — because at 32k context it won't fit, and that's exactly when you find out the hard way.

The formula everyone skips

VRAM usage is not one number — it's three, added together. Get any of them wrong and your model either refuses to load or silently spills into system RAM and crawls. Here's the whole thing:

the formula

total_vram  =  weights  +  kv_cache  +  overhead

weights    = params × bytes_per_param        # quant decides bytes
kv_cache   ≈ 2 × n_layers × n_kv_heads × head_dim × ctx × bytes
overhead   ≈ 0.5–1.0 GB                       # CUDA context, buffers

Weightsare the model itself: the parameter count times how many bytes each parameter takes. That second part is the quantization. A "7B" model has 7 billion parameters; at full FP16 precision that's 14GB, but nobody serious runs FP16 for inference. The community standard is Q4_K_M — 4-bit weights with a few sensitive layers kept higher — which lands around 0.60GB per billion parameters. Memorize that one number and you can size any model in your head.

bytes per parameter

# Bytes per parameter by quant
FP16 / BF16   → 2.00   (full precision, almost never needed)
Q8_0          → 1.06   (near-lossless, 2× the size of Q4)
Q6_K          → 0.82
Q5_K_M        → 0.71
Q4_K_M        → 0.60   ← the default sweet spot
Q3_K_M        → 0.48   (noticeable quality loss starts here)

# Quick estimate: weights_GB ≈ params_B × 0.60  for Q4_K_M
# A 32B model → 32 × 0.60 ≈ 19 GB of weights

KV cache is the part everyone forgets. As the model generates, it stores a key and value vector for every token in the context window, on every layer. This grows linearly with context length — double your context, double your KV cache. For a 7B model it's tiny at 4k and substantial at 128k. For a 70B model with lots of layers it can be enormous. This is why a model that loads fine for chat falls over the moment a coding agent fills the context.

Overhead is the CUDA/Metal runtime context, compute buffers, and fragmentation — call it 0.75GBand you won't be far off. It doesn't scale with model size, so it matters most on small cards.

Modern models cut KV cache with Grouped-Query Attention (GQA): instead of one KV head per attention head, several attention heads share one KV head. Llama 3.1, Qwen2.5, and Mistral all use it, which is why their long-context footprints are far smaller than older models like the original Llama 2.

What the popular models actually cost

Here are real footprints at Q4_K_M with an 8k context window— a realistic chat-and-light-agent setup, weights plus KV cache plus overhead. "Comfortably fits" means you have a little headroom for a larger context later, not that it barely squeaks in.

Model	Q4_K_M weights	Total @ 8k ctx	Comfortably fits on
Mistral 7B Instruct	~4.4 GB	~5.5 GB	8GB VRAM / 16GB Mac
Llama 3.1 8B	~4.9 GB	~6.0 GB	8GB VRAM / 16GB Mac
Qwen2.5 7B	~4.7 GB	~5.8 GB	8GB VRAM / 16GB Mac
Qwen2.5-Coder 7B	~4.7 GB	~5.8 GB	8GB VRAM / 16GB Mac
Qwen2.5 14B	~9.0 GB	~10.3 GB	12GB VRAM / 32GB Mac
Qwen2.5-Coder 14B	~9.0 GB	~10.3 GB	12GB VRAM / 32GB Mac
DeepSeek-R1-Distill 14B	~9.0 GB	~10.3 GB	12GB VRAM / 32GB Mac
Mixtral 8x7B (MoE)	~26 GB	~28 GB	2× 24GB / 48GB+ Mac
Qwen2.5 32B	~19.5 GB	~21.5 GB	24GB VRAM / 32GB+ Mac
Qwen2.5-Coder 32B	~19.5 GB	~21.5 GB	24GB VRAM / 32GB+ Mac
DeepSeek-R1-Distill 32B	~19.5 GB	~21.5 GB	24GB VRAM / 32GB+ Mac
Llama 3.1 70B	~42 GB	~45 GB	2× 24GB / 64GB+ Mac
Qwen2.5 72B	~44 GB	~47 GB	2× 24GB / 64GB+ Mac

Two takeaways. A 32B model at Q4_K_M is the realistic ceiling for a single 24GB consumer card — and it's tight, which the next table explains. And a 70B model genuinely needs two 24GB cards or a large unified-memory Mac; anyone telling you it fits in 24GB is quietly running Q2 or offloading half of it to CPU at single-digit tokens per second.

Mixtral 8x7B is a mixture-of-experts model: it has ~47B total parameters but only activates ~13B per token. You pay the full ~26GB to hold all the experts in memory, but it generates at roughly 13B speed. Great if you have the VRAM; not a way to cheat the weights cost.

The context-window multiplier

The 8k numbers above hide the trap. KV cache scales with context, so the same model has very different footprints depending on how much context you give it. Here's a single 7B model (Qwen2.5-style GQA, ~4.7GB of weights) across context lengths:

Context window	KV cache	Total ≈	What it means
4k	~0.2 GB	~5.7 GB	Fine for chat, too small for agents
8k	~0.4 GB	~5.9 GB	Comfortable general use
32k	~1.6 GB	~7.1 GB	Coding agents — the real sweet spot
128k	~6.4 GB	~11.9 GB	Whole-repo context, needs 16GB+

Notice the 7B model more than doublesits footprint going from 4k to 128k — entirely from KV cache. Now scale that up: a 32B model at 32k context adds roughly 4GB of KV cache on top of its ~19.5GB of weights, which is what pushes it past the edge of a 24GB card. The model loads at 4k and dies at 32k, and the error message won't mention context at all.

Llama.cpp can quantize the KV cache too (-ctk q8_0 -ctv q8_0), roughly halving its size for a small quality hit. It's the cheapest way to buy back context room on a card that's almost big enough.

The GPU shopping guide

Working backwards — here's the best model you can comfortablyrun at a useful 32k context on each common hardware tier. "Comfortably" is the operative word: these leave room for the KV cache to breathe, unlike spec-sheet claims.

Hardware	Example	Best model @ 32k ctx	Notes
8 GB VRAM	RTX 3050, 4060	7B–8B Q4_K_M	Drop to 16k for more headroom
12 GB VRAM	RTX 3060, 4070	14B Q4_K_M	The value sweet spot
16 GB VRAM	RTX 4060 Ti 16GB, 4080	14B Q5/Q6 or 32B at 8k	Comfortable 14B at full ctx
24 GB VRAM	RTX 3090, 4090	32B Q4_K_M	Tight at 32k — quantize KV
2× 24 GB	Dual 3090/4090	70B Q4_K_M	Or 32B at full 128k ctx
Mac 16 GB	M-series unified	7B–8B Q4_K_M	~12GB usable for the model
Mac 32 GB	M-series unified	14B Q4_K_M	~24GB usable
Mac 64 GB	M-series unified	32B Q4_K_M	~48GB usable
Mac 128 GB	M-series Max/Ultra	70B Q4_K_M	~96GB usable, slower than dual 4090

For most people the honest recommendation is a 12GB card for a 14B model or a 24GB card for a 32B model. Those two combinations cover the overwhelming majority of real local workloads — coding, RAG, summarization, structured extraction — without paying for a second GPU or a maxed-out Mac.

Apple Silicon plays by different rules

On a Mac there's no separate VRAM — the GPU shares the same unified memory pool as the OS. That sounds like a free lunch (a 128GB Mac can hold a 70B model an RTX 4090 can't), and partly it is. But two caveats matter.

First, not all of it is usable. macOS caps how much the GPU may allocate — plan on roughly 75% of total memory for the model and KV cache, with the rest reserved for the OS and your apps. A 16GB Mac gives you about 12GB to work with, a 64GB Mac about 48GB. You can raise the limit with sudo sysctl iogpu.wired_limit_mb, but going too high starves the system and triggers swapping.

Second, memory capacity is not memory bandwidth. A 128GB Mac can holda 70B model, but generation speed is bound by bandwidth, and even an M-series Ultra trails a pair of 4090s on big models. Apple Silicon is unbeatable for fitting large models cheaply and silently; it's not the fastest tokens-per-dollar at the high end.

check what you've actually got

# See what fits before you download anything:
#   the Models page filters HF GGUF quants to your node's memory.
# Or check live VRAM on a running NVIDIA node:
nvidia-smi --query-gpu=memory.used,memory.total --format=csv

# On Apple Silicon, the unified pool is shared with macOS:
#   usable ≈ total × 0.75   (the rest is reserved for the OS)

The mistakes that waste VRAM

Running Q8 (or FP16) when Q4_K_M is fine

This is the big one. Q8_0 is nearly twice the size of Q4_K_M for a quality difference most users can't detect in normal use. A 32B model in Q8 needs ~35GB; in Q4_K_M it needs ~19.5GB and fits a single 24GB card. Start at Q4_K_M and only move up if you can measure a problem on your actual task.

Forgetting the context window

The single most common reason a model "mysteriously" won't load. You sized it for the weights and ignored the KV cache, then set a 32k or 128k context and ran out of room. Always size for the context you intend to use, not the default 4k.

Trusting "minimum requirements"

Published minimums almost always assume a trivial context and zero headroom for overhead and fragmentation. They tell you whether the weights fit, not whether the thing will actually run your workload. Treat them as a floor, not a target.

Splitting context across parallel slots

If your server runs with --parallel N for concurrent requests, the context window is divided between slots — each request gets ctx ÷ N tokens. People set a 32k window, run 4 slots, and wonder why each request only gets 8k.

Skip the arithmetic

You now have the formula and the numbers, but you don't have to do the math by hand. We built two tools for exactly this: the VRAM calculator lets you punch in any model, quant, and context length and get the total instantly, and can I run AI? goes the other way — tell it your GPU and it lists every model that comfortably fits.

Inside Wide Area Intelligence the same logic runs automatically. When you deploy a model from the Modelspage, the dashboard filters the Hugging Face GGUF catalog to the quants that actually fit your node's memory at your chosen context — no guessing, no failed downloads. The node detail page shows the live memory math as you change the context window, and estimates tokens per second for every model size your hardware can run.

Put the right model on your own GPU

Sizing is the hard part; running it is easy. Add a node with the one-line installer (Cloudflare Tunnel, no port forwarding), deploy a model the dashboard says fits, raise the context window to 32k on the node's detail page, create a wai_sk_… key, and point any OpenAI-compatible tool at https://wideareaai.com/api/v1. The gateway load-balances across your nodes and fails over to the cloud if one drops — so you get the economics of your own hardware without the fragility of a single box.

Size it, deploy it, and start serving →