How much VRAM do I need to run an LLM?

Roughly: weights ≈ parameters × 0.6 GB at Q4_K_M, plus a KV cache that grows with context length, plus ~0.75 GB runtime overhead. An 8B model at Q4 is about 5 GB of weights and lives happily on a 12 GB card; a 70B model at Q4 needs ~40 GB and wants a 48 GB card or two 24 GB cards. Pick your exact model and context length above for the precise number.

How do I calculate KV cache memory?

KV cache = 2 × layers × kv_heads × head_dim × context_length × 2 bytes, multiplied by the number of concurrent requests. It grows linearly with context: going from a 4K to a 128K window multiplies the cache by 32×. Quantizing the KV cache to 8-bit roughly halves it. This calculator uses each model's real architecture (grouped-query attention, sliding windows, MLA) so the figure matches what llama.cpp actually allocates.

Does quantization reduce VRAM usage?

Yes — quantization is the single biggest lever on fit. FP16 is 2 bytes per parameter; Q4_K_M is about 0.61 bytes, cutting weights to roughly a third with near-identical quality for chat and coding. Q5/Q6 trade a little memory for more quality; Q3 and 2-bit IQ quants show visible degradation and are best reserved for models that won't fit any other way.

Why does the calculator show less usable memory on a Mac?

Apple Silicon and AMD Strix Halo share one memory pool between CPU and GPU, and macOS won't hand all of it to a model. We treat ~75% of total unified memory as usable for inference, so a 64 GB Mac shows about 48 GB available.

← all tools

free tool · no signup · runs in your browser

LLM VRAM Calculator

How much VRAM does Llama need? Pick a model, a quantization, and a context length — this calculator adds up the weights, the KV cache, and runtime overhead, then lists the GPUs that can hold it all.

model

quantization

4-bit. The most popular balance of size and quality.

context length (model max 128K)

parallel requests (multiplies KV cache)

KV cache precision

Q8_0 KV cache halves context memory with negligible quality impact (llama.cpp: --cache-type-k q8_0 --cache-type-v q8_0).

estimated total vram

6.3 GB

Llama 3.1 8B · Q4_K_M · 8K ctx · FP16 KV

model weights4.6 GB

KV cache1.0 GB

runtime overhead0.8 GB

total6.3 GB

gpus that can run this

✓RTX 3070 (8GB)NVIDIA · 8 GB VRAM
✓RTX 2070 Super (8GB)NVIDIA · 8 GB VRAM
✓GTX 1070 Ti (8GB)NVIDIA · 8 GB VRAM
✓Intel Arc A750 (8GB)Intel · 8 GB VRAM
✓RTX 3080 (10GB)NVIDIA · 10 GB VRAM
✓Intel Arc B570 (10GB)Intel · 10 GB VRAM
✓RTX 2080 Ti (11GB)NVIDIA · 11 GB VRAM
✓GTX 1080 Ti (11GB)NVIDIA · 11 GB VRAM
✓RTX 5070 (12GB)NVIDIA · 12 GB VRAM
✓RTX 4070 (12GB)NVIDIA · 12 GB VRAM

Unified-memory machines (Apple Silicon, Strix Halo) and CPU rows show usable memory at ~75% of total, since the OS keeps the rest.

How the VRAM estimate is built

Running an LLM locally takes three chunks of memory, and this calculator sums all three:

1. Model weights. This is the dominant cost. It equals parameters × bytes-per-parameter. A 7B model in full FP16 precision is roughly 7e9 × 2 = 14 GB. Quantizing to 4-bit (Q4_K_M, about 0.61 bytes per parameter once you account for the higher-precision embedding and output layers) brings that same model down to a little over 4 GB. That is why quantization is the single biggest lever on whether a model fits.

2. KV cache. Every token the model has seen is cached as key and value vectors so it does not have to recompute attention. The cache grows linearly with context length: 2 × layers × kv_heads × head_dim × context × 2 bytes. We use each model's real architecture — grouped-query attention, sliding-window layers, and DeepSeek-style MLA compression are all accounted for — so the number matches what llama.cpp actually allocates. Serving several requests at once means several caches, so the parallel-requests input multiplies this term.

3. Runtime overhead. CUDA context, the compute graph, and activation scratch buffers cost roughly 0.75 GB on top. We add that as a flat term so the total is a number you can trust against a real GPU.

Why context length matters so much

The weights are fixed once you pick a quantization, but the KV cache is not. Going from a 4K to a 128K context window multiplies the cache by 32×. For a coding agent that needs a 32K window, the KV cache can rival the weights themselves. If a model fits at 8K but you want long context, watch the KV line in the breakdown — that is usually what pushes you over the edge. Quantizing the KV cache to 8-bit roughly halves it, which many runtimes support.

Why Q4_K_M is usually the answer

Q4_K_M is the default people reach for because it cuts the weights to about a third of FP16 while keeping quality nearly intact for chat and coding. Going further — Q3 or 2-bit IQ quants — shows visible degradation and is best reserved for cases where a large model simply will not fit any other way. If a model fits comfortably in Q5 or Q6, take the extra quality; if it is close, Q4_K_M is the sweet spot.

Rule of thumb: weights ≈ params × 0.6 GB at Q4_K_M. An 8B model is ~5 GB of weights; add a couple GB of KV cache and overhead and it lives happily on a 12 GB card.

A note on Apple Silicon and unified memory

Apple Silicon Macs (and AMD Strix Halo) share one pool of memory between the CPU and GPU. macOS will not hand the whole thing to a model — the OS, the window server, and your other apps need their share. We treat roughly 75% of total memoryas usable for inference, so a 64 GB Mac shows about 48 GB available. That is why the GPU list flags unified and CPU rows with a "usable" figure rather than the headline number.

When the model fits — what next?

Once you know a model fits your GPU, the next question is throughput. A GPU sitting under your desk is wasted between prompts. Wide Area Intelligence turns it into an OpenAI-compatible endpoint: install one line, route requests to it over a Cloudflare Tunnel (no port forwarding), and fail over to the cloud when it is busy. The node detail page lets you set the same context window you tested here, and the Models page deploys any Hugging Face GGUF in one click.

Frequently asked questions

How much VRAM do I need to run an LLM?: Roughly: weights ≈ parameters × 0.6 GB at Q4_K_M, plus a KV cache that grows with context length, plus ~0.75 GB runtime overhead. An 8B model at Q4 is about 5 GB of weights and lives happily on a 12 GB card; a 70B model at Q4 needs ~40 GB and wants a 48 GB card or two 24 GB cards. Pick your exact model and context length above for the precise number.
How do I calculate KV cache memory?: KV cache = 2 × layers × kv_heads × head_dim × context_length × 2 bytes, multiplied by the number of concurrent requests. It grows linearly with context: going from a 4K to a 128K window multiplies the cache by 32×. Quantizing the KV cache to 8-bit roughly halves it. This calculator uses each model's real architecture (grouped-query attention, sliding windows, MLA) so the figure matches what llama.cpp actually allocates.
Does quantization reduce VRAM usage?: Yes — quantization is the single biggest lever on fit. FP16 is 2 bytes per parameter; Q4_K_M is about 0.61 bytes, cutting weights to roughly a third with near-identical quality for chat and coding. Q5/Q6 trade a little memory for more quality; Q3 and 2-bit IQ quants show visible degradation and are best reserved for models that won't fit any other way.
Why does the calculator show less usable memory on a Mac?: Apple Silicon and AMD Strix Halo share one memory pool between CPU and GPU, and macOS won't hand all of it to a model. We treat ~75% of total unified memory as usable for inference, so a 64 GB Mac shows about 48 GB available.

Related reading: How much VRAM do you need?. Ready to use that hardware? Turn your GPU into an OpenAI-compatible endpoint — free for 2 nodes.

/// wide area ai

These numbers are theory. Your GPU is real — put it on the network.

Wide Area Intelligence turns any machine with a GPU into an OpenAI-compatible endpoint — routed, cached, and failed over automatically. Free for 2 nodes.

Start routing — free →