← all tools

free tool · no signup · runs in your browser

LLM VRAM Calculator

How much VRAM does Llama need? Pick a model, a quantization, and a context length — this calculator adds up the weights, the KV cache, and runtime overhead, then lists the GPUs that can hold it all.

4-bit. The most popular balance of size and quality.

Q8_0 KV cache halves context memory with negligible quality impact (llama.cpp: --cache-type-k q8_0 --cache-type-v q8_0).

estimated total vram

6.3 GB

Llama 3.1 8B · Q4_K_M · 8K ctx · FP16 KV

model weights4.6 GB
KV cache1.0 GB
runtime overhead0.8 GB
total6.3 GB

gpus that can run this

  • RTX 3070 (8GB)NVIDIA · 8 GB VRAM
  • RTX 2070 Super (8GB)NVIDIA · 8 GB VRAM
  • GTX 1070 Ti (8GB)NVIDIA · 8 GB VRAM
  • Intel Arc A750 (8GB)Intel · 8 GB VRAM
  • RTX 3080 (10GB)NVIDIA · 10 GB VRAM
  • Intel Arc B570 (10GB)Intel · 10 GB VRAM
  • RTX 2080 Ti (11GB)NVIDIA · 11 GB VRAM
  • GTX 1080 Ti (11GB)NVIDIA · 11 GB VRAM
  • RTX 5070 (12GB)NVIDIA · 12 GB VRAM
  • RTX 4070 (12GB)NVIDIA · 12 GB VRAM

Unified-memory machines (Apple Silicon, Strix Halo) and CPU rows show usable memory at ~75% of total, since the OS keeps the rest.

How the VRAM estimate is built

Running an LLM locally takes three chunks of memory, and this calculator sums all three:

1. Model weights. This is the dominant cost. It equals parameters × bytes-per-parameter. A 7B model in full FP16 precision is roughly 7e9 × 2 = 14 GB. Quantizing to 4-bit (Q4_K_M, about 0.61 bytes per parameter once you account for the higher-precision embedding and output layers) brings that same model down to a little over 4 GB. That is why quantization is the single biggest lever on whether a model fits.

2. KV cache. Every token the model has seen is cached as key and value vectors so it does not have to recompute attention. The cache grows linearly with context length: 2 × layers × kv_heads × head_dim × context × 2 bytes. We use each model's real architecture — grouped-query attention, sliding-window layers, and DeepSeek-style MLA compression are all accounted for — so the number matches what llama.cpp actually allocates. Serving several requests at once means several caches, so the parallel-requests input multiplies this term.

3. Runtime overhead. CUDA context, the compute graph, and activation scratch buffers cost roughly 0.75 GB on top. We add that as a flat term so the total is a number you can trust against a real GPU.

Why context length matters so much

The weights are fixed once you pick a quantization, but the KV cache is not. Going from a 4K to a 128K context window multiplies the cache by 32×. For a coding agent that needs a 32K window, the KV cache can rival the weights themselves. If a model fits at 8K but you want long context, watch the KV line in the breakdown — that is usually what pushes you over the edge. Quantizing the KV cache to 8-bit roughly halves it, which many runtimes support.

Why Q4_K_M is usually the answer

Q4_K_M is the default people reach for because it cuts the weights to about a third of FP16 while keeping quality nearly intact for chat and coding. Going further — Q3 or 2-bit IQ quants — shows visible degradation and is best reserved for cases where a large model simply will not fit any other way. If a model fits comfortably in Q5 or Q6, take the extra quality; if it is close, Q4_K_M is the sweet spot.

Rule of thumb: weights ≈ params × 0.6 GB at Q4_K_M. An 8B model is ~5 GB of weights; add a couple GB of KV cache and overhead and it lives happily on a 12 GB card.

A note on Apple Silicon and unified memory

Apple Silicon Macs (and AMD Strix Halo) share one pool of memory between the CPU and GPU. macOS will not hand the whole thing to a model — the OS, the window server, and your other apps need their share. We treat roughly 75% of total memoryas usable for inference, so a 64 GB Mac shows about 48 GB available. That is why the GPU list flags unified and CPU rows with a "usable" figure rather than the headline number.

When the model fits — what next?

Once you know a model fits your GPU, the next question is throughput. A GPU sitting under your desk is wasted between prompts. Wide Area Intelligence turns it into an OpenAI-compatible endpoint: install one line, route requests to it over a Cloudflare Tunnel (no port forwarding), and fail over to the cloud when it is busy. The node detail page lets you set the same context window you tested here, and the Models page deploys any Hugging Face GGUF in one click.

/// wide area ai

These numbers are theory. Your GPU is real — put it on the network.

Wide Area Intelligence turns any machine with a GPU into an OpenAI-compatible endpoint — routed, cached, and failed over automatically. Free for 2 nodes.

Start routing — free →