free tool · no signup · runs in your browser

Context Window Memory Calculator

Pick a model, a quantization, and a context length — see exactly how much VRAM the weights and the KV cache will eat. The KV cache grows linearly with every token you keep in context, and at long contexts it can dwarf the model itself. This calculator shows the full growth curve and tells you which GPUs hold the result.

/// configuration

ModelQuantization

Context window — model max 128K

KV cache precision

FP16 (2 bytes/element) is the default. Switch to Q8_0 to halve the KV cache via --cache-type-k q8_0 --cache-type-v q8_0 in llama.cpp.

Parallel slots (--parallel)

Each concurrent slot allocates its own KV cache — memory scales with slot count.

/// vram required

Gemma 4 E2B · Q4_K_M · 8K · KV FP16

3.7GB

fits on RTX 3070 (8GB)

smallest single card that holds it

Weights

2.9 GB

78% of total

KV cache

63 MB

2% of total

Overhead

0.8 GB

CUDA/Metal context

Note: 29 of 35 layers use sliding-window attention (window 512), capping their cache below the full context.

/// memory at every context size

Gemma 4 E2B · Q4_K_M · KV FP16 — weights stay fixed, the KV cache grows.

Context	KV cache	Total VRAM
2K	27 MB	3.7 GB
4K	39 MB	3.7 GB
8K	63 MB	3.7 GB	← selected
16K	0.1 GB	3.8 GB
32K	0.2 GB	3.9 GB
64K	0.4 GB	4.0 GB
128K	0.8 GB	4.4 GB

/// what fits

Discrete single-GPU capacity vs the 3.7 GB total above. CPU, iGPU, and unified-memory system rows are not treated as single cards.

Holds it (38)

RTX 5090 (32GB) · 32GBRTX 5080 (16GB) · 16GBRTX 5070 Ti (16GB) · 16GBRTX 5070 (12GB) · 12GBRTX 5060 Ti (16GB) · 16GBRTX 4090 (24GB) · 24GBRTX 4080 (16GB) · 16GBRTX 4070 Ti Super (16GB) · 16GBRTX 4070 (12GB) · 12GBRTX 4060 Ti (16GB) · 16GBRTX 3090 (24GB) · 24GBRTX 3080 (10GB) · 10GBRTX 3070 (8GB) · 8GBRTX 3060 (12GB) · 12GBRTX 2080 Ti (11GB) · 11GBRTX 2070 Super (8GB) · 8GBGTX 1080 Ti (11GB) · 11GBGTX 1070 Ti (8GB) · 8GBRTX 6000 Ada (48GB) · 48GBNVIDIA L4 (24GB) · 24GBNVIDIA L40S (48GB) · 48GBNVIDIA A100 (40GB) · 40GBNVIDIA A100 (80GB) · 80GBNVIDIA H100 SXM (80GB) · 80GBNVIDIA H200 (141GB) · 141GBNVIDIA B200 (192GB) · 192GBAMD RX 9070 XT (16GB) · 16GBAMD RX 7900 XTX (24GB) · 24GBAMD RX 7900 XT (20GB) · 20GBAMD RX 7800 XT (16GB) · 16GBAMD RX 7600 XT (16GB) · 16GBAMD RX 6900 XT (16GB) · 16GBAMD RX 6800 XT (16GB) · 16GBAMD MI300X (192GB) · 192GBIntel Arc B580 (12GB) · 12GBIntel Arc B570 (10GB) · 10GBIntel Arc A770 (16GB) · 16GBIntel Arc A750 (8GB) · 8GB

What is the KV cache?

When a transformer generates text, every token it has already seen produces a set of attention keys and values at every layer. Rather than recompute those for the whole sequence on each new token, the model caches them — this is the KV cache. It is the working memory that lets the model "remember" your prompt and its own output as it streams a response.

The weights of the model are fixed: a 7B model at Q4_K_M is roughly the same size whether your context is 2K or 200K. The KV cache is the part that scales with the conversation. That is why a model that loads fine at 4K can run out of memory the moment a coding agent asks for a 32K context.

Why does it grow linearly with context?

For standard grouped-query attention (GQA), the cache stores one key vector and one value vector per token, per layer, per KV head. Hold the architecture fixed and the only variable is the number of tokens — so memory is directly proportional to context length. The rough formula is:

kv_bytes = 2 (K and V) × layers × kv_heads × head_dim × context × bytes_per_element

Double the context, double the cache. Go from 8K to 128K and the cache is 16× bigger. This tool computes it per-model using each model's real layer count, KV-head count, and head dimension, and accounts for sliding-window layers that cap their cache at the window size.

GQA vs MLA: some models compress the cache

Not every architecture pays the full GQA price. DeepSeek-style models use multi-head latent attention (MLA), which stores a single compressed latent vector per token per layer instead of a full key/value pair for every head. The result is a dramatically smaller KV cache at the same context length — sometimes an order of magnitude less. The calculator uses each model's declared attention type, so an MLA model will show a much flatter growth curve than a comparable GQA model.

The --parallel gotcha

llama.cpp's --parallel flag splits the context window across concurrent request slots. If you set a 32K context with --parallel 4, each slot only gets 8K — but the server still allocates KV cache for all four slots at once. The memory cost multiplies by the number of slots. Set the parallel-slots field above to see this: serving four users at 8K each costs the same KV memory as one user at 32K.

How does this relate to Wide Area Intelligence?

Every node you add to Wide Area Intelligence has its own context window setting(default 4096). Coding agents and long-document workflows typically need 32K, which is exactly where the KV cache starts to matter. Use this calculator to confirm a node's GPU has the headroom before you raise the context window, then route requests to it. The gateway speaks the OpenAI API at https://wideareaai.com/api/v1, and the X-WAI-Node header pins a request to the node you sized here.

Add a node and set its context window →

Related reading: How much VRAM do you need?. Ready to use that hardware? Turn your GPU into an OpenAI-compatible endpoint — free for 2 nodes.

/// wide area ai

These numbers are theory. Your GPU is real — put it on the network.

Wide Area Intelligence turns any machine with a GPU into an OpenAI-compatible endpoint — routed, cached, and failed over automatically. Free for 2 nodes.

Start routing — free →