free tool · no signup · runs in your browser
Context Window Memory Calculator
Pick a model, a quantization, and a context length — see exactly how much VRAM the weights and the KV cache will eat. The KV cache grows linearly with every token you keep in context, and at long contexts it can dwarf the model itself. This calculator shows the full growth curve and tells you which GPUs hold the result.
/// configuration
FP16 (2 bytes/element) is the default. Switch to Q8_0 to halve the KV cache via --cache-type-k q8_0 --cache-type-v q8_0 in llama.cpp.
Each concurrent slot allocates its own KV cache — memory scales with slot count.
/// vram required
Gemma 4 E2B · Q4_K_M · 8K · KV FP16
3.7GB
fits on RTX 3070 (8GB)
smallest single card that holds it
Weights
2.9 GB
78% of total
KV cache
63 MB
2% of total
Overhead
0.8 GB
CUDA/Metal context
Note: 29 of 35 layers use sliding-window attention (window 512), capping their cache below the full context.
/// memory at every context size
Gemma 4 E2B · Q4_K_M · KV FP16 — weights stay fixed, the KV cache grows.
| Context | KV cache | Total VRAM | |
|---|---|---|---|
| 2K | 27 MB | 3.7 GB | |
| 4K | 39 MB | 3.7 GB | |
| 8K | 63 MB | 3.7 GB | ← selected |
| 16K | 0.1 GB | 3.8 GB | |
| 32K | 0.2 GB | 3.9 GB | |
| 64K | 0.4 GB | 4.0 GB | |
| 128K | 0.8 GB | 4.4 GB |
/// what fits
Single-GPU capacity vs the 3.7 GB total above.
Holds it (59)
What is the KV cache?
When a transformer generates text, every token it has already seen produces a set of attention keys and values at every layer. Rather than recompute those for the whole sequence on each new token, the model caches them — this is the KV cache. It is the working memory that lets the model "remember" your prompt and its own output as it streams a response.
The weights of the model are fixed: a 7B model at Q4_K_M is roughly the same size whether your context is 2K or 200K. The KV cache is the part that scales with the conversation. That is why a model that loads fine at 4K can run out of memory the moment a coding agent asks for a 32K context.
Why does it grow linearly with context?
For standard grouped-query attention (GQA), the cache stores one key vector and one value vector per token, per layer, per KV head. Hold the architecture fixed and the only variable is the number of tokens — so memory is directly proportional to context length. The rough formula is:
kv_bytes = 2 (K and V) × layers × kv_heads × head_dim × context × bytes_per_element
Double the context, double the cache. Go from 8K to 128K and the cache is 16× bigger. This tool computes it per-model using each model's real layer count, KV-head count, and head dimension, and accounts for sliding-window layers that cap their cache at the window size.
GQA vs MLA: some models compress the cache
Not every architecture pays the full GQA price. DeepSeek-style models use multi-head latent attention (MLA), which stores a single compressed latent vector per token per layer instead of a full key/value pair for every head. The result is a dramatically smaller KV cache at the same context length — sometimes an order of magnitude less. The calculator uses each model's declared attention type, so an MLA model will show a much flatter growth curve than a comparable GQA model.
The --parallel gotcha
llama.cpp's --parallel flag splits the context window across concurrent request slots. If you set a 32K context with --parallel 4, each slot only gets 8K — but the server still allocates KV cache for all four slots at once. The memory cost multiplies by the number of slots. Set the parallel-slots field above to see this: serving four users at 8K each costs the same KV memory as one user at 32K.
How does this relate to Wide Area Intelligence?
Every node you add to Wide Area Intelligence has its own context window setting(default 4096). Coding agents and long-document workflows typically need 32K, which is exactly where the KV cache starts to matter. Use this calculator to confirm a node's GPU has the headroom before you raise the context window, then route requests to it. The gateway speaks the OpenAI API at https://wideareaai.com/api/v1, and the X-WAI-Node header pins a request to the node you sized here.
/// wide area ai
These numbers are theory. Your GPU is real — put it on the network.
Wide Area Intelligence turns any machine with a GPU into an OpenAI-compatible endpoint — routed, cached, and failed over automatically. Free for 2 nodes.
Start routing — free →