free tool · no signup · runs in your browser

GGUF Quantization Picker

Stop guessing which .gguf file to download. Tell us your GPU, search any GGUF model on Hugging Face, and we read the real file sizes from the repo to tell you which quant fits in your VRAM — and which one gives you the best quality without spilling to the CPU.

step 1 — your hardware

gpu / acceleratorvram

24 GB (fixed)

system ram

usable budget

24.0 GB on-GPU + 22.4 GB RAM spill

step 2 — find a model on hugging face

Which GGUF should I download?

A single model on Hugging Face is usually published as a dozen different .gguf files. They are all the same model — the only difference is the quantization, i.e. how many bits are used to store each weight. Fewer bits means a smaller file that needs less VRAM, at the cost of some quality. The right file is simply the highest-quality quant that still fits in your GPU's memory with room left over for the KV cache. This tool does that arithmetic for you against the actual file sizes in the repo.

Decoding the GGUF filename

The quant is encoded in the filename, e.g. qwen2.5-coder-7b-instruct-q4_k_m.gguf. Read it like this: the number after Q is the bit-depth (lower = smaller and lower quality), Kmeans the modern "K-quant" method, and the trailing letter is the size tier within that bit-depth — S (small), M (medium), L (large). So Q5_K_M is a 5-bit K-quant, medium variant. IQ labels (e.g. IQ3_M) are "importance" quants that squeeze more quality out of very low bit-depths.

Quant	Bits	Quality	Use when
Q8_0	8	Practically lossless	VRAM to spare
Q6_K	6	Near-lossless	You want max quality that fits
Q5_K_M	5	Very good	Comfortable headroom
Q4_K_M	4	Good — the sweet spot	Default choice
Q3_K_M	3	Visible loss	VRAM is tight
IQ2_M	2	Significant loss	Last resort for huge models

Why Q4_K_M is the sweet spot

Quality does not fall off linearly as you drop bits. The gap between 8-bit and 4-bit is small and hard to notice in everyday use, while the file is roughly half the size. Below 4-bit the quality starts dropping fast. That is why Q4_K_M is the most-downloaded quant for almost every model — it is the knee of the curve. If Q5_K_M or Q6_K also fits in your VRAM, take it; the extra quality is free. If only Q3_K_M fits, the model is probably too large for your card and you should pick a smaller model instead.

How the "fits" verdict is calculated

Loading a model needs more than the file itself. We compute need = (file size + 0.75 GiB overhead + 1.5 GiB KV reserve) × 1.08. The 0.75 GiB covers the runtime's CUDA/Metal context and compute buffers; the 1.5 GiB is a starter KV cache reserve, plus 8% safety headroom because KV grows with your context window. We compare that against your usable VRAM (for Apple Silicon and other unified-memory systems we count about 75% of total memory, since the OS keeps the rest). A green fits with headroom means it should run entirely on the GPU at ordinary contexts, but very long contexts can still need more memory. An amber cpu offload means it only fits by spilling some layers into system RAM — it works, but generation is much slower. Red too big means it will not load at all.

What about sharded models?

Large models are split into multiple files named like ...-00001-of-00003.gguf. These are not separate options — you need every shard of a set, and they load as one model. This tool groups shards back together and sums their sizes so the size and verdict reflect the whole model. Download all shards into the same folder; the runtime stitches them automatically.

Skip the download dance entirely

Picking the file is the easy part — then you still have to download tens of gigabytes, install GPU drivers, and keep a server running. Wide Area Intelligence deploys any Hugging Face GGUF to a node on your own GPU with one click, gives you an OpenAI-compatible endpoint at https://wideareaai.com/api/v1, and fails over to the cloud when your node is busy. Free for 2 nodes.

Related reading: GGUF quantization explained. Ready to use that hardware? Turn your GPU into an OpenAI-compatible endpoint — free for 2 nodes.

/// wide area ai

These numbers are theory. Your GPU is real — put it on the network.

Wide Area Intelligence turns any machine with a GPU into an OpenAI-compatible endpoint — routed, cached, and failed over automatically. Free for 2 nodes.

Start routing — free →