← all tools

free tool · no signup · runs in your browser

GGUF Quantization Picker

Stop guessing which .gguf file to download. Tell us your GPU, search any GGUF model on Hugging Face, and we read the real file sizes from the repo to tell you which quant fits in your VRAM — and which one gives you the best quality without spilling to the CPU.

step 1 — your hardware
usable budget
24.0 GB on-GPU + 22.4 GB RAM spill
step 2 — find a model on hugging face

Which GGUF should I download?

A single model on Hugging Face is usually published as a dozen different .gguf files. They are all the same model — the only difference is the quantization, i.e. how many bits are used to store each weight. Fewer bits means a smaller file that needs less VRAM, at the cost of some quality. The right file is simply the highest-quality quant that still fits in your GPU's memory with room left over for the KV cache. This tool does that arithmetic for you against the actual file sizes in the repo.

Decoding the GGUF filename

The quant is encoded in the filename, e.g. qwen2.5-coder-7b-instruct-q4_k_m.gguf. Read it like this: the number after Q is the bit-depth (lower = smaller and lower quality), Kmeans the modern "K-quant" method, and the trailing letter is the size tier within that bit-depth — S (small), M (medium), L (large). So Q5_K_M is a 5-bit K-quant, medium variant. IQ labels (e.g. IQ3_M) are "importance" quants that squeeze more quality out of very low bit-depths.

QuantBitsQualityUse when
Q8_08Practically losslessVRAM to spare
Q6_K6Near-losslessYou want max quality that fits
Q5_K_M5Very goodComfortable headroom
Q4_K_M4Good — the sweet spotDefault choice
Q3_K_M3Visible lossVRAM is tight
IQ2_M2Significant lossLast resort for huge models

Why Q4_K_M is the sweet spot

Quality does not fall off linearly as you drop bits. The gap between 8-bit and 4-bit is small and hard to notice in everyday use, while the file is roughly half the size. Below 4-bit the quality starts dropping fast. That is why Q4_K_M is the most-downloaded quant for almost every model — it is the knee of the curve. If Q5_K_M or Q6_K also fits in your VRAM, take it; the extra quality is free. If only Q3_K_M fits, the model is probably too large for your card and you should pick a smaller model instead.

How the "fits" verdict is calculated

Loading a model needs more than the file itself. We compute need = file size + 0.75 GiB overhead + 1.5 GiB KV reserve. The 0.75 GiB covers the runtime's CUDA/Metal context and compute buffers; the 1.5 GiB reserves space for the KV cache that grows with your context window. We compare that against your usable VRAM (for Apple Silicon and other unified-memory systems we count about 75% of total memory, since the OS keeps the rest). A green fits fully means it runs entirely on the GPU. An amber cpu offload means it only fits by spilling some layers into system RAM — it works, but generation is much slower. Red too big means it will not load at all.

What about sharded models?

Large models are split into multiple files named like ...-00001-of-00003.gguf. These are not separate options — you need every shard of a set, and they load as one model. This tool groups shards back together and sums their sizes so the size and verdict reflect the whole model. Download all shards into the same folder; the runtime stitches them automatically.

Skip the download dance entirely

Picking the file is the easy part — then you still have to download tens of gigabytes, install GPU drivers, and keep a server running. Wide Area Intelligence deploys any Hugging Face GGUF to a node on your own GPU with one click, gives you an OpenAI-compatible endpoint at https://wideareaai.com/api/v1, and fails over to the cloud when your node is busy. Free for 2 nodes.

/// wide area ai

These numbers are theory. Your GPU is real — put it on the network.

Wide Area Intelligence turns any machine with a GPU into an OpenAI-compatible endpoint — routed, cached, and failed over automatically. Free for 2 nodes.

Start routing — free →