You found the model you want. You open its GGUF repository on Hugging Face, click the Files tab, and there are twenty downloads with names like Q4_K_M, IQ3_XXS, and Q5_K_S. They range from 3GB to 15GB. There is no README that says "pick this one," and the file sizes alone don't tell you what you're trading away.
This post decodes the entire naming system, shows how much quality each quantization actually loses (with numbers, not vibes), explains the counterintuitive part — smaller quants run faster — and gives you a default you can stop thinking about. The short version: for almost everyone, on almost everything, the answer is Q4_K_M. The rest of this explains when it isn't.
What quantization actually is
A language model is a giant pile of numbers — the weights. When a model is trained, each weight is usually stored as a 16-bit floating-point number (f16 or bf16). A 7-billion-parameter model at 16 bits is about 15GB of pure weights, which is why the full precision files are the biggest in the list.
Quantization compresses those weights into fewer bits each — 8, 6, 5, 4, even 2. Instead of storing every weight as a precise 16-bit value, you store it as a low-bit approximation plus a shared scaling factor for a small block of weights. You lose precision. The model's answers drift slightly from what full precision would have produced.
Here is the part that makes quantization practical: the quality loss is real but small, and it stays small until roughly 4 bits. Going from 16-bit to 8-bit is essentially free — the outputs are statistically indistinguishable. From 8 to 4 bits you give up a little, usually not enough to notice in normal use. Below 4 bits the curve bends sharply: 3-bit is noticeably worse, and 2-bit models start producing visibly degraded text. The whole game is sitting right at that knee in the curve, and that knee is at 4 bits.
The naming system, decoded
Every GGUF filename encodes its quantization in a short code. Once you can read it, the twenty-file wall of options collapses into a single axis you can reason about:
Q4_K_M │ │ │ │ │ └─ M = medium variant (S=small, L=large) │ └─── K = "k-quant" — smarter bit allocation per layer └───── 4 = ~4 bits per weight (down from 16) IQ3_XXS ││ │ ││ └── XXS = extra-extra-small variant │└──── 3 = ~3 bits per weight └───── IQ = importance-aware quant (better than plain K at low bits)
Three pieces:
The number is the approximate bits per weight. Lower = smaller file, faster, lower quality. This is the dominant factor — a Q4 of anything is in the same rough ballpark as any other Q4.
The _K means it's a k-quant. Older "legacy" quants (like the bare Q4_0) used the same bit width for every weight in the model. K-quants are smarter: they spend more bits on the layers that matter most for output quality (attention, the parts that feed directly into predictions) and fewer on the rest. Same average bit budget, better results. Always prefer a _K quant over a legacy one at the same bit level — the only legacy quant still worth downloading is Q8_0, because at 8 bits the smarter allocation has nothing left to optimize.
The _S / _M / _L is the variant — small, medium, large — within the same bit level. They nudge the bit allocation up or down slightly, trading a few hundred megabytes for a sliver of quality. Q4_K_M is the medium variant of the 4-bit k-quant. When in doubt, take the _M: it's the variant the community treats as the reference.
The IQ series (IQ2_XS, IQ3_XXS, and friends) is the newer importance-aware family. It uses a small calibration dataset to figure out which weights matter and a more sophisticated encoding to squeeze quality out of very low bit counts. IQ quants shine below 4 bits — an IQ3_M can beat a Q3_K_Mof the same size. The catch: they're slightly slower to run on some hardware (more compute per weight to decode), and at 4 bits and above the plain k-quants are already good enough that the difference stops mattering.
# A typical GGUF repo. Twenty files. Which one? Qwen2.5-7B-Instruct-Q2_K.gguf 3.02 GB Qwen2.5-7B-Instruct-IQ3_XXS.gguf 3.03 GB Qwen2.5-7B-Instruct-Q3_K_S.gguf 3.49 GB Qwen2.5-7B-Instruct-Q3_K_M.gguf 3.81 GB Qwen2.5-7B-Instruct-Q3_K_L.gguf 4.09 GB Qwen2.5-7B-Instruct-Q4_K_S.gguf 4.46 GB Qwen2.5-7B-Instruct-Q4_K_M.gguf 4.68 GB <- this one Qwen2.5-7B-Instruct-Q5_K_S.gguf 5.32 GB Qwen2.5-7B-Instruct-Q5_K_M.gguf 5.44 GB Qwen2.5-7B-Instruct-Q6_K.gguf 6.25 GB Qwen2.5-7B-Instruct-Q8_0.gguf 8.10 GB Qwen2.5-7B-Instruct-f16.gguf 15.24 GB
What each quant actually costs (7B model)
Quality is usually measured with perplexity— roughly, how "surprised" the model is by real text. Lower is better, and the meaningful figure is how much higher a quant's perplexity is than the full-precision original. A delta under ~1% is inaudible; a few percent is the point where careful side-by-side testing can spot it; 10%+ is degradation you'll feel in normal use. Sizes below are for a 7B model, which is the most common case:
| quant | bits/wt | 7B size | quality vs f16 | use it for |
|---|---|---|---|---|
| Q8_0 | 8.0 | ~8.1 GB | indistinguishable (<0.1%) | archival / paranoia |
| Q6_K | 6.6 | ~6.3 GB | transparent (~0.1%) | when you have spare VRAM |
| Q5_K_M | 5.7 | ~5.4 GB | excellent (~0.3%) | quality-sensitive work |
| Q4_K_M | 4.8 | ~4.7 GB | very good (~0.8%) | the default — start here |
| Q4_K_S | 4.5 | ~4.5 GB | good (~1.2%) | shave VRAM off Q4_K_M |
| Q3_K_M | 3.9 | ~3.8 GB | noticeable (~3–4%) | tight memory, big model |
| IQ3_XXS | 3.1 | ~3.0 GB | degraded but coherent (~6%) | last resort, IQ > K here |
| Q2_K | 2.6 | ~3.0 GB | clearly degraded (~10%+) | emergencies only |
Read the table top to bottom and notice where the "quality vs f16" column stops being a rounding error: between Q4_K_M and Q3_K_M. That jump — from under 1% to several percent — is the single most important fact about GGUF quantization. Q4_K_M is the last quant that's boring, and boring is exactly what you want.
When to go higher than Q4_K_M
Q4_K_M is the default, not a law. Two situations genuinely justify spending the extra gigabytes on Q5_K_M or Q6_K:
Precision-sensitive tasks degrade faster. Chat and summarization are forgiving — a slightly different word choice is still a fine answer. Code and math are not.A model that's 99% as good at prose can be meaningfully worse at producing a syntactically valid function or carrying a long arithmetic chain, because there's often exactly one right token and quantization noise is more likely to knock it off. If you're running a coding or math model and you have the VRAM, Q5_K_Mis a defensible upgrade — and if you're choosing between a bigger model at Q4 and a smaller one at Q8, the bigger model at Q4 almost always wins.
Small models suffer more than big ones.Quantization error is roughly a fixed percentage of each weight, but big models have more redundancy to absorb it. A 70B model at Q4_K_M is barely scratched; a 1.5B or 3B model at the same quant loses proportionally more, because every weight is carrying more of the load. For models under ~7B, bumping to Q5_K_M or Q6_K buys back noticeably more quality than it does on a 32B. The same Q4_K_M that's overkill-safe on a 32B model is the floor on a 1.5B.
Going the other direction: dropping belowQ4_K_M is only worth it when a model you want genuinely will not fit otherwise. A 32B model at Q3_K_M on a 24GB card usually still beats a 14B at Q5_K_M — more parameters, even crushed, tends to win. But Q2 is a different country. Treat it as "I need this enormous model on this small card for one experiment," not a daily driver.
The counterintuitive part: smaller is faster
New users assume the bigger, higher-quality quant is the slow one. It's the opposite. Token generation on a GPU is memory-bandwidth bound: to produce each token, the hardware reads essentially every weight in the model out of memory. Fewer bits per weight means fewer bytes to move per token, which means more tokens per second. A Q4 model isn't just smaller than a Q8 — it's almost twice as fast, on the same hardware, for the same model.
Rough single-stream generation speed for a 7B model on a 24GB GPU (RTX 4090-class), fully offloaded to VRAM:
| quant | 7B size | tok/s (approx) | vs Q8_0 |
|---|---|---|---|
| Q8_0 | ~8.1 GB | ~95 tok/s | baseline |
| Q6_K | ~6.3 GB | ~115 tok/s | +21% |
| Q5_K_M | ~5.4 GB | ~130 tok/s | +37% |
| Q4_K_M | ~4.7 GB | ~150 tok/s | +58% |
| Q3_K_M | ~3.8 GB | ~175 tok/s | +84% |
Exact numbers vary with hardware, driver, and context length, but the shape is universal: the lighter the quant, the faster it generates. This is the second reason Q4_K_M is the sweet spot. You're not just choosing it because it fits — you're choosing it because it's fast andthe quality cost is under 1%. Q8_0 buys you quality you can't perceive at a speed penalty you definitely can.
The exception that proves the rule: if a quant is too big to fit in VRAM and spills into system RAM, speed falls off a cliff — CPU memory bandwidth is an order of magnitude slower than GPU. A Q4_K_M that fits entirely in VRAM will crush a Q6_K that doesn't. Fitting in VRAM matters far more than the quant level.
How Wide Area Intelligence picks the quant for you
That last point — fitting in VRAM — is exactly the calculation people get wrong, because the weights are only part of it. You also need room for the KV cache(which grows with your context window) and a little overhead. Eyeballing "the 4.7GB file fits my 8GB card" ignores the 1–2GB the context window will eat.
Wide Area Intelligence does this math for you. When you open the Modelspage and search a model, the dashboard already knows each of your nodes' memory, so for every quant it shows whether it fits, fits with room for a big context window, or won't fit at all — and it defaults the deploy button to the best quant your hardware can actually run well, which is usually Q4_K_M. You click [ deploy ] and skip the entire twenty-file guessing game.
Want to confirm what a node ended up serving? The model name carries the quant right in it, and you can read it back from the gateway:
# Ask the gateway what your nodes can actually serve curl https://wideareaai.com/api/v1/models \ -H "Authorization: Bearer wai_sk_..."
If you're torn between two quants, deploy both to different nodes and use the Playground → Comparetab to send the same prompt to each side by side. It's the fastest way to settle whether Q5_K_M is worth the extra VRAM over Q4_K_M for your specific workload — usually it isn't, but for a coding model it sometimes is, and now you can see it instead of guessing.
The whole thing in one line
Download Q4_K_M. Go up to Q5_K_M or Q6_Kif it's a coding/math model or a small model and you have spare VRAM. Go down to Q3_K_M (or an IQ3) only to fit a bigger model you really want. Skip Q8_0 unless you're archiving — you can't hear the difference and you can feel the slowdown. And let the gateway pick when you'd rather not think about it at all.
Deploy a model, create a key, and point your tools at your own GPU →