~/blog/llm-deep-dive-quantization-algorithms

LLM Deep Dive · part 1

[LLM Deep Dive] What Quantization Algorithms Actually Do: From Q4_K_M to TurboQuant

cat --toc

TL;DR

Q4_K_M doesn't just "throw away 75% of each weight." It's three algorithms stacked: K-quant binds 256 weights into a super-block with FP16 d / dmin and per-32-weight 6-bit sub-scales, giving each block tight asymmetric calibration Q4_0 couldn't reach; PolarQuant randomly rotates vectors so the distribution flattens; QJL uses a 1-bit sign sketch on the residual for unbiased correction. Each layer fixes the previous layer's failure mode — that's why 4 bits became the sweet spot.

Why this article exists

In LLM 101 Part 4 I used the MP3 metaphor to explain why quantization works at all — neural networks are inherently noisy, weights cluster around zero, and rounding errors mostly cancel out. That article is enough to answer "which version should I download."

But a reader emailed: "Is 4-bit really just rounding every weight to 16 buckets? Why doesn't that collapse the model?"

The intuitive answer is "neural networks have redundancy." That's true but lazy. The real answer is that the last decade of quantization research has done a lot of work to make 4 bits the usable sweet spot. A naive cut from FP16 to 4 bits (the original Q4_0) loses noticeable quality. The reason Q4_K_M is "almost lossless" is the chain of clever engineering decisions behind it.

This is the opening article of a new series — LLM Deep Dive. It's for readers who finished Part 4, want to understand the mechanism, but don't want to read the paper. I'll walk through three layers of modern quantization, from 2023's K-quant to 2025's TurboQuant. Real algorithm names appear ("super-block," "Johnson-Lindenstrauss"), but no equations — I'll use geometric intuition instead.

If you want the measured numbers afterward, the TurboQuant benchmark on GX10 is the sister article.


The fundamental enemy: outliers

Every quantization algorithm is designed around one enemy — outliers.

Quantization means: you have a stream of continuous floating-point numbers, and you want to approximate them using a small, fixed set of discrete levels. 4 bits = 16 levels. If your numbers sit roughly between -1 and +1, slicing that range into 16 buckets and snapping each value to the nearest bucket gives small errors.

Real weights don't behave like that. Most weights do live between -1 and +1, but every few hundred entries you find a 5 or a -8. These outliers can't be discarded — they often encode important features. But if you stretch the scale to cover -8 to +8, those 16 levels now span the full range with a step size of 1. All the detail that used to live between -1 and +1 collapses into three buckets, losing 90% of the resolution.

This is why naive quantization fails. Every generation of algorithm tackles outliers differently.


Layer one: why legacy quantization (Q4_0) isn't enough

The earliest llama.cpp scheme is Q4_0. The algorithm fits in a single sentence:

For each group of 32 weights, find the maximum absolute value max_abs, set scale = max_abs / 7, then round each weight to round(weight / scale) and store as a 4-bit signed integer (range -8 to +7).

That's the literal "split the range into 16 buckets" implementation. Every 32 weights share one scale.

Failure mode: a single outlier in those 32 weights hijacks the scale for the whole group. The other 31 small weights get crushed into a few central buckets and become indistinguishable from each other. For neural networks, the relative magnitudes of those 31 weights carry information — flattening them to "all the same" amputates that layer's expressive power.

Q4_0 works passably on small models (under 3B). Above 7B, the quality gap to FP16 is visible. Fixing this required changing the grouping strategy.


Layer two: K-quant — super-blocks plus per-block scales

In June 2023, ikawrakow opened llama.cpp PR #1684, introducing Q2_K, Q3_K, Q4_K, Q5_K, and Q6_K. That's the K in Q4_K_M.

The core idea: two-tier grouping.

Take Q4_K:

  • Super-block = 256 weights sharing a pair of high-precision values, d and dmin, both stored as ggml_half (FP16 half precision)
  • Block = the super-block is split into 8 sub-blocks of 32 weights each; every sub-block carries a 6-bit quantized sub-scale and sub-min (packed inside block_q4_K.scales per the ggml-common.h struct)
  • Each weight is stored in 4 bits; dequantization combines "super-block d × dequantized sub-scale + dmin × dequantized sub-min"

Convoluted to describe, but the upgrade over Q4_0 isn't "shrinking the contamination radius" — both still use 32-weight blocks. The real gains are two: (1) asymmetric quantization via dmin lets K-quant handle distributions that aren't centered on zero, which Q4_0's symmetric scheme simply can't; (2) the super-block hierarchy lets 8 blocks share a pair of FP16 d / dmin while each block still carries its own 6-bit sub-scale and sub-min — each block calibrates itself far more precisely than Q4_0 could, while the metadata overhead grows only modestly. An outlier in one block pushes its own sub-scale; its 7 neighbors share the same super-block d but calibrate independently and stay unaffected.

Super-block layouts for each K-quant level (clearly listed in ikawrakow's PR description):

LevelSuper-block layoutbits/weight
Q2_K16 blocks × 16 weights = 2562.5625
Q3_K16 blocks × 16 weights = 2563.4375
Q4_K8 blocks × 32 weights = 2564.5
Q5_K8 blocks × 32 weights = 2565.5
Q6_K16 blocks × 16 weights = 2566.5625

Super-block size is always 256 weights, but block granularity changes with bit width. The fewer bits, the finer the slicing — because outlier impact is more catastrophic and you need smaller quarantine zones.

The fractional bits (the .5625 and .4375) are metadata amortized across the weights — sub-scales and sub-mins also take space. That's why Q4_K is 4.5 bits/weight, not exactly 4.

K-quant's success directly produced the "just use Q4_K_M" consensus. From that moment, 4 bits stopped being a tradeoff and became the default.


Layer three: TurboQuant — rotate, flatten, then record direction

K-quant pushed weight quantization to a sweet spot using asymmetric quantization plus two-tier metadata. But it still assumes that each 32-weight block fits a single scale/min well and that one FP16 d per super-block covers all its blocks. For weights those assumptions hold; for the KV cache, where distributions are more skewed and outliers concentrate more sharply, they don't.

The KV cache is the key/value pair the model accumulates across long conversations to compute attention. Its distribution is even more skewed than weights: a few channels carry extreme outliers, most channels are tightly clustered. Traditional quantization performs poorly here, so Google Research published TurboQuant in April 2025 (arxiv 2504.19874).

TurboQuant is a two-stage algorithm. I'll use the naming from the existing benchmark article — PolarQuant + QJL.

Stage one: PolarQuant — flatten the distribution

You can't quantize KV vectors directly because the distribution is so skewed. PolarQuant's move: rotate first.

Picture a 128-dimensional vector where a few dimensions (say 17 and 53) hold extreme values. Multiply by a random rotation matrix R to get Rv. Rotation preserves vector length but redistributes the energy that was concentrated in a few dimensions across all dimensions evenly.

After rotation, every component looks similar. More precisely: for a fixed-norm vector under a uniformly random rotation, the square of each coordinate follows a concentrated Beta distribution — and in high dimensions the coordinate itself approaches a Gaussian (Maxwell-Poincaré). Either framing leads to the same point: there are no outlier dimensions anymore; every dimension's values live in the same range.

The elegance: you don't need to know the weight distribution in advance (unlike GPTQ-style methods that require calibration data). Random rotation works for any distribution, because its job is precisely to destroy the original distribution. Lloyd-Max can then find globally optimal quantization points on the resulting Beta distribution.

Rotation = spreading outliers across all dimensions. With no outliers, quantization is easy.

Stage two: QJL — 1-bit sign sketch for unbiased correction

After PolarQuant there's still residual error. For weight quantization that residual is usually negligible; for attention score computation it isn't — MSE-only quantization introduces systematic bias, not random error. The benchmark verified this: MSE-only at 3-bit shows bias = +0.000191; QJL correction brings it to -0.000317 (≈0).

QJL = Quantized Johnson-Lindenstrauss. The Johnson-Lindenstrauss lemma is a classic result: random projection into a lower-dimensional space approximately preserves inner products between vectors. QJL applies this to the residual — project the residual through another random rotation, but keep only the sign bit (1 bit) of each component.

Why is 1 bit enough? Because this stage isn't recording how big the residual is, only which direction it leans. Combined with the magnitude already captured by PolarQuant, that direction is enough to correct the systematic bias and turn attention score computation into an unbiased estimator.

The paper phrases it as "MSE quantizer followed by 1-bit Quantized JL transform on the residual" — the two stages together yield "unbiased and high compression simultaneously." In practice it achieves quality-neutral KV cache compression at 3.5 bits/channel.


Mathematical intuition: why this works

No equations, but I want you to leave with two pictures:

Picture one: rotation flattens. Imagine a sheet of paper covered with stars, but 90% of them are crammed into the upper-right corner and only a sparse few in the lower-left. Now rotate the sheet randomly — 30 degrees, then 47, then 13. The relative positions of the stars haven't changed, but their distribution along "horizontal vs vertical" axes is much more even. Random rotation does exactly that — it spins the coordinate axes into an orientation that distributes the information uniformly.

Picture two: direction vs length. The inner product of two vectors equals the product of their lengths times cos(angle between them). What we usually care about isn't how long they are, but how aligned they are. If the magnitude is already preserved by the main quantization, all that's left to correct is the angle — and angle correction only needs the sign of the residual. That's why a 1-bit sign sketch suffices.

The two pictures together: quantization isn't "discarding information," it's reorganizing information so the limited bits land where they matter most.


What three layers of algorithm have solved

Looking back at the progression:

AlgorithmYearProblem solvedFailure mode
Q4_0 (legacy)2022Squeezing 16 bits down to 4 at allSymmetric-only (no min), each 32-weight block isolated — outlier in a block crushes its 31 neighbors
K-quant (Q4_K)2023-06Super-blocks limit outlier blast radius to 32 weightsAssumes symmetric distribution; fails for KV cache
TurboQuant2025-04Rotation + 1-bit sign sketch handle any distribution, guarantee unbiasednessNo fused CUDA kernel yet, so no speedup advantage

K-quant was a "topology win" — changing the grouping structure solved 80% of the problem. TurboQuant is a "geometry win" — changing the coordinate system solved the remaining 20%. Neither generation reached for "use more bits." The bottleneck was never bit count — it was algorithmic cleverness.

This is also why LLM quantization research is still active: each new algorithm exposes failure modes the previous one didn't notice. Q4_0 → K-quant's roughly 7x quality improvement didn't come from adding bits; it came from changing strategy.


What you can do with this

Three concrete things:

1. Reading Q4_K_M you know what's hiding in the default: 8 blocks × 32 weights super-block, plus the M variant (medium — K-quant adds extra bits to the sub-scale to trade size for quality).

2. Reading --kv-cache-dtype turboquant3 you know how 3-bit KV cache stays lossless: rotation flattens, sign sketch corrects — it's not "just slash bits."

3. Reading new papers you'll ask the right question: the next generation of quantization algorithms won't be "fewer bits" — it'll be "smarter coordinate systems" or "smaller contamination radius." See "rotation," "sketch," "super-block," "block-wise" in a paper and you can immediately classify what problem it's solving.


Cross-references


Further reading


This is the first article in the LLM Deep Dive series — for readers who've finished LLM 101 and want to understand the mechanism. Next planned topic: the attention mechanism itself — what Q/K/V actually do, and why softmax is the heart of attention rather than a decoration.

FAQ

Why doesn't 4-bit quantization break the model?
Because modern quantization isn't 'one scale per tensor.' K-quant groups every 256 weights into a super-block with a shared scale; TurboQuant goes further — it randomly rotates vectors so outliers spread across all dimensions, then uses a 1-bit sign sketch on the residual for unbiased correction. Each layer fixes the previous layer's failure mode.
How does K-quant differ from the older Q4_0?
Q4_0 already uses per-32-weight blocks with one FP16 scale each, but it's symmetric (no min offset) and each block stands alone. A single outlier in a block collapses its 31 neighbors into two or three distinct values. K-quant (llama.cpp PR #1684, June 2023) keeps the 32-weight block size but adds (a) asymmetric quantization with a `dmin` companion to the `d` scale, and (b) a super-block layer — every 256 weights share a pair of FP16 `d` and `dmin`, while each 32-weight block carries its own 6-bit quantized sub-scale and sub-min. Each block now calibrates itself far more precisely at small metadata cost.
Why does TurboQuant rotate before quantizing?
Because raw KV cache vectors are unevenly distributed — a few dimensions hold extreme outliers, the rest cluster tightly. MSE quantization on this distribution always leaves systematic bias. Random rotation (PolarQuant stage) flattens the distribution toward Beta, giving every dimension a similar profile so Lloyd-Max can find globally optimal quantization points without calibration data.
Why is 1 bit enough for QJL?
QJL doesn't quantize the vector itself — it quantizes the direction of the residual. The Johnson-Lindenstrauss lemma guarantees that random projections approximately preserve inner products. After MSE quantization captures the magnitude, recording just the sign of each residual component is enough to restore unbiased estimation of the attention score.