LLM Deep Dive · part 2
[Benchmark] TurboQuant on GX10: Is 3-bit KV Cache Compression Actually Lossless?
❯ cat --toc
- Plain-Language Version: Compressing AI Memory Without Losing Quality
- What TurboQuant Does: 30-Second Version
- Setup
- Test 1: Synthetic Benchmark
- Lloyd-Max Codebook
- MSE-Only vs QJL: Bias Confirmed
- Needle-in-Haystack (Synthetic)
- GB10 Speed Numbers
- Test 2: Real Model Validation (Qwen2.5-3B-Instruct)
- Compression and Accuracy
- The Qwen3.5-35B Situation
- Hybrid Attention: Only 25% of Layers Have a KV Cache
- Memory Math
- Is It Usable Now?
- What Took the Most Time
- TL;DR
TL;DR
Google's TurboQuant 3-bit KV cache compression validates on GB10: 5.0x compression, cosine sim 0.9943, Needle-in-Haystack EXACT. But no speed gain yet (needs fused kernels), and Qwen3.5-35B's hybrid attention means only 25% of layers benefit — saving just 600 MB vs the paper's claimed 6x for dense models.
TurboQuant in 30 seconds: PolarQuant + QJL two-stage compression
Plain-Language Version: Compressing AI Memory Without Losing Quality
When an AI model has a conversation with you, it needs to remember the entire conversation history. This memory is called the "KV cache" (Key-Value cache), and it grows with every message. A single conversation with 8,000 tokens already uses 289 MB of memory. Scale that to long documents or complex tasks, and memory becomes the bottleneck — not the AI's brain, but its notepad.
TurboQuant, published by Google Research at ICLR 2026, claims to compress this memory by 6x using just 3 bits per value instead of the usual 16, with no loss in answer quality. Think of it like JPEG compression for photos: the file gets much smaller, but you can't tell the difference by looking at it.
This article tests those claims on real hardware — an NVIDIA DGX Spark desktop AI computer. The compression works as advertised for standard AI models. But for newer hybrid models like Qwen3.5-35B (which already use memory-efficient architectures), the savings are much smaller than the headline number suggests. The math works; the benefit depends on what you're running.
New to quantization basics? If phrases like "3 bits vs 16 bits" or "Q4_K_M" don't mean much yet, start with LLM 101 Part 4: What is quantization? — it explains weight quantization with MP3 analogies. This article is the KV-cache sequel: same technique family, different target.
The memory bottleneck in LLM inference isn't the model weights — it's the KV cache. An 8K-context Qwen2.5-3B already needs 289MB for KV alone, and it scales linearly. At long contexts or with large models, this becomes genuinely painful.
On March 24, Google Research published TurboQuant (ICLR 2026), claiming 3-bit KV cache compression with 6x memory reduction and zero accuracy loss. Six independent implementations appeared within five days. This is a hands-on validation run on a GX10 (GB10/SM121).
What TurboQuant Does: 30-Second Version
Standard quantization maps floats to fixed bit widths. TurboQuant does two things instead:
Stage 1: PolarQuant Apply a random rotation to each KV vector, convert to polar coordinates, then quantize per-dimension. The rotation spreads energy uniformly, making each dimension statistically predictable — so you can precompute mathematically optimal quantization bins (Lloyd-Max algorithm) once, offline. No calibration data, no fine-tuning required.
Stage 2: QJL (Quantized Johnson-Lindenstrauss) Stage 1 leaves residual error. QJL projects that residual through a random Gaussian matrix using the Johnson-Lindenstrauss transform, keeping only the sign bit (+1 or −1). This single-bit sketch makes attention score computation provably unbiased — something MSE-only quantization cannot guarantee.
Paper claims: ~3.5 bits/channel for quality-neutral compression, KV compression exceeding 5x, QJL-corrected attention logit speedups when run with optimized kernels, zero accuracy loss across LongBench/RULER/Needle-in-Haystack. (The "8x on H100" figure that often gets quoted alongside this paper isn't in the arxiv text I pulled — treat secondary-source performance numbers with caution.)
Setup
- Machine: GX10, NVIDIA GB10 (SM121), 128GB unified memory, 273 GB/s bandwidth
- Python env:
~/.python-vllm-customvenv (torch 2.10+cu130, transformers 5.3.0) - Implementation: tonbistudio/turboquant-pytorch (from-scratch PyTorch, community implementation)
- Validation model: Qwen2.5-3B-Instruct (local, fp16)
Installation:
source ~/.python-vllm-custom/bin/activate
pip install accelerate scipy -q
# Note: directory name must be a valid Python module name (no hyphens)
git clone https://github.com/tonbistudio/turboquant-pytorch ~/experiments/turboquant-pytorch
mv ~/experiments/turboquant-pytorch ~/experiments/turboquant
cd ~/experiments && python -m turboquant.test_turboquant
Test 1: Synthetic Benchmark
No model download needed — validates the math directly.
Lloyd-Max Codebook
Quantization distortion at different dimensionalities and bit widths:
d=128, bits=3: distortion/coord=0.000270
d=256, bits=3: distortion/coord=0.000135
Higher dimensions → lower per-dimension distortion. High-dimensional vector quantization has this property, which is why TurboQuant's random rotation trick works.
MSE-Only vs QJL: Bias Confirmed
bits=2: MSE-only bias=+0.001052 (biased)
bits=3: MSE-only bias=+0.000191 (biased)
After QJL correction:
bits=3: bias=-0.000317 (≈ 0)
bits=4: bias=+0.000559 (≈ 0)
This is one of the paper's core claims: MSE-only quantization introduces systematic bias in attention score computation. QJL eliminates it. Verified on GB10, matches the paper.
Needle-in-Haystack (Synthetic)
| bits | seq=512 | seq=2048 | seq=8192 |
|---|---|---|---|
| 2-bit | EXACT | EXACT | EXACT |
| 3-bit | EXACT | EXACT | EXACT |
| 4-bit | EXACT | EXACT | EXACT |
Perfect retrieval across all bit widths and sequence lengths.
GB10 Speed Numbers
GPU: NVIDIA GB10
Config: d=128, bits=3, seq_len=8192, n_queries=64
Quantize 8192 keys: 31.04 ms
Inner product (64q × 8192k): 11.16 ms
Full-precision matmul: 0.02 ms
Memory: 2048 KB (fp16) → 384 KB (TQ-3bit) = 5.3x
The 0.02ms vs 11ms gap matters: this PyTorch implementation has no fused CUDA kernel. The paper reports attention-time speedups when QJL runs with optimized kernels (the specific "8x on H100" claim often attributed to this paper online wasn't in the arxiv text I checked — treat as unverified secondary-source). On GB10 (SM121, non-mainstream hardware), using standard PyTorch ops, there is currently no speed advantage. Memory savings are real. Speed gains require upstream kernel support.
Test 2: Real Model Validation (Qwen2.5-3B-Instruct)
Using actual model KV cache vectors, three context lengths (2K/4K/8K), needle-in-haystack scenario.
Compression and Accuracy
| bits | Ratio | Cosine sim (2K) | Cosine sim (8K) | Top-1 match (8K) |
|---|---|---|---|---|
| 4-bit | 3.8x | 0.9987 | 0.9982 | 84.7% |
| 3-bit | 5.0x | 0.9959 | 0.9943 | 80.6% |
| 2-bit | 7.3x | 0.9891 | 0.9846 | 69.4% |
Original KV cache: 289MB at 8K context (fp16). After 3-bit TurboQuant: 57.6MB.
A few notes:
3-bit is the sweet spot. Cosine similarity 0.9943, Top-1 match 80.6%. For most long-context tasks, attention doesn't need perfect per-head top-1 retrieval — getting the overall distribution right is sufficient.
Quality degrades slightly with longer context. 3-bit cosine sim drops from 0.9959 at 2K to 0.9943 at 8K. The effect is small but present. Quantization error accumulates across longer sequences.
Don't use 2-bit in production. 69.4% Top-1 match will hurt on tasks requiring precise token retrieval from long context.
The Qwen3.5-35B Situation
The obvious next step was testing on the production model. That's where things got interesting.
Hybrid Attention: Only 25% of Layers Have a KV Cache
From ~/models/qwen35-35b-hf/ config:
layer_types: [
"linear_attention", "linear_attention", "linear_attention", "full_attention",
"linear_attention", "linear_attention", "linear_attention", "full_attention",
... # full_attention every 4th layer
]
# full_attention layers: [3, 7, 11, 15, 19, 23, 27, 31, 35, 39]
num_key_value_heads: 2
head_dim: 256
Of 40 layers, only 10 are standard full attention. The other 30 are GatedDeltaNet (linear attention with fixed-size state, no KV cache). TurboQuant only compresses full attention KV — that's 10/40 = 25% of layers. And those 10 layers have num_key_value_heads=2, meaning the KV cache per layer is already very small.
Memory Math
fp16, 200K context:
2 KV heads × 256 dim × 2 bytes × 10 layers × 200K tokens ≈ 2GB
TurboQuant 3-bit → ~400MB
The GX10 vLLM setup already uses --kv-cache-dtype fp8 (8-bit), bringing KV to ~1GB. TurboQuant 3-bit would bring it to ~400MB — saving 600MB.
In other words: for Qwen3.5-35B-A3B, TurboQuant's memory benefit is much smaller than the paper suggests for dense models. This isn't a flaw in TurboQuant — it's a property of the hybrid attention architecture. Compare a pure dense model like Llama-3.1-70B with all 80 layers doing full attention: there, you'd see close to the paper's claimed 6x.
Is It Usable Now?
Not in production. vLLM PR #38280 was under review through late March 2026 but closed unmerged on 2026-04-06. llama.cpp has multiple community forks but nothing merged to main. Watch upstream vLLM for a follow-up PR.
Yes for learning and experimentation. pip install turboquant or tonbistudio/turboquant-pytorch gets you running Python-level validation and algorithm understanding in under 10 minutes.
The path forward:
vLLM PR #38280 merges
↓
--kv-cache-dtype turboquant3
↓
Retest Qwen3.5-35B (expected: context capacity 1.8M → ~4.8M tokens)
What Took the Most Time
Figuring out that qwen35-35b-hf/ is a multimodal model (Qwen3_5MoeForConditionalGeneration), not a text-only model. transformers can't load an fp8 MoE + vision encoder into 89GB of system RAM. That cost an hour.
Transferable lesson: For any hybrid attention model (Mamba hybrid, GLA, DeltaNet, RetNet variants), calculate what fraction of layers are full attention before estimating TurboQuant's benefit. Architecture changes the algorithm's applicable scope.
TL;DR
- TurboQuant 3-bit validates correctly on GB10: 5x compression, cosine sim 0.9943, Needle-in-Haystack EXACT
- QJL bias correction is real and effective — MSE-only has systematic bias, QJL eliminates it
- No speed benefit on GB10 currently: needs fused kernels, current PyTorch implementation is ~500x slower than matmul
- Qwen3.5-35B-A3B only has 25% of layers with KV cache (hybrid attention) — TurboQuant benefit is much lower than for dense models
- Wait for vLLM PR #38280 to merge before enabling in production
FAQ
- Does TurboQuant 3-bit KV cache compression actually work without accuracy loss?
- On GB10 with Qwen2.5-3B-Instruct, TurboQuant 3-bit achieves 5.0x compression with cosine similarity 0.9943 and 80.6% Top-1 match at 8K context. Needle-in-Haystack retrieval is EXACT at all bit widths (2/3/4-bit). The QJL bias correction eliminates systematic bias that MSE-only quantization introduces. Quality degrades slightly with longer context (0.9959 at 2K vs 0.9943 at 8K).
- How much memory does TurboQuant save on Qwen3.5-35B hybrid attention models?
- Much less than for dense models. Qwen3.5-35B-A3B has only 10/40 layers with full attention (25%), each with just 2 KV heads. At 200K context, KV cache is ~2 GB in fp16, ~1 GB with fp8, and ~400 MB with TurboQuant 3-bit — saving only 600 MB. Dense models like Llama-3.1-70B with all 80 layers in full attention would see close to the paper's claimed 6x savings.
- Is TurboQuant faster than standard KV cache on GB10 (DGX Spark)?
- No, not currently. The community PyTorch implementation has no fused CUDA kernel — inner product takes 11ms vs 0.02ms for full-precision matmul on GB10. The paper reports attention speedups on GPU with optimized kernels (the specific '8x on H100' figure quoted online wasn't found in the arxiv text I checked — treat as secondary-source). Memory savings on GB10 are real; speed gains require upstream kernel support.
- When will TurboQuant be usable in production with vLLM?
- vLLM PR #38280 was under review through March 2026 but **closed unmerged on 2026-04-06**. llama.cpp has community forks but nothing merged to main. Watch upstream vLLM for a follow-up PR — production usability is not there yet.