~/blog/gtx-970-gemma4-e2b-quantization-benchmark

LLM Deep Dive · part 4

[Just for Fun] Gemma 4 E2B on a GTX 970: the biggest quant runs fastest (47.6 tok/s)

cat --toc

TL;DR

I put Google's Gemma 4 E2B on a 2014 GTX 970 — 3.5GB of usable VRAM, no tensor cores — and ran four quants through the same six tasks. The result inverts what you know about modern GPUs: the biggest file was the fastest. Google's 3.2GB QAT Q4_0 hit 47.6 tok/s; bartowski's 2.9GB Q2_K — the smallest file — managed 32.8 tok/s. On a card with no tensor cores, inference is dequant-compute-bound, not memory-bandwidth-bound, so the simplest quant format wins even though it moves more bytes. All four quants passed English and Chinese Q&A, single and parallel tool calls, JSON output, and code generation. One run per task — a smoke test, not a paper.

The short version

Nobody runs a GTX 970 for AI in 2026, and that's exactly what makes it fun. A graphics card from 2014 — the kind that sells for the price of lunch now — can still run a current Google AI model. Not just run it: answer in English and Chinese, call tools, write working code. The fun part is the twist. The "smarter" compression formats that everyone recommends turned out to be slower on this old card than the dumbest one. Old hardware plays by different rules, and this post is about why.


The setup: a 2014 card, a 2026 model

The GTX 970 shipped in September 2014. Maxwell architecture, compute capability 5.2, no tensor cores (those didn't arrive until Volta in 2017), and the infamous "3.5GB + 0.5GB" memory layout that got NVIDIA roasted back in 2015 — nominally 4GB, but the last half-gig sits on a slow partition, so in practice you plan around ~3.5GB.

The model is Google's Gemma 4 E2B. The "E" stands for effective: 2.3B effective parameters, but 5.1B total once you count the embeddings, thanks to Per-Layer Embeddings (PLE) — large lookup tables that live in the weights but barely cost any compute. It also carries a 262,144-token vocabulary. That vocab is why a "2B" model still weighs 3GB+ even after quantization, and it bites again at load time (more on that later). The model claims 128K context, a thinking mode, native tool use, and vision — none of which a 970 cares about, but the on-device design is exactly why it fits.

I built llama.cpp's CUDA backend for Maxwell and ran llama-server with everything on the GPU:

cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=52   # 5.2 = GTX 970
./build/bin/llama-server \
  -m ~/models/gemma-4-E2B-QAT-Q4_0.gguf \
  -ngl 99 -c 2048 --port 8080 \
  --jinja --chat-template-kwargs '{"enable_thinking":false}'

Four quants, all small enough to load fully into VRAM:

ModelTypeFile sizeSource
gemma-4-E2B-it-Q2_Kimatrix PTQ Q2_K2.9 GBbartowski
gemma-4-E2B-it-IQ3_Mimatrix PTQ IQ3_M3.0 GBbartowski
gemma-4-E2B-it-Q3_K_Mimatrix PTQ Q3_K_M3.1 GBbartowski
gemma-4-E2B-QAT-Q4_0QAT Q4_03.2 GBGoogle official

One framing note before the numbers: bartowski's imatrix quants are not "brute force." Brute force is RTN (round-to-nearest, no calibration). The real comparison is QAT (training-aware) vs imatrix PTQ (calibration-aware) vs RTN (no calibration at all). We're comparing the first two.

The twist: the biggest file is the fastest

Same six tasks, temperature 0.1, single run each:

TaskQ2_K (2.9G)Q3_K_M (3.1G)IQ3_M (3.0G)QAT Q4_0 (3.2G)
English Q&A32.529.130.044.7
Chinese Q&A31.929.230.347.0
Tool use (single)33.830.531.950.3
Tool use (parallel)32.829.531.247.6
JSON output32.829.430.549.1
Code generation32.829.930.346.9
Average32.829.630.747.6
QAT Q4_0  ████████████████████████████████████████████████  47.6 tok/s  (+45%)
Q2_K      █████████████████████████████████                 32.8 tok/s  (baseline)
IQ3_M     ███████████████████████████████                   30.7 tok/s  (-6%)
Q3_K_M    ██████████████████████████████                    29.6 tok/s  (-10%)

Now compare that ranking with the file sizes. The 3.2GB QAT Q4_0 — the largest file — is 45% faster than the 2.9GB Q2_K, the smallest. On a modern card you'd never see this: a GB10 or a 5090 is memory-bandwidth-bound, so fewer bytes per token usually means more tokens per second, and the smallest quant tends to win. Here the order is upside down. Something other than bandwidth is setting the pace.

Why a Maxwell card without tensor cores is dequant-bound

Every quant format has to unpack its weights back to something the math units can multiply. How much unpacking that takes depends heavily on the format:

  • Q4_0 — flat 4-bit integers in fixed blocks, one scale per block. The dequant is almost nothing: read a nibble, multiply by the scale.
  • K-quants (Q2_K, Q3_K_M) — each block is split into sub-blocks with their own quantized scales (Q2_K also carries a min per sub-block; Q3_K doesn't). More unpacking per weight.
  • I-quants (IQ3_M) — weights decode through fixed codebook/grid lookups at runtime, with an importance matrix that steered the precision allocation back at quantization time. The most unpacking of the three.

On a modern tensor-core GPU, none of that matters because the bottleneck is reading bytes from VRAM, not unpacking them. On a GTX 970 there are no tensor cores and integer throughput is weak, so the dequant arithmetic is the per-token cost. The cheapest dequant wins, and that's Q4_0 — even though its file is the biggest. The fancy formats spend their byte savings, and then some, on unpacking math the 970 is bad at.

So the "best quant" question doesn't have a universal answer. It depends on the hardware. (For how the quant algorithms themselves work, see Part 1: What Quantization Algorithms Actually Do.)

QAT didn't make it fast — Q4_0 did

It's tempting to write "Google's QAT model won." It didn't, not on speed. QAT — quantization-aware training — adjusts the weight values during training so the model tolerates low precision better. It changes what the numbers are, not how the kernel reads them. Inference speed comes from the format, and the format here is Q4_0.

A plain post-training Q4_0 of the same model would run at essentially the same speed. QAT's payoff is quality, not throughput — and quality is a separate axis, which is where it earns its keep. It's an easy mistake to credit the speed to QAT; the speed comes from the format.

Quality: all four pass, and only one task separates them

Speed is half the story. On quality, all four quants cleared every task:

  • English and Chinese Q&A — all correct, all fluent. Q2_K's Traditional Chinese was clean, no simplified-character leak.
  • Tool use, single and parallel — all four emitted the right call, and on "compare Taipei and Tokyo" all four correctly fired two tool_calls:
[
  {"name": "get_weather", "arguments": {"city": "Taipei"}},
  {"name": "get_weather", "arguments": {"city": "Tokyo"}}
]
  • JSON output — all valid, and all correctly typed Year as a number, not a string.

Here's the catch, and it's the same trap I hit reconstructing a tool-call pipeline last week: just passing a task is a low bar. Every quant "passed" tool use because the prompt was a trivial single-city lookup — that measures "does it emit valid JSON," not "is it right under pressure." The only task that actually separated the four was code generation. Asked for a palindrome checker, three of them wrote the textbook one-liner:

return s == s[::-1]

Only IQ3_M normalized the input first:

cleaned_s = ''.join(char.lower() for char in s if char.isalnum())
return cleaned_s == cleaned_s[::-1]

That's the version that handles "A man, a plan, a canal: Panama" — the edge case the others silently failed. That's the imatrix doing its job: it preserves precision on the weights that matter most to reasoning, so the model still catches the harder case. Worth noting, but don't read too much into it — this was one prompt, one run. It's a signal, not a verdict.

Takeaways

Where the time went

The 262K vocabulary is the culprit. It makes the file bigger, and it makes startup much slower. llama-cli reloads that tokenizer on every single query — about three minutes each time. The fix is to run llama-server once and hit it over the OpenAI-compatible API. The big vocab that makes a 2B model weigh 3GB is the same thing that makes a careless setup feel broken.

Reusable diagnostics

"Bigger file is faster" is a diagnostic, not a curiosity. It means your bottleneck is dequant compute, not memory bandwidth, which means you're on old or weak-integer hardware. The rule of thumb that falls out: a tensor-core card wants the smallest quant that fits (bandwidth-bound); a no-tensor-core card wants the simplest dequant, which is Q4_0 (compute-bound). Same model, opposite choice.

The general principle

There is no universal quant ranking. Whether Q2_K or Q4_0 wins depends on the GPU — the answer reverses between a GTX 970 and a GB10. Benchmark on the hardware you actually have; the theoretical numbers will mislead you with a perfectly clean table.

What to actually run

  • No-tensor-core card (GTX 9xx/10xx class): QAT Q4_0, or any Q4_0, for speed.
  • Need the quality edge on hard, logic-heavy tasks: IQ3_M — the imatrix earns it.
  • Always llama-server, never llama-cli, with that 262K vocab.
  • Build llama.cpp with -DCMAKE_CUDA_ARCHITECTURES=52 for a 970 (commit 42a0afd59, June 2026).

The same week I ran this, I had a 122B model crawling on a 128GB GB10. One end of the rack: a frontier-scale model squeezed into a small box. The other: a 5.1B model on a 12-year-old card with 3.5GB. Same game on both ends — the hardware decides, and the only way to know is to measure on it.

The GPU in that box is worth about a sandwich on the used market, and it runs a 2026 model, uses tools, and writes code. That's the whole appeal: you don't need a 5090 to get in the door — sometimes a junk card will do.


Appendix: WSL2 notes

The 970 sits in a Windows 11 box (Ryzen 5 3500X, 16GB) that I access through WSL2. A few sharp edges:

  • nvidia-smi isn't on PATH inside WSL2 — it's at /usr/lib/wsl/lib/nvidia-smi.
  • GPU utilization reads 0% in WSL2 even mid-inference. Don't trust it; trust tok/s.
  • The 3500X has no integrated graphics, so unplugging the HDMI or closing Remote Desktop makes Windows think the box is headless and can drop the GPU. Leave a display attached.
  • WSL2 build tip: -j4, not -j32 — the default RAM cap will OOM the compile otherwise. Set memory=8GB in .wslconfig.

FAQ

Can a GTX 970 run Gemma 4 E2B?
Yes. All four quants I tested — Q2_K, Q3_K_M, IQ3_M, and QAT Q4_0 — fit in the GTX 970's ~3.5GB usable VRAM and run fully GPU-offloaded through llama.cpp, at roughly 29–48 tok/s.
Why is the larger QAT Q4_0 file faster than the smaller Q2_K on a GTX 970?
The GTX 970 has no tensor cores and weak integer throughput, so dequantization compute dominates the per-token cost, not memory bandwidth. Q4_0's flat 4-bit blocks dequant far cheaper than K-quant sub-blocks or I-quant codebook lookups, so the bigger Q4_0 file still wins.
Does QAT (quantization-aware training) make inference faster?
No. QAT only changes the weight values to recover quality; it does not change the inference kernel. The speed win here belongs to the Q4_0 format, not QAT. A plain post-training Q4_0 would run at essentially the same speed.