Gemma 4 on a GTX 970 — Series

~ / blog / series / Gemma 4 on a GTX 970

❯ ls ~/blog/series/gemma-4-on-a-gtx-970

4 posts

partdatereadtitle
12026-06-0910m
[Just for Fun] Gemma 4 E2B on a GTX 970: the biggest quant runs fastest (47.6 tok/s)
Four Gemma 4 E2B quants on a 2014 GTX 970. The bigger 3.2GB QAT Q4_0 beats the 2.9GB Q2_K — 47.6 vs 32.8 tok/s — because a tensor-core-less Maxwell card is dequant-bound, not bandwidth-bound.
22026-06-0910m
[Just for Fun] A GTX 970 as an offline voice assistant: Gemma 4 E2B + Piper TTS (2.8s end-to-end)
A 2014 GTX 970 running Gemma 4 E2B (vision + audio) plus Piper TTS — a full offline voice assistant that sees, listens, talks back, and writes code. ~2.8s end-to-end, ~$15 of hardware.
32026-06-1410m
[Just for Fun] On a GTX 970, Flash Attention nearly doubles long-context decode (24.3 → 42.5 tok/s)
On a tensor-core-less Maxwell GTX 970 running Gemma 4 E2B, Flash Attention nearly doubles long-context decode (24.3 → 42.5 tok/s) and saves ~430MB VRAM — while q8 KV cache barely saves memory and slows decode. The usual KV-cache advice flips.
42026-06-1414m
[Just for Fun] A blog RAG support bot on a GTX 970: no torch, no vector DB, no LangChain
A retrieval-augmented support bot for my blog, running on a 2014 GTX 970 and a ~600MB embedding model. llama.cpp embeddings on CPU, numpy brute-force cosine over 3,475 chunks, an embedding-score guardrail, and Cloudflare Tunnel.