#kv-cache — Blog — ai-muninn

~ / blog / tag / kv-cache

❯ grep -r "#kv-cache" ~/blog

11 matches

datereadtitle
2026-07-0712m
[Local LLM] Why a 284B fits a 128GB GB10 at long context: DeepSeek-V4-Flash attacks the KV cache, not the parameter count
#dgx-spark #gb10 #deepseek-v4-flash #kv-cache
2026-07-0512m
[Local LLM] My 284B agent quietly stopped reusing its KV cache — the ds4 evict storm that re-paid prefill every turn
#dgx-spark #gb10 #deepseek-v4-flash #kv-cache
2026-07-0314m
[Just for Fun — Advanced] I Doubled My Agent's Decode Speed and It Got Slower: TTFT Is the Number You Actually Feel
#local-llm #ai-agent #ttft #llama.cpp
2026-06-2611m
[Just for Fun — Advanced] llama.cpp won't persist KV cache to disk — so I put a 60-line proxy in front of it (7× faster restore)
#local-llm #llama.cpp #kv-cache #ttft
2026-06-2514m
[Just for Fun — Advanced] Quantizing the Draft Cache Backfired — A Counterintuitive Look at Qwen MTP (f16 ran 34% faster than q4)
#mtp #speculative-decoding #local-llm #qwen3
2026-06-2416m
[Just for Fun — Advanced] I Maxed Context to 256K, It Loaded Fine — Then Crashed in Real Use: A VRAM Detective Story on a 22GB Frankencard
#local-llm #llama.cpp #qwen3 #vram
2026-06-1410m
[Just for Fun] On a GTX 970, Flash Attention nearly doubles long-context decode (24.3 → 42.5 tok/s)
#gemma-4 #gtx-970 #flash-attention #kv-cache
2026-06-1214m
[Local LLM] Running a 15 tok/s 284B as your daily agent brain — the settings that make it bearable
#dgx-spark #gb10 #deepseek-v4-flash #kv-cache
2026-04-088m
[Benchmark] Rescuing Gemma 4 31B on a 32GB MacBook Pro: From 1.5 to 12.8 tok/s
#gemma-4 #31b #m1-max #ollama
2026-03-308m
[Benchmark] TurboQuant on GX10: Is 3-bit KV Cache Compression Actually Lossless?
#turboquant #kv-cache #quantization #vllm
2026-03-216m
[vLLM] FP8 KV Cache on GB10: Why Outputs Collapse into Repetition Loops
#vllm #fp8 #kv-cache #gb10

← back to all posts