DGX Spark · part 14
[Benchmark] Gemma 4 Complete Guide on DGX Spark — Which Model Should You Pick?
❯ cat --toc
- Plain-Language Version: Four AI Models, One Table
- The Complete Comparison
- DGX Spark (GB10) — 128 GB, 273 GB/s
- RTX 5090 — 32 GB GDDR7, 1792 GB/s
- MacBook Pro M1 Max — 32 GB, 400 GB/s
- How to Choose
- You have a DGX Spark
- You have an RTX 5090
- You have a MacBook Pro 32GB
- Why Not Dense?
- Quantization Format Guide
- What Was Gained
- Deep Dive Links
TL;DR
Complete benchmark of all four Gemma 4 variants across three machines. 26B MoE NVFP4 on DGX Spark: 52 tok/s is the best bang for buck. 31B Dense is only 7 tok/s — skip it. MacBook Pro 32GB maxes out at 26B MoE (47 tok/s). RTX 5090 is the only hardware where 31B Dense is the right pick (62 tok/s — fast enough to keep full intelligence).
Plain-Language Version: Four AI Models, One Table
Gemma 4 is Google's 2026 open-source AI model family. It comes in four sizes: E2B runs on phones, E4B on desktops, 26B on servers, 31B for maximum capability. The docs don't tell you how fast each one actually runs on your hardware.
This article puts all four variants on three machines into one table so you can decide without reading six separate articles.
The Complete Comparison
DGX Spark (GB10) — 128 GB, 273 GB/s
| Model | Type | Active Params | Quant | Runtime | tok/s | Model Size | Rating |
|---|---|---|---|---|---|---|---|
| 26B-A4B | MoE | 4B | NVFP4 | vLLM | 52 | 16.5 GB | ⭐⭐⭐⭐⭐ |
| E4B | PLE | 4B | NVFP4 | vLLM | 50 | 9.8 GB | ⭐⭐⭐⭐ |
| E4B | PLE | 4B | FP8 | vLLM | 36 | ~14 GB | ⭐⭐⭐ |
| E2B | PLE | 2B | Q4_K_M | Ollama | 53 | 7.2 GB | ⭐⭐⭐ |
| E4B | PLE | 4B | BF16 | vLLM | 19 | ~18 GB | ⭐⭐ |
| 31B Dense | Dense | 31B | NVFP4 | vLLM | 7 | 31 GB | ❌ |
RTX 5090 — 32 GB GDDR7, 1792 GB/s
Rated by capability first — speed is a threshold, not a ranking. All four models clear the usability floor (~20 tok/s) on this hardware.
| Model | Quant | Runtime | tok/s | Active Params | Rating |
|---|---|---|---|---|---|
| 31B Dense | Q4_K_M | Ollama | 62 | 31B | ⭐⭐⭐⭐⭐ |
| 26B MoE | Q4_K_M | Ollama | 186 | 4B | ⭐⭐⭐⭐⭐ |
| E4B | Q4_K_M | Ollama | 202 | 4B | ⭐⭐⭐⭐ |
| E2B | Q4_K_M | Ollama | 310 | 2B | ⭐⭐⭐ |
MacBook Pro M1 Max — 32 GB, 400 GB/s
| Model | Quant | Runtime | tok/s | Notes |
|---|---|---|---|---|
| E2B | Q4_K_M | Ollama | 81 | Fastest |
| 26B MoE | Q4_K_M | Ollama | 47 | Largest usable |
| 31B Dense | Q4_K_M | oMLX | 12.8 | Requires oMLX |
| 31B Dense | Q4_K_M | Ollama (ctx=2048) | 9 | Context cap required |
| 31B Dense | Q4_K_M | Ollama (default) | 1.5 | ❌ Swap death |
How to Choose
You have a DGX Spark
→ 26B-A4B NVFP4 + vLLM. 52 tok/s, model uses only 16 GB, leaving 82 GB for KV cache. Best capability + speed combination.
Full deployment guide: 26B NVFP4 Complete Guide
You have an RTX 5090
→ Default pick: 31B Dense (62 tok/s). On 5090's 1792 GB/s bus, the Dense model that's unusable elsewhere becomes the smartest option at a perfectly comfortable speed. No reason to sacrifice intelligence when the hardware can handle it.
→ Need more speed: 26B MoE (186 tok/s) — still very capable, 3x faster. Good for high-throughput agent workloads where latency matters more than reasoning depth.
→ Need raw throughput: E2B (310 tok/s) — least capable but fastest. Only pick this for edge-like workloads or batch processing where speed dominates.
You have a MacBook Pro 32GB
→ 26B MoE (47 tok/s) is the largest usable variant. 31B Dense do not use Ollama defaults — drops to 1.5 tok/s.
If you want to try 31B anyway: Rescuing 31B on 32GB MBP
Why Not Dense?
31B Dense on GB10: 7 tok/s. 26B MoE: 52 tok/s. Same hardware, 7.5x difference.
Dense models read all 31B parameters per token (62 GB at BF16). MoE activates only 4B per token (8 GB). On a 273 GB/s memory bus, that difference is the ceiling.
If you have an RTX 5090 (1792 GB/s), 31B Dense at 62 tok/s is the top recommendation — smart enough to justify the speed, fast enough to be comfortable. On anything else, choose MoE.
Detailed math: 31B Dense Bandwidth Wall
Quantization Format Guide
| Format | Best For | Notes |
|---|---|---|
| NVFP4 | DGX Spark + vLLM | Fastest. E4B: 2.6x speedup. Needs --moe-backend marlin |
| FP8 | DGX Spark + vLLM | Middle ground. Slower than NVFP4 |
| Q4_K_M | Any + Ollama | Universal. Works everywhere |
| BF16 | Large VRAM only | Lossless but slowest |
Full E4B NVFP4 quantization walkthrough: World-First E4B NVFP4
What Was Gained
What cost the most time: Not having a unified comparison table, forcing re-reading of individual articles for every decision. This article exists to fix that.
Transferable diagnostics: The model selection decision tree is always: Does it fit in memory → Is bandwidth sufficient → Is there a MoE variant at this size → Is there an NVFP4 checkpoint.
The pattern that applies everywhere: On bandwidth-limited hardware, always choose a sparse-activation architecture (MoE, or PLE-style dense models like E2B/E4B) over a fully dense model. Total parameter count does not matter — the parameters read per token determine speed.
Deep Dive Links
Every data point has a full article behind it:
| Topic | Article |
|---|---|
| 26B NVFP4 deployment + 52 tok/s | Full guide |
| Why 31B Dense is so slow | Bandwidth wall |
| 31B on 32GB MacBook Pro | From 1.5 to 12.8 tok/s |
| 4 machines full comparison | Memory decides everything |
| E2B vs E4B on 3 machines | Bandwidth = speed |
| E4B NVFP4 quantization | From 19 to 50 tok/s |
| DGX Spark power diagnostic | 30W / 100W / overheating guide |
FAQ
- How many Gemma 4 variants are there and what is the difference?
- Four: E2B (2.3B effective, phone/edge), E4B (4.5B effective, desktop), 26B-A4B (3.8B active, server), 31B (30.7B params, largest but slowest). E2B and E4B are dense models with PLE (Per-Layer Embedding) that reduces effective compute. Only 26B-A4B is genuine MoE (128 experts, top-8 routing). 31B is plain dense — reads all parameters every token.
- Which Gemma 4 variant runs fastest on DGX Spark?
- 26B-A4B MoE NVFP4, 52 tok/s (vLLM). E4B NVFP4 is close at 49.9 tok/s. 31B Dense is only 7 tok/s — not recommended. E2B on Ollama gets 53 tok/s but has weaker capabilities.
- Can a MacBook Pro 32GB run Gemma 4?
- Up to 26B MoE (47 tok/s). 31B Dense will drop to 1.5 tok/s because KV cache exceeds memory and triggers swap. Reducing context window to 2048 gets 9 tok/s, oMLX gets 12.8 tok/s. Recommend E4B or 26B MoE.
- Which quantization format is best?
- On DGX Spark with vLLM: NVFP4 is fastest (E4B goes from 19 to 50 tok/s, 2.6x improvement). On Mac/consumer GPUs with Ollama: Q4_K_M is the only option. FP8 is a middle ground (E4B 36 tok/s) but not worth it over NVFP4.
- Is the DGX Spark worth $4,699 in 2026?
- For Gemma 4 specifically: 26B MoE NVFP4 at 52 tok/s using only 16 GB of the 128 GB available — yes, if you plan to run multiple models or 100B+ models. For a single model under 32 GB, an RTX 5090 runs the same 26B MoE at 186 tok/s (3.6x faster). The DGX Spark wins on capacity and the ability to run models that don't fit anywhere else.
- DGX Spark vs RTX 5090 for Gemma 4 — which is better?
- RTX 5090 is faster on every Gemma 4 variant (E2B 310 vs 53, 26B MoE 186 vs 52 tok/s). But RTX 5090 can't run models over 32 GB. On RTX 5090, 31B Dense at 62 tok/s is the smartest comfortable choice. On DGX Spark, 26B MoE NVFP4 at 52 tok/s is the best balance because 31B Dense hits only 7 tok/s (bandwidth-limited).