~/blog/dgx-spark-gemma4-complete-guide

DGX Spark · part 14

[Benchmark] Gemma 4 Complete Guide on DGX Spark — Which Model Should You Pick?

cat --toc

TL;DR

Complete benchmark of all four Gemma 4 variants across three machines. 26B MoE NVFP4 on DGX Spark: 52 tok/s is the best bang for buck. 31B Dense is only 7 tok/s — skip it. MacBook Pro 32GB maxes out at 26B MoE (47 tok/s). RTX 5090 is the only hardware where 31B Dense is the right pick (62 tok/s — fast enough to keep full intelligence).

Plain-Language Version: Four AI Models, One Table

Gemma 4 is Google's 2026 open-source AI model family. It comes in four sizes: E2B runs on phones, E4B on desktops, 26B on servers, 31B for maximum capability. The docs don't tell you how fast each one actually runs on your hardware.

This article puts all four variants on three machines into one table so you can decide without reading six separate articles.


The Complete Comparison

DGX Spark (GB10) — 128 GB, 273 GB/s

ModelTypeActive ParamsQuantRuntimetok/sModel SizeRating
26B-A4BMoE4BNVFP4vLLM5216.5 GB⭐⭐⭐⭐⭐
E4BPLE4BNVFP4vLLM509.8 GB⭐⭐⭐⭐
E4BPLE4BFP8vLLM36~14 GB⭐⭐⭐
E2BPLE2BQ4_K_MOllama537.2 GB⭐⭐⭐
E4BPLE4BBF16vLLM19~18 GB⭐⭐
31B DenseDense31BNVFP4vLLM731 GB

RTX 5090 — 32 GB GDDR7, 1792 GB/s

Rated by capability first — speed is a threshold, not a ranking. All four models clear the usability floor (~20 tok/s) on this hardware.

ModelQuantRuntimetok/sActive ParamsRating
31B DenseQ4_K_MOllama6231B⭐⭐⭐⭐⭐
26B MoEQ4_K_MOllama1864B⭐⭐⭐⭐⭐
E4BQ4_K_MOllama2024B⭐⭐⭐⭐
E2BQ4_K_MOllama3102B⭐⭐⭐

MacBook Pro M1 Max — 32 GB, 400 GB/s

ModelQuantRuntimetok/sNotes
E2BQ4_K_MOllama81Fastest
26B MoEQ4_K_MOllama47Largest usable
31B DenseQ4_K_MoMLX12.8Requires oMLX
31B DenseQ4_K_MOllama (ctx=2048)9Context cap required
31B DenseQ4_K_MOllama (default)1.5❌ Swap death

How to Choose

You have a DGX Spark

26B-A4B NVFP4 + vLLM. 52 tok/s, model uses only 16 GB, leaving 82 GB for KV cache. Best capability + speed combination.

Full deployment guide: 26B NVFP4 Complete Guide

You have an RTX 5090

→ Default pick: 31B Dense (62 tok/s). On 5090's 1792 GB/s bus, the Dense model that's unusable elsewhere becomes the smartest option at a perfectly comfortable speed. No reason to sacrifice intelligence when the hardware can handle it.

→ Need more speed: 26B MoE (186 tok/s) — still very capable, 3x faster. Good for high-throughput agent workloads where latency matters more than reasoning depth.

→ Need raw throughput: E2B (310 tok/s) — least capable but fastest. Only pick this for edge-like workloads or batch processing where speed dominates.

You have a MacBook Pro 32GB

26B MoE (47 tok/s) is the largest usable variant. 31B Dense do not use Ollama defaults — drops to 1.5 tok/s.

If you want to try 31B anyway: Rescuing 31B on 32GB MBP


Why Not Dense?

31B Dense on GB10: 7 tok/s. 26B MoE: 52 tok/s. Same hardware, 7.5x difference.

Dense models read all 31B parameters per token (62 GB at BF16). MoE activates only 4B per token (8 GB). On a 273 GB/s memory bus, that difference is the ceiling.

If you have an RTX 5090 (1792 GB/s), 31B Dense at 62 tok/s is the top recommendation — smart enough to justify the speed, fast enough to be comfortable. On anything else, choose MoE.

Detailed math: 31B Dense Bandwidth Wall


Quantization Format Guide

FormatBest ForNotes
NVFP4DGX Spark + vLLMFastest. E4B: 2.6x speedup. Needs --moe-backend marlin
FP8DGX Spark + vLLMMiddle ground. Slower than NVFP4
Q4_K_MAny + OllamaUniversal. Works everywhere
BF16Large VRAM onlyLossless but slowest

Full E4B NVFP4 quantization walkthrough: World-First E4B NVFP4


What Was Gained

What cost the most time: Not having a unified comparison table, forcing re-reading of individual articles for every decision. This article exists to fix that.

Transferable diagnostics: The model selection decision tree is always: Does it fit in memory → Is bandwidth sufficient → Is there a MoE variant at this size → Is there an NVFP4 checkpoint.

The pattern that applies everywhere: On bandwidth-limited hardware, always choose a sparse-activation architecture (MoE, or PLE-style dense models like E2B/E4B) over a fully dense model. Total parameter count does not matter — the parameters read per token determine speed.


Every data point has a full article behind it:

TopicArticle
26B NVFP4 deployment + 52 tok/sFull guide
Why 31B Dense is so slowBandwidth wall
31B on 32GB MacBook ProFrom 1.5 to 12.8 tok/s
4 machines full comparisonMemory decides everything
E2B vs E4B on 3 machinesBandwidth = speed
E4B NVFP4 quantizationFrom 19 to 50 tok/s
DGX Spark power diagnostic30W / 100W / overheating guide

FAQ

How many Gemma 4 variants are there and what is the difference?
Four: E2B (2.3B effective, phone/edge), E4B (4.5B effective, desktop), 26B-A4B (3.8B active, server), 31B (30.7B params, largest but slowest). E2B and E4B are dense models with PLE (Per-Layer Embedding) that reduces effective compute. Only 26B-A4B is genuine MoE (128 experts, top-8 routing). 31B is plain dense — reads all parameters every token.
Which Gemma 4 variant runs fastest on DGX Spark?
26B-A4B MoE NVFP4, 52 tok/s (vLLM). E4B NVFP4 is close at 49.9 tok/s. 31B Dense is only 7 tok/s — not recommended. E2B on Ollama gets 53 tok/s but has weaker capabilities.
Can a MacBook Pro 32GB run Gemma 4?
Up to 26B MoE (47 tok/s). 31B Dense will drop to 1.5 tok/s because KV cache exceeds memory and triggers swap. Reducing context window to 2048 gets 9 tok/s, oMLX gets 12.8 tok/s. Recommend E4B or 26B MoE.
Which quantization format is best?
On DGX Spark with vLLM: NVFP4 is fastest (E4B goes from 19 to 50 tok/s, 2.6x improvement). On Mac/consumer GPUs with Ollama: Q4_K_M is the only option. FP8 is a middle ground (E4B 36 tok/s) but not worth it over NVFP4.
Is the DGX Spark worth $4,699 in 2026?
For Gemma 4 specifically: 26B MoE NVFP4 at 52 tok/s using only 16 GB of the 128 GB available — yes, if you plan to run multiple models or 100B+ models. For a single model under 32 GB, an RTX 5090 runs the same 26B MoE at 186 tok/s (3.6x faster). The DGX Spark wins on capacity and the ability to run models that don't fit anywhere else.
DGX Spark vs RTX 5090 for Gemma 4 — which is better?
RTX 5090 is faster on every Gemma 4 variant (E2B 310 vs 53, 26B MoE 186 vs 52 tok/s). But RTX 5090 can't run models over 32 GB. On RTX 5090, 31B Dense at 62 tok/s is the smartest comfortable choice. On DGX Spark, 26B MoE NVFP4 at 52 tok/s is the best balance because 31B Dense hits only 7 tok/s (bandwidth-limited).