How do I run Nemotron-3-Super-120B on a DGX Spark (GB10, SM121)?

Use the stable vLLM image (v0.17.1-cu130), not nightly. Set --moe-backend marlin as a CLI flag (no env var substitute), add VLLM_NVFP4_GEMM_BACKEND=marlin, VLLM_USE_FLASHINFER_MOE_FP4=0, and VLLM_MARLIN_USE_ATOMIC_ADD=1. Use --reasoning-parser nemotron_v3 (exact name). Nemotron loads in ~8 minutes and runs at 13-16 tok/s.

Why does VLLM_NVFP4_MOE_BACKEND=marlin do nothing on vLLM?

This environment variable does not exist in vLLM's source code. Setting it produces no error and no effect. The correct way to route MoE layers through Marlin on SM121 is the CLI flag --moe-backend marlin. Check the startup log for '[NVFP4] Using backend: marlin' to confirm it's active.

Which vLLM Docker image should I use for Nemotron on DGX Spark versus Qwen3.5?

They're opposite: Nemotron requires the stable image (vllm/vllm-openai:v0.17.1-cu130) because nightly has a torch.compile SM121 bug that crashes on first request. Qwen3.5-35B requires the nightly image (cu130-nightly) because stable lacks GB10 kernel support for that model.

How fast does Nemotron-3-Super-120B run on a GB10 with 128GB unified memory?

13-16 tok/s decode speed. Load time is ~8 minutes for 17 shards. At 200K context with fp8 KV cache and 0.85 GPU utilization, it allocates ~35GB for KV cache. It's roughly 3x slower than Qwen3.5-35B (47 tok/s) on the same hardware — consistent with its 3x larger active parameter count.

[vLLM] Nemotron-3-Super-120B on a Single GB10: Full Day Debug Log

TL;DR

Nemotron-3-Super-120B-NVFP4 runs at 13-16 tok/s on a single GB10 (DGX Spark, SM121) using ~108 GB of the 128 GB unified memory. Requires stable vLLM image (v0.17.1-cu130, not nightly), --moe-backend marlin CLI flag, and three NVFP4-specific env vars. Five SM121-specific pitfalls documented with a working docker command.

Plain-Language Version: Running a 120-Billion-Parameter AI on a Desktop

Imagine buying a sports car engine designed for a race track, then installing it in your daily driver. The engine fits — barely — but every sensor, fuel map, and timing parameter needs manual adjustment because the car's computer doesn't recognize this specific engine variant.

That's what running Nemotron-3-Super-120B on an NVIDIA DGX Spark looks like. The DGX Spark is a desktop-sized AI computer with 128 GB of shared memory. Nemotron is NVIDIA's own 120-billion-parameter reasoning model — it can analyze complex problems, write code, and think through multi-step logic. At 108 GB, it fills almost the entire machine's memory, leaving just enough room for the AI to "remember" the conversation context.

The catch: the DGX Spark uses a chip called GB10 (internally labeled SM121), which is subtly different from the data center chips (SM100/SM120) that most AI software targets. The software doesn't crash — it just silently does the wrong thing. This article documents five specific traps encountered during a full day of debugging, and provides the exact command that makes it work. If you're running large AI models on DGX Spark hardware, these fixes apply to you.

Preface

Same factory, different assembly line. GB10 and B200 are both Blackwell. They share a manufacturing process, a marketing family, and almost no kernel-level compatibility. Getting one model working on SM121 does not mean the next one works the same way — or at all.

This is the record of getting Nemotron-3-Super-120B-NVFP4 running on an ASUS GX10 (NVIDIA GB10, SM121, 128GB unified memory). One full day of debugging. Five pitfalls. A working docker command at the end.

The SM121 fundamentals are covered in Part 1. That post explains why SM121 and SM100 diverge at the kernel level and what the symptoms look like. This post picks up there — and covers what Nemotron specifically adds on top of the SM121 baseline issues.

The Model

Nemotron-3-Super-120B-NVFP4 is NVIDIA's reasoning model — 120B parameters, quantized in NVFP4 format, distributed as 17 shards. It's designed for extended context tasks with an emphasis on structured reasoning.

At rest: ~108GB. This monopolizes the full 128GB system. No other model runs alongside it. The first step before starting is always unloading whatever Ollama has in memory:

curl -X POST http://localhost:11434/api/generate \
  -d '{"model": "MODEL_NAME", "keep_alive": 0}'

Verify with curl -s http://localhost:11434/api/ps before proceeding. Ollama's default KEEP_ALIVE is 2 hours. If you ran a model recently, it's still there.

Load time: ~8 minutes (17 shards). This is not a model you restart casually.

Why Does SM121 Break Standard CUDA Kernels?

GB10's compute capability is SM121. The B200's is SM100. Both are Blackwell. They are not ISA-compatible.

Most optimized CUDA kernels target SM100 and are written with the assumption that all Blackwell chips use the same instruction set. They don't. Running SM100-targeted kernels on SM121 triggers cudaErrorIllegalInstruction. The CPU equivalent would be running AVX-512 instructions on a processor that implements AVX2.

For Nemotron — an NVFP4 MoE model — the failure modes are specific: MoE routing kernels that invoke FLASHINFER_CUTLASS, and torch.compile in the nightly image, both fail on SM121 in ways that aren't immediately obvious.

The SM121 NVFP4 root cause is covered in depth in Part 1. This post focuses on what Nemotron adds on top.

Pitfall 1: The Nightly Image Breaks on SM121

For Qwen3.5-35B, the recommendation is cu130-nightly — the stable image doesn't support it on GB10. For Nemotron, the recommendation is the opposite.

The cu130-nightly image has a torch.compile bug specific to SM121. It surfaces during warmup, not startup, which means the container appears to start correctly and then crashes on the first real request. The error:

RuntimeError: CUDA error: an illegal instruction was encountered

The fix: use the stable image.

vllm/vllm-openai:v0.17.1-cu130

This is the opposite of what the Qwen3.5 article recommends. The difference is model-specific, not hardware-specific. Nemotron's kernel path in nightly hits the SM121 torch.compile bug; Qwen3.5's doesn't but requires nightly for other reasons. Document your image version alongside your serve command and don't treat it as interchangeable across models.

Update (2026-04-15): NVIDIA's official DGX Spark Nemotron guide now recommends cu130-nightly together with a custom reasoning parser plugin (super_v3, shipped as /app/super_v3_reasoning_parser.py). Either path can work — the stable+nemotron_v3 recipe below is what I ran and verified at the time this article was written. If you start fresh today, check both.

Pitfall 2: MoE Kernel — The Env Var That Does Nothing

Nemotron is a MoE model. vLLM's default MoE routing on SM121 uses FLASHINFER_CUTLASS, which doesn't support SM121. The fix is to route MoE layers through Marlin instead.

The env var approach:

# This does nothing. The variable does not exist in vLLM's source.
export VLLM_NVFP4_MOE_BACKEND=marlin

VLLM_NVFP4_MOE_BACKEND is not defined anywhere in vLLM 0.17.1. Setting it produces no error, no warning, and no effect. vLLM falls back to auto-selection, auto-selection picks FLASHINFER_CUTLASS, and the model crashes on first use.

The correct fix is a CLI flag:

--moe-backend marlin

This must be passed as a command-line argument to the vLLM serve command. There is no env var equivalent. If you're adapting a serve script from another source and it has VLLM_NVFP4_MOE_BACKEND=marlin, remove it and add --moe-backend marlin instead.

(Note: for MXFP4 models like gpt-oss, the analogous fix uses VLLM_MXFP4_BACKEND=marlin. That's a different variable for a different quantization format. See the gpt-oss article for that distinction. Nemotron uses NVFP4, not MXFP4.)

Pitfall 3: FP4 Backend Env Vars

Two additional env vars are required for Nemotron's NVFP4 GEMM operations on SM121:

VLLM_USE_FLASHINFER_MOE_FP4=0
VLLM_NVFP4_GEMM_BACKEND=marlin

VLLM_USE_FLASHINFER_MOE_FP4=0 disables the FlashInfer FP4 MoE path — which, like FLASHINFER_CUTLASS, doesn't work on SM121.

VLLM_NVFP4_GEMM_BACKEND=marlin forces Marlin for NVFP4 matrix multiplications. This is NVFP4-specific (the NVFP4 prefix in the variable name is meaningful). It's different from VLLM_MXFP4_BACKEND, which targets MXFP4 quantization. The two formats use different kernel paths and different env vars.

Additionally:

VLLM_MARLIN_USE_ATOMIC_ADD=1

This fixes a Marlin atomic race condition on SM121. Without it, Marlin occasionally produces incorrect output on GB10 under concurrent load. The flag enables a slower but correct atomic add path.

Confirm the backend is active by checking the startup log for:

[NVFP4] Using backend: marlin

If you see Auto-selected: CUTLASS_FP4 instead, one of these env vars isn't being picked up.

Pitfall 4: Parser Name Is Exact

Nemotron uses a thinking output format. The reasoning parser flag is:

--reasoning-parser nemotron_v3

Not nemotron. Not nemotron_super. Exactly nemotron_v3.

If the parser name is wrong, vLLM does not error at startup. The model loads and runs. Thinking tokens get routed incorrectly — either swallowing the actual response or producing malformed output. The failure is silent until you test with a request that exercises the reasoning path.

Verify by checking that a test request returns content in the correct field.

Note: super_v3 is a real alternative — it's the reasoning parser NVIDIA ships as a custom plugin in their official DGX Spark Nemotron guide via --reasoning-parser-plugin /app/super_v3_reasoning_parser.py --reasoning-parser super_v3. Both work; this article uses nemotron_v3 (built into upstream vLLM) because that's what I ran and verified.

Pitfall 5: Context Window vs System Prompt Size

The default max_model_len for Nemotron in most serve scripts is 32768 tokens. If you use a long system prompt — 24K tokens is not unusual for agent deployments — you have 8K tokens left for the actual conversation. That's not enough for most real tasks.

Set max_model_len=200000. At this context length, vLLM allocates ~35GB for KV cache (fp8, 0.85 utilization), with a theoretical concurrency of 7.48x.

The SSM-related parameter coupling from the Qwen3.5 article applies here too: --max-num-batched-tokens must be >= block_size. At 200K context, block_size = ceil(200000 / N) ≈ 2096. Set --max-num-batched-tokens 4096 to satisfy the constraint with headroom.

How Fast Does Nemotron Actually Run?

Metric	Value
Load time	~8 min (17 shards)
Memory footprint	~108GB
Decode speed (thinking disabled)	13-16 tok/s
Max context	200K tokens
KV cache (fp8, 0.85 util)	~35GB / ~450K tokens

For reference, Qwen3.5-35B on the same hardware runs at ~47 tok/s. Nemotron is 3x slower. It's also roughly 3x larger in active parameters. The arithmetic is consistent — GB10 is bandwidth-bound, and more parameters means more memory loads per token.

The practical implication: Nemotron is not an agent's primary inference backend. At 13-16 tok/s, call latency is too high for interactive agent workloads. It's more appropriate for batch jobs, deep analysis tasks, or situations where output quality justifies the throughput cost.

Takeaways

Nemotron runs on GB10. The 200K context window is functional with a long system prompt. The SM121-specific fix set (Marlin backend, correct env vars, stable image) works for this model.

The transferable lesson from this debugging session: on SM121, always verify the kernel path before concluding hardware incompatibility. The failure symptom — cudaErrorIllegalInstruction or silent wrong output — looks like a hardware problem. It isn't. It's a kernel targeting problem. The fix is usually one flag away.

The specific diagnostic: check the startup log for [NVFP4] Using backend: marlin. If you see Auto-selected, something in your configuration isn't being read. The env var for NVFP4 (VLLM_NVFP4_GEMM_BACKEND) is different from the env var for MXFP4 (VLLM_MXFP4_BACKEND). They are not interchangeable. The MoE backend must be set via CLI flag (--moe-backend marlin), not env var.

The Working Command

# Stop qwen35 first — 128GB only fits one large model
docker stop qwen35

docker run -d --name nemotron --restart unless-stopped \
  --gpus all --ipc host --shm-size 16g -p 8002:8000 \
  -v /home/coolthor/models/nemotron-nvfp4:/models/nemotron \
  -e VLLM_NVFP4_GEMM_BACKEND=marlin \
  -e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
  -e VLLM_USE_FLASHINFER_MOE_FP4=0 \
  vllm/vllm-openai:v0.17.1-cu130 \
  --model /models/nemotron \
  --served-model-name nemotron \
  --dtype auto \
  --kv-cache-dtype fp8 \
  --moe-backend marlin \
  --tensor-parallel-size 1 \
  --trust-remote-code \
  --gpu-memory-utilization 0.85 \
  --max-model-len 200000 \
  --max-num-batched-tokens 4096 \
  --reasoning-parser nemotron_v3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --enable-prefix-caching

Notable differences from the Qwen3.5 command:

Stable image (v0.17.1-cu130) not nightly
--moe-backend marlin as a CLI flag (no env var substitute)
Three NVFP4-specific env vars
Port 8002 (8000 is qwen35's port when running)
--shm-size 16g is sufficient; Nemotron doesn't need 64g

Do not add --enforce-eager. It was tested and causes a startup crash at the MoE autotuner stage.

Also in this series: Why Your DGX Spark Only Says "!!!!!": Debugging NVFP4 on SM121 · gpt-oss-120B at 59 tok/s: 6 Pitfalls and a Working Serve Script · Migrating Qwen3.5 from Ollama to vLLM on DGX Spark