DGX Spark · part 6
[DGX Spark] Overheating, 100W Power Cap, 30W Safety Mode — Complete Diagnostic Guide
❯ cat --toc
- Plain-Language Version: When Your AI Machine Is Secretly Running on Fumes
- The Diagnostic (Run This First)
- Root Cause: PD Controller Firmware Negotiation Failure
- What Does Not Fix It
- The Other Power Bug (Different Problem, Actually Fixable)
- Firmware Version
- 2026 Community Controversy: Carmack + the 100W Problem
- What Was Gained
- Checklist
TL;DR
Some GX10 units ship with a PD controller defect that permanently caps total system power at ~30W regardless of workload. Firmware reflash does not fix it. NVIDIA recommends RMA. Run one command to find out if you are affected.
Plain-Language Version: When Your AI Machine Is Secretly Running on Fumes
The NVIDIA DGX Spark (and its ASUS GX10 variant) is a desktop AI workstation that needs a lot of power — up to 180 watts delivered through a USB-C cable. Think of it like a high-performance car: it needs premium fuel at full pressure to perform.
Some units shipped with a defective power negotiation chip. The machine turns on, looks normal, and even runs AI tasks — just painfully slowly. It is like driving a sports car with the fuel line pinched: the engine runs, but it never gets past first gear. The system caps itself at about 30 watts instead of the full 180.
The tricky part is that everything looks like a software problem. You spend hours tweaking AI model settings, swapping configurations, and reading logs — when the real issue is that the hardware is starving for electricity. One terminal command (nvidia-smi) reveals whether you are affected in 30 seconds.
If you are, the fix is not software — it is a warranty replacement (RMA). This article shows exactly how to diagnose it and what to do next.
The Diagnostic (Run This First)
Fire any GPU-bound workload — an inference request, a quick matrix multiply, anything — then sample:
nvidia-smi --query-gpu=power.draw,utilization.gpu,clocks.sm --format=csv,noheader
Two outcomes:
Healthy:
35.65 W, 96 %, 2522 MHz
30W safety mode:
4.80 W, 2 %, 2411 MHz
In safety mode, GPU utilization stays near zero regardless of how much work you throw at it. The machine is not thermal-throttling — it is power-starved at the PMIC level.
On my unit, running Qwen3.5-35B-A3B-FP8 via vLLM during a 300-token generation:
power.draw = 35.65 W | utilization.gpu = 96% | clocks.sm = 2522 MHz
throughput = ~50 tok/s
No throttling. Normal behavior.
Root Cause: PD Controller Firmware Negotiation Failure
The GX10 takes power via USB-C PD 3.1 (180W EPR spec). The PD controller negotiates with the 240W adapter to unlock high-power mode — somewhere around 20V/5A or 28V.
On affected units, this negotiation fails silently. The adapter is plugged in, the machine boots normally, but the PD controller never escalates beyond the default safe power level. The PMIC then restricts the entire main power rail to approximately 30W.
The dmesg signature:
Detected insufficient power on the PCIe slot (27W)
That message appears across multiple Mellanox network controllers — confirming this is system-wide power starvation, not a single component issue.
What Does Not Fix It
ASUS's recommended procedure is a "double flash" of the PD controller firmware:
Flash → Reboot → Flash → Reboot
Users in the NVIDIA Developer Forums thread followed this exactly. The throttling persisted. NVIDIA support's conclusion: hardware defect, RMA required.
This is not a fixable firmware issue. The PD controller itself has failed in a way that software cannot reach.
The Other Power Bug (Different Problem, Actually Fixable)
There is a separate issue worth knowing: GPU stuck at 5W with 0% utilization, caused by running an outdated driver stack (550.54.15 + CUDA 12.4).
This one is software:
sudo apt dist-upgrade
sudo fwupdmgr refresh && sudo fwupdmgr update
Upgrading to Driver 580.x + CUDA 13.0 resolved it. Confirmed fixed as of January 2026.
If your GPU utilization is low but not stuck at near-zero, check driver versions before assuming hardware failure.
Firmware Version
The latest firmware as of this writing: BIOS v0103 (2026/03/18)
SOC / 0x305
EC / 2.78.18.3
PD / 0x507
My unit is currently on PD 0x500. Updating is worth doing, but if you are not seeing 30W symptoms, it is not urgent. If you are in safety mode, updating PD firmware is unlikely to help — but do it before initiating RMA to rule it out.
2026 Community Controversy: Carmack + the 100W Problem
In April 2026, John Carmack publicly criticized the DGX Spark for falling far short of the advertised 1 PFLOPS sparse FP4 performance. The NVIDIA developer forums were subsequently flooded with reports of three categories of failure:
| Symptom | Cause | Severity | Fixable? |
|---|---|---|---|
| Power capped at ~30W, 0% GPU utilization | PD controller hardware defect | Critical | No — RMA required |
| Power capped at ~100W (less than half of 240W rating) | Insufficient cooling → thermal throttling | High | Partially — improve airflow |
| GPU shows 5W / 0%, but dmesg is clean | Driver 550.x bug | Low | Yes — upgrade driver |
The 100W issue is distinct from the 30W issue. 30W is a PD controller manufacturing defect (permanent). 100W is thermal throttling (environmental). If your machine's power draw plateaus around 100W under load and GPU temperature is near the limit:
nvidia-smi --query-gpu=power.draw,temperature.gpu,clocks.throttle_reasons.sw_thermal_slowdown --format=csv,noheader
Active in the thermal slowdown column confirms throttling. Mitigations:
- Ensure intake vents are unobstructed
- Keep ambient temperature below 25C
- Consider an external fan aimed at the chassis underside
- Vertical orientation improves convection vs. horizontal
Note: NVIDIA's CES 2026 software update claims 2.5x performance improvements. If you are on an older DGX OS version, update before judging.
What Was Gained
What cost the most time: The 30W issue manifests silently. The machine boots, runs commands, serves requests — just slowly. Without the direct power.draw + utilization.gpu check, it would look like a vLLM config problem or an underpowered model.
Transferable diagnostic: For any unexpectedly slow inference on new hardware, sample power.draw and utilization.gpu simultaneously under load. If utilization is high and power is low relative to rated TDP, the power delivery chain deserves scrutiny before any software tuning.
The pattern: Hardware defects that look like configuration problems waste the most time because the debugging surface is infinite on the software side.
Checklist
- Run
nvidia-smi --query-gpu=power.draw,utilization.gpu,clocks.sm --format=csv,noheaderunder load - If utilization is low: check driver version, upgrade to 580.x + CUDA 13.0
- If utilization is high but throughput is inexplicably low: check for 30W dmesg signature
- If 30W confirmed: double-flash PD firmware per ASUS docs, then RMA if no change
- Update to BIOS v0103 (PD 0x507) regardless — cooler temps reported by other users post-update
Also in this series:
Related deep dive:
FAQ
- How do I check if my DGX Spark / GX10 is stuck in 30W safety mode?
- Run any GPU workload, then execute: nvidia-smi --query-gpu=power.draw,utilization.gpu,clocks.sm --format=csv,noheader. Healthy units show ~35W and 96% utilization under load. Units in 30W safety mode show ~5W and near-zero utilization regardless of workload.
- What causes the 30W power safety mode on GX10 / DGX Spark?
- A PD (Power Delivery) controller defect causes USB-C PD 3.1 negotiation with the 240W adapter to fail silently. The machine boots normally but never escalates beyond the default safe power level (~30W). The PMIC then restricts the entire power rail. dmesg shows 'Detected insufficient power on the PCIe slot (27W)'.
- Can I fix the 30W safety mode with a firmware update?
- No. ASUS recommends a 'double flash' of PD controller firmware, but affected users report the throttling persists. NVIDIA support's conclusion is that this is a hardware defect requiring RMA. Update to BIOS v0103 (PD 0x507) to rule it out, but expect to RMA if the issue remains.
- My GX10 GPU shows 5W and 0% utilization — is it the 30W safety mode?
- Not necessarily. There is a separate software bug where outdated drivers (550.54.15 + CUDA 12.4) cause the GPU to appear stuck at 5W/0%. Upgrading to Driver 580.x + CUDA 13.0 fixes this. Check driver versions first. If the issue persists after upgrading, then it may be the hardware PD controller defect.
- Is the DGX Spark overheating issue real? What about Carmack's 100W complaint?
- Carmack and multiple developers reported DGX Spark capping at 100W under sustained load (less than half the 240W rating), with overheating and unexpected shutdowns. There are actually three distinct problems: (1) 30W PD controller defect (hardware, needs RMA), (2) 100W thermal throttling (normal protection but limits performance), (3) 5W driver bug (software, fixable). The nvidia-smi diagnostic in this article identifies which one you have in 30 seconds.
- Is the DGX Spark worth buying with all these issues?
- The power and thermal problems affect some batches, not all. Units that work properly run Gemma 4 26B MoE at 52 tok/s with 128GB unified memory fitting most models. Confirm the return policy before purchasing, and run the nvidia-smi diagnostic as soon as you receive it.