What is the best LLM to run on a DGX Spark for AI agents?

qwen3-coder-next q4_K_M is the best primary model: 47.0 tok/s, reliable tool calling, solid across all task categories. For vision: qwen3-vl:30b at ~19GB in raw mode. For deep reasoning: glm-4.7-flash. Running qwen3-coder-next + qwen3-vl together uses ~71GB of the available 128GB.

Why does gpt-oss 120B fail tool calling despite being a larger model?

Parameter count doesn't guarantee tool calling capability — fine-tuning and RLHF focus do. gpt-oss:120b generated invalid JSON in tool calling tests, making it a non-starter for agent frameworks. It was also 5-6x more verbose than qwen3-coder-next with no quality improvement.

Why do thinking models like GLM and Qwen3-VL return empty responses via Ollama?

These models route output to an internal thinking channel. /api/generate doesn't expose it — the response field is empty. Use /api/chat instead: it returns message.content with the actual output. In one test, qwen3-vl spent 7,094 characters thinking and returned zero output.

Does q4_K_M quantization significantly reduce quality compared to q8_0?

Not meaningfully for agent tasks. The speed difference is real (47.0 vs 36.1 tok/s, 23% gap) and VRAM savings are substantial, but across 7 task categories the quality difference was nearly invisible. q4_K_M is the right choice for almost all agent workloads.

[Benchmark] 8 Models on DGX Spark: Finding the Best Stack for AI Agents

TL;DR

Benchmarked 8 LLMs on DGX Spark (GB10, 128 GB) across 7 task categories. Best agent stack: qwen3-coder-next q4_K_M at 47 tok/s (primary) + qwen3-vl:30b at 19 GB (vision), using only 71 GB combined. gpt-oss 120B failed tool calling with invalid JSON despite being 3x larger. q4 vs q8 quality difference was nearly invisible.

Plain-Language Version: Choosing the Right AI Brain for Your Desktop

When you run AI models on your own computer instead of paying for cloud services like ChatGPT, you face a choice: which model should you actually use? There are hundreds of open-source AI models available, each with different strengths, sizes, and speeds. Picking the wrong one wastes your hardware.

This benchmark tested eight different AI models on an NVIDIA DGX Spark — a desktop computer with 128 GB of shared memory designed for AI workloads. Each model was tested on seven real tasks: logic puzzles, writing code, translating between languages, math, summarizing documents, finding bugs, and calling external tools (like searching the web or reading files).

The key finding: a smaller, well-tuned 32-billion-parameter model beat a 120-billion-parameter model on the most critical task — generating structured data that software can read. The bigger model couldn't reliably produce valid JSON, which is like a translator who speaks fluently but can't fill out a standard form correctly. Size isn't everything in AI — training focus matters more.

For a deeper look at how Ollama (the tool used to run these models) works, see the official documentation.

Preface

Before any quest, you choose your weapons. Not all weapons suit all fights — and you don't find that out until after the first dungeon, when the sword you thought was general-purpose turns out to melt on contact with acid.

This is that process, applied to local LLMs.

After getting SM121 working with vLLM, the immediate practical question became: which models are actually worth running on this machine, and for which tasks? I had 128GB of unified memory and an ASUS Ascent GX10 sitting in my office. Running Ollama at this point — before the vLLM migration — I benchmarked eight models across seven task categories to find the right stack for AI coding agents.

Here's what I learned.

What Hardware Does This Run On?

ASUS Ascent GX10. NVIDIA GB10 Grace Blackwell Superchip. 128GB unified memory, 273 GB/s bandwidth, SM121 compute capability.

The hardware backstory — why SM121 is different from the GB200 and what that means for kernel selection — is covered in Part 1. This article treats the hardware as a given and focuses on what runs well on it.

The short version: 128GB means you can run models that would require multi-GPU setups elsewhere. This benchmark takes advantage of that.

Test Methodology

Seven task categories, each designed to stress a different capability:

Logic reasoning — multi-step deduction, constraint satisfaction
Code generation — write a working function from a specification
Translation — English ↔ Chinese, evaluating naturalness and accuracy
Math — arithmetic through algebra, with explicit working shown
Summarization — condense a long document while preserving key points
Debugging — identify and explain a bug in provided code
Tool calling — generate valid JSON matching a specified function signature

Tool calling deserves specific explanation. In AI agent frameworks, models communicate with tools by generating structured JSON that matches a function signature — something like:

{
  "function": "search_web",
  "arguments": {
    "query": "NVIDIA GB10 specs",
    "max_results": 5
  }
}

If the model generates invalid JSON, the agent framework either crashes or silently fails. This is a binary pass/fail category, not a quality gradient. It's one of the most important capabilities for production agent use.

All tests used Ollama via the /api/generate and /api/chat endpoints. More on why both endpoints matter in Finding 3.

Results

Speed

Model	Format	Speed (tok/s)
qwen3-vl:30b	native (raw mode)	78.5
qwen3-vl:30b	thinking mode	76–79
qwen3-coder-next	q4_K_M	47.0
gpt-oss:120b	MXFP4	42.4
qwen3-coder-next	q8_0	36.1
glm-4.7-flash	bf16	27.5

VRAM

Model	VRAM Usage
qwen3-vl:30b	~19 GB
qwen3-coder-next q4_K_M	~20 GB (estimated)
qwen3-coder-next q8_0	~33 GB (estimated)
gpt-oss:120b	~60 GB (estimated)
glm-4.7-flash bf16	~30 GB (estimated)

VRAM figures for q4_K_M and q8_0 are estimated from model parameter counts and quantization formats. The qwen3-vl:30b 19 GB figure was directly observed.

Notable: qwen3-coder-next q4_K_M and q8_0 share blob storage in Ollama — two quantizations of the same model don't double your disk usage.

Does q4_K_M vs q8_0 Quantization Actually Matter for Quality?

The headline comparison: qwen3-coder-next q4_K_M vs q8_0 across all seven task categories.

The speed difference is real: 47.0 tok/s vs 36.1 tok/s, a 23% gap. The VRAM difference is substantial: roughly 33GB less for q4. Across seven tasks, the quality difference was close to invisible.

The specific differences I found:

Translation: q8 occasionally produced slightly more natural phrasing — the kind of difference a human would catch but a BLEU score wouldn't
Debugging: q8 explanations were marginally more detailed, adding an extra sentence or two of context around the root cause

Everything else — logic, code generation, math, summarization, tool calling — was functionally identical between the two quantizations.

The practical implication: q4_K_M is the right choice for almost everything. You get 23% more throughput, you free 33GB for other models (or larger context), and you lose almost nothing in quality for the tasks that matter for agent workloads.

The principle this suggests: on a model with 32B+ parameters, 4-bit quantization retains enough precision that the quality floor is set by the model's training and architecture, not by the quantization level. This isn't surprising if you've read the quantization literature — it just wasn't obvious to me until I tested it directly.

Does a 120B Model Outperform a 32B Model on Agent Tasks?

gpt-oss:120b is a 120B model. It failed tool calling.

Not "worse at tool calling." Failed. Generated invalid JSON in the tool calling tests — a dealbreaker for any agent framework that depends on structured output. For tasks where the output format is load-bearing, a model that can't reliably produce valid JSON is a non-starter regardless of its benchmark scores on other tasks.

The verbosity problem was separate but consistent. On translation and summarization tasks, gpt-oss:120b was 5–6× more verbose than qwen3-coder-next with no corresponding quality improvement. The extra output wasn't adding precision or nuance — it was padding. For agent workflows that process model outputs programmatically, verbosity without content is a tax on latency.

At 42.4 tok/s, gpt-oss:120b is already slower than qwen3-coder-next q4. At 5× the verbosity with worse structured output, the practical case for using it as an agent backbone is difficult to make.

The lesson: parameter count is a proxy for capability, not a guarantee. Tool calling specifically depends on fine-tuning and RLHF focus, not on raw model size. A well-tuned 32B model will reliably outperform a poorly-tuned 120B model on structured output tasks every time.

Why Do Thinking Models Return Empty Responses via /api/generate?

This one is a real trap.

Both glm-4.7-flash and qwen3-vl return empty responses when called via the standard /api/generate endpoint. The models appear to run — tokens are being generated, the request takes time — but the returned response field is empty.

What's happening: these models route their output into an internal "thinking" channel. The actual response lives in a field that /api/generate doesn't expose. You have to use /api/chat to access it:

# This returns empty response for thinking models:
curl http://<your-gx10-ip>:11434/api/generate \
  -d '{"model": "glm-4.7-flash", "prompt": "Translate: Hello world", "stream": false}'

# This works:
curl http://<your-gx10-ip>:11434/api/chat \
  -d '{
    "model": "glm-4.7-flash",
    "messages": [{"role": "user", "content": "Translate: Hello world"}],
    "stream": false
  }'

The second call returns a message.content field with the actual output. The thinking process is in a separate field — you can inspect it if you want, ignore it if you don't.

The qwen3-vl case had an additional failure mode worth documenting separately.

The 7,094 Character Thinking Incident

qwen3-vl:30b in thinking mode can burn its entire token budget on internal reasoning and produce zero characters of actual output.

In one test — a single translation sentence — the model spent 7,094 characters of thinking tokens working through the translation and then returned an empty response. Not truncated. Not an error. Just nothing. The thinking was thorough and the output was absent.

This isn't a bug exactly — it's a consequence of how thinking models allocate their budget. When the thinking process fills the context, there's nothing left for output. For short tasks where the model over-thinks, you get maximum reasoning and minimum results.

The practical workaround: for simple tasks, use qwen3-vl in raw mode (no thinking). Reserve thinking mode for tasks where the reasoning depth actually adds value — complex multi-step analysis, not translation.

The Stack I Use

Three roles, three models:

Primary: qwen3-coder-next q4_K_M — 47.0 tok/s, reliable tool calling, solid across all seven task categories. For an AI coding agent that needs to call tools, generate code, and reason about logic at volume, this is the workhorse. The q4 quantization gives the speed and memory headroom to run it alongside other models.

Vision: qwen3-vl:30b — only 19GB, which means you can run it alongside the primary model with room to spare. Handles image input. Use raw mode (not thinking mode) for straightforward tasks — translation, description, extraction. Reserve thinking mode for cases where you actually need it and the task is complex enough to justify the token budget risk.

Deep reasoning: glm-4.7-flash — when you need thorough analysis rather than fast responses. Slower at 27.5 tok/s, but the thinking channel adds depth for tasks where surface-level answers aren't enough. Use it as a deliberate choice for complex reasoning tasks, not as a general-purpose model.

Running qwen3-coder-next q4 (~20GB) and qwen3-vl together (~19GB) uses approximately 71GB of the available 128GB. The machine is less than half-loaded. That headroom is useful — it means you can load additional models for specific tasks, keep contexts warm, or switch between configurations without waiting for models to reload.

Takeaways

A few transferable principles from this benchmark:

VRAM math matters more than model count. 128GB sounds like a lot until you start running 60GB models. The insight from this benchmark: two smaller specialized models often outperform one large general model on actual agent tasks, use less memory, and give you more operational flexibility. The 71GB headroom with two models running isn't waste — it's deliberate reserve.

Verbosity is not quality. gpt-oss:120b's 5–6× verbosity on summarization and translation produced no corresponding quality improvement. For programmatic processing — which is most of what an AI agent does with model output — shorter, accurate responses are better than longer, padded ones. Benchmark for output quality per token, not just output quality.

The API endpoint is part of the model's interface contract. Thinking models that silently return empty responses via /api/generate aren't buggy — but they're also not going to tell you that you're using the wrong endpoint. If a model works in the Ollama UI but returns empty responses in your code, check whether you're using /api/generate or /api/chat. The difference is not documented prominently.

Tool calling is a quality gate, not a quality gradient. For agent frameworks, a model that sometimes generates invalid JSON is operationally equivalent to a model that always fails. This should be tested first, not benchmarked last. If structured output fails, nothing else about the model's quality matters for that use case.

What Changed After This

This benchmark informed the move from Ollama to vLLM. The specific motivations: prefix caching for repeated system prompts (significant for agent workflows with long tool schemas), lower TTFT at the start of each call, and finer control over KV cache management.

The vLLM migration on GB10/SM121 has its own set of complications — covered in the other articles in this series.

Next: Why Your DGX Spark Only Says "!!!!!": Debugging NVFP4 on SM121