DGX Spark · part 15
[AI Agent] Gemma 4 Went from 40 Errors to a 9-Step Bug Fix — by Switching One Thing
❯ cat --toc
- Plain-Language Version: Can a desktop AI fix real bugs?
- The Setup
- Test 1: OpenHands — Function Calling Hell
- Gemma 4 26B on OpenHands: 40+ errors
- But the API works fine?
- Qwen 3.5 35B on OpenHands: Success (with caveats)
- Test 2: SWE-agent — Different action format, different story
- Gemma 4 26B on SWE-agent: 9 steps, patch submitted
- Qwen 3.5 35B on SWE-agent: Fixed but couldn't finish
- So what actually happened
- vLLM Configuration That Matters
- 1. The chat template (most commonly missed)
- 2. Qwen 3.5's thinking mode must be disabled
- 3. Gemma 4 on Ollama: broken
- What Was Gained
- What cost the most time
- Transferable diagnostics
- The thing I'll remember
- How to reproduce
TL;DR
Feasibility test, not a full benchmark. Gemma 4 26B went from 40+ tool calling errors on OpenHands to fixing a test bug in 9 steps on SWE-agent — same model, same GX10 hardware. The pipeline works; a full SWE-Bench Lite run (300 tasks) is next.
Plain-Language Version: Can a desktop AI fix real bugs?
SWE-Bench is the standard test for whether an AI can fix real software bugs — it takes actual GitHub issues from popular projects and checks if the AI can produce a working patch. The leaderboard is dominated by Claude and GPT through cloud APIs.
I wanted to know if you could skip the API bill entirely — run SWE-Bench on your own hardware with open-source models. So I tested Gemma 4 26B and Qwen 3.5 35B on a DGX Spark, first with OpenHands, then with SWE-agent. I spent four hours convinced Gemma 4 was broken before discovering the problem was OpenHands, not the model.
The Setup
- Hardware: NVIDIA DGX Spark (GB10, 128 GB unified memory, SM121)
- Models: Gemma 4 26B-A4B NVFP4, Qwen 3.5 35B-A3B FP8 — both served via vLLM
- Frameworks: OpenHands v0.59.0, SWE-agent v1.1.0
- Agent host: Mac mini M4 (16 GB, OrbStack Docker) — runs the agent + sandbox, connects to GX10 vLLM via Tailscale
Test 1: OpenHands — Function Calling Hell
OpenHands uses OpenAI-style function calling — the model receives tool schemas as JSON and must respond with structured tool_calls objects. It exposes 11 tools: execute_bash, str_replace_editor, think, finish, browser, execute_ipython_cell, task_tracker, fetch, create_pr, create_mr, create_bitbucket_pr.
Task: "Write a Python function that checks if a number is prime. Include a test."
Gemma 4 26B on OpenHands: 40+ errors
Every parser configuration failed:
--tool-call-parser | Errors before first success | Result |
|---|---|---|
pythonic | 5 | Only produced task_tracker:plan |
hermes | 4 | Same |
| None | 7 | Same |
gemma4 + official chat template | 40+ | Never completed task |
The errors were always the same pattern:
Missing required parameters for function 'execute_bash': {'command', 'security_risk'}
Missing required parameters for function 'str_replace_editor': {'path', 'command', 'security_risk'}
Gemma 4 called the right functions but dropped the parameters. Every time.
But the API works fine?
This is what kept me debugging for hours. The exact same 11 tools + the full OpenHands system prompt (8,892 characters) via curl? Perfect:
curl http://<gx10-ip>:8000/v1/chat/completions -d '{
"model": "gemma-4-26b",
"messages": [{"role": "system", "content": "<8892-char OpenHands prompt>"},
{"role": "user", "content": "Create hello.py"}],
"tools": [<all 11 tools>]
}'
Response: correct str_replace_editor call with all parameters. So the API worked. The framework somehow broke it.
Qwen 3.5 35B on OpenHands: Success (with caveats)
Qwen 3.5 completed the same task with only 1 error — missing security_risk parameter once, then recovered:
✅ Created prime_check.py with is_prime() function
✅ Ran python3 prime_check.py — all tests passed
✅ Task completed
Critical setup detail: Qwen 3.5's --reasoning-parser qwen3 must be removed. With it enabled, all output gets consumed as "thinking tokens" and the response comes back empty. Use --tool-call-parser qwen3_xml without the reasoning parser.
Test 2: SWE-agent — Different action format, different story
SWE-agent doesn't use function calling at all. The model just writes plain text, and the framework parses it:
💭 THOUGHT
I need to look at the file with the syntax error.
🎬 ACTION
str_replace_editor view /repo/tests/missing_colon.py
The framework parses the text. No OpenAI function calling involved.
Task: Fix a bug from SWE-agent/test-repo#1 — a missing colon in a Python function definition causing SyntaxError. This is SWE-agent's own test repo, not a real SWE-Bench task — deliberately simple to validate the pipeline.
Gemma 4 26B on SWE-agent: 9 steps, patch submitted
Step 1: find . -maxdepth 2 -not -path '*/.*' → Found repo structure
Step 2: cat tests/missing_colon.py → Read the broken file
Step 3: python3 tests/missing_colon.py → Reproduced the error
Step 4: cat tests/missing_colon.py → Confirmed the issue
Step 5: str_replace_editor (wrong path) → Path error, self-corrected
Step 6: str_replace_editor str_replace → Fixed the bug ✅
Step 7: python3 tests/missing_colon.py → Verified the fix ✅
Step 8-9: submit → Patch submitted ✅
The patch:
-def division(a: float, b: float) -> float
+def division(a: float, b: float) -> float:
return a/b
That's the same model that couldn't write "hello world" on OpenHands without 40 errors.
Qwen 3.5 35B on SWE-agent: Fixed but couldn't finish
Qwen 3.5 also found and fixed the bug — it correctly applied the same str_replace edit. But it ran for 96 steps doing excessive git archaeology and never called submit. The fix was correct; the model just couldn't figure out how to end the session.
So what actually happened
| OpenHands (function calling) | SWE-agent (text actions) | |
|---|---|---|
| Gemma 4 26B | ❌ 40+ errors, never completed | ✅ 9 steps, patch produced |
| Qwen 3.5 35B | ✅ 1 error, completed | ⚠️ Fixed the bug but ran 96 steps, never called submit |
The difference is the action format. OpenHands requires structured JSON function calls with exact parameter names. Gemma 4 keeps dropping parameters when there are 11 tools competing for its attention. SWE-agent just needs plain text — and Gemma 4 writes plain text fine.
Gemma 4 can code. It just can't fill out JSON forms reliably.
vLLM Configuration That Matters
These cost me most of the afternoon:
1. The chat template (most commonly missed)
# Download from vLLM GitHub
curl -sL https://raw.githubusercontent.com/vllm-project/vllm/main/examples/tool_chat_template_gemma4.jinja \
-o tool_chat_template_gemma4.jinja
# Mount into container
-v /path/to/tool_chat_template_gemma4.jinja:/app/tool_chat_template_gemma4.jinja
--chat-template /app/tool_chat_template_gemma4.jinja
Without this, Gemma 4's tool calls lose their parameters even when the model generates them correctly. The official vLLM recipe documents this but it's easy to miss.
2. Qwen 3.5's thinking mode must be disabled
# WRONG — all output gets consumed as thinking tokens
--reasoning-parser qwen3 --tool-call-parser qwen3_xml
# RIGHT — tool calling works
--tool-call-parser qwen3_xml
# (no --reasoning-parser)
3. Gemma 4 on Ollama: broken
Gemma 4 E2B, E4B, and 26B all return empty tool calls on Ollama (known bug). Use vLLM with --tool-call-parser gemma4.
What Was Gained
What cost the most time
Debugging Gemma 4's tool calling on OpenHands. I tested four different --tool-call-parser settings, added the official chat template, verified the parser code included the PR #38847 fix, and ran curl tests at every level. The curl tests all passed perfectly — 11 tools, 8,892-character system prompt, correct parameters every time. The problem only appeared inside OpenHands' multi-turn conversation loop.
I should have tried SWE-agent first.
Transferable diagnostics
When a model "can't do tool calling," test it at three levels before concluding it's broken:
- Single API call (curl with 2 tools) — tests basic format
- Full schema (curl with all tools + system prompt) — tests scale
- Framework integration (actual agent loop) — tests multi-turn interaction
If level 1-2 pass but level 3 fails, the problem is the framework, not the model. Switching frameworks is faster than debugging function calling compatibility.
The thing I'll remember
Next time a local model looks broken on an agent framework, I'm switching frameworks before I start debugging the model. Would've saved me four hours this time.
How to reproduce
- vLLM for Gemma 4: Use
vllm/vllm-openai:gemma4-cu130image (notlatest— it lacks Gemma 4 transformers support) - Download the chat template:
curl -sL https://raw.githubusercontent.com/vllm-project/vllm/main/examples/tool_chat_template_gemma4.jinja - SWE-agent install:
uv python install 3.12 && git clone SWE-agent && uv pip install -e . - Run:
sweagent run --agent.model.name openai/gemma-4-26b --agent.model.api_base http://<your-gx10-ip>:8000/v1 --agent.model.api_key dummy --agent.model.per_instance_cost_limit 0 - For Qwen 3.5: Remove
--reasoning-parser qwen3, use--tool-call-parser qwen3_xml
This is Part 15 of the "DGX Spark" series.
FAQ
- Can you run SWE-Bench with local open-source models instead of Claude or GPT?
- The pipeline works. We got Gemma 4 26B to fix a test bug in 9 steps via SWE-agent on a DGX Spark — no API costs. But this was a feasibility test on a simple bug (missing colon), not a full SWE-Bench Lite run. We haven't measured resolve rates on the real 300-task benchmark yet.
- Which is better for SWE-Bench — Gemma 4 or Qwen 3.5?
- It depends on the framework. With SWE-agent (text-based actions), Gemma 4 26B solved the task in 9 steps and submitted a patch. Qwen 3.5 35B fixed the bug too but took 96 steps and never submitted. With OpenHands (function calling), Qwen 3.5 worked (1 error) while Gemma 4 produced 40+ errors. Pick the model that matches your framework's action format.
- Why does Gemma 4 fail on OpenHands but work on SWE-agent?
- OpenHands uses OpenAI-style function calling (structured JSON tool calls). Gemma 4's function calling is unreliable with 11 concurrent tools — it drops required parameters. SWE-agent uses text-based actions (plain text commands parsed by the framework), which Gemma 4 handles perfectly. The model's coding ability is fine; it's the function calling format that breaks.
- What vLLM settings are needed for Gemma 4 tool calling?
- Three critical settings: --enable-auto-tool-choice, --tool-call-parser gemma4, and --chat-template tool_chat_template_gemma4.jinja (download from vLLM GitHub). The chat template is the most commonly missed one — without it, Gemma 4's tool calls lose their parameters even though the model generates them correctly.