Can you run SWE-Bench with local open-source models instead of Claude or GPT?

The pipeline works. We got Gemma 4 26B to fix a test bug in 9 steps via SWE-agent on a DGX Spark — no API costs. But this was a feasibility test on a simple bug (missing colon), not a full SWE-Bench Lite run. We haven't measured resolve rates on the real 300-task benchmark yet.

Which is better for SWE-Bench — Gemma 4 or Qwen 3.5?

It depends on the framework. With SWE-agent (text-based actions), Gemma 4 26B solved the task in 9 steps and submitted a patch. Qwen 3.5 35B fixed the bug too but took 96 steps and never submitted. With OpenHands (function calling), Qwen 3.5 worked (1 error) while Gemma 4 produced 40+ errors. Pick the model that matches your framework's action format.

Why does Gemma 4 fail on OpenHands but work on SWE-agent?

OpenHands uses OpenAI-style function calling (structured JSON tool calls). Gemma 4's function calling is unreliable with 11 concurrent tools — it drops required parameters. SWE-agent uses text-based actions (plain text commands parsed by the framework), which Gemma 4 handles perfectly. The model's coding ability is fine; it's the function calling format that breaks.

What vLLM settings are needed for Gemma 4 tool calling?

Three critical settings: --enable-auto-tool-choice, --tool-call-parser gemma4, and --chat-template tool_chat_template_gemma4.jinja (download from vLLM GitHub). The chat template is the most commonly missed one — without it, Gemma 4's tool calls lose their parameters even though the model generates them correctly.

~/blog/swe-bench-local-models-framework-matters

DGX Spark · part 15

[AI Agent] Gemma 4 Went from 40 Errors to a 9-Step Bug Fix — by Switching One Thing

2026-04-137 min read#swe-bench #gemma-4 #qwen-3.5 #openhands 中文版

❯ cat --toc

Plain-Language Version: Can a desktop AI fix real bugs?
The Setup
Test 1: OpenHands — Function Calling Hell
Gemma 4 26B on OpenHands: 40+ errors
But the API works fine?
Qwen 3.5 35B on OpenHands: Success (with caveats)
Test 2: SWE-agent — Different action format, different story
Gemma 4 26B on SWE-agent: 9 steps, patch submitted
Qwen 3.5 35B on SWE-agent: Fixed but couldn't finish
So what actually happened
vLLM Configuration That Matters
1. The chat template (most commonly missed)
2. Qwen 3.5's thinking mode must be disabled
3. Gemma 4 on Ollama: broken
Takeaways
Where the time went
Reusable diagnostics
The thing I'll remember
How to reproduce

TL;DR

Feasibility test, not a full benchmark. Gemma 4 26B went from 40+ tool calling errors on OpenHands to fixing a test bug in 9 steps on SWE-agent — same model, same GX10 hardware. The pipeline works; a full SWE-Bench Lite run (300 tasks) is next.

Plain-Language Version: Can a desktop AI fix real bugs?

SWE-Bench is the standard test for whether an AI can fix real software bugs — it takes actual GitHub issues from popular projects and checks if the AI can produce a working patch. The leaderboard is dominated by Claude and GPT through cloud APIs.

I wanted to know if you could skip the API bill entirely — run SWE-Bench on your own hardware with open-source models. So I tested Gemma 4 26B and Qwen 3.5 35B on a DGX Spark, first with OpenHands, then with SWE-agent. I spent four hours convinced Gemma 4 was broken before discovering the problem was OpenHands, not the model.

The Setup

Hardware: NVIDIA DGX Spark (GB10, 128 GB unified memory, SM121)
Models: Gemma 4 26B-A4B NVFP4, Qwen 3.5 35B-A3B FP8 — both served via vLLM
Frameworks: OpenHands v0.59.0, SWE-agent v1.1.0
Agent host: Mac mini M4 (16 GB, OrbStack Docker) — runs the agent + sandbox, connects to GX10 vLLM via Tailscale

Test 1: OpenHands — Function Calling Hell

OpenHands uses OpenAI-style function calling — the model receives tool schemas as JSON and must respond with structured tool_calls objects. It exposes 11 tools: execute_bash, str_replace_editor, think, finish, browser, execute_ipython_cell, task_tracker, fetch, create_pr, create_mr, create_bitbucket_pr.

Task: "Write a Python function that checks if a number is prime. Include a test."

Gemma 4 26B on OpenHands: 40+ errors

Every parser configuration failed:

`--tool-call-parser`	Errors before first success	Result
`pythonic`	5	Only produced `task_tracker:plan`
`hermes`	4	Same
None	7	Same
`gemma4` + official chat template	40+	Never completed task

The errors were always the same pattern:

Missing required parameters for function 'execute_bash': {'command', 'security_risk'}
Missing required parameters for function 'str_replace_editor': {'path', 'command', 'security_risk'}

Gemma 4 called the right functions but dropped the parameters. Every time.

But the API works fine?

This is what kept me debugging for hours. The exact same 11 tools + the full OpenHands system prompt (8,892 characters) via curl? Perfect:

curl http://<gx10-ip>:8000/v1/chat/completions -d '{
  "model": "gemma-4-26b",
  "messages": [{"role": "system", "content": "<8892-char OpenHands prompt>"},
               {"role": "user", "content": "Create hello.py"}],
  "tools": [<all 11 tools>]
}'

Response: correct str_replace_editor call with all parameters. So the API worked. The framework somehow broke it.

Qwen 3.5 35B on OpenHands: Success (with caveats)

Qwen 3.5 completed the same task with only 1 error — missing security_risk parameter once, then recovered:

✅ Created prime_check.py with is_prime() function
✅ Ran python3 prime_check.py — all tests passed
✅ Task completed

Critical setup detail: Qwen 3.5's --reasoning-parser qwen3 must be removed. With it enabled, all output gets consumed as "thinking tokens" and the response comes back empty. Use --tool-call-parser qwen3_xml without the reasoning parser.

Test 2: SWE-agent — Different action format, different story

SWE-agent doesn't use function calling at all. The model just writes plain text, and the framework parses it:

💭 THOUGHT
I need to look at the file with the syntax error.

🎬 ACTION
str_replace_editor view /repo/tests/missing_colon.py

The framework parses the text. No OpenAI function calling involved.

Task: Fix a bug from SWE-agent/test-repo#1 — a missing colon in a Python function definition causing SyntaxError. This is SWE-agent's own test repo, not a real SWE-Bench task — deliberately simple to validate the pipeline.

Gemma 4 26B on SWE-agent: 9 steps, patch submitted

Step 1: find . -maxdepth 2 -not -path '*/.*'     → Found repo structure
Step 2: cat tests/missing_colon.py                → Read the broken file
Step 3: python3 tests/missing_colon.py            → Reproduced the error
Step 4: cat tests/missing_colon.py                → Confirmed the issue
Step 5: str_replace_editor (wrong path)           → Path error, self-corrected
Step 6: str_replace_editor str_replace             → Fixed the bug ✅
Step 7: python3 tests/missing_colon.py            → Verified the fix ✅
Step 8-9: submit                                   → Patch submitted ✅

The patch:

-def division(a: float, b: float) -> float
+def division(a: float, b: float) -> float:
     return a/b

That's the same model that couldn't write "hello world" on OpenHands without 40 errors.

Qwen 3.5 35B on SWE-agent: Fixed but couldn't finish

Qwen 3.5 also found and fixed the bug — it correctly applied the same str_replace edit. But it ran for 96 steps doing excessive git archaeology and never called submit. The fix was correct; the model just couldn't figure out how to end the session.

So what actually happened

	OpenHands (function calling)	SWE-agent (text actions)
Gemma 4 26B	❌ 40+ errors, never completed	✅ 9 steps, patch produced
Qwen 3.5 35B	✅ 1 error, completed	⚠️ Fixed the bug but ran 96 steps, never called submit

The difference is the action format. OpenHands requires structured JSON function calls with exact parameter names. Gemma 4 keeps dropping parameters when there are 11 tools competing for its attention. SWE-agent just needs plain text — and Gemma 4 writes plain text fine.

Gemma 4 can code. It just can't fill out JSON forms reliably.

vLLM Configuration That Matters

These cost me most of the afternoon:

1. The chat template (most commonly missed)

# Download from vLLM GitHub
curl -sL https://raw.githubusercontent.com/vllm-project/vllm/main/examples/tool_chat_template_gemma4.jinja \
  -o tool_chat_template_gemma4.jinja

# Mount into container
-v /path/to/tool_chat_template_gemma4.jinja:/app/tool_chat_template_gemma4.jinja
--chat-template /app/tool_chat_template_gemma4.jinja

Without this, Gemma 4's tool calls lose their parameters even when the model generates them correctly. The official vLLM recipe documents this but it's easy to miss.

2. Qwen 3.5's thinking mode must be disabled

# WRONG — all output gets consumed as thinking tokens
--reasoning-parser qwen3 --tool-call-parser qwen3_xml

# RIGHT — tool calling works
--tool-call-parser qwen3_xml
# (no --reasoning-parser)

3. Gemma 4 on Ollama: broken

Gemma 4 E2B, E4B, and 26B all return empty tool calls on Ollama (known bug). Use vLLM with --tool-call-parser gemma4.

Takeaways

Where the time went

Debugging Gemma 4's tool calling on OpenHands. I tested four different --tool-call-parser settings, added the official chat template, verified the parser code included the PR #38847 fix, and ran curl tests at every level. The curl tests all passed perfectly — 11 tools, 8,892-character system prompt, correct parameters every time. The problem only appeared inside OpenHands' multi-turn conversation loop.

I should have tried SWE-agent first.

Reusable diagnostics

When a model "can't do tool calling," test it at three levels before concluding it's broken:

Single API call (curl with 2 tools) — tests basic format
Full schema (curl with all tools + system prompt) — tests scale
Framework integration (actual agent loop) — tests multi-turn interaction

If level 1-2 pass but level 3 fails, the problem is the framework, not the model. Switching frameworks is faster than debugging function calling compatibility.

The thing I'll remember

Next time a local model looks broken on an agent framework, I'm switching frameworks before I start debugging the model. Would've saved me four hours this time.

How to reproduce

vLLM for Gemma 4: Use vllm/vllm-openai:gemma4-cu130 image (not latest — it lacks Gemma 4 transformers support)
Download the chat template: curl -sL https://raw.githubusercontent.com/vllm-project/vllm/main/examples/tool_chat_template_gemma4.jinja
SWE-agent install: uv python install 3.12 && git clone SWE-agent && uv pip install -e .
Run: sweagent run --agent.model.name openai/gemma-4-26b --agent.model.api_base http://<your-gx10-ip>:8000/v1 --agent.model.api_key dummy --agent.model.per_instance_cost_limit 0
For Qwen 3.5: Remove --reasoning-parser qwen3, use --tool-call-parser qwen3_xml

This is Part 15 of the "DGX Spark" series.

FAQ

Can you run SWE-Bench with local open-source models instead of Claude or GPT?: The pipeline works. We got Gemma 4 26B to fix a test bug in 9 steps via SWE-agent on a DGX Spark — no API costs. But this was a feasibility test on a simple bug (missing colon), not a full SWE-Bench Lite run. We haven't measured resolve rates on the real 300-task benchmark yet.
Which is better for SWE-Bench — Gemma 4 or Qwen 3.5?: It depends on the framework. With SWE-agent (text-based actions), Gemma 4 26B solved the task in 9 steps and submitted a patch. Qwen 3.5 35B fixed the bug too but took 96 steps and never submitted. With OpenHands (function calling), Qwen 3.5 worked (1 error) while Gemma 4 produced 40+ errors. Pick the model that matches your framework's action format.
Why does Gemma 4 fail on OpenHands but work on SWE-agent?: OpenHands uses OpenAI-style function calling (structured JSON tool calls). Gemma 4's function calling is unreliable with 11 concurrent tools — it drops required parameters. SWE-agent uses text-based actions (plain text commands parsed by the framework), which Gemma 4 handles perfectly. The model's coding ability is fine; it's the function calling format that breaks.
What vLLM settings are needed for Gemma 4 tool calling?: Three critical settings: --enable-auto-tool-choice, --tool-call-parser gemma4, and --chat-template tool_chat_template_gemma4.jinja (download from vLLM GitHub). The chat template is the most commonly missed one — without it, Gemma 4's tool calls lose their parameters even though the model generates them correctly.

Don't miss the next one

Subscribe, and you won't.

One-click unsubscribe anytime.

← back to blog