~/blog/swe-bench-lite-gemma4-26b-38-percent

DGX Spark · part 17

[Benchmark] SWE-bench Lite 38.67% with a 26B Local Model — 0.33% from Claude 3.5 Sonnet Scaffolds

cat --toc

TL;DR

Gemma 4 26B-A4B FP8 scored 38.67% (116/300) on SWE-bench Lite — ranking #16 globally, within 0.33% of Moatless Tools + Claude 3.5 Sonnet (39.00%). Zero API cost, running entirely on a DGX Spark GB10. The differentiator was scaffold design (backticks protocol + edit-tool v2 + budget prompt), not model size.

Plain-Language Version: Local AI fixing real bugs

SWE-bench is a benchmark where an AI agent has to fix real bugs from open-source GitHub projects — Django, scikit-learn, matplotlib, sympy, and others. It's not a toy test. The agent gets a bug report, the full repository code, and has to figure out what file to edit, what to change, and submit a working patch. Most humans would take hours per bug.

We ran all 300 tasks from SWE-bench Lite using Gemma 4 26B — a 26-billion-parameter open-source model from Google — running locally on a single NVIDIA DGX Spark desktop workstation. No cloud APIs, no per-query charges. The model fixed 116 out of 300 bugs (38.67%), landing within 1 solved bug of systems using Claude 3.5 Sonnet — Anthropic's flagship model at the time, accessed via paid API.

The punchline: how we talked to the model mattered more than which model we used.


Preface

Part 15 asked "can we run SWE-bench locally at all?" and got a single clean patch. Part 16 documented two days of debugging false positives and scaffold failures. This article is the payoff — the full 300-task run and what the numbers actually say.


The number

Resolved: 116 / 300 = 38.67%
Run ID:   gemma26b-full-300
Date:     2026-04-17

Where that lands on the leaderboard

SWE-bench Lite leaderboard (snapshot 2026-04-16, swebench.com):

#SystemModel% Resolved
1ExpeRepair v1.0Claude 4 Sonnet60.33
5EntroPO + R2EQwen3-Coder-30B49.67
7SWE-agentClaude 3.7 Sonnet48.00
13OpenHands CodeAct v2.1Claude 3.5 Sonnet41.67
15Moatless ToolsClaude 3.5 Sonnet39.00
16mini-swe-agent + edit-tool v2Gemma 4 26B-A4B FP838.67
17Patched.Codes Patchwork37.00
SWE-FixerQwen 2.5 72B24.67

Two observations. First: Moatless Tools uses Claude 3.5 Sonnet — a model roughly 10x larger and accessed via paid API — and beats us by exactly one solved instance (0.33%). Second: SWE-Fixer uses Qwen 2.5 72B — nearly 3x our parameter count — and lands at 24.67%. Scaffold matters.

Why Moatless + Claude 3.5 Sonnet as the primary comparison? It's the closest apples-to-apples on the Lite board: single model, general-purpose scaffold, no multi-agent pipeline, no retrieval augmentation. Qwen3.5-35B-A3B (recently added to SWE-rebench) and Qwen 3.6 are on the radar — running the same scaffold on a different model is the obvious next experiment and will appear in a follow-up post. The real thesis isn't "Gemma 4 beats X" — it's "this scaffold transfers across models."


What actually ran

Model

RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic — Google's Gemma 4, Mixture-of-Experts architecture. 25.2B total parameters, approximately 3.8B active per token. FP8 dynamic quantization (per-channel, no calibration data needed). ~27GB on disk.

Serving

vllm serve /models/gemma4 \
  --served-model-name gemma-4-26b-fp8 \
  --host 0.0.0.0 --port 8000 \
  --max-model-len 96000 \
  --gpu-memory-utilization 0.85 \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4

Running inside vllm-node-tf5 (a custom container built for SM121 / GB10 compatibility) on an NVIDIA GB10 with 128GB unified memory. At runtime the model takes ~27GB for weights and ~100GB for KV cache — SWE-bench trajectories are long (100 steps of reading files and editing code), so the KV cache pool needs to be generous.

Do not add --reasoning-parser gemma4 or a custom --chat-template. Both were tested and both degraded patch quality — the model starts doing whole-file rewrites and duplicating existing methods.

Agent

mini-swe-agent v2.2.8, stock default.py (155 lines, unmodified). One patch to interactive.py: the LimitsExceeded handler calls raise instead of input() when confirm_exit=False, so batch runs don't hang on an interactive prompt.

Batch command

mini-extra swebench --subset lite --split test --workers 2 \
  -c swebench_backticks.yaml \
  -c gemma4-bt-edittool.yaml \
  -c environment.pull_timeout=1800 \
  -c environment.container_timeout=4h \
  -o ~/swe-bench-runs/gemma26b-path-b-full-300

Total wall time: ~19 hours. 2 workers in parallel. Zero API cost.


The scaffold — three things that actually mattered

1. Backticks protocol instead of tool_calls

Gemma 4 26B cannot reliably produce OpenAI-style JSON tool calls when there are 10+ tools registered. The tool_calls field comes back empty intermittently, the harness retries, and the model burns its step budget on format errors.

The fix: swebench_backticks.yaml switches the action format to markdown code blocks:

```mswea_bash_command
grep -rn "def _print_sinc" /testbed/sympy/
```

The framework parses these with a regex instead of expecting JSON. Same model, same intelligence — the only change is how we read its output.

2. edit-tool v2 (heredoc API)

Without guardrails, Gemma 4 will cat > /testbed/sympy/printing/ccode.py << 'EOF' and rewrite the entire file from memory. It forgets methods, invents imports, and breaks unrelated code. Every competitive SWE-bench system addresses this — Anthropic's Claude system uses str_replace_based_edit_tool, and we built a compatible version that runs inside the Docker container:

edit-tool str_replace --file /testbed/sympy/printing/ccode.py << 'PATCH'
---OLD---
    def _print_NegativeInfinity(self, expr):
        return '-HUGE_VAL'
---NEW---
    def _print_NegativeInfinity(self, expr):
        return '-HUGE_VAL'

    def _print_sinc(self, expr):
        from sympy import sin, Piecewise, Ne
        x = expr.args[0]
        return self._print(Piecewise((sin(x)/x, Ne(x, 0)), (1, True)))
PATCH

Key constraint: str_replace requires the old text to appear exactly once in the file. Zero matches or multiple matches produce an explicit error that guides the model to view the file and retry with a more specific snippet. This one constraint eliminates the entire class of "rewrote the file and deleted half the codebase" failures.

3. Budget prompt

BUDGET
You have ~100 steps. Aim to:
- steps 1-15: explore + reproduce the issue
- steps 15-40: locate target file + edit with edit-tool
- steps 40-60: verify the fix and submit

Without this, Gemma 4 explores indefinitely. It finds the bug, fixes it, then wonders if maybe fcode.py and jscode.py need the same fix, then wonders about octave.py, and burns 100 steps without ever submitting. The budget prompt creates time pressure. Combined with a step_limit: 100 (down from the default 250), the model learns to be decisive.


Per-repo breakdown

RepoResolvedTotalRate
mwaskom/seaborn3475%
psf/requests4667%
pylint-dev/pylint3650%
django/django5211346%
scikit-learn102343%
pydata/xarray2540%
matplotlib82335%
astropy2633%
sphinx-doc51631%
sympy237730%
pytest-dev41724%
pallets/flask030%

Django dominates the corpus (113 of 300 tasks) and the model handles it well — 46% is above the overall average. seaborn at 75% stands out, though the sample is small (4 tasks). Flask at 0% (3 tasks) is a complete miss — the model didn't understand Flask's extension registration pattern in any of the three bugs.

The sympy column tells the real story of diminishing returns: the first 171 instances (which included easier sympy tasks) resolved at 45%, but the remaining 129 (harder, later-sequenced sympy) dropped to 30%.


What didn't work

AttemptWhy it failed
tool_calls protocol with Gemma 4Empty tool_calls field intermittently → format retry loop → step budget wasted
--reasoning-parser gemma4Supposed to clean up internal thought tokens; in practice made the model do whole-file rewrites
--chat-template tool_chat_template_gemma4.jinjaChanged the prompt structure the model sees; patch quality dropped
NVFP4 quantization4-bit weight quantization broke the tool calling tokens entirely
step_limit: 250 (default)Model over-engineers, never submits
step_limit: 30Too tight — edit phase gets cut short
cat > file rewrites (no edit-tool)Model deletes existing methods when rewriting from memory
system_message_suffix YAML keySilently dropped by pydantic — the prompt never reached the model
OpenHands frameworkFunction calling with 11 concurrent tools → 40+ errors per instance

Each of these looked reasonable on paper. The system_message_suffix one cost a full day — the YAML key was valid syntax, the harness didn't warn, but the agent's pydantic model silently dropped the field. The prompt we thought was running never was.


Infrastructure: ARM64 QEMU and the image problem

The GB10 is an ARM64 machine. SWE-bench Docker images are x86_64. Docker on ARM runs x86 containers via QEMU user-mode emulation — it works, but everything is 2-3x slower than native x86.

The real bottleneck was image pulls. Each SWE-bench repo version has its own Docker image (matplotlib alone has 16 variants). These are 1-5GB each. Pulling them through QEMU's emulated decompression was taking 30+ minutes per image.

The fix: pre-pull all 172 unique images in a background job before starting the batch run. One shuffled xargs -P 2 docker pull session pulled everything in about an hour. After that, every docker run started from local cache — the per-instance overhead dropped from "minutes waiting for pull" to "seconds for container startup."


The three-day timeline

April 14: Ran the first batch with what seemed like a working scaffold. Wrote documentation claiming "35-step submit success" and "phased prompting is the key insight." Neither was true — system_message_suffix was silently dropped, and the model was actually aborting on every instance. A full day of work based on a false premise.

April 15: Discovered the pydantic silent-drop bug. Tore apart the scaffold config. Rebuilt from scratch with system_template override. Got Gemma 4 to genuinely submit a correct patch at step 38 — and verified the exit status this time. Started the 300-task batch that night.

April 16–17: Monitored the batch. Pre-pulled Docker images. Fixed ARM64 QEMU bottlenecks. The batch completed after ~19 hours. sb-cli evaluation: 38.67%.


What was gained

What cost the most time

Not model tuning. Not scaffold design. Trusting a document that hadn't been verified. The April 14 "success" documentation was a hypothesis written as fact. It shaped the next day's debugging — every failure was interpreted through the lens of "the phased prompt worked yesterday, so something else must have broken." The real answer was: the phased prompt never ran.

This is why we added confidence: verified | hypothesis | superseded tags to our knowledge graph tool (Musubi) during this project. Notes now carry their verification status, and search results show whether the source material has been checked or is just a best guess.

Transferable diagnostics

  1. Dump the runtime state, don't trust the config file. After any YAML/config change, verify by inspecting the actual system prompt in the first trajectory: python3 -c "import json; t=json.load(open('traj.json')); print(len(t['messages'][0]['content']))". If the length doesn't change, your config isn't loading.

  2. Match the action format to the model's strengths. Tool calling (JSON) is great for large frontier models. For smaller/MoE models, text-based action formats (backticks, XML tags) are more robust. Check the model's tool calling fidelity on a 10-tool setup before committing to a protocol.

  3. Pre-pull container images for cross-architecture batch runs. Any time you're running x86 containers on ARM64 (or vice versa), image decompression under QEMU is the hidden bottleneck. Pre-pulling turns a 30-minute-per-instance tax into a one-time upfront cost.

The pattern that applies everywhere

The gap between Gemma 4 26B (38.67%) and SWE-Fixer with Qwen 72B (24.67%) is 14 percentage points — in favor of the smaller model. The gap between our scaffold and Moatless Tools + Claude 3.5 Sonnet is 0.33%. Scaffold design is a larger lever than model selection, at least in the range where open-weight models have reached.


Conclusion

The full configuration — YAML files, edit-tool source, vLLM serving command, and batch invocation — is documented in the artifact record. Anyone with a GB10 (or any machine that can serve Gemma 4 26B via vLLM) can reproduce this run.

Checklist for reproducing 38.67%:

  1. Serve RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic via vLLM with --tool-call-parser gemma4 (no reasoning parser, no chat template)
  2. Use swebench_backticks.yaml base config (text-based actions, not tool_calls)
  3. Override system_template with edit-tool v2 instructions + budget prompt
  4. Set step_limit: 100, temperature: 0.3
  5. Add --exit-immediately to the batch command
  6. Pre-pull all 172 SWE-bench Lite Docker images before starting
  7. Run with --workers 2, container_timeout=4h, pull_timeout=1800

What's next: This is the first data point in a scaffold engineering series, not the last. The scaffold — backticks + edit-tool v2 + budget prompt — is model-agnostic by design. The natural follow-ups:

  • Gemma 4 E4B (4B) is running on the same scaffold right now. Where's the floor for a 4B MoE? Results will be appended here.
  • Qwen 3.6 and Qwen3.5-35B-A3B are the obvious next models to drop into the same scaffold. If the resolve rate transfers, the thesis holds: scaffold is the lever, model is the variable.
  • Fault localization — the scaffold currently lets the model explore blind. Adding automated traceback parsing before the model starts could cut 10-15 wasted exploration steps.

The question isn't "how high can Gemma 4 score" — it's "how far can a well-designed scaffold carry any open-weight model."


Also in this series: Part 15 — Feasibility test | Part 16 — Scaffold engineering

FAQ

Can a 26B open-source model actually compete with Claude 3.5 Sonnet on SWE-bench?
Yes, with the right scaffold. Gemma 4 26B-A4B FP8 scored 38.67% on SWE-bench Lite — within 0.33% of Moatless Tools + Claude 3.5 Sonnet (39.00%). The model matters less than the action format, edit tool design, and budget prompting. The same model on a naive scaffold scores near zero.
What scaffold changes made the biggest difference for SWE-bench with local models?
Three things: (1) switching from OpenAI tool_calls to backticks regex parsing — Gemma 4 can't reliably produce JSON tool calls but handles markdown code blocks perfectly; (2) a custom edit-tool that enforces exactly-one-match string replacement instead of cat-whole-file rewrites; (3) a budget prompt that tells the model to submit by step 60 of 100.
How long does it take to run SWE-bench Lite 300 tasks on a DGX Spark?
About 19 hours with 2 parallel workers. The ARM64 GB10 runs x86_64 SWE-bench Docker containers via QEMU emulation, which adds overhead. Pre-pulling all 172 Docker images beforehand cut the per-instance time roughly in half.
What's the cost of running SWE-bench Lite with a local model vs cloud APIs?
Local: $0.00 API cost — just electricity for ~19 hours of GPU time on a DGX Spark. For comparison, Claude 3.5 Sonnet runs on the leaderboard average $0.66/instance, so 300 instances would cost roughly $200.