Can the same SWE-bench scaffold work across different open-source models?

Yes. We ran backticks + edit-tool v2 + budget prompt on three models with zero code changes: Gemma 4 E4B (16.67%), Gemma 4 26B (38.67%), and Qwen 3.6 35B (48.33%). The scaffold transfers — you invest engineering time once and reuse it every time a better model drops.

How does Qwen 3.6 35B compare to Claude 3.7 Sonnet on SWE-bench Lite?

Qwen 3.6-35B-A3B FP8 scored 48.33% on our scaffold, slightly above SWE-agent + Claude 3.7 Sonnet (48.00%). Qwen 3.6 is open-weight, runs locally at zero API cost, with 3B active parameters per token (MoE architecture).

What's the minimum model size that can still solve real GitHub bugs?

Gemma 4 E4B with ~1B active parameters solved 50 out of 300 SWE-bench Lite tasks (16.67%) on the same scaffold. That's roughly on par with MCTS-Refine-7B (16.33%), showing that even very small models can contribute — but the ceiling is low.

How long does it take to run SWE-bench Lite with Qwen 3.6 on a DGX Spark?

About 36 hours with 2 parallel workers. Qwen 3.6 uses DeltaNet hybrid attention which is slower than standard Transformer attention under vLLM. Gemma 4 26B (pure Transformer MoE) finished the same 300 tasks in ~19 hours.

[Benchmark] Same Scaffold, Three Models: 16% → 38% → 48% on SWE-bench Lite

TL;DR

One scaffold, three models, zero code changes: Gemma 4 E4B → 16.67%, Gemma 4 26B → 38.67%, Qwen 3.6 35B → 48.33% on SWE-bench Lite. Qwen 3.6 beats SWE-agent + Claude 3.7 Sonnet (48.00%) running locally on a DGX Spark at zero API cost. The scaffold is the fixed cost; the model is the variable.

Plain-Language Version: One tool, three brains

In Part 17, we built a simple scaffold — three constraints that tell an AI model how to edit code without breaking things. That scaffold scored 38.67% on SWE-bench Lite with Gemma 4 26B. We ended with a question: does this scaffold transfer to other models?

We ran two more models through the exact same scaffold. No code changes, no prompt tweaks, no config adjustments — just swapped the model behind the API endpoint.

The answer: the scaffold transfers. A 4B model scored 16.67%. A 35B model scored 48.33% — beating a system that uses Claude 3.7 Sonnet, one of the strongest proprietary models available. The engineering work you put into the scaffold pays off every time a better open-source model comes along.

Preface

Part 17 established the scaffold (backticks + edit-tool v2 + budget prompt) and scored 38.67% with Gemma 4 26B. The article ended with: "The question isn't 'how high can Gemma 4 score' — it's 'how far can a well-designed scaffold carry any open-weight model.'"

This article answers that question with data.

Three models, one scaffold, zero changes

Scaffold: backticks protocol + edit-tool v2 + budget prompt
Config:   swebench_backticks.yaml + gemma4-bt-edittool.yaml
Changes:  model_name swapped via CLI flag. Nothing else touched.

Model	Active params	Total params	Resolved	%
Gemma 4 E4B BF16	~1B	4B	50/300	16.67%
Gemma 4 26B-A4B FP8	3.8B	25.2B	116/300	38.67%
Qwen 3.6-35B-A3B FP8	3B	35B	145/300	48.33%

The only difference between runs was the model.model_name config value. Same YAML files. Same edit-tool. Same budget prompt. Same Docker images. Same machine.

Where Qwen 3.6 lands on the leaderboard

SWE-bench Lite leaderboard (snapshot 2026-04-16, swebench.com):

#	System	Model	% Resolved
1	ExpeRepair v1.0	Claude 4 Sonnet	60.33
5	EntroPO + R2E	Qwen3-Coder-30B	49.67
→	mini-swe-agent + edit-tool v2	Qwen 3.6-35B-A3B FP8	48.33
7	SWE-agent	Claude 3.7 Sonnet	48.00
13	OpenHands CodeAct v2.1	Claude 3.5 Sonnet	41.67
15	Moatless Tools	Claude 3.5 Sonnet	39.00
16	mini-swe-agent + edit-tool v2	Gemma 4 26B-A4B FP8	38.67

Qwen 3.6 with our scaffold beats SWE-agent + Claude 3.7 Sonnet by 0.33%. An open-weight model with 3B active parameters, running locally on a single desktop workstation, with zero API cost.

What the three data points tell us

The scaffold sets the floor

Gemma 4 E4B (1B active) scored 16.67%. That's the floor — the scaffold can't compensate for a model that simply doesn't have enough capacity to understand complex codebases. But 50 real bug fixes from a 4B model is still remarkable. Without the scaffold (raw tool_calls on a naive setup), this model scores near zero.

Model quality is the multiplier

From 16.67% to 38.67% to 48.33% — each model upgrade delivered a meaningful jump. But notice the diminishing returns: the first 6x increase in active parameters (1B → 3.8B) bought 22 percentage points. The next jump (3.8B → 3B active but 35B total with more experts) bought 10 points. More model helps, but it's not linear.

The scaffold amortizes

The engineering work that went into Parts 15-17 — debugging pydantic silent drops, discovering backticks protocol, building edit-tool v2, tuning the budget prompt — was done once. Running Qwen 3.6 cost zero additional engineering time. Just:

-c model.model_name=hosted_vllm/qwen36-35b-a3b-fp8

Every future model that supports text-based actions gets the scaffold for free.

Qwen 3.6 specifics

Architecture differences

Qwen 3.6-35B-A3B is not a standard Transformer. It uses DeltaNet hybrid attention — a mix of linear attention (Gated DeltaNet) and standard attention in a 3:1 ratio. This has implications:

Context length: 262K native (vs Gemma 4's 96K) — more room for long SWE-bench trajectories
Inference speed: slower under vLLM. The DeltaNet kernels aren't as optimized as standard attention. Our 300-task run took ~36 hours vs ~19 hours for Gemma 4 26B.
Expert count: 256 experts with 8 routed + 1 shared per token (vs Gemma 4's architecture). More experts, fewer active.

vLLM serving

vllm serve /models/qwen36 \
  --served-model-name qwen36-35b-a3b-fp8 \
  --host 0.0.0.0 --port 8000 \
  --max-model-len 96000 \
  --gpu-memory-utilization 0.85 \
  --enable-auto-tool-choice \
  --tool-call-parser hermes

Note: --tool-call-parser hermes instead of gemma4. But since we use the backticks scaffold (LitellmTextbasedModel), the tool-call parser is irrelevant — actions are parsed by regex, not by the vLLM tool-call protocol.

The scaffold — unchanged from Part 17

For reference, here's what transferred across all three models:

1. Backticks protocol — model outputs markdown code blocks instead of JSON tool calls. Parsed by regex.

2. edit-tool v2 — 90-line Python script injected into each Docker container at startup. Enforces exactly-one-match string replacement. Prevents whole-file rewrites.

3. Budget prompt — five lines of text telling the model to submit by step 60 of 100.

That's it. Three constraints. 90 lines of Python. Five lines of prompt. The rest is the model's own capability.

Takeaways

Where the time went

Waiting. The three runs together took ~67 hours of wall time (19h + 12h + 36h). The actual engineering for Part 18 was changing one config value and pressing enter. The scaffold work from Parts 15-17 was the real investment — and it's now fully amortized.

Reusable diagnostics

DeltaNet models are slower under vLLM. If you're benchmarking a hybrid-attention model (DeltaNet, Mamba, SSM), budget 2x the wall time compared to a pure Transformer of similar size. The kernels aren't there yet.
Tool-call parser doesn't matter for text-based scaffolds. We used --tool-call-parser hermes for Qwen 3.6 and --tool-call-parser gemma4 for Gemma 4. Both are irrelevant — the backticks scaffold bypasses the tool-call protocol entirely. Don't waste time finding the "right" parser.
Watch the diminishing returns. 1B → 3.8B active gave +22pp. 3.8B → 3B (but 35B total) gave +10pp. The next model might give +5pp. At some point, scaffold improvements (fault localization, test-driven loop) will outperform model upgrades.

The general principle

Invest in the scaffold, not the model. The model changes every quarter — Qwen 3.7 is probably months away. The scaffold constraints (don't rewrite files, submit within budget, use text-based actions) address fundamental behavioral patterns that every model shares. Build for the behavior, not the brand.

Conclusion

Three models. One scaffold. Zero changes between runs:

Model	Active	% Resolved	Cost
Gemma 4 E4B	~1B	16.67%	$0
Gemma 4 26B	3.8B	38.67%	$0
Qwen 3.6 35B	3B	48.33%	$0

The 48.33% beats SWE-agent + Claude 3.7 Sonnet (48.00%). The scaffold — backticks + edit-tool v2 + budget prompt — is the fixed cost. Every new model is free upside.

What's next: Fault localization (auto-parsing tracebacks before the model starts) is the next scaffold improvement. It should cut 10-15 wasted exploration steps regardless of which model is running. That's a scaffold upgrade, not a model upgrade — and it'll transfer to every model we run after it.

Also in this series: Part 15 — Feasibility test | Part 16 — Scaffold engineering | Part 17 — Gemma 4 26B: 38.67%