可以用本地開源模型跑 SWE-Bench 嗎？不用 Claude 或 GPT？

整條路是通的。我們用 SWE-agent + Gemma 4 26B 在 DGX Spark 上 9 步修好了一個測試 bug，不花 API 費。但這是可行性測試（簡單的少冒號 bug），還沒跑完整的 SWE-Bench Lite 300 題。完整的 resolve rate 數字還在規劃中。

SWE-Bench 該用 Gemma 4 還是 Qwen 3.5？

看框架。在 SWE-agent（文字指令格式）上，Gemma 4 26B 用 9 步解完並提交 patch。Qwen 3.5 35B 也修好了 bug 但跑了 96 步而且沒 submit。在 OpenHands（function calling 格式）上，Qwen 3.5 能用（1 個 error）但 Gemma 4 產生 40+ 個 error。選跟你框架 action 格式合的模型。

為什麼 Gemma 4 在 OpenHands 上失敗但在 SWE-agent 上成功？

OpenHands 用 OpenAI 格式的 function calling（結構化 JSON tool calls）。Gemma 4 在 11 個同時的 tools 下 function calling 不穩定 — 會漏掉必要參數。SWE-agent 預設 0.7 config 用文字格式的 action（純文字指令由框架解析），Gemma 4 處理得很好。（SWE-agent 自 0.7 之後也有 function-calling config，但這篇用的是文字預設。）模型的 coding 能力沒問題，壞掉的是 function calling 格式。

vLLM 跑 Gemma 4 tool calling 需要什麼設定？

三個關鍵設定：--enable-auto-tool-choice、--tool-call-parser gemma4、以及 --chat-template tool_chat_template_gemma4.jinja（從 vLLM GitHub 下載）。Chat template 最容易漏 — 沒設的話 Gemma 4 的 tool call 會丟失參數，即使模型有正確生成。

~/blog/swe-bench-local-models-framework-matters

DGX Spark · part 15

[AI Agent] Gemma 4 從 40 次失敗到 9 步修好 Bug — 只換了一個東西

2026-04-137 分鐘閱讀#swe-bench #gemma-4 #qwen-3.5 #openhands English

❯ cat --toc

白話版：桌上型 AI 電腦能修真正的 bug 嗎？
環境
測試一：OpenHands — Function Calling 地獄
Gemma 4 26B on OpenHands：40+ 次 error
但 API 明明是通的？
Qwen 3.5 35B on OpenHands：成功（有條件）
測試二：SWE-agent — 換個 action 格式，換個世界
Gemma 4 26B on SWE-agent：9 步完成，patch 已提交
Qwen 3.5 35B on SWE-agent：修好了但無法結束
所以到底怎麼回事
重要的 vLLM 設定
1. Chat template（最容易漏）
2. Qwen 3.5 的 thinking mode 必須關掉
3. Gemma 4 在 Ollama 上壞的
收穫
花最多時間的地方
可轉移的診斷方法
以後會記得的事
怎麼重現

TL;DR

可行性測試，不是完整 benchmark。 Gemma 4 26B 在 OpenHands 上 40+ 次 tool calling 錯誤，換到 SWE-agent 後 9 步修好測試 bug。同一個模型、同一台 GX10。整條路通了，完整 SWE-Bench Lite（300 題）是下一步。

白話版：桌上型 AI 電腦能修真正的 bug 嗎？

SWE-Bench 是測試 AI 能不能修真實軟體 bug 的標準測驗 — 它從知名開源專案拿真實的 GitHub issue，看 AI 能不能產出能用的修復 patch。排行榜上幾乎全是 Claude 和 GPT 透過雲端 API 跑的。

我想知道能不能完全跳過 API 帳單 — 在自己的硬體上用開源模型跑。所以在 DGX Spark 上測了 Gemma 4 26B 和 Qwen 3.5 35B，先試 OpenHands，再試 SWE-agent。花了四個小時以為 Gemma 4 壞了，最後才發現問題出在 OpenHands 不在模型。

環境

硬體： NVIDIA DGX Spark（GB10，128 GB unified memory，SM121）
模型： Gemma 4 26B-A4B NVFP4、Qwen 3.5 35B-A3B FP8 — 都透過 vLLM 服務
框架： OpenHands v0.59.0、SWE-agent v1.1.0
Agent 主機： Mac mini M4（16 GB，OrbStack Docker）— 跑 agent + sandbox，透過 Tailscale 連 GX10 vLLM

測試一：OpenHands — Function Calling 地獄

OpenHands 用 OpenAI 格式的 function calling — 模型收到 JSON 格式的 tool schema，必須回傳結構化的 tool_calls 物件。它對外暴露 11 個 tools：execute_bash、str_replace_editor、think、finish、browser、execute_ipython_cell、task_tracker、fetch、create_pr、create_mr、create_bitbucket_pr。

任務：「寫一個檢查質數的 Python 函式，包含測試。」

Gemma 4 26B on OpenHands：40+ 次 error

每種 parser 設定都失敗：

`--tool-call-parser`	成功前的 error 次數	結果
`pythonic`	5	只出了 `task_tracker:plan`
`hermes`	4	同上
無	7	同上
`gemma4` + 官方 chat template	40+	從未完成任務

Error 都是同一個 pattern：

Missing required parameters for function 'execute_bash': {'command', 'security_risk'}
Missing required parameters for function 'str_replace_editor': {'path', 'command', 'security_risk'}

Gemma 4 呼叫了正確的 function 但參數通通漏掉。每一次都是。

但 API 明明是通的？

這是讓我 debug 了好幾個小時的原因。同樣 11 tools + 完整 OpenHands system prompt（8,892 字）用 curl 打？完美：

curl http://<gx10-ip>:8000/v1/chat/completions -d '{
  "model": "gemma-4-26b",
  "messages": [{"role": "system", "content": "<8892 字的 OpenHands prompt>"},
               {"role": "user", "content": "Create hello.py"}],
  "tools": [<全部 11 個 tools>]
}'

回應：正確的 str_replace_editor 呼叫，所有參數都帶了。API 層面沒問題。框架的多輪對話不知為何把它搞壞了。

Qwen 3.5 35B on OpenHands：成功（有條件）

Qwen 3.5 用 1 個 error 就完成了同樣的任務 — 漏了一次 security_risk 參數，然後自己修正了：

✅ 建了 prime_check.py，裡面有 is_prime() 函式
✅ 跑了 python3 prime_check.py — 所有測試通過
✅ 任務完成

關鍵設定： Qwen 3.5 的 --reasoning-parser qwen3 必須拿掉。開著的話所有 output 都被當 thinking token 消化掉，回傳內容是空的。用 --tool-call-parser qwen3_xml，不要加 reasoning parser。

測試二：SWE-agent — 換個 action 格式，換個世界

SWE-agent（預設 0.7 config，也就是這篇用的）不用 function calling。模型就寫純文字，框架自己解析：

💭 THOUGHT
我需要看一下有語法錯誤的那個檔案。

🎬 ACTION
str_replace_editor view /repo/tests/missing_colon.py

框架負責解析文字。完全不涉及 OpenAI function calling。

任務： 修復 SWE-agent/test-repo#1 的 bug — Python function 定義少了一個冒號，造成 SyntaxError。這是 SWE-agent 自己的測試 repo，不是真正的 SWE-Bench 題目 — 刻意簡單，用來驗證整條路通不通。

Gemma 4 26B on SWE-agent：9 步完成，patch 已提交

Step 1: find . -maxdepth 2 -not -path '*/.*'     → 看 repo 結構
Step 2: cat tests/missing_colon.py                → 讀了有 bug 的檔案
Step 3: python3 tests/missing_colon.py            → 重現了 error
Step 4: cat tests/missing_colon.py                → 確認問題位置
Step 5: str_replace_editor（路徑錯）              → 自己修正路徑
Step 6: str_replace_editor str_replace             → 修好 bug ✅
Step 7: python3 tests/missing_colon.py            → 驗證修復 ✅
Step 8-9: submit                                   → 提交 patch ✅

Patch 內容：

-def division(a: float, b: float) -> float
+def division(a: float, b: float) -> float:
     return a/b

就是那個在 OpenHands 上連 hello world 都寫不出來、40 次 error 的同一個模型。

Qwen 3.5 35B on SWE-agent：修好了但無法結束

Qwen 3.5 也找到並修好了 bug — 它正確地執行了同樣的 str_replace 修改。但它跑了 96 步，花大量時間做 git 歷史分析，始終沒有呼叫 submit。修復是正確的，模型只是不知道怎麼結束任務。

所以到底怎麼回事

	OpenHands（function calling）	SWE-agent（文字指令）
Gemma 4 26B	❌ 40+ errors，從未完成	✅ 9 步完成，產出正確 patch
Qwen 3.5 35B	✅ 1 error，完成任務	⚠️ 修好了但跑 96 步，沒呼叫 submit

差別在 action 格式。OpenHands 要求結構化 JSON function call，參數名稱和型別都要對。Gemma 4 在 11 個 tools 同時擠的時候就會漏參數。SWE-agent 只需要純文字 — 而 Gemma 4 寫純文字沒問題。

Gemma 4 會寫 code。它只是不會穩定地填 JSON 表格。

重要的 vLLM 設定

這幾個設定花掉了我大半個下午：

1. Chat template（最容易漏）

# 從 vLLM GitHub 下載
curl -sL https://raw.githubusercontent.com/vllm-project/vllm/main/examples/tool_chat_template_gemma4.jinja \
  -o tool_chat_template_gemma4.jinja

# Mount 進 container
-v /path/to/tool_chat_template_gemma4.jinja:/app/tool_chat_template_gemma4.jinja
--chat-template /app/tool_chat_template_gemma4.jinja

沒設的話 Gemma 4 的 tool call 會丟失參數，即使模型有正確生成。vLLM 官方 recipe 有寫但很容易忽略。

2. Qwen 3.5 的 thinking mode 必須關掉

# 錯 — 所有 output 被當 thinking token 消化
--reasoning-parser qwen3 --tool-call-parser qwen3_xml

# 對 — tool calling 正常
--tool-call-parser qwen3_xml
# （不要加 --reasoning-parser）

3. Gemma 4 在 Ollama 上壞的

Gemma 4 E2B、E4B、26B 在 Ollama 上的 tool call 全部回傳空的（已知 bug）。用 vLLM + --tool-call-parser gemma4。

收穫

花最多時間的地方

在 OpenHands 上 debug Gemma 4 的 tool calling。測了四種 --tool-call-parser、加了官方 chat template、確認 parser 程式碼包含 PR #38847 的修復、在每個層級跑 curl 測試。curl 測試全部通過 — 11 tools、8,892 字 system prompt、參數完整。問題只在 OpenHands 的多輪對話裡出現。

早知道一開始就用 SWE-agent。

可轉移的診斷方法

當模型「不會 tool calling」的時候，在下結論之前先在三個層級測試：

單一 API call（curl + 2 tools）— 測基本格式
完整 schema（curl + 全部 tools + system prompt）— 測規模
框架整合（實際 agent loop）— 測多輪互動

如果 1-2 通過但 3 失敗，問題在框架不在模型。換框架比 debug function calling 相容性快。

以後會記得的事

下次本地模型在某個 agent 框架上看起來壞掉，我會先換框架再 debug 模型。這次能省四個小時。

怎麼重現

vLLM 跑 Gemma 4： 用 vllm/vllm-openai:gemma4-cu130 image（不要用 latest — 缺 Gemma 4 的 transformers 支援）
下載 chat template： curl -sL https://raw.githubusercontent.com/vllm-project/vllm/main/examples/tool_chat_template_gemma4.jinja
安裝 SWE-agent： uv python install 3.12 && git clone SWE-agent && uv pip install -e .
執行： sweagent run --agent.model.name openai/gemma-4-26b --agent.model.api_base http://<your-gx10-ip>:8000/v1 --agent.model.api_key dummy --agent.model.per_instance_cost_limit 0
Qwen 3.5 的話： 拿掉 --reasoning-parser qwen3，用 --tool-call-parser qwen3_xml

這是「DGX Spark」系列的第十五篇。

常見問題

可以用本地開源模型跑 SWE-Bench 嗎？不用 Claude 或 GPT？: 整條路是通的。我們用 SWE-agent + Gemma 4 26B 在 DGX Spark 上 9 步修好了一個測試 bug，不花 API 費。但這是可行性測試（簡單的少冒號 bug），還沒跑完整的 SWE-Bench Lite 300 題。完整的 resolve rate 數字還在規劃中。
SWE-Bench 該用 Gemma 4 還是 Qwen 3.5？: 看框架。在 SWE-agent（文字指令格式）上，Gemma 4 26B 用 9 步解完並提交 patch。Qwen 3.5 35B 也修好了 bug 但跑了 96 步而且沒 submit。在 OpenHands（function calling 格式）上，Qwen 3.5 能用（1 個 error）但 Gemma 4 產生 40+ 個 error。選跟你框架 action 格式合的模型。
為什麼 Gemma 4 在 OpenHands 上失敗但在 SWE-agent 上成功？: OpenHands 用 OpenAI 格式的 function calling（結構化 JSON tool calls）。Gemma 4 在 11 個同時的 tools 下 function calling 不穩定 — 會漏掉必要參數。SWE-agent 預設 0.7 config 用文字格式的 action（純文字指令由框架解析），Gemma 4 處理得很好。（SWE-agent 自 0.7 之後也有 function-calling config，但這篇用的是文字預設。）模型的 coding 能力沒問題，壞掉的是 function calling 格式。
vLLM 跑 Gemma 4 tool calling 需要什麼設定？: 三個關鍵設定：--enable-auto-tool-choice、--tool-call-parser gemma4、以及 --chat-template tool_chat_template_gemma4.jinja（從 vLLM GitHub 下載）。Chat template 最容易漏 — 沒設的話 Gemma 4 的 tool call 會丟失參數，即使模型有正確生成。

接著讀

2026-03-05
[vLLM] Qwen3.5-35B 跑到 47 tok/s：從 Ollama 遷移到 vLLM
TTFT 從幾秒降到 0.12s。DGX Spark GB10 上 Qwen3.5-35B 從 Ollama 換到 vLLM 的實戰筆記，含六個坑：SSM + chunked prefill 陷阱、記憶體衝突、docker 重啟順序。
2026-05-06
火箭起飛:Gemma 4 在 DGX Spark 跑出 670 tok/s 總吞吐(單流 108 tok/s)
Google 2026-05-05 發 Multi-Token Prediction drafter,vLLM PR 同日開、官方 preview docker 同日有。DGX Spark 上實測 Gemma 4 26B-A4B-it FP8 + MTP γ=4:單流 108 tok/s(2.66× baseline)、8 路並行 674 tok/s 總吞吐。一個沒寫進文件的雷:drafter 不能配 base model,要配 -it。
2026-04-20
[Benchmark] 同 Scaffold、三個模型：SWE-bench Lite 16% → 38% → 48%
一套 scaffold（backticks + edit-tool + budget prompt），三個模型（Gemma 4 E4B、Gemma 4 26B、Qwen 3.6 35B），跑之間零程式碼改動。Qwen 3.6 拿到 48.33%——超越 SWE-agent + Claude 3.7 Sonnet。Scaffold 是固定成本，模型是變數。
2026-04-17
[Benchmark] 26B 地端模型在 SWE-bench Lite 拿到 38.67% — 差 Claude 3.5 Sonnet 系統 0.33%
Gemma 4 26B-A4B FP8 在 SWE-bench Lite 解了 116/300 題，全球排名 #16。跑在 DGX Spark 上，零 API 費。差距在 scaffold 設計，不是模型大小。

ShareReddit LinkedIn X Facebook

不想錯過新文章?

訂閱我確保不漏接!

隨時一鍵退訂。

← 返回文章列表