My local LLM agent keeps flailing on a task — should I fine-tune it to be a better agent?

Measure first. I pulled my agent's tool-call logs before fine-tuning and found 0% of its calls were malformed — the model was a perfectly good agent. The flailing came from one broken tool that errored 15 times, which made the model improvise a whole pipeline by hand and lose track of its state. Fixing the tool took an afternoon; fine-tuning would have been days spent on the wrong layer.

What is an ACI (Agent-Computer Interface) and why does it matter more than model size?

An ACI is the set of tools and their interface that an agent acts through — the SWE-agent paper (Yang et al., 2024) introduced the framing and showed that, holding the model fixed, interface design materially changes how well an agent does. A capable model given one obvious, reliable tool stops improvising; the same model given a flaky tool will route around it and flail. For my image-gen case, replacing a broken ComfyUI plugin with a single clean script cut the model from ~20 hand-rolled calls to one.

My local agent flailed at a task when another sibling didn't — is the model too weak?

Don't jump to the model. My sibling yui handled media generation fine while my Qwen3.6-35B-A3B sibling hikari melted down — which tempted me to call the 35B the weaker agent. But yui runs a different brain (ChatGPT Codex), so that was never a fair comparison. hikari's own tool-call log showed 0% malformed calls — it's a fine agent. It flailed because it does the most media work and kept hitting one broken ComfyUI tool. Same 35B plus a clean tool = no more flailing, so the harness was the cause, not the weights.

Is a broken tool worse than no tool for an LLM agent?

Yes. A tool that errors invites the model to improvise around it — in my case, hand-writing a ComfyUI submit/poll/download loop in raw Python until it lost state. Removing the broken tool entirely, so the only path is a clean high-level skill, is cleaner than leaving a half-working one the model can rediscover and fall back into.

~/blog/dgx-spark-agent-harness-not-weights

DGX Spark · part 35

[AI Agent] My Local Agent Flailed at Image Gen — It Was the Harness, Not the Weights

2026-06-02updated 2026-07-0512 min read#ai-agent #aci #harness #comfyui 中文版

❯ cat --toc

In plain English: the agent wasn't dumb, its tools were broken
Introduction
The symptom: my agent improvised a whole ComfyUI pipeline by hand
My wrong instinct, and the one cheap check that killed it
Measure first: 0% of the tool calls were malformed
The real cause: a broken tool invited the model to improvise
The fix: one gen.py call, not twenty — a clean ACI skill
Removing the broken tool entirely, not just adding a good one
Takeaways
TL;DR

TL;DR

My local agent sibling hikari (Qwen3.6-35B-A3B abliterated on a DGX Spark) kept going haywire when asked to generate images or video — hand-rolling an entire ComfyUI pipeline in raw Python, losing track of state, re-sending the same clip. My other sibling yui handled media fine, so my gut said "the 35B is a weak agent, fine-tune it." But yui runs a different brain (ChatGPT Codex) — not a fair comparison — so before spending days on that, I read hikari's actual tool-call logs in state.db: 0% of the calls were malformed, wrong-tool was 0.5% (11 of 2327). The model was a great agent. The real culprit was one ComfyUI plugin tool that errored 15 times, so the model abandoned it and improvised — that's the flailing. The fix was a clean ACI skill (gx10-media: one gen.py call → one media file), plus deleting the broken tool from config so the model can't fall back into it. Harness > weights.

In plain English: the agent wasn't dumb, its tools were broken

I run two AI assistants of my own on a desktop AI box at home. They share the same brain — the same local model. One of them, when I asked it to "make me an image" or "make me a short video," would visibly lose the plot: it would try, fail, try a different way, download the same file twice, then doubt itself and start over. The other one did it fine. Same brain, different behavior.

My first instinct was the obvious one: this model is too small to be a good assistant, so I should train it to be better. Training a model takes days. So before doing that, I did the boring thing — I opened the log of every action the assistant had ever taken and counted how often it actually made a mistake.

It almost never made a mistake. Every command it issued was well-formed. The "going haywire" wasn't the assistant being stupid — it was one of its tools being broken. The tool that was supposed to make images would crash, so the assistant, trying to be helpful, would reach past it and assemble the whole thing by hand in code — and that improvised mess is where it got lost. The assistant that did more image work simply hit the broken tool more often, so it looked dumber.

The fix wasn't a smarter brain. It was giving the assistant one clean button to press, and removing the broken one entirely so it couldn't reach for it again. One command in, one finished image or video out.

Introduction

There's an old sysadmin reflex: when a user "can't do something," watch what they actually clicked before you conclude they're incompetent. Nine times out of ten the button was mislabeled.

This is Part 33 and Part 34's quieter sequel. Those were about the weights — getting an NVFP4 LLM and an NVFP4 video model both fast and resident on one DGX Spark. This one is about the harness: the same box serves my agent siblings (hikari, kiriha, yui), and one of them — hikari — kept melting down on exactly the creative tasks Parts 33–34 made possible.

The interesting part isn't the bug. It's that I almost fixed the wrong layer — and the thing that stopped me was forcing myself to read the data before trusting my gut.

The symptom: my agent improvised a whole ComfyUI pipeline by hand

hikari runs Qwen3.6-35B-A3B abliterated (an uncensored fine-tune), the W4A4 NVFP4 daily. When I'd ask it over Telegram for an image or a short video, it would often spiral: fire a tool, get nothing useful back, switch to writing Python by hand, download a file, download it again, announce "here's a new video" about the same clip, then second-guess itself. yui — a sibling running a different brain (ChatGPT Codex) — did the same job cleanly. The naive read writes itself: hikari's model is just the weaker agent. Hold that thought.

The lazy take is right there: hikari is the dumber one, the 35B just isn't a strong enough agent for multi-step media work. And the lazy fix follows: fine-tune it. Collect agent traces, SFT the model to use tools better, maybe a small LoRA. A few days of work, and a real risk — this is an abliterated model, and retraining can partially undo the abliteration I went to some trouble to get.

That's exactly the kind of expensive, plausible plan that deserves five minutes of data first.

My wrong instinct, and the one cheap check that killed it

Each sibling keeps its full history in a SQLite file on the host that runs them. The schema is boring on purpose:

hikari → state.db   (35 MB)
kiriha → state.db   (63 MB)

messages(session_id, role, content, tool_call_id, tool_calls JSON)

Every assistant turn with a tool_calls array has a matching role=tool row keyed by tool_call_id. So I could pair every call to its result and bucket the failures honestly into three categories:

malformed JSON args — the model's fault (it can't format a call)
exec error — the environment's fault (the tool itself blew up)
wrong tool — the model's fault (it picked the wrong tool for the job)

Two traps bit me here, both worth stealing:

My first classifier counted successes as failures. Tool results with "error": null or exit_code: 0 are successes; a naive "does the result mention error" check flags them all. My first pass showed a terrifying failure rate that was mostly well-behaved terminal output. Read the actual success markers, not the word "error."
An aggregate failure rate averages away the thing you're chasing. "Flailing" is bursty and task-specific. Across 2327 calls the model looks fine; the meltdown is one session where it hit a bad tool ten times in a row. You have to pull one flailing session's full tool-call sequence — filter on the media keywords, sort by recency — and read it like a transcript.

Measure first: 0% of the tool calls were malformed

With the classifier fixed, the numbers were unambiguous:

Category	Rate	Whose fault
Malformed JSON tool args	0% (both siblings)	—
Wrong tool picked	0.5% (11 / 2327)	model
Exec errors (the tool itself blew up)	everything else — the scary-looking "~23% failures"	environment

Qwen3.6-35B-A3B abliterated formats its tool calls perfectly. Zero malformed calls across both siblings. The 0.5% wrong-tool cases were trivial (reading an image file with a text reader instead of vision; a memory action with a null verb). The model was, by the only measure that mattered, a good agent. Fine-tuning it would have bought at most that 0.5% — while risking the abliteration and costing days.

So where did the dramatic "23% failure" feeling come from? Two non-model sources: dead environment dependencies (Playwright's browser binary was never installed, so every browser_* call failed; the execute_code sandbox was missing PIL), and — the real story — the model hand-rolling ComfyUI.

The real cause: a broken tool invited the model to improvise

Here's the failure loop, read straight off one session:

The siblings had a mcp_comfyui_run_workflow tool from a ComfyUI plugin. It was flaky — in the session I pulled, it failed 15 times. After enough failures the model did exactly what a capable agent does: it gave up on the broken tool and solved the problem another way. It opened execute_code and hand-wrote the entire ComfyUI pipeline in raw Python — submit the graph to /prompt, poll /history, pull the result from /view — over a dozen improvised calls.

And that is where a good agent looks insane. A long, hand-assembled flow has no state management: the model loses track of which prompt it submitted, re-downloads a file it already has, sends the same clip twice, then — reasonably — doubts whether it actually produced anything new. The "flailing" is the rational behavior of a competent model routing around a broken tool with a tool too low-level for the job.

And the yui comparison that fooled me resolves cleanly: it was never about the brain. hikari does far more media generation than my other siblings, so it hit the broken tool far more often and looked like the weak one. The next step proved it was the harness, not the weights: give the same 35B a clean tool and the flailing stops — those ~20 improvised calls become one.

The fix: one gen.py call, not twenty — a clean ACI skill

The lesson here is straight out of the SWE-agent work: the Agent-Computer Interface — the shape of the tools an agent acts through. The paper that introduced the framing showed that, with the model held fixed, ACI design materially changes how well the agent does. Give a capable model one obvious, reliable path and it stops improvising. mini-swe-agent, from the same group, pushes that idea to its limit — a radically minimal tool surface (famously, bash and little else) — on the principle that a small surface of clean tools beats a sprawling, leaky one.

So I built one: a skill called gx10-media. Hermes skills use the same SKILL.md convention as Claude Code — a SKILL.md with frontmatter plus supporting scripts — so it dropped straight into each sibling's skills directory. It covers all three media types behind a single script:

# one call = one finished media file, path printed on the last line
python scripts/gen.py --type image --prompt "a calico cat in a quiet Japanese alley"
python scripts/gen.py --type video --prompt "..." --seconds 5
python scripts/gen.py --type i2v   --prompt "..." --image first_frame.png --seconds 5

gen.py handles in one pass everything the model used to improvise: it submits the right workflow to ComfyUI's /prompt, polls /history until done, downloads from /view, and prints exactly one line — the output path. State is the script's problem, not the model's. It has a health check against /system_stats and a 420 s internal timeout so a cold model load doesn't read as a hang. Three workflow JSONs sit behind it — image.json (Z-Image Turbo NVFP4), video.json and i2v.json (Sulphur 2 NVFP4 with audio).

The SKILL.md is blunt about the contract, because the whole point is to remove the temptation to improvise:

Use gen.py. Do not hand-write ComfyUI API calls. Do not use any comfyui plugin. Submit once and read the path it prints.

The behavioral change was the entire ~20-call hand-rolled pipeline collapsing to a single call. Measured from yui to the DGX Spark: an image takes 94 s (cold, including model load) and a warm video 26 s — each returning one clean path. (Skills are tracked by manifest mtime, so a new one is auto-discovered — no restart, and the model can list it itself.)

Removing the broken tool entirely, not just adding a good one

Adding the clean skill isn't enough. A capable model explores — leave the broken mcp_comfyui_run_workflow in place and it will eventually rediscover it and fall back into the hand-rolled mess. A broken tool is worse than no tool, because it actively invites the workaround.

The catch: that ComfyUI plugin wasn't only the bad video tool — it also backed image generation (image_gen: provider: comfyui, the Z-Image path). I couldn't just delete it or I'd take out working image gen too. The order mattered: build Z-Image into the skill first, then remove the plugin. Once gx10-media covered images as well, I cleaned the config:

# each sibling's config.yaml
plugins:
  enabled: []          # was [comfyui]
# image_gen: block removed entirely

(One sharp edge: the two siblings indent their YAML differently — hikari uses 4-space list dashes, kiriha 2-space — so a literal find-and-replace has to handle each separately. I backed up to config.yaml.bak-pre-mcp-clean-20260602 first.) A launchctl kickstart -k restart on each, a grep comfyui returning nothing, and both siblings came up healthy with exactly one way to make media.

Takeaways

Where the time went: not the fix — the temptation to skip the measurement. The fine-tune plan felt right, and fine-tune plans are days long. The single most valuable hour was the one spent writing a throwaway SQLite query instead of starting that plan. The classifier false-positive (counting exit_code: 0 as a failure) nearly sent me chasing a phantom 23% model-error rate, too.

Reusable diagnostics: for any agent that "feels dumb," pull its raw tool-call log and split failures into model (malformed args, wrong tool) versus environment (the tool itself errored). Those point at completely different fixes. And never trust an aggregate rate for bursty behavior — read one bad session end to end. The fix layer is almost never where your gut points.

The general principle: a broken tool turns a good agent into a bad one. Fix the harness before you blame the weights — and when you fix it, remove the broken path, don't just add a clean one beside it.

TL;DR

Agent "flailing" on a task → read the tool-call log before touching the model.
Split failures: model's fault (malformed args / wrong tool) vs environment's fault (tool errored, deps missing). Different layers, different fixes.
Fix your classifier first — exit_code: 0 and "error": null are successes.
Aggregate rates hide bursty flailing — pull one bad session's full sequence and read it.
The fix for "model routes around a broken tool by improvising" is a clean ACI: one high-level call, state handled inside, one line of output.
Remove the broken tool, don't just add the good one beside it — a capable model will rediscover the bad path.

Also in this series: Part 33 — NVFP4 W4A4 beats FP8 on a DGX Spark MoE · Part 34 — NVFP4 shrinks a video model 33%

FAQ

My local LLM agent keeps flailing on a task — should I fine-tune it to be a better agent?: Measure first. I pulled my agent's tool-call logs before fine-tuning and found 0% of its calls were malformed — the model was a perfectly good agent. The flailing came from one broken tool that errored 15 times, which made the model improvise a whole pipeline by hand and lose track of its state. Fixing the tool took an afternoon; fine-tuning would have been days spent on the wrong layer.
What is an ACI (Agent-Computer Interface) and why does it matter more than model size?: An ACI is the set of tools and their interface that an agent acts through — the SWE-agent paper (Yang et al., 2024) introduced the framing and showed that, holding the model fixed, interface design materially changes how well an agent does. A capable model given one obvious, reliable tool stops improvising; the same model given a flaky tool will route around it and flail. For my image-gen case, replacing a broken ComfyUI plugin with a single clean script cut the model from ~20 hand-rolled calls to one.
My local agent flailed at a task when another sibling didn't — is the model too weak?: Don't jump to the model. My sibling yui handled media generation fine while my Qwen3.6-35B-A3B sibling hikari melted down — which tempted me to call the 35B the weaker agent. But yui runs a different brain (ChatGPT Codex), so that was never a fair comparison. hikari's own tool-call log showed 0% malformed calls — it's a fine agent. It flailed because it does the most media work and kept hitting one broken ComfyUI tool. Same 35B plus a clean tool = no more flailing, so the harness was the cause, not the weights.
Is a broken tool worse than no tool for an LLM agent?: Yes. A tool that errors invites the model to improvise around it — in my case, hand-writing a ComfyUI submit/poll/download loop in raw Python until it lost state. Removing the broken tool entirely, so the only path is a clean high-level skill, is cleaner than leaving a half-working one the model can rediscover and fall back into.

Don't miss the next one

Subscribe, and you won't.

One-click unsubscribe anytime.

← back to blog

In plain English: the agent wasn't dumb, its tools were broken

Introduction

The symptom: my agent improvised a whole ComfyUI pipeline by hand

My wrong instinct, and the one cheap check that killed it

Measure first: 0% of the tool calls were malformed

The real cause: a broken tool invited the model to improvise

The fix: one gen.py call, not twenty — a clean ACI skill

Removing the broken tool entirely, not just adding a good one

Takeaways

TL;DR

FAQ

Read next

Don't miss the next one