How do I prevent an AI agent from producing malformed output on multi-step tasks?

Use the Codex-executor pattern: the agent spawns a Codex subprocess with a complete task description, the subprocess handles all intermediate steps, and the agent reads a single result file. The agent makes 2 tool calls instead of 8, and all intermediate state lives inside the subprocess's fresh context window.

When should an AI agent delegate a task to a subprocess instead of handling it directly?

When the task involves multiple file reads, one or more external model calls, multiple file writes as outputs, and a final notification — in sequence. If the task has more than 3 sequential steps with intermediate state, the subprocess version is more reliable. If conditional branching based on intermediate results is needed, keep it in the agent's session.

What is the Codex-executor pattern and why does it improve agent reliability?

The pattern: agent spawns 'codex exec ' as a subprocess, all 8 steps execute in the subprocess's own fresh context window, agent reads the output file and sends notification (2 total tool calls). Failure is atomic — the subprocess either completes or returns an error, no partial writes.

How do you hand off state between an agent and a Codex subprocess?

Through the filesystem. The subprocess writes its output to a predictable path (/tmp/result.md). The agent reads that path. This is the correct handoff point: simple, explicit, testable — the agent never needs to hold intermediate state in its own context.

[AI Agent] The Codex-Executor Pattern: Keeping Agent Sessions Small

TL;DR

When an AI agent orchestrates many sequential steps directly, its context fills with intermediate state and output degrades. The Codex-executor pattern fixes this: the agent spawns a fresh subprocess for the heavy work and reads a single result file — 2 tool calls instead of 8.

Plain-Language Version: Why Your AI Agent Gets Worse at Long Tasks

Think of a project manager who also does all the drafting, formatting, and filing. By the fifth task, their desk is buried and mistakes creep in. AI agents have the same problem — their "desk" is a context window, and it fills up with every step they take.

The fix is delegation. Instead of the agent doing eight steps itself, it writes a complete task description and hands it to Codex CLI — which runs in a fresh subprocess with a clean desk. The subprocess does all the work, writes the result to a file, and the agent reads it. Failure is all-or-nothing: either the subprocess completes everything, or it returns an error. No more half-finished outputs at step seven.

Preface

A project manager who also does all the drafting, formatting, and filing ends up doing nothing well. The session fills with half-finished artifacts, and by the time the final output is needed, the context is a mess of intermediate state. The same thing happens to an AI agent when you make it orchestrate every step of a long task directly.

This picks up where Zero API Cost: Running OpenClaw on DGX Spark + Mac Mini left off. That post covered the deployment architecture. This one covers a pattern that emerged during actual use — after the agent started getting long tasks that degraded partway through.

What Goes Wrong When an Agent Orchestrates Many Steps Directly?

The initial design was straightforward: the OpenClaw agent handles everything. A task like daily-reflexion — the agent's daily self-analysis routine — would proceed step by step:

Read /memory/today.md
Read /memory/prediction-ledger.md
Call an external model for a bull-case critique
Write that critique to a temp file
Write a bear-case critique itself
Synthesize both
Update market-beliefs.md
Send a Telegram summary

Eight tool calls. Each one adds to the agent's context window. The agent has to hold all the intermediate state — file contents, model responses, in-progress synthesis — in its active session while it works through the sequence.

This is fine for short tasks. It is not fine for tasks with six, eight, or ten steps where each step produces output that feeds the next one.

What happened in practice: sessions would degrade. The agent would complete most of a task and then produce malformed output on the final step, or write partial content to a file, or send a truncated Telegram message. Not every time, but often enough that daily-reflexion couldn't be trusted to run unattended.

The pattern was consistent: tasks that required reading multiple files, calling external tools, and writing outputs — in sequence — were fragile at scale.

What Is the Key Insight Behind Codex-Executor?

The agent is good at deciding what to do. It is bad at maintaining coherent state across many tool calls in a long session.

Codex (invoked via codex exec) is structured differently. It runs as a separate subprocess with its own fresh context window. You give it a complete task description, it executes the entire sequence from start to finish, and returns a single result. The intermediate state lives inside Codex's subprocess — it never touches the agent's session.

The agent's job becomes: describe the task clearly, spawn the subprocess, read the result, act on it.

This is the Codex-executor pattern.

What Does the Codex-Executor Pattern Look Like?

Instead of the agent doing:

read file A
→ read file B
→ call external model with A + B
→ read model response
→ synthesize A, B, response
→ write output to disk
→ notify via Telegram

The agent does:

codex exec "Read A and B, call the model, synthesize, write output to /tmp/result.md"
→ read /tmp/result.md
→ notify via Telegram

The agent makes two tool calls instead of seven. All the complexity lives inside the Codex subprocess, which has its own full context window and exits cleanly when done.

The handoff point is the filesystem. The subprocess writes its output to a predictable path. The agent reads that path. Simple, explicit, testable.

Concrete Example: daily-reflexion

Here is what daily-reflexion looked like before and after the change.

Before — agent orchestrates directly:

// Agent does all 8 steps in its own session
const today = await readFile('/memory/today.md')
const ledger = await readFile('/memory/prediction-ledger.md')

const bullCritique = await callGemini(
  `Critique these predictions from a bull perspective:\n${today}\n${ledger}`
)
await writeFile('/tmp/gemini-critique.md', bullCritique)

const bearCritique = await callCodex(
  `Write a bear-case critique of these predictions:\n${today}\n${ledger}`
)
await writeFile('/tmp/codex-critique.md', bearCritique)

const synthesis = await synthesize(bullCritique, bearCritique)
await writeFile('/memory/market-beliefs.md', synthesis.beliefs)
await writeFile('/memory/prediction-ledger.md', synthesis.ledgerUpdate)

await sendTelegram(synthesis.summary)

// Result: 8+ tool calls in the agent's session
// Context grows with each step
// Failure at step 7 means partial writes to disk

After — Codex-executor:

// Agent spawns one subprocess with a complete task description
await execCodex(`
  You are running the daily-reflexion routine.

  1. Read /memory/today.md and /memory/prediction-ledger.md
  2. Call gemini --yolo with the contents and ask for a bull-case critique of the predictions
     Write the critique to /tmp/gemini-critique.md
  3. Write your own bear-case critique to /tmp/codex-critique.md
  4. Synthesize both critiques. Update:
     - /memory/market-beliefs.md (revised beliefs based on critique)
     - /memory/prediction-ledger.md (add resolution notes for today's predictions)
  5. Write a 3-5 sentence summary of what changed to /tmp/reflexion-summary.md
`)

// Agent context: still minimal
const summary = await readFile('/tmp/reflexion-summary.md')
await sendTelegram(summary)

// Result: 2 tool calls in the agent's session
// All intermediate state is in Codex's subprocess
// If something fails, the agent gets an error from execCodex — not a partial write

Measured Results (2026-03-15)

The first end-to-end run of the refactored daily-reflexion:

Files updated: market-beliefs.md, prediction-ledger.md, today.md — all three, completely
Telegram delivery: confirmed, messageId=4030
Agent session size: stayed minimal throughout; no context overflow
Total agent tool calls: 2 (execCodex + sendTelegram)

The subprocess ran for about 90 seconds. During that time, the agent's context was idle. When the subprocess finished, the agent read the result and sent the notification. Clean.

Takeaways

Reliability. When the agent orchestrates many steps directly, a failure mid-sequence leaves the filesystem in a partial state. With Codex-executor, failure is atomic from the agent's perspective — the subprocess either completes the full task or returns an error. No partial writes.

Debugging clarity. When something goes wrong, you know where to look. If the agent's two-line session fails, check the subprocess invocation. If the subprocess fails, check its output log. The scope of investigation is bounded.

Transferable diagnostics. Any task that matches this shape is a candidate for the pattern:

Multiple file reads as inputs
One or more external model calls
Multiple file writes as outputs
A final notification or action

If your task has more than three sequential steps with intermediate state, the subprocess version is likely more reliable than the direct version.

When to Use It

Use Codex-executor when:

The task involves reading multiple files, calling external tools, and writing outputs — in sequence
Intermediate state matters (a failure mid-task would leave things in a broken state)
The task can be fully specified upfront as a complete instruction to a subprocess
You want the agent's session to stay small regardless of task complexity

Do not use it when:

The task is a single tool call (just do it directly)
You need to respond to intermediate results before proceeding to the next step
The task is under three steps and has no significant intermediate state
The task requires interactive back-and-forth that the subprocess can't anticipate

The Rule

If you can write the task as a single complete instruction that a capable agent could execute from start to finish without checking in — that's a Codex-executor task.

If the task requires conditional branching based on intermediate results that only the orchestrating agent can evaluate — keep it in the agent's session.

The goal is to move complexity out of the agent's context window and into bounded subprocesses. The agent stays the decision-maker. Codex does the work.

Also in this series: Zero API Cost: Running OpenClaw on DGX Spark + Mac Mini