The six-phase loop

26 min read

What is the six-phase agentic loop?¶

The six phases are research, plan, execute, review, verify, ship. Each phase is a skill, in the agent-anatomy sense - a packaged set of instructions the agent loads when the phase is active. Each phase has a clear input, a clear output, and a clear hand-off to the next phase. Each phase is designed to be gated at its boundary; today's concrete Superpowers implementation uses skill instructions to request the discipline, and you may need to wire in a project-specific PreToolUse hook to enforce the gating strictly. Kernel-level phase enforcement is still maturing as of mid-2026.

Figure: The six-phase loop. Most failures route back to Plan, not back to Research.

I will walk through the phases in order, and at the end I will tell you what the whole thing looks like when it runs end to end on a real piece of work.

One principle runs under all six phases: a good result needs a well-defined expected outcome. The agent can build almost anything you describe, but it cannot know which thing you meant - defining that is the job of the front of the loop, and it is what the later phases mean by "the spec" and by "the intent." Research pins down what it can and surfaces what it cannot; you settle the open questions it raises; the plan turns the settled outcome into per-task checks; verify measures the result against it. You do not hold that outcome complete before you start - it is an output of research and plan, sharpened every time a failed verify routes back to plan, and for genuinely novel work you define what you can and let research close the rest. Skip the definition and the rest of the loop has nothing to stand on: review has no spec to compare the diff against, verify has nothing to check, and "done" is whatever the agent decided.

Phase one: research.

The agent reads the codebase and produces a research note that establishes the current state, names the relevant files, identifies the existing patterns, and calls out risks. Input: the task description. Output: a markdown document of two to four pages.

What goes in the research note? The files that will be touched. The conventions those files follow. The existing tests that cover the area. Related concepts elsewhere in the codebase that might be relevant. Open questions the agent has - places where the codebase is ambiguous and a human needs to decide. These are not loose ends. They are the decisions that define what "done" will look like, surfaced now so you settle them before the plan commits to them.

Research is the phase teams skip most often, because it produces no code and feels like overhead. Research is also the phase that, in my experience, has the highest leverage within the loop. A bad research note guarantees a bad plan, which guarantees a bad implementation. A good research note makes the rest of the loop dramatically easier, because the plan is grounded in the real state of the code, not in the agent's first-guess hypothesis about the state of the code.

The research artifact is durable. It gets committed alongside the change. Six months later, when a different developer is working on adjacent code and wants to know "why is this thing structured this way," the research note is there. It is the institutional memory the team did not have before.

Phase two: plan.

The agent reads the research note and produces a file-level plan. Each task in the plan names the file to be changed, the change to be made, the verification that proves the change worked. Each task is sized to two to five minutes of work - small enough that a failure is recoverable, big enough that the overhead of task switching does not dominate.

The plan also names what tests need to be added or updated. If the plan does not mention tests, the plan is incomplete and the agent goes back. This is intended to be enforced by a hook; the skill instructions request this rigor, and the hard enforcement is something you wire up per project as your team's maturity warrants.

The plan also states what "done" means for the change as a whole - the outcome verify will check against - not only the per-task verifications. A plan that names files, tasks, and tests but never says what the change is supposed to achieve has decomposed the work without defining it; the outer-loop contract in Appendix B.6 forces a "Done when" line one loop out, and the inner loop needs the same thing at the scale of a single change.

The plan is the gate where a human reviewer matters most. You read the plan. You push back on tasks that are too vague, too large, or wrongly ordered. You add tasks the plan missed. You remove tasks that are out of scope. The agent revises. You approve. Only then does execute start.

The plan review takes a few minutes. It saves an afternoon when the plan was wrong and you would have discovered it during execute, which is much harder to unwind.

Phase three: execute.

This is where the recursive primitive from Chapter 1 - subagents - earns its keep. Execute is the phase where the orchestrator dispatches multiple constrained children, each working on a bounded task in its own isolated context.

The agent dispatches subagents per task. Each subagent works in its own isolated context - a key architectural feature, because context contamination is the single biggest reason long-running agent sessions go wrong. Task one's confused reasoning does not pollute task four's clean slate. Each subagent reads only what it needs, makes its assigned change, runs the verification step in the plan, and reports back. The orchestrator agent assembles the results.

Execute is the phase that produces visible code on the screen. It is also the phase that benefits most from the isolation. Without subagent isolation, an eight-task change in a single agent session ends with a context window stuffed with eight tasks' worth of code, partial results, debugging output, and the agent making decisions in task seven based on garbled memory of decisions in task two. With isolation, every task is fresh. Every task is small. Every task either succeeds or fails on its own merits.

If a task fails, the orchestrator decides whether to retry, route around the failure, or stop and ask the human. Most failures are recoverable - the agent's first attempt missed a detail the research note flagged, the second attempt incorporates the fix. Some failures are blocking - the task as planned cannot be completed, the plan needs revision. The orchestrator distinguishes the two.

What subagent isolation does not solve: orchestrator-level contamination. The orchestrator still holds summaries of every subagent's work, and if the orchestrator's summary of task one's result is imprecise, task four's subagent may make an assumption that contradicts something task one actually established. I have watched a six-subagent execute phase produce mutually inconsistent edits for exactly this reason. The isolation is real and valuable; it is not a silver bullet. The orchestrator is still a single point of context, and its summaries are still a place where drift can enter. Read the orchestrator's task-handoff messages with the same skepticism you would read a junior engineer's standup updates.

The related cost is mediation when parallel subagents make conflicting edits. The cheap mediation: the orchestrator detects that two branches touched the same file in incompatible ways, drops the later branch, and re-runs it sequentially with the first branch's output passed in as context. The expensive mediation: the orchestrator summarizes the conflicting changes, asks a higher-capability model to pick the right merge, then re-applies. Most conflicts are cheap. Some are not, and the expensive ones eat the speed gain you went parallel to capture.

The coordination cost shows up as conflict mediation - re-running a dropped branch, or paying a higher-capability model to pick a merge - and it is bounded. The bound: the more independent the subagent tasks are, the lower the conflict rate. The way to keep them independent is to scope by file or by module, not by feature. Six subagents each editing one file is safe. Six subagents all editing the same feature across overlapping files is a recipe for the expensive case every time. In my experience, the teams that hit this problem are usually dispatching too many subagents for the work at hand. Three well-scoped subagents finish faster than eight overlapping ones, every time.

Execute is also where the agent encounters governance. Every tool call goes through the permission gate. Every Bash command goes through the security hooks. Every file write goes through the sandbox. If the agent tries something the governance layer disallows, the action is blocked, the agent reports back to the orchestrator, the orchestrator decides how to proceed. The rigor lives in the layers below execute; execute just runs the work.

Phase four: review.

Two reviewers. In sequence.

First, spec compliance. The agent reads the original research note, the approved plan, and the actual diff. It answers a single question: does the implementation match the spec? If yes, it says so. If no, it flags the gap. Spec compliance is a different skill from code quality. A change can be high-quality code that does the wrong thing. A change can be ugly code that does exactly the right thing. The spec reviewer cares only about the first dimension.

Second, code quality. A different agent. A different prompt. It reads only the diff. It asks: is this good code, by the team's standards? Naming. Style. Edge cases. Test coverage. Error handling. Performance considerations. It comments on the diff as a senior reviewer would.

The reason you split these into two reviewers is that doing both at once produces worse output. A reviewer who is simultaneously asking "does this match the spec" and "is this well-written" tends to blur the two. The spec gets weighted by the code quality, or the code quality gets weighted by spec compliance, and you lose the distinct signal each one was supposed to provide. Two reviewers, two concerns, no blur.

The output of review is structured. Each finding has a severity. Critical findings block ship. Important findings get fixed before ship. Suggestions are noted in the PR description. The agent acts on the blocking and important findings automatically (within the constraints of the plan), and surfaces the suggestions for the human reviewer to decide.

How do you review an agent-written diff?¶

An agent-written diff fails differently than a human-written one, so you read it in a different order. A human diff fails at execution - a typo, an off-by-one, the edge case the author forgot - and you hunt line by line for the mistake, because that is where a tired human leaves them. An agent diff arrives without those. It compiles, it passes lint, it reads like idiomatic code your team would merge without comment. It fails one layer up, at intent and context - a business rule that is plausible and wrong, the right code placed in the wrong layer - and neither of those announces itself on the page. An agent bug looks like the code a good engineer would write for a slightly different task. Fluency is not correctness, and reading for fluency will not catch it.

So spend the first ten minutes above the code, not in it. Start with the diff-stat against the plan, before a single line of implementation: does the shape of the change match the shape of the ask? A file touched that the plan never named is the first flag, not a footnote - an over-scoped diff is the agent deciding something you did not ask it to decide.

Read the tests next, and read what they assert, not whether they pass. Green is the cheapest signal in the diff and the one you already have. A test that asserts what the implementation does instead of what the intent required will pass and still be wrong; the verify phase returns to this, to why the agent's own tests earn a second read.

Then check the boundaries your team has written rules about - the forbidden patterns in AGENTS.md, the layer conventions, the modules that are dangerous to touch. The agent violates a convention confidently and in fluent style, so a boundary crossing does not look wrong; it looks clean, which is exactly what style-reading slides past. And grep every new name the diff introduces - each API, function, or config key you do not recognize - the same cross-check Chapter 6 runs against the confident invention, because the call that does not exist reads as plausibly as the one that does.

Only now, line by line. This is the mechanical layer the two reviewers above already swept - spec compliance, code quality - so you are not repeating their pass; you are spending the minutes they bought you on the two things they cannot judge, business correctness and architectural fit. This is where the plausible-wrong business rule surfaces, and where a stale idiom from the framework version the model knows best reads as a cleanup and ships as a regression. The read does not disappear as the tooling improves. It gets shorter and sharper, aimed at the errors of intent that survive everything upstream - and when it decays instead into a glance at green checkmarks, Chapter 9's pattern four owns what happens next.

Appendix B.8 is this read order as a one-pager.

Phase five: verify.

The verify phase is where tests run. Specifically, new tests run - tests that exercise the change. The existing test suite is run as part of execute (any task that modifies code runs the relevant existing tests to make sure nothing broke). Verify is about whether the change is actually correct - correct against the outcome you defined in research and plan - not just whether the existing tests still pass.

For backend logic, verify usually means unit tests and integration tests. The plan named which tests to add; the execute phase added them; verify runs them and reports the results.

For frontend code, verify gets more interesting. Frontend testing has historically been hard - UI tests are fragile, snapshot tests are brittle, manual QA does not scale. The agentic workflow has a real answer here, and it is one of the things I am most enthusiastic about. The answer is Playwright with the accessibility tree.

Playwright with the accessibility tree means this. Playwright is a browser automation library. It drives a real browser (headless or visible) through a sequence of interactions. The accessibility tree is the structure browsers maintain for assistive technology - screen readers, voice control, the like. The accessibility tree describes the page in semantic terms: there is a button with this label, there is a form field labeled "email", there is a list of items with these names. The accessibility tree is stable. It does not change when you restyle the page, because the labels and roles do not change. CSS refactor? Accessibility tree is the same. Component library swap? Accessibility tree is the same.

Verify in the agentic workflow writes tests against the accessibility tree, not against pixels and not against CSS selectors. The test says "navigate to the user profile page, find the field labeled Priority, change its value, click Save, verify the new value renders." That test passes whether the UI is styled with Tailwind or Bootstrap or Material or nothing. The test passes whether the field is implemented as a select dropdown or a radio group or a custom component. The test asserts the behavior, which is what you care about.

I run this with banking teams constantly. They have long histories of failed UI test automation - flaky tests, brittle suites, junior engineers wasting weeks chasing snapshot diffs. The accessibility tree pattern is the first thing that has worked, in my experience, for keeping UI test suites green over months and years. The agent writes the tests. The tests survive refactors. The team trusts the green-light signal.

Can you trust the tests the agent writes?¶

A green test suite the agent wrote is evidence, not proof.

It is evidence the change does what the test says. It is not proof the test says the right thing. The worked example later in this chapter makes the gap concrete: the audit-log task shipped a passing test that asserted the old log format. Green, and wrong. The suite was happy. The behavior was incorrect. Nothing but Review caught it, because nothing else was looking at what the test claimed - only at whether it passed.

Coverage percentage does not close this gap; it widens the illusion. Coverage measures which lines executed, not whether anything meaningful was checked. An agent optimizing for a coverage gate will write tests that call the code, assert nothing of consequence, and turn the number green. You get the metric and not the safety. A high coverage figure on agent-written tests tells you the code ran during the test. It does not tell you the code is right.

Characterization tests have the same shape of limitation, named in Chapter 8: they lock in current behavior, not correctness. They are genuinely valuable - a regression net that lets the agent refactor without silently changing what the code does. But they will preserve a bug as faithfully as they preserve a feature. A characterization suite that goes green after a refactor proves you did not change the behavior. It says nothing about whether the behavior was ever correct.

So the discipline is the obvious one, applied where teams forget to apply it: review the agent's tests the way you review the agent's code. Read what they assert, not just whether they pass. For backend logic especially, a human or a second agent should check the assertions against the spec - against what the code is supposed to do - not against the implementation that happens to be in front of them. A test written from the implementation will agree with the implementation. That is the failure mode. The assertion has to come from the intent.

Phase six: ship.

Ship is the phase that produces the artifact your team's normal review process handles. The agent commits the changes with a structured commit message. It pushes the branch and opens a pull request with a structured description: what changed, why, how it was verified, what risks remain, what reviewers are tagged. If a Jira ticket was linked at the start, the agent updates the ticket. If Slack notifications are wired, the agent posts to the relevant channel.

Ship takes thirty seconds. It is the easiest phase. It is also the phase that makes the rest of the loop palatable to the team, because the artifact the agent produces - the pull request - is exactly the artifact the team is already used to reviewing. There is no special "AI lane" in your repository. There is the same pull request review process that every change goes through. The reviewer reads the diff, reads the description, reads the research note linked in the description, reads the test results, approves or requests changes. Same as always.

This is the property that makes agentic delivery work in practice. The agent does not break your existing process. The agent feeds your existing process with better-formulated work.

The full loop, on a small feature, takes about twenty to thirty minutes of total wall-clock time. On a medium feature, an hour. On a large feature, a few hours - and the large feature would have taken days without the agent, so the comparison is favorable.

That total includes both the agent's processing time and the human gate review time. The agent itself runs in maybe a third to half of the wall clock; the rest is you reading the research note, you reviewing the plan, you approving the diff, you watching the verify step pass. The human gates are the rate-limiter on a healthy workflow, not the agent. If your loop is taking three hours on a small feature, the issue is almost certainly that the gates are over-engineered or that you are doing them in a slow back-and-forth instead of a focused pass. The agent will not save you from your own meeting culture.

The friction relative to "just have the agent write the code" is the gate time - the note, the plan, the diff review - and it is bounded. The benefit relative to "ship code without rigor" is substantial.

The loop's timing in rehearsal is not the loop's timing in production. I learned this from a demo I ran for a client team earlier this year. The demo plan called for an architecture-review run that produced an HTML report from a fresh repo in roughly four minutes. In rehearsal, with the agent's AGENTS.md pre-loaded and the repo paths cached, four minutes was achievable. The first live attempt in front of the team took eight minutes per pane and started a clock on the audience's patience that I could feel from the front of the room. The second live attempt, two days later in a different room, took ten minutes.

The pattern was not a bug. It was the predictable difference between a warm-cache run and a cold-start run. The discipline I should have built into the demo plan from the start was the same discipline this chapter teaches: assume the variable matters, plan for the worst-case timing, have a fallback ready when the live system blows your budget. The recovery pattern I now use on every demo is two-layer: a pre-generated fallback artifact in a git branch I can check out in two seconds, and a resumable session I can continue from the rehearsal state if the live session hangs. Neither is glamorous. Both eliminated the live-demo failure mode that I had been improvising around for a year.

A worked example.

To make the loop concrete, here is one feature flowing through all six phases. The feature is small: add a priority field to the Wire record in a regulated banking service. Priority is one of low / normal / high / urgent, defaults to normal, and the urgent flag triggers a separate compliance-review queue.

Research. I asked the agent to read the codebase and produce a research note. The note named four files I would not have found in an hour of grepping: the Wire record itself, the migration directory, the compliance-review-queue service, and the audit-log emitter. It also raised an open question: whether priority should be enum or free-text, given that the regulator's spec uses free-text in some documents and enum in others. I picked enum. That choice was not a detail - it defined the target the later phases would be checked against.

Plan. The agent produced a six-task plan, in order: add the database column with a default; update the Wire record class; update the wire-builder service; update the API contract; update the compliance-routing logic to read the new field; update the audit-log emitter. Each task was constrained to one file or one pair of files. I caught one issue in review: task five depended on task four's API contract change, but the order was right and the agent had flagged the dependency in the task description. Approved.

Execute. Six subagents in parallel, one per task, each constrained to its file. They returned in under three minutes. Five tasks had passing tests on first run. The sixth (audit-log emitter) had a passing test but the test was wrong - it asserted the old log format. Caught at Review.

Review. The spec-compliance reviewer caught that the audit-log task's test was asserting the old format. It also flagged that the migration was missing the down direction. The code-quality reviewer caught nothing of note. The implementer subagent fixed both items and re-ran the relevant tests.

Verify. The agent ran the full test suite (3,400 tests), a smoke test against a staging compliance-routing service, and produced a diff of the API contract change for the regulator's review. All passing. Total agent time: 47 minutes from research to verify.

Ship. PR opened with the research note, the plan, the per-task reports, the spec-compliance and code-quality reviews, and the test evidence attached. Senior reviewer spent eleven minutes on the PR, asked one question (about whether the urgent flag should be observable in the metrics dashboard, which I had not thought about), and approved. Merged. The whole feature, from "let's add a priority field" to merged code, took ninety minutes of clock time across the agent and me.

Ninety minutes, not the twenty or thirty I quoted for a small feature - the difference is the regulated context: a 3,400-test suite, a staging smoke test, and a contract diff for the regulator. Small in scope is not small in ceremony. And the minutes are still not the point; the artifacts are. Every step produced something a senior reviewer could audit. The loop is the discipline that converts the agent's capability into work I can defend.

The whole loop, in one view:

Phase	Artifact	Human gate	Failure caught
Research	Research note (2-4 pages)	Domain plausibility check	Missing context
Plan	File-level task plan	Scope and order review	Bad decomposition
Execute	Diff + per-task reports	None or light	Implementation drift
Review	Spec compliance + quality reports	Senior review	Wrong or weak code
Verify	Test evidence (failing -> passing)	QA or owner review	Behavioral failure
Ship	PR with evidence trail	Normal PR process	Process violation

Sidebar: the team that shipped with half the loop.

The full ceremony is not always the right amount, and pretending otherwise would cost me your trust the first time you watched a team thrive without it. One team I watched ran agentic delivery for the better part of a year with what looked like half the loop. Research was a comment thread on the ticket. The plan was three bullets in the PR description. No research notes, no review agents, no formal gates. Velocity went up. Defects did not.

It worked because of three conditions, and the conditions are the lesson. The team was small and senior and had built the codebase themselves, so the research function was nearly free - the context a research note exists to assemble was already in their heads. The CI suite was genuinely strict, so the verify function ran on every push whether anyone called it a phase or not. And the PR culture was already serious, so review happened, with judgment, because it always had.

Look at that list: research, verify, review. The functions did not disappear. They were built into the team's walls, so the explicit form was redundant. That is the honest reading of the loop - it is not a price you pay to use agents; it is the explicit form of six functions that have to happen somewhere. A team that has made some of the functions ambient - in its infrastructure, in its culture - can run lighter exactly there, and nowhere else.

The test is accounting, not optimism. For each phase you want to skip, name the thing that already does its job. If you cannot name it, the phase stays.

One piece of vocabulary before we move on, because the word loop is getting overloaded in the field. The six phases here are the inner loop: one unit of work, six functions, gated by you. There is also an outer loop in growing circulation - re-invoking the agent on an interval or against a queue, with nobody between iterations, until a condition holds. That is a different instrument with different preconditions, and it is the final pattern in Chapter 9. The dependency runs one way: an outer loop is only as safe as the mechanization of the functions it re-runs, because it compounds whatever discipline - or whatever absence - it wraps.

Context hygiene¶

Every phase in this loop runs inside a context window, and the window is the agent's working memory. Everything loaded into it - the system prompt, the files read, the tool results, the conversation so far - competes for the same bounded attention the agent needs to reason. Chapter 1 named the bound and said bigger windows only raise the ceiling; context hygiene is how you work inside it. This chapter has already blamed context contamination for the single biggest failure mode of long sessions. This is the practice that prevents it.

The first discipline is to load what the task needs and reference the rest. Pointers over payloads. The architecture document is linked from AGENTS.md, not pasted into it; the research note names the files that matter and where to find them, instead of inlining their contents. The two-hundred-line AGENTS.md budget from Chapter 6 is this same discipline applied to the always-loaded layer - the lines that load at every session start are the most expensive space in the window, so they earn their place or they go.

The session boundary is the instrument. One unit of work per session. The artifacts this loop commits - the research note, the plan, the per-task reports - exist precisely so the next session can start clean and read the state it needs from the repository, instead of dragging a conversation's worth of history behind it. Fresh context per unit of work is the inner-loop form of what pattern eight does per iteration; the outer-loop argument is Chapter 9's.

Contamination announces itself, if you are watching for it. Four signs. The agent re-answers a question it already settled earlier in the session. It cites a stale version of a file it edited an hour ago. It forgets a constraint it honored twenty minutes back. Its edits get sloppier the longer the session runs. When you see these, do not argue with the session - you cannot debate a context window back into coherence. Commit the durable state, end the session, start clean. The fresh session is not lost progress; it is the progress, read back from the repository without the noise.

Compaction is a handoff, not a continuation. When the harness summarizes a full window to make room it drops detail - that is what summarizing is - so treat a compacted session the way you would treat handing your work to a new engineer at the door: anything that matters and is not written to a file by then is gone. Subagents are the other half of the instrument, isolating each task in its own window so one task's confusion never reaches the next. Their handoff summaries get read with the same skepticism the execute phase already asks of the orchestrator's.

Appendix B.7 is this discipline as a one-pager.

The Superpowers plugin I have referenced is one implementation of the inner loop. There are others - GitHub Spec Kit, BMAD frameworks, custom team-built skill collections. They differ in the details. They share the iterative-loop pattern. Choose what integrates with your workflow, your tools, your compliance constraints. The carrier matters less than the discipline.

Next chapter: the artifact that makes the discipline portable across team members, repositories, and time. AGENTS.md.

Key	Action
`?`	Show this help
`Esc`	Close overlays and menus
`⌘ K` or `Ctrl K`	Open search
`/`	Open search (secondary)
`←` `→`	Previous / next chapter
`g` `g`	Jump to top
`G`	Jump to bottom
`T`	Toggle theme
`-` `+`	Decrease / increase font size