From generating code to shipping software

13 min read

The most common failure I see in teams adopting agentic coding is that they treat the agent as a code generator. They use the agent for the part of the work where code gets typed. They handle every other part of shipping software the way they always have. Then they wonder why the agent's contribution feels marginal, or worse, why the team is spending more time fixing the agent's output than they saved by having the agent type for them.

The mistake is that code generation is no longer the bottleneck.

For the past forty years of software engineering, the constraint on output was, very approximately, how fast a developer could type valid code that did roughly the right thing. Autocomplete helped. Better IDEs helped. AI-assisted typing helped a lot. Each of these tools attacked the same constraint - the time between "I know what I want to write" and "the code is on the screen." Each generation of tooling chipped away at that gap.

Agentic AI breaks the constraint. The agent can produce a hundred lines of code that compiles in the time it takes you to formulate the prompt. Code production has collapsed by an order of magnitude in the kinds of work the agent handles well. It has not gone to zero - the agent still costs latency, tokens, and your time spent formulating and reviewing - but the marginal cost of the next hundred lines of code has dropped from "an hour of typing" to "thirty seconds of waiting." That is the constraint that breaks. So if your method is "agent produces code, you review code, you ship code," your bottleneck moved. The bottleneck is no longer producing code. The bottleneck is can you formulate the work clearly enough that the agent does exactly what is needed.

That sentence is the central methodological claim of this manual.

The bottleneck is formulation discipline. The capability is there. What separates teams that ship well with agents from teams that flail is whether they have built the habit of formulating work in a way the agent can execute correctly. Teams without that habit get vague specifications. Vague specifications produce guessing. Guessing produces output that looks right and isn't. Teams with the practice get sharp specifications. Sharp specifications produce predictable execution. Predictable execution produces output you can ship.

What does "formulation discipline" mean in practice?

First, clarity of domain before any code is written. The agent does not know your business. It does not know your conventions. It does not know the history of why your codebase looks the way it does. Before you ask the agent to make a change, you have to load context - through AGENTS.md, through skills, through the research phase of the workflow, through whatever mechanism. The amount of upfront work required has decreased relative to the pre-agent era (because the agent can do a lot of the loading itself), but it has not gone to zero. It has shifted from "you load context into your own head" to "you load context into the agent's context window."

Second, decomposition by file and by task before any action is taken. The agent will happily attempt a multi-file refactor in a single shot. The output will look impressive. It will also be much harder to review than the same refactor decomposed into five small steps, each of which is a coherent change. The rule is to make the decomposition explicit - write the plan first, then execute the plan, then verify each step independently. This is not new advice; senior engineers have always worked this way on complex changes. The agent makes the cost of failing to decompose much higher, because the agent will produce something that compiles whether it is good or not.

Third, refusing to say "done" without verification. The agent will tell you the change is complete. The agent will be sincere. The agent will also be wrong sometimes, because the agent's internal sense of "this looks right" is not the same as "I ran the tests and they pass." The rule is: there is no "done" without test evidence. Not unit tests as a checkbox - actual behavioral tests that exercise the change. If your change cannot be tested behaviorally, your change probably should not have been an agent task in the first place.

Three habits. Domain clarity before code. Decomposition by file and by task. Test evidence before "done." All three are formulation, not generation. All three are what separates a team that ships software with agents from a team that generates code with agents.

The first two habits are not ends in themselves. Domain clarity and decomposition are how you arrive at a well-defined expected outcome, and the third habit is evidence measured against that result - with no defined result, there is nothing for the evidence to be evidence of.

Now, here is where I should anticipate the objection. You are probably thinking: this sounds like a lot of work for what was supposed to save me time.

First, it is less work than you think. The disciplines above are mostly already practiced by good senior engineers on important changes. They formulate carefully. They decompose. They verify. What they do not always do is package the practice in a form that scales - that travels from one person's habits to a team's habits, from one repository to many, from one project to a year of projects. The agentic workflow lets you encode the practice once and have it apply everywhere. The total cost is lower, even if the per-task cost is comparable.

Second, the alternative is worse. The "let the agent freestyle" approach feels faster in the moment. It produces visible output. The output goes to code review. The reviewer spots the bugs (sometimes). The bugs go back to the agent. The cycle repeats. By the time the change ships, the team has spent more total time than they would have spent if they had imposed rigor at the start. The "discipline takes time" feeling is real. The "lack of discipline takes more time" reality is also real, and usually invisible until you look at cycle time across a quarter.

Across multiple teams: those that imposed rigor early shipped faster on a quarterly basis. Those that let the agent freestyle felt faster in the moment and shipped slower on the quarter. Same teams, same agents, same codebases. Different formulation discipline. Substantially different outcomes.

A heuristic for when formulation discipline pays back. Not every task needs the full loop. The cost of formulating well - the research note, the file-level plan, the test definitions - is paid up front, before the agent writes a line. For trivial work, the formulation cost exceeds the agent's contribution and you lose time.

The practice pays back when at least one of these is true: the work touches more than three files (blast radius is large enough that an unguided agent will get something wrong); the change crosses concerns (database schema plus service layer plus UI, where a single misstep cascades); or you would not do the work yourself by typing it in (it is genuinely novel, or large, or in code you do not know well enough to write by hand). If none of those apply, type it. A two-line bug fix does not need a research note. A one-line config change does not need a plan.

The error mode I see most often is teams that apply the loop to everything (and slow down) or apply it to nothing (and miss the wins). The heuristic above sorts the two. Use the loop when the formulation cost is less than the cost of the agent getting it wrong; skip it when it is not.

The METR randomized controlled trial published in July 2025 measured this directly. Experienced open-source developers, working on their own familiar repositories with AI assistance (Cursor with Claude), took 19% longer to complete tasks than the same developers working without the AI. Before the trial, the developers predicted the AI would make them 24% faster. After the trial, despite the objective slowdown, they still believed the AI had made them roughly 20% faster. That is a 43-point gap between predicted speedup and measured slowdown, and a perception that persists even after the data contradicts it.

METR did not test formulation discipline as a separate variable. They measured raw AI assistance on familiar code. I treat METR's result as a warning: raw AI assistance does not automatically become delivery speed. The missing variable is workflow discipline. Teams that impose process can do better. Teams that do not can stay at or below METR's number, while believing they are faster.

The most vivid version of this contrast I have personally observed involved two teams at the same company, working in adjacent areas of the same product. Team A turned the agent loose immediately. They celebrated the early velocity. By month three they were shipping at roughly the pre-agent baseline but with higher defect rates and visibly drained senior reviewers.

Team B spent month one on the practice. AGENTS.md drafted and iterated. Skills written for the team's specific patterns. Hooks configured. They shipped no agent-led work in month one. By month two they returned to baseline velocity with the practice in place. By month three they were shipping at noticeably better quality and velocity than they had been pre-agent. The investment had compounded.

Same company. Adjacent codebases. Same agent. The difference was formulation discipline imposed early. The cost was a slow first month. The benefit paid back over the quarter. This was a field observation across multiple teams, not a controlled experiment. The contrast was clear enough to change how I teach adoption, with the caveat that adjacent teams in the same company are not independent samples.

The strongest test of the thesis in this chapter came in February through April 2026. Anthropic shipped three product changes in six weeks that materially degraded Claude Code for complex engineering work in some workflows. Adaptive thinking by default on February 9. Default effort dropped from high to medium on March 3. A caching bug in reasoning history retention on March 26. The combined effect was severe enough that an AMD senior director analyzed 6,852 Claude Code sessions and 234,760 tool calls; the analysis showed the model shifting from research-first to edit-first behavior as thinking redaction rolled from 1.5% to 100% of turns. A separate external analysis reported a substantial drop in code quality across the same period. Anthropic published a technical post-mortem on April 23 acknowledging all three root causes.

I watched two adjacent teams during this period. The first team had spent the prior quarter building the infrastructure this manual describes: CLAUDE.md with team-specific rules, plan mode by default, a six-phase loop for non-trivial work, hooks intercepting destructive operations. The second team used the same Claude Code, same models, same plans, but ran them as freestyle dispatch. The second team's velocity halved overnight when the model started taking edit-first shortcuts. The first team's velocity dropped maybe 10% and recovered the moment they tightened plan-mode gates and switched a few skills to higher-effort by default.

That was the experiment running in production. The teams with the discipline absorbed much of the regression. The teams without it felt the full behavioral shift immediately, lost weeks of momentum, and blamed the tool. The methodology is not a hedge against a bad model run. It is the difference between the agent being a useful teammate and the agent being a liability the day the vendor ships a regression.

If you take one thing from this chapter, take this: the cost of imposing the rigor is paid in the first month. The benefit is paid for the rest of time. The math is in your favor if you can survive the first month.

One related observation, because it has cost me time to learn.

The practice does not need to be perfect to start producing benefit. An AGENTS.md with fifteen lines is meaningfully better than no AGENTS.md. A research note that is half-finished is meaningfully better than no research note. A two-task plan that misses three things is meaningfully better than no plan. Perfect is the enemy of started.

The teams that get stuck are often the teams that try to set up the perfect agentic workflow before running it. They want the comprehensive AGENTS.md, the complete set of skills, the full hook configuration, before they ship a single agent-led story. They get bogged down in the setup. They never start. The setup expands to fill all available time, because there is always one more rule that could be added, one more skill that could be written, one more hook that could be configured.

Start with the minimum viable practice. AGENTS.md skeleton, fifteen lines, capturing the things you know are dangerous. Run a few stories through the agent. Notice what goes wrong. Add the rules that would have prevented those specific failures. Iterate. The practice grows from real failures, not from imagined ones. It also grows faster, because each rule has earned its place.

The next chapter is the operational form of the practice. Six phases. Research, plan, execute, review, verify, ship. Each phase is a specific kind of work that the agent does well when properly formulated. Each phase produces an artifact that can be reviewed, committed, and audited. The whole loop is the equivalent of the SDLC for a single change - research is your requirements gathering, plan is your design, execute is your implementation, review is your code review, verify is your testing, ship is your deployment.

If you have ever found yourself wishing you had a junior engineer who followed the team's process every single time - never skipped the design step, never shipped without tests, never marked a story done without verification - the six-phase loop is the answer. The agent will follow the process if you encode the process as code. The plugin we will look at in the next chapter does exactly that.

But first, a moment on what the six phases are not.

They are not the only way to do agentic delivery. There are competing frameworks - GitHub's Spec Kit, BMAD-style methodologies, custom skill collections that teams build for themselves. I will name them again in this chapter and the next, because the lesson is not "use this specific plugin." The lesson is the iterative loop itself. The plugin is the carrier; the discipline is what travels.

They are not a one-size-fits-all process. A two-line bug fix does not need all six phases. A trivial documentation update does not need all six phases. The six-phase loop is for the kinds of changes where the upside of rigor justifies the friction - typically anything that touches business logic, modifies data structures, changes API contracts, or affects more than a handful of files. The decision of which work to put through the full loop and which to skip is itself part of the practice.

They are not the end of the work. The agent ships a pull request. The pull request goes through your normal team review. Humans look at the diff. Humans approve. Humans merge. The agent does not replace the team's existing shipping process; the agent feeds it.

On to the six phases.

Key	Action
`?`	Show this help
`Esc`	Close overlays and menus
`⌘ K` or `Ctrl K`	Open search
`/`	Open search (secondary)
`←` `→`	Previous / next chapter
`g` `g`	Jump to top
`G`	Jump to bottom
`T`	Toggle theme
`-` `+`	Decrease / increase font size