Patterns for brownfield codebases

19 min read

The traffic light tells you which codebases are ready for the agent. This chapter is about the operational patterns that make agentic work effective on legacy codebases - yellow projects in particular, where the practice matters most. The last pattern is the exception; it is what becomes available when a codebase graduates to green.

I will walk through eight patterns. The first four are the default operating patterns I install on most brownfield teams: worktrees, champions, hooks, and PR review. The final four are maturity patterns for teams past the first few months: mistake-journal review, demo-day backstop, failure watchlist, and the outer loop. A sidebar between the two sets covers governance for companies that sell AI to their own clients - a segment-specific concern rather than a brownfield operating pattern, but one worth naming.

Each one is something I have watched make the difference between an agent that contributes and an agent that frustrates.

Pattern one: worktrees.

Each agent session runs in its own git worktree. A worktree is a separate checked-out copy of the repository on a separate branch. The agent can experiment, fail, retry, refactor on its worktree without touching anyone else's working copy. When the work is good, the branch goes through normal PR review. When the work is bad, you delete the worktree and start over.

Cost of local-state damage with worktrees: near zero. The agent's bad attempt never touches your working copy or anyone else's.

Cost of failure without worktrees: the agent contends with your unfinished local work, breaks something subtle, and you spend an hour figuring out what changed.

Worktrees are the single most under-appreciated git feature for agentic work. Every developer on an agentic-coding team should have a git worktree add command in their muscle memory. Use them.

Pattern two: champions.

One person on the team owns the AGENTS.md and the mistake journal for a given repository. The champion does the weekly maintenance: read what other developers added to the mistake journal, refactor rules that have accumulated, deprecate rules that no longer apply, update the conventions when the team's practice changes.

The champion rotates quarterly. The first champion has the highest cost - they set up the patterns. The subsequent champions have the lowest cost - they maintain. Rotation prevents single-point-of-failure on the tribal knowledge of "how we use agents here." It also distributes the practice; every senior on the team eventually takes a turn, every senior internalizes the maintenance.

The champion is not the only person who edits AGENTS.md. Everyone edits it, through pull requests, when they encounter a new pattern or a new mistake. The champion's job is curation, not authorship. The distinction matters: if the champion is the only author, the file becomes one person's opinion of how to do agentic work, and the team's actual practice diverges. If everyone authors and the champion curates, the file represents the team's collective experience, kept disciplined by a single owner at any one time.

Pattern three: two kinds of hooks.

hookify (or your agent's equivalent plugin) lets you write hooks that fire before tool execution. The hook reads the proposed action, evaluates it against custom rules, and either allows it, prompts the user, or blocks it.

The use case for hookify in brownfield work is specific. You have areas of the codebase that are dangerous to modify without senior review - cryptography modules, payment processing, data migration scripts, anything regulatory. You write a hookify rule that blocks the agent from modifying files in those directories, or requires explicit user confirmation when it tries. The rule lives in the repository, committed to git, applied automatically every session.

hookify rules complement AGENTS.md. AGENTS.md tells the agent the team's conventions and forbidden patterns; the agent reads them and applies them by default. hookify enforces the rules structurally; if the agent tries to violate them anyway (because LLMs sometimes do), the hook catches it. AGENTS.md is the polite request. hookify is the firm boundary on the agent. But the agent is not the only author with write access, and the boundary that binds every author is a different kind of hook.

For yellow projects, I recommend establishing hookify rules for at least the regulatory-sensitive areas and the historically broken modules. Five to ten rules is usually enough. Each rule is one line of configuration plus a one-line justification.

The second kind of hook is not on the agent at all. It is on the repository. Git runs client-side hooks at commit and at push, and they fire for whoever is committing - you, your teammate, the agent. That is the leverage. The agent is just another author with write access, so a commit gate applies to it for free, with no agent-specific wiring. You write the check once, against the act of committing, and it covers the agent's code and yours alike.

Split the checks by cost. The pre-commit hook is the fast pass - format, lint, typecheck, a secret scan - the things that run in seconds and should never reach a branch. The pre-push hook is the heavier pass - the fast test suite, the build - which you can afford to wait on because it runs once per push rather than once per commit. Keep the tests in pre-push; a pre-commit hook has to finish in seconds, or people reach for the escape hatch, and every skip trains the habit of skipping. Manage both from a versioned config through a hook manager, so the hooks install for everyone who clones the repository instead of living un-versioned in one machine's .git/hooks.

The escape hatch is --no-verify, and the agent can reach for it the same as anyone; you close that door with a hookify deny rule on the flag. The two boundaries cover each other's blind spot: hookify catches the agent mid-session, before the edit lands, and the git gate catches every author, on every path hookify never sees, at the moment the commit is written. Be honest about what the local hooks are, though. They are fast feedback, not enforcement - skippable, configured in a file the agent can itself edit, and not guaranteed to be installed on a given machine. The same checks running in CI are the ones you cannot skip from your laptop, and that hook config together with its CI workflow is pattern eight's gate the agent cannot edit. Either way, deterministic checks do not drift the way probabilistic reviewers can - a secret scanner either finds the key or it does not.

Pattern four: PR review toolkit.

Before a senior engineer reviews an agent-produced pull request, a set of review agents go through the diff first. Silent failure hunter looks for swallowed exceptions, unbounded retries, missing null checks. PR test analyzer identifies which new methods have weak test coverage and recommends specific tests to add. Security scanner checks for the standard vulnerability categories. Documentation reviewer flags missing or stale documentation.

The review agents do not replace human review. They run before it, surfacing the kinds of issues that are mechanical to detect. The human reviewer then focuses on the things only humans can evaluate: business correctness, architectural fit, judgment calls.

The leverage is in the time savings on the mechanical findings. A senior engineer doing a fifteen-minute review can catch the obvious bugs. A senior engineer doing a five-minute review (because the review agents caught the obvious bugs already) can spend the other ten minutes on the architectural judgment that the agents cannot make.

Set up the review agents. Wire them into the pull request flow. The agents do the mechanical work; the humans do the human work.

One failure story for this pattern, because it is the pattern with a failure mode the others do not have. A team I worked with installed the full toolkit, and the toolkit was good. The review agents caught real bugs, week after week, and the senior reviewers learned to trust the green checkmarks. Within a quarter, the fifteen-minute human reviews had become three-minute scans. Nobody decided that. Attention drifts away from work that appears to be already done.

Then a PR shipped that every agent had passed. Tests green, no security findings, clean diff. It also applied a discount rule to the wrong customer tier - a business-correctness error, which is exactly the category the review agents do not evaluate and the category this pattern reserves for humans. It ran in production for weeks before a support ticket surfaced it.

The post-mortem was honest: the toolkit had worked precisely as designed. It removed the mechanical findings from human review, and the humans let the judgment work drift away with the mechanical work. Chapter 10 names an archetype called the uncalibrated delegator; this was a whole team becoming one, with good tooling as the alibi.

The fix was not removing the review agents. The fix was making the human floor explicit: a minimum review on business correctness and architectural fit for every agent-touched PR, tracked as a number next to defect rate; Chapter 5's read order for agent diffs is that floor made concrete. The review agents are a filter in front of human judgment. The moment they become a substitute for it, this pattern is making your reviews worse while making them look better.

Four patterns. Worktrees. Champions. Hooks. PR review toolkit. None of them require inventing a new process. All of them slot into how engineering teams already ship code, with the small additions that agentic work requires.

Sidebar: governance for companies that sell AI to their clients.

This one is not a brownfield operating pattern - it is advice for a particular audience segment, set apart from the numbered patterns for that reason. If your company sells AI capabilities to its own clients - not just consumes AI internally, but resells AI as part of your product - then the governance pattern is different from a pure consumer of AI. Your demos, your sales calls, your client engagements are all situations where your team's discipline is on display. The client is evaluating whether you know how to do AI responsibly, not just whether the AI works.

This calls for a few additional patterns on top of the general ones.

First, the demos are not allowed to skip the rigor. If you are showing a client how you use an agent to ship code, you show them the research phase, the plan phase, the review phase, the verify phase. You do not just show them the agent generating code and shipping it, because that is the demo that creates client expectations you will not meet in production.

Second, your AGENTS.md is a sales asset. Clients will want to see what your team has codified. A well-maintained AGENTS.md that runs to a hundred lines, with a mistake journal that shows real lessons learned, is more credible than a thousand-line AGENTS.md that reads like it was written by a consultant. Show the discipline, not the volume.

Third, the kill signal framework is something you teach clients. The rubric is more valuable to a client than any specific recommendation you would make, because the lens lets the client evaluate their own codebases without depending on you. Giving away the frame strengthens the trust relationship. Teams that hoard frameworks lose to teams that share them.

The patterns above apply to any team. They apply with extra force to teams whose customers are watching - companies whose engineering quality is a visible product surface, not an internal cost center. Governance maturity is part of those companies' offering, and the discipline this manual describes is what makes the maturity defensible.

Four more patterns. Three are brief, because they appear in the well-functioning teams I have worked with even though they are less often discussed. The fourth is the newest material in this chapter, and it gets the room it needs.

Pattern five: the mistake-journal review.

The mistake journal in AGENTS.md is alive. Every entry is a real failure the team experienced. But the journal grows over time, and not every entry stays load-bearing forever. Some failures are structurally resolved - the underlying cause has been refactored out, the dependency has been replaced, the convention has been internalized to the point where nobody would make the mistake again. Those entries can be retired without losing safety, and retiring them keeps the journal lean.

The champion runs a quarterly review of the journal. For each entry, the question is: has anyone been saved by this rule in the last three months? If yes, keep it. If no, but the rule is still applicable to the codebase, keep it (rules that prevent rare failures are still valuable even if the failure has not recurred recently). If no, and the rule no longer applies because the underlying problem is gone, retire it with a note in the commit message explaining why.

The habit keeps the journal from becoming a graveyard. A graveyard of rules is almost as useless as no rules at all, because the team stops trusting any individual rule when there are too many of them, and the agent's context window gets crowded with obsolete instructions.

Pattern six: the demo-day backstop.

When the team is preparing a demo of agentic work - for leadership, for clients, for an internal showcase - there is a temptation to do the demo live, with the agent doing real work in front of the audience. Sometimes this works. Sometimes the agent has a bad day, the network hiccups, the model decides to be unusually verbose. Live demos of probabilistic systems carry real failure risk.

The pattern I recommend: prepare a backup recording of the same demo, done successfully ahead of time. If the live demo runs into trouble, pivot to the recording at thirty seconds in. The audience does not need to know the difference. The lesson lands either way.

The backstop is not cheating. The backstop is professional execution. Every senior speaker in any field has a backup plan for the moment the live element fails. The agentic equivalent is a recorded version of the same work, kept ready.

Pattern seven: the failure watchlist.

When the team has been doing agentic work for a few months, you start to notice failure modes that recur. Specific kinds of mistakes the agent makes that you have to correct repeatedly. Specific situations where the workflow breaks down. Specific user behaviors that lead to predictable problems.

The pattern is to maintain a failure watchlist - a document that catalogs these recurring failure modes, the conditions under which they occur, and the team's standard response when they happen. The list grows. The team reviews it together every month or so. New entries get added; old entries that have been structurally fixed get retired.

The watchlist is to operations what the mistake journal is to development. The mistake journal prevents code mistakes; the watchlist prevents process mistakes. Both are committed to the repository. Both are reviewed regularly. Both are how the team's accumulated experience becomes infrastructure.

Pattern eight: the outer loop.

The seven patterns above assume a human in the room. The eighth is the one teams reach for a few months in, when somebody asks the obvious question: if the agent can run a disciplined loop while I watch, why am I watching? The industry's answer is the outer loop - re-invoking the agent automatically, iteration after iteration, until a condition holds. The six-phase loop in Chapter 5 is the inner loop: one unit of work, six functions, gated by you. The outer loop wraps it. When an iteration ends, the next one starts, and nobody is between them.

The idea has a prehistory, and the difference between the two eras is the entire lesson. In 2023, AutoGPT and BabyAGI looped a model against its own opinion of its progress. Nothing external graded an iteration - the model marked its own homework - so every lap compounded drift, and the approach collapsed as a way of shipping software within months. The revival is structurally different. In mid-2025, Geoff Huntley wired a coding agent into a bash while-loop - feed it one prompt file, let it run, repeat forever - and the technique spread under the name Ralph Wiggum. Each iteration starts with a fresh context window, does one unit of work, and ends against graders the model does not control: the compiler, the test suite, the diff. State lives in the repository, not in the conversation. Everything that makes the loop converge sits outside the model.

Between late 2025 and spring 2026, the pattern stopped being a bash trick and became a product surface. GitHub's Copilot coding agent went generally available in September 2025: delegate a task, an agent works in an isolated environment, and the result comes back as a draft pull request. Cursor shipped Cloud Agents in October 2025 - many agents running detached, your laptop closed. Google's Jules added Scheduled Tasks that December for recurring maintenance work. The same month, a ralph-wiggum plugin appeared in the official Claude Code repository, and by spring 2026 the loop was first-class there: /loop re-runs a prompt on an interval or paces itself when you omit one, Routines fire cloud agents from a schedule or a GitHub event, /goal keeps the agent working until a completion condition holds, /autofix-pr watches CI and pushes fixes until the pull request goes green. One shape, many spellings. By Chapter 1's convergence test, the capability has arrived everywhere - though what converged is a workflow wrapped around the primitives, not a new primitive.

The trend is real, and it is also where the discipline gets tested hardest, because the outer loop adds attempts, not judgment. It multiplies whatever your inner loop permits. If every iteration ends against a strict gate, the loop compounds progress: a queue of small verified units gets shorter overnight. If the gate is weak, the same patience compounds slop. Huntley's own name for the failure mode is overbaking - leave the loop running past its job and it keeps inventing work nobody asked for. The agent does not get tired. That is the feature, and unattended, it is also the threat.

So the pattern is not the loop; the pattern is the contract you run it under. Five lines, written before the first unattended iteration. A stop condition a machine can evaluate - the queue is empty, the suite is green, the budget is spent. A loop without one is not autonomy; it is abandonment. A budget - tokens, money, iterations, or hours, whichever hits first; an unattended loop is the per-token pricing model's best customer, and the Appendix A math runs overnight too. A gate the agent cannot edit - tests, lint configuration, CI workflow, and the hookify rules sit behind a deny rule (pattern three). Chapter 5's caveat - a green suite the agent wrote is evidence, not proof - applies twice over when nobody reads the evidence until morning. The cheapest way for a loop to go green is to negotiate with its own grader. Fresh context per iteration, durable state in the repository - a queue file and a journal, committed, so each iteration starts clean and reads the loop's history from git instead of dragging a degrading context behind it. Chapter 5 called context contamination the single biggest reason long-running sessions go wrong; the outer loop done right is a context-hygiene instrument - forty short clean sessions instead of one long degrading one. Isolation sized for absence - its own worktree (pattern one), sandbox on, no production credentials, network constrained. An unattended session is the one place where prompt injection meets no human skeptic; Chapter 3's layers are load-bearing here, not optional. Appendix B.6 is this contract as a one-pager.

What goes in the queue matters as much as the contract. Loop-eligible work has many similar units, each machine-verifiable, each reversible: migrations, lint and typing sweeps, dependency bumps, characterization-test backfill, mechanical refactors. Design-heavy single-artifact work is not eligible; more attempts do not add judgment, and the loop will spend your budget proving it. The traffic light from Chapter 8 applies with extra force, because the outer loop is autonomous agent work in its most concentrated form: GREEN codebases only. YELLOW means human-led, and the outer loop has no human in it by definition.

The human floor does not disappear; it moves to the morning. Overnight output arrives as pull requests and gets reviewed as pull requests - pattern four's explicit floor, business correctness and architectural fit, not a glance at the checkmarks, because everything a loop produces arrives wearing green checkmarks. And the loop itself gets kill signals, the same discipline Chapter 8 applies to codebases. Four are enough: the same diff applied and then reverted across iterations; budget burning while the queue does not shrink; the same failure surfacing a third time; any iteration that touched the gate. Any one of them means stop - read the journal, fix the cause, then relaunch. A loop restarted on hope is a loop you have stopped controlling.

Two poles mark how far teams take this. Huntley runs the loop raw and prices it like a utility - roughly $10 an hour. At the industrial end, StrongDM's software factory runs fully non-interactive delivery, humans neither writing nor reviewing code, with end-to-end scenarios held outside the codebase as a holdout set the loop cannot weaken - at a token spend Simon Willison pegged near $20,000 a month per engineer. This pattern sits deliberately between the poles: contract, queue, gates outside the agent's reach, and a human reading the pull requests in the morning. Nothing in Chapter 10's ninety days requires the outer loop; it is what month four can look like when the first ninety were honest. Attended, you are the backstop. Unattended, the contract is the backstop - which makes the outer loop the first consumer of every control this manual installs, and the cleanest test of whether they were ever really installed.

Eight patterns total. They will not all apply to every team. The first four - worktrees, champions, hooks, PR review toolkit - apply broadly. The next three - mistake-journal review, demo-day backstop, failure watchlist - are for teams past the first few months. The last one - the outer loop - is for teams past the other seven.

Next chapter: the adoption framework - how a team that has read this manual starts. Three roles, ninety days, specific commitments.

Key	Action
`?`	Show this help
`Esc`	Close overlays and menus
`⌘ K` or `Ctrl K`	Open search
`/`	Open search (secondary)
`←` `→`	Previous / next chapter
`g` `g`	Jump to top
`G`	Jump to bottom
`T`	Toggle theme
`-` `+`	Decrease / increase font size