Readiness: The Kill Signals and the Traffic Light

22 min read

When should you not use AI coding agents?¶

An unfashionable claim to open this chapter. There are codebases where you should not use agentic delivery. Not because the agent is bad, not because the team is bad, but because the codebase has properties that make agentic work unsafe, unproductive, or both. The honest thing to do, when a team asks me whether they should put an agent on their legacy monolith, is to evaluate the codebase first, then answer.

The rubric I use for this evaluation is eight kill signals. Each signal is a property of the codebase or the team. The more signals present, the more dangerous it is to put an agent in front of the code. At a certain threshold, you stop and fix the codebase before you bring the agent in.

The rubric is not a rejection of agentic delivery. It is the opposite. It is the discipline that lets you say yes to agentic delivery in the places where it works, by saying no in the places where it does not. Without the kill signals, every project becomes an AI project, including the ones that should not be, and the failures of those bad-fit projects taint the reputation of the entire approach. This is also the chapter built to travel upward: the eight signals and the traffic light are the portfolio conversation - which codebases now, which later, which never - in a form a director can act on.

The signals at a glance:

Signal	What fails	First fix
1. No tests	Agent cannot verify behavior; regressions go uncaught	Characterization tests on the modules in scope
2. No documentation	Agent invents missing context; produces plausible but wrong code	Architecture review workflow (Chapter 7)
3. Tight coupling	Blast radius is unpredictable; one change cascades	Break the worst coupling boundary first
4. Scattered rules	Agent updates one copy of business logic, misses others	Single source of truth for the rule
5. Regulatory constraints	Agent cannot satisfy audit alone; human gate required	Workflow wrapper with human approval steps
6. Team cannot evaluate output	Human review cannot catch domain failure	Restrict scope or add expert reviewer (weighs heavily)
7. Model-context fit	Agent lacks corpus familiarity; performance degrades	Add docs/skills, or route to a model with better fit
8. Velocity-of-change	Framework or version moving under the agent	Version-specific rules in AGENTS.md

Signal one: no tests.

The codebase has no automated test suite, or has a test suite so out of date that nobody runs it. The build succeeds. There is nothing that exercises behavior.

This blocks safe agent-led work, because the agent cannot verify its own changes. The agent will write code that looks correct. The code will compile. The code will pass syntax checks. The code will deploy to the staging environment. The code will then break something in production that nobody notices until a customer complains three weeks later, by which point the change is buried under twenty other changes and the regression is hard to attribute.

Twice, in banking, with the same root cause: a transaction reconciliation module written in the early 2010s by a senior engineer who left the company in 2018. No tests, because "the senior engineer just knew it worked." The next person who modified it broke the reconciliation logic in a way that took two months to surface, by which point the discrepancy had crossed the threshold that triggers a regulatory disclosure. That is the cost of "we never had tests" in a regulated environment. The agent makes it faster - both the productive path and the failure path.

What to do: write the tests first. Not all the tests; just enough to lock the current behavior. Use the agent for that - point it at the module, ask it to write characterization tests that capture what the module currently does. The agent is good at this kind of bulk test generation, because the spec is "whatever the code currently produces." Once the characterization tests exist, you can let the agent modify the module, because the tests will catch behavioral changes.

This converts a kill signal into a manageable risk. It is also a real investment of time. If the team is unwilling to make the investment, the answer is no, you should not put an agent on this code.

One caution: agent-generated characterization tests on legacy code lock current behavior, not correct behavior. Use them as a regression net while you migrate, not as proof of correctness.

Signal two: no documentation.

The codebase has no architectural overview, no in-code comments explaining why decisions were made, no design documents, no decision records. The structure is what it is, and only the original author knows why.

This blocks safe agent-led work because the agent invents context when context is missing. In one team, the senior engineer responsible for commission logic knew the fee tier was calculated differently for the legacy product line because of a regulatory carve-out from 2017. The agent did not. The agent read the code, saw a single calculation, assumed uniform treatment, and proposed a refactor that "simplified" the function. The senior engineer spotted the bug in code review. The senior engineer was now reviewing AI output for forty hours a week instead of building.

The shift lands on the people you can least afford to lose. When the team adopts agentic delivery without first investing in documentation, the team's senior engineers stop building and start reviewing. The throughput might be the same; the experience is much worse, and the senior engineers eventually leave because reviewing AI output all day is not a job anyone took for the love of it.

What to do: run the architecture review workflow from Chapter 7 first. The agent produces documentation in fifteen minutes that would have taken a senior engineer a week. Commit it. Reference it from AGENTS.md. Add a brief domain glossary. Now the agent has context.

This converts the kill signal into a manageable risk. It is also a real investment of time - though much less than writing the documentation from scratch.

Signal three: tight coupling.

The codebase has modules that cannot be changed in isolation. Edit one file, three others break. The dependency graph is a hairball. Imports cross layer boundaries casually. Tests for one module require setting up state in another module. The "domain core" is intertwined with the persistence layer is intertwined with the HTTP layer.

This blocks safe agent-led work because the agent's blast radius becomes impossible to predict. The agent edits the file it was asked to edit. The agent runs the tests it can find. The tests pass. The change ships. The change breaks a downstream module that has no test, or whose test does not exercise the relevant code path, and the breakage surfaces in production at month-end batch close.

One example: a customer service module that imported from a loan origination module that imported from a credit scoring module that imported from the customer service module. The cycle was introduced fourteen years ago to "save time" and nobody had refactored it because every attempt to break the cycle would have required an eight-week project. The agent did not know about the cycle. The agent made a change. The change rippled in a way that surfaced months later, in production.

What to do: identify the worst coupling first, and break it. Just the worst piece. The eight-week project does not have to be done before any agentic work happens; the worst piece does, because the worst piece is where the agent's mistakes will compound. Once the worst piece is decoupled, the rest of the codebase is no longer immediately dangerous to the agent - it is normal legacy code, manageable with the other practices in this manual.

If the team is unwilling to invest in decoupling, the answer is yellow at best, red at worst, depending on how bad the coupling is.

Signal four: scattered business rules.

The same business rule is expressed in three places. Database constraint, service-layer check, UI validation. When the rule changes - and business rules change constantly - the agent updates one and not the others. The schema accepts a value the service rejects, or the UI lets the user enter a value the schema refuses. The system lies to itself.

Banking example: the maximum transfer amount. Defined in a config file for the UI. Hardcoded in a constant in the service. Constrained by a database trigger. The compliance team updates the config because that is the file they know to read. Six weeks later, a customer cannot submit a valid transfer because the service-layer constant is stale. The compliance team's update was ignored by the system everyone forgot to update.

The agent makes this worse, faster. Without a single source of truth, the agent will guess which copy is canonical, and guess wrong.

What to do: identify the duplications and consolidate them. The pattern is "extract the rule to a single source, derive the other expressions from the source." For banking-style validations, this often means moving rules into a typed rules engine or a configuration service that all layers consult. It is weeks of refactor work, not an afternoon. It pays off whether or not you ever bring in agents, because the duplication was already a bug factory.

If the team will not do the consolidation, the agent's contribution to this codebase will be limited to areas that do not touch the duplicated rules. That is a more restricted use of the agent than the team probably wants, but it is honest about the constraint.

Signal five: regulatory constraints.

The codebase is under audit requirements where every change must be traceable to a specific approval, a specific reviewer, a specific evidence trail.

The agent cannot fulfill the audit constraint on its own. The agent does not know which changes are material to the audit and which are not. It does not know which reviewers must sign off on which categories of change. It cannot route approvals; it cannot satisfy "two-person rule" requirements; it cannot produce evidence in the form the auditor expects.

This is not fatal, but it requires that the team wrap the agent in workflow. The agent produces the change. The team routes the approval. The team's existing audit machinery handles the evidence trail. The agent's output is one input to a larger process; it is not the process.

Banking example: any change to anti-money-laundering logic requires sign-off from the compliance officer, a representative from legal, and a senior engineer not involved in the implementation. The agent can produce the change. The agent cannot route the approval. The team must wrap the agent in workflow.

If the regulatory constraints are mild - say, all code changes require a code review by a senior engineer, but there is no formal sign-off matrix - the agent fits cleanly into the existing process. If the constraints are heavy - sign-off matrices, evidence requirements, time-stamped approvals, retention obligations - the agent works only as part of a larger system that already handles those constraints.

The kill signal here is "the team has no system for handling the regulatory constraints, and is hoping the agent will somehow address them." It will not. The system has to exist independently. The agent operates inside it.

Signal six: the team cannot evaluate the output.

This is the most important signal. It is also the one teams misunderstand most often.

The signal is structural. The signal fires when the team's existing capability to evaluate work is insufficient for the work the agent will be doing. A junior developer is assigned a story to add a cryptographic verification step to the payment flow. The junior dispatches the agent. The agent produces a cryptography module that uses AES-256 in CBC mode with a hardcoded initialization vector. To a junior who has not done cryptography before, this looks correct. The code compiles. The unit tests pass. The senior cryptographer is on parental leave for three months. The junior ships it. The vulnerability sits in production for six months before a penetration test finds it.

The agent did exactly what was asked. The agent's output looked right. The agent's output was wrong. The team had no one in a position to evaluate the output at the moment the output was produced.

I see this signal misunderstood as "we need more senior engineers." That is one solution. It does not scale on a six-month timeline. The other solution is structural: pair the agent-generated output with a senior reviewer in a workflow that makes the review unavoidable. A hookify rule that blocks the merge until a senior approves. The PR review toolkit running its agents before the human reviewer sees the diff. An AGENTS.md restriction that the agent cannot touch cryptography modules at all until a specific flag is set. Structural answers, not staffing answers.

The signal is also the most important because it is the one that catches the long-tail catastrophic failures. The other signals catch productivity problems and operational frustrations. This one catches the failures that end up in compliance disclosures and incident retrospectives.

Signal seven: model-context fit.

The codebase is written in a language, framework, or domain where the agent's pretraining is thin. Older COBOL. Esoteric proprietary DSLs. Internal frameworks with no public footprint. Niche scientific libraries that never made it to the open web.

The agent will produce confident output that is syntactically valid but semantically off. It is pattern-matching against the wrong corpus. A new function will look like idiomatic code for some other framework that resembles yours. A class name will follow a convention from a different language. The mistakes are subtle and consistent, which makes them harder to catch than the obvious failures.

Banking example: a custom in-house workflow engine, written in the early 2010s, with no public documentation and a peculiar syntax for state transitions. The agent produced "improvements" that looked like cleanups, but were applying patterns from a different (unrelated) workflow framework the agent had seen more of. The code compiled. The tests, such as they were, passed. The state machine no longer behaved correctly in three specific edge cases the team did not initially test for.

What to do: identify whether your codebase's framework or domain has substantial public footprint (open source repos the agent could have seen, public documentation, conference talks). If yes, the model-context fit is reasonable. If no, the agent's confidence will outrun its competence. Treat the codebase as if it had only the team-capability signal (six) active, even when the other signals are clean.

Signal eight: velocity-of-change.

The codebase's framework or dependencies are shifting under it during the period you want to use the agent. React eighteen to nineteen migration in flight. Spring Boot major-version bump mid-quarter. A core library deprecating its previous public API. A database migration that the team is part-way through.

The agent's pretraining represents a frozen snapshot of the world. If the world has moved since the snapshot, and your codebase is straddling the move, the agent will be confidently wrong about the new version while being confidently right about the old one. You will not always know which.

Concrete failure mode: I shipped a bug into a React component because Claude was using React eighteen idioms in code that needed React nineteen patterns. The component worked locally because the local development environment still resolved to a transitive React eighteen dependency. It broke in production when the build pulled the upgraded React nineteen package. The agent was not wrong about React eighteen. The agent was wrong about the version of React this particular code path needed to support.

What to do: if your codebase is in a framework or dependency migration, slow the agent down on the migration-touched paths. AGENTS.md should name the target version explicitly ("we are migrating from React eighteen to React nineteen this quarter; new code uses nineteen idioms; old code may still use eighteen but should be updated when touched"). The agent reads the rule and uses the right idioms for the right context. Without that rule, you will discover the bugs in production.

Eight signals. No tests. No documentation. Tight coupling. Scattered business rules. Regulatory constraints. Team cannot evaluate output. Model-context fit. Velocity-of-change.

Are AI coding agents production-ready?¶

The signals are not a rejection of the agent. They are the discipline that lets the agent succeed where it can. What remains is to combine them into a decision rule. The decision rule I use is a traffic light: green, yellow, red.

Green: zero or one signal present. Agent-led work is appropriate. A single developer can dispatch the agent, supervise the six-phase loop, review the output, and ship. Normal velocity. Normal review burden. This is what people imagine when they imagine "AI productivity."

Yellow: two or three signals present. Human-led with agent support. Pair-program with the agent rather than dispatch and review. Slower than green. Real productivity benefit relative to no-agent, but the benefit is in quality (the agent catches things the human misses) more than in velocity (the human is still doing most of the work).

Red: four or more signals present. Fix the codebase first. The agent's contribution to this codebase will be net negative until the kill signals are reduced. The temptation will be to use the agent anyway, because the team's leadership has decided to "adopt AI." Resist. The temptation is what causes the failed AI adoption stories that taint the entire field.

0 - 1GREENAgent-led, normal velocity

2 - 3YELLOWHuman-led, agent support

4 +REDStop. Fix codebase first.

Signals (count each present)

1No tests
2No documentation
3Tight coupling
4Scattered rules
5Regulatory constraints
6Team cannot evaluate output
7Model-context fit
8Velocity-of-change

Figure: The kill signals and the traffic light decision rule. Signal 6 weighs more heavily than the others.

Three banking examples make this concrete.

Example one. A microservice that handles customer profile updates. Tests at 70% coverage. Architectural overview in the repo's README. Decoupled from other services through a well-defined REST API. Business rules expressed once in a typed validation library. Standard SOX controls but no special audit requirements. Team has senior engineers who routinely handle profile-related changes.

Signal count: zero. This is green. Agent-led work is appropriate. The team can confidently dispatch the agent on profile-related stories, run them through the six-phase loop, ship.

Example two. The legacy payments engine. Tests at 20% coverage, mostly happy-path. Architectural documentation exists but is partially stale. Moderate coupling between the payment domain and the customer domain, with documented seams but some leaky abstractions. Business rules mostly centralized but with a few stragglers in legacy code paths. Regulatory constraints are real (PCI, fraud reporting) and the team has the audit machinery to handle them. Mixed team seniority, with one or two seniors who can evaluate payments-specific output.

Signal count: two and a half (no tests is half, partial documentation is half, moderate coupling is one, regulatory constraints managed is zero because the machinery exists, team can evaluate is zero because seniors are available, scattered rules is half). Round up. This is yellow. Human-led with agent support. The team can use the agent for specific tasks - characterization tests, documentation generation, refactoring of well-bounded modules - but not for autonomous feature delivery on the parts of the codebase where the kill signals are present.

Example three. A custom encryption library written by a contractor in 2017, undocumented, tested only by integration tests that exercise downstream features, tightly coupled to the key management infrastructure, encoding compliance rules in code, and currently maintained by a team where nobody specializes in cryptography.

Signal count: five (no tests of the unit, no documentation, tight coupling, regulatory constraints with custom logic, team cannot evaluate). Red. The agent should not touch this codebase. The team should either find or fund cryptography expertise, write the documentation and tests, and then re-evaluate. Or replace the custom library with a vetted standard library, which is what they should probably do regardless. The agent is not the answer for this code; better engineering practice is.

The traffic light is meant to be applied at the project level, not the company level. A single company will have green, yellow, and red codebases simultaneously. The question is not "is this company ready for AI." The question is "which of this company's codebases is ready for AI today, and what would it take to move the others into a different color."

The most common adoption pattern I see is: start with the green codebases to build the team's experience and confidence, invest in the yellow codebases to move them toward green over a quarter or two, and either reform or retire the red codebases over a longer horizon. The agent is not the goal. The goal is shipping software. The agent is a means.

Apply the rubric once to your portfolio of projects. Sort by color. Notice that the picture is more nuanced than "we are doing AI" or "we are not doing AI." Most companies have a mix. The mix tells you the order of operations.

Two refinements come up almost every time I walk a team through the traffic light.

First: signals are not binary, and the count of two-or-three versus four-or-more is a rule of thumb, not a precise threshold. Real codebases have signals at various intensities. A codebase with "tests exist but are partial" has half of signal one, not all of it. A codebase with "documentation exists but is partially stale" has half of signal two. The way I count is to add up the partial values and round to the nearest whole. A codebase with four halves is a two. A codebase with three halves and a full signal is a three. A codebase with five halves is a three. The arithmetic is impressionistic; what matters is the order of magnitude.

The refinement is important because teams sometimes look at a codebase, count strict yeses on the eight signals, get a count of one or two, and conclude green when the actual reality is closer to yellow. The codebase that "technically has tests" but where the tests cover 10% of the code at low quality is not a green codebase. The codebase that "technically has documentation" but where the documentation is three years out of date is not a green codebase. Count by intensity, not by presence.

Second: signal six (team cannot evaluate output) is weighted more heavily than the others. A codebase that scores zero on signals one through five but yes on signal six is still a red codebase for the relevant work. The team-capability gap dominates the other signals, because it is the one that produces the truly bad outcomes - shipping code the team cannot verify into production where the failure has compounding consequences. The other signals produce friction and slowdown. Signal six produces incidents that show up in compliance disclosures and incident retrospectives.

If you find signal six is present for a particular kind of work, the answer is not "make the codebase greener on the other signals." The answer is "do not do this kind of work with the agent in this team's current configuration." Either build the team's capability, or partner with someone who has it, or restrict the agent away from the relevant areas. The signals are diagnostic; signal six is the one that requires the most specific structural response.

Three more examples, briefer than the earlier ones, cover shapes the first three did not:

Project	Signals	Color	The lesson
Greenfield API service - no code yet, senior team, clear spec	0	Green-plus	Treat the agent's involvement as a first-class architectural decision: set up AGENTS.md before the first commit, establish conventions while they are cheap to change, bake in test rigor from day one. Greenfield is the home turf of agentic coding, and the payoff builds over the project's lifetime.
Internal tool - 20 developers, written 2019, modest tests, low regulatory exposure, not customer-facing	1	Green-leaning	The right first target for a team new to agentic work: realistic enough to teach something, bounded enough that a mistake costs nobody a customer. Internal tools are ideal training grounds; the learnings transfer to the higher-stakes codebases without the higher stakes.
Vendor-customized fork - hasty contractor layer, no tests or docs on the customizations, severe upstream coupling	4	Red	The diagnostic is sometimes about the codebase and sometimes about the strategy. Here it asks whether the custom layer should exist at all - replace the customizations with upstream features, contribute them back, or maintain them deliberately, but do not use the agent to make the wrong work faster.

Try it yourself

This is a self-assessment you can run on any project in under fifteen minutes. Honesty matters more than score; the score is downstream of the conversation it forces.

Pick an in-flight project where the team is debating whether to lean harder on agentic work.
Open the eight kill signals from this chapter side by side with the project.
For each kill signal, ask the team (not just yourself): is this true of our project right now? Mark each TRIGGERED, BORDERLINE, or CLEAR. Use the team's answers, not your impression of them. Borderline counts as half.
Count the TRIGGERED marks. Apply the rubric: zero or one is GREEN, agent-led work at normal velocity. Two or three is YELLOW, human-led with the agent supporting narrow tasks. Four or more is RED, stop autonomous agent work in this codebase until kill signals close.
Before scoring, write your gut answer (GREEN, YELLOW, or RED) on a separate sheet. Compare it to the rubric result. The gap between intuition and rubric is the data point worth keeping; it tells you which signals you systematically over- or under-weight.

Run this exercise quarterly, after major incidents, and before any decision to expand agent scope to a new team or codebase. The traffic light makes the discussion concrete and shared. "We are RED because four signals are TRIGGERED" beats "I have a bad feeling about this" in every meeting that follows.

The specific kill signals will need updating as the field matures. The discipline of scoring will not.

Key	Action
`?`	Show this help
`Esc`	Close overlays and menus
`⌘ K` or `Ctrl K`	Open search
`/`	Open search (secondary)
`←` `→`	Previous / next chapter
`g` `g`	Jump to top
`G`	Jump to bottom
`T`	Toggle theme
`-` `+`	Decrease / increase font size