Specification By Example Was Essential Before AI. It’s Twice As Essential Now.

Psst. If your boss won’t invest in training you in Specification By Example, I’m running out-of-hours workshops on May 12 and 16 specifically for self-funding learners. £99 + UK VAT.

The research I’ve done over the last 3 years into AI-assisted programming, including my own closed-loop experiments, found that one major factor in the likelihood that an LLM will correctly interpret a specification is whether or not examples are included to clarify requirements.

Image

Completion rates – as measured by acceptance tests passed – improve dramatically, even in a single pass.

In multiple passes, with feedback from acceptance testing, models given examples converge on impressive completion of ~80%, while without examples they tend to just go around in circles, with completion barely improving.

Image
Image

This should come as no surprise, because we saw a similar effect with dev teams before AI coding tools appeared on the scene. Teams who clarify requirements using examples are much more likely to interpret what the customer (or the product manager, or the business analyst) means correctly.

And, as requirements misunderstandings are typically one of the biggest sources of avoidable rework, they save a lot of time and money correcting mistakes that could have been spotted before a line of code was written.

The LLM equivalent means the same outcomes – features delivered as the prompter intended – in fewer passes, using fewer tokens (and burning down fewer proverbial forests).

Done right, specifications with examples can be translated pretty directly into executable tests that can drive the design and development of working software using techniques like Test-Driven Development.

A specification for totaling items in an order that uses examples is test-ready. Essentially, it is a test.

Image
    def test_one_item(self):
        product = Product(id=327, price=159.95, stock=7, hold=1)
        order = Order([Item(product=product, quantity=1)])

        total = order.total()
        
        self.assertEqual(total, 159.5)

When I’ve provided specifications with examples in TDD training workshops, and measured successful interpretation of requirements by students, I’ve found the same trend that my experiments found with LLMs – it roughly doubles, and often hits 100% completion.

When I don’t include examples… Well, I’ve lost count of the number of times students thought that telling the Mars Rover to turn right moved it x+1, or that Roman Numerals should be converted into integers. As a trainer, it saves me and my students a lot of time – especially if I get to them later.

But humans can do things LLMs can’t, like understand, reason and learn. So levels of misinterpretation tend to be lower, because we can apply an understanding of the world and judge whether a requirement makes sense. Misinterpretation by AI coding assistants is a higher risk, and therefore the need to clarify is significantly heightened.

As is the need to use language consistently. While some folks claim – presumably because they haven’t tested it in any meaningful way – that LLMs don’t need code to be human-readable, the evidence is clear that they really, really do.

I’ve seen many times myself how completion rates dropped significantly when code wasn’t clearly and consistently signposted, using language that had a close conceptual correlation to our specifications. If I call it “sales tax” in one interaction, and “VAT” in another, the model struggles to anchor on a name for that variable, often interpreting them as distinct variables in the code.

Specifying with examples gives us an opportunity to establish a shared vocabulary for describing our problem domain, which aids communication between stakeholders, but also between humans and LLMs.

Image

When developers are stuck for a name for a class or a function that makes the intent of that code clear, I encourage them to write that intent in plain English and take inspiration from that. Specifying with examples can help establish a shared language before a line of code’s been written.

The Mythical Agent-Month


Psst.
 If your boss won’t invest in training you in Specification By Example (BDD, ATDD), I’m running out-of-hours workshops on May 12 and 16 specifically for self-funding learners. £99 + UK VAT.

One of the more frustrating aspects of this new “AI era” is watching people rediscover things we’ve known about software development for many decades.

Apparently, for example, it really helps if our specifications include tests when we use AI. You don’t say?

Perhaps most annoying, though, are pronouncements that – “with AI”, of course – small teams outperform large teams. Yes indeed. “With AI”, a team of 6 developers can do the work of a team of 26.

Forget the fact that we’ve known that a team of 6 will tend to outperform a team of 26 for at least 50 years. Or the fact that we know precisely why small teams tend to get more done than large teams.

It’s called “Brooks’ Law”, named after Fred P. Brooks, author of the seminal software engineering book The Mythical Man-Month, published in 1975.

Image

Among the topics Brooks expounds on is the effect of team size on the lines of communication required to keep that team in sync.

It can be calculated by imaginary pair-programming configurations. A team of 2 can only pair with each other, so they are continuously synchronising.

A team of 3 – let’s call them Aunt Flo, Farmer Barleymow and Bod (since we’re in 1975) – can be paired 3 ways to keep in sync:

  • Flo & Barleymow
  • Flo & Bod
  • Bod & Barleymow

A team of 4 requires 6 different pairings to keep everyone in sync. A team of 5 requires 10 unique pairings.

With each additional team member, the lines of communication required to keep everyone on the same page increases non-linearly. As the team grows, the communication overhead explodes.

Image

The upshot of all this is that larger teams spend orders of magnitude more time keeping in sync, or – more commonly – dealing with the downstream consequences of not keeping in sync.

Brooks’ Law simply states that adding people to a late project makes it later.

It also means that adding developers to a team has rapidly diminishing returns. A team of 6 will probably get more done than a team of 3, but not twice as much. And a team of 12 may well get less done than a team of 6.

So am I at all surprised by this revelation that – “with AI” – a team of 6 can deliver more value than a team of 26? No. It’s exactly what I’d expect, with or without AI.

And I’ve seen it play out many times in my career: that cohesive, small team who were getting shit done has a dozen more bodies thrown at them in the mistaken belief that more shit will get done faster.

Typically, productivity slows to a crawl as the team attempts to get everybody pushing in roughly the same direction.

When it comes to software economics, team size – and team makeup – has very high leverage. Always has, and always will, regardless of who’s typing the code.

So it’s curious how, in the 51 years since TMM-M was published, the standard management response to slipping schedules and escalating costs has been to increase team size. It’s almost as if they haven’t heard of Brooks’ Law.

Now, here’s where this gets interesting. The combinatorial explosion in lines of communication required to keep team members in sync applies whether those team members are human or AI.

Agents working on the same code need to keep as much as possible on the same page about what’s being done to what. When 2 agents work in parallel on their own branches, the longer they go without synchronising, the more their picture of the plan and of the code drifts apart, and the bigger the risk of conflicts grows.

Add a third, and that overhead increases linearly. But add a fourth, and we’re off to the races.

At this point, agents either spend more time waiting in merge queues, or even more time dealing with the consequences of not waiting to merge safely.

Those nice folks who make Cursor helpfully demonstrated what happens when agent “swarms” are let loose on the same code base at the same time.

Image

Here’s the rub; because merges – to happen safely – have to go in single file, there are real hard limits to how many merges can happen in any period of time. And therefore real hard limits to how many people or agents can be changing the same code base at the same time.

And that’s just the coordination required to avoid breaking the build. That’s before we even think about coordinating over requirements, over architecture, over coding standards etc etc etc.

When I run team workshops, the size of the team is a pretty reliable predictor of which ones will successfully complete the exercise. A team of 4 has a big advantage over a team of 12.

Although I’ve yet to test this empirically, what I’m seeing and hearing is that the number of agents working in parallel on the same set of source files might follow a similar trend.

Indeed, the folks I’ve been watching experiment with agentic software development the longest appear to have landed on a single thread of execution as the optimal solution. Even when there are multiple specialised agents involved, they’re taking turns.

It’s quite possible that the limit of the number of agents working in parallel – without very high separation of concerns from the code other agents are working on – is just one.

When we spin too many plates, the result is usually broken plates.

The AI-Ready Software Developer #24 – Specification Is A Conversation


Psst.
 If your boss won’t invest in training you in Test-Driven Development, I’m running out-of-hours workshops on April 7 and 11 specifically for self-funding learners. £99 + UK VAT.

A sentiment I see often on social media about “AI”-assisted and agentic coding goes something along the lines of “If you’re just translating specs into code, your job is disappearing”.

It sounds reasonable on the surface, if you believe that’s all many programmers were doing. Someone – say, a product manager or an architect – hands the programmer a specification for a feature, and the programmer just “codes it up” like a pharmacist filling a prescription.

But was that ever really a thing?

In reality, most software specifications are incomplete and ambiguous, and often contain logical contradictions that are hard to spot – because of the incompleteness and the ambiguity.

Think of the movie script that contains the line “A huge battle ensues”. The studio asks “How much will that cost?” The producer has absolutely no idea, because that part of the script still needs to be written. The line’s just a placeholder for more work to flesh out the details. And in software development, just as in movie-making, the devil is in the details. That’s where the time and the money goes.

And that’s the reality of software specifications written in natural languages like English, even ones written by programmers. At best, they’re placeholders for conversations. Extreme Programming actually makes that explicit: a “user story” is not a requirements specification. It’s just a placeholder for a chat with the person who wrote it. That’s why it’s a waste of time making them detailed.

And this means that the programmer’s job is not just to “code up the spec”, it’s to figure out what the specification actually means. What exactly happens in this huge battle?

And because specifications are incomplete and ambiguous and often contradictory, this process inevitably has to walk everything back to figuring out what the need being addressed is in the first place.

This is why, for so many years, I drummed it into product managers, requirements analysts and the like to come to the team not with the “what”, and certainly not with the “how”, but with the “why” – what problem are we aiming to solve?

We then work as a team – leveraging our combined expertise in systems design and development and the problem domain – to learn together how to solve the problem through successive iterations of best guesses informed by rapid user feedback.

Now, I could be wrong, but that doesn’t sound like “just translating specs into code” to me.

The promise of this new generation of “AI” coding tools is that non-programmers will be able to iterate working software by themselves. And this is true, to an extent.

Tools like Claude Code and Cursor have proven themselves to be very useful for generating prototypes and proofs-of-concept with no programmer involvement, enabling business analysts, UX designers, product managers, start-up founders and pastry chefs to test simple ideas quickly and cheaply.

The problem is that without the expert judgement of experienced programmers, it doesn’t mature to reliable, scalable, secure software that stands up to real-world production rigours.

So, at some point, you’ll have to pick up the phone to Programmers-R-Us and get some involved if you want your experiment to scale. Have your cheque book ready!

And this is where the problems really start. You now have kind-of, sort-of working software that validates your idea. There’s a fork in the road here. You can either:

  • Have the programmers find and fix all the problems to make the prototype market-ready
  • Use the prototype as the specification and have the programmers build a production-quality version from scratch with, y’know, tests and architecture and stuff

Let’s go through Door #1.

So, the prototype sort-of works, but there are bugs – oh, boy are there bugs?! – and security vulnerabilities and performance bottlenecks and scaling blockers and some gone-off cheddar and discarded prams and all the kind of stuff that LLMs will tend to leave in your code if you let them. Which you did, because you can’t tell a switch statement from a discarded pram.

So the programmers need to test the software thoroughly to find all the usage scenarios where the software doesn’t do what it’s supposed to. And there’s the Catch 22. What is it supposed to do? If only there was a complete and precise specification!

We don’t fare much better with Door #2.

Now your programmers have to reverse-engineer the prototype to figure out what it does. What happens when the user leaves that field blank and clicks “Continue”? What happens when the clock strikes midnight and interest needs to be applied to the account? What happens when the vehicle remains stationary for more than 5 minutes? Edge case after edge case after edge case.

You run into the exact same wall. A complete and precise specification for any non-trivial software is made up of thousands of definitive answers to these kinds of questions. Software systems are the most complex machines we’ve ever built. You can’t specify them on the back of a cigarette packet.

Or, to put it the way a customer once put it to me when we had this discussion about a new feature, “Why are you making it so complicated, Jason?”

“I’m not making it complicated. That’s how complicated what you’re asking for is.”

The system has to handle all of these inputs in some meaningful way, otherwise it will break. If the user’s email address isn’t valid, a whole bunch of features won’t work. Are you happy for the system to just not work for those users?

Then, as it almost always does, the conversation turned into a negotiation about the scope and complexity of that feature for the next release. We can always remove one variable now, and add it in a later iteration. It’s an old physics trick (see: Special Relativity).

And this is why requirements specifications are placeholders for conversations. If there’s no conversation, issues will not get addressed by experts who understand them until much, much later when they’re much, much harder to fix.

This is why, as a tech lead, I almost always – when presented with a “requirements specification” as a fait accompli – pressed the “Reset” button and started the conversation again at “Okay, so what seems to be the problem?”

That’s Door #3 – involve programmers early. Because those conversations have to happen whether you like it or not, and the sooner you have them, the sooner you’ll converge on a workable, production-ready solution.

A simple prototype can help you validate your idea before you pick up that phone, but the more design decisions you make before involving experts, the bigger and badder the catch-up’s going to be later. And you might be surprised – when you have a clear end goal in mind – how simple the simplest proof-of-concepts can be.

I’ve been in this game for 34 years, and in that time I’ve seen countless attempts to demarcate this process of building an understanding of not just what the software needs to do, but why it needs to do it.

They all inevitably walk into the same wall. You cannot pay someone else to understand something for you. It’s like paying someone to revise for your exams.

Software specification is necessarily a conversation between people with needs – and, ideally, money – and people who specialise in meeting needs using computers. T’was ever thus, t’will ever be.

Unless, of course, your specification is complete, consistent and mathematically precise.

And a complete, consistent, mathematically precise specification of a computer program is that computer program. That’s what source code is, and that’s why programming languages were invented.

A person who just translates complete, consistent and mathematically precise specifications into executable code is a compiler.

Fans of Spec-Driven Development may be feeling vindicated now because you believe your specifications are complete, consistent and precise. If you’ve clarified requirements using examples – to me and you, tests – that might push them towards being of that integrity.

But even if your specs really are completely complete, and completely consistent and completely precise – and even if LLMs were capable of reliably translating such specifications into code (which they’re not) – you need to remember that it will still be full of assumptions about what’s really needed. Basically, a formal specification is just formalised guesswork.

To quote the Second Doctor, “Logic, my dear Zoe, merely enables one to be wrong with authority”.

The real knowledge isn’t in the spec, or in the code, it’s in the feedback we get when people use it in the real world. In this sense, iterating is the ultimate requirements discipline – it’s where most of the real value gets discovered.

So, by all means, spec away. But don’t spec far – just enough to test an assumption with user feedback from working software. And user feedback’s like code reviews – the more changes we ask for feedback on, the less attention gets paid to most of them.

Research has found that when users give feedback, they often anchor on one or two standout moments – positive or negative – rather than the entire user experience. Psychologists call it the “peak-end rule” – it’s “LGTM” for user eyeballs.

Spec one change to functionality at a time, build it in rapid, tested iterations, ship it through a reliable delivery pipeline, and then go get that focused feedback. Because the spec very probably will need to change.

And if the spec rarely changes, I’d worry that we aren’t listening to our users. Either that or we got incredibly lucky (or clairvoyant).

It’s all one big, ongoing conversation.

The AI-Ready Software Developer #23 – Speaking Clearly

Psst. If your boss won’t invest in training you in Test-Driven Development, I’m running out-of-hours workshops on April 7 and 11 specifically for self-funding learners. £99 + UK VAT.

I see a lot of wrongheaded takes on “AI” coding assistants and agents online, but one of the most misguided goes something along the lines of “Code doesn’t need to be easy for humans to understand anymore”.

Presumably, these people have never stopped to ask themselves why they’re called “language models”, but let’s entertain their chain of thought for a moment.

The theory is that humans won’t need to understand the code generated by LLMs (wrong!) because they won’t need to edit the code themselves (wrong!), and so code should be optimised for things like token efficiency, not readability (WRONG!)

Some even go so far as to propose token-optimised programming languages designed by and especially for LLMs.

Working backwards through this sequence of compounding errors quickly unpicks the logic.

First of all, even if LLMs could design a general-purpose, Turing-complete programming language from scratch – which they can’t – it would require huge numbers of examples to train them to work with such a language. It might work for small DSLs, but languages like Java and Go are a whole other ballpark.

Without that large corpus of training data – showing examples of every aspect of the language many, many times – every request to generate code in it would be out-of-distribution. There’s a good reason why Claude, GPT, Gemini etc are better at generating code in popular languages. So, are we going to write those millions of examples? In a non-human-readable programming language?

Secondly, when you look at the tokens in code, how much of it is language syntax – reserved words, semicolons and so on – and how much of it is names that we have chosen for the things in our code to make it more self-descriptive?

“Ah, but Jason, AI-generated code doesn’t need to be self-descriptive, because we don’t need to understand it.”

I have bad news for you on that front, I’m afraid. Research – both my experiments and larger-scale studies – have found that the less clear code is to human readers, the less clear it is to LLMs. When code is obfuscated, models make more mistakes interpreting it.

This should come as no surprise. Language Models – the clue is in the name. What did folks think they were matching patterns in?

Human-readable code is model-optimized code.

LLMs don’t “understand” code like compilers do, they pattern-match on language.

So:

  • Meaningful identifiers = stronger semantic anchors
  • Consistent naming = better pattern recall
  • Verbose clarity = richer signal

And, just as it’s incredibly helpful on a software team if everybody’s speaking the same shared language to describe the problem they’re setting out to solve – including (especially) in the code itself – it’s also essential that we use consistent shared language in our prompts and in the code. If I ask Claude to change the “sales tax” calculation logic, and in the code the function’s called ‘v_tx’, we’re probably not going to get a glowing result.

Image
Confusion of Tongues: The Construction of the Tower of Babel, Lucas van Valckenborch, 1594

And then there’s the take that humans won’t need to edit the code. Some even go so far as to say that LLMs will be generating machine code directly, bypassing the human-readable source.

The fact of the matter is that LLMs are unreliable. Very unreliable. I’ve documented the common experience of agents like Claude Code getting stuck in “doom loops”, when a task or problem goes outside their training data distribution, and they just can’t do it. This is a fact of life when we’re working with the technology, and the general consensus – even among the hyperscalers – is that it’s an unfixable problem at any achievable scale.

There will always be a need for the human in the loop, and there will always be a need for that human to understand the code.

If it’s token efficiency you’re after, the smart thing to do is to focus on reducing the probability of mistakes and unnecessary rework. Deliberately obfuscating our code is going to work against us in that respect.

The Mouse That Roared

Psst. If your boss won’t invest in training you in Test-Driven Development, I’m running out-of-hours workshops on April 7 and 11 specifically for self-funding learners. £99 + UK VAT.

Once upon a time, in a land far, far away (just north of London Bridge), there was a medium-sized financial services company with a problem. A big problem.

For 9 months, they had been trying to deploy a new release of their core platform, and every single attempt had exploded in their – and their customers’ – faces.

Teams of testers worked round the clock trying to find the bugs. Developers worked 12-hour days, 6 days a week trying to fix them. But the software just kept getting more and more broken. (As did the teams.)

Because, while the developers were fixing the bugs, the changes they were making were introducing new bugs – along with reintroducing some old favourites – that the testers weren’t finding until weeks later, if they found them at all. Users were 2x more likely to find a bug than the testers, and the stability of the platform was so bad that every release ended up being rolled back within a couple of days.

They were beached.

I’ve mentioned before that the DevOps Research & Assessment classifications of software delivery performance need a new level below “Poor”, which I called “Catastrophically Bad“. This is when every deployment fails, problems can’t be fixed – just reverted – and lead times are effectively infinite. Nothing’s getting delivered – well, nothing that sticks – and there’s no light at the end of that tunnel.

In desperation, an army of consultants and coaches was brought in who – at vast expense – set about “fixing” the process. Backlogs were groomed. Daily meetings were attended. Velocities were tracked. Estimates were tweaked. Test plans were optimised. Graphs and charts were prominently displayed.

And at the end of all that, the consultants and coaches confidently concluded, “Yep, you’re basically f***ed.” They exited the building with… well, let’s just say there wasn’t a leaving card. At least now they had the graphs to prove it. Money well spent, I’m sure we’d all agree.

Image

That’s where I came in. I politely declined to engage with management on their “agile process” – as I have for many years now with all my clients.

Instead, I asked the developers to change one thing about how they worked. I took them into a meeting room, and showed them how to write NUnit tests.

In the next couple of weeks, we wrote some system-level smoke tests that they could run a few times a day just to provide some basic assurance that the thing wasn’t totally borked. Which, for a long time, it was.

Then I instructed them to write an automated “unit” test before any change they made to the code. Fixing a bug? Write a test for it first. Making a change to the logic? Write a test for it first. Refactoring the architecture? Write a test for it first. (And if you can’t, then refactor it with careful manual testing until you can, and then write a test for it.)

It took a couple-or-three months, but as the unit test coverage gradually increased – and in the areas of the code that were changing and breaking most often – the software stabilised enough to finally produce a release that stuck; a first foot forward in over a year.

The second foot forward – the next stable release – came a month after that, and a third a month after that. The platform – and the business built around it – was walking again.

Image

This is where interesting things start to happen to the overall development process. With releases now happening fairly incident-free once a month instead of, well, never, the “agile process” adapted around that. Product managers were now thinking about next month more than they were thinking about next year. (Specifically, where they might be working next year.)

Again, I was invited to engage with them, and again I politely declined. I wasn’t done with this innermost feedback loop yet.

Tests covered about 40% of the code – basically, the code that had been changed in those 3 months – but that coverage was growing every day. By the time it got to around 80% six months later, release cycles had accelerated from monthly, still with a significant reliance on manual regression testing, to weekly, with a small amount of manual testing. And the backlog had been paid down to the point where lead times were in the order of a couple of weeks.

Image

And while the test assurance grew, and the architecture became more modular to accommodate it – which significantly reduced the “blast radius” of changes and merge conflicts – lead times shrank even further.

And, again, the wider process adapted to this new reality, shrinking user feedback loops and becoming more willing to gamble and experiment, knowing that they were now placing much smaller bets.

The management process at this point looked dramatically different to how they were doing things when I arrived. And I hadn’t engaged with it at all, beyond recommending a couple of books about goals.

The organisation had evolved from plan-driven, big-bang releases (that always went bang), to iterative, goal-driven releases that enabled experimentation and learning.

And that all happened in response to shrinking lead times, enabled by accelerating release cycles, made possible by very fast, very frequent automated testing.

We didn’t need to ask for extra budget. We didn’t require software licenses. We didn’t need management to engage or give permission. We didn’t ask anybody outside of the dev teams to change anything. We just f***ing did it – this one little change to how the teams wrote code – and the entire organisation adapted around it.

In the proceeding years, that innermost feedback loop got tightened even more, with more investment in skills and automation. And, of course, the wider processes changed, as did the makeup of the teams, but not because we demanded they should. The delivery cycle accelerated, and the business adapted around that. They became more goal-oriented, more feedback-driven, and more experimental, and the organisation started to reflect that.

They now release changes multiple times a day, testing one change in the market at a time, and rapidly feeding back what they learn. Or, as you may know it, actual agility.

Image

Sure, it took a few years for them to get there. But consultants and coaches had spent a year effecting no change at all at a very high cost. I spent a few days a month with them for a couple of years.

They made the mistake that almost all agile transformations make: they try to improve software delivery capability without actually engaging with it. You can’t plan or manage your way to daily releases.

You’re looking at the wrong feedback loop – the wrong cog in the clockwork. The big cogs aren’t driving the little cogs. The little cogs are driving the whole system.

Image

Essential Code Craft – Test-Driven Development – April 7 & 11

Something I hear often: “I’d love to go on one of your courses, but my boss won’t pay for it”.

Codemanship’s new Essential Code Craft training workshops are aimed at software developers who are self-funding their professional growth. If your employer won’t invest in you, perhaps you can invest in you. (Businesses and other VAT-registered entities should visit codemanship.co.uk for details of corporate training for teams.)

For nearly 30 years, Test-Driven Development has been the technical core of successful agile software development.

Teams have shortened delivery lead times dramatically, while actually improving the reliability of their releases, and lowering the cost of changing software using TDD.

And TDD is proving to be not just compatible with AI-assisted software development, but essential.

In this introductory workshop, you will learn how to solve problems working in TDD micro-cycles, rigorously specifying desired software behaviour using tests, writing the simplest solution code to pass those tests, and refactoring safely to enable a simple, clean design to emerge.

The emphasis will be on learning by doing, with succinct practical instruction and guidance from a 25-year+ TDD practitioner and teacher.

You will work in pairs in your chosen programming language, swapping through Continuous Integration into a shared GitHub* repository after each TDD cycle, reinforcing the relationship between TDD, refactoring, version control and CI.

When you register, you’ll be asked to list up to 3 programming languages you’re comfortable working in (e.g., Java, Ruby, Go), and I’ll use that to pair folks as best I can up for the exercise. (Tip: put at least one popular one on the list – we may struggle to find you a pairing partner for Prolog)

I’ll demonstrate in either Java, Python, JS or C#, depending on which of those is listed most often by registrants.

* Requires an active personal GitHub account

This workshop includes a 15-minute break

Find out more and register here

The AI-Ready Software Developer #22 – Test Your Tests

Many teams are learning that the key to speeding up the feedback loops in software development is greater automation.

The danger of automation that enables us to ship changes faster is that it can allow bad stuff to make it out into the real world faster, too.

To reduce that risk, we need effective quality gates that catch the boo-boos before they get deployed at the very latest. Ideally, catching the boo-boos as they’re being – er – boo-boo’d, so we don’t carry on far after a wrong turn.

There’s much talk about automating tests and code reviews in the world of “AI”-assisted software development these days. Which is good news.

But there’s less talk about just how good these automated quality gates are at catching problems. LLM-generated tests and LLM-performed code reviews in particular can often leave gaping holes through which problems can slip undetected, and at high speed.

The question we need to ask ourselves is: “If the agent broke this code, how soon would we know?”

Imagine your automated test suite is a police force, and you want to know how good your police force is at catching burglars. A simple way to test that might be to commit a burglary and see if you get caught.

We can do something similar with automated tests. Commit a crime in the code that leaves it broken, but still syntactically valid. Then run your tests to see if any of them fail.

This is a technique called mutation testing. It works by applying “mutations” to our code – for example, turning a + into a -, or a string into “” – such that our code no longer does what it did before, but we can still run it.

Then our tests are run against that “mutant” copy of the code. If any tests fail (or time-out), we say that the tests “killed the mutant”.

If no tests fail, and the “mutant” survives, then we may well need to look at whether that part of the code is really being tested, and potentially close the gap with new or better tests.

Mutation testing tools exist for most programming languages – like PIT for Java and Stryker for JS and .NET. They systematically go through solution code line by line, applying appropriate mutation operators, and running your tests.

This will produce a detailed report of what mutations were performed, and what the outcome of testing was, often with a summary of test suite “strength” or “mutation coverage”. This helps us gauge the likelihood that at least one test would fail if we broke the code.

This is much more meaningful than the usual code coverage stats that just tell us which lines of code were executed when we ran the tests.

Some of the best mutation testing tools can be used to incrementally do this on code that’s changed, so you don’t need to run it again and again against code that isn’t changing, making it practical to do within, say, a TDD micro-cycle.

So that answers the question about how we minimise the risk of bugs slipping through our safety net.

But what about other kinds of problems? What about code smells, for example?

The most common route teams take to checking for issues relating to maintainability or security etc is code reviews. Many teams are now learning that after-the-fact code reviews – e.g., for Pull Requests – are both too little, too late and a bottleneck that a code-generating firehose easily overwhelms.

They’re discovering that the review bottleneck is a fundamental speed limit for AI-assisted engineering, and there’s much talk online about how to remove or circumvent that limit.

Some people are proposing that we don’t review the code anymore. These are silly people, and you shouldn’t listen to them.

As we’ve known for decades now, when something hurts, you do it more often. The Cliffs Notes version of how to unblock a bottleneck in software development is to put the word “continuous” in front of it.

Here, we’re talking about continuous inspection.

I build code review directly into my Test-Driven micro-cycle. Whenever the tests are passing after a change to the code, I review it, and – if necessary – refactor it.

Continuous inspection has three benefits:

  1. It catches problems straight away, before I start pouring more cement on them
  2. It dramatically reduces the amount of code that needs to be reviewed, so brings far greater focus
  3. It eliminates the Pull Request/after-the-horse-has-bolted code review bottleneck (it’s PAYG code review)

But, as with our automated tests, the end result is only as good as our inspections.

Some manual code inspection is highly recommended. It lets us consider issues of high-level modular design, like where responsibilities really belong and what depends on what. And it’s really the only way to judge whether code is easy to understand.

But manual inspections tend to miss a lot, especially low-level details like unused imports and embedded URLs. There are actually many, many potential problems we need to look for.

This is where automation is our friend. Static analysis – the programmatic checking of code for conformance to rules – can analyse large amounts of code completely systematically for dozens or even hundreds of problems.

Static analysers – you may know them as “linters” – work by walking the abstract syntax tree of the parsed code, applying appropriate rules to each element in the tree, and reporting whenever an element breaks one of our rules. Perhaps a function has too many parameters. Perhaps a class has too many methods. Perhaps a method is too tightly coupled to the features of another class.

We can think of those code quality rules as being like fast-running unit tests for the structure of the code itself. And, like unit tests, they’re the key to dramatically accelerating code review feedback loops, making it practical to do comprehensive code reviews many times an hour.

The need for human understanding and judgement will never go away, but if 80%-90% of your coding standards and code quality goals can be covered by static analysis, then the time required reduces very significantly. (100% is a Fool’s Errand, of course.)

And, just like unit tests, we need to ask ourselves “If I made a mess in my code, would the automated code inspection catch it?”

Here, we can apply a similar technique; commit a crime in the code and see if the inspection detects it.

For example, I cloned a copy of the JUnit 5 framework source – which is a pretty high-quality code base, as you might expect – and “refuctored” it to introduce unused imports into random source files. Then I asked Claude to look for them. This, by the way, is when I learned not to trust code reviews undertaken by LLMs. They’re not linters, folks!

Continuous inspection is an advanced discipline. You have to invest a lot of time and thought into building and evolving effective quality gates. And a big part of that is testing those gates and closing gaps. Out of the box, most linters won’t get you anywhere near the level of confidence you’ll need for continuous inspection.

If we spot a problem that slipped through the net – and that’s why manual inspections aren’t going away (think of them as exploratory code quality testing) – we need to feed that back into further development of the gate.

(It also requires a good understanding of the code’s abstract syntax, and the ability to reason about code quality. Heck, though, it is our domain model, so it’ll probably make you a better programmer.)

Read the whole series

Engineering Leaders: Your AI Adoption Doesn’t Start With AI

In the past few months, I’ve been hearing from more and more teams that the use of AI coding tools is being strongly encouraged in their organisations.

I’ve also been hearing that this mandate often comes with high expectations about the productivity gains leaders expect this technology to bring. But this narrative is rapidly giving way to frustration when these gains fail to materialise.

The best data we have shows that a minority of development teams are reporting modest gains – in the order of 5%-15% – in outcomes like delivery lead times and throughput. The rest appear to be experiencing negative impacts, with lead times growing and the stability of releases getting worse.

The 2025 DevOps Research & Assessment State of AI-assisted Software Development report makes it clear that the teams reporting gains were already high-performing or elite by DORA’s classification, releasing frequently, with short lead times and with far fewer fires in production to put out.

As the report puts it, this is not about tools or technology – and certainly not about AI. It’s about the engineering capability of the team and the surrounding organisation.

It’s about the system.

Teams who design, test, review, refactor, merge and release in bigger batches are overwhelmed by what DORA describes as “downstream chaos” when AI code generation makes those batches even bigger. Queues and delays get longer, and more problems leak into releases.

Teams who design, test, review, refactor, merge and release continuously in small batches tend to get a boost from AI.

In this respect, the team’s ranking within those DORA performance classifications is a reasonably good predictor of the impact on outcomes when AI coding assistants are introduced.

The DORA website helpfully has a “quick check” diagnostic questionnaire that can give you a sense of where your team sits in their performance bands.

Image

(Answer as accurately as you can. Perception and aspiration aren’t capability.)

The overall result is usefully colour-coded. Red is bad, blue is good. Average is Meh. Yep, Meh is a colour.

Image

If your team’s overall performance is in the purple or red, AI code generation’s likely to make things worse.

If your team’s performance is comfortably in the blue, they may well get a little boost. (You can abandon any hopes of 2x, 5x or 10x productivity gains. At the level of team outcomes, that’s pure fiction.)

The upshot of all this is that before you even think about attaching a code-generating firehose to your development process, you need to make sure the team’s already performing at a blue level.

If they’re not, then they’ll need to shrink their batch sizes – take smaller steps, basically – and accelerate their design, test, review, refactor and merge feedback loops.

Before you adopt AI, you need to be AI-ready.

Many teams go in the opposite direction, tackling whole features in a single step – specifying everything, letting the AI generate all the code, testing it after-the-fact, reviewing the code in larger change-sets (“LGTM”), doing large-scale refactorings using AI, and integrating the whole shebang in one big bucketful of changes.

Heavy AI users like Microsoft and Amazon Web Services have kindly been giving us a large-scale demonstration of where that leads – more bugs, more outages, and significant reputational damage.

A smaller percentage of teams are learning that what worked well before AI works even better with it. Micro-iterative practices like Test-Driven Development, Continuous Integration, Continuous Inspection, and real refactoring (one small change at a time) are not just compatible with AI-assisted development, they’re essential for avoiding the “downstream chaos” DORA finds in the purple-to-red teams.

And while many focus on the automation aspects of Continuous Delivery – and a lot of automation is required to accelerate the feedback loops – by far the biggest barrier to pushing teams into the blue is skills.

Yes. SKILLS.

Skills that most developers, regardless of their level of experience, don’t have. The vast majority of developers have never even seen practices like TDD, refactoring and CI being performed for real.

That’s certainly because real practitioners are pretty rare, so they’re unlikely to bump into one. But much of this is because of their famously steep learning curves. TDD, for example, takes months of regular practice to to be able to use it on real production systems.

And, as someone who’s been practicing TDD and teaching it for more than 25 years, I know it requires ongoing mindful practice to maintain the habits that make it work. Use it or lose it!

An experienced guide can be incredibly valuable in that journey. It’s unrealistic to expect developers new to these practices to figure it all out for themselves.

Maybe you’re lucky to have some of the 1% of software developers – yes, it really is that few – who can actually do this stuff for real. Or even one of the 0.1% who has had a lot of experience helping developers learn them. (Just because they can do it, it doesn’t necessarily follow that they can teach it.)

This is why companies like mine exist. With high-quality training and mentoring from someone who not only has many thousands of hours of practice, but also thousands of hours of experience teaching these skills, the journey can be rapidly accelerated.

I made all the mistakes so that you don’t have to.

And now for the good news: when you build this development capability, the speed-ups in release cycles and lead times, while reliability actually improves, happen whether you’re using AI or not.

The AI-Ready Software Developer #21 – Stuck In A “Doom Loop”? Drop A Gear

Image

I find it helpful to visualise agentic workflows as sequences of dice throws. Take the now-popular “Ralph Wiggum loop“. You want 7. The agent throws the dice, and it’s 5. That fails your test for 7, so the agent reverts the code, flushes the context, and tries again, repeating this cycle until it throws the 7 you wanted.

Different numbers have different probabilities, and therefore might take more or less throws to achieve. Throwing 7 might take 6 throws. Throwing 2 or 12 might take 36 throws. It depends on the probability distribution associated with 2x six-sided dice.

Image

The outputs of Large Language Models also depend on probability distributions, and some outputs will take more attempts to achieve than others, depending on the input and on the distribution of data “learned” by the model.

Image
A toy example of the distribution of probabilities of next-token predictions by an LLM. Note how there’s a very obvious choice because that exact token sequence appears often in the training data. The model’s prediction will be confident.

On average, a more complex context – one that asks the model to match on more complex patterns – will tend to require more improbable pattern matches and produce less confident predictions. This is “hallucination” territory. Little Ralph goes round and round in circles trying in vain to throw 13.

I call these “doom loops”.

An adapted agentic workflow recognises when a step has gone outside the model’s distribution (e.g., by the number of failed iterations) and “drops a gear” – meaning it switches back into read-only planning mode and asks the model to break that step down into simpler contributing steps that are less improbable – more in-distribution.

(It’s a phrase I’ve been using informally for years in pairing sessions when a problem’s “gradient” turns out to be steeper than we can handle in a single step. It harks back to an idea in Kent Beck’s book Test-Driven Development By Example, where he talks about how we might want the teeth on a pulley to be smaller the heavier the load. The fiddlier the problem, the smaller the steps.)

Image

If you think about it, this is exactly what “multi-step reasoning” is – tackling a more complex problem as a sequence of simpler problems. You can try this experiment to see the impact it has on the accuracy of outputs:

If you’re familiar with the Mars Rover programming exercise, where the surface of Mars is a 2D grid and you can drive a rover over it by sending it sequences of commands (F – forward, B – back, L – left, R – right), you can task a model like GPT-5.x with solving problems like “Given the rover is at (5,9) facing North and it’s told to go ‘RFFRBBBLB’, what is its final position and direction?”

Image

In one interaction, instruct the model additionally “Don’t show your working. Respond with the final answer only.” Odds are it’ll get it wrong every time.

Then let it “reason”. It’ll likely get it right first or second time, and it does this by breaking the pattern down:

“R turns the rover clockwise, so it’s at (5,9) facing East.

FF moves the rover forward 2 squares in the +x direction, so it’s at (7,9) facing East.”

etc etc.

When little Ralph tries to throw 13, the model just can’t do it. It’s OOD. When the agent drops a gear, it might try to throw 6+7, both of which are comfortably in-distribution.

Final note, sometimes a problem is just not in the data. It doesn’t matter how many times we throw the dice, or how we break it down, the model just can’t do it. Be ready to step in.

Read the full series here

A New DORA Performance Level – Catastrophically Bad

DORA (DevOps Research & Assessment) has 4 broad levels for dev team performance: Elite, High, Medium, and Low.

A Low-Performing team deploys less than once a month, has lead times for changes > 1 month, sees as many as half their deployments go boom, and takes more than a week to fix them.

An Elite team deploys changes multiple times a day, has lead times typically of < 1 day, failure rates < 15% (fewer than 1 in 8 deployments go boom), and can fix failed releases in under an hour.

I picture a Low-Performing team walking a tightrope between two mountain peaks. It’s a long way to safety (working, shippable code), a long way down if they fall, and a long climb back up to try again.

I picture an Elite team as walking the same length tightrope, tied to wooden posts a few feet apart, just 3 feet off the ground. Safety’s never far away, and a fall’s no big deal. They can quickly recover.

I would like to propose a 5th performance level, one that I’ve seen for real more than once: Catastrophically Bad.

At this level, the tightrope has snapped.

I’ve seen teams stuck in a death spiral where no changes can be deployed because every release goes boom. One example was a financial services company here in London who’d been running themselves ragged trying to stabilise a release for almost a year.

Every deployment had to be rolled back, and every deployment cost them high six-figures in client compensation, not to mention loss of reputation.

You know the drill: testing was done 100% manually and took weeks. While the devs were fixing the bugs testing found, they were introducing all-new bugs (and reintroducing a few old favourites).

A change failure rate of 100% and a lead time of – effectively – infinity.

How does a Catastrophically Bad development process that delivers nothing turn into at the very least a Low-Performing process that delivers something? (And yes, this is the mythical “hyper-productivity” the Scrum folks told you about – and no, Scrum isn’t the answer).

What we did with my client started with 2 fundamental changes:

  • Fast-running automated “smoke” tests, selected by analysing which features broke most often
  • CI traffic lights – a wrapper around version control that forced check-ins to go in single file, and blocked check-ins and check-outs of changes when the light wasn’t “green” (meaning, no check-in in progress and build is working).

It took 12 weeks to go from Catastrophically Bad to Medium-Performing. (Pro tip for new leaders and process improvement consultants – poorly-performing teams are a gift because they have such low-hanging fruit).

You can build from here. In this case, by showing teams how to safely change legacy code in ways that add more fast-running regression tests and gradually simplify and modularise the parts of the code that are changing the most.

(The title image is taken from a Catastrophically Bad agentic “team”)