Yeah, About Your “Precise” Specification…

Increasingly, I see people who’ve been struggling with LLM-based coding assistants reaching the conclusion that what’s needed is “better” specifications.

If you were to ask me what might make a specification “better”, I’d probably say:

  • Less ambiguous – less open to multiple valid interpretations
  • More complete – fewer gaps where expected system behaviour and other properties are left undefined
  • More consistent – fewer contradictions (e.g., Requirement #1: “Users can opt in to notifications”, Requirement #77: “By default, notifications must be on”)

Of these three factors, ambiguity is top of my list. It can mask contradictions and paper over gaps. When requirements are ambiguous, that takes us into physicist Wolfgang Pauli’s “not even wrong” territory.

It’s hard to know what the software’s supposed to do, and hard to know when it’s not doing it. This is why so many testers tell me that a large part of their job is figuring out what the requirements were in the first place. (Pro tip: bring them into those discussions.)

An ideal software specification therefore has no ambiguity. It’s not open to multiple interpretations. This enables us to spot gaps and inconsistencies more easily. But more importantly, it enables us to know with certainty when the software doesn’t conform to the specification.

We can never know, of course, that it always conforms to the specification. That would require infinite testing in most cases. But it only needs one test to refute it – and that requires the specification to be refutable.

So I guess when I talk about a “better” specification, I’m talking mostly about refutability.

“Precise”. You Keep Using That Word.

Refutability requires precision. And this is where our natural languages let us down. Try as we might to articulate rules in “precise English” or “precise French” or “precise Cantonese”, these languages haven’t evolved for precision.

Language entropy – the tendency of natural language statements to have multiple valid interpretations, and therefore uncertain meaning – is pretty inescapable.

For completely unambiguous statements, we need a formal language – a language with precisely-defined syntax – with formal semantics that precisely define how that syntax is to be interpreted. Statements made with these can have one – and only one – interpretation. It’s possible to know with certainty when an example contradicts it.

Computer programmers are very familiar with these formal systems. Programming languages are formal languages, and compilers and interpreters endow them with formal semantics – with precise meaning.

I half-joke, when product managers and software designers ask me where they can find good examples of complete software specifications to look on GitHub. It’s full of them.

It’s only half a joke because it’s literally true that program source code is a program specification, not an actual program. It expresses all of the rules of a program in a formal language that are then interpreted into lower-level formal languages like x86 assembly language or machine code. These in turn are interpreted into even lower-level representations, until eventually they’re interpreted by the machine itself – the ultimate arbiter of meaning.

It’s turtles all the way down, and given a specific stack of turtles, meaning – hardware failures notwithstanding – is completely predictable. The same source code, compiled by the same compiler, executed by the same CPU, will produce the same observable behaviour.

So we have a specification that’s refutable and predictable. The same rules will produce the same behaviour every time, and we can know with certainty when examples break the rules.

But, of course, a computer program does what it does. It will always conform to its program specification, expressed in Java or Python or – okay, maybe not JavaScript – or Go. That doesn’t mean it’s the right program.

So we need to take a step back from the program. Sure, it does what it does. But what is it supposed to do?

Remember those turtles? Well, it would be a mistake if we believed the program source code is at the top of the stack. In order to meaningfully test if we wrote the right program code, we need another formal specification (and I use those words most accurately) that describes the desired properties of the program without being part of the program itself.

Let’s think of a simple example. If I have a program that withdraws money from a bank account, and me and my customer agree that withdrawal amounts must be more than zero, and the account needs to have sufficient funds to cover it, we might specify that withdrawals should only happen when that’s true.

In informal language, a precondition of any withdrawal is that the amount must be greater than zero, and the balance must be greater than or equal to the amount being withdrawn. If the withdraw function is invoked when that condition isn’t met, the program is wrong.

To remove any ambiguity, I would wish to express that in a formal language. I could do it in a programming language. I could insert an assertion at the start of the withdraw function that checks the condition and e.g., throws an exception of it’s not satisfied, or halts execution during testing and reports an error.

e.g. in Python “defensive programming” (we can talk in another blog post about what terrible UX design this is – yes, UX design. In the code. Bazinga!)

def withdraw(self, amount):
if amount <= 0:
raise InvalidAmountError()
if self.balance < amount:
raise InsufficientFundsError()
self.balance -= amount

e.g., using inline assertions that are checked during testing

def withdraw(self, amount):
assert amount > 0
assert self.balance >= amount
self.balance -= amount

These approaches are fine, but they’re not a great way to establish what those rules are with our customer in the first place. Are we going to sit down with them and start writing code to capture the requirements?

In the late 1980s, formal languages started to appear specifically with the aim of creating precise external specifications of correct behaviour that aren’t part of the code at all.

The first I used was Z. Z was a notation founded on predicate logic and set theory. Here’s an artist’s impression of a Z specification that ChatGPT hallucinated for me.

Image

Not the most customer-friendly of notations. Other formal specification languages attempted to be more “business-friendly”, like the Object Constraint Language:

context BankAccount::withdraw(amount: Real)
pre: amount > 0
pre: balance >= amount
post: balance = balance@pre - amount

These OCL constraints were designed to extend UML models to make their meaning more precise. I remember being told that it was designed to be used by business people. I found that naivety endearing.

To cut a long story short, while formal specification certainly found a home in the niche of high-integrity and critical systems engineering, that same snow never settled on the plains of business and requirements analysis and everyday software development. We were expecting business stakeholders to become programmers. That rarely works out.

But for a time, I used formal specifications – luckily, my customers were electronics engineers and not marketing executives, so most already had programming experience.

Tests As Specifications

We’d firm up a specification using a combination of Z and the Object Modeling Technique (UML wasn’t a thing then) describing precisely what a feature or a function needed to do.

Then I’d analyse that specification and choose test examples.

BankAccount:: withdraw
Example #1: invalid amount
amount = 0
Outcome:
throws InvalidAmountError
Example #2: valid amount and sufficient funds
amount = 50.0
balance = 50.0
Outcome:
balance = 0.0
Example #3: insufficient funds
amount = 50.01
balance = 50.0
Outcome:
throws InsufficientFundsError

It turned out that business stakeholders can much more easily understand specific examples than general rules expressed in formal languages. So we flipped the script, and explored examples first, and then generalised them to a formal specification.

It was when I first started learning about “test-first design”, one of the practices of the earliest documented versions of Extreme Programming, that the lightbulb moment came.

If we’ve got tests, do we need the formal specifications at all? Maybe we could cut out the middle-man and go straight to the tests?

This often works well – exploring the precise meaning of requirements using test examples – with non-programming stakeholders.

And many people are discovering that including test examples in our prompts helps LLMs match more accurately by reducing the search space of code patterns. It turns out that models are trained on code samples that have been paired with usage examples (tests, basically), so including examples in the prompt gives them more to match on.

So, if you were to ask me what might make a specification for LLM code generation “better”, I’d definitely say “tests”. (And there was you thinking it was the LLM’s job to dream up tests.)

Visualising The Gaps

That helps reduce ambiguity and the risk of misinterpretation, but what of completeness and consistency?

This is where some kind of generalisation is really needed, but it doesn’t have take us down the Z or OCL road. What we really need is a way to visualise the state space of the problem.

One simple technique I’ve used to good effect is a decision table. This helps me to see how the rules of a function or an action map to different outcomes.

Image

Here, I’ve laid out all the possible combinations of conditions and mapped them to specific outcomes. There’s one simplification we can make – if the amount isn’t greater than zero, we don’t care if the account has sufficient funds.

Image

That maps exactly on to my three original test cases, so I’m confident they’re a complete description of this withdraw function.

Mapping it out like this and exploring test cases encourages us to clarify exactly what the customer expects to happen. When the amount is greater than the balance, exactly what should the software do? It forces us and our customers to consider details that probably wouldn’t have come up otherwise.

Other tools we can use to visualise system behaviour and rules include Venn diagrams (have we tested every part of the diagram?), state transition diagrams and state transition tables (have we tested every transition from every state?), logic flow diagrams (have we tested every branch and every path?), and good old-fashioned truth tables – the top half of a decision table.

Isn’t This Testing?

“But, Jason, this sounds awfully like what testers do!”

Yup 🙂

Tests are to specifications what experiments are to hypotheses.

If I say “It should throw an error when the account holder tries to withdraw more than their balance” before any code’s been written to do that, I’m specifying what should happen. Hypothesis.

If I try to withdraw £100 from an account with a balance of £99, then that’s a test of whether the software satisfies it’s specification. It’s a test of what does happen. Experiment.

This is why I strongly recommend teams bring testing experts into requirements discussions. You’re far more likely to get a complete specification when someone in the room is thinking “Ah, but what if A and B, but not C?”

You can, of course, learn to think more like a tester. I did, so it can’t be that hard.

But there’s really no substitute for someone with deep and wide testing experience in the room.

If a function or a feature is straightforward, we can probably figure out what test cases we’d need to cover in our heads. My initial guesses at tests for the withdraw function were pretty good, it turned out.

But when they’re not straightforward, or when the scenario’s high risk, I’ve found these techniques very valuable.

As a bottom line, I’ve found that tests of some kind are table stakes. They’re the least I’ll include in my specification.

Shared Language

Another thing I’ve found that helps to minimise misinterpretations is establishing a shared model of the concepts we’re talking about in our specifications.

In a training exercise I run often, pairs are asked to use Test-Driven Development to create a simple online retail program. They’re given a set of requirements expressed in plain English and the idea is that they agree tests with the customer (one of them plays that role) to pin down what they think the requirements mean.

e.g.

Add item – add an item to an order. An order item has a product and a quantity. There must be sufficient stock of that product to fulfil the order

Total including shipping – calculate the total amount payable for the order, including shipping to the address

Confirm – when an order is confirmed, the stock levels of every product in the items are adjusted by the item quantity, and then the order is added to the sales history.

A couple of years back, I changed the exercise by giving them a “walking skeleton” – essentially a “Hello, world!” project for their tech stack with a dummy test and a CI build script set up and ready to go – to get them started.

And in that project I added a bare-bones domain model – just classes, fields and relationships – that modeled the concepts used in the requirements.

In UML, it looked something like this.

Image

Before I added a domain model, pairs would come up with distinctly different interpretations of the requirements.

With the addition of a domain model, 90% of pairs would land on pretty much the same interpretation. Such is the power of a shared conceptual model of what it is we’re actually talking about.

It doesn’t need to be code or a UML diagram – but some expression in some form we hopefully can all understand of the concepts in our requirements and how they’re related evidently cuts out a lot of misunderstandings.

Precision In UX & UI Design

And, of course, if we’re trying to describe a user interface, pictures can really help there. Wireframes and mock-ups are great, but if we’re trying to describe dynamic behaviour – what happens when I click that button? – I highly recommend storyboards.

A storyboard is just a sequence of snapshots of the UI in specific test scenarios that illustrates what happens with each user interaction. Here’s a great example.

Image
Source: Annie Hay Design https://anniehaydesign.weebly.com/app-design/storyboarding

It’s another way of visualising a test case, just from the user’s perspective. In that sense, it can be a powerful tool in user experience design, helping stakeholders to come to a shared understanding of the user’s journey, and potentially revealing problems with the design early.

Precision != BDUF

Before anybody jumps in with accusations of Big Design Up-Front (BDUF), a quick reminder that I would never suggest trying to specify everything, then implement it, then test it, then merge and release it in one pass. I trust you know me better than that.

When clarity’s needed, I have a pretty well-stocked toolbox of techniques for providing it, as and when it’s needed in a highly iterative process delivering working software in thin slices – one feature at a time, one scenario at a time, one outcome at a time, and one example at a time. Solving one problem at a time in tight feedback loops.

Taking small steps with continuous feedback and opportunities to steer is highly compatible with working with LLM-based coding assistants. It’s actually kind of essential, really. Folks talking about specifying e.g., a whole feature “precisely” and then leaving the agent(s) to get on with it are… Well, you probably know what I think. I’ve seen those trains come off the rails so many times.

And with each step, I stay on-task. I’ll rarely, for example, model domain concepts that aren’t involved in the test cases I’m working on. I’m not one of these “First, I model ALL THE THINGS, then I think about the user’s goals” guys.

And using tests as specifications goes hand-in-glove with a test-driven approach to development, which you may have heard I’m quite partial to.

Believe it or not, agility and precision are completely compatible. How precise you’re being, and the size of the steps you’re taking that end in user feedback from working software, are orthogonal concerns. If you look in the original XP books, you’ll even find – gasp! – UML diagrams.

Hopefully you get some ideas about the kinds of things we can include in a specification to make it more precise, more complete and more consistent.

But at the very least, you might begin to rethink just how good your current specifications actually are.

Prompts Aren’t Code and LLMs Aren’t Compilers

One final thought. The formal systems of computer programming – programming languages, compilers, machine code and so on – and the “turtles” in an LLM-based stack are very different.

Prompts – even expressed in formal languages – aren’t code, and LLMs aren’t compilers. They will rarely produce the exact same output given the exact same input. It’s a category mistake to believe otherwise.

This means that no matter how precise our inputs are, they will not be processed precisely or predictably. Expect surprises.

But less ambiguity will – and I’ve tested this a lot – reduce the number of surprises. And refutability gives us a way to spot the brown M&Ms in the output more easily.

It’s easier to know when the model got it wrong.

Do You Know Where Your Load-Bearing Code Is?

Do you know where your load-bearing code is?

90% of the time, TDD is enough to assure that code of the everyday variety is reliable enough.

But some code really, really needs to work. I call it “load-bearing code”, and it’s rare to find a software product or system that doesn’t have any code that’s critical to its users in some way.

In my 3-day Code Craft training workshop, we go beyond Test-Driven Development to look at a couple of more advanced testing techniques that can help us make sure that code that really, really needs to work in all likelihood does.

It raises the question, how do we know which parts of our code are load-bearing, and therefore might warrant going that extra mile?

An obvious indicator is critical paths. If a feature or a usage scenario is a big deal for users and/or for the business, tracing which code lies on the execution path for it can lead us to code that may require higher assurance.

Some teams work with stakeholders to assess risk for usage scenarios, perhaps captured alongside examples that they use to drive the design (e.g., in .feature files), and then when these tests are run, use instrumentation (e.g., test coverage) to build a “heat map” of their code that graphically illustrates which code is cool – no big deal if this fails – and which code might be white hot – the consequences will be severe if it fails.

(It’s not as hard to build a tool like this as you might think, BTW.)

A less obvious indicator is dependencies. Code that’s widely reused, directly or indirectly, also presents a potentially higher risk. Static analysis tools like NDepend can calculate the “rank” of a method or a class or a package in the system (as in, the Page Rank) to show where code is widely reused.

Monitoring how often code’s executed in production can produce a similar, but dynamic, picture of which code’s used most often.

These are all measures of the potential impact of failure. But what about the likelihood of failure? A function may be on a critical path, and reused widely, but if it’s just adding a list of numbers together, it’s not very likely to fail.

Complex logic, on the other hand, presents many more ways of being wrong – the more complex, the greater that risk.

Code that’s load-bearing and complex should attract our attention.

And code that’s load-bearing, complex and changing often is white hot. That should be balanced by the strength of our testing. The hotter the code, the more exhaustively and the more frequently it might need testing.

Hopefully, with a testing specialist in the team, you will have a good repertoire of software verification techniques to match against the temperature of the code – guided inspection, property-based testing, DBC, decision tables, response matrices, state transition tables, model checking, maybe even proofs of correctness when it really needs to work.

But a good start is knowing where your hottest code actually is.

Codemanship’s Code Craft Road Map

One of the goals behind my training courses is to help developers navigate all the various disciplines of what we these days call code craft.

It helps me to have a mental road map of these disciplines, refined from three decades of developing software professionally.

codecraftroadmap

When I posted this on Twitter, a couple of people got in touch to say that they find it helpful, but also that a few of the disciplines were unfamiliar to them. So I thought it might be useful to go through them and summarise what they mean.

  • Foundations – the core enabling practices of code craft
    • Unit Testing – is writing fast-running automated tests to check the logic of our code, that we can run many times a day to ensure any changes we’ve made haven’t broken the software. We currently know of no other practical way of achieving this. Slow tests cause major bottlenecks in the development process, and tend to produce less reliable code that’s more expensive to maintain. Some folk say “unit testing” to mean “tests that check a single function, or a single module”. I mean “tests that have no external dependencies (e.g., a database) and run very fast”.
    • Version Control – is seat belts for programmers. The ability to go back to a previous working version of the code provides essential safety and frees us to be bolder with our code experiments. Version Control Systems these days also enable more effective collaboration between developers working on the same code base. I still occasionally see teams editing live code together, or even emailing source files to each other. That, my friends, is the hard way.
    • Evolutionary Development – is what fast-running unit tests and version control enable. It is one or more programmers and their customers collectively solving problems together through a series of rapid releases of a working solution, getting it less wrong with each pass based on real-world feedback. It is not teams incrementally munching their way through a feature list or any other kind of detailed plan. It’s all about the feedback, which is where we learn what works and what doesn’t. There are many takes on evolutionary development. Mine starts with a testable business goal, and ends with that goal being achieved. Yours should, too. Every release is an experiment, and experiments can fail. So the ability to revert to a previous version of the code is essential. Fast-running unit tests help keep changes to code safe and affordable. If we can’t change the code easily, evolution stalls. All of the practices of code craft are designed to enable rapid and sustained evolution of working software. In short, code craft means more throws of the dice.
  • Team Craft – how developers work together to deliver software
    • Pair Programming – is two programmers working side-by-side (figuratively speaking, because sometimes they might not even be on the same continent), writing code in real time as a single unit. One types the code – the “driver” – and one provides high-level directions – the “navigator”. When we’re driving, it’s easy to miss the bigger picture. Just like on a car journey, in the days before GPS navigation. The person at the wheel needs to be concentrating on the road, so a passenger reads the map and tells them where to go. The navigator also keeps an eye out for hazards the driver may have missed. In programming terms, that could be code quality problems, missing tests, and so on – things that could make the code harder to change later. In that sense, the navigator in a programming pair acts as a kind of quality gate, catching problems the driver may not have noticed. Studies show that pair programming produces better quality code, when it’s done effectively. It’s also a great way to share knowledge within a team. One pairing partner may know, for example, useful shortcuts in their editor that the other doesn’t. If members of a team pair with each other regularly, soon enough they’ll all know those shortcuts. Teams that pair tend to learn faster. That’s why pairing is an essential component of Codemanship training and coaching. But I appreciate that many teams view pairing as “two programmers doing the work of one”, and pair programming can be a tough sell to management. I see it a different way: for me, pair programming is two programmers avoiding the rework of seven.
    • Mob Programming – sometimes, especially in the early stages of development, we need to get the whole team on the same page. I’ve been using mob programming – where the team, or a section of it, all work together in real-time on the same code (typically around a big TV or projector screen) – for nearly 20 years. I’m a fan of how it can bring forward all those discussions and disagreements about design, about the team’s approach, and about the problem domain, airing all those issues early in the process. More recently, I’ve been encouraging teams to mob instead of having team meetings. There’s only so much we can iron out sitting around a table talking. Eventually, I like to see the code. It’s striking how often debates and misunderstandings evaporate when we actually look at the real code and try our ideas for real as a group. For me, the essence of mob programming is: don’t tell me, show me. And with more brains in the room, we greatly increase the odds that someone knows the answer. It’s telling that when we do team exercises on Codemanship workshops, the teams that mob tend to complete the exercises faster than the teams who work in parallel. And, like pair programming, mobbing accelerates team learning. If you have junior or trainee developers on your team, I seriously recommend regular mobbing as well as pairing.
  • Specification By Example – is using concrete examples to drive out a precise understanding of what the customer needs the software to do. It is practiced usually at two levels of abstraction: the system, and the internal high-level design of the code.
    • Test-Driven Development – is using tests (typically internal unit tests) to evolve the internal design of a system that satisfies an external (“customer”) test. It mandates discovery of internal design in small and very frequent feedback loops, making a few design decisions in each feedback loop. In each feedback loop, we start by writing a test that fails, which describes something we need the code to do that it currently doesn’t. Then we write the simplest solution that will pass that test. Then we review the code and make any necessary improvements – e.g. to remove some duplication, or make the code easier to understand – before moving on to the next failing test. One test at a time, we flesh out a design, discovering the internal logic and useful abstractions like methods/functions, classes/modules, interfaces and so on as we triangulate a working solution. TDD has multiple benefits that tend to make the investment in our tests worthwhile. For a start, if we only write code to pass tests, then at the end we will have all our solution code covered by fast-running tests. TDD produces high test assurance. Also, we’ve found that code that is test-driven tends to be simpler, lower in duplication and more modular. Indeed, TDD forces us to design our solutions in such a way that they are testable. Testable is synonymous with modular. Working in fast feedback loops means we tend to make fewer design decisions before getting feedback, and this tends to bring more focus to each decision. TDD, done well, promotes a form of continuous code review that few other techniques do. TDD also discourages us from writing code we don’t need, since all solution code is written to pass tests. It focuses us on the “what” instead of the “how”. Overly complex or redundant code is reduced. So, TDD tends to produce more reliable code (studies find up to 90% less bugs in production), that can be re-tested quickly, and that is simpler and more maintainable. It’s an effective way to achieve the frequent and sustained release cycles demanded by evolutionary development. We’ve yet to find a better way.
    • Behaviour-Driven Development – is working with the customer at the system level to precisely define not what the functions and modules inside do, but what the system does as a whole. Customer tests – tests we’ve agreed with our customer that describe system behaviour using real examples (e.g., for a £250,000 mortgage paid back over 25 years at 4% interest, the monthly payments should be exactly £1,290) – drive our internal design, telling us what the units in our “unit tests” need to do in order to deliver the system behaviour the customer desires. These tests say nothing about how the required outputs are calculated, and ideally make no mention of the system design itself, leaving the developers and UX folk to figure those design details out. They are purely logical tests, precisely capturing the domain logic involved in interactions with the system. The power of BDD and customer tests (sometimes called “acceptance tests”) is how using concrete examples can help us drive out a shared understanding of what exactly a requirement like “…and then the mortgage repayments are calculated” really means. Automating these tests to pull in the example data provided by our customer forces us to be 100% clear about what the test means, since a computer cannot interpret an ambiguous statement (yet). Customer tests provide an outer “wheel” that drives the inner wheel of unit tests and TDD. We may need to write a bunch of internal units to pass an external customer test, so that outer wheel will turn slower. But it’s important those wheels of BDD and TDD are directly connected. We only write solution code to pass unit tests, and we only write unit tests for logic needed to pass the customer test.
  • Code Quality – refers specifically to the properties of our code that make it easier or harder to change. As teams mature, their focus will often shift away from “making it work” to “making it easier to change, too”. This typically signals a growth in the maturity of the developers as code crafters.
    • Software Design Principles – address the underlying factors in code mechanics that can make code harder to change. On Codemanship courses, we teach two sets of design principles: Simple Design and Modular Design.
      • Simple Design
        • The code must work
        • The code must clearly reveal it’s intent (i.e., using module names, function names, variable names, constants and so on, to tell the story of what the code does)
        • The code must be low in duplication (unless that makes it harder to understand)
        • The code must be the simplest thing that will work
      • Modular Design (where a “module” could be a class, or component, or a service etc)
        • Modules should do one job
        • Modules should know as little about each other as possible
        • Module dependencies should be easy to swap
    • Refactoring – is the discipline of improving the internal design of our software without changing what it does. More bluntly, it’s making the code easier to change without breaking it. Like TDD, refactoring works in small feedback cycles. We perform a single refactoring – like renaming a class – and then we immediately re-run our tests to make sure we didn’t break anything. Then we do another refactoring (e.g., move that class into a different package) and test again. And then another refactoring, and test. And another, and test. And so on. As you can probably imagine, a good suite of fast-running automated tests is essential here. Refactoring and TDD work hand-in-hand: the tests make refactoring safer, and without a significant amount of refactoring, TDD becomes unsustainable. Working in these small, safe steps, a good developer can quite radically restructure the code whilst ensuring all along the way that the software still works. I was very tempted to put refactoring under Foundation, because it really is a foundational discipline for any kind of programming. But it requires a good “nose” for code quality, and it’s also an advanced skill to learn properly. So I’ve grouped it here under Code Quality. Developers need to learn to recognise code quality problems when they see them, and get hundreds of hours of practice at refactoring the code safely to eliminate them.
    • Legacy Code – is code that is in active use, and therefore probably needs to be updated and improved regularly, but is too expensive and risky to change. This is usually because the code lacks fast-running automated tests. To change legacy code safely, we need to get unit tests around the parts of the code we need to change. To achieve that, we usually need to refactor that code to make it easy to unit test – i.e., to remove external dependencies from that code. This takes discipline and care. But if every change to a legacy system started with these steps, over time the unit test coverage would rise and the internal design would become more and more modular, making changes progressively easier. Most developers are afraid to work on legacy code. But with a little extra discipline, they needn’t be. I actually find it very satisfying to rehabilitate software that’s become a millstone around our customers’ necks. Most code in operation today is legacy code.
    • Continuous Inspection – is how we catch code quality problems early, when they’re easier to fix. Like anything with the word “continuous” in the title, continuous inspection implies frequent automated checking of the code for cod quality “bugs” like functions that are too big or too complicated, modules with too many dependencies and so on. In traditional approaches, teams do code reviews to find these kinds of issues. For example, it’s popular these days to require a code review before a developer’s changes can be merged into the master branch of their repo. This creates bottlenecks in the delivery process, though. Code reviews performed by people looking at the code are a form of manual testing. You have to wait for someone to be available to do it, and it may take them some time to review all the changes you’ve made. More advanced teams have removed this bottleneck by automating some or all of their code reviews. It requires some investment to create an effective suite of code quality gates, but the pay-off in speeding up the check-in process usually more than pays for it. Teams doing continuous inspection tend to produce code of a significantly higher quality than teams doing manual code reviews.
  • Software Delivery – is all about how the code we write gets to the operational environment that requires it. We typically cover it in two stages: how does code get from the developer’s desktop into a shared repository of code that could be built, tested and released at any time? And how does that code get from the repository onto the end user’s smartphone, or the rented cloud servers, or the TV set-top box as a complete usable product?
    • Continuous Integration – is the practice of developers frequently (at least once a day) merging their changes into a shared repository from which the software can be built, tested and potentially deployed. Often seen as purely a technology issue – “we have a build server” – CI is actually a set of disciplines that the technology only enables if the team applies them. First, it implies that developers don’t go too long before merging their changes into the same branch – usually the master branch or “trunk”. Long-lived developer branches – often referred to as “feature branches” – that go unmerged for days prevent frequent merging of (and testing of merged) code, and is therefore most definitely not CI. The benefit of frequent tested merges is that we catch conflicts much earlier, and more frequent merges typically means less changes in each merge, therefore less merge conflicts overall. Teams working on long-lived branches often report being stuck in “merge hell” where, say, at the end of the week everyone in the team tries to merge large batches of conflicting changes. In CI, once a developer has merged their changes to the master-branch, the code in the repo is built and the tests are run to ensure none of those changes has “broken the build”. It also acts as a double-check that the changes work on a different machine (the build server), which reduces the risk of configuration mistakes. Another implication of CI – if our intent is to have a repository of code that can be deployed at any time – is that the code in master branch must always work. This means that developers need to check before they merge that the resulting merged code will work. Running a suite of good automated tests beforehand helps to ensure this. Teams who lack those tests – or who don’t run them because they take too long – tend to find that the code in their repo is permanently broken to some degree. In this case, releases will require a “stabilisation” phase to find the bugs and fix them. So the software can’t be released as soon as the customer wants.
    • Continuous Delivery – means ensuring that our software is always shippable. This encompasses a lot of disciplines. If the is code sitting on developers’ desktops or languishing in long-lived branches, we can’t ship it. If the code sitting in our repo is broken, we can’t ship it. If there’s no fast and reliable way to take the code in the repo and deploy it as a working end product to where it needs to go, we can’t ship it. As well as disciplines like TDD and CI, continuous delivery also requires a very significant investment in automating the delivery pipeline – automating builds, automating testing (and making those test run fast enough), automating code reviews, automating deployments, and so on. And these automated delivery processes need to be fast. If your builds take 3 hours – usually because the tests take so long to run – then that will slow down those all-important customer feedback loops, and slow down the process of learning from our releases and evolving a better design. Build times in particular are like the metabolism of your development process. If development has a slow metabolism, that can lead to all sorts of other problems. You’d be surprised how often I’ve seen teams with myriad difficulties watch those issues magically evaporate after we cut their build+test time down from hours to minutes.

Now, most of this stuff is known to most developers – or, at the very least, they know of them. The final two headings caused a few scratched heads. These are more advanced topics that I’ve found teams do need to think about, but usually after they’ve mastered the core disciplines that come before.

  • Managing Code Craft
    • The Case for Code Craft – acknowledges that code craft doesn’t exist in a vacuum, and shouldn’t be seen as an end in itself. We don’t write unit tests because, for example, we’re “professionals”. We write unit tests to make changing code easier and safer. I’ve found it helps enormously to both be clear in my own mind about why I’m doing these things, as well as in persuading teams that they should try them, too. I hear it from teams all the time: “We want to do TDD, but we’re not allowed”. I’ve never had that problem, and my ability to articulate why I’m doing TDD helps.
    • Code Craft Metrics – once you’ve made your case, you’ll need to back it up with hard data. Do the disciplines of code craft really speed up feedback cycles? Do they really reduce bug counts, and does that really save time and money? Do they really reduce the cost of changing code? Do they really help us to sustain the pace of innovation for longer? I’m amazed how few teams track these things. It’s very handy data to have when the boss comes a’knockin’ with their Micro-Manager hat on, ready to tell you how to do your job.
    • Scaling Code Craft – is all about how code craft on a team and within a development organisation just doesn’t magically happen overnight. There are lots of skills and ideas and tools involved, all of which need to be learned. And these are practical skills, like riding a bicycle. You can;t just read a book and go “Hey, I’m a test-driven developer now”. Nope. You’re just someone who knows in theory what TDD is. You’ve got to do TDD to learn TDD, and lot’s of it. And all that takes time. Most teams who fail to adopt code craft practices do so because they grossly underestimated how much time would be required to learn them. They approach it with such low “energy” that the code craft learning curve might as well be a wall. So I help organisations structure their learning, with a combination of reading, training and mentoring to get teams on the same page, and peer-based practice and learning. To scale that up, you need to be growing your own internal mentors. Ad hoc, “a bit here when it’s needed”, “a smigen there when we get a moment” simply doesn’t seem to work. You need to have a plan, and you need to invest. And however much you were thinking of investing, it’s not going to be enough.
  • High-Integrity Code Craft
    • Load-Bearing Code – is that portion of code that we find in almost any non-trivial software that is much more critical than the rest. That might be because it’s on an execution path for a critical feature, or because it’s a heavily reused piece of code that lies on many paths for many features. Most teams are not aware of where their load-bearing code is. Most teams don’t give it any thought. And this is where many of the horror stories attributed to bugs in software begin. Teams can improve at identifying load-bearing code, and at applying more exhaustive and rigorous testing techniques to achieve higher levels of assurance when needed. And before you say “Yeah, but none of our code is critical”, I’ll bet a shiny penny there’s a small percentage of your code that really, really, really needs to work. It’s there, lurking in most software, just waiting to send that embarrassing email to everyone in your address book.
    • Guided Inspection – is a powerful way of testing code by reading it. Many studies have shown that code inspections tend to find more bugs than any other kind of testing. In guided inspections, we step through our code line by line, reasoning about what it will do for a specific test case – effectively executing the code in our heads. This is, of course, labour-intensive, but we would typically only do it for load-bearing code, and only when that code itself has changed. If we discover new bugs in an inspection, we feed that back into an automated test that will catch the bug if it ever re-emerges, adding it to our suite of fast-running regression tests.
    • Design By Contract – is a technique for ensuring the correctness of the interactions between components of our system. Every interaction has a contract: a pre-condition that describes when a function or service can be used (e.g., you can only transfer money if your account has sufficient funds), and a post-condition that describes what that function or service should provide to the client (e.g., the money is deducted from your account and credited to the payee’s account). There are also invariants: things that must always be true if the software is working as required (e.g., your account never goes over it’s limit). Contracts are useful in two ways: for reasoning about the correct behaviour of functions and services, and for embedding expectations about that behaviour inside the code itself as assertions that will fail during testing if an expectation isn’t satisfied. We can test post-conditions using traditional unit tests, but in load-bearing code, teams have found it helpful to assert pre-conditions to ensure that not only do functions and services do what they’re supposed to, but they’re only ever called when they should be. DBC presents us with some useful conceptual tools, as well as programming techniques when we need them. It also paves the way to a much more exhaustive kind of automated testing, namely…
    • Property-Based Testing – sometimes referred to as generative testing, is a form of automated testing where the inputs to the tests themselves are programmatically calculated. For example, we might test that a numerical algorithm works for a range of inputs from 0…1000, at increments of 0.01. or we might test that a shipping calculation works for all combinations of inputs of country, weight class and mailing class. This is achieved by generalising the expected results in our tests, so instead of asserting that the square root of 4 is 2, we might assert that the square root of any positive number multiplied by itself is equal to the original number. These properties of correct test results look a lot like the contracts we might write when we practice Design By Contract, and therefore we might find experience in writing contracts helpful in building that kind of declarative style of asserting. The beauty of property-based tests is that they scale easily. Generating 1,000 random inputs and generating 10,000 random inputs requires a change of a single character in our test. One character, 9,000 extra test cases. Two additional characters (100,000) yields 99,000 more test cases. Property-based tests enable us to achieve quite mind-boggling levels of test assurance with relatively little extra test code, using tools most developers already know.

So there you have it: my code craft road map, in a nutshell. Many of these disciplines are covered in introductory – but practical – detail in the Codemanship TDD course book

If your team could use a hands-on introduction to code craft, our 3-day hands-on TDD course can give them a head-start.

The Gaps Between The Gaps – The Future of Software Testing

If you recall your high school maths (yes, with an “s”!), think back to calculus. This hugely important idea is built on something surprisingly simple: smaller and smaller slices.

If we want to roughly estimate determine the area under a curve, we can add up the areas of rectangular slices underneath. If we want to improve the estimate, we make the slices thinner. Make them thinner still, the estimate gets even better. Make them infinitely thin, and we get a completely accurate result. We can actually prove the area under the curve by taking an infinite number of samples.

In computing, I’ve lived through several revolutions where increasing computing power has meant more and more samples can be taken, until the gaps between them are so small that – to all intents and purposes – the end result is analog. Digital Signal Processing, for example, has reached a level of maturity where digital guitar amplifiers and digital synthesizers and digital tape recorders are indistinguishable from the real thing to the human ear. As sample rates and bit depths increased, and number-crunching power skyrocketed while the cost per FLOP plummeted, we eventually arrived at a point where the question of, say, whether to buy a real tube amplifier or use a digitally modeled tube amplifier is largely a matter of personal preference rather than practical difference.

Software testing’s been quietly undergoing the same revolution. When I started out, automated test suites ran overnight on machines that were thousands of times less powerful than my laptop. Today, I see large unit test suites running in minutes or fractions of minutes on hardware that’s way faster and often cheaper.

Factor in the Cloud, and teams now can chuck what would relatively recently have been classed as “supercomputing” power at their test suites for a few extra dollars each time. While Moore’s Law seems to have stalled at the CPU level, the scaling out of computing power shows no signs of slowing down – more and more cores in more and more nodes for less and less money.

I have a client who I worked with to re-engineer a portion of their JUnit test suite for a mission critical application, adding a staggering 2.5 billion additional property-based test cases (with only an additional 1,000 lines of code, I might add). This extended suite – which reuses – but doesn’t replace – their day-to-day suite of tests – runs overnight in about 5 1/2 hours on Cloud-based hardware. (They call it “draining the swamp”).

I can easily imagine that suite running in 5 1/2 minutes in a decade’s time. Or running 250 billion tests overnight.

And it occurred to me that, as the gaps between tests get smaller and smaller, we’re tending towards what is – to all intents and purposes – a kind of proof of correctness for that code. Imagine writing software to guide a probe to the moons of Jupiter. A margin of error of 0.001% in calculations could throw it hundreds of thousands of kilometres off course. How small would the gaps need to be to ensure an accuracy of, say, 1km, or 100m, or 10m? (And yes, I know they can course correct as they get closer, but you catch my drift hopefully.)

When the gaps between the tests are significantly smaller than the allowable margin for error, I think that would constitute an effective proof of correctness. In the same way that when the audio samples fall way outside of human hearing, you have effectively analog audio – at least in the perceived quality of the end result.

And the good news is that this testing revolution is already well underway. I’ve been working with clients for quite some time, achieving very high integrity software using little more than the same testing tools we’re almost all using, and off-the-shelf hardware solutions available to almost everyone.