Writing More Readable Tests

6 min readMar 28, 2026

Ideally, tests should provide a clear understanding of the system’s behavior. This means they should require significantly less effort to comprehend than going through the production code. In the era of LLMs, the value of clean, readable tests — acting as executable specifications — is more important than ever. They have become essential tools for both guiding LLM inputs and validating their outputs.

Why Tests Become Cluttered

Unfortunately, poor readability is one of the most frequent issues I see in test code. When a test is bogged down by details, it is usually due to one of three reasons:

The production code is too complex. Typical case when test is busy orchestrating high number of mocks because production code has too many dependencies.
A single test tries to cover too many rules at once. This where the principle of having a single assertion per test came from. Long list of assertions is especially common with high-level tests, like component tests. Because they run significantly slower than unit or micro tests, the desire to pack multiple rules into a single execution is understandable but still unfortunate.
The test exposes too many implementation details. Test is too tightly coupled to the internal workings of the production code rather than focusing on the high-level behavior that needs to be verified. Thoughtworks has made the same case in its writing on DSLs for testing: a good testing vocabulary lets teams describe domain behavior directly, rather than leaking low-level execution details into every example. One option here is focusing more on high-level tests that are naturally less coupled to system’s internal mechanics. However, this goal can be achieved with unit tests just as well as the following example shows.

Example

Let’s say I’m implementing trade settlements for an investing application. In this domain, OrderFill records are created when a customer's order is executed. The settlement information for these fills is stored in OrderFillSettlement records, which are generated when our system receives a SettlementConfirmation event from the central settlement system.

Here is an example of a “bad” test that exposes far too many details (written in Groovy/Spock):

def "partially settles buys using FIFO and marks full then partial fill" {
  given:
    def fill1 = anOrderFill(
      side: BUY,
      quantity: new OrderQuantity(2000.0),
      averagePrice: new OrderPrice(2.50),
      lastPrice: new OrderPrice(2.50),
      venueReference: new VenueReference(NASDAQ, "ref123"),
      fillTime: Instant.parse("2026-01-20T12:20:00Z"),
      createTime: Instant.parse("2026-01-20T12:20:01Z")
    )
    def fill2 = anOrderFill(
      side: BUY,
      quantity: new OrderQuantity(2000.0),
      averagePrice: new OrderPrice(2.50),
      lastPrice: new OrderPrice(2.50),
      venueReference: new VenueReference(NASDAQ, "ref123"),
      fillTime: Instant.parse("2026-01-20T12:20:02Z"),
      createTime: Instant.parse("2026-01-20T12:20:03Z")
    )
    orderFillRepository
      .findAllByVenueReference(_ as VenueReference) >> [fill1, fill2]
    orderFillSettlementRepository
      .findAllByFillIds(_ as Collection<Id<OrderFill>>) >> []

  when:
    def settlementConfirmation = aSettlementConfirmation(    
      reference: "123abc",
      direction: BUY,   
      senderMessageReference: "sender-ref123",
      settlementReference: new VenueReference(NASDAQ, "ref123"),
      settledQuantity: new OrderQuantity(3000.0),
      settledAmount: new SettledAmount(new OrderCashAmount(7500.0), GBP)
    )
    service.process(settlementConfirmation)

  then:
    1 * orderFillService.settleFill({ OrderFillSettlement settlement ->
      settlement.fillId() == fill1.id()
        && settlement.quantity() == new OrderQuantity(2000.0)
        && settlement.grossAmount() == new OrderCashAmount(5000.0)
      }, true)
    1 * orderFillService.settleFill({ ClientOrderFillSettlement settlement ->
      settlement.clientOrderFillId() == fill2.id()
        && settlement.quantity() == new OrderQuantity(1000.0)
        && settlement.grossAmount() == new OrderCashAmount(2500.0)
      }, false)
}

Breaking Down the Business Rules

First, it is clear that this test is trying to cover too much. Even the title — “partially settles buys using FIFO and marks full then partial fill” — is an indicator that many rules are crammed into it.

Looking at the test we can guess it tries to verify rules such as:

Fills are settled using FIFO (First-In, First-Out) ordering.
Fills are looked up using a settlement or external reference.
“Buy” settlement confirmations correspond to “buy” fills, and the same applies to “sells”.
Fills can be fully or partially settled.

However, there is also incidental data mixed in. For example, is the senderMessageReference significant to the business logic? Maybe it is used for idempotency but this is not clear. Ideally, every piece of information inside the test should be significant for the rule being checked.

This test should be broken up into multiple, focused tests:

“fully settles a buy fill”
“partially settles a buy fill”
“settles a fill on top of a previously settled quantity and amount”
“uses FIFO to settle multiple fills”

Each of these would include only the details necessary to explain its specific business rule.

Hiding Implementation Details

Another problem with the original test is that it is tightly coupled to the specific implementation of finding order fills, checking whether a given settlement event has already been processed, and how fills are settled. If we followed the principle of hiding implementation details, the test body would look the same regardless of whether it executes at the component/system level or as a unit/micro test.

The Improved Test

Here is the refactored test:

def "settles first fill fully and second partially"() { 
  given:
    def fill1 = fill(
      quantity: 2000.0,
      lastPrice: 2.50
    )
    def fill2 = fill(
      quantity: 2000.0,
      lastPrice: 2.50
    )
    havingFills(fill1, fill2)

  when:
    def settlementConfirmation = settlementConfirmation(    
      settledQuantity: 3000.0,
      settledAmount: 7500.0
    )
    process(settlementConfirmation)

  then:
    interaction {
      1 * settlesFill(
        fillId: fill1.id(), 
        quantity: 2000.0, 
        grossAmount: 5000.0, 
        fully: true)
      1 * settlesFill(
        fillId: fill2.id(),
        quantity: 1000.0,
        grossAmount: 2500.0,
        fully: false
      )
    }
}

This version hides all details that aren’t relevant to the quantity/amount distribution logic or the fill lookup.

Instead of relying solely on generic test data builders, it uses local factory methods (fill and settlementConfirmation) with default values tailored to this specific specification. It also abstracts the OrderFillRepository interaction behind a havingFills helper method, and does the same for the verification block. (Note: The interaction closure is necessary here to make Spock handle the verification block correctly).

Implementing the Helper Methods

To make the above test work, here is the implementation for the fill factory method (which uses the generic OrderFillTestFactory.anOrderFill method):

@NamedVariant
OrderFill fill(OrderSide side = BUY,
              BigDecimal quantity,
              BigDecimal lastPrice,
              String venueReference = SETTLEMENT_REFERENCE) {
  return anOrderFill(
    side: side,
    quantity: new OrderQuantity(quantity),
    averagePrice: new OrderPrice(lastPrice),
    lastPrice: new OrderPrice(lastPrice),
    venueReference: new VenueReference(NASDAQ, externalReference),
    // clock is a custom Clock that returns Instant incremented by 1s
    // every time "tick" is called
    fillTime: clock.tick(),
    createTime: clock.tick()
  )
}

The havingFills setup method looks like this:

void havingFills(OrderFill... fills) {
  interaction {
    orderFillRepository
      .findAllByVenueReference(_ as VenueReference) >> asList(fills)
  }
}

Again, interaction is needed for Spock to record this stub outside the given/when/then block.

And finally the settlesFill:

@NamedVariant
void settlesFill(Id<OrderFill> fillId, 
                 boolean isFull, 
                 BigDecimal quantity, 
                 BigDecimal grossAmount) {
  1 * orderFillService.settleFill({ OrderFillSettlement settlement ->
    settlement.fillId() == fillId
      && settlement.quantity() == new OrderQuantity(quantity)
      && settlement.grossAmount() == new OrderCashAmount(grossAmount)
  }, isFull)
}

The Three Levels of Test Data Visibility

When deciding which values to hide, I categorise them into three buckets:

Values that explain the tested business rule: These are crucial for understanding the test and should remain directly inside the test body.
Values required for execution, but irrelevant to the rule: These should be abstracted away. You can put them inside test-class-specific factory methods or define them as fields within the test class.
Values that simply need to be valid & present: Place these into generic test data factories/builders that can be reused across the entire test suite.

However, when hiding data be careful not to swing too far in the opposite direction — hiding crucial information leads to the “magic values” anti-pattern and results in tests that no longer explain anything.

Final Thoughts

Readability comes down to emphasizing what is important and hiding what is unimportant. When a test tries to explain everything everywhere all at once, it’s no different to having an article written without paragraphs and subtitles — it becomes just one big blob.

In the past, I have only selectively pushed for this level of “cleanliness”. Before LLMs, the cost/gain ratio was not always clear. However, now the economics have shifted:

One-Time Setup: I only need to instruct the LLM once and provide it with the right examples to maintain consistency.
Improved LLM Performance: Clearer structures make it much easier for the LLM to understand how the system works.
More Efficient Human Review: It simplifies the review process which is clearly becoming a significant bottleneck (some studies show review time increase by up to 91%).