Behaviour-Driven Development: From Stories to Working Software

The Greenbox team hit 214 subscribers. The sprint cadence is working. Event Storming gave them shared understanding. Example Mapping made their stories concrete. The sprint rhythm turned sticky notes into delivery.

But bugs keep appearing.

Not catastrophic bugs – the payment system works, the delivery scheduling is solid. But edge cases slip through. The delivery date calculation breaks on public holidays because nobody checked. The box-size switch fails if a customer changes on Wednesday instead of Monday. A paused subscriber gets charged because the retry logic doesn’t check pause state. Each one is a twenty-minute fix. Each one costs trust.

The team has concrete examples from their Example Mapping sessions – context, action, outcome, written on cards. But those cards are on a table. The code is on a screen. Somewhere between the two, the details get lost.

A language for examples

The Example Map gave the team examples as Context/Action/Outcome. There’s a step between “cards on a table” and “something a test framework can run.” The team needs a way to express those examples formally enough for a computer to use, while keeping them readable enough that Maya can look at them and say “yes, that’s what I meant.”

The language for this is Gherkin. Three keywords: Given, When, and Then – mapping directly to Context/Action/Outcome.

Given sets up the context – what’s true before anything happens.
When describes the action – what someone does.
Then states the outcome – what should be true afterwards.

A trivial example:

Given it is raining
When I go outside
Then I should get wet

No code. No special syntax. Anyone can read it. It’s the same pattern the team already used on their green cards – just formalised with keywords a test framework can parse.

From Example Map to Gherkin

The Example Map output for “Subscribe to a produce box” is already there:

Rule: Customer must choose a box size (Small $25/week, Large $45/week)
Rule: Payment must succeed (valid card → confirmed, declined card → retry)
Rule: Customer sees their first delivery date (Monday → this Thursday, Friday → next Thursday)

Take the delivery date example from the green card:

Context: delivery day is Thursday, minimum lead time is 3 days. Sarah subscribes on Friday. → First delivery is next Thursday.

Translated to Gherkin:

Given today is Friday
And deliveries happen on Thursdays
And the minimum lead time is 3 days
And a customer has a valid payment method
When they subscribe to the "Small" box
Then their first delivery date should be next Thursday

Mechanical translation. The hard thinking already happened round the table with Maya and the team.

The Feature file

Individual scenarios group into a Feature file – one coherent piece of behaviour. A Background section captures context shared across every scenario.

Feature: Subscribe to a produce box
  Customers want a regular supply of fresh, local produce
  without having to think about it each week.

  Background:
    Given the following box sizes are available:
      | name   | price    |
      | Small  | $25/week |
      | Large  | $45/week |

  Scenario: Subscribing with a valid payment method
    Given a customer has a valid payment method
    When they subscribe to the "Small" box
    Then their subscription should be confirmed
    And they should see their first delivery date

  Scenario: Payment is declined
    Given a customer has an expired credit card
    When they subscribe to the "Small" box
    Then no subscription should be created
    And they should be asked to update their payment method

  Scenario: Subscribing without enough lead time
    Given today is Friday
    And deliveries happen on Thursdays
    And the minimum lead time is 3 days
    And a customer has a valid payment method
    When they subscribe to the "Small" box
    Then their first delivery date should be next Thursday

Each rule from the Example Map maps to one or more scenarios. Each green card becomes concrete data inside a scenario. You’re not staring at a blank file wondering what to write. The conversation already happened. You’re transcribing.

The BDD cycle: story, unit, code

Now the team has scenarios – acceptance tests describing the agreed behaviour. But you don’t implement them top-down. You work inward, using two loops.

The outer loop is the acceptance test – the Gherkin scenario. The inner loop is unit tests driving the implementation. The acceptance test tells you when you’re done. The unit tests tell you how to get there.

Pick a scenario. Run it. RED – it fails because nothing exists yet.
Drop to unit tests. Write a small, focused test. RED.
Write the simplest code that makes it pass. GREEN.
Refactor if needed.
Repeat 2-4 until the acceptance test passes. GREEN.
Move to the next scenario.

Worked example: Greenbox subscription in Go

The code that follows is deliberately simple – it shows the BDD rhythm without the noise of a real production system. The discovery techniques produce the same concrete examples regardless of implementation complexity.

Tom and Priya are implementing the subscription story together. They’re sitting side by side for the first time – Priya usually works with headphones on, Tom usually works alone. He notices she names her tests differently. “How do you name tests?” he asks. “I describe what the customer expects, not what the code does,” she says. It’s a small thing. Tom starts doing it too.

Delivery date calculator

They start with the third scenario – delivery date calculation – because it’s pure logic with no external dependencies. Self-contained, well-specified by the Example Map, easy to test in isolation.

The rules:

Deliveries happen on Thursdays
Minimum lead time is 3 days
Subscribe on Monday → this Thursday (3 days – just enough)
Subscribe on Friday → next Thursday (less than 3 days to this Thursday, rolls forward)

RED. Unit test first.

// delivery_test.go
package greenbox

import (
    "testing"
    "time"
)

func TestFirstDeliveryDate_MondaySubscription(t *testing.T) {
    monday := time.Date(2026, 3, 23, 10, 0, 0, 0, time.UTC)
    deliveryDay := time.Thursday
    minLeadDays := 3

    got := FirstDeliveryDate(monday, deliveryDay, minLeadDays)

    want := time.Date(2026, 3, 26, 10, 0, 0, 0, time.UTC)
    if !got.Equal(want) {
        t.Errorf("FirstDeliveryDate(%v, Thursday, 3) = %v, want %v",
            monday.Weekday(), got.Weekday(), want.Weekday())
    }
}

Won’t compile. FirstDeliveryDate doesn’t exist yet. That’s the point.

GREEN. Write the function.

// delivery.go
package greenbox

import "time"

func FirstDeliveryDate(from time.Time, deliveryDay time.Weekday, minLeadDays int) time.Time {
    earliest := from.AddDate(0, 0, minLeadDays)
    daysUntil := (int(deliveryDay) - int(earliest.Weekday()) + 7) % 7
    if daysUntil == 0 {
        return earliest
    }
    return earliest.AddDate(0, 0, daysUntil)
}

Test passes.

RED. Edge case from the Example Map: Friday subscription.

func TestFirstDeliveryDate_FridaySubscription(t *testing.T) {
    friday := time.Date(2026, 3, 27, 10, 0, 0, 0, time.UTC)
    deliveryDay := time.Thursday
    minLeadDays := 3

    got := FirstDeliveryDate(friday, deliveryDay, minLeadDays)

    want := time.Date(2026, 4, 2, 10, 0, 0, 0, time.UTC)
    if !got.Equal(want) {
        t.Errorf("FirstDeliveryDate(%v, Thursday, 3) = %v, want %v",
            friday.Format("Monday"), got.Format("Monday 2006-01-02"),
            want.Format("Monday 2006-01-02"))
    }
}

GREEN. Already passes. The modular arithmetic handles it naturally. One of the pleasures of TDD – you write a test expecting failure, and it passes, telling you your implementation is more general than you thought.

Subscription creation

The second piece: creating the subscription, including payment.

RED.

// subscription_test.go
package greenbox

import (
    "testing"
    "time"
)

type fakeGateway struct {
    shouldSucceed bool
    chargedAmount int
}

func (f *fakeGateway) Charge(amountCents int) (bool, error) {
    f.chargedAmount = amountCents
    return f.shouldSucceed, nil
}

func TestSubscribe_ValidPayment(t *testing.T) {
    gw := &fakeGateway{shouldSucceed: true}
    delivery := time.Date(2026, 3, 26, 0, 0, 0, 0, time.UTC)

    sub, err := Subscribe("Small", 2500, gw, delivery)

    if err != nil {
        t.Fatalf("unexpected error: %v", err)
    }
    if sub.BoxSize != "Small" {
        t.Errorf("BoxSize = %q, want %q", sub.BoxSize, "Small")
    }
    if sub.PricePerWeek != 2500 {
        t.Errorf("PricePerWeek = %d, want %d", sub.PricePerWeek, 2500)
    }
    if !sub.FirstDelivery.Equal(delivery) {
        t.Errorf("FirstDelivery = %v, want %v", sub.FirstDelivery, delivery)
    }
    if gw.chargedAmount != 2500 {
        t.Errorf("charged %d, want %d", gw.chargedAmount, 2500)
    }
}

GREEN. Simplest thing that passes.

// subscription.go
package greenbox

import (
    "errors"
    "time"
)

var ErrPaymentDeclined = errors.New("payment declined")

type Subscription struct {
    BoxSize       string
    PricePerWeek  int
    FirstDelivery time.Time
}

type PaymentGateway interface {
    Charge(amountCents int) (ok bool, err error)
}

func Subscribe(boxSize string, priceCents int, gw PaymentGateway, firstDelivery time.Time) (*Subscription, error) {
    _, _ = gw.Charge(priceCents)
    return &Subscription{
        BoxSize:       boxSize,
        PricePerWeek:  priceCents,
        FirstDelivery: firstDelivery,
    }, nil
}

Test passes. But the implementation is deliberately naive – it ignores the payment result. The next test will force the fix.

RED. Declined payment.

func TestSubscribe_DeclinedPayment(t *testing.T) {
    gw := &fakeGateway{shouldSucceed: false}
    delivery := time.Date(2026, 3, 26, 0, 0, 0, 0, time.UTC)

    sub, err := Subscribe("Small", 2500, gw, delivery)

    if err != ErrPaymentDeclined {
        t.Errorf("err = %v, want %v", err, ErrPaymentDeclined)
    }
    if sub != nil {
        t.Errorf("subscription should be nil when payment declined")
    }
}

Fails. The current implementation always returns a subscription.

GREEN.

func Subscribe(boxSize string, priceCents int, gw PaymentGateway, firstDelivery time.Time) (*Subscription, error) {
    ok, err := gw.Charge(priceCents)
    if err != nil {
        return nil, err
    }
    if !ok {
        return nil, ErrPaymentDeclined
    }
    return &Subscription{
        BoxSize:       boxSize,
        PricePerWeek:  priceCents,
        FirstDelivery: firstDelivery,
    }, nil
}

Both tests pass. Four unit tests, two source files, clean types, narrow interfaces.

Step definitions: the glue

Step definitions connect Gherkin keywords to your application. When the test runner sees When they subscribe to the "Small" box, it needs a function that calls your real Subscribe code.

func iSubscribeToTheBox(ctx context.Context, size string) error {
    gw := stripeGateway()
    sub, err := greenbox.Subscribe(size, boxPrice(size), gw, greenbox.FirstDeliveryDate(time.Now(), time.Thursday, 3))
    if err != nil {
        lastError = err
        return nil
    }
    lastSubscription = sub
    return nil
}

Thin on purpose. It delegates to the real functions the team already wrote and tested. No business logic. Just glue.

Three guidelines for keeping them healthy:

Keep them thin. If you’re writing if statements or business logic inside a step definition, the logic belongs in domain code where it’s unit-tested.
Use consistent language. If the team says “subscribe,” every step says “subscribe.” Inconsistent language means duplicate step definitions doing the same thing with different words.
Maintain them like production code. Review in PRs. Refactor when the domain language evolves. Delete when scenarios are removed. If step definitions drift from reality, the team stops trusting the scenarios, stops maintaining them, and BDD quietly dies.

Priya suggests running the Gherkin tests automatically. “We’re writing tests that prove the code does what Maya expects. Why are we running them by hand?” She sets up a GitHub Action – tests run on every pull request. It takes her an afternoon. The first automated run catches a bug in Tom’s payment retry logic that manual testing missed. Tom: “That saved me a day.” Priya: “That saved a customer.”

LLMs as implementation partners

Here’s the thing about everything you just read: an LLM could have written most of it.

Not the Example Map. Not the discovery conversation where Maya explained that deliveries happen on Thursdays and the minimum lead time is three days. Not the moment when Tom asked “what about Friday?” and surfaced an edge case. The LLM wasn’t in the room for that.

But the code? You could hand an LLM the Feature file and say: “Write me a Go implementation with tests that makes these scenarios pass.” And it would produce something remarkably close to what you just read. The behaviour would be right, because the scenarios are concrete and unambiguous. There’s no room for the LLM to guess wrong about what “subscribe” means when the Feature file spells it out.

A caveat. LLMs are good at the happy path. They’ll miss things you didn’t specify: network timeouts, concurrency issues, flaky payment gateways. Code review isn’t optional – budget roughly half your time for reviewing and hardening what comes back. The discovery work is what makes this review possible. Because you have concrete examples, you can check the LLM’s output against something specific. Without that, you’re reviewing code against vibes.

The pipeline:

Event Storming

→

Example Mapping

→

BDD Scenarios

→

Hand to LLM

→

Review Output

→

Ship

Everything left of “Hand to LLM” is human thinking. Everything right is review and refinement. The human work is the thinking. The LLM work is the typing. Both are necessary. Neither is sufficient alone.

While implementing the payment integration, Tom makes a deliberate shortcut: he hardcodes the currency to AUD instead of making it configurable. He writes a comment: // SHORTCUT: AUD only. If we ever go international, this needs to change. Lee sees it during review and says: “That’s a good shortcut. You know it’s there, you know when it’ll matter, and you’ve documented it. Technical debt is fine when it’s conscious.” Tom carries the idea forward: debt is a choice, not an accident. The dangerous kind is the kind you don’t know you’re taking on.

That same week, a subscriber emails Sam on Saturday: “Your website has been showing an error since yesterday afternoon.” Nobody noticed – they don’t monitor the site outside business hours. Sam signs up for a free uptime monitor that pings the site every five minutes and texts her if it’s down. It’s not observability. It’s a text message. But it’s the first time a machine is watching instead of a person.

But are we building the right things?

One thing Tom notices: the LLM generates code faster than he can review it. The code arrives clean and confident, but he can’t always tell if it’s right until he traces through it line by line. The Feature file gives him something concrete to check against. But the speed creates an odd sensation: the bottleneck isn’t writing code any more. It’s knowing whether the code is correct.

A few weeks in, the rhythm is working. Example Mapping eliminates the surprises. BDD catches bugs before production. The code quality is up. The board looks healthy.

But the number that actually matters – active subscribers – is going backwards. They hit 214 at the end of the first sprint cycle. A month later, they’re at 197.

Maya checks the number at her kitchen table one evening. Nadia looks over her shoulder. “Is that good?”

“It’s going the wrong way.”

Churn is eating the growth. For every ten new subscribers, three or four cancel. The team is building well, but subscriber count doesn’t care about code quality.

The frustrating thing is that the team is doing good work. They’ve built a solid subscription system, payment processing, delivery date logic. Tom has been improving the admin tools. Jas redesigned the onboarding flow. Sam is pushing for a farm analytics dashboard. Everyone has a reasonable next thing to build.

But nobody has stepped back to ask: which of these things will actually stop the bleeding? A prettier onboarding flow won’t fix churn. A farm dashboard won’t either. The team is efficiently building features that don’t address the problem.

Maya raises it at the Monday standup. “We’re shipping faster than ever. But we’re shrinking. Something’s wrong and I don’t think the answer is to ship even faster.”

Which stories should the team be building? How do they connect work to the business goal? For that, they need a technique that works backwards from outcomes – one that forces the question “why are we building this?”

Lee suggests a technique that works backwards from the goal. It’s called Impact Mapping, and it starts with one question: why are we building this?

The next chapter, Impact Mapping: Connecting Work to Goals, publishes around 14 April.