Ensemble Programming: The Team Navigates, the LLM Types

GreenBox is a produce-box company with 5,000 subscribers across Perth and Melbourne, now expanding to Brisbane. Three squads, twenty-five people, and a substitution engine that’s about to get its most complex upgrade yet – built by one developer working alone, with predictable results.

GreenBox has 5,000 subscribers. Three squads – Perth, Melbourne, and a remote team scattered across regional towns. Twenty-five people. Three cities. And a substitution engine that’s about to get a lot more complicated.

Charlotte is leading the engineering organisation now. She joined during the scaling phase and brought the structure that got them from 1,000 to 5,000. Lee still drops in – mostly for the complex discovery work – but the day-to-day coaching is Charlotte’s.

The substitution engine has been the heart of GreenBox since Maya first mapped it on a wall of sticky notes. It started as Maya’s brain, became a decision table, and now runs thousands of matches a week across three cities. But the Perth squad has just picked up a major upgrade: seasonal rules, allergen combinations, and customer preference learning. It’s the most complex code in the system, and it touches every bounded context they’ve drawn.

Tom volunteers to build it.

The solo sprint

Tom is still one of the best developers in the organisation. He’s been with GreenBox since week one, he understands the domain better than almost anyone, and he’s fast with an LLM. He opens a session with Claude and starts prompting.

The first afternoon is electric. Tom hasn’t felt this way in months – maybe since the early weeks, before the workshops and the retros and the cadences. Just him and the machine, building. The LLM generates a seasonal availability model and Tom reshapes it, tightens the types, adds constraints. He works through Ava’s bedtime and Leo’s story and doesn’t hear Sarah come into his office at ten o’clock.

“You’ve got that look,” she says from the doorway.

“What look?”

“The one you had when you first started at GreenBox. When you told me it was the most productive week you’d ever had.”

Tom smiles. “It’s good. It’s really good.”

Day two is even better. He builds the allergen cross-referencing module, the preference learning system, the feedback loop that weights recent customer choices more heavily. The code is elegant – the kind of architecture he used to sketch on the back of napkins and never quite achieve. The LLM fills in the gaps he’d normally spend hours on. By five o’clock he has 2,000 lines of generated code. It handles seasonal availability windows, cross-references allergen profiles against substitution candidates, and learns from customer feedback to prefer certain swaps over others. It compiles. It passes the tests he wrote. He opens a pull request feeling proud. A crafted thing, made well.

Kai reviews it. He’s been with GreenBox since the scaling phase – one of the developers who helped build the Melbourne operation. He stares at the PR for forty minutes and then sends Tom a message: “I can read each function but I can’t follow the logic. Why does it prefer sweet potato over carrot in winter? Where does the allergen exclusion happen relative to the preference weighting?”

Tom reads the comment and feels a surge of irritation – followed immediately by recognition. He’s heard this before. From Maya, in week two, about his subscription model. That same feeling of I built it, it works, why are you questioning it.

Ravi, also reviewing, has a different concern: “This touches the supply matching bounded context and the customer profile bounded context. Did we check whether the contracts between them still hold?”

Maya, who is now head of product, looks at the seasonal rules and spots something immediately: “In winter, never substitute sweet potato for pumpkin – they’re both in season, so if we’re short on one, we’re short on both. This code would suggest sweet potato as a pumpkin substitute in July.”

Tom’s code works. It passes tests. But nobody else understands it. Three people have found three different categories of problem in a single review, and all three problems stem from the same root cause: Tom was the only person thinking when the code was written. The LLM was his implementation partner. Maya’s domain knowledge, Kai’s architectural awareness, and Ravi’s boundary expertise were all missing from the conversation.

Charlotte looks at the PR comments and says, “We’ve been here before. Week one vibes.”

The words land on Tom like cold water. He doesn’t respond. He doesn’t argue. He closes his laptop and stares at the wall for a long time. The framed print of his first merged pull request, which he brought from home when they moved to the bigger office, hangs above his monitor. It used to make him feel accomplished. Right now it makes him feel like a man who hasn’t learned.

She’s right. This is the same failure mode from the very first month – building fast, building alone, building wrong. The difference is that back then, Tom was typing the code himself. Now the LLM types it, so the wrong thing arrives even faster.

That evening, Sarah asks how his day was. Tom is quiet for a while, loading the dishwasher while the kids argue about something in the next room. Then: “Charlotte was right. She keeps being right and I keep needing to hear it twice.”

Sarah dries her hands on the tea towel. “At least you hear it. Your dad never did.”

Tom doesn’t answer. He picks up his phone and texts Priya: I think I needed to learn this lesson twice.

Priya replies at eleven, after she’s fed Refactor and turned off the lights: The good news is you learned it.

The ensemble idea

Charlotte suggests something the team hasn’t tried before: ensemble programming.

Ensemble programming – also called mob programming – is an old idea. The whole team works on one thing, together, at the same time, on the same computer. In the traditional version, one person types (the “driver”) while the rest of the team tells them what to type (the “navigators”). The driver handles the keyboard. The navigators handle the thinking.

Charlotte’s twist: the LLM is the driver. It does all the typing. The whole team navigates.

“Think about it,” she says. “In traditional mob programming, the bottleneck was typing speed. One person at the keyboard, everyone else waiting. That’s why a lot of teams gave up on it – it felt slow. But if the LLM types, there’s no bottleneck at the keyboard. The only bottleneck is how fast the team can think. Which is the whole point.”

Tom is sceptical. He sits at the back of the room with his arms crossed, which is a posture Charlotte has learned to read over the past year. He built the feature in two days solo. How long will it take with six people in a room?

“Less time than fixing the three categories of bug we just found in your PR,” Charlotte says. “And at the end, everyone will understand the code.”

Kai is more open to it. “I’ve spent most of the last two days trying to understand Tom’s PR. If I’d been in the room when it was written, I’d already understand it. That’s two days of review saved.”

Ravi agrees: “I’ve been reading ADRs trying to figure out whether the bounded context contracts still hold. If I’d been there when the design decisions were made, I wouldn’t need to reverse-engineer them from the code.”

The pattern is clear. Tom spent two days writing code. Three people spent a collective four days reviewing, understanding, and finding problems. The total cost of the solo approach was six developer-days. If the ensemble approach costs three sessions of ninety minutes with six people – that’s 27 person-hours, roughly three and a half developer-days – and produces code that doesn’t need a second pass, it’s actually cheaper.

The efficiency argument against ensemble programming assumes the code only needs to be written. In reality, code needs to be written, reviewed, understood, and maintained. When you count all four, ensemble programming often comes out ahead.

Setting up the session

Charlotte books a meeting room with a large screen. She invites six people:

Charlotte – facilitating
Maya – domain expertise on substitutions, seasonal produce, and customer expectations
Tom – technical lead, architecture knowledge
Kai – developer, responsible for the supply matching context
Ravi – developer, responsible for the customer profile context
Priya – testing, edge cases, quality

One laptop connected to the screen, running Claude in a code editor. Everyone can see the conversation. Everyone can see the code as it’s generated.

Charlotte sets ground rules:

One navigator at a time talks to the LLM
The navigator rotation is every ten minutes – a timer runs on Charlotte’s phone
When you’re not the navigator, write observations on sticky notes. Raise them at the next rotation
No side conversations while the navigator is working
Anyone can call “stop” if they spot something wrong

She also sets the session length: ninety minutes. Not open-ended. They’ll see how far they get.

The first attempt

The first session is awkward. Genuinely awkward.

Maya starts as navigator. She describes the seasonal substitution rules to the LLM. “In Western Australia, pumpkin is in season from March to August. Sweet potato overlaps from April to September. If we’re short on pumpkin, we shouldn’t suggest sweet potato as a substitute during the overlap months because the same weather that affects pumpkin supply will affect sweet potato supply.”

The LLM generates a seasonality model. It’s clean. Maya nods – the domain logic is right.

Tom takes over. He looks at the generated code and says, “This needs to integrate with the existing substitution pipeline. The current architecture passes a list of unavailable items and gets back a ranked list of substitutes.” He instructs the LLM to restructure the seasonal model as a filter in the existing pipeline.

Then Kai jumps in before his rotation. “Wait – that pipeline is in the supply matching context. The seasonal data lives in the product catalogue context. You’re crossing a boundary.”

Charlotte holds up a hand. “That’s important. Write it on a sticky note. You’re next.”

Kai writes it down, visibly frustrated. When his rotation comes, he explains the bounded context issue to the LLM and asks it to generate the seasonal filter as a separate module that communicates through the existing context boundary. The LLM restructures. Ravi, watching, puts up a sticky note: “The contract between these contexts needs updating – add it to the ADR.”

This is the moment where the ensemble starts to justify itself. In Tom’s solo session, the bounded context violation would have gone unnoticed until code review, when Kai would have left a comment, Tom would have restructured, and Kai would have re-reviewed. Two days of asynchronous back-and-forth compressed into thirty seconds of real-time conversation.

Priya’s turn. She’s been accumulating sticky notes. She has four of them, each written in her small, precise handwriting. She picks the most important one. “What about a customer who’s allergic to nuts and the best seasonal substitute is a nut? Where does allergen filtering happen relative to seasonal filtering?” She instructs the LLM to generate both the allergen filter and the seasonal filter, and to compose them in the correct order. The LLM produces a pipeline:

Unavailable Items

→

Seasonal Filter

→

Allergen Filter

→

Preference Ranking

→

Ranked Substitutes

Maya looks at the pipeline and says, “The order is wrong. Allergen filtering should happen before seasonal filtering. If we filter by season first, we might eliminate a safe substitute and then the allergen filter has fewer options to work with. Filter the dangerous stuff first, then rank what’s left by seasonality.”

The LLM swaps the order. Everyone nods. That’s a domain insight that wouldn’t have surfaced in Tom’s solo session because Tom doesn’t think about allergens the way Maya does. And it wouldn’t have surfaced in Maya’s head alone because she doesn’t think about filter ordering the way a developer does. It took both perspectives in the same room at the same moment.

Finding the rhythm

The first session produces about 400 lines of code. Less than Tom’s solo effort in raw volume. But every line has been seen by six pairs of eyes as it was written. The domain logic is correct because Maya was there. The architecture respects bounded contexts because Kai and Ravi were there. The edge cases are covered because Priya was there. The pipeline design is clean because Tom was there.

The second session, two days later, is smoother. The team has learned the rhythm. The navigator rotation feels natural. People write better sticky notes – more specific, more actionable. Charlotte intervenes less.

Something else improves: the team learns to break the problem into chunks that fit a ten-minute rotation. In the first session, navigators tried to accomplish too much in their slot. Maya would start describing a complex seasonal rule, get halfway through, and have to hand off to Tom with the thought incomplete. By the third session, navigators naturally scope their instructions to produce a complete, testable piece of code within their ten minutes.

Charlotte notices that the sticky notes have evolved too. Early sessions produced sticky notes like “this feels wrong” and “check the edge case.” By session three, they look like “the preference ranking doesn’t account for frequency weighting – a customer who rejected a substitute once shouldn’t be treated the same as one who rejected it five times.” The team’s ability to articulate concerns has sharpened because they’re practising it every session.

By the third session, something interesting happens. The LLM starts getting better prompts. Not because the LLM has improved, but because the navigators have learned what makes a good instruction. Maya stops saying “handle the case where two items are both in season” and starts saying “add a co-seasonality check: if the unavailable item and the candidate substitute share the same growing season in the same region, exclude the candidate.” The more precise the instruction, the better the generated code.

Tom notices it too. “The LLM is like a very fast, very literal junior developer. If you give it vague instructions, you get plausible-looking code that’s subtly wrong. If you give it precise instructions, you get exactly what you asked for. Having six people in the room means the instructions are always precise, because someone catches the ambiguity before it reaches the prompt.”

There’s a deeper effect here that takes a few weeks to become visible. The navigators start thinking in terms of instructions rather than implementations. Instead of mentally writing the code and then asking the LLM to type it, they describe the behaviour they want. “Generate a function that takes a list of candidate substitutes and a customer’s allergen profile, and returns only the candidates that are safe for that customer.” That’s a behaviour description, not a code description. It gives the LLM freedom to choose the implementation while constraining the outcome.

This distinction matters. When you describe behaviour, the LLM sometimes chooses an implementation you wouldn’t have thought of – a different data structure, a more elegant algorithm, a simpler approach. When you describe code (“write a for loop that checks each candidate against each allergen”), you’ve already made the design decision and the LLM is just a typist. The ensemble naturally gravitates toward behaviour descriptions because the navigators are a mix of technical and non-technical people. Maya can’t describe code, but she can describe behaviour. Her instructions are often the cleanest because they’re the most abstract.

What the ensemble changes

The traditional argument against mob programming was always efficiency. One keyboard, six people, one-sixth the throughput. Managers looked at the maths and flinched.

The LLM changes that equation completely.

In a solo session, the developer is doing two jobs: thinking about what to build and typing the code. The thinking is the hard part. The typing is the bottleneck. Most of the developer’s cognitive load goes to translation – turning a mental model into syntax.

In an ensemble session with an LLM, nobody is typing code. The LLM handles translation. The entire team’s cognitive load goes to thinking: What should this do? What are the edge cases? Does this fit the architecture? Is the domain logic right?

Solo + LLM

One developer thinks
LLM types
Others review later
Bugs found in review

Ensemble + LLM

Whole team thinks
LLM types
Everyone sees it live
Issues caught immediately

The output of the GreenBox ensemble sessions was measurably better than Tom’s solo work. Not because Tom is a worse developer – he’s excellent. But because the substitution engine sits at the intersection of domain knowledge (Maya), architectural boundaries (Kai, Ravi), edge case awareness (Priya), and technical design (Tom). No single person holds all of that context. The ensemble puts all the context in the room at the same time.

And the code didn’t need a second pass. The PR from the ensemble session had zero review comments. Not because people were being polite – because every concern had already been raised during the session. The review happened live, as the code was being generated.

When it doesn’t work

Ensemble programming isn’t for everything. The team learned this the hard way.

Anika, who leads the Melbourne operations squad, tried running an ensemble session for a routine bug fix – a timezone conversion error in the delivery scheduling system. Four people sat in a room for thirty minutes while one person navigated the LLM through a three-line fix.

“That was a waste of everyone’s time,” Anika said afterwards. “I could have fixed that in five minutes.”

She’s right. Ensemble programming has a setup cost – getting people in a room, establishing context, running rotations. That cost is worth paying when the problem is complex, when multiple knowledge domains intersect, or when the team needs to build shared understanding. It’s not worth paying for a bug that one person can diagnose and fix in a single session.

Charlotte draws a line:

Use ensemble programming when:

The problem crosses bounded contexts or knowledge domains
Multiple people’s expertise is needed simultaneously (not sequentially)
The code will be critical infrastructure that everyone needs to understand
You’re onboarding new team members to a complex area
The solo approach has already failed (like Tom’s substitution engine PR)

Don’t use ensemble programming when:

The problem is well-understood and lives in one person’s domain
It’s a routine bug fix with a known root cause
The code is throwaway or experimental
The team is small enough that a quick conversation covers the knowledge sharing

The Perth squad settles into a rhythm. Complex features get ensemble sessions. Routine work gets solo development with standard code review. The split ends up being roughly 30/70 – about a third of their development time is ensemble work, the rest is solo.

Tom, who was the most sceptical about ensemble programming, becomes one of its best navigators. He has the deepest technical knowledge and his instructions to the LLM are surgically precise. But he also learns something about his own blind spots. “When I work alone,” he tells Charlotte one afternoon, leaning against the doorframe of her office, “I optimise for elegance. I want the code to be clever. In the ensemble, I can see that cleverness is a tax on everyone else’s understanding. Maya doesn’t care about clever code. She cares about code that does the right thing. Having her in the room makes me write simpler instructions, and simpler instructions produce simpler code.”

He pauses. “Sarah – my wife – she told me something once. She said I love making things, but I hate letting anyone help me make them. She said I’m like my dad.” He looks at Charlotte. “My dad builds houses. He’s good at it. But every subcontractor he’s ever worked with has a story about Marco Russo standing over their shoulder telling them they’re doing it wrong.”

Charlotte waits.

“I don’t want to be that person. The ensemble is… it’s the opposite of that. It’s me trusting that the room is smarter than I am. Which it is. It always was. I just couldn’t see it when I was the one typing.”

The knowledge transfer effect

Two months after adopting ensemble programming, something unexpected happens. The Perth squad hires two new developers. Instead of the usual onboarding – reading documentation, pairing with a senior developer for a week, then fumbling through their first stories alone – Charlotte puts them straight into ensemble sessions.

The new developers don’t navigate for the first few sessions. They observe. They write sticky notes. They ask questions during rotations. Within a week, they’ve seen how the substitution engine works, how the bounded contexts interact, how Maya thinks about seasonal produce, and how the team structures prompts for the LLM.

Kai, who onboarded the traditional way during the scaling phase, is impressed. “It took me three months to understand the domain well enough to contribute confidently. These two are raising useful concerns in their second week.”

The ensemble sessions are an accidental knowledge base. Everything the team knows about the domain is demonstrated live, in context, with the reasoning visible. It’s not documented in a wiki that nobody reads. It’s not captured in an onboarding guide that’s six months out of date. It’s happening in front of you, in real time, and you can ask questions.

The navigator protocol

Charlotte refines the ensemble process over several iterations. The biggest improvement is what she calls the “navigator protocol” – a structured way to hand off between navigators that avoids the chaos of the first session.

When the timer goes off, the current navigator:

Summarises where they are – what the last instruction to the LLM was and what the generated code does
Reads their remaining sticky notes aloud – things they wanted to address but didn’t get to
Hands off to the next navigator, who picks up either from the summary or from one of the sticky notes

This takes about sixty seconds per rotation. It sounds like overhead. It’s not. Without it, the new navigator spends their first three minutes figuring out what just happened, and the rest of the team loses context while they catch up. The sixty-second handoff keeps everyone synchronised.

The other refinement is the “stop” protocol. Anyone can say “stop” at any time if they spot something wrong. When someone calls stop, the current navigator pauses, the person who called stop explains the concern, and the team decides whether to address it now or park it on a sticky note for later. Most stops take less than a minute. A few trigger important conversations that reshape the approach.

Ravi calls stop during one session when the LLM generates a function that fetches customer allergen data from the database inside a loop. “That’s an N+1 query. It’ll be fine with 50 customers but it’ll kill the database at scale.” The navigator restructures the prompt to batch the query. Thirty seconds. In a solo session, that N+1 might not have been caught until a performance incident in production.

Maya calls stop during a different session when the LLM generates substitution logic that treats all box sizes the same. “Small boxes have five items, large boxes have twelve. The substitution constraints are different – in a small box, one bad substitution ruins a third of the box. In a large box, it’s less than ten per cent. The weighting needs to be proportional.” Nobody else in the room had considered this. The LLM certainly hadn’t. One sentence from Maya saves a week of customer complaints.

The cross-squad ensemble

When Melbourne needs to implement their own seasonal rules – different climate, different farms, different growing seasons – Charlotte runs a cross-squad ensemble. Three people from Perth (Tom, Maya, Priya), two from Melbourne (Anika, a developer named Liam who’ll own the Melbourne rules), and Kai joining remotely.

The remote participant is the hardest part. Kai can see the screen via screenshare, but he can’t read the sticky notes. Charlotte adapts: sticky notes get read aloud when they’re raised, and the chat becomes a parallel sticky-note channel for remote participants.

The session is rougher than the co-located ones. The audio lag means Kai occasionally talks over someone. The screen share doesn’t capture the room’s energy. But the core mechanism still works: multiple knowledge domains converging on the same code at the same time.

Liam, who has never seen the substitution engine’s internals, asks questions that the Perth team would never think to ask. “Why is the preference ranking a simple average? Shouldn’t recent preferences count more than old ones?” Tom pauses. “I… hadn’t thought about that. The averaging was what the LLM generated and it seemed reasonable.” Maya agrees with Liam: a customer who hated capsicum last week should be weighted more heavily than one who disliked it six months ago. The team adds exponential decay to the preference model.

A fresh pair of eyes found an improvement that the original team had normalised. That’s the knowledge transfer effect in reverse – it doesn’t just flow from experienced to new. New perspectives reveal assumptions that experienced people have stopped questioning.

The broader change

Ensemble programming with LLMs changes the team’s relationship with code generation in a way Charlotte didn’t expect.

Before the ensemble sessions, developers treated LLM-generated code as their own. Tom prompted the LLM, the code came back, Tom reviewed it, Tom pushed it. The code felt like Tom’s work, and reviewing it felt like reviewing Tom’s work. That created a subtle dynamic where challenging the code felt personal.

In ensemble sessions, nobody owns the code. The LLM generated it based on the team’s collective instructions. Challenging it doesn’t feel like challenging a colleague – it feels like reviewing a draft. The team is more honest, more critical, and more willing to say “that’s wrong, regenerate it.”

Maya noticed it first. “In the old pull request reviews, I’d sometimes spot a domain issue but not mention it because the developer had clearly spent two days on the code and I didn’t want to be the person who sent them back to the drawing board. In the ensemble, I say it immediately because there’s no drawing board to go back to. The LLM just generates a new version.”

The emotional cost of rework drops to near zero when the rework is done by an LLM in real time. That changes the team’s willingness to iterate, to challenge, to get things right.

There’s a second-order effect too. Because rework is cheap, the team experiments more. In a traditional code review, suggesting “what if we tried a completely different approach?” is a big ask – it means the developer throws away their work and starts over. In an ensemble session, it’s a ten-second conversation followed by the LLM generating a new version. The team can compare two approaches side by side in the time it would normally take to write one.

Charlotte sees this happen during the preference learning implementation. Tom’s navigator turn produces a simple averaging approach to customer preferences. Kai, on the next rotation, says, “What if we used exponential decay instead? Recent preferences count more.” Instead of debating which approach is better in the abstract, the LLM generates both. The team compares the outputs against a set of test cases Maya provides – real substitution scenarios from the last three months. The exponential decay version handles the “customer changed their mind” case noticeably better. Decision made, with evidence, in ten minutes.

In the old world – Tom building alone – that comparison would never have happened. Tom would have picked one approach, implemented it, shipped it, and defended it in code review because he’d invested time in it. The sunk cost fallacy is real, and it’s strongest when the cost is your own labour. When the labour is the LLM’s, there’s no sunk cost to defend.

When to use ensemble programming: a summary

After three months of practice, Charlotte documents the team’s guidelines:

Situation	Approach	Why
New feature crossing bounded contexts	Ensemble	Multiple knowledge domains needed simultaneously
Complex domain logic (substitutions, pricing)	Ensemble	Domain expert + technical expert need to co-create
Onboarding to a complex codebase	Ensemble (new person observes)	Fastest way to absorb context
Routine bug fix	Solo + review	One person can handle it
Experimental prototype	Solo	Speed over understanding
Feature another squad will maintain	Ensemble with that squad	Build understanding before handoff

The key insight isn’t that ensemble programming is always better. It’s that the LLM changes when it’s worth doing. The setup cost – getting people in a room – hasn’t changed. But the output per session has gone up dramatically because the bottleneck has moved from typing to thinking, and thinking scales with the number of people in the room.

And that’s the real insight. Ensemble programming with an LLM isn’t primarily a coding technique. It’s a thinking technique. The LLM removes the mechanical bottleneck of typing, which means the entire team can focus on the thing that actually matters: making sure the code does the right thing.

The substitution engine ships two weeks after the first ensemble session. It handles seasonal rules, allergen combinations, and customer preference learning. Every developer in the Perth squad understands how it works. The Melbourne squad, who’ll need to maintain it when they implement Melbourne-specific produce rules, sat in on the final session.

Maya reviews the production output after the first week. The substitution quality is noticeably better. Fewer complaints about odd combinations. No repeats of the sweet-potato-for-pumpkin mistake that would have shipped in Tom’s original version.

“The LLM wrote the code,” she says. “But the team wrote the thinking.”

The ensemble sessions are working. So are the Example Mapping sessions. So is every other discovery technique the team has learned over the past year. The problem is that the team is now using all of them for everything – including stories where everyone already knows the answer. Workshop fatigue is setting in, and the teams are spending more energy going through the motions than doing the work that actually needs deep thinking.

Charlotte reaches for a framework that tells the team which approach to use and when to skip discovery altogether. That’s Cynefin (coming 6 October).