LLMs as Thinking Partners: How the Role Evolved

Part of The GreenBox Story — a standalone reference for the full series.

The GreenBox team started by using LLMs to generate code fast. They ended by using LLMs to help them think. The shift wasn’t about better prompts or newer models — it was about better inputs. Discovery techniques produce structured understanding. Structured understanding produces useful LLM output. This post traces that evolution.

The evolution at a glance

Stage	GreenBox moment	LLM role	What worked	What didn’t	Post
Code generator	Week 1–4, building the wrong thing	“Write me a subscription system”	Fast output, clean code	Built on wrong assumptions, amplified misunderstanding	The Wrong Kind of Fast
Implementation partner	BDD/Gherkin, turning examples into code	Generate code from concrete specs	Accurate when given precise examples	Still needs human-written specs	From Stories to Working Software
Sprint planning assistant	First sprints, task breakdown	Break down stories, draft acceptance criteria	Speeds up planning	Can’t assess gut-feel sizing	The First Sprints
Research/synthesis tool	JTBD interviews, assumption mapping	Transcribe interviews, spot patterns, synthesise data	Catches themes humans miss across many interviews	Misses cultural context, local nuance	Jobs to Be Done, Assumption Mapping
Board presentation drafter	Roadmapping, board decks	Draft presentations from data	Fast first draft	Needs heavy editing for narrative and nuance	What Changes First
Code generator from domain models	Decision tables, bounded contexts	Generate code from formal tables, generate within context boundaries	Comprehensive, consistent, testable	Needs precise domain models as input	Decision Tables, Domain-Driven Design
ADR drafter	Architecture decisions	Draft ADRs from conversation context	Gets written instead of deferred	Misses nuance, overstates certainty	Architecture Decision Records
Ensemble tool	Ensemble programming	Types while team navigates	Removes mechanical bottleneck, team focuses on thinking	Solo use misses cross-domain concerns	Ensemble Programming
First-pass threat modeller	Threat modelling / STRIDE	Systematic STRIDE enumeration	Covers ~70% of threats, doesn’t get tired	Misses context-specific threats, cultural factors	Threat Modelling
Discovery infrastructure	Continuous discovery	Transcription, synthesis, drafting across all practices	Embedded in every part of the weekly cadence	Never replaces the human judgment about what matters	Continuous Discovery

Phase 1: “Give me the code” (Series 1)

Tom’s week-one approach: describe the feature, let the LLM write the code, ship it. The subscription system was clean, well-structured, and wrong. It handled billing before the team understood what customers were actually subscribing to. The LLM didn’t cause the mistake — it amplified it. Vague understanding in, plausible-looking wrong code out. (The Wrong Kind of Fast)

The shift came with Example Mapping. Once the team had concrete examples — “Given a customer in Melbourne, when they subscribe to a weekly veggie box, then delivery is every Thursday” — the LLM stopped guessing. Gherkin features from Example Map cards became precise prompts. The same LLM that built the wrong thing now built the right thing, because the input changed. (From Stories to Working Software)

The lesson: the LLM amplifies whatever understanding you give it — correct or incorrect, with equal confidence.

Phase 2: “Help me understand the data” (Series 2)

Twenty JTBD interviews produce thousands of words of transcript. The LLM excels here: transcription, pattern-spotting across interviews, clustering themes. It found the “Tuesday anxiety” pattern — customers checking on Monday whether their Thursday box would arrive — across seven separate interviews that three different interviewers conducted. No single interviewer saw the pattern. The LLM did. (Jobs to Be Done)

But it missed context. Regional farming culture, the difference between how a Yarra Valley grower talks about produce and how a Mornington Peninsula grower does — these distinctions matter for partnership strategy, and the LLM flattened them into generic summaries. Assumption Mapping helped the team flag where the LLM’s synthesis needed human correction.

For the board presentation, the LLM drafted a clean deck from the roadmap data. Maya rewrote most of it. The data was right but the story was wrong — the LLM presented facts; Maya needed to present a narrative the board could act on. (What Changes First)

The lesson: LLMs are strong at synthesis across volume. They’re weak at judgment, narrative, and cultural nuance.

Phase 3: “Generate from the model” (Series 3)

This is where LLM usage matured. Decision Tables are formal, complete, unambiguous — exactly the kind of input LLMs handle well. Every condition combination, every outcome, explicitly stated. The LLM generated comprehensive test suites and implementation code from these tables with near-zero defects. The same was true for code generation scoped to bounded contexts — when the LLM knows the boundaries and the ubiquitous language, it stays within them.

ADRs revealed a different benefit. The team had been deferring documentation because writing is slow. The LLM drafted ADRs from conversation transcripts — not perfect, but good enough that the team edited rather than wrote from scratch. Decisions that would have gone unrecorded got captured. The risk: the LLM overstated certainty and understated trade-offs, so every draft needed human review for hedging and nuance.

The lesson: formal, structured inputs produce the best LLM outputs. The shift from “write me code” to “implement this specification” is the difference between Phase 1 and Phase 3.

Phase 4: “Think with us” (Series 4)

Ensemble programming changed the relationship entirely. The LLM types while the team navigates. Three or four people debating the right approach, the LLM implementing their decisions in real time. The mechanical bottleneck — someone has to type — disappears. The team focuses on thinking. Solo LLM use produced code that worked but missed cross-domain concerns; ensemble use caught those concerns because multiple perspectives were present.

Threat modelling with STRIDE showed the LLM as systematic first-pass analyst. It enumerated threats at every boundary — spoofing, tampering, repudiation, information disclosure, denial of service, elevation of privilege — without getting tired or bored. It covered roughly 70% of what the team found important. The remaining 30% required human knowledge of the specific deployment context, customer behaviour patterns, and regulatory environment.

By Continuous Discovery, the LLM was infrastructure. Transcription, synthesis, pattern-matching, drafting — woven into the weekly cadence at every step. Not a tool the team reached for occasionally, but a layer underneath every practice. The human role shifted entirely to judgment: what matters, what to act on, what to ignore.

The principle

The LLM’s value is proportional to the quality of the thinking that goes into the prompt. Vague instructions produce plausible-looking wrong code. Concrete examples produce accurate implementations. Formal domain models produce comprehensive code. The discovery techniques in this series aren’t just for humans — they produce the structured understanding that makes LLMs genuinely useful.

The anti-pattern

Using the LLM without discovery. “Give me a subscription system” versus “implement these 12 Example Map scenarios as Gherkin features.” The first produces Tom’s week-one disaster — fast, confident, wrong. The second produces working software. The gap between the two isn’t a better prompt template or a more capable model. It’s the discovery work that happened before anyone opened a terminal.

The GreenBox Cheat Sheet — every discovery and delivery technique in one place
The Planning Onion — every planning layer in one place
The GreenBox Story — the full series from first idea to scaled operation