LLMs as Thinking Partners: How the Role Evolved

November 24, 2026 · 6 min read

Part of The GreenBox Story — a standalone reference for the full series.

The GreenBox team started by using LLMs to generate code fast. They ended by using LLMs to help them think. The shift wasn’t about better prompts or newer models — it was about better inputs. Discovery techniques produce structured understanding. Structured understanding produces useful LLM output. This post traces that evolution.

The evolution at a glance

Stage GreenBox moment LLM role What worked What didn’t Post
Code generator Week 1–4, building the wrong thing “Write me a subscription system” Fast output, clean code Built on wrong assumptions, amplified misunderstanding The Wrong Kind of Fast
Implementation partner BDD/Gherkin, turning examples into code Generate code from concrete specs Accurate when given precise examples Still needs human-written specs From Stories to Working Software
Sprint planning assistant First sprints, task breakdown Break down stories, draft acceptance criteria Speeds up planning Can’t assess gut-feel sizing The First Sprints
Research/synthesis tool JTBD interviews, assumption mapping Transcribe interviews, spot patterns, synthesise data Catches themes humans miss across many interviews Misses cultural context, local nuance Jobs to Be Done, Assumption Mapping
Board presentation drafter Roadmapping, board decks Draft presentations from data Fast first draft Needs heavy editing for narrative and nuance What Changes First
Code generator from domain models Decision tables, bounded contexts Generate code from formal tables, generate within context boundaries Comprehensive, consistent, testable Needs precise domain models as input Decision Tables, Domain-Driven Design
ADR drafter Architecture decisions Draft ADRs from conversation context Gets written instead of deferred Misses nuance, overstates certainty Architecture Decision Records
Ensemble tool Ensemble programming Types while team navigates Removes mechanical bottleneck, team focuses on thinking Solo use misses cross-domain concerns Ensemble Programming
First-pass threat modeller Threat modelling / STRIDE Systematic STRIDE enumeration Covers ~70% of threats, doesn’t get tired Misses context-specific threats, cultural factors Threat Modelling
Discovery infrastructure Continuous discovery Transcription, synthesis, drafting across all practices Embedded in every part of the weekly cadence Never replaces the human judgment about what matters Continuous Discovery

Phase 1: “Give me the code” (Series 1)

Tom’s week-one approach: describe the feature, let the LLM write the code, ship it. The subscription system was clean, well-structured, and wrong. It handled billing before the team understood what customers were actually subscribing to. The LLM didn’t cause the mistake — it amplified it. Vague understanding in, plausible-looking wrong code out. (The Wrong Kind of Fast)

The shift came with Example Mapping. Once the team had concrete examples — “Given a customer in Melbourne, when they subscribe to a weekly veggie box, then delivery is every Thursday” — the LLM stopped guessing. Gherkin features from Example Map cards became precise prompts. The same LLM that built the wrong thing now built the right thing, because the input changed. (From Stories to Working Software)

The lesson: the LLM amplifies whatever understanding you give it — correct or incorrect, with equal confidence.

Phase 2: “Help me understand the data” (Series 2)

Twenty JTBD interviews produce thousands of words of transcript. The LLM excels here: transcription, pattern-spotting across interviews, clustering themes. It found the “Tuesday anxiety” pattern — customers checking on Monday whether their Thursday box would arrive — across seven separate interviews that three different interviewers conducted. No single interviewer saw the pattern. The LLM did. (Jobs to Be Done)

But it missed context. Regional farming culture, the difference between how a Yarra Valley grower talks about produce and how a Mornington Peninsula grower does — these distinctions matter for partnership strategy, and the LLM flattened them into generic summaries. Assumption Mapping helped the team flag where the LLM’s synthesis needed human correction.

For the board presentation, the LLM drafted a clean deck from the roadmap data. Maya rewrote most of it. The data was right but the story was wrong — the LLM presented facts; Maya needed to present a narrative the board could act on. (What Changes First)

The lesson: LLMs are strong at synthesis across volume. They’re weak at judgment, narrative, and cultural nuance.

Phase 3: “Generate from the model” (Series 3)

This is where LLM usage matured. Decision Tables are formal, complete, unambiguous — exactly the kind of input LLMs handle well. Every condition combination, every outcome, explicitly stated. The LLM generated comprehensive test suites and implementation code from these tables with near-zero defects. The same was true for code generation scoped to bounded contexts — when the LLM knows the boundaries and the ubiquitous language, it stays within them.

ADRs revealed a different benefit. The team had been deferring documentation because writing is slow. The LLM drafted ADRs from conversation transcripts — not perfect, but good enough that the team edited rather than wrote from scratch. Decisions that would have gone unrecorded got captured. The risk: the LLM overstated certainty and understated trade-offs, so every draft needed human review for hedging and nuance.

The lesson: formal, structured inputs produce the best LLM outputs. The shift from “write me code” to “implement this specification” is the difference between Phase 1 and Phase 3.

Phase 4: “Think with us” (Series 4)

Ensemble programming changed the relationship entirely. The LLM types while the team navigates. Three or four people debating the right approach, the LLM implementing their decisions in real time. The mechanical bottleneck — someone has to type — disappears. The team focuses on thinking. Solo LLM use produced code that worked but missed cross-domain concerns; ensemble use caught those concerns because multiple perspectives were present.

Threat modelling with STRIDE showed the LLM as systematic first-pass analyst. It enumerated threats at every boundary — spoofing, tampering, repudiation, information disclosure, denial of service, elevation of privilege — without getting tired or bored. It covered roughly 70% of what the team found important. The remaining 30% required human knowledge of the specific deployment context, customer behaviour patterns, and regulatory environment.

By Continuous Discovery, the LLM was infrastructure. Transcription, synthesis, pattern-matching, drafting — woven into the weekly cadence at every step. Not a tool the team reached for occasionally, but a layer underneath every practice. The human role shifted entirely to judgment: what matters, what to act on, what to ignore.

The principle

The LLM’s value is proportional to the quality of the thinking that goes into the prompt. Vague instructions produce plausible-looking wrong code. Concrete examples produce accurate implementations. Formal domain models produce comprehensive code. The discovery techniques in this series aren’t just for humans — they produce the structured understanding that makes LLMs genuinely useful.

The anti-pattern

Using the LLM without discovery. “Give me a subscription system” versus “implement these 12 Example Map scenarios as Gherkin features.” The first produces Tom’s week-one disaster — fast, confident, wrong. The second produces working software. The gap between the two isn’t a better prompt template or a more capable model. It’s the discovery work that happened before anyone opened a terminal.

  • The GreenBox Cheat Sheet — every discovery and delivery technique in one place
  • The Planning Onion — every planning layer in one place
  • The GreenBox Story — the full series from first idea to scaled operation
Questions or thoughts? Get in touch.