The Workshop: AI Use Case Envisioning

AI Use Case Envisioning takes a vague mandate (“do something with AI”) and turns it into a scored shortlist: a handful of use cases worth piloting, each pinned to a capability and an autonomy level, plus the ideas you’ve deliberately decided not to build with a model. The grid is the forcing function; the autonomy ladder is what keeps the dangerous ones honest.

AI Use Case Envisioning

AI Use Case Envisioning is a structured session that generates candidate uses of AI across a business, filters them on value, feasibility, data readiness, and the cost of being wrong, and lands on a small portfolio of bets worth piloting. Sometimes called AI opportunity mapping, an AI discovery workshop, or a use-case canvas. It borrows the divergent generation of Impact Mapping, the risk grid of Assumption Mapping, and a gate that most enthusiasm skips: is this even an AI-shaped problem? The output is not a model choice and not a backlog of prompts. It is a ranked handful of opportunities, each tagged with what kind of AI it needs and how much it’s allowed to do on its own.

The session exists because the failure it prevents is so common. A leadership team decides the company needs AI. A hackathon produces twelve demos. Six months later there’s a chatbot nobody trusts, a “summariser” wired to the riskiest workflow in the building, and a backlog of half-built ideas with no owner. Nobody asked, up front, which problems were worth a model, which were worth a rule, and which were too costly to get wrong. Envisioning asks those questions first, on a wall, before anyone writes a prompt.

At a glance

Who, for how long: a facilitator, someone who owns the business outcome, two or three people who do the actual work being considered, someone who knows the data, and an engineer who knows what’s buildable. Five to seven people, around half a day.
What you walk out with: a populated value/feasibility grid of candidate use cases, two or three picked pilots each tagged with a capability family and an autonomy level, a parking lot, and an explicit “not an AI problem” list.
When to reach for it: you have a mandate to “use AI” and no shortlist, or a backlog of AI ideas and no way to choose between them. Not for picking a model, designing a single agreed feature, or running a build (those come after).

What’s It For

A property-management agency decides it’s behind on AI. The directors greenlight a fortnight of experiments. The team builds a chatbot that answers tenant questions, a tool that “predicts” which tenants will fall behind on rent, and a feature that auto-replies to maintenance emails. Three months later: the chatbot gives a tenant the wrong notice period and the agency eats the complaint; the arrears “prediction” is a logistic regression that a single overdue-days rule would have beaten; and the auto-reply has been quietly switched off because it once told a tenant a gas leak could wait until Monday.

None of these failed because the technology didn’t work. They failed because nobody asked the questions that come before the build. Which of these is worth a model at all? Which needs a human between the model and the consequence? Where’s the data, and is it any good? What does a wrong answer cost, and who pays it? Each idea was treated as a build task when it was really a portfolio decision.

This is the universal shape. Enthusiasm for AI produces a list of features; what’s missing is a way to weigh them against each other before committing engineers. Envisioning makes the candidates visible, scores them on the axes that actually predict whether an AI project survives contact with production, and forces an early, cheap decision about which two or three to pilot, which to park, and which to solve some other way.

Reach for it when:

You have a mandate to “do something with AI” and no agreed shortlist
You have a pile of AI ideas (a hackathon’s worth, a consultant’s deck) and no way to choose between them
You’re about to fund AI work and want to separate the bets that pay from the ones that look impressive in a demo
You’ve run an Impact Map or a Jobs to be Done round and want to ask which of the deliverables are AI-shaped
A team keeps proposing to “add AI” to workflows and you want a shared way to say yes, not yet, or no

What It’s Not For

Skip it when:

You already know the single use case and need to design or build it (run the build; the envisioning is done)
The question is which model or which service, not which problem (that’s a service-selection exercise; see Picking the AWS AI Service Tier)
The organisation has no appetite to fund any follow-up; envisioning a portfolio nobody will resource is theatre
The problem is plainly deterministic (a calculation, a lookup, a rule) and the only reason “AI” is on the table is fashion

Stop a session that’s already started if:

Every candidate collapses into “a chatbot that does everything,” which means the room is imagining a product, not scoping problems
Nobody in the room knows whether the data exists, so every feasibility score is a guess (adjourn, find the data owner, reconvene)
The session has become a model-architecture debate; that’s a sign the picks are already obvious and you should close and move to build

The session has costs to weigh against the benefits. What you get: a shared, defensible shortlist; a portfolio rather than a pet project; an explicit record of what you chose not to build with AI, which is worth as much as the picks; and a set of pilots scoped tightly enough to learn from in weeks. What it costs: half a day of five to seven people; the discipline to kill seductive ideas; and the follow-through of actually running the pilots, without which the grid is just a wall of optimism.

Definitions & Background

Three lenses do most of the work in this session.

The “is it an AI-shaped problem?” gate. Before scoring a candidate, ask what would solve it. A surprising number of “AI” ideas are a rule, a lookup, a calculation, or a search box, and a model would be a slower, pricier, less predictable version of something deterministic. The gate borrows directly from When Not to Use an LLM: if a wrong answer is unacceptable and the rule is knowable, write the rule. AI earns its place when the input is messy or unstructured, the mapping is fuzzy, and an approximately-right answer (checked, or cheap to be wrong about) beats no answer at all.

The capability families. Every genuine AI candidate falls into one of a small set of shapes, and the shape determines the build, the data, and the risk:

Classify – put an input into one of a few buckets (urgency, category, sentiment). Cheap, evaluable, easy to bound.
Extract – pull structured fields out of unstructured text or documents.
Retrieve and answer – find the relevant source and answer from it, with citations (retrieval-augmented generation, or RAG).
Draft or generate – produce text a human edits and sends.
Summarise – compress a long thing into a short faithful thing.
Perceive – read images, audio, or video (multimodal).
Predict – forecast a number or a probability from historical data. This one is usually not a language-model job; it’s classical machine learning, and it belongs on the “not an LLM” list more often than not.

The autonomy ladder. The single most important AI-specific decision is how much a use case is allowed to do without a human. Four rungs, in increasing order of blast radius:

Suggest – the model proposes; a human reads it as one input and decides everything.
Draft for review – the model produces the artefact; a human edits and commits it.
Act with approval – the model proposes an action and executes only after a human clicks yes.
Act autonomously – the model acts and tells someone afterwards.

The rule: the higher the cost of being wrong, the lower the rung you’re allowed to start on. You can climb the ladder as evidence accrues; you cannot start at the top because the demo looked confident. Pinning every survivor to a rung is what stops envisioning from producing the auto-reply that told a tenant the gas leak could wait.

A fourth lens, borrowed from Assumption Mapping, is worth keeping in your pocket: every score on the grid is an assumption. “We have the data” and “a wrong answer here is cheap” are beliefs until tested. The picks coming out of this session are exactly the assumptions to test first in the pilot.

Inputs

Something concrete about the business to generate candidates against. The richest input is a map of the work: an Impact Map, a process walked end to end, a Jobs to be Done list, or simply the team’s own list of where the days go. Without a view of the actual work, the session generates generic ideas (“a chatbot,” “a copilot”) instead of scoped ones.

You also need:

A wall or board for divergent generation, and a pre-drawn value/feasibility grid
Sticky notes in three colours: candidates, capability tags, autonomy tags
Someone who can answer “do we have that data, and is it any good?” in the room, not as a follow-up
An honest read on what a wrong answer costs in each workflow, ideally from the person who handles the complaints
A half-day slot and the right five to seven people (see Who’s Needed)

Outputs

What lands on the wall at the end:

A populated value/feasibility grid with every candidate placed, the high-value/high-feasibility corner pulled out as the pilot shortlist
Two or three picked pilots, each carrying: a one-line problem statement, a capability family, an autonomy rung, the data it would draw on, and the cost of a wrong answer
A parking lot of promising-but-not-yet candidates, each with the one thing that has to change (usually data) before it’s worth revisiting
An explicit “not an AI problem” list, each item with the cheaper thing that solves it (a rule, a lookup, a form field)

Photograph the grid with every note readable before the notes come down.

These outputs feed straight into:

A build. Each pilot becomes a tightly-scoped implementation; the capability family tells you the shape and the autonomy rung tells you where the human sits. In upcoming posts we’ll take two of this session’s picks all the way to code: Triaging Maintenance Requests with a Bedrock Classifier and Answering Tenant Questions from the Lease.
Assumption Mapping: run the grid’s scores as assumptions and test the riskiest before you build.
Prioritisation: when the shortlist is still longer than your capacity, sequence it.
Picking the AWS AI Service Tier: each pilot now needs a service; the capability family narrows the field before you start.

Who’s Needed

Five to seven people, around half a day:

Facilitator. Runs the clock, polices the AI-shaped gate, and stops the room collapsing every idea into one chatbot.
Outcome owner. The person accountable for the business result (a head of operations, a service lead). They decide which pilots get funded, so they place the value scores.
Practitioners. Two or three people who do the work being considered. They know the real volumes, the edge cases, and which “obvious” idea would actually make their day worse. Without them the candidates are imagined, not observed.
Data owner. Someone who knows what data exists, where it lives, how clean it is, and what’s legally usable. Feasibility scores are guesses without them, and data is where AI pilots most often die.
Engineer. Someone who can say “that’s a week” or “that’s a research project” and who knows the capability families well enough to tag candidates honestly.
Risk or compliance, when the domain has teeth (money, safety, regulated advice). They place the cost-of-being-wrong scores and keep the autonomy rungs honest.

Who to leave out:

People with a model to sell, internal or external. They anchor the room on a solution before the problems are scoped.
Large stakeholder groups. If a dozen people need a say, run a pre-session to gather candidates, then envision with the smaller group.
Observers. As in every workshop in this family, observers warp the room.

How To Run It

Phase	Duration	Materials	Key question
Frame the business	20 min	The work map	“Where does the time and pain actually go?”
Diverge on candidates	30 min	Yellow notes, silent	“Where could a model read, write, decide, or perceive?”
Gate: is it AI-shaped?	25 min	Two columns	“Would a rule, lookup, or search beat a model here?”
Tag capability and data	25 min	Coloured tags	“What shape of AI is it, and do we have the data?”
Plot on the grid	40 min	Value/feasibility grid	“How much value? How feasible, given the data?”
Pin autonomy and cost	25 min	Autonomy tags	“What does a wrong answer cost, and where’s the human?”
Pick the portfolio	20 min	Dot votes	“Which two or three do we pilot, and what do we park?”
Total	~3 hours

Below, we’ll work through an envisioning session by following Lodgewise, a residential property-management agency that’s decided it’s behind on AI. It manages around three thousand tenancies for landlords across two cities, with forty-odd staff: property managers, a maintenance desk, leasing consultants, accounts. Tenants reach them by email, a web portal, the phone, and an after-hours emergency line. The directors have said the words “we need an AI strategy,” and the head of operations, sceptical and busy, has booked the room.

Phase 1: Frame the business (20 minutes)

Put the work map on the wall. For Lodgewise it’s a one-page walk through a tenancy’s life: enquiry, application, lease signing, move-in, the steady state of rent and maintenance and questions, inspections, renewal or move-out. The facilitator marks where the days actually go, from the practitioners, not the org chart:

“Forget AI for twenty minutes. Where does this team’s time disappear, and where do things go wrong? Point at the map.”

The maintenance coordinator points at the steady state: hundreds of inbound requests a week, triaged by hand, and the occasional emergency that sits in the queue too long. A property manager points at the same place for a different reason: she answers the same dozen tenant questions over and over, and the answers are all sitting in the lease and the tenant handbook, just not anywhere findable. Accounts points at arrears: by the time it’s visible, it’s three weeks deep. Inspections come up too: a routine inspection is forty photos and an hour of writing.

You’re not solving anything yet. You’re building the shared picture the candidates will attach to.

What to watch for:

The map that’s an org chart. “Where do the days go” gets you the work; “who reports to whom” gets you politics. Keep redirecting to the work.
The hero workflow. One person’s pet pain dominates. Note it, then deliberately ask the others where their time goes.

Phase 2: Diverge on candidates (30 minutes)

Now turn on the AI lens. Silent generation, one idea per note, fifteen minutes:

“Walk the map. Anywhere a model could read something messy, write a first draft, decide a category, find and answer from our documents, or look at a photo, write it down. Don’t judge it yet. One idea per note. I’d rather throw half away than miss the good one.”

Prompt with the capability families if the room stalls: Classify? Extract? Retrieve and answer? Draft? Summarise? Perceive? Predict? Lodgewise’s wall fills up:

Triage inbound maintenance requests by urgency and category, and suggest the trade
Answer tenant questions from the lease and handbook
Draft first-pass replies to routine tenant emails
Read inspection photo sets and flag likely issues
Pull key terms out of new lease PDFs into the system
Predict which tenancies will fall into arrears
Write listing copy from a feature checklist
Spot duplicate and spam maintenance tickets
An after-hours voice bot for the emergency line
Summarise a tenancy’s whole history for a manager taking over a portfolio
Translate tenant comms into community languages

Eleven candidates from a room of six is healthy. Cluster the obvious duplicates, keep the rest.

What to watch for:

The everything-bot. Three notes all say “a chatbot.” Split them by the actual job: answering questions is retrieve-and-answer; drafting replies is generate; triaging is classify. They’re different builds with different risks.
Solutions with no problem. “Use AI for inspections” with no idea what it would do. Push: “do what, exactly, with which input?”

Phase 3: The gate – is it AI-shaped? (25 minutes)

Two columns on a fresh wall: AI-shaped and Solve it another way. Take each candidate and ask the gate question:

“If we had no AI at all, what would solve this? If a rule, a lookup, a calculation, or a search box does the job, it goes in the right-hand column, and that’s a good outcome, not a failure.”

Lodgewise’s gate does real work. Predict which tenancies fall into arrears goes straight right: a rule (“rent more than N days overdue, flag it”) captures most of the value today, and if they ever want a real model it’s classical machine learning on tabular data, not a language model. The room links the reasoning to When Not to Use an LLM and moves on. Spot duplicate tickets is mostly a matching rule on address and time window, with a model only at the fuzzy edges; it goes right with a note. Translate tenant comms is a managed translation service, not a generative project; right column.

What’s left in the AI-shaped column is the genuine list: triage, tenant Q&A, draft replies, inspection photos, lease extraction, tenancy-history summary, listing copy. Seven real candidates, down from eleven, and the agency has already saved itself from building a logistic regression nobody needed.

What to watch for:

Reluctance to send ideas right. The room feels like moving a candidate out of the AI column is losing. Reframe: the right-hand column is the cheapest, most reliable wins in the room.
The “but AI could also” creep. A clean rule gets dressed back up as AI because it sounds better in the board deck. Hold the line: cheaper and predictable wins.

Phase 4: Tag capability and data (25 minutes)

For each surviving candidate, stick a capability tag and a data verdict. The data owner earns their seat here:

Triage – Classify. Data: years of historical tickets with the category and trade eventually assigned. Good and labelled.
Tenant Q&A – Retrieve and answer. Data: every lease, the tenant handbook, the FAQs. Exists, unstructured, but real.
Draft replies – Generate. Data: a corpus of past replies. Exists but uneven in quality.
Inspection photos – Perceive. Data: thousands of photos, but almost none labelled with what was wrong. Thin.
Lease extraction – Extract. Data: the lease PDFs, but no gold-standard “right answers” to check against yet. Buildable, needs a labelled set.
Tenancy-history summary – Summarise. Data: scattered across the system; assembling the input is most of the work.
Listing copy – Generate. Data: plenty of past listings. Fine.

The data verdict is the quiet killer. Two attractive ideas (inspection photos, lease extraction) have a data problem, not a model problem, and that shows up as a low feasibility score in the next phase rather than as a vague worry.

Phase 5: Plot on the grid (40 minutes)

Move to the grid. The vertical axis is value (how much time, money, or risk this removes); the horizontal axis is feasibility, and feasibility here folds in the data verdict, because an idea you can’t feed is not feasible no matter how clever the model.

The quadrants:

Top-right, pilot now: high value, you can actually build it. Maintenance triage and tenant Q&A land here for Lodgewise. Good data, clear value, a model genuinely suits the problem.
Top-left, big bets: high value, but something (usually data) isn’t ready. Inspection photos and lease extraction sit here. Worth a place in the parking lot with the one thing that has to change written on the note: label a few hundred photos, build a gold set of lease answers.
Bottom-right, quick wins: real but modest value, cheap to do. Listing copy. Pick these up between pilots; they build the team’s muscle on something low-risk.
Bottom-left, park or drop: low value and hard. Be ruthless; this corner is where good intentions go to consume engineers.

What to watch for:

Everything in the top-right. If the room scores every idea as easy and valuable, the data owner hasn’t pushed hard enough. Make them defend each feasibility score out loud.
Value inflation. “It’ll save hundreds of hours” with no basis. Anchor to the volumes from phase 1: how many tickets a week, how many repeat questions, how long a report takes.

Phase 6: Pin autonomy and cost (25 minutes)

For each candidate in the pilot-now and quick-win corners, place an autonomy tag, and write down what a wrong answer costs. This is the phase that separates this workshop from a generic prioritisation.

Maintenance triage: cost of a wrong answer is real (a misrouted emergency), so it starts at draft for review – the model suggests urgency, category, and trade; the coordinator confirms with one click. Emergencies are routed to a human regardless of model confidence.
Tenant Q&A: a wrong answer is a tenant acting on bad information about their lease, so it starts at suggest – the model answers with citations to the actual clause, and when it isn’t sure it says so and hands off to a property manager. It never does anything; it only informs.
Listing copy: a wrong answer is an awkward sentence a human deletes, so draft for review is plenty and the risk is trivial.

Notice the pattern: value put these in the pilot corner; cost-of-being-wrong sets the rung they start on. The after-hours voice bot, had it survived, would have illustrated the opposite extreme: a wrong answer can be a safety incident, no rung is low enough to start at, so it’s deferred until the rest of the practice is mature.

“The demo will always look confident enough for the top rung. We start low not because the model is bad but because the cost of being wrong is ours, not the model’s. We climb the ladder when the evidence says we’ve earned it.”

What to watch for:

Autonomy set by ambition. Someone wants the triage to auto-dispatch trades on day one. Pin it to the cost: what happens when it dispatches a plumber for an electrical fault? Start lower, climb later.
Cost waved away. “It’s only a maintenance ticket.” Ask the person who handles the complaints what the worst plausible wrong answer does.

Phase 7: Pick the portfolio (20 minutes)

Stand back. The pilot-now corner has two strong candidates; the quick-win corner has one. Dot-vote to confirm sequence, not to change the set:

Lodgewise picks two pilots: maintenance triage (classify, draft-for-review) and tenant Q&A (retrieve-and-answer, suggest). Different capability families on purpose, so the team learns two shapes of AI from one quarter.
One quick win: listing copy, picked up by whoever has a slow week.
The parking lot: inspection photos and lease extraction, each with its data prerequisite written on the note and a date to revisit.
The not-an-AI list: arrears (a rule today, classical ML never urgently), duplicate tickets (a matching rule), translation (a managed service). Each with the cheaper solution named.

The head of operations who booked the room sceptically leaves with something she didn’t expect: not a grand AI strategy, but two scoped experiments she can fund, a short list of cheap wins, and a documented decision about what the agency is deliberately not building with a model. That last list is the one she pins above her desk.

What Can Go Wrong

The hackathon hangover. The room arrives with twelve demos already built and wants to retrofit a strategy around them. Recovery: Run the gate and the grid on the demos as if they were fresh candidates. Some will survive; the ones that don’t are cheaper to retire now than to keep limping. Stop if: The demos are politically protected and the session is being used to bless them. That’s not envisioning; decline the framing.

The everything-bot. Every candidate dissolves into “one assistant that does it all.” Recovery: Force each candidate back to a single capability family and a single autonomy rung. An assistant that classifies, retrieves, drafts, and acts is four projects and four risk profiles; scope them apart. Stop if: The room can’t or won’t separate them; the organisation wants a product vision, not a use-case portfolio, and that’s a different session.

The feasibility fantasy. Everything scores as easy because nobody in the room actually knows the data. Recovery: Adjourn the grid, send someone to inspect the data, reconvene. A feasibility axis built on guesses produces a portfolio built on guesses. Stop if: There’s no data owner available at all. Reschedule; this session cannot run without one.

The autonomy land-grab. The picks all get pinned to “act autonomously” because that’s the impressive version. Recovery: For each, name the worst plausible wrong action and who wears it. Pin the rung to that, not to the ambition. Stop if: Leadership insists on full autonomy for a high-cost workflow over the room’s objection. Record the objection; that’s a risk decision being made above the team.

The orphaned portfolio. A clean shortlist that nobody is resourced to pilot. Recovery: Cut the portfolio to the one pilot that actually has an owner and a fortnight. One real pilot beats three imaginary ones. Stop if: There’s no appetite to fund anything. End early; don’t manufacture a backlog that will rot.

Next Steps

The session ends; the pilots begin.

Same day, the facilitator:

Photographs the grid, the gate columns, and the autonomy tags, all readable.
Writes up each pilot as a one-pager: problem statement, capability family, autonomy rung, data source, cost of a wrong answer, and the assumption to test first.
Circulates the not-an-AI list with the cheaper solution named for each, and makes sure someone owns the quickest of those wins.

This quarter, the outcome owner:

Funds the pilots as experiments, not products. A pilot’s job is to retire the riskiest assumption from the grid (usually “the data is good enough” or “a wrong answer is cheap enough”), cheaply, in weeks.
Holds each pilot to its autonomy rung. Climbing happens on evidence (measured accuracy, a clean eval set, a quarter without an incident), not on enthusiasm.
Keeps the parking lot honest: when the data prerequisite is met, the parked candidate comes back to the grid, not straight to a build.

Ongoing, the team:

Treats each pilot’s launch as the start of a measurement loop, not the finish line (keeping it honest after it ships). A classify pilot needs an accuracy number and a watched error rate; a retrieve-and-answer pilot needs a groundedness check and a hand-off rate. Picking the AWS AI Service Tier is the next decision for each survivor.
Re-runs envisioning when the business changes or when a quarter’s pilots have taught the team what’s actually feasible. The second session is always sharper than the first.

Variants

Portfolio Level (default). A whole business or department, half a day, five to seven people, one grid, two or three picked pilots plus a not-an-AI list. This is what most teams need, and the rest of this post describes it.

Single-process deep dive. Instead of the whole business, take one process walked in detail (the maintenance flow, the move-in flow) and envision only within it. Faster (ninety minutes), narrower, good when one workflow is obviously where the value is and you just need to scope the AI within it.

Impact-Map-driven. Take an existing Impact Map and ask, of each deliverable, “is this AI-shaped, and if so which capability?” The map supplies the candidates; the gate, the grid, and the autonomy ladder do the rest. Useful when an envisioning session would otherwise start from a blank wall.

Remote. A board (Miro or Mural) with the grid, the two gate columns, and a tagging palette pre-drawn. Generation is silent in the tool; the grid debate moves at the pace of one shared cursor, so budget a little longer. Keep the autonomy tags as a distinct colour so the cost conversation doesn’t get lost in the value one.

Vendor-claims filter. When the candidates are arriving as a vendor’s slide deck rather than the team’s own ideas, run the gate hard: for each claimed use case, ask what data it needs from you, what a wrong answer costs you, and what autonomy rung the vendor is quietly assuming. Most “turnkey AI” pitches assume a higher rung than you’d choose for yourself.