Agents, Chains, and Retrieval

May 20, 2026 · 20 min read

AI Practitioner · AIF-C01 · part of The Exam Room

A product manager wants a “GenAI assistant” for internal operations. Version one answers policy questions. Version two also looks up customer records. Version three drafts emails. Version four files Jira tickets on behalf of the user. Three architectural patterns keep appearing in proposals – chains, retrieval, and agents – and teams routinely reach for the most elaborate one because it’s the newest. Each one fits a different workload shape, so there’s no single “best”; what matters is which piece of the assistant’s job each pattern solves, and when the extra machinery of an AgentA system that wraps an LLM with tools, memory, and a loop, so it can take multi-step actions toward a goal rather than just answering one prompt. actually earns its keep over a plain chain or retrieval call.

The situation

An internal-operations team has commissioned a GenAI assistant. The requirements, ordered by when they landed:

  • v1: Policy Q&A. Engineers ask “what’s our data-retention policy for customer chat logs?”; the assistant answers from the internal policy wiki. One-shot question-answering, grounded in documents.
  • v2: Customer record lookup. Support agents ask “what subscription tier is customer ID 4711 on, and when did they last log in?”; the assistant calls an internal API and returns the answer in natural language. The data isn’t in any document; it’s in a database.
  • v3: Email drafting. After looking up a customer, draft a personalised apology-plus-next-steps email for the agent to review. Combines retrieved facts with generated text.
  • v4: Ticket filing. “Please file a P2 Jira ticket against team payments describing the issue above, with the customer context attached.” The assistant takes an action in an external system based on what was just discussed.

The team has a Bedrock account, access to Claude Sonnet and Nova Pro, an internal REST API for the customer-record lookup, a Jira API, and the policy wiki mirrored to an S3 bucket. What’s unclear is the architecture. Somebody has proposed one big “agent” that handles all four versions uniformly. Somebody else has proposed four separate endpoints, each built on the simplest pattern that works for its job. The team want a recommendation.

What actually matters

These three patterns aren’t rivals for the same problem. They’re answers to different shapes of problem. Picking right is mostly about matching the pattern to the shape of the workload. Policy Q&A is a retrieval shape – knowledge in documents, no actions. A multi-step “do whatever’s needed” flow is an agent shape. Everything in between is some flavour of chain. Trying to solve a retrieval problem with an agent is over-engineering; trying to solve “do whatever’s needed” with a fixed chain is under-engineering.

The patterns sit on a ladder of elaboration. Each rung up buys capability – external data, actions, planning – and pays for it in three currencies: latency, cost, and predictability. The first two scale with the number of model invocations per request. The third is what bites in production: a chain always runs steps 1, 2, 3, in that order; an AgentA system that wraps an LLM with tools, memory, and a loop, so it can take multi-step actions toward a goal rather than just answering one prompt. might call lookup_customer once on Monday and three times on Tuesday, depending on how the ModelA trained set of weights plus the architecture that makes them useful – the thing you load up and run inference against. planned. For any production-facing behaviour, that determinism is a feature – if an auditor asks “what did the assistant do when user X asked Y?”, a chain’s answer is the code; an agent’s is the run trace.

Tool useLetting an LLM call structured functions you’ve defined – search, calculator, database query, API call – instead of trying to do everything in text. is the mechanism that connects models to systems – describe a function, the model decides when to call it, the application executes, the result returns. It’s a model feature, not an agent-only one. A chain whose steps can call tools is still a chain (the topology is fixed by code); a model handed tools and told to “figure it out” is an agent (the topology is chosen by the model). That distinction is the entire architectural call – not “do we use tools?” but “do we let the model choose the path?”

The default error mode, when an agent platform is on the table, is over-elaboration: “we have an agent runtime, so everything becomes an agent call.” The better discipline is to pick the simplest pattern that supports each piece of the workload, and reach for agency only when the path genuinely needs to be chosen by the model rather than written down by the engineer.

What we’ll filter on

Six filters, applied to each of the three patterns.

  1. Determinism of topology – does the same input produce the same sequence of steps?
  2. Supports external data at query time – facts not in the PromptThe input you hand to an LLM – system instructions, user message, examples, retrieved documents, tool descriptions, the lot. or the model’s training?
  3. Supports taking actions (calling APIs, writing to external systems)?
  4. Latency – roughly, how many round-trips to the model per user request?
  5. Cost per request – roughly proportional to number of model invocations?
  6. Observability / ease of audit – can a human reconstruct what happened?

The pattern landscape

  1. Single LLM call. One invocation, one response. The user’s prompt goes in; the model’s completion comes out. No external data, no tools, no multi-step reasoning beyond what fits in one prompt. The baseline – useful for tasks the model can do in one shot given its training (summarise this text, translate this sentence, classify this ticket).

  2. Chain. Multiple LLM calls stitched together in application code. Output of call N feeds into input of call N+1. Topology is hardcoded. Example: “extract facts from this ticket (call 1), then generate a customer-facing summary from the facts (call 2).” Each step is an InvokeModel call; the orchestration is your Lambda or application server.

  3. Retrieval (RAG). A specific two-step chain: retrieve relevant chunks from a document corpus, then generate using the chunks. AWS-native via Bedrock Knowledge Bases (bedrock-agent-runtime:RetrieveAndGenerate) or DIY with EmbeddingA fixed-length vector of floats that represents a piece of text (or image, or other thing) in a space where similar meanings sit close together. + VectorAn ordered list of numbers – in AI usage, almost always an embedding – and by extension the databases that index them for nearest-neighbour search. + InvokeModel. Deterministic topology, one model call per request (two if you count the embedding call).

  4. Chain with tool use. A chain where one or more steps allow the model to call tools. The model’s response might be “call tool X with these arguments”; the application executes the tool and sends the result back; the model continues. The chain topology is still fixed (step 1 can use tools, step 2 generates the final response, etc.) but within a step the model has degrees of freedom.

  5. Agent. A loop, not a chain. The model is given tools and a goal; it plans, the application executes, the model observes, it re-plans. The loop continues until the model emits a “final answer.” AWS-native via Bedrock Agents – define an agent with instructions, action groups (each backed by a Lambda or an OpenAPI schema), optional Knowledge Bases, and optional GuardrailA filter or rule applied to an LLM’s inputs or outputs to keep it inside safe, legal, or on-brand behaviour. ; invoke via bedrock-agent-runtime:InvokeAgent. The runtime handles the plan-act-observe loop and emits a trace showing each step.

  6. Multi-agent orchestration. Multiple specialised agents coordinated by a supervisor agent. A billing agent, a customer-lookup agent, a ticketing agent, and a supervisor that routes requests to the correct one. Bedrock supports this via agent collaborators. Useful at scale when a single agent’s tool count exceeds what it can reason over reliably (typically 15-20 tools); over-engineering at lower scale.

Side by side

Pattern Deterministic External data Actions Latency Cost Auditability
Single call 1 hop Low Trivial
Chain ✗ (unless tool) Partial N hops Low-medium Easy
Retrieval (RAG) 1-2 hops Low Easy
Chain with tool use N+M hops Medium Easy
Agent Variable High Trace-based
Multi-agent Very variable Very high Multi-trace

The trade is obvious reading down the table: as you move from single call to multi-agent, capability increases and predictability, latency, and cost all move the wrong way. Choosing well means picking the lowest row that supports the required capability.

Pattern decision tree

Picking the pattern Does the task need knowledge outside the model's training? NO YES Does it need to take actions in external systems? NO Single LLM call bedrock-runtime: InvokeModel one request, one response YES Is the flow topology fixed? (you know which tools, in order) YES Chain with tool use InvokeModel with tools orchestration in app code deterministic, cheap NO Agent Bedrock Agents model-driven plan-act loop trace is the audit artefact Does it also need to take actions? NO Retrieval (RAG) Bedrock Knowledge Bases RetrieveAndGenerate grounded answers + citations YES Agent + KB Bedrock Agent with Knowledge Base attached + action groups (Lambdas) The four assistant versions on this map: v1 Policy Q&A -> Retrieval · v2 Customer lookup -> Chain with tool use · v3 Email drafting -> Chain · v4 Ticket workflow -> Agent (when the flow genuinely branches)
Two questions split the space cleanly. "One big agent" is almost always over-engineering; "agent when it genuinely needs to plan" is the rule worth holding.

The picks in depth

v1 – Policy Q&A: retrieval. Knowledge lives in documents, no actions, fixed topology (retrieve, then generate). Implementation: a Bedrock Knowledge Base over the policy wiki’s S3 mirror, with semantic chunking and Titan v2 embeddings. The v1 endpoint is one RetrieveAndGenerate call per question. Answers come back with citations; the UI renders clickable links to the source policy. No tools, no agent runtime.

v2 – Customer lookup: chain with tool use, not an agent. The topology isn’t uncertain – ask, optionally call one tool, format. The model never has to plan; it just decides whether the question needed the tool. An agent here would be slower, costlier, and harder to debug for no extra capability the flow needs. Implementation: a tool description lookup_customer(id: int) -> {tier, last_login, plan_details}. The v2 endpoint calls Claude Sonnet with the user’s question, the tool description, and a System promptThe instruction block that frames the model’s behaviour for a session, separate from the user’s messages. instructing it to call the tool if the question requires customer data. The model responds either with text (if it didn’t need the tool) or a tool_use block with the customer ID. The application executes the Lambda behind the tool, sends the result back in a tool_result block, and the model produces the final text response. Two model calls in the typical case; one if the model can answer without the tool.

v3 – Email drafting: chain, with retrieval folded in. Fixed topology – retrieve customer + policy context, generate the draft, optionally review for tone/PII. The model doesn’t choose the path. Implementation: two or three steps in application code. Step 1: retrieve policy context relevant to the issue (if applicable) and the customer record. Step 2: pass the retrieved context plus the engineer’s instructions to Claude with a prompt template (“You are drafting a customer email. Tone: empathetic but professional. Include the following points…”). Step 3 (optional): a second model call that reviews the draft for tone and PII compliance.

v4 – Ticket workflow: agent – and this is where the pattern earns its keep. If the flow were “user says file a ticket, assistant extracts context and calls file_jira()”, that would still be a chain with one tool call. The agent pattern starts to make sense when the flow might be: check Jira for a duplicate first, comment on it if one matches, otherwise file new, then notify Slack, then escalate on-call if P1. That’s a decision tree where the model picks the path based on what each call returns – agent territory. Implementation: a Bedrock Agent with:

  • Instructions. “You help engineers file and manage Jira tickets. When asked to file, first search for duplicates; if a similar open ticket exists, prefer commenting on it. Always summarise the ticket back to the user for confirmation before filing. Never change ticket priority without explicit user confirmation.”
  • Action group: Jira. An OpenAPI schema (or Lambda function schema) exposing jira_search(query), jira_create_ticket(team, title, description, priority), jira_comment(ticket_id, text), jira_update(ticket_id, changes).
  • Action group: Notifications. slack_notify(channel, text) for the P1 escalation flow.
  • Knowledge Base attached. For any policy questions that come up mid-flow (“is this a P1 or P2?” requires consulting the severity policy).
  • Guardrails attached. Content filters; PII redaction on output.

The agent runtime handles the loop: receive the user’s request, plan (model decides first action), act (runtime calls the action group’s Lambda), observe (result returns to the model), re-plan, etc. When the model emits a final response, the loop terminates. The runtime produces a trace – a structured record of each plan-act-observe cycle – which is the audit artefact.

An example v4 flow, condensed:

User: "File a P2 ticket with team payments describing the duplicate-charge issue we
       just discussed. Attach customer 4711's context."

Agent:
  Thought: "Search for existing P2 tickets matching this description first."
  Action: jira_search(query="duplicate charge customer 4711")
  Observation: [] (no matches)

  Thought: "Look up customer context to include."
  Action: (uses Knowledge Base to retrieve recent chat history for customer 4711)
  Observation: "Customer 4711, Enterprise tier, reported duplicate charge on 2027-02-26..."

  Thought: "Confirm ticket details with user before filing."
  Response to user: "I'll file a P2 with team payments: 'Duplicate charge for
                    customer 4711 (Enterprise) on 2027-02-26.' Confirm?"

User: "Yes, file it."

Agent:
  Action: jira_create_ticket(team="payments", title="Duplicate charge for customer 4711",
                             description="...", priority="P2")
  Observation: {ticket_id: "PAY-2387"}

  Response to user: "Filed PAY-2387 with team payments. https://..."

Four model calls, three action-group invocations, one knowledge-base lookup, one user confirmation cycle. The trace shows every step with inputs, outputs, and the model’s reasoning. Compare that to the equivalent “chain with tool use” implementation: you’d have to hardcode the search-first-then-file logic, the confirmation step, the knowledge-base lookup – and the moment a user asks something slightly different (e.g. “file this or comment on PAY-2301 if it’s the same issue”), your hardcoded chain misses it. The agent handles variations without code changes; the cost is the non-deterministic topology and the observability burden of reading traces.

A worked dispatch

The v1-to-v4 assistant in production fronts four endpoints, or one endpoint with an intent router. Either way, an individual request’s path looks like:

Request: "What's our data retention policy for chat logs?"
Router classifies: policy question -> route to Knowledge Base
  RetrieveAndGenerate -> answer + citations
  Latency: ~1.5s. Cost: ~$0.003.

Request: "What tier is customer 4711 on?"
Router classifies: customer lookup -> route to chain with tool use
  Claude Sonnet (tool-use capable) -> tool_use: lookup_customer(4711)
  Lambda executes -> tool_result: {tier: "Pro", ...}
  Claude formats -> "Customer 4711 is on the Pro tier, last logged in 2 days ago."
  Latency: ~2s. Cost: ~$0.008.

Request: "Draft an apology email for customer 4711 about their billing issue."
Router classifies: drafting -> route to email chain
  Step 1: retrieve customer + policy context.
  Step 2: generate draft with Claude Sonnet.
  Step 3: (optional) tone/PII review pass.
  Latency: ~4s. Cost: ~$0.015.

Request: "File a P2 Jira ticket for the billing issue, attach customer context,
          and notify #customer-escalations on Slack."
Router classifies: multi-step workflow -> route to Bedrock Agent
  InvokeAgent -> agent runs its plan-act loop:
    search Jira -> retrieve context -> confirm with user -> create ticket -> notify Slack
  Returns final response + trace.
  Latency: ~10-15s across confirmations. Cost: ~$0.05-0.10.

Latency and cost vary by an order of magnitude across the four patterns. If you’d built the policy Q&A as an agent, each simple question would cost 5x more and take 5x longer – for no quality gain. If you’d built the ticket workflow as a chain, you’d have hardcoded the flow and brittle-failed on any variant the code didn’t anticipate. Each pattern matches the shape of its workload.

What’s worth remembering

  1. Chains, retrieval, and agents are different designs for different problems. Not rival approaches to the same problem. Picking wrong usually means over-engineering (agent where chain would do) rather than under-engineering.
  2. Chains are fixed topologies; agents are model-driven topologies. That’s the fundamental distinction. Chains are deterministic in structure (though each LLM call is stochastic). Agents choose their own path through a tool space, so the same request can take different paths.
  3. Tool use is a model feature, not an agent-only feature. A chain with tool use gets many agent-like capabilities (external data, actions) while keeping a fixed topology. This is often the correct middle ground.
  4. Retrieval is a specific chain that deserves its own name. Retrieve, then generate. Bedrock Knowledge Bases is the managed path; DIY with embeddings + vector store is the flexible path.
  5. Bedrock Agents handle the plan-act-observe loop for you. Define action groups (Lambdas or OpenAPI schemas), attach Knowledge Bases and Guardrails, invoke via InvokeAgent. The runtime produces a trace that’s the audit artefact.
  6. Latency and cost scale with the pattern. Single call is 1-2 seconds and fractions of a cent. Retrieval is 1-3 seconds and a few cents. Chain with tool use is 3-5 seconds and an order of magnitude up. Agents are 5-20 seconds and another order of magnitude. Pick the cheapest pattern that works.
  7. Non-determinism is the cost of agency. Agents will take different paths on the same request. That’s what makes them general; it’s also what makes them harder to test and explain. Keep agents to the flows that genuinely need them.
  8. An intent router is the architecture most assistants actually want. Not “one big agent” handling everything. A router that classifies the request and dispatches to the correct pattern (retrieval, chain, agent) keeps the cheap paths cheap and reserves the expensive machinery for the cases that need it.

The temptation, when agents are available, is to use them for everything – “then we don’t have to think about routing.” The result is a system that costs ten times more, takes five times longer, and is harder to audit than it needed to be. The harder, better discipline is to look at each piece of the workload, ask whether the model genuinely needs to choose the path, and reach for an agent only when the answer is yes.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.