From Raw Model to Production Endpoint

A product team wants a chatbot that summarises support tickets. They have the tickets, a cloud account, and zero ML background. Somebody in the room has read the phrase “foundation model” and suggested using one. Between that sentence and a working endpoint sit roughly seven distinct stages, each with its own AWS service and its own decisions. The interesting question isn’t which model to use – it’s which stages this team can skip, which they absolutely cannot, and what AWS gives them at each step.

The situation

A support organisation handles around 8,000 tickets a week across five product lines. Each ticket is a thread of customer messages and agent replies, averaging roughly 1,500 words. Managers want a one-paragraph summary at the top of each ticket, written in the same tone the company uses in its knowledge base, that a reviewer can read in ten seconds.

The team is three backend engineers and a product manager. None of them has trained a model. The company already has an AWS account with a modest budget, the tickets live in an RDS Postgres database, and the security team has said anything sent to a third-party API needs a written exception – AWS-native is the path of least resistance.

“Foundation model” has been floated as a solution but nobody in the room can define it, let alone explain the path from “a foundation model exists somewhere” to “a reviewer sees a summary in the ticket UI tomorrow morning.” The lifecycle is the thing to walk.

What we might want from this

A foundation model, in the sense the industry now uses the term, is a large neural network trained on a broad corpus – text, code, sometimes images – that can be adapted to many downstream tasks without being retrained from scratch. Anthropic’s Claude, Meta’s Llama, Amazon’s Titan and Nova families are examples. “Foundation” is the metaphor: the model is the ground floor, and the application sits on top.

The first thing worth thinking about is that there is no single “use a foundation model” step. There is a sequence: choose the model, get access to it, design the way you’ll prompt it, optionally teach it something about your data, deploy it behind an endpoint, plumb that endpoint into your application, and then watch what it does in production. Each of those is a distinct decision, and AWS sells a distinct service (or at least a distinct API surface) for each one.

The second is that most of those stages are optional. A team that needs a summariser doesn’t have to train, doesn’t have to fine-tune, and in many cases doesn’t even need retrieval – the model has read enough English by now that summarising a ticket is within its baseline capability. Recognising which stages a given problem needs is most of the work; adding stages that aren’t pulling weight is how projects end up with a training pipeline they never use.

The third is the managed-versus-self-managed axis. Bedrock gives you a foundation model behind an API call with no infrastructure – you don’t see the GPUs, you pay per token. SageMaker JumpStart lets you take a model, put it on your own endpoint, and run it on infrastructure you pick. The first is the path for most text-summarisation-shaped problems; the second is the path when data can’t leave your VPC, when the model you want isn’t on Bedrock, or when latency demands a provisioned endpoint rather than an on-demand one.

The fourth is the cost shape. Bedrock’s on-demand pricing is per 1,000 input tokens and per 1,000 output tokens, with different rates for different models. For an 8,000-ticket-per-week workload that’s predictable enough to price up front, but the pricing model matters: a longer summary is more output tokens, and input tokens scale with the length of the ticket, so summarisation costs scale roughly linearly with workload.

The fifth is governance. Once a model is behind an endpoint, every team in the company will want to call it. Who can, for which use cases, logged how, evaluated against what? “We stood up a model” is easy; “we stood up a model and a governance story around it” is the one that survives an audit.

The attributes that matter

Every foundation-model project passes through some subset of seven stages. Scoring each stage against the team’s situation is the filter that decides what gets built.

Model choice – which foundation model fits the task’s quality, language, context-window, and cost profile?
Access – managed API (Bedrock) or self-hosted endpoint (SageMaker JumpStart / custom)?
Adaptation – prompt engineering alone, retrieval-augmented generation, fine-tuning, or continued pre-training?
Deployment surface – on-demand per-token, provisioned throughput, or a real-time SageMaker endpoint?
Integration – how does the application call the endpoint and handle responses, errors, and rate limits?
Evaluation – how do we know the model is getting it correct, and how do we track that over time?
Governance – logging, guardrails, access control, cost attribution.

The lifecycle landscape

1. Model selection and access via Bedrock. Amazon Bedrock is a managed service that puts a catalogue of foundation models – Anthropic Claude, Meta Llama, Amazon Nova and Titan, Mistral, Cohere, AI21 – behind a single API. No infrastructure to provision; access is granted per-model in the Bedrock console (some models require an access request, some are self-serve). Authentication is IAM; calls are bedrock-runtime:InvokeModel or InvokeModelWithResponseStream. For a summarisation task with 8,000 tickets a week, this is the shortest path from “we chose a model” to “the model is callable from Lambda.”

2. Model selection and access via SageMaker JumpStart. JumpStart is a SageMaker feature that lets you pick an open-weights model from a catalogue (Llama, Falcon, Mistral, and Amazon’s own models) and deploy it to a real-time SageMaker endpoint in your VPC with a few clicks or a CloudFormation-friendly SDK call. You pay for the underlying instance (e.g. ml.g5.2xlarge) whether calls come in or not, but the model lives in your account, talks only to your VPC, and is subject to no per-token pricing. The path when data residency, custom fine-tuning, or steady high throughput push you off on-demand.

3. Prompt engineering. The cheapest form of adaptation. A prompt is just the text you send to the model – instructions, examples, and the input. “Summarise the following support ticket in one paragraph, using a neutral professional tone” followed by the ticket text is a prompt. Good prompt engineering can take a generic model most of the way to task-specific behaviour without touching a training pipeline. No new AWS service; the work lives in your application code.

4. Retrieval-augmented generation (RAG). When the model needs facts it wasn’t trained on – internal product documentation, this quarter’s pricing, an engineer’s runbook – you retrieve relevant documents at request time and include them in the prompt. Bedrock Knowledge Bases is the AWS-managed path: point it at an S3 bucket of documents, it chunks them, embeds each chunk into a vector (a list of numbers that encodes meaning), stores the vectors in an OpenSearch Serverless or Aurora PostgreSQL index, and at query time retrieves the most relevant chunks and injects them into the model’s prompt. The team can do this themselves with Titan or Cohere embedding models plus their own vector store; Knowledge Bases is the zero-plumbing version.

5. Fine-tuning. If prompt engineering and retrieval both fall short – typically because the task needs a voice, format, or domain vocabulary the base model doesn’t produce reliably – fine-tuning adjusts the model’s weights on a task-specific dataset. Bedrock supports fine-tuning a subset of its models (Nova, Titan, Llama) via the console and API: upload JSONL training data to S3, start a fine-tuning job, get a custom model that requires provisioned throughput to serve. Fine-tuning is expensive in dollars and in evaluation time; most projects don’t need it.

6. Deployment. Bedrock offers two throughput models: on-demand (pay per input and output token, no capacity reservation) and provisioned throughput (commit to a number of “model units” for 1 or 6 months in exchange for guaranteed capacity and a different price). Fine-tuned Bedrock models require provisioned throughput. SageMaker endpoints are a third path: provision instances, pay for them continuously, get sub-second predictable latency. The choice depends on whether the workload’s shape is bursty (on-demand wins), steady-high (provisioned throughput wins), or latency-critical (SageMaker endpoint wins).

7. Governance. Bedrock emits CloudTrail events for every InvokeModel call, supports data capture to S3 for input and output logging, and integrates with Bedrock Guardrails (topic denies, PII redaction, profanity filters) configured independently of the model. IAM policies scope which principals can invoke which models; AWS Config and Service Control Policies can prevent unapproved models from being invoked at all. SageMaker endpoints inherit the standard VPC, IAM, and CloudWatch story.

The attribute table

Mapping the seven stages onto the support-ticket-summariser scenario:

Stage	Needed for this project?	AWS service	Notes
Model selection	✓	Bedrock catalogue	Claude / Nova for English summarisation
Access	✓	Bedrock	On-demand via `bedrock-runtime:InvokeModel`
Prompt engineering	✓	(application code)	One well-crafted prompt carries most of the work
Retrieval (RAG)	✗	Bedrock Knowledge Bases	Ticket is self-contained; no external facts needed
Fine-tuning	✗	Bedrock Custom Models	Defer until prompting is measured
Deployment surface	✓	Bedrock on-demand	8k/week is predictable but bursty; not fine-tuned
Evaluation	✓	Bedrock Evaluation + SageMaker Clarify	Sample, label, track drift
Governance	✓	IAM, CloudTrail, Guardrails	PII redaction on input; log everything

The two stages the team can skip – retrieval and fine-tuning – are the two stages where most “AI projects” burn budget unnecessarily. The ticket is the thing being summarised; the model doesn’t need facts beyond the ticket. Fine-tuning is premature until there’s evidence prompting has plateaued.

The lifecycle as a pipeline

Seven stages in sequence, two of them optional. The summariser touches five of the seven. Evaluation and governance are continuous, not a phase.

The pick in depth

Bedrock on-demand, Claude or Nova, one well-crafted prompt. Bedrock gives a catalogue of models behind a single SDK. bedrock-runtime:InvokeModel takes a model ID (anthropic.claude-3-5-sonnet-20241022-v2:0, amazon.nova-pro-v1:0, etc.) and a JSON body whose shape depends on the model family. For Claude, the body is {"anthropic_version":"bedrock-2023-05-31","max_tokens":1024,"messages":[{"role":"user","content":"..."}]}. For Nova, {"inferenceConfig":{"max_new_tokens":1024},"messages":[...]}. The response comes back JSON; the application extracts output.message.content[0].text (Nova) or content[0].text (Claude) and hands it to the UI.

Model choice is an empirical question, not a reading-specs question. Amazon Bedrock has an Evaluation feature that runs a set of prompts through a chosen model and scores the results on dimensions like accuracy, robustness, and toxicity – or against a custom ground-truth dataset. Run a batch of 50 representative tickets through three candidate models (Claude Sonnet, Nova Pro, Llama 3.3 70B), have the product manager score the summaries, pick the model that wins on the cheapest price-per-token that meets the quality bar. The evaluation is a few hours of work; it saves months of arguing about which model “feels better.”

Prompt engineering is where the quality lives. A prompt that says “Summarise this ticket” produces mediocre summaries. A prompt that says “You are writing a one-paragraph summary of a customer support ticket for an internal reviewer. Use a neutral professional tone. Mention the customer’s issue, what the agent did, and whether it’s resolved. Do not include the customer’s name or email. If the ticket is in a language other than English, summarise in English.” produces the correct shape, every time. Give the model a few labelled examples in the prompt – “few-shot prompting” – and the consistency tightens further. None of this touches AWS; it’s application code. It’s also where 80% of the lift comes from.

Deployment-wise, 8,000 tickets a week at roughly 2,000 input tokens and 200 output tokens each works out to 16M input and 1.6M output tokens per week. Claude 3.5 Sonnet on Bedrock bills at roughly $3 per million input tokens and $15 per million output tokens, so: $48 + $24 = $72/week, or about $310/month. On-demand is correct: the workload is small enough that provisioned throughput’s minimum commitment would cost more than the usage, and the traffic bursts to Monday-morning peaks that on-demand handles without capacity planning.

Governance is an independent track. A Bedrock Guardrail – a configuration object attached to the invocation – redacts PII from the input before it reaches the model, denies specific topics (medical or legal advice, for example), and filters profanity from the output. CloudTrail records every InvokeModel call with the model ID and the caller’s IAM principal. The bedrock-runtime invocation supports passing an invocation log destination (S3) so full input/output pairs land there, KMS-encrypted, for audit and evaluation-set curation. An IAM policy on the Lambda role restricts which models it can invoke: bedrock:InvokeModel with a Resource scoped to specific model ARNs.

A worked pipeline: one ticket end-to-end

Priya, the PM, wants to see the pipeline work on a real ticket before sign-off. The engineering team has a Lambda wired up.

$ aws bedrock-runtime invoke-model \
    --model-id anthropic.claude-3-5-sonnet-20241022-v2:0 \
    --body '{
      "anthropic_version": "bedrock-2023-05-31",
      "max_tokens": 400,
      "messages": [
        {"role": "user", "content": "You are writing a one-paragraph summary of a customer support ticket for an internal reviewer. Use a neutral professional tone. Mention the customers issue, what the agent did, and whether it is resolved. Do not include the customers name or email. Ticket follows:\n\n---\nCustomer: Hi, my dashboard has been stuck on loading for two hours. Im on the Pro plan.\nAgent: Hi there, sorry to hear. Can you try clearing your browser cache?\nCustomer: Tried that, same issue.\nAgent: Ok, Im seeing an issue on our side with the Pro-plan widget rendering. Engineering is deploying a fix; should be resolved in 30 min.\nCustomer: Ok thanks.\nAgent: Deployed. Can you refresh and confirm?\nCustomer: Working now. Thanks!"}
      ]
    }' \
    --guardrail-identifier ticket-summariser-gr \
    --guardrail-version 1 \
    --cli-binary-format raw-in-base64-out \
    out.json

$ jq -r '.content[0].text' out.json
A Pro-plan customer reported that their dashboard was stuck loading for
two hours. The agent diagnosed a server-side rendering issue affecting
Pro-plan widgets, deployed a fix, and the customer confirmed the
dashboard was working again. The ticket is resolved.

What happened behind the scenes:

IAM authorised the caller for bedrock:InvokeModel on the Claude Sonnet model ARN in the target Region.
Bedrock applied the ticket-summariser-gr guardrail: scanned the input for PII (no matches, because the prompt specifically said not to include names), scanned the output (no matches), passed both through.
Bedrock called Anthropic’s model (hosted inside the AWS-Anthropic arrangement – the request never leaves AWS), got the completion, returned it.
CloudTrail logged the InvokeModel call: principal, model ID, timestamp, and – because invocation logging is enabled – the input and output landed in s3://ticket-summariser-logs/ under a KMS key that only the platform team holds.
The Lambda wrote the summary to the ticket’s summary column in RDS. The support UI rendered it at the top of the thread next time the ticket was opened.

That’s the loop. For 8,000 tickets a week, an EventBridge rule fires a SQS message per new ticket, the Lambda processes them at whatever concurrency Bedrock’s on-demand quota allows (request a service-quota increase if the default isn’t enough), and the whole pipeline is roughly 200 lines of code plus the guardrail, the IAM policies, and the log bucket.

What’s worth remembering

Foundation model means a general-purpose pretrained model you adapt, not one you build. Someone else did the training; your job is to choose, access, prompt, and optionally adapt.
The lifecycle has seven stages but most projects only need five. Retrieval is for when the model needs external facts. Fine-tuning is for when prompting plateaus. Both add real complexity; neither is free.
Bedrock is the default access path for text generation. Managed, IAM-gated, per-token pricing, no GPUs. The path of least resistance for most business-problem-shaped use cases.
SageMaker JumpStart is the path when Bedrock isn’t enough. Your own endpoint, in your VPC, with the model you choose; you pay for the instance whether calls come or not.
Prompt engineering is where 80% of the quality lift comes from. One well-crafted prompt with instructions, tone guidance, and a few examples beats a mediocre prompt against a more expensive model.
Evaluate empirically, not by vibes. Bedrock Evaluation runs a candidate prompt across multiple models on a fixed dataset; the correct model is the one that wins your evaluation, not the one with the newest press release.
Deployment surface follows workload shape. Bursty and small: on-demand. Steady-high or fine-tuned: provisioned throughput. Latency-critical or data-residency-bound: SageMaker endpoint.
Governance is continuous, not a stage. Guardrails, CloudTrail, invocation logging, and IAM scoping belong in the first version, not the hardening pass. Retrofitting them is harder than starting with them.

The path from “use a foundation model” to “a reviewer sees a summary in the ticket UI” isn’t a single step; it’s a pipeline. Most of the stages in that pipeline have obvious AWS-native answers, and most teams that get stuck are the ones who treat adaptation (retrieval or fine-tuning) as compulsory rather than contingent. Start with the shortest path, measure, and only add stages when the evidence says they’d help.