The Model Behind the Product

A SaaS platform ships an in-product “ask the data” assistant at roughly a million LLMLLMA neural network trained to predict the next token in a sequence, large enough that it generalises to tasks it wasn’t explicitly trained for. requests a day, peaking at 30 RPS, split across US and EU customers. P99 first-token under 1.5 s. Real reasoning over retrieved context. Bedrock has seven modelModelA trained set of weights plus the architecture that makes them useful – the thing you load up and run inference against. families and four ways to pay for capacity. Most of the landscape filters itself out once you name the attributes that matter, and the real trick is what you wire around the chosen model so it survives a region going dark.

The situation

A B2B SaaS platform is shipping an in-product assistant. Users ask questions of their own data; the application retrieves relevant records, stitches them into a promptPromptThe input you hand to an LLM – system instructions, user message, examples, retrieved documents, tool descriptions, the lot. , and asks a foundation model to answer. Measured over three months of production traffic:

~1,000,000 requests per day, peaking at 30 RPS during US/EU business-hours overlap.
Median request: ~3,000 input tokensTokenThe unit of text an LLM actually sees – usually a short character sequence, not a whole word. (system promptSystem promptThe instruction block that frames the model’s behaviour for a session, separate from the user’s messages. + retrieved context + user question), ~400 output tokens.
P99 first-token latency target < 1.5 s. The UI streams the answer.
Quality bar: complex reasoning over structured retrieved context – tables, JSON, pulling answers from multiple documents.
Multi-region failover is hard-required. Customers in both us-east-1 and eu-west-1; a regional Bedrock incident must not take either customer base down.
Bedrock-native. No separate model-serving infrastructure.

What actually matters

Before reaching for a model card, it’s worth asking what the application is actually paying for.

The first question is whose product is this? A model choice is a product choice – it decides who owns the upgrade cadence, who tracks the pricing page, and who gets paged when the answer quality drifts after a point release. On Bedrock, those answers are “Anthropic ships the behaviour, AWS ships the availability, we own the integration.” That shape is cheap to buy into and expensive to reverse; the cost of moving a production RAGRAGA pattern where you retrieve relevant documents at query time and stuff them into the prompt so the model can ground its answer on them. application between model families is rarely smaller than the savings, and any decision worth making pairs the model with the throughput mode underneath it.

The second is what does a bad day look like? At a million requests a day the interesting failure isn’t an individual bad answer – it’s a region going dark for forty-five minutes. Every second of that outage has a customer-visible consequence, and the blast radius is the entire customer base on the affected region unless the architecture spreads the load. That pushes the design toward something the application can call with a single model ID while AWS fans the request out across regions behind the scenes, because the alternative is the application owning its own regional routing table and every deploy carrying the risk of a misrouted call.

The third question is what does the bill look like when the product wins? At ~3 billion input tokens and ~400 million output tokens a day, the gap between a $3/M-input model and a $15/M-input model is the difference between a $400k/month bill and a $2M/month bill. That’s not a line-item on a finance review – it’s a budget conversation with the CFO. The interesting economics aren’t “which model is cheapest” but “where can we spend Haiku prices on questions Haiku can answer, and Sonnet prices on the ones that actually need reasoning?”

The fourth is what happens when the easy answer is wrong? A single-model architecture pays Sonnet rates for the FAQ slice of traffic and gets Haiku-grade failure modes when a region saturates. A two-model architecture splits the traffic by difficulty and buys a load-shed path when Sonnet’s capacity tightens. The sophistication isn’t picking the model – it’s designing the cascade that lets the cheaper model carry the tail.

The fifth is how do we know it’s still working? Model behaviour drifts across point releases. A RAG system prompt calibrated against Sonnet 4.5 doesn’t automatically work the same on 4.6, and the gap between “the answers are slightly worse this week” and “we lost 3% accuracy across the board” is an evaluation pipeline that runs nightly against a golden set. That pipeline is a first-class piece of the design, not an afterthought.

Finally: what buys us the right to change our mind? InferenceInferenceRunning a trained model to produce output – as opposed to training it. profiles that hide the specific region, prompt templates that separate cacheable prefix from volatile context, evaluation harnesses that can A/B a new model version – all of those are optionality the architecture builds in, and they’re the difference between “we upgraded to 4.7 last Tuesday” and “we spent six weeks revalidating the prompt.”

What we’ll filter on

Distilling that exploration into filters we can score each model against:

Reasoning quality on retrieved context. Reasoning across long structured prompts, not just fluent extraction.
First-token latency under 1.5 s at P99 for ~3,000-token inputs. Tail, not average.
Cost-per-token that survives a million requests a day. Daily volume is ~3B input + ~400M output tokens; a 10x pricing gap between families is $60k vs $600k a month.
Bedrock-native multi-region availability across US and EU, surviving one region offline.
Throughput predictability at 30 RPS peak. Traffic is smooth, not spiky – so the capacity question is which mode gives predictable latency without over-buying.

The Bedrock model landscape

Bedrock’s catalogue sorts into seven families.

Anthropic Claude. Three live tiers: Haiku 4.5 (~$1 / $5 per million input / output tokens; 200K context), Sonnet 4.5 and Sonnet 4.6 (~$3 / $15; 200K context), Opus 4.5 and Opus 4.6 (~$15 / $75; 200K context). All three support global and geographic cross-region inference profiles, prompt caching, and vision inputs. Sonnet is available in US East (N. Virginia, Ohio, Oregon) and EU (Frankfurt, Ireland, Paris, Zurich). First-token latency on Sonnet 4.5 in a warm region sits around 1-1.8 s; Haiku 4.5 under a second.

Amazon Nova. Four text tiers: Nova Micro (~$0.03 / $0.12 per million; 128K context), Nova Lite (~$0.06 / $0.24; 300K context), Nova Pro (~$0.80 / $3.20; 300K context), Nova Premier (~$2.50 / $12.50; frontier-class). Nova Pro is cheapest-per-token at its quality tier by a wide margin. Regional availability is broad within the US; EU coverage is thinner and largely via cross-region profiles anchored in US regions.

Meta Llama. Llama 3.1 (8B, 70B, 405B), Llama 3.2 (1B, 11B vision), Llama 4 Maverick and Scout (MoE). Among the lowest pricing on the platform – Llama 3.1 70B around $2.65 / $3.50 per million. The top-end 405B and the Llama 4 MoE models are concentrated in US regions only; cross-region profiles don’t cover EU for the top tiers.

Mistral AI. Mistral Large 3 (~$2 / $6 per million, 128K context) is the flagship; Ministral 3B and Mixtral 8x7B sit lower. Decent mid-tier reasoning, strong multilingual. Doesn’t beat Sonnet on quality or Nova Pro on cost; EU coverage thinner than Claude’s.

Cohere. Command R+ is specifically tuned for RAG – citation generation, grounded answers, tool-use. Available in us-east-1 and us-west-2 only; no native EU. First-class option for US-only RAG; ruled out by the EU requirement.

Amazon Titan. The family has shifted to embeddings (Titan Text Embeddings V2) and image generation. Useful for the embedding side of a RAG pipeline; not the generation model.

AI21 Labs. Jamba 1.5 Mini and Large, 256K context, Jamba hybrid SSM/Transformer. Good at long-context extraction; limited EU presence; mid-tier reasoning.

Side by side

Family	Reasoning on retrieved context	P99 first-token < 1.5 s	Cost at 1M req/day	EU region availability	Predictable at 30 RPS
Anthropic Claude (Sonnet)	✓	✓	✓	✓	✓
Amazon Nova (Pro / Premier)	✓	✓	✓	✗	✓
Meta Llama	✓	✓	✓	✗	✓
Mistral AI	—	✓	✓	—	✓
Cohere Command R+	✓	✓	✓	✗	✓
Amazon Titan	✗	—	—	—	—
AI21 Jamba	—	✓	✓	—	✓

Two families make the shortlist on all five: Anthropic Claude and (in a US-only variant) Amazon Nova. Nova wins on cost-per-token but fails EU availability for the Pro and Premier tiers that would clear the reasoning bar. Cohere’s Command R+ is purpose-built for RAG but currently lives in us-east-1 and us-west-2 only. Claude Sonnet is the only row with all ticks, and the “complex reasoning over structured retrieved context” constraint keeps it there.

Matching the workload to the model

Four gates -- reasoning, residency, latency, cost -- and the seven-family catalogue collapses to Sonnet on a geographic cross-region profile with Haiku as the cost-tier fallback.

Sonnet, in depth

Sonnet is where most production RAG applications land: Opus-grade reasoning on most realistic prompts, Haiku-competitive latency for typical RAG input sizes, mid-tier pricing that makes a million-requests-a-day application viable.

Version choice. 4.5 and 4.6 are both live on Bedrock at the same price. 4.6 is newer; 4.5 has the longer benchmarkBenchmarkA standardised test set used to score and compare models. track record. New applications default to 4.6; calibrated pipelines stay on 4.5 until the re-calibration is done, because behaviours differ across minor versions and a RAG system prompt is typically tuned against one specific model.

Context windowContext windowThe maximum number of tokens an LLM can attend to in a single call – prompt plus output combined. . 200K tokens. A 3K input sits far below the ceiling; context pressure is nowhere near a concern.

Latency profile. First-token for a 3K input in a warm region runs under 1.8 s. That’s close to the 1.5 s target, which makes latency a throughput-mode question rather than a model question – on-demand variance can push P99 above target under peak load.

Four ways to buy capacity. On-demand pays per token with no commitment, variable latency under noisy-neighbour contention. Provisioned throughput buys model units at a guaranteed rate on a 1-month or 6-month commitment – predictable latency, committed spend. Batch inference ships 50% off at a 24-hour SLA, fine for offline jobs. Flex tier ships 50% off at best-effort latency, fine for tolerant async. The right default for 30 RPS peak is on-demand with raised quotas; provisioned earns its place when the peak sustains into the hundreds of RPS or a hard latency SLA demands isolated capacity.

Prompt caching is the cost lever most applications miss. Cache reads cost ~10% of the normal input-token price; cache writes cost ~25% more than normal and populate the cache for ~5 minutes. The scenario’s 3K input is almost certainly ~800 tokens of shared system prompt plus ~1,700 of retrieved context plus ~500 of user question. Marking the system prompt cacheable pays full price once per 5-minute window and 10% everywhere else – a 23% reduction in input cost at the stated numbers, larger if tool definitions and few-shot examples live in the cached prefix. Caching also cuts first-token latency by hundreds of milliseconds, directly against the 1.5 s budget.

Cross-region inference profiles are how the multi-region requirement collapses to a config change. Call us.anthropic.claude-sonnet-4-6 and the US invocation spreads across us-east-1, us-east-2, and us-west-2; call eu.anthropic.claude-sonnet-4-6 and the EU invocation spreads across Frankfurt, Ireland, Paris, and Zurich. If one constituent region fails, the others serve. No code change; on-demand rate applies; no cross-region data-transfer charge on the inference path. Global profiles exist for maximum availability but trade residency; geographic profiles are the right default when US and EU customers are separate.

Cascading for cost. Not every question needs Sonnet. Routing the easy slice – short queries, straightforward extraction – to Haiku 4.5 at roughly a third of Sonnet’s price is where the daily bill bends. Three shapes in the wild: cascade (try Haiku, retry on Sonnet when confidence is low), pre-route (classify first, choose once), load-shed (Sonnet by default, drop to Haiku when Sonnet’s P99 climbs). Cascading is the most common because it degrades gracefully when the judge is uncertain.

A worked architecture

Primary model: Sonnet 4.6 via us.anthropic.claude-sonnet-4-6 for US tenants, eu.anthropic.claude-sonnet-4-6 for EU. The application routes tenants to the right profile by home region.
Throughput mode: on-demand. Quotas raised in advance for peak 30 RPS with margin. Provisioned reviewed quarterly against actual utilisation.
Prompt caching: system prompt (roles, instructions, tool definitions) marked cacheable. Cache hit rate monitored as a first-class metric.
Cost-tier fallback: shed to Haiku 4.5 when profile P99 exceeds 2 s for 5 min. Haiku via the matching geographic profile.
Cross-geography failover: on repeated 5xx from the primary profile, retry once against the other geography. Degraded-residency mode for continuity.
Evaluation: 500-question golden dataset, nightly run via Bedrock Evaluations with Sonnet 4.6 as judge, alert on aggregate drops above 5% WoW.
Embedding model: Titan Text Embeddings V2 in each region, vector store local to where it’s queried.

Rough monthly cost at 1M requests/day, 3K/400 token median, ~25% input cached:

Input: 3,000 x 1M x 30 = 90B/month. Effective ~70B after caching x $3/M = ~$210k.
Output: 400 x 1M x 30 = 12B x $15/M = ~$180k.
Evaluations, embeddings, incidental Haiku: ~$10k.
Total: ~$400k/month, before any volume discounts from the account team.

Routing ~40% of traffic to Haiku via a well-calibrated cascade drops total cost to roughly $260k/month for similar quality on the easy slice. That’s where the investment in evaluation pays off – a trustworthy judge makes the cost curve bend.

What’s worth remembering

Seven Bedrock families, only some in both US and EU. Claude, Nova, Llama, Mistral, Cohere, Titan, AI21. The EU-residency gate is what rules most of the catalogue out for a dual-geography product.
Claude’s three-tier split (Haiku / Sonnet / Opus) maps to working points. Haiku for latency and cost, Sonnet as the production default, Opus as an escalation rather than a daily-driver.
Geographic cross-region inference profiles (us., eu.) give automatic in-geography failover at on-demand rates with no surcharge. The primary mechanism for multi-region availability on Bedrock.
Four throughput modes. On-demand for smooth sub-hundreds-RPS; provisioned for sustained high throughput or hard latency SLAs; batch for 24-hour async; flex for tolerant async.
Prompt caching’s 90% discount on cache reads is the biggest lever on input cost for any RAG workload with a stable system prompt. The 5-minute window and ~25% write premium are the two numbers to remember.
Cascading, pre-routing, and load-shedding are three shapes for splitting traffic between Sonnet and Haiku. Cascading is the common production default; load-shedding is the cleaner degradation path than 503s under peak.
Bedrock Evaluations runs automatic and LLM-as-a-judge modes against a golden dataset. Nightly evaluation plus alert-on-drop is the discipline that catches model-version drift before users do.
Custom Model Import is a niche tool – provisioned-only, limited regions, fine-tuned open-source only. Not a general alternative when Bedrock’s hosted models already clear the quality bar.

The model choice is the easy part. The work that actually matters is the throughput mode, the failover topology, the caching discipline, and the evaluation harness wrapped around the chosen model – all pieces that compound when the application outgrows a single region and a single quality tier.