The situation
The platform team owns Bedrock access for the whole company. Roughly thirty services, a support assistant, a ticket classifier, a marketing-copy drafter, a translation pipeline, a meeting summariser, and twenty-five others, call Bedrock in production, each with its own prompt. The prompts were drafted separately by product teams, copied between codebases, embedded as string literals, sometimes templated with f-strings, sometimes loaded from a Markdown file.
What happened last quarter is going to happen again. A product engineer tweaked the meeting-summariser’s System promptThe instruction block that frames the model’s behaviour for a session, separate from the user’s messages. , changed “concise” to “brief” in what looked like a clean-up, deployed to production, and retention on the daily summary email dropped 12% for eight days before anyone correlated the code change. The prompt had no version history the product team could see. The A/B infrastructure didn’t know prompts were a thing to vary. The monitoring dashboard reported Bedrock latency and error rate; it didn’t report whether the output was any good.
Platform’s ask: a prompt management story for the whole company. Version prompts, test them before release, roll them out alongside the code that calls them (or independently, if that’s better), measure their impact, and stop letting string literals in thirty repos be the authoritative copy.
What actually matters
A prompt is text the ModelA trained set of weights plus the architecture that makes them useful – the thing you load up and run inference against. sees before it does anything else. It is, functionally, configuration: it changes behaviour, it’s smaller than code, it wants review, and it wants versioning. The failure modes are the ones every config-management practice was invented to address, drift, shadow copies, untested change, silent regression, no rollback.
The first decision is where prompts live as source of truth. Checked into the service repo? Central repo? A managed Bedrock resource? A database?
The second is how they’re versioned. Git commit hashes? Semantic versions? Bedrock prompt versions? All of the above, coordinated?
The third is how they’re released. Deployed with the code that uses them, or independently? Is a prompt change a deployment, a feature flag flip, or a config push?
The fourth is how they’re tested. Before the change hits production, somebody runs the new prompt against a bank of examples and checks the outputs. Is that bank owned by the prompt author? The product team? Platform?
The fifth is how they’re parameterised. A prompt usually has slots, user input, retrieved context, session state. The templating language matters: f-strings lose their context when you refactor; Jinja gains power but adds a dependency; a simple {variable} substitution is predictable. Managed registries usually have their own template syntax to learn.
The sixth is how they’re attributed. When one prompt feeds thirty services, the bill, the latency, and the quality signal have to be broken down by caller, otherwise the platform team can’t tell which service is driving which problem.
And a softer one: the team that owns the prompt. Who edits it, who approves the edit, who rolls it back? Without a clear answer, every service’s prompt is owned by the last engineer who touched it, which is to say, owned by no one.
What we’ll filter on
- Source-of-truth clarity, one place, many places, or a registry?
- Versioning and rollback, immutable versions, diffs, easy revert?
- Deployment shape, bundled with code, pushed independently, feature-flagged?
- Evaluation coverage, tests run before a prompt ships?
- Per-caller attribution, cost, latency, quality broken down by service?
The prompt-management landscape
-
Bedrock Prompt Management. AWS-native prompt registry. Create a prompt with a template, variables, a selected foundation model, default inference config (TemperatureA knob (usually 0 to 2) that controls how much the model deviates from its highest-probability next token. , top-p, max TokenThe unit of text an LLM actually sees – usually a short character sequence, not a whole word. ), and an optional system message. The working draft is mutable;
CreatePromptVersionfreezes an immutable, numbered snapshot, template, variables, model choice, and inference config all locked together. There are no aliases: nothing inside Bedrock pointsproductionat version 12; a version is referenced by its ARN, and that reference lives wherever you put it. To invoke, you pass the prompt’s ARN (with the version, for anything past the draft) as themodelIdtoConverseorInvokeModel, along with apromptVariablesmap; Bedrock renders the template, runs the model, returns the response. IAM scopes who can create, version, and invoke prompts. Ticks 2 cleanly and 1 partially; quiet on 3, 4, and 5, the routing and measurement layers are yours to build. -
Git-backed templates in a shared repo. A
prompts/directory in a shared repo, one file per prompt, Jinja2 or Handlebars or a plain-text template with named placeholders. A small library in each service loads the prompt, substitutes variables, calls Bedrock. Versioning is Git commits; releases are tags; tests sit next to the templates in CI. Nothing AWS-specific; works identically if Bedrock moves to a different model. Ticks 1, 2, 3, 4 cleanly; 5 depends on observability we add. -
Parameterised prompts in each service’s config. Prompts live in each service’s config file (YAML, JSON), deployed with the service, versioned with the service. Cheapest to set up; the baseline thirty-services-each-doing-their-own-thing pattern, formalised. Ticks 3 cleanly; fails 1 and 5.
-
LangChain’s PromptTemplate + LangSmith. Prompts as code in a shared Python package; LangSmith as the evaluation and observability surface. Prompts versioned in the package, evaluated with LangSmith datasets, observed per-invocation. Strong on 4 and 5; separate SaaS; tied to LangChain’s abstractions.
-
A prompt database. A DynamoDB or Postgres table, or SSM Parameter Store itself, holding prompt bodies, versions, and metadata. Services fetch the active prompt at call time. Flexible, but puts prompt changes one write away from production, fast, and dangerous without a deployment gate.
-
Hybrid: Git + Bedrock Prompt Management + a routing parameter. The pattern most platform teams land on. Prompts authored in Git, reviewed in PRs, evaluated in CI. On merge, a pipeline calls
CreatePromptVersion; the new version’s ARN is the release artefact. A Parameter Store parameter per prompt per stage (/prompts/summariser/production) holds the ARN currently in service; promotion and rollback are parameter writes. Git is the source; Bedrock is the immutable registry; Parameter Store is the alias layer Bedrock doesn’t ship.
Side by side
| Option | Source of truth | Versioning | Deployment | Evaluation | Attribution |
|---|---|---|---|---|---|
| Bedrock Prompt Management | Bedrock resource | Immutable numbered versions | API call | Manual / custom | Build it ourselves |
| Git-backed templates | Repo | Commits, tags | With service | CI-driven | Build it ourselves |
| Per-service config | Each service | With service | With service | Each team’s job | None central |
| LangChain + LangSmith | Python package | Package versions | With service | LangSmith datasets | LangSmith traces |
| Prompt database | DB rows | Row versions | DB write | Optional | Depends |
| Git + Bedrock + SSM (hybrid) | Git, with mirror | Git commits → Bedrock versions | Pipeline + parameter flip | CI + golden set | Invocation logs + caller metrics |
For a platform team with 30 callers, the hybrid wins on the trade-offs. Git carries the authoring workflow; Bedrock Prompt Management carries the immutable version registry; Parameter Store carries the routing, which version each stage is actually serving. No single piece hits all five attributes; the conclusion isn’t “Bedrock does it all”, it’s Prompt Management for versioning plus Parameter Store for routing, with attribution built on top.
The prompt lifecycle, end to end
The pick in depth: Git + Bedrock Prompt Management + Parameter Store
Authoring. Prompts live in prompts/ in a shared repo. Each prompt is a Jinja2 template plus a YAML sidecar with the inference config (temperature, top-p, max tokens, stop sequences), the intended foundation model, and ownership metadata (team, primary contact, service list). PRs require review by the prompt owner; SMEs are added as reviewers via CODEOWNERS based on domain.
Evaluation in CI. A GitHub Action runs on every PR. For each changed prompt, it loads a 50-example smoke set (small, fast, runs in under two minutes), invokes the model with the new template, and scores with a mix of reference metrics (BLEU, ROUGE, exact-match for structured outputs) and LLM-as-judge (another Claude call scoring each output 1-5 on defined rubrics). Thresholds are per-prompt, the summariser has different quality criteria than the ticket classifier. A regression blocks merge.
Release pipeline. On merge to main, a pipeline loops through changed prompts and calls CreatePromptVersion on Bedrock. The version is an immutable snapshot, template, variables, model choice, and inference config frozen together, and its ARN is the release artefact. The pipeline then reruns the evaluation harness against a larger golden set (500-2000 examples), invoking the new version’s ARN directly; same judges as CI, bigger net. If the scores are within tolerance of the version currently serving production, the pipeline writes the new ARN into the prompt’s staging parameter and the canary starts, 5% of traffic, routed by the platform SDK, which resolves the candidate parameter for canary callers and the production parameter for everyone else. 24 hours of metrics; if error rate and user-facing quality signals hold, the pipeline writes the production parameter.
Runtime. The alias layer is ours, not Bedrock’s. One Parameter Store parameter per prompt per stage, /prompts/summariser/production, holds the version ARN currently in service. The platform SDK resolves the parameter (cached with a short TTL, so a flood of invocations doesn’t become a flood of SSM reads), then calls Converse with the version ARN as the modelId and a promptVariables map. Bedrock renders the frozen template, runs the model, returns the response. Services never embed a version number; they embed a parameter name.
Rollback. One parameter write: point /prompts/summariser/production back at the previous version’s ARN. No service redeploy; the change propagates as SDK caches expire, seconds to a minute. Parameter Store keeps its own version history per parameter, so the rollback is itself audited, who repointed what, when, to which ARN. This is the single most important operational property of the whole setup: the “oh no” button exists, it’s fast, and it leaves a paper trail. Bedrock doesn’t ship this button; a parameter per stage is the cheapest honest way to build it.
Per-caller attribution. Bedrock won’t break a shared prompt down by caller on its own, so the platform SDK does: every invocation emits a custom CloudWatch metric dimensioned by caller ID, prompt name, and version, and model invocation logging captures the full request and response for the audit trail. A dashboard breaks down each prompt by which service is calling it, how much they’re spending, and how their latency compares. When one service complains that “the summariser is slow,” platform can see whether it’s slow for everyone or only for them, and if only for them, which argument shape is triggering the slowness.
A worked example: the concise-vs-brief incident, prevented
Someone opens a PR changing “concise” to “brief” in the summariser prompt.
- CI runs the 50-example smoke set. The LLM-as-judge rubric includes a “length appropriateness” criterion. The new prompt scores 3.2/5 on that criterion vs the baseline’s 4.1/5, outputs are now shorter than the ideal. CI posts the regression; reviewer asks “was that intentional?”
- Author decides the intent was wording cleanup, not behaviour change. They revert. Incident prevented in three minutes.
Alternative reality: author insists. Reviewer approves. PR merges.
- Pipeline freezes summariser version 17 with
CreatePromptVersion. The golden-set BenchmarkA standardised test set used to score and compare models. runs the 500-example set against the new version’s ARN. Overall quality score holds, but the length-appropriateness sub-metric is down. Platform’s quality dashboard flags the change for human review before the staging parameter moves. - Product team decides “shorter is fine” and approves. The staging parameter repoints; canary starts at 5%.
- User-facing retention metric in Datadog is wired into the canary gate. 24 hours in, retention on the summariser’s daily email has dipped 4% on the canary users; p<0.01. Pipeline aborts the canary. The production parameter stays on version 16’s ARN.
- Rollback is automatic; no action required. Author sees the abort notification and has data to work with.
What didn’t happen: eight days of silent regression, a confused postmortem, and a product team blaming engineering.
What’s worth remembering
- Prompts are configuration. Treat them like it: version control, review, CI, release pipeline, rollback. The damage of ignoring this is cheap to inflict and expensive to detect.
- Git is the source of truth; Bedrock is the runtime registry. Git gives PRs, diffs, code review, and CI; Bedrock gives immutable numbered versions referenced by ARN; Parameter Store gives the pointer that says which ARN is live.
- A prompt version is invoked by ARN. Pass it as the
modelIdtoConverseorInvokeModelwith apromptVariablesmap; the template renders server-side, inference config included. - Bedrock has no alias, so build one. An SSM parameter per prompt per stage holds the live version ARN; rollback is repointing the parameter, instant, audited via parameter history, and no thirty-service redeploy.
- Evaluation is non-optional. A 50-example smoke set in CI catches the obvious; a 500-example golden set in the pipeline catches the subtle; a production canary catches the rest.
- Per-caller attribution matters when one prompt feeds many services. Bedrock won’t slice a shared prompt by caller; emit a custom metric dimensioned by caller ID, prompt, and version from the SDK, and turn on model invocation logging.
- Prompt ownership is explicit. Each prompt has a named owner team and SME reviewers. Without it, the owner becomes the last person who touched it.
- Bedrock Prompt Management handles inference config too. Temperature, top-p, max tokens, stop sequences travel with the prompt version, changing them is a prompt change, with the same review and rollback.
- Don’t let prompts live as string literals in thirty repos. That’s not a style point; that’s the root cause of every prompt regression that ever ships.
One prompt, a hundred callers, a versioned registry, a release pipeline, and a rollback that’s faster than the Slack thread asking “did we change something?” The service owners still own their prompts; the platform team just stopped letting them own them badly.