Fifty Thousand Documents and a Citation on Every Answer

The situation

A 6,000-person enterprise is standing up an internal assistant. The corpus is ~50,000 documents across four domains – HR policies, engineering runbooks, security guidelines, product specs – totalling ~5 GB of mostly text-dense PDFs, Markdown, Word, and Confluence exports. New documents land weekly, old ones get superseded, a handful are retracted. The assistant has to reflect the current state within a day of a change.

On the answer path:

P95 end-to-end latency < 3 s from question to last tokenTokenThe unit of text an LLM actually sees – usually a short character sequence, not a whole word. , across retrieval, generation, and network.
Document-level access control. An engineer asking “what are the band-5 engineering salaries?” must get a polite refusal, not an HR document. A security auditor asking about an incident-response runbook gets the runbook. Identity drives what the retriever can see.
Citations on every answer. Every factual claim points back to a source chunk. No citation, no answer.

What actually matters

A RAG system lives or dies at the boundary where identity meets retrieval, so the first question is who owns that boundary? A product team that ships “the assistant” without owning the access-control fabric under it is building a compliance incident with a generative front-end. The design has to make the seam explicit: identity in, filter out, retriever sees only what the caller is allowed to see. Anywhere else in the stack is the wrong place to apply the check – filtering results after retrieval leaves the top-K polluted with chunks the user can’t read, and filtering at generation leaves the citation hanging off something the user shouldn’t have seen in the first place.

The second is what’s the blast radius of a bad answer? An engineer who asks about someone else’s salary and gets a careful decline is fine. An engineer who asks about someone else’s salary and gets the answer is a wrongful-disclosure incident, and the remediation isn’t a promptPromptThe input you hand to an LLM – system instructions, user message, examples, retrieved documents, tool descriptions, the lot. tweak – it’s legal notice, HR escalation, and a six-month trust deficit with the workforce that was just asked to share more data with the tool. The cost of a single leakage dominates every other cost on the project. That shape pushes the design toward managed components where the access-control path is a first-class API, not a piece of glue the team maintains.

The third is what’s the cost curve as the corpus grows? Five gigabytes today, seven next year, thirty when the internal wiki finally gets ingested. The ingestion story has to be incremental by default – a full weekly reprocess of 5 GB is doable, a full weekly reprocess of 30 GB eats the evening. The vector storeVectorAn ordered list of numbers – in AI usage, almost always an embedding – and by extension the databases that index them for nearest-neighbour search. bill scales with dimensions times chunks times replicas, so the embeddingEmbeddingA fixed-length vector of floats that represents a piece of text (or image, or other thing) in a space where similar meanings sit close together. model choice is a multi-year commitment to a specific storage footprint. Changing embedding models means reindexing everything, which is why the default is Titan V2 at 1,024 dimensions rather than the cheapest 256 option – the halving is quick to win and expensive to reverse.

The fourth is what are the failure modes we have to design against? A citation the user can’t load because the S3 object is gated by a different policy. A retrieval that returns zero chunks for a legitimate question because the filter is too tight. A chunking strategy that slices a procedure in half and leaves the generator stitching two halves of two runbooks together. A metadata-sidecar path where a file was added without its .metadata.json and therefore has no allowed_groups, defaulting to nobody or everybody depending on how the filter is composed. Each of those wants a test, a runbook, and a monitoring line – the managed service takes care of about half; the application team owns the other half.

The fifth is where does a small platform team want to spend its operational attention? Not on owning a vector-store operator, not on writing chunking pipelines, not on re-implementing citation extraction for the fourth time. Managed services buy back that attention at the cost of flexibility; the trade is good when the workload is standard and bad when it has a weird shape. A 50K-document corpus with vanilla group-based access control is standard. A SOX-grade audit requirement with multi-hop ACL joins is weird and wants SQL.

Finally: what does “current state” mean in practice? The brief says “within a day” but the business will discover it means “within an hour” the first time a retracted policy keeps answering questions. The ingestion cadence has to scale from weekly-cron down to per-object event without re-architecting, because the product requirement will tighten under production pressure.

What we’ll filter on

Five filters, and the landscape either clears them or doesn’t.

Document-level access control enforced during retrieval. Not a post-hoc scrub of results – otherwise the top-K is polluted with chunks the user can’t see and quality collapses.
Sub-3-second end-to-end latency at P95. Retrieval under a second, generation streamed, first tokens visible to the user inside one.
Citations that survive the model summarising or paraphrasing. The generation path has to propagate “which chunk came from which document” all the way to the response.
Incremental weekly ingestion. New files picked up, changed files re-embedded, deleted files removed. Not a full weekly reprocess of 5 GB.
Reasonable operational overhead. A small platform team. Managed components where the differentiation isn’t worth hand-rolling.

The RAG architecture landscape

Five plausible shapes on AWS.

Fine-tune a foundation model on the corpus. No retrieval at all – the knowledge goes into the weights. Weekly refresh means weekly fine-tune cycles at 5-GB scale. Citations are impossible because fine-tuning merges sources into weights with no pointer back. Per-user access control is impossible because once a chunk is in the weights, every user sees it.

Bedrock Knowledge Bases. A managed RAG service that ingests documents from a data source (S3, SharePoint, Confluence, Salesforce, web crawler, custom), chunks them, embeds them through a chosen model, stores the vectors, and exposes two runtime APIs – Retrieve for raw chunks and RetrieveAndGenerate for the full round-trip with citations. Eight supported vector stores: OpenSearch Serverless, OpenSearch managed clusters, S3 Vectors, Aurora pgvector, Neptune Analytics (GraphRAG), Pinecone, Redis Enterprise Cloud, MongoDB Atlas. Four supported embedding models: Titan Embeddings G1 (1,536 dim), Titan Text Embeddings V2 (256 / 512 / 1,024), Cohere Embed English v3 (1,024), Cohere Embed Multilingual v3 (1,024). Metadata filtering during retrieval and citations in generation are first-class.

Custom RAG with Bedrock + OpenSearch Serverless vector engine. Same substrate as Knowledge Bases’ most common configuration, but you write the pipeline: ingestion Lambdas, embedding invocations, k-NN mappings, prompt assembly, citation extraction. Every component is under your control and yours to operate. OpenSearch Serverless supports HNSW with Faiss, cosine / L2 / dot-product metrics, up to 16,000 dimensions, and scales in OCU increments (2-OCU minimum for production, $0.24 per OCU-hour).

Custom RAG with Bedrock + Aurora PostgreSQL pgvector. Same DIY pipeline, but the vector store is Aurora with pgvector 0.5.0+ and HNSW indexes on a vector(n) column. Knowledge Bases can also consume Aurora as a vector store via the RDS Data API plus Secrets Manager. The selling point is SQL: embeddings sit next to the metadata you already keep relationally, and filters become ordinary WHERE clauses.

Custom RAG with Bedrock + Amazon Kendra. Kendra is not a vector database – it’s an intelligent search service with its own ranking models, ML-based relevance tuning, and built-in document-level security. GenAI Enterprise Edition runs $0.32/hour base plus $0.25/hour per storage unit plus $0.07/hour per query unit; Basic Enterprise starts at $1.40/hour. Point it at data sources, hit Retrieve, stuff results into a Bedrock prompt, emit citations from Kendra’s result URIs.

Side by side

Option	Access control in retrieval	<3 s P95	Citations	Incremental sync	Low ops overhead
Fine-tune foundation model	✗	✓	✗	✗	✗
Bedrock Knowledge Bases	✓	✓	✓	✓	✓
Custom RAG on OpenSearch Serverless	✓	✓	✓	✓	✗
Custom RAG on Aurora pgvector	✓	✓	✓	✓	✗
Custom RAG on Kendra	✓	✓	✓	✓	—

Matching the shape to the managed service

Identity in, filter composed server-side, metadata filter applied during retrieval (green dashed), citations emitted by preserving the default prompt template's `$output_format_instructions$` placeholder.

Knowledge Bases, in depth

Chunking. Five strategies: default (~300 tokens, sentence-aware), fixed-size (tunable), hierarchical (child for precision, parent for context), semantic (LLM-driven boundaries with buffer and percentile threshold), no-chunking (one chunk per document, loses page-number citations). For runbooks and policies – structured documents where the correct answer is a two-sentence span but the generator needs surrounding subsection context – hierarchical earns its place. Child 300 tokens, parent 1,500. Parent + child above 8,000 combined tokens hits metadata-size limits; not supported on the S3 Vectors backend.

Embedding model. Titan V2 at 1,024 dimensions is the default for an English corpus: cheapest option that clears the quality bar, reasonable per-vector footprint. Dropping to 512 halves vector storage at some retrieval-quality cost. Cohere Embed English v3 is the upgrade when lexical-vs-semantic ranking matters. Dimensions are locked to the embedding model – switching models means reindexing the whole corpus.

Access control through metadata filtering. Every document has a companion <filename>.metadata.json declaring allowed_groups, domain, classification, effective_date. Every retrieval call passes a filter composed server-side from the authenticated caller’s group membership:

{
  "vectorSearchConfiguration": {
    "numberOfResults": 10,
    "filter": {
      "orAll": [
        {"listContains": {"key": "allowed_groups", "value": "engineering"}},
        {"listContains": {"key": "allowed_groups", "value": "on-call"}}
      ]
    }
  }
}

The filter is applied during vector search, not after. Chunks whose metadata doesn’t satisfy it never enter the top-K. Available operators: equals, notEquals, greaterThan(OrEquals), lessThan(OrEquals), in, notIn, startsWith (OpenSearch Serverless only), stringContains, listContains, andAll / orAll (minimum 2 conditions each). Enough for group-based rules; not enough for full ABAC with clearance-level comparisons.

Critical: the filter is composed by a trusted backend on every call. If the browser gets to construct it, there’s no access control at all.

Incremental ingestion. StartIngestionJob walks the data source, diffs against the vector store via S3 metadata (ETags), re-embeds what changed, removes vectors for deleted documents. Weekly cron via EventBridge; per-object triggers from S3 event notifications when the product tightens to near-real-time.

Citations. RetrieveAndGenerate preserves a citations array in the response linking spans of the generated text to retrieved chunks plus their S3 URIs and metadata. Citations require the $output_format_instructions$ placeholder in the prompt template; removing it to hand-tune instructions silently disables citations.

A worked retrieval trace

One question, end to end. An engineer asks “What’s the runbook for rotating the production database password?” Groups ["engineering", "on-call"].

Identity translation. Backend looks up groups, confirms the session is live, composes the retrieval filter.
Embed the query. Titan V2 returns a 1,024-dim vector in ~30-80 ms.
Vector search with filter. Retrieve with numberOfResults: 10 and the orAll filter. OpenSearch Serverless runs HNSW k-NN with metadata filtering during search, returning ten chunks. HR chunks never contribute noise. ~100-250 ms.
Hierarchical replacement. Child chunks sharing a parent collapse to the parent. Ten children might become six parents, each 1,500-token, each with surrounding procedural context.
Prompt assembly. Knowledge Bases populates $search_results$ , $query$ , and $output_format_instructions$ – removing the last silently disables citations.
Generation. RetrieveAndGenerate calls Claude Sonnet via a cross-region inference profile. First token ~800 ms; a 300-token answer finishes in ~1.8 s.
Citations. Response includes a citations array linking spans of generated text to retrieved chunks plus S3 URIs. The app renders each as a numbered inline reference.

Total end-to-end: embedding 60 ms + vector search 180 ms + orchestration 50 ms + first-token 800 ms + streaming 1,000 ms = ~2.1 s P95. Comfortably inside the 3-second budget.

When Aurora pgvector earns its place instead

Reach for Aurora pgvector directly when the access-control logic exceeds what metadata-filter operators express: multi-hop joins across user / group / ACL / classification tables, clearance-level ≤ user-clearance via a lookup table, time-windowed validity (effective_date <= now() AND (expiry_date IS NULL OR expiry_date > now())). SQL eats all of that; metadata attributes can’t. Also correct when the ops muscle for Postgres already exists and adding pgvector plus an HNSW index is a smaller jump than owning an OpenSearch Serverless collection, or when transactional consistency between documents and metadata matters (an ACL change and its embedding update atomically, no stale-filter window).

For 50,000 documents with a vanilla group-membership filter, Aurora is overkill. For 5 million documents with SOX-grade audit against a mature Postgres estate, it’s the correct answer.

When Kendra earns its place instead

Kendra is an intelligent-search service that happens to be useful in a RAG pipeline. Favour it when ranking quality on messy natural-language queries matters more than embedding flexibility (Kendra’s ML-based ranking beats plain vector similarity when user phrasing diverges sharply from source text), when document-level access control via user tokens and group context in the Retrieve API is easier to wire than metadata sidecars, and when the maintained connectors (SharePoint, Confluence, ServiceNow, Salesforce, Box, Slack) earn the premium. For 50,000 documents a GenAI Enterprise Edition base runs ~$500-700/month before queries versus OpenSearch Serverless’s 2-OCU minimum at ~$350/month. For the situation as stated, Knowledge Bases wins on cost and flexibility. For “users consistently phrase things weirdly enough that vector similarity misses,” Kendra earns the premium.

What’s worth remembering

Bedrock Knowledge Bases is the managed RAG path. A data source, a chunking strategy, an embedding model, a vector store, and two runtime APIs: Retrieve for raw chunks and RetrieveAndGenerate for the full round-trip with citations.
Chunking is the lever nobody thinks about until answers are wrong. Five strategies; hierarchical (child for precision, parent for generator context) is the pragmatic default for structured documents.
Embedding model locks dimensions and therefore storage footprint. Titan V2 at 1,024 is the sensible English-corpus default; changing embedding models means reindexing.
Metadata filters run during vector search, not after. That’s what makes access control effective rather than cosmetic – disallowed chunks never enter the top-K and never pollute the generator.
Filter operators cover equals, numeric comparisons, in / notIn, stringContains, listContains, startsWith (OpenSearch-Serverless only), andAll / orAll. Enough for group-based access; not enough for multi-hop SQL-style ACL joins.
Identity-to-groups translation happens server-side. The browser never composes filters; that’s the one non-negotiable security boundary in the design.
Citations depend on the $output_format_instructions$ placeholder. Remove it to hand-tune the prompt and citations vanish silently.
Incremental ingestion scales from weekly cron to per-object S3 event triggers without rearchitecting. “Weekly” becomes “within an hour” with a config change, not a redesign.
Aurora pgvector is the upgrade path when access-control logic exceeds metadata-filter operators. Kendra is the upgrade when ranking quality beats embedding flexibility. Fine-tuning is the wrong path entirely for living, access-controlled corpora.

The answer: Bedrock Knowledge Bases on OpenSearch Serverless, Titan Text Embeddings V2 at 1,024 dimensions, hierarchical chunking with child 300 tokens and parent 1,500, metadata sidecars declaring allowed_groups, every RetrieveAndGenerate filtered by the caller’s group membership via orAll + listContains. Weekly EventBridge-triggered StartIngestionJob; Claude Sonnet for generation with the default prompt template preserving $output_format_instructions$ . Latency closes at ~2.1 s P95 – generation is the long pole, retrieval barely registers. A configured managed service plus a small orchestration Lambda, not a pipeline to own.