Grounding a Chatbot in Your Own PDFs

May 11, 2026 · 20 min read

AI Practitioner · AIF-C01 · part of The Exam Room

A facilities team has 600 PDFs – equipment manuals, safety procedures, maintenance schedules – sitting on a SharePoint drive. Engineers want a chatbot that answers “how do I reset the chiller on floor 4?” in seconds instead of the current ten-minute PDF hunt. Bedrock Knowledge Bases is the AWS-native answer, and on a marketing diagram it looks like two boxes and a line. Behind those boxes sit six decisions: which embedding model, which chunking strategy, which vector store, how to parse the PDFs, what the retrieval configuration looks like, and which generation model writes the answer. The interesting question isn’t “does it work” – it works – it’s which of those six decisions actually matter for this corpus.

The situation

A facilities-engineering team at a manufacturing site maintains 600 PDFs covering roughly 200 pieces of equipment, 50 safety procedures, and 30 maintenance schedules. Documents range from 5 to 300 pages; the largest are OEM manuals with dense tables, wiring diagrams, and exploded parts views. A handful are scans of older paper manuals where the PDF is essentially a picture of a page.

The engineers – around 40 on rotating shifts – currently type keywords into SharePoint search, open the top three or four hits, and Ctrl-F through them. Time-to-answer for “what’s the torque spec on the chiller’s compressor mount?” averages 8-12 minutes. The team lead has asked whether “one of those AI things” could shorten that to under a minute, with a citation back to the exact manual and page.

There is already an S3 bucket mirroring the SharePoint drive (nightly sync). The team has AWS access; they don’t have ML engineers. “Bedrock Knowledge Bases” has come up; the question is what configuration actually makes it work well for this corpus, versus what configuration just makes it work.

What actually matters

Bedrock Knowledge Bases is a managed service that does retrieval-augmented generation end to end: you point it at an S3 bucket, it chunks the documents, embeds them into vectors, stores the vectors, and at query time retrieves the most relevant chunks and hands them to a generation model. The API surface is small – essentially CreateKnowledgeBase, StartIngestionJob, and RetrieveAndGenerate – but the configuration inside those calls is where the quality lives.

The first thing worth thinking about is the shape of the corpus. 600 PDFs averaging, say, 40 pages each is roughly 24,000 pages of text. Some have rich tables; some are OEM manuals with figure captions and callouts; some are scanned. A generic “chunk every 300 tokens” strategy will split a wiring-diagram table across two chunks, and the retrieved half won’t make sense on its own. Knowing where your corpus sits on the structured-to-unstructured axis drives the chunking choice.

The second is what questions the engineers actually ask. “How do I reset the chiller?” maps well to procedure sections, which have clear headings. “What’s the torque spec on the compressor mount?” maps to a table of values. “What PPE do I need for this maintenance?” maps to a safety section. If most questions land on structured regions (tables, bulleted procedures, numbered safety steps), the retrieval needs to handle structure; if most are paraphrased conceptual questions (“why does the chiller do X?”), pure vector similarity is fine.

The third is scale and cost. 24,000 pages at, say, 500 tokens per page is 12M tokens of corpus. Embedding at Titan v2’s $0.00002/1K tokens is 24 cents to embed the whole thing once; re-embedding on updates is fractions of that. OpenSearch Serverless for the vector store starts at roughly $350/month for a minimum OCU allocation; Aurora PostgreSQL with pgvector can share an existing Aurora cluster at the cost of a few vector columns. The cost floor is mostly vector-store running cost, not embedding or querying.

The fourth is scans. Some of the PDFs are image-only. Text doesn’t come out of them without OCR. Bedrock Knowledge Bases can use Amazon Textract to extract text and structure from image PDFs (as part of an “advanced parsing” configuration that calls out to a foundation model for layout-aware parsing). Without that, the scanned manuals are dark matter – they exist in the index but their chunks are near-empty.

The fifth is citation format. The engineers want “chiller manual page 42, section 3.2” as the citation, not a raw S3 URI. The chunks carry location metadata (S3 URI, page number, and with advanced parsing, section/heading context); the application displays it. If the chunks are parsed badly, the citations are rough.

The sixth is governance. Every retrieval call invokes a foundation model; every call generates CloudTrail events and optionally invocation logs. The engineers aren’t making sensitive queries, but the corpus contains supplier confidential information (some OEM manuals are marked “not for distribution”). A Bedrock Guardrail on output can redact model numbers or supplier names if the team wants; an IAM policy keeps the Knowledge Base query API scoped to the engineer tool’s Lambda role.

What we’ll filter on

Six configuration decisions for a Knowledge Base, scored against this particular corpus.

  1. Chunking strategy – fixed-size, semantic, hierarchical, or default?
  2. Embedding model – Titan v2, Cohere English v3, Cohere Multilingual v3?
  3. Vector store – OpenSearch Serverless, Aurora PostgreSQL pgvector, Pinecone, Redis Enterprise Cloud?
  4. Parsing – default text extraction, or advanced parsing using a foundation model to preserve layout?
  5. Retrieval configuration – number of results, hybrid vs. vector-only search, metadata filters?
  6. Generation model and prompt template – which model, with what instructions about grounding, citation, and refusal?

The configuration landscape

1. Chunking strategy. Bedrock Knowledge Bases offers four options. Default chunks into roughly 300-token pieces with ~20% overlap – safe, generic, ignores structure. Fixed-size lets you set chunk size and overlap explicitly. Hierarchical creates a two-level index: larger “parent” chunks for context and smaller “child” chunks for retrieval; the child is matched but the parent is what gets sent to the generation model. Semantic chunks using a foundation model to identify natural boundaries – paragraphs, sections, topic shifts – instead of arbitrary token counts. For dense technical manuals with heading structure, semantic or hierarchical chunking retrieves more cleanly than fixed-size because chunk boundaries match the document’s own logic.

2. Embedding model. Titan Text Embeddings v2 produces 1024-dimensional vectors, costs $0.00002 per 1K tokens, and is multilingual (100+ languages). Cohere Embed English v3 (1024 dims) is English-only and often retrieves slightly better on English-heavy corpora in Cohere’s own benchmarks. Cohere Embed Multilingual v3 (1024 dims) handles non-English. For a mostly-English manuals corpus with occasional non-English OEM content (German machine-tool manuals, Japanese electronics datasheets), Titan v2 is the safe default; Cohere Multilingual if multilingual retrieval quality is proven to be better on a test set.

3. Vector store. OpenSearch Serverless is the zero-plumbing choice – Bedrock can create it for you. Minimum 2 OCUs at roughly $0.24/OCU/hour means a floor of roughly $350/month. Aurora PostgreSQL with pgvector piggy-backs on an existing Aurora cluster: a CREATE EXTENSION vector; and a vector column on a table. No additional running cost beyond what the cluster already burns, but you manage the schema, the ingestion hooks, and the index tuning. Pinecone and Redis Enterprise Cloud are third-party integrations; useful if the organisation already runs one of them but usually not the first choice for a new build.

4. Parsing. Default parsing extracts text from PDFs using AWS’s standard extractors – fast, cheap, loses layout. Scanned pages produce no text. Advanced parsing routes documents through a foundation model (Claude, Nova) that sees the page layout – tables, figures, columns – and emits structured text preserving that layout. Costs per-page extra (priced like a model invocation); for a 24,000-page corpus that’s a material ingestion bill. Pays back on corpora where layout matters (tables, multi-column, scanned). Defaults work on plain-text PDFs.

5. Retrieval configuration. numberOfResults – how many chunks to retrieve per query – defaults to 5. For a 600-PDF corpus where relevant content might be split across chunks, 6-10 is often better. overrideSearchType controls vector-only vs. hybrid (vector similarity plus keyword BM25). Hybrid matters when exact terms (part numbers, equipment tags) drive the query. Metadata filters let queries constrain retrieval by fields on the source document (“only safety manuals”, “only equipment in building B”) – requires metadata to be attached at ingestion via .metadata.json sidecars in S3.

6. Generation model and prompt template. The generation model is independently configurable: Claude Sonnet, Nova Pro, Llama, any Bedrock-hosted text model. The Knowledge Base has a default prompt template that injects retrieved chunks under $search_results$ and asks the model to answer from them; you can override it with a custom template that specifies citation format, refusal behaviour, and tone.

Side by side

Matching each decision to the facilities-engineering corpus:

Decision Default This corpus Rationale
Chunking Default (300 tokens) Semantic Manual sections have natural boundaries; avoid splitting tables
Embedding model Titan v2 Titan v2 English-heavy, occasional multilingual, multilingual model as default
Vector store OpenSearch Serverless OpenSearch Serverless Lowest friction; no existing Aurora to piggy-back on
Parsing Default text extraction Advanced parsing Scans + tables + figures require layout-aware extraction
Retrieval 5 results, vector-only 8 results, hybrid Part numbers and equipment tags need keyword precision
Generation model Claude Sonnet Claude Sonnet Quality on drafting technical procedures justifies the token cost

The two decisions that matter most for this corpus are advanced parsing (scanned manuals are otherwise invisible) and hybrid retrieval (part numbers and equipment tags are exact-match hints that pure vector search can miss). The others are close to defaults.

How the pieces fit together

Ingestion (one-off + on update) S3 bucket 600 PDFs SharePoint mirror Advanced parsing Claude / Nova preserves tables & layout handles scans via Textract Semantic chunking boundaries respect sections and headings Titan v2 embeddings 1024-dim vectors $0.00002 / 1K tokens OpenSearch Serverless vector index + keyword index supports hybrid search Ingestion runs once for the initial corpus, then StartIngestionJob re-runs incrementally on changed or added documents (EventBridge schedule, weekly). Query (per-question, seconds) Engineer tool RetrieveAndGenerate API call Embed question same Titan v2 model (must match index) Hybrid search vector + keyword top 8 chunks Prompt assembly question + chunks custom template Claude Sonnet grounded answer + citations Answer text + retrieved-references list (S3 URI, page number) flows back to the engineer. A typical round trip is 1-3 seconds. CloudTrail records the call; invocation logging captures the full prompt and output to S3 for audit, encrypted under a customer-managed KMS key.
Two phases, seven moving parts. Ingestion is rare and expensive per-page; query is frequent and cheap per-call.

The configuration in depth

Creating the Knowledge Base. CreateKnowledgeBase takes a data source configuration (S3 bucket ARN, optional inclusion/exclusion filters), an embedding model ARN (Titan v2 for this build), a vector store configuration (OpenSearch Serverless collection ARN and field mappings), and an IAM service role Bedrock will assume to read S3 and write to OpenSearch. The field mappings are worth getting right: vectorField names the column holding the 1024-dim vector, textField holds the chunk text, metadataField holds anything else (source URI, page number, section heading).

Advanced parsing configuration. In the data source config, parsingConfiguration with parsingStrategy: BEDROCK_FOUNDATION_MODEL and a bedrockFoundationModelConfiguration pointing at Claude 3 Haiku (cheapest capable option) or Claude Sonnet (more accurate on complex layouts). The parser sees each PDF page as an image and emits layout-aware text: tables as tables, figures with captions, multi-column text reassembled in reading order. Costs scale per page; budget for a one-off few-hundred-dollar ingestion bill on the initial 24,000 pages.

Chunking configuration. chunkingConfiguration with chunkingStrategy: SEMANTIC and semanticChunkingConfiguration specifying maxTokens (e.g. 600), bufferSize (e.g. 1), and breakpointPercentileThreshold (e.g. 95). The threshold controls how aggressively the chunker splits: higher values mean fewer, larger chunks; lower means more, smaller. 95 is a reasonable starting point for procedure-style documents; tune by running a test set of queries and looking at whether retrieved chunks contain the whole answer or half of it.

Data source sync. StartIngestionJob kicks off ingestion. For the initial run, this parses, chunks, embeds, and indexes the full corpus (24,000 pages taking typically a few hours end-to-end, mostly in advanced parsing). For subsequent runs, Bedrock diffs against the last manifest and only re-processes changed files. An EventBridge Scheduler rule running StartIngestionJob nightly (or hourly if updates are frequent) keeps the index current.

The retrieval call. RetrieveAndGenerate takes the question text and a configuration:

{
  "input": {"text": "How do I reset the chiller on floor 4?"},
  "retrieveAndGenerateConfiguration": {
    "type": "KNOWLEDGE_BASE",
    "knowledgeBaseConfiguration": {
      "knowledgeBaseId": "KB-FACILITIES",
      "modelArn": "arn:aws:bedrock:eu-west-1::foundation-model/anthropic.claude-3-5-sonnet-20241022-v2:0",
      "retrievalConfiguration": {
        "vectorSearchConfiguration": {
          "numberOfResults": 8,
          "overrideSearchType": "HYBRID",
          "filter": {
            "equals": {
              "key": "building",
              "value": "B"
            }
          }
        }
      },
      "generationConfiguration": {
        "promptTemplate": {
          "textPromptTemplate": "You are a facilities-engineering assistant. Answer the engineer's question using only the provided manual excerpts. If the excerpts don't cover it, say 'I don't have that procedure in the indexed manuals' rather than guessing. Always cite the source as [manual name, page N, section if given]. Be concise; engineers are often on the floor. Question: $query$\n\nManual excerpts:\n$search_results$"
        },
        "inferenceConfig": {
          "textInferenceConfig": {
            "maxTokens": 800,
            "temperature": 0.1
          }
        }
      }
    }
  }
}

Three points on that call. The filter constrains retrieval to metadata matching building=B, which works because the ingestion step attached building metadata via .metadata.json sidecars in S3 (each PDF has a companion JSON listing building, equipment type, manufacturer). The custom promptTemplate is the refusal-and-citation instruction; without it the default template does a passable job but not a great one. temperature: 0.1 keeps the model close to the retrieved text – higher values invite embellishment, which for a safety-adjacent domain is exactly what isn’t wanted.

A worked question

Ravi is on shift and the chiller alarm is blinking. He opens the engineer tool on his tablet.

Ravi: "Chiller floor 4 fault code E-207, what's the reset procedure?"

The tool calls RetrieveAndGenerate. A second later:

Answer:

Fault code E-207 on the floor-4 chiller (Trane CGAM 050, manual v3.2)
indicates a high-pressure cutout on the refrigerant circuit. Reset procedure:

1. Verify the condenser fans are running and unobstructed.
2. Wait 5 minutes for head pressure to normalise.
3. At the controller panel, press MENU -> DIAGNOSTICS -> CLEAR FAULT.
4. If E-207 returns within 10 minutes, escalate to maintenance
   (do not attempt a second reset).

PPE required: safety glasses only for the reset; refrigerant PPE only
if a circuit must be opened.

Sources:
  [Trane CGAM 050 Manual v3.2, page 142, §7.4]
  [Chiller Fault Code Quick Reference, page 3]
  [Site PPE Matrix - Building B, page 12]

What happened under the hood:

  1. Ravi’s question was embedded with Titan v2.
  2. OpenSearch Serverless ran a hybrid query – the vector matched conceptually to “chiller reset”, the keyword half matched “E-207” exactly – returning 8 chunks.
  3. The top chunks were §7.4 of the Trane manual, the entry for E-207 in the quick-reference, and the PPE matrix section for Building B.
  4. Claude Sonnet saw the chunks, the custom prompt template, and produced a grounded answer with the three citations. Every cited fact came from the retrieved text.
  5. The tool rendered clickable citations that link back to the S3 URI and page number of each source document. Tapping “[Trane CGAM 050 Manual v3.2, page 142, §7.4]” opens the PDF at that page.

Round trip: ~2 seconds. The model didn’t invent a fault code (it exists), didn’t invent a page number (it matches the source), and didn’t skip the PPE step (retrieval surfaced the site matrix). Time-to-answer went from 10 minutes to 5 seconds.

Edge cases the configuration handles

The scanned manual. An older Siemens drive manual is a scan, not a text PDF. Without advanced parsing, its chunks would be near-empty and it would be invisible to retrieval. With advanced parsing, Claude extracts the text from each page image; the chunks carry real content. The OCR quality is imperfect on handwritten annotations, but the typewritten body text extracts cleanly enough for questions about it to land on its pages.

The multi-building filter. Some procedures differ between Building A and Building B (different equipment models, different PPE requirements). Each PDF has a .metadata.json sidecar specifying which building it applies to. The retrieval call’s filter constrains to the engineer’s current building, so “what’s the PPE for confined-space entry?” returns the Building B matrix, not Building A’s.

The torque-spec table. A question like “what’s the torque spec on the compressor mount?” hits a table in the manual. Default parsing would have flattened the table row-by-row and split it across chunks; advanced parsing preserves it. Semantic chunking keeps the table intact within one chunk. The retrieved chunk contains the full table; the model extracts the right row based on the question’s equipment reference.

The no-answer case. An engineer asks “what’s the torque spec on the new HVAC from SupplierCo?” – but the SupplierCo HVAC was installed last week and its manual hasn’t been added yet. Hybrid retrieval returns low-relevance chunks. The custom prompt template’s instruction – “If the excerpts don’t cover it, say ‘I don’t have that procedure in the indexed manuals’ rather than guessing” – kicks in, and the model refuses gracefully, prompting the engineer to add the manual or call the supplier.

What’s worth remembering

  1. Bedrock Knowledge Bases is end-to-end managed RAG. Point it at S3, configure it, call RetrieveAndGenerate. Ingestion, embedding, storage, retrieval, and generation plumbed for you.
  2. Advanced parsing is the right default for document-heavy corpora. It costs real money at ingestion but turns scans into text and preserves tables and layout. Defaults lose all of that.
  3. Semantic chunking respects document structure. Fixed-size chunking splits tables and procedure lists at arbitrary points. Semantic chunking aligns boundaries with the document’s own sections and paragraphs.
  4. Hybrid search beats vector-only when exact terms matter. Part numbers, equipment tags, fault codes – keyword BM25 gets these right; pure vector search can miss them when surface forms don’t match.
  5. Metadata filters are the scoping lever. Sidecar .metadata.json files attach structured attributes to each document; retrieval calls can filter by any of them. This is how you get per-building, per-equipment, per-role retrieval from one index.
  6. Custom prompt templates are where refusal and citation behaviour lives. The default is adequate; a custom template is where you instruct the model to say “I don’t know” instead of inventing, and to format citations the way your UI expects.
  7. Ingestion cost is one-off plus incremental; query cost is per-call. The big bill is the initial advanced-parsing pass over the whole corpus. Subsequent updates only re-parse changed documents; queries are standard Bedrock on-demand.
  8. Invocation logging + CloudTrail + KMS keep the governance story complete. Every RetrieveAndGenerate call emits CloudTrail; invocation logs capture full prompt and response to S3 under a customer-managed key; Knowledge Base IAM is a separate policy from the underlying model policy.

A working facilities chatbot isn’t a single configuration choice – it’s six of them, each justified by the shape of the corpus. Advanced parsing and hybrid retrieval are the two that shift this build from “it mostly works” to “engineers trust it on the floor.” The others are close to defaults, and that’s fine: the defaults exist because they’re sensible starting points. The craft is knowing which defaults to change.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.