Task 1.5 - Design retrieval mechanisms for FM augmentation

Scope of Task 1.5

Task 1.5 in the AIP‑C01 guide covers six core skills for retrieval mechanisms that augment FMs, typically in RAG or knowledge‑base architectures.
These skills span how you segment documents, generate embeddings, search over vectors, handle queries, and integrate retrieval consistently with FMs and agents.

The skills are:

1.5.1: Document segmentation
1.5.2: Embedding model selection and configuration
1.5.3: Vector search deployment and configuration
1.5.4: Advanced search architectures (semantic, hybrid, reranking)
1.5.5: Query handling systems
1.5.6: Consistent access mechanisms (APIs, function calling, MCP)

1.5.1 Document segmentation (chunking)

Document segmentation is how you break source content into units (chunks) that can be embedded and retrieved effectively.
The goal is to preserve enough context for the FM to answer questions, while keeping chunks small enough to index, retrieve, and fit in the context window.

Typical approaches:

Fixed‑size chunking
- Example: break text into 512–1,000 tokens with some overlap (e.g., 10–20%) to avoid cutting off important sentences.
- Often implemented via Lambda or Bedrock‑side chunking (e.g., Knowledge Bases, Titan-based workflows).
Structure‑aware / hierarchical chunking
- Use document structure: headings, sections, paragraphs, tables.
- Example: for PDFs or manuals, chunk per section or subsection and include the section title and path as metadata.
Modality‑specific chunking
- Text: tokens or sentences.
- Tables: row‑ or cell‑level with column headers as metadata.
- Code: function‑ or class‑level, preserving file path and repository metadata.

Key design considerations for exam questions:

Retrieval quality
- Too large chunks → irrelevant content, context window bloat.
- Too small chunks → retrieval misses context, FM hallucinates or answers partially.
Latency and cost
- More chunks → more embeddings, larger index, higher cost.
- Overlapping reduces information loss but increases storage and embedding cost.
Tools and AWS services
- Amazon Bedrock Knowledge Bases: built‑in document ingestion and chunking.
- AWS Lambda: custom chunking logic (by headings, by JSON fields, etc.).
- Amazon Textract / Comprehend: extract structure (e.g., forms, tables, entities) before chunking.

Exam‑style pattern: when the question mentions “long technical manuals”, “hierarchical documentation”, or “legal documents with sections”, prefer structure‑aware or hierarchical chunking over naive fixed‑size splits.

1.5.2 Embedding solutions for semantic search

Embedding models convert text (and sometimes other modalities) into vectors so you can perform semantic search.
Task 1.5 expects you to choose and configure embedding solutions that match domain, latency, and cost requirements.

Design dimensions:

Model choice
- Amazon Titan Text Embeddings (via Bedrock) with varying dimensionalities optimized for semantic search.
- Third‑party embeddings via Bedrock (e.g., Cohere) where appropriate.
Dimensionality and index implications
- Higher dimensions can capture richer semantics but increase storage, index size, and compute.
- Lower dimensions may be sufficient for narrow domains or low‑latency workloads.
Domain adaptation
- General‑purpose embeddings for broad knowledge bases.
- Domain‑tuned embeddings (if available) for legal, medical, or code retrieval.
Batch and pipeline design
- Use Lambda or containerized workers to batch embedding generation for ingestion jobs to reduce API overhead.
- Make embedding generation idempotent (hash of content + version) so re‑ingesting does not duplicate vectors.

Key exam cues:

“Optimize semantic similarity search” → choose Titan or similar semantic embedding models.
“Need low latency and cost for large corpus” → consider lower‑dimension embeddings and efficient batch processing.
“Multi‑language retrieval” → ensure embedding model supports the relevant languages.

1.5.3 Vector search deployment

Vector search stores and indexes embeddings to support nearest‑neighbor retrieval.
Task 1.5 wants you to recognize when and how to use managed options vs. custom vector databases.

Key AWS options:

Amazon Bedrock Knowledge Bases
- Fully managed ingestion, embeddings, and vector store tied directly to Bedrock models and RAG workflows.
- Best when you want rapid setup and deep Bedrock integration.
Amazon OpenSearch Service with vector search
- Supports k‑NN / ANN search plus traditional inverted index for hybrid search.
- Use when you need combined keyword + semantic search, custom scoring, and search analytics.
Amazon Aurora with pgvector
- Store vectors in a relational DB with SQL semantics.
- Useful for transactional workloads where metadata and vectors live together.
Third‑party or self‑managed vector DBs
- Often deployed on ECS/EKS for advanced use cases.

Architecture considerations:

Index layout and sharding
- Separate indices by domain, tenant, or data sensitivity.
- Size shards based on corpus size and QPS to avoid hotspot shards.
Metadata storage
- Use index metadata fields for document IDs, titles, timestamps, access control, and hierarchy.
- For S3‑backed documents, store s3 URI and version for traceability.
Consistency and updates
- Decide on eventual vs. strong consistency; most vector stores are eventually consistent.
- When documents update, regenerate embeddings and update the index in a background workflow (e.g., using EventBridge + Lambda + Step Functions).

Exam‑style pattern: “needs hybrid keyword and semantic search with aggregations” → OpenSearch with vector search.

1.5.4 Advanced search architectures

Advanced search architectures combine semantic search with traditional signals and re‑ranking to improve relevance.
The exam expects you to identify hybrid and multi‑stage retrieval patterns rather than simple top‑k cosine similarity.

Common patterns:

Semantic search
- Pure vector search: query → embedding → nearest neighbors.
- Useful when queries are natural language and terms may not match texts exactly.
Hybrid search (lexical + semantic)
- Combine BM25/keyword search with vector similarity, often in OpenSearch.
- Example: filter by keyword or metadata first, then apply vector similarity for ranking.
Filtered / faceted search
- Use metadata filters (e.g., tenant ID, region, document type, time range) before or alongside vector similarity.
- Critical for multi‑tenant and compliance‑bound systems.
Re‑ranking (second‑stage retrieval)
- Step 1: retrieve N candidate chunks via vector/hybrid search.
- Step 2: run a reranker model (e.g., Bedrock reranker or a small FM) to order candidates by semantic relevance to the specific query.
Multi‑index / multi‑source retrieval
- Search multiple indices (e.g., product docs, FAQs, tickets) and unify results.
- Optionally include index‑type metadata so the FM knows the source.

Exam cues:

“Combine keyword filters with semantic relevance” → hybrid search in OpenSearch or similar.
“Improve accuracy of top results from vector search” → add reranking stage.
“Need tenant‑aware retrieval with strict isolation” → metadata filters on tenant ID + separate indices.

1.5.5 Query handling systems

Query handling is about transforming and orchestrating user queries so retrieval returns the most useful context.
Task 1.5 emphasizes query expansion, decomposition, routing, and pre‑/post‑processing before calling retrieval.

Key capabilities:

Query expansion and reformulation
- Use an FM on Bedrock to rewrite or expand queries with synonyms, related terms, or clarifications.
- Helpful when user input is short, ambiguous, or uses jargon not common in the corpus.
Query decomposition
- Break complex, multi‑part questions into simpler sub‑queries (via Lambda or agents).
- Retrieve context separately for each sub‑question and then synthesize in the FM.
Query classification and routing
- Classify incoming queries into types: FAQ, troubleshooting, policy, creative, etc.
- Route to different vector indices, models, or even non‑RAG paths when no retrieval is needed.
Query constraints and safety filters
- Apply authorization filters based on user identity, role, tenant, or data classification before retrieval.
- Enforce governance (e.g., block queries for restricted topics).
Workflow orchestration
- Use Step Functions or agents to orchestrate query → transform → search → rerank → FM answer.

Exam cues:

“Need to handle complex multi‑step questions” → query decomposition and multi‑stage retrieval.
“Improve result quality without retraining models” → query expansion and better retrieval, not fine‑tuning.
“Need to dynamically choose between multiple knowledge sources” → query routing.

1.5.6 Consistent access mechanisms to FMs and retrieval

Consistent access mechanisms ensure FMs can reliably call retrieval tools, vector stores, and other backends in a uniform way.
The exam links this to function calling, tool patterns, and emerging protocols like MCP.

Patterns and tools:

Function calling interfaces
- Define tools (e.g., “search_knowledge_base”, “get_customer_profile”) that an FM can call with structured arguments.
- Backends then perform vector search or DB lookups and return structured results for the FM to reason over.
Model Context Protocol (MCP)
- Standardizes how clients and servers expose tools and resources to models.
- MCP servers can wrap vector search, databases, or APIs; MCP clients provide a consistent integration layer.
Standardized retrieval APIs
- Expose retrieval via REST/GraphQL endpoints with consistent request/response schemas.
- Example: POST /search with query, filters, top_k; response includes chunks, scores, and metadata.
Agents and tool orchestration
- Amazon Bedrock Agents can orchestrate calling tools, retrieving knowledge base content, and using multiple models.
- Agent tools typically wrap retrieval systems so models can invoke them declaratively.

Design considerations:

Consistency across models
- Design APIs so different FMs can call the same retrieval mechanism without tight coupling to a particular model.
Observability
- Log tool calls, latency, retrieved documents, and FM reasoning to debug retrieval quality.

Exam cues:

“Need standardized access to vector search for multiple agents and FMs” → MCP or standardized function‑calling tools.
“Need to expose retrieval as a reusable building block across apps” → dedicated retrieval microservice or MCP server.

Putting it together: end‑to‑end retrieval for FM augmentation

A complete retrieval mechanism for FM augmentation typically follows this flow:

Ingestion and segmentation
- Extract content from sources (S3, RDS, Confluence, etc.).
- Chunk using fixed or structure‑aware methods; attach rich metadata.
Embedding generation
- Choose an embedding model (e.g., Amazon Titan) and generate vectors, preferably in batched workflows.
Vector storage and indexing
- Store vectors and metadata in a vector store (Bedrock Knowledge Bases, OpenSearch, Aurora pgvector, or third‑party DB).
Query handling
- Classify, expand, or decompose user queries as needed, applying filters and access controls.
Retrieval and ranking
- Perform semantic or hybrid search, apply filters, and optionally rerank with a reranker FM.
FM augmentation
- Build a prompt that integrates retrieved chunks (with citations and metadata) and send to the FM.
Access pattern standardization
- Expose the retrieval pipeline via tools, function calls, or MCP so multiple agents and applications can use it consistently.

This end‑to‑end design is what AIP‑C01 Task 1.5 is testing: your ability to choose the right chunking, embeddings, vector search, and query‑handling approaches to deliver high‑quality FM‑augmented answers on AWS.

Prateek's Digital Garden

Explorer

Task 1.5 - Design retrieval mechanisms for FM augmentation

Scope of Task 1.5

1.5.1 Document segmentation (chunking)

1.5.2 Embedding solutions for semantic search

1.5.3 Vector search deployment

1.5.4 Advanced search architectures

1.5.5 Query handling systems

1.5.6 Consistent access mechanisms to FMs and retrieval

Putting it together: end‑to‑end retrieval for FM augmentation

Graph View

Table of Contents

Backlinks