Retrieval-Augmented Generation (RAG) connects large language models to your organization’s knowledge. Done well, it transforms how teams access information—turning months of documentation into an assistant that answers questions in seconds. Done poorly, it generates confident-sounding nonsense that erodes trust faster than having no system at all.
This guide covers the architecture decisions and implementation patterns that separate production RAG systems from impressive demos. We’ll focus on the choices that matter most: how you chunk documents, select embeddings, tune retrieval, and handle the inevitable edge cases.
Core Architecture
A production RAG system has three main stages, each with its own failure modes and optimization opportunities:
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Ingestion │ ──► │ Retrieval │ ──► │ Generation │
│ │ │ │ │ │
│ • Parse docs │ │ • Embed query │ │ • Build prompt │
│ • Chunk text │ │ • Vector search │ │ • Call LLM │
│ • Generate │ │ • Re-rank │ │ • Format output │
│ embeddings │ │ • Filter │ │ │
└─────────────────┘ └─────────────────┘ └─────────────────┘
Ingestion happens offline (or near-real-time for dynamic content). This is where you process source documents into a searchable format. Mistakes here propagate through the entire system.
Retrieval happens at query time. Given a user question, you find the most relevant chunks from your corpus. This is usually the bottleneck for answer quality—if you retrieve the wrong context, even the best LLM can’t give a good answer.
Generation synthesizes an answer from retrieved context. This is the most visible part of the system, but often the least impactful to optimize. Get retrieval right first.
Document Processing and Chunking
How you split documents into chunks is one of the highest-leverage decisions in RAG system design. The goal is to create chunks that are:
- Self-contained: Each chunk should make sense on its own, without requiring surrounding context
- Appropriately sized: Large enough to contain useful information, small enough to be specific
- Semantically coherent: Don’t split in the middle of a thought, code block, or logical section
Naive Chunking (And Why It Fails)
The simplest approach—splitting on character count or paragraph breaks—creates predictable problems:
# Don't do this in production
def naive_chunk(text, chunk_size=1000):
return [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]
This breaks sentences mid-word, splits code examples across chunks, and ignores document structure entirely. The resulting chunks are hard to match against user queries because they lack semantic coherence.
Structure-Aware Chunking
Better approaches respect document structure:
import re
from typing import List
def structure_aware_chunk(
text: str,
max_chunk_size: int = 1500,
min_chunk_size: int = 100,
overlap: int = 200
) -> List[str]:
"""
Chunk text while respecting structural boundaries.
Prioritizes splitting at: headers > paragraphs > sentences
"""
# Split on markdown headers first
header_pattern = r'\n(?=#{1,6} )'
sections = re.split(header_pattern, text)
chunks = []
current_chunk = ""
for section in sections:
# If section fits, add to current chunk
if len(current_chunk) + len(section) <= max_chunk_size:
current_chunk += section
else:
# Save current chunk if it meets minimum size
if len(current_chunk) >= min_chunk_size:
chunks.append(current_chunk.strip())
# Start new chunk, potentially with overlap
if overlap > 0 and current_chunk:
# Include last paragraph of previous chunk
overlap_text = get_last_paragraph(current_chunk, overlap)
current_chunk = overlap_text + section
else:
current_chunk = section
# Don't forget the last chunk
if current_chunk and len(current_chunk) >= min_chunk_size:
chunks.append(current_chunk.strip())
return chunks
Chunk Size Tradeoffs
There’s no universally optimal chunk size—it depends on your content and queries:
| Chunk Size | Pros | Cons | Best For |
|---|---|---|---|
| Small (200-500 tokens) | Precise retrieval, lower cost | May lack context, more chunks to search | FAQ-style queries, specific facts |
| Medium (500-1000 tokens) | Good balance | Jack of all trades | General-purpose systems |
| Large (1000-2000 tokens) | More context per chunk | Less precise, higher cost | Complex topics, narrative content |
We typically start with 800-1200 tokens and adjust based on retrieval quality metrics.
Handling Special Content
Real documents contain more than prose. Handle these explicitly:
Code blocks: Keep code together. A function split across chunks is useless.
def preserve_code_blocks(text: str) -> List[str]:
"""Extract code blocks before chunking, reinsert after."""
code_pattern = r'```[\s\S]*?```'
code_blocks = re.findall(code_pattern, text)
placeholder_text = re.sub(code_pattern, '[[CODE_BLOCK]]', text)
# Chunk the placeholder text, then restore code blocks
# ...
Tables: Either keep tables intact or convert to a more retrievable format (key-value pairs, natural language descriptions).
Lists: Don’t split numbered lists mid-sequence. The item “3. Configure the settings” is meaningless without items 1 and 2.
Embedding Selection and Optimization
Your embedding model translates text into vectors that capture semantic meaning. The right choice depends on your content, languages, and infrastructure constraints.
Model Comparison
| Model | Dimensions | Strengths | Considerations |
|---|---|---|---|
| OpenAI text-embedding-3-large | 3072 | Strong general performance, easy API | Per-token cost, data leaves your infra |
| OpenAI text-embedding-3-small | 1536 | Lower cost, still solid | Slightly lower quality |
| Cohere embed-v3 | 1024 | Excellent multilingual, compression options | Different API patterns |
| BGE-large-en-v1.5 | 1024 | Self-hosted, no API costs | Requires GPU infrastructure |
| E5-large-v2 | 1024 | Strong benchmark performance | Needs instruction prefixes |
The Case for Self-Hosting
If you’re processing sensitive documents or need to control costs at scale, self-hosted embeddings are worth considering:
from sentence_transformers import SentenceTransformer
# Load model once, reuse for all embeddings
model = SentenceTransformer('BAAI/bge-large-en-v1.5')
def embed_texts(texts: List[str]) -> np.ndarray:
"""Generate embeddings for a batch of texts."""
return model.encode(
texts,
normalize_embeddings=True, # Important for cosine similarity
show_progress_bar=True
)
A single A10 GPU can embed thousands of documents per minute, making the infrastructure cost trivial compared to API pricing at scale.
Query-Document Asymmetry
Some embedding models perform better when you process queries differently from documents. E5 models, for example, expect prefixes:
# For E5 models
document_text = "passage: " + original_document
query_text = "query: " + user_question
Check your model’s documentation. Skipping these prefixes can significantly hurt retrieval quality.
Vector Storage and Retrieval
Once you have embeddings, you need somewhere to store them and a way to search efficiently.
Database Options
Pinecone, Weaviate, Qdrant: Purpose-built vector databases with managed hosting options. Good for getting started quickly.
PostgreSQL + pgvector: Add vector search to your existing Postgres database. Simpler operations if you’re already running Postgres.
-- Enable the extension
CREATE EXTENSION vector;
-- Create a table with vector column
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
content TEXT,
embedding vector(1536), -- Match your model's dimensions
metadata JSONB
);
-- Create an index for fast similarity search
CREATE INDEX ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
Elasticsearch with dense vectors: If you’re already using Elasticsearch, it supports vector search alongside traditional keyword search.
Hybrid Search
Pure semantic search misses exact matches. Pure keyword search misses semantic relationships. Hybrid search combines both:
def hybrid_search(
query: str,
vector_results: List[Document],
keyword_results: List[Document],
vector_weight: float = 0.7
) -> List[Document]:
"""
Combine vector and keyword search results.
Uses Reciprocal Rank Fusion (RRF) for score combination.
"""
k = 60 # RRF constant
scores = {}
for rank, doc in enumerate(vector_results):
scores[doc.id] = scores.get(doc.id, 0) + vector_weight / (k + rank)
for rank, doc in enumerate(keyword_results):
scores[doc.id] = scores.get(doc.id, 0) + (1 - vector_weight) / (k + rank)
# Sort by combined score
ranked_ids = sorted(scores.keys(), key=lambda x: scores[x], reverse=True)
return [get_document(doc_id) for doc_id in ranked_ids]
This approach catches queries where the user uses exact terminology from the documents (keyword match) and queries where they describe concepts in different words (semantic match).
Re-ranking
Initial retrieval optimizes for speed. Re-ranking optimizes for relevance on a smaller candidate set:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
def rerank_results(query: str, documents: List[str], top_k: int = 5) -> List[str]:
"""Re-rank documents using a cross-encoder for better relevance."""
pairs = [[query, doc] for doc in documents]
scores = reranker.predict(pairs)
ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)
return [doc for doc, score in ranked[:top_k]]
Cross-encoders are slower than bi-encoders (they process query-document pairs rather than independent embeddings) but significantly more accurate. Use them on your top 20-50 candidates to select the final 3-5 for generation.
Prompt Engineering for Generation
With good chunks retrieved, you need a prompt that helps the LLM use them effectively.
Basic RAG Prompt Structure
You are a helpful assistant answering questions about [DOMAIN].
Use ONLY the information provided in the context below to answer.
If the context doesn't contain enough information to answer fully, say so.
Context:
---
{retrieved_chunks}
---
Question: {user_question}
Instructions:
- Answer based solely on the provided context
- If you're uncertain, express that uncertainty
- If the context doesn't address the question, say "I don't have information about that"
- Cite specific sections when possible
Handling Multiple Sources
When chunks come from different documents, help the LLM understand the structure:
Context from multiple sources:
[Source: Employee Handbook, Section 4.2]
Vacation requests must be submitted at least two weeks in advance...
[Source: HR Policy Update, March 2024]
The two-week notice requirement may be waived for emergencies...
[Source: Manager Guidelines]
Managers have discretion to approve urgent time-off requests...
Preventing Hallucination
LLMs want to be helpful, sometimes to a fault. They’ll confabulate information rather than admit ignorance. Countermeasures:
-
Explicit instructions: “If the context doesn’t contain the answer, say so” (as shown above)
-
Lower temperature: Reduce randomness to keep responses grounded
-
Citation requirements: “Quote the specific text that supports your answer”
-
Confidence signals: Ask the model to rate its confidence, then filter low-confidence responses
Evaluation and Monitoring
You can’t improve what you don’t measure. RAG systems need evaluation at multiple stages.
Retrieval Metrics
Recall@K: Of the relevant documents, how many appear in the top K results?
Precision@K: Of the top K results, how many are actually relevant?
Mean Reciprocal Rank (MRR): How high does the first relevant result appear?
def calculate_mrr(queries: List[str], relevance_judgments: Dict[str, List[str]]) -> float:
"""Calculate Mean Reciprocal Rank across queries."""
reciprocal_ranks = []
for query in queries:
results = retrieve(query, top_k=10)
relevant = relevance_judgments.get(query, [])
for rank, doc in enumerate(results, 1):
if doc.id in relevant:
reciprocal_ranks.append(1 / rank)
break
else:
reciprocal_ranks.append(0)
return sum(reciprocal_ranks) / len(reciprocal_ranks)
End-to-End Metrics
Answer correctness: Does the generated answer match a ground-truth answer? (Requires a test set)
Faithfulness: Is the answer supported by the retrieved context? (Can be evaluated with LLM-as-judge)
Relevance: Does the answer address what the user asked?
Production Monitoring
Log everything you’ll need to debug issues:
def log_rag_request(
query: str,
retrieved_chunks: List[str],
retrieval_scores: List[float],
generated_answer: str,
latency_ms: float,
user_feedback: Optional[str] = None
):
"""Log RAG request for monitoring and debugging."""
log_entry = {
"timestamp": datetime.utcnow().isoformat(),
"query": query,
"num_chunks_retrieved": len(retrieved_chunks),
"top_retrieval_score": max(retrieval_scores) if retrieval_scores else None,
"answer_length": len(generated_answer),
"latency_ms": latency_ms,
"user_feedback": user_feedback
}
# Send to your logging infrastructure
logger.info(json.dumps(log_entry))
Watch for:
- Queries with low retrieval scores (might indicate gaps in your corpus)
- High latency requests (might indicate infrastructure issues)
- Negative user feedback patterns (might indicate systematic problems)
Production Considerations
Latency Budget
RAG adds latency at every stage. A typical breakdown:
| Stage | Latency |
|---|---|
| Query embedding | 50-100ms |
| Vector search | 20-50ms |
| Re-ranking | 100-200ms |
| LLM generation | 500-2000ms |
| Total | 700-2500ms |
If this is too slow, consider:
- Caching frequent queries
- Skipping re-ranking for simple queries
- Using faster (smaller) LLMs for straightforward questions
- Streaming responses to improve perceived latency
Cost Modeling
At scale, costs add up. For a system handling 100,000 queries/month:
| Component | Cost Estimate |
|---|---|
| Query embeddings | $5-20 (depends on model) |
| Vector DB hosting | $50-200 |
| LLM generation | $100-500 (depends on model, response length) |
| Re-ranker | $10-50 |
Self-hosting embeddings and using open-source LLMs can reduce costs dramatically, but requires infrastructure investment.
Failure Modes
Plan for these scenarios:
Empty retrieval: No relevant chunks found. Don’t let the LLM hallucinate—return a clear “I don’t have information about that” response.
Contradictory sources: Retrieved chunks disagree. Surface the contradiction to the user rather than arbitrarily picking one.
Outdated information: Your corpus has old data. Include timestamps in chunk metadata and handle version conflicts explicitly.
Context overflow: Retrieved content exceeds the LLM’s context window. Implement truncation strategies that preserve the most relevant information.
Getting Started
If you’re building your first RAG system:
-
Start simple: Basic chunking, a hosted embedding API, and a straightforward prompt. Get something working end-to-end.
-
Build an evaluation set: Collect 50-100 realistic queries with expected answers. You’ll need this to measure improvements.
-
Iterate on retrieval first: Most RAG quality issues are retrieval issues. Don’t fine-tune your prompts until you’re confident you’re retrieving the right context.
-
Add complexity gradually: Hybrid search, re-ranking, and advanced chunking strategies each add value, but also add complexity. Measure the impact of each change.
-
Plan for maintenance: Documents change. Models improve. Build pipelines that let you re-index and update without starting from scratch.
RAG systems reward iteration. The first version is never the best version—but it’s the foundation everything else builds on.
