Designing RAG Pipelines for Production

Retrieval-Augmented Generation (RAG) connects large language models to your organization’s knowledge. Done well, it transforms how teams access information—turning months of documentation into an assistant that answers questions in seconds. Done poorly, it generates confident-sounding nonsense that erodes trust faster than having no system at all.

This guide covers the architecture decisions and implementation patterns that separate production RAG systems from impressive demos. We’ll focus on the choices that matter most: how you chunk documents, select embeddings, tune retrieval, and handle the inevitable edge cases.

Core Architecture

A production RAG system has three main stages, each with its own failure modes and optimization opportunities:

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   Ingestion     │ ──► │   Retrieval     │ ──► │   Generation    │
│                 │     │                 │     │                 │
│ • Parse docs    │     │ • Embed query   │     │ • Build prompt  │
│ • Chunk text    │     │ • Vector search │     │ • Call LLM      │
│ • Generate      │     │ • Re-rank       │     │ • Format output │
│   embeddings    │     │ • Filter        │     │                 │
└─────────────────┘     └─────────────────┘     └─────────────────┘

Ingestion happens offline (or near-real-time for dynamic content). This is where you process source documents into a searchable format. Mistakes here propagate through the entire system.

Retrieval happens at query time. Given a user question, you find the most relevant chunks from your corpus. This is usually the bottleneck for answer quality—if you retrieve the wrong context, even the best LLM can’t give a good answer.

Generation synthesizes an answer from retrieved context. This is the most visible part of the system, but often the least impactful to optimize. Get retrieval right first.

Document Processing and Chunking

How you split documents into chunks is one of the highest-leverage decisions in RAG system design. The goal is to create chunks that are:

Self-contained: Each chunk should make sense on its own, without requiring surrounding context
Appropriately sized: Large enough to contain useful information, small enough to be specific
Semantically coherent: Don’t split in the middle of a thought, code block, or logical section

Naive Chunking (And Why It Fails)

The simplest approach—splitting on character count or paragraph breaks—creates predictable problems:

# Don't do this in production
def naive_chunk(text, chunk_size=1000):
    return [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]

This breaks sentences mid-word, splits code examples across chunks, and ignores document structure entirely. The resulting chunks are hard to match against user queries because they lack semantic coherence.

Structure-Aware Chunking

Better approaches respect document structure:

import re
from typing import List

def structure_aware_chunk(
    text: str,
    max_chunk_size: int = 1500,
    min_chunk_size: int = 100,
    overlap: int = 200
) -> List[str]:
    """
    Chunk text while respecting structural boundaries.
    Prioritizes splitting at: headers > paragraphs > sentences
    """
    # Split on markdown headers first
    header_pattern = r'\n(?=#{1,6} )'
    sections = re.split(header_pattern, text)

    chunks = []
    current_chunk = ""

    for section in sections:
        # If section fits, add to current chunk
        if len(current_chunk) + len(section) <= max_chunk_size:
            current_chunk += section
        else:
            # Save current chunk if it meets minimum size
            if len(current_chunk) >= min_chunk_size:
                chunks.append(current_chunk.strip())

            # Start new chunk, potentially with overlap
            if overlap > 0 and current_chunk:
                # Include last paragraph of previous chunk
                overlap_text = get_last_paragraph(current_chunk, overlap)
                current_chunk = overlap_text + section
            else:
                current_chunk = section

    # Don't forget the last chunk
    if current_chunk and len(current_chunk) >= min_chunk_size:
        chunks.append(current_chunk.strip())

    return chunks

Chunk Size Tradeoffs

There’s no universally optimal chunk size—it depends on your content and queries:

Chunk Size	Pros	Cons	Best For
Small (200-500 tokens)	Precise retrieval, lower cost	May lack context, more chunks to search	FAQ-style queries, specific facts
Medium (500-1000 tokens)	Good balance	Jack of all trades	General-purpose systems
Large (1000-2000 tokens)	More context per chunk	Less precise, higher cost	Complex topics, narrative content

We typically start with 800-1200 tokens and adjust based on retrieval quality metrics.

Handling Special Content

Real documents contain more than prose. Handle these explicitly:

Code blocks: Keep code together. A function split across chunks is useless.

def preserve_code_blocks(text: str) -> List[str]:
    """Extract code blocks before chunking, reinsert after."""
    code_pattern = r'```[\s\S]*?```'
    code_blocks = re.findall(code_pattern, text)
    placeholder_text = re.sub(code_pattern, '[[CODE_BLOCK]]', text)
    # Chunk the placeholder text, then restore code blocks
    # ...

Tables: Either keep tables intact or convert to a more retrievable format (key-value pairs, natural language descriptions).

Lists: Don’t split numbered lists mid-sequence. The item “3. Configure the settings” is meaningless without items 1 and 2.

Embedding Selection and Optimization

Your embedding model translates text into vectors that capture semantic meaning. The right choice depends on your content, languages, and infrastructure constraints.

Model Comparison

Model	Dimensions	Strengths	Considerations
OpenAI text-embedding-3-large	3072	Strong general performance, easy API	Per-token cost, data leaves your infra
OpenAI text-embedding-3-small	1536	Lower cost, still solid	Slightly lower quality
Cohere embed-v3	1024	Excellent multilingual, compression options	Different API patterns
BGE-large-en-v1.5	1024	Self-hosted, no API costs	Requires GPU infrastructure
E5-large-v2	1024	Strong benchmark performance	Needs instruction prefixes

The Case for Self-Hosting

If you’re processing sensitive documents or need to control costs at scale, self-hosted embeddings are worth considering:

from sentence_transformers import SentenceTransformer

# Load model once, reuse for all embeddings
model = SentenceTransformer('BAAI/bge-large-en-v1.5')

def embed_texts(texts: List[str]) -> np.ndarray:
    """Generate embeddings for a batch of texts."""
    return model.encode(
        texts,
        normalize_embeddings=True,  # Important for cosine similarity
        show_progress_bar=True
    )

A single A10 GPU can embed thousands of documents per minute, making the infrastructure cost trivial compared to API pricing at scale.

Query-Document Asymmetry

Some embedding models perform better when you process queries differently from documents. E5 models, for example, expect prefixes:

# For E5 models
document_text = "passage: " + original_document
query_text = "query: " + user_question

Check your model’s documentation. Skipping these prefixes can significantly hurt retrieval quality.

Vector Storage and Retrieval

Once you have embeddings, you need somewhere to store them and a way to search efficiently.

Database Options

Pinecone, Weaviate, Qdrant: Purpose-built vector databases with managed hosting options. Good for getting started quickly.

PostgreSQL + pgvector: Add vector search to your existing Postgres database. Simpler operations if you’re already running Postgres.

-- Enable the extension
CREATE EXTENSION vector;

-- Create a table with vector column
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT,
    embedding vector(1536),  -- Match your model's dimensions
    metadata JSONB
);

-- Create an index for fast similarity search
CREATE INDEX ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

Elasticsearch with dense vectors: If you’re already using Elasticsearch, it supports vector search alongside traditional keyword search.

Hybrid Search

Pure semantic search misses exact matches. Pure keyword search misses semantic relationships. Hybrid search combines both:

def hybrid_search(
    query: str,
    vector_results: List[Document],
    keyword_results: List[Document],
    vector_weight: float = 0.7
) -> List[Document]:
    """
    Combine vector and keyword search results.
    Uses Reciprocal Rank Fusion (RRF) for score combination.
    """
    k = 60  # RRF constant
    scores = {}

    for rank, doc in enumerate(vector_results):
        scores[doc.id] = scores.get(doc.id, 0) + vector_weight / (k + rank)

    for rank, doc in enumerate(keyword_results):
        scores[doc.id] = scores.get(doc.id, 0) + (1 - vector_weight) / (k + rank)

    # Sort by combined score
    ranked_ids = sorted(scores.keys(), key=lambda x: scores[x], reverse=True)
    return [get_document(doc_id) for doc_id in ranked_ids]

This approach catches queries where the user uses exact terminology from the documents (keyword match) and queries where they describe concepts in different words (semantic match).

Re-ranking

Initial retrieval optimizes for speed. Re-ranking optimizes for relevance on a smaller candidate set:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank_results(query: str, documents: List[str], top_k: int = 5) -> List[str]:
    """Re-rank documents using a cross-encoder for better relevance."""
    pairs = [[query, doc] for doc in documents]
    scores = reranker.predict(pairs)

    ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)
    return [doc for doc, score in ranked[:top_k]]

Cross-encoders are slower than bi-encoders (they process query-document pairs rather than independent embeddings) but significantly more accurate. Use them on your top 20-50 candidates to select the final 3-5 for generation.

Prompt Engineering for Generation

With good chunks retrieved, you need a prompt that helps the LLM use them effectively.

Basic RAG Prompt Structure

You are a helpful assistant answering questions about [DOMAIN].
Use ONLY the information provided in the context below to answer.
If the context doesn't contain enough information to answer fully, say so.

Context:
---
{retrieved_chunks}
---

Question: {user_question}

Instructions:
- Answer based solely on the provided context
- If you're uncertain, express that uncertainty
- If the context doesn't address the question, say "I don't have information about that"
- Cite specific sections when possible

Handling Multiple Sources

When chunks come from different documents, help the LLM understand the structure:

Context from multiple sources:

[Source: Employee Handbook, Section 4.2]
Vacation requests must be submitted at least two weeks in advance...

[Source: HR Policy Update, March 2024]
The two-week notice requirement may be waived for emergencies...

[Source: Manager Guidelines]
Managers have discretion to approve urgent time-off requests...

Preventing Hallucination

LLMs want to be helpful, sometimes to a fault. They’ll confabulate information rather than admit ignorance. Countermeasures:

Explicit instructions: “If the context doesn’t contain the answer, say so” (as shown above)
Lower temperature: Reduce randomness to keep responses grounded
Citation requirements: “Quote the specific text that supports your answer”
Confidence signals: Ask the model to rate its confidence, then filter low-confidence responses

Evaluation and Monitoring

You can’t improve what you don’t measure. RAG systems need evaluation at multiple stages.

Retrieval Metrics

Recall@K: Of the relevant documents, how many appear in the top K results?

Precision@K: Of the top K results, how many are actually relevant?

Mean Reciprocal Rank (MRR): How high does the first relevant result appear?

def calculate_mrr(queries: List[str], relevance_judgments: Dict[str, List[str]]) -> float:
    """Calculate Mean Reciprocal Rank across queries."""
    reciprocal_ranks = []

    for query in queries:
        results = retrieve(query, top_k=10)
        relevant = relevance_judgments.get(query, [])

        for rank, doc in enumerate(results, 1):
            if doc.id in relevant:
                reciprocal_ranks.append(1 / rank)
                break
        else:
            reciprocal_ranks.append(0)

    return sum(reciprocal_ranks) / len(reciprocal_ranks)

End-to-End Metrics

Answer correctness: Does the generated answer match a ground-truth answer? (Requires a test set)

Faithfulness: Is the answer supported by the retrieved context? (Can be evaluated with LLM-as-judge)

Relevance: Does the answer address what the user asked?

Production Monitoring

Log everything you’ll need to debug issues:

def log_rag_request(
    query: str,
    retrieved_chunks: List[str],
    retrieval_scores: List[float],
    generated_answer: str,
    latency_ms: float,
    user_feedback: Optional[str] = None
):
    """Log RAG request for monitoring and debugging."""
    log_entry = {
        "timestamp": datetime.utcnow().isoformat(),
        "query": query,
        "num_chunks_retrieved": len(retrieved_chunks),
        "top_retrieval_score": max(retrieval_scores) if retrieval_scores else None,
        "answer_length": len(generated_answer),
        "latency_ms": latency_ms,
        "user_feedback": user_feedback
    }
    # Send to your logging infrastructure
    logger.info(json.dumps(log_entry))

Watch for:

Queries with low retrieval scores (might indicate gaps in your corpus)
High latency requests (might indicate infrastructure issues)
Negative user feedback patterns (might indicate systematic problems)

Production Considerations

Latency Budget

RAG adds latency at every stage. A typical breakdown:

Stage	Latency
Query embedding	50-100ms
Vector search	20-50ms
Re-ranking	100-200ms
LLM generation	500-2000ms
Total	700-2500ms

If this is too slow, consider:

Caching frequent queries
Skipping re-ranking for simple queries
Using faster (smaller) LLMs for straightforward questions
Streaming responses to improve perceived latency

Cost Modeling

At scale, costs add up. For a system handling 100,000 queries/month:

Component	Cost Estimate
Query embeddings	$5-20 (depends on model)
Vector DB hosting	$50-200
LLM generation	$100-500 (depends on model, response length)
Re-ranker	$10-50

Self-hosting embeddings and using open-source LLMs can reduce costs dramatically, but requires infrastructure investment.

Failure Modes

Plan for these scenarios:

Empty retrieval: No relevant chunks found. Don’t let the LLM hallucinate—return a clear “I don’t have information about that” response.

Contradictory sources: Retrieved chunks disagree. Surface the contradiction to the user rather than arbitrarily picking one.

Outdated information: Your corpus has old data. Include timestamps in chunk metadata and handle version conflicts explicitly.

Context overflow: Retrieved content exceeds the LLM’s context window. Implement truncation strategies that preserve the most relevant information.

Getting Started

If you’re building your first RAG system:

Start simple: Basic chunking, a hosted embedding API, and a straightforward prompt. Get something working end-to-end.
Build an evaluation set: Collect 50-100 realistic queries with expected answers. You’ll need this to measure improvements.
Iterate on retrieval first: Most RAG quality issues are retrieval issues. Don’t fine-tune your prompts until you’re confident you’re retrieving the right context.
Add complexity gradually: Hybrid search, re-ranking, and advanced chunking strategies each add value, but also add complexity. Measure the impact of each change.
Plan for maintenance: Documents change. Models improve. Build pipelines that let you re-index and update without starting from scratch.

RAG systems reward iteration. The first version is never the best version—but it’s the foundation everything else builds on.