RAG — Retrieval Augmented Generation — has become the dominant pattern for grounding AI in your specific content. It is how most "AI that knows our docs/product/company" systems actually work. Building a capable RAG system in 2026 is far easier than it was two years ago, but it is also more subtle than the naive "embed, retrieve, generate" pipeline that RAG tutorials made famous. Real-world RAG requires careful attention to chunking, retrieval quality, re-ranking, prompting, evaluation, and maintenance. This guide walks through the full pipeline as it actually exists in production systems — the choices that matter, the pitfalls that sink projects, and the advanced techniques that separate good RAG from great.

What RAG actually is

The core pattern explained precisely.

The problem RAG solves. Large language models have general knowledge but do not know your specific content. Training models on your content is expensive and inflexible.

The RAG solution. At query time, retrieve relevant content from your knowledge base. Include in the prompt. Model generates response using retrieved content.

The data flow. User query → embed query → search vector database → retrieve relevant chunks → construct prompt with chunks + query → model generates response → return to user.

Why this works. LLMs are strong at using provided context. Retrieval pulls in just the relevant context for each query. Model does not need to memorise your content; it references it.

This is a much simpler and more flexible pattern than training custom models. Most "AI trained on our data" products are actually RAG systems.

Chunking — the underrated decision

How you split your content determines retrieval quality.

Chunk size. 200-1000 tokens typical. Smaller is more precise but misses context. Larger catches more context but pollutes retrieval.

Chunk boundaries. Respect semantic units. Do not split mid-paragraph. Prefer structure-aware (split at headings).

Overlap. Small overlap (10-20%) between adjacent chunks helps. Boundary information available in both.

Chunk metadata. Source URL. Section. Last updated. Document type. Used for filtering and display.

Hierarchical chunking. Multiple representations — full doc embedding, section embeddings, chunk embeddings. Enables different retrieval strategies.

Content-type specific chunking. Code handled differently from prose. Tables treated specially. Images require separate pipeline.

Chunking is often the difference between useful and useless RAG. Do not skimp on this step.

Embedding models

Choosing the representation.

OpenAI text-embedding-3-small. Baseline. Good quality, reasonable cost, easy to use.

OpenAI text-embedding-3-large. Higher quality for demanding use cases.

Cohere embed-english-v3 or embed-multilingual-v3. Often slightly better retrieval for specific domains.

Voyage AI. Specific domain embeddings (code, legal, finance). Often outperform general models in those domains.

BGE family (BAAI). Open source. Competitive with commercial. Self-hostable.

Nomic Embed. Open source, high quality, strong multilingual.

Selection criteria. Quality on your domain (test empirically). Cost. Vector size (affects storage). Whether you need self-hosting.

The practical advice. Start with OpenAI text-embedding-3-small. Switch only when you have measured a specific improvement.

Vector database selection

Where to store and search embeddings.

pgvector. PostgreSQL extension. If you already use Postgres, this is often the right choice. Mature, well-supported.

Pinecone. Managed vector database. Easy to start. Good performance. Commercial.

Weaviate. Open source with managed option. Rich feature set.

Chroma. Simple, popular for development.

Qdrant. Strong performance. Good for larger scale.

Elasticsearch and Opensearch. Traditional search engines now have vector capabilities. Good if you need hybrid search.

Selection criteria. Scale. Hybrid search needs. Existing infrastructure. Budget. Self-hosted vs managed preference.

For most projects, pgvector or Pinecone are good starting points. Switch only for specific reasons.

Basic retrieval and its limits

What the naive pipeline does.

Embed query. Get top-k nearest chunks. Put in prompt. Generate.

Problems with the naive approach. Query-document vocabulary mismatch — user asks "how to reset my password" but docs say "credential recovery procedure." Missing context — retrieved chunk makes sense only with surrounding text. Concept drift — query is about A but chunks about B also have similar embeddings.

Specific failure modes. Retrieving semantically similar but not actually relevant content. Missing content that requires understanding a concept by different name. Picking chunks that are lexically similar but semantically different.

These problems motivate more sophisticated techniques.

Hybrid search

Combining vector and keyword retrieval.

The insight. Vector search catches semantic similarity; keyword search catches exact matches. Many queries benefit from both.

Implementation. Run both searches. Combine results (typically with reciprocal rank fusion or similar). Return combined top-k.

BM25. Classical keyword search algorithm. Still excellent baseline.

When hybrid helps most. Queries with specific terminology, product names, code snippets. Less help for pure semantic queries.

Cost. Slightly more complex infrastructure. Usually worth it.

Most production RAG in 2026 uses hybrid search. The performance improvement is real.

Re-ranking

Refining retrieval results.

The problem. Top-k retrieval returns plausible results; not all are actually most relevant. Vector similarity approximates relevance imperfectly.

The solution. Retrieve more candidates (say top-50) than needed. Re-rank using more expensive but more accurate model. Take top-k from re-ranked list.

Re-ranking models. Cohere Rerank, Voyage rerank-lite, BGE reranker. Cross-encoders trained for relevance.

The tradeoff. Added latency and cost versus better retrieval. Usually worth it for quality-sensitive applications.

Effect on quality. Often the single biggest retrieval improvement. Can lift recall@5 from 60% to 85% or better.

Production RAG benefits substantially from re-ranking. Add it unless you have specific latency constraints preventing.

Query transformation

Improving the query before retrieval.

Query rewriting. User queries are often poor retrieval queries. Use LLM to rewrite into more retrievable form.

Query expansion. Generate multiple query variations. Retrieve for each. Combine results.

HyDE (Hypothetical Document Embeddings). Generate a hypothetical answer first. Embed that. Retrieve documents similar to hypothetical answer.

Multi-step queries. Break complex questions into sub-questions. Retrieve for each. Synthesise.

Conversation-aware. In multi-turn conversations, rewrite questions to include context from earlier turns.

These techniques help when naive retrieval struggles. Not always needed.

Metadata filtering

Scoping retrieval by attributes.

The use case. Content with different visibility (public vs logged-in users), different product versions, different languages, different recency.

Implementation. Vector DB supports filtering by metadata during search. Specify filters at query time.

Performance. Well-implemented filtering does not meaningfully slow retrieval. Poorly implemented can substantially.

Examples. User asks about "latest version features" — filter to most recent docs. User asks in Spanish — filter to Spanish docs. Enterprise customer — include enterprise-only content.

Filtering is a powerful tool for getting the right content to the right user.

Prompt engineering for RAG

Getting the model to use retrieved context well.

System prompt structure. Role definition. Instructions for using context. Instructions for unknown queries.

Context formatting. How chunks are presented to the model. Delimiters, source citations, structure.

Grounding instructions. "Answer based only on provided context" versus "use context when relevant." Different outcomes.

Citation behavior. Instructing model to cite sources in output. Specific formats.

Fallback behaviour. What should model do when context does not contain answer?

Iterative improvement. Prompt engineering is empirical. Test, measure, iterate.

Evaluation of RAG

How do you know if it is working?

Retrieval metrics. Recall@k — did we retrieve the right content? Measured on labelled dataset.

Faithfulness. Are model outputs grounded in retrieved content? Measures hallucination.

Answer quality. Are answers useful? Human evaluation typically required.

End-to-end metrics. User satisfaction. Task completion. Whatever matters to users.

Evaluation frameworks. RAGAS, TruLens, specific in-house eval suites.

Continuous evaluation. Production monitoring catches regressions. Not just pre-deployment testing.

Weak evaluation is the most common RAG failure. Invest here.

Common RAG failure modes

What actually breaks in production.

Poor chunks. Chunking that does not respect document structure produces poor retrieval.

Missing content. Knowledge base does not contain needed information. No retrieval technique can retrieve what is not there.

Wrong model usage. Retrieved content present but model does not use it effectively. Prompt engineering issue.

Hallucination despite grounding. Model generates content not supported by retrieval. System prompt adjustments needed.

Stale content. Knowledge base not refreshed. Outdated answers.

Privilege leaks. Retrieval not respecting access controls. Information leaks.

These failures are common. Addressing each requires specific attention.

Agentic RAG

The next-generation pattern.

Traditional RAG. Single retrieval, single generation.

Agentic RAG. Model decides what to retrieve. May retrieve multiple times. May use different tools for different sub-tasks. Reasons about retrieval strategy.

Benefits. Handles complex queries. More flexible. Can decompose tasks.

Costs. More latency. More complexity. More opportunities for failure.

When to use. Complex multi-step queries. Tasks requiring multiple data sources. Queries where single retrieval is not enough.

Implementation. Function calling, tool use, multi-step reasoning chains.

Agentic RAG is growing but not universal. Start with traditional RAG; add agents when needed.

Multimodal RAG

Beyond text.

Images. Store image embeddings alongside text. Retrieve images relevant to queries.

Tables. Structured data requires specific handling. Extract, embed, retrieve.

Code. Code embeddings different from text. Specialised models.

PDFs with mixed content. Extraction pipelines that handle layout, images, tables.

Tools. Unstructured.io, LlamaParse, ColPali for multimodal handling.

The challenge. Pipelines more complex. Quality varies by content type.

Most RAG in 2026 is text-focused with some multimodal extensions. Pure multimodal RAG still emerging.

Knowledge base maintenance

The ongoing work.

Ingestion pipeline. How does new content reach the RAG system? Manual, scheduled, real-time.

Update handling. When content changes, how does RAG reflect that? Delete old chunks, add new.

Deletion. When content is removed, ensure it is removed from vector DB.

Versioning. Multiple versions of same content? Handle carefully.

Quality over time. Content drift affects quality. Regular evaluation.

Scale growth. Vector DBs have scaling limits. Plan for growth.

Maintenance is a significant ongoing effort. Budget for it.

Worked example: a technical docs RAG

A concrete build.

Content. 500-page product documentation across many pages.

Week 1. Ingestion script crawling docs. HTML cleaning. Structure-aware chunking at sections and subsections. Chunk size ~400 tokens with 50 overlap.

Week 2. Embedding with text-embedding-3-small. Storage in pgvector. Basic retrieval pipeline.

Week 3. Query API with Claude Sonnet generation. Basic system prompt. Test with real queries.

Week 4. Add hybrid search with BM25. Add Cohere re-ranking. Measure improvement.

Week 5. Evaluation suite with 100 labelled queries. Measure retrieval quality, answer quality.

Week 6. Deploy to production with monitoring. Initial user feedback.

Outcome. Working RAG system with measurable quality. 85% of user queries answered correctly. 15% require improvement — typically missing content or ambiguous queries.

Total effort. 6 person-weeks. Comparable quality to systems built in months by teams who skipped evaluation investment.

Advanced patterns

Beyond basic RAG.

Conversational RAG with memory. Multi-turn conversations where context builds.

Graph RAG. Knowledge graphs augmenting vector retrieval. Microsoft GraphRAG paper and others.

Self-querying. Model generates structured filters from natural language queries.

Corrective RAG. Detect when retrieval is likely poor, adjust.

These patterns address specific limitations of basic RAG. Worth exploring when basics are mastered.

Cost optimisation

RAG scales costs with usage.

Token costs. Retrieved context increases prompt tokens. Model inference costs grow.

Embedding costs. Indexing costs (one-time or on updates). Query embedding costs (every query).

Infrastructure. Vector database scaling. Inference infrastructure.

Optimisation levers. Smaller models where quality allows. Caching common queries. Chunk size tuning (smaller chunks = smaller prompts = lower costs per query). Re-ranking only for hard queries.

Typical costs. Small RAG: $50-200/month. Mid-scale: $500-5,000/month. Large-scale: $10,000+/month.

When RAG is the wrong solution

Honest limitations.

Tasks not fundamentally about knowledge retrieval. Creative writing, reasoning puzzles, general conversation — RAG adds overhead without benefit.

Very small knowledge bases. If your content fits in the prompt, skip retrieval.

Real-time data. RAG over a vector DB is not ideal for rapidly changing data. Different patterns (API calls, database queries) better.

Structured data. If your data is structured (rows in database), SQL is probably better than RAG.

RAG is a tool. Pick it when it fits; do not force fit where it does not.

Access control in production RAG

An often-missed but critical dimension. Different users see different content. Employees see internal docs; customers see product docs; enterprise tier sees enterprise content. RAG systems must enforce these boundaries. The common failure pattern. Developers build the RAG pipeline first, think about access control later. Then under pressure to ship, they bolt on filtering that is easy to bypass.

The correct pattern. Access control in metadata at indexing time. Filters applied at retrieval time enforced at the vector DB level, not the application layer. Audit logging of what content each user accessed. Periodic review of access patterns. Testing specifically for access boundary violations. This matters enormously for enterprise deployments. A leak of confidential documents through a RAG chatbot is a serious incident. Building access control into the design from day one is much cheaper than retrofitting.

Worked example: migrating from prompt engineering to RAG

A SaaS company ran customer support with Claude and lengthy system prompts containing product information. As the product grew, the system prompt ballooned to 40K tokens. Costs grew. Latency grew. Quality suffered because the model struggled to use so much context effectively.

The migration. Split the 40K-token system prompt into discrete chunks by topic. Embed each chunk. At query time, retrieve top 5 relevant chunks. System prompt shrinks to 3K tokens of instructions plus retrieved context. Cost per query drops 70%. Latency drops. Quality improves because retrieved context is specifically relevant.

The lessons transferable. Very long system prompts are often a symptom of needing RAG. Migration is incremental — you can replace sections of prompt with retrieval progressively. Quality improves not just because the pipeline is more efficient but because the model handles smaller, focused context better than massive undifferentiated context. Teams running into system prompt length limits should consider RAG migration before reaching for bigger context windows.

RAG in regulated industries

Specific considerations for healthcare, finance, legal, government. Data residency — where vector embeddings are stored matters. Self-hosted or regional cloud deployments required. Audit logging — every retrieval and generation logged for compliance review. Source citation mandatory — responses must be traceable to specific sources. Version control — which version of which document was retrieved at a given time. Content approval — only approved content enters the knowledge base, with approval workflow. These requirements add complexity but are well-established patterns. Organisations in regulated sectors deploy RAG successfully by building these considerations into architecture from the start rather than as afterthoughts. Engaging compliance and legal teams early in RAG projects prevents expensive redesigns later — the common pattern of "we'll figure out compliance after we prove the concept" usually costs more in the long run than building compliance in from day one.

Good RAG is 20% vector search and 80% everything else — chunking, reranking, prompt engineering, and evaluation. Teams that only focus on the vector database usually underdeliver.

The short version

RAG in 2026 is the dominant pattern for making AI useful on your specific content. The basic pipeline is simple — embed, retrieve, generate. Production-quality RAG requires attention to chunking, hybrid search, re-ranking, prompt engineering, and evaluation. Advanced techniques — query transformation, agentic patterns, graph RAG — address specific limitations. Maintenance is substantial; budget for it. When RAG is right, modern tooling makes it accessible. When it is wrong, other patterns serve better. For most "AI on our content" projects, RAG with good engineering is the right choice. Start with basics, measure rigorously, add sophistication where evaluation shows improvement.

Share: