A comprehensive, step-by-step checklist to build production-quality Retrieval-Augmented Generation systems. Check off each item as you go.
Building a RAG system that works on a demo is easy. Building one that works reliably in production -- with real users, messy documents, and edge-case queries -- is hard. This checklist captures every critical step from document ingestion to production deployment, along with the reasons each step matters and the pitfalls that catch most teams. Use it as a living document: check items off as you implement them, and revisit when debugging issues.
Inventory all document types (PDF, HTML, DOCX, Markdown, CSV), their volume, and update frequency. Identify which documents are most critical for user queries.
Use specialized parsers for each document type. For PDFs, use libraries that handle OCR, tables, and layout detection (e.g., Unstructured, PyMuPDF). For HTML, strip boilerplate and extract main content.
Experiment with multiple chunking approaches: fixed-size (500-1000 tokens), recursive character splitting, semantic chunking (split by meaning), or document-structure-aware chunking (by sections/headers). Test at least 3 strategies.
Attach structured metadata: source document name, section title, page number, creation date, document type, and any domain-specific tags. Store this alongside the vector embedding.
Build an incremental processing pipeline. Track document versions, detect changes, re-chunk and re-embed only modified documents, and remove stale chunks from the index.
Test at least 3 embedding models on your actual data. Consider: OpenAI text-embedding-3-large, Cohere embed-v3, open-source options (BGE, E5, GTE). Benchmark on retrieval precision against a test set of queries.
Evaluate options based on your scale: Chroma or Qdrant for prototyping, Pinecone or Weaviate for managed production, pgvector for staying within PostgreSQL. Configure the index type (HNSW for speed, IVF for memory efficiency).
Combine keyword-based search (BM25/TF-IDF) with semantic vector search. Use reciprocal rank fusion (RRF) or learned weights to merge results. This catches both exact keyword matches and semantic similarities.
Transform user queries before retrieval: expand abbreviations, correct spelling, resolve pronouns using conversation history, and optionally generate multiple query variants (HyDE or multi-query) for broader recall.
After initial retrieval (top-20 to top-50 candidates), apply a cross-encoder re-ranker (e.g., Cohere Rerank, BGE Reranker) to re-score and re-order results based on query-document relevance. Then take the top-5 to top-10.
When a chunk is retrieved, also fetch its surrounding context: the parent document section, adjacent chunks, or a document summary. Include this additional context in the LLM prompt to prevent the "lost context" problem.
Allow retrieval to be scoped by metadata: date ranges, document types, categories, or access permissions. Implement pre-filtering (before vector search) for large exclusions and post-filtering for fine-grained control.
Craft a system prompt that instructs the LLM to answer based on provided context, cite sources, and explicitly state when the context does not contain sufficient information. Structure retrieved chunks with clear delimiters and source attribution.
Require the LLM to cite specific sources for each claim in its response. Map citations back to original documents with page numbers or section references. Display citations to users for verification.
Detect when retrieved context is insufficient (low relevance scores, no matching chunks) and instruct the LLM to say "I don't have enough information to answer this" rather than guessing. Provide suggested alternative queries.
Stream the LLM response token-by-token to the user interface rather than waiting for the complete response. Display a loading state during the retrieval phase, then stream the generation.
Create 50-100+ question-answer pairs with source document references. Include diverse query types: factual lookups, multi-hop reasoning, temporal queries, out-of-scope queries, and adversarial inputs. Have domain experts validate the ground truth answers.
Evaluate retrieval independently from generation using information retrieval metrics: Precision@k, Recall@k, Mean Reciprocal Rank (MRR), and NDCG. This tells you if the right chunks are being found, regardless of how the LLM uses them.
Evaluate generation using RAGAS metrics: faithfulness (does the answer stick to the context?), answer relevancy (does it address the question?), and context precision (was the right context used?). Run LLM-as-judge evaluations for nuanced quality assessment.
Instrument every step: query preprocessing, retrieval (which chunks, scores), re-ranking, context assembly, and generation. Use tools like LangSmith, Langfuse, or Arize Phoenix. Log latency, token counts, and costs per request.
Add multi-level caching: exact-match cache (identical queries), semantic cache (similar queries using embedding similarity), and embedding cache (avoid re-embedding the same document chunks). Set appropriate TTLs for each cache level.
Monitor: retrieval latency (p50, p95, p99), generation latency, error rates, cache hit rates, average relevance scores, token usage, and costs. Set alerts for anomalies: sudden drops in relevance scores or spikes in hallucination rates.
Add thumbs up/down buttons, allow users to flag incorrect answers, and collect optional text feedback. Store feedback linked to the full trace (query, retrieved chunks, generated answer) so you can analyze failure patterns.