RAG Systems in Production: Beyond the Tutorial

Every RAG tutorial makes it look easy. Embed your docs, store in vector db, retrieve top 5, stuff in prompt, done. Ship it.

Then you deploy to production and everything falls apart. Retrievals are irrelevant. The LLM hallucinates anyway. Users complain that the AI "doesn't know" stuff that's definitely in your knowledge base.

I've been there. Heres what actually works.

The Basic Pipeline (And Why Its Not Enough)

Standard RAG looks simple:

async function basicRAG(query: string) {
  const embedding = await embed(query);
  const docs = await vectorDB.search(embedding, { topK: 5 });

  const prompt = `Context:\n${docs.map(d => d.content).join('\n\n')}\n\nQuestion: ${query}`;

  return await llm.complete(prompt);
}

This works for demos. In production it fails because:

Semantic similarity ≠ relevance - "similar" text isnt always useful
Chunking destroys context - Info gets split awkwardly
No query understanding - Different phrasings retrieve different docs
Hallucination persists - Model might ignore context anyway

Lets fix each of these.

Better Chunking Strategies

How you chunk matters more then which vector DB you use. Seriously.

Dont do this:

// Cuts sentences in half, loses structure
function naiveChunk(text, size = 500) {
  const chunks = [];
  for (let i = 0; i < text.length; i += size) {
    chunks.push(text.slice(i, i + size));
  }
  return chunks;
}

Do this instead:

function semanticChunk(document) {
  const sections = document.splitByHeaders();

  return sections.flatMap(section => {
    if (section.tokens < 400) {
      // Small section - keep whole
      return [{
        content: section.text,
        metadata: { title: section.header }
      }];
    }

    // Large section - split by paragraphs with overlap
    return section.splitByParagraphs().map(para => ({
      content: para.text,
      metadata: {
        title: section.header,
        context: section.firstParagraph  // Keep context
      }
    }));
  });
}

The key insight: preserve natural boundaries. Headers, paragraphs, lists. Dont cut in the middle of ideas.

Hybrid Search: Keywords + Vectors

Pure vector search misses exact matches. Pure keyword search misses semantic similarity. Use both.

async function hybridSearch(query: string) {
  const [vectorResults, keywordResults] = await Promise.all([
    vectorDB.search(await embed(query), { topK: 20 }),
    textSearch.search(query, { topK: 20 })
  ]);

  // Combine with RRF
  return reciprocalRankFusion([
    { results: vectorResults, weight: 0.6 },
    { results: keywordResults, weight: 0.4 }
  ]).slice(0, 10);
}

Ive seen this improve retrieval quality by 30-40% in production. Its one of those "why doesnt everyone do this" things.

Query Transformation

Users dont write queries optimized for retrieval. Help them out.

Query Expansion:

async function expandQuery(query: string) {
  const alternatives = await llm.complete(`
Generate 3 alternative ways to ask this:
"${query}"

Return as JSON array.`);

  const allResults = await Promise.all(
    [query, ...JSON.parse(alternatives)].map(q => search(q))
  );

  return deduplicateAndRank(allResults.flat());
}

HyDE (Hypothetical Document Embeddings):

This ones clever - generate a fake answer, then search for docs similar to that answer:

async function hydeSearch(query: string) {
  // Generate hypothetical answer
  const hypothetical = await llm.complete(`
Write a short paragraph answering: "${query}"
Write as if citing a knowledge base.`);

  // Embed the answer, not the question
  const embedding = await embed(hypothetical);

  return await vectorDB.search(embedding, { topK: 10 });
}

Works surprisingly well because answers are more similar to documents than questions are.

Reranking: The Secret Weapon

After retrieval, rerank with a cross-encoder or LLM:

async function rerankResults(query: string, docs: Document[]) {
  const scored = await Promise.all(
    docs.map(async doc => ({
      doc,
      score: await reranker.score(query, doc.content)
    }))
  );

  return scored
    .sort((a, b) => b.score - a.score)
    .slice(0, 5)
    .map(s => s.doc);
}

Cross-encoders are slower but much more accurate then embedding similarity. Worth it for the final ranking step.

Context Assembly

How you present docs to the LLM matters:

function formatContext(docs: Document[]) {
  return docs.map((doc, i) => `
[Source ${i + 1}]
Title: ${doc.metadata.title}
---
${doc.content}
`).join('\n\n');
}

const prompt = `Answer using ONLY the provided sources.
Cite sources using [Source N].

If sources dont have enough info, say "I don't have enough information."

Sources:
${formatContext(docs)}

Question: ${query}`;

The citation instruction is crucial. It forces the model to ground its answer in the sources and makes hallucinations obvious.

The Production Checklist

Before deploying:

[ ] Test with adversarial queries (prompt injection)
[ ] Handle empty retrieval results
[ ] Monitor retrieval quality metrics
[ ] Collect user feedback (thumbs up/down)
[ ] Plan for knowledge base updates
[ ] Load test vector database
[ ] Set up latency alerts

Measuring Success

Track these metrics:

const metrics = {
  // Are we retrieving relevant docs?
  retrievalRecall: await measureRecall(testCases),

  // Are users happy with answers?
  userSatisfaction: await getFeedbackRate(),

  // How often does model say "I don't know"?
  noAnswerRate: await countNoAnswers(),

  // Latency
  p95Latency: await getLatencyPercentile(95)
};

Wrapping Up

Production RAG needs:

Smart chunking that preserves context
Hybrid search combining vectors + keywords
Query transformation to improve retrieval
Reranking to surface best results
Proper prompting with citations
Continuous monitoring

The tutorials show you 20% that gets you started. This covers some of the 80% that makes it actually work.

Start simple, measure everything, iterate based on real feedback. Thats the only way to build RAG that genuinely helps people.