Every RAG tutorial makes it look easy. Embed your docs, store in vector db, retrieve top 5, stuff in prompt, done. Ship it.
Then you deploy to production and everything falls apart. Retrievals are irrelevant. The LLM hallucinates anyway. Users complain that the AI "doesn't know" stuff that's definitely in your knowledge base.
I've been there. Heres what actually works.
The Basic Pipeline (And Why Its Not Enough)
Standard RAG looks simple:
async function basicRAG(query: string) {
const embedding = await embed(query);
const docs = await vectorDB.search(embedding, { topK: 5 });
const prompt = `Context:\n${docs.map(d => d.content).join('\n\n')}\n\nQuestion: ${query}`;
return await llm.complete(prompt);
}
This works for demos. In production it fails because:
- Semantic similarity ≠ relevance - "similar" text isnt always useful
- Chunking destroys context - Info gets split awkwardly
- No query understanding - Different phrasings retrieve different docs
- Hallucination persists - Model might ignore context anyway
Lets fix each of these.
Better Chunking Strategies
How you chunk matters more then which vector DB you use. Seriously.
Dont do this:
// Cuts sentences in half, loses structure
function naiveChunk(text, size = 500) {
const chunks = [];
for (let i = 0; i < text.length; i += size) {
chunks.push(text.slice(i, i + size));
}
return chunks;
}
Do this instead:
function semanticChunk(document) {
const sections = document.splitByHeaders();
return sections.flatMap(section => {
if (section.tokens < 400) {
// Small section - keep whole
return [{
content: section.text,
metadata: { title: section.header }
}];
}
// Large section - split by paragraphs with overlap
return section.splitByParagraphs().map(para => ({
content: para.text,
metadata: {
title: section.header,
context: section.firstParagraph // Keep context
}
}));
});
}
The key insight: preserve natural boundaries. Headers, paragraphs, lists. Dont cut in the middle of ideas.
Hybrid Search: Keywords + Vectors
Pure vector search misses exact matches. Pure keyword search misses semantic similarity. Use both.
async function hybridSearch(query: string) {
const [vectorResults, keywordResults] = await Promise.all([
vectorDB.search(await embed(query), { topK: 20 }),
textSearch.search(query, { topK: 20 })
]);
// Combine with RRF
return reciprocalRankFusion([
{ results: vectorResults, weight: 0.6 },
{ results: keywordResults, weight: 0.4 }
]).slice(0, 10);
}
Ive seen this improve retrieval quality by 30-40% in production. Its one of those "why doesnt everyone do this" things.
Query Transformation
Users dont write queries optimized for retrieval. Help them out.
Query Expansion:
async function expandQuery(query: string) {
const alternatives = await llm.complete(`
Generate 3 alternative ways to ask this:
"${query}"
Return as JSON array.`);
const allResults = await Promise.all(
[query, ...JSON.parse(alternatives)].map(q => search(q))
);
return deduplicateAndRank(allResults.flat());
}
HyDE (Hypothetical Document Embeddings):
This ones clever - generate a fake answer, then search for docs similar to that answer:
async function hydeSearch(query: string) {
// Generate hypothetical answer
const hypothetical = await llm.complete(`
Write a short paragraph answering: "${query}"
Write as if citing a knowledge base.`);
// Embed the answer, not the question
const embedding = await embed(hypothetical);
return await vectorDB.search(embedding, { topK: 10 });
}
Works surprisingly well because answers are more similar to documents than questions are.
Reranking: The Secret Weapon
After retrieval, rerank with a cross-encoder or LLM:
async function rerankResults(query: string, docs: Document[]) {
const scored = await Promise.all(
docs.map(async doc => ({
doc,
score: await reranker.score(query, doc.content)
}))
);
return scored
.sort((a, b) => b.score - a.score)
.slice(0, 5)
.map(s => s.doc);
}
Cross-encoders are slower but much more accurate then embedding similarity. Worth it for the final ranking step.
Context Assembly
How you present docs to the LLM matters:
function formatContext(docs: Document[]) {
return docs.map((doc, i) => `
[Source ${i + 1}]
Title: ${doc.metadata.title}
---
${doc.content}
`).join('\n\n');
}
const prompt = `Answer using ONLY the provided sources.
Cite sources using [Source N].
If sources dont have enough info, say "I don't have enough information."
Sources:
${formatContext(docs)}
Question: ${query}`;
The citation instruction is crucial. It forces the model to ground its answer in the sources and makes hallucinations obvious.
The Production Checklist
Before deploying:
- [ ] Test with adversarial queries (prompt injection)
- [ ] Handle empty retrieval results
- [ ] Monitor retrieval quality metrics
- [ ] Collect user feedback (thumbs up/down)
- [ ] Plan for knowledge base updates
- [ ] Load test vector database
- [ ] Set up latency alerts
Measuring Success
Track these metrics:
const metrics = {
// Are we retrieving relevant docs?
retrievalRecall: await measureRecall(testCases),
// Are users happy with answers?
userSatisfaction: await getFeedbackRate(),
// How often does model say "I don't know"?
noAnswerRate: await countNoAnswers(),
// Latency
p95Latency: await getLatencyPercentile(95)
};
Wrapping Up
Production RAG needs:
- Smart chunking that preserves context
- Hybrid search combining vectors + keywords
- Query transformation to improve retrieval
- Reranking to surface best results
- Proper prompting with citations
- Continuous monitoring
The tutorials show you 20% that gets you started. This covers some of the 80% that makes it actually work.
Start simple, measure everything, iterate based on real feedback. Thats the only way to build RAG that genuinely helps people.
