Most RAG tutorials show you the happy path. A clean project, a handful of sample documents, a query that works first time.
That’s fine for getting oriented but it’s not what building these systems in production actually looks like.
This post is the one I wished existed when I started working in earlier RAG solutions using Semantic Kernel.
The chunking decisions that degrade silently, the retrieval quality that looks fine in demos and falls apart on real queries, and observability gaps that make debugging feel like guessing.
All in .NET.
This blog is by know means exhaustive and I continue to find optimisations but it’s a good starting point.
Lets dig in.
~
What RAG Is
RAG (Retrieval-Augmented Generation) is a pattern, not a framework. Before asking an LLM to answer a question, you first retrieve relevant content from your own data store and include it in the prompt.
The model answers from that grounded context rather than from training data alone.
A RAG pipeline has five stages:
- Ingestion — load source documents (HTML, Markdown, PDFs, plain text)
- Chunking — split documents into segments small enough to embed meaningfully
- Embedding — convert each chunk into a vector using an embedding model
- Storage — persist vectors to a vector store (SQLite, Elasticsearch, Azure AI Search, etc.)
- Retrieval + Generation — embed the incoming query, find the closest chunks, build a grounded prompt, generate an answer
Simple on paper. The devil is in the implementation choices at each stage.
~
Chunking: The Step Most Tutorials Rush
Chunking quality has a disproportionate effect on retrieval quality. Chunks too large: vector similarity becomes diluted. Too small: you lose the context that makes a chunk meaningful.
One approach is to use TextChunker.SplitMarkdownParagraphs() from Semantic Kernel. It respects document structure such as paragraphs, headings, and list items don’t get bisected mid-sentence.
var chunks = TextChunker.SplitMarkdownParagraphs(
lines: markdownContent.Split('n').ToList(),
maxTokensPerParagraph: 512,
overlapTokens: 50
);
The overlapTokens parameter matters. A small overlap (10%-15%) between adjacent chunks ensures that a relevant sentence near a chunk boundary doesn’t disappear from retrieval. Skipping this is a common mistake.
I implemented my own custom chunking service on one project.
Gotcha: HTML content
Convert HTML to Markdown before chunking. Raw HTML bloats chunks with noise such as tags, attributes, and inline styles . These degrade embedding quality. Use the HtmlAgilityPack to strip structure first.
Gotcha: Mixed content types
A chunk that mixes a code sample with surrounding prose often embeds poorly because the two content types pull the vector in different directions. Chunk code blocks separately and tag them with metadata for filtering at retrieval time. This was an an important learning for me.
~
The Relevance Threshold Is Not a Magic Number
Semantic Kernel’s SearchAsync takes a minRelevanceScore parameter. Tutorial defaults (0.75–0.80) are not universally correct. The right threshold depends on your corpus and embedding model.
var results = await memory.SearchAsync(
collection: CollectionName,
query: userQuery,
limit: 5,
minRelevanceScore: 0.70
);
Start at 0.70 (or whatever your comfort level is) and run representative queries, and look at what gets returned.
Build a manual eval set of 20–30 query/expected-answer pairs and iterate. There is no substitute for looking at actual retrieval results on your specific data.
~
Choosing a Vector Store
Match the tool to the stage:
- VolatileMemoryStore — Demos only. Vectors live in RAM, gone on restart.
- SqliteMemoryStore — Local development and early production. File-based, zero infrastructure overhead.
- Elasticsearch — Already in your stack? Use it. Good for hybrid search.
- Azure AI Search — Production on Azure. Managed, scalable.
- Qdrant / Pinecone — Dedicated vector workloads at scale.
SQLite is underrated for early production. It’s a one-line swap from VolatileMemoryStore and handles modest query volumes without infrastructure cost. Migrate later when you actually need to.
~
The One-Time Embedding Check
Once you’re using a persistent store, add a collection existence check before the ingestion loop. Without it, every restart re-embeds the entire corpus — API calls and cost you don’t need.
var collections = await sqliteStore.GetCollectionsAsync().ToListAsync();
if (!collections.Contains(CollectionName))
{
await ragService.IngestDocumentsAsync(documents, CollectionName);
}
else
{
Console.WriteLine("Vectors already stored - skipping ingestion.");
}
Small investment. Saves meaningful API cost at scale.
~
Prompt Construction: Ground It Properly
The difference between a useful RAG system and a hallucinating one often comes down to prompt construction.
A simple prompt you can use:
var sb = new StringBuilder();
sb.AppendLine("Answer the question using ONLY the context below.");
sb.AppendLine("If the answer is not in the context, say so explicitly.");
sb.AppendLine();
sb.AppendLine("CONTEXT:");
foreach (var chunk in retrievedChunks)
{
sb.AppendLine($"[Source: {chunk.Metadata.Id}]");
sb.AppendLine(chunk.Metadata.Text);
sb.AppendLine();
}
sb.AppendLine($"QUESTION: {userQuery}");
The key phrases are “ONLY the context below” and “say so explicitly”. Without explicit grounding instructions, models blend retrieved content with training knowledge which looks helpful but introduces unfaithful answers.
This isn’t optional.
~
Semantic Caching: The Easy Win Most People Skip
For user-facing or high-volume pipelines, add semantic caching early. Before hitting the vector store and LLM, check whether an incoming query is semantically similar to a recent query already answered.
If the similarity score is above threshold return the cached answer directly.
var cachedAnswer = await cacheService.FindSimilarAsync(query, threshold: 0.92f);
if (cachedAnswer != null)
{
return cachedAnswer.Answer; // No vector search, no LLM call
}
At scale this eliminates a large proportion of pipeline calls and cuts latency dramatically for common query patterns. Add this early. Retrofitting it later is more work than it needs to be.
~
Observability: Knowing What’s Actually Happening
A RAG pipeline has multiple failure modes and they all look the same from the outside: a bad answer. Without instrumentation you can’t tell whether the problem is in chunking, retrieval, prompt construction, or the model itself.
Consider capturing data using a logging record similar to:
public record RagQueryTrace
{
public string Query { get; init; }
public int ChunksRetrieved { get; init; }
public float TopChunkScore { get; init; }
public float LowestChunkScore { get; init; }
public string[] SourceIds { get; init; }
public string GeneratedAnswer { get; init; }
public double LatencyMs { get; init; }
public bool CacheHit { get; init; }
}
Signals to watch:
- TopChunkScore consistently below 0.75: retrieval is struggling.
- ChunksRetrieved always hitting your limit: try widening search and re-ranking.
- CacheHit always false with high latency: cache threshold may be too tight.
Wire up end-to-end tracing with ILogger:
public async Task<string> QueryAsync(string query, string collection)
{
var sw = Stopwatch.StartNew();
_logger.LogInformation("RAG query started. Query={Query}", query);
var chunks = await RetrieveChunksAsync(query, collection);
_logger.LogInformation("Retrieval complete. Chunks={Count}, TopScore={Score:F3}",
chunks.Count, chunks.FirstOrDefault()?.Relevance ?? 0);
var answer = await GenerateAnswerAsync(query, chunks);
_logger.LogInformation("Generation complete. LatencyMs={Ms}", sw.ElapsedMilliseconds);
return answer;
}
Diagnosing bad answers:
- Right chunks not retrieved? – Retrieval problem (threshold, chunking, embedding model)
- Chunks retrieved but answer wrong? – Tighten grounding instructions in the prompt
- Chunks and prompt correct but hallucinated? – Add explicit “do not speculate” to system prompt
Work backwards through the trace when you experience any of the above.
~
Evaluating RAG Quality and Why CI Matters
Most RAG prototypes get evaluated informally. This works until the corpus changes, a threshold gets tweaked, or the embedding model is swapped. Quality silently regresses with no way to detect it.
Build question/answer pairs covering easy queries, hard queries spanning multiple documents, and edge cases where the answer isn’t in the corpus and the system should say “I don’t know”. Three metrics worth tracking include:
- Context Recall: were the right chunks retrieved?
- Faithfulness: does the answer stick to the retrieved context?
- Answer Correctness: does the answer match the expected answer?
Wire evals into your CI process. For example:
[Fact]
public async Task RagEval_ContextRecall_AboveThreshold()
{
var results = await RunEvalSetAsync(_evalQueries);
var avgRecall = results.Average(r => r.ContextRecall);
Assert.True(avgRecall >= 0.80,
$"Context recall {avgRecall:P0} is below the 80% threshold");
}
The edge case eval is the most important, i,e. – queries where the answer genuinely isn’t in the corpus.
These test whether the system correctly says “I don’t know” rather than hallucinating.
Hallucination on out-of-scope queries is the thing that erodes user trust fastest and it’s the thing informal testing almost never catches.
~
What to Watch Out For
Some other things to watch out for:
- Skipping overlap tokens in chunking — sentences near chunk boundaries silently drop out of retrieval. Always set
overlapTokens.
- Using tutorial threshold values verbatim — 0.75 or 0.80 is a starting point, not a universal answer. Tune against your actual corpus.
- Re-embedding on every restart — add the collection existence check. It’s five lines and saves real API cost.
- Weak grounding instructions — “use the context below” is not the same as “ONLY the context below”. The difference shows up in production.
- No out-of-scope eval set — hallucination on questions the system can’t answer is the fastest way to lose user trust. Test for it explicitly.
Another reminded – make sure you chunk prose and code examples differently. This really caught me out.
~
Tools Used
Key tools involved in the above included:
- Semantic Kernel (chunking, embedding, retrieval)
- TextChunker.SplitMarkdownParagraphs (structure-aware chunking)
- HtmlAgilityPack (HTML-to-Markdown conversion)
- SqliteMemoryStore / VolatileMemoryStore / Azure AI Search (vector stores)
- ILogger / Stopwatch (observability and tracing)
- xUnit (eval set CI integration)
- .NET 8
The happy path is easy to build. A RAG system that stays reliable as the corpus grows, thresholds get tuned, and real users ask unexpected questions is a different problem.
The patterns above are the ones that made the difference in practice.
~