A production RAG-powered documentation chatbot started throwing intermittent “Server Error” responses.
No obvious pattern, no reliable repro, and the first sign of trouble was an automated alert in staging rather than a user complaint.
The cause turned out to be a limit almost nobody checks for: the embedding model’s context window.
Chat-completion models comfortably handle 128k tokens. The embedding model sitting next to them in the same pipeline caps at 8192.
If you guard the lower bound of your query length but not the upper bound – and most pipelines do exactly that – you have left a cliff edge in production.
Let’s dig in.
~
The Symptom: Intermittent 500s With No Pattern
The “intermittent” part is what made this annoying. Short queries worked. Most conversations worked. Then occasionally a request would 500, and the stack trace pointed straight at the OpenAI embedding API:
Microsoft.SemanticKernel.HttpOperationException: HTTP 400 (invalid_request_error)
This model's maximum context length is 8192 tokens, however you requested
39935 tokens (39935 in your prompt; 0 for the completion). Please reduce
your prompt; or completion length.
Nearly 40,000 tokens requested against an 8192-token ceiling. The HTTP 400 from OpenAI bubbled up as a Semantic Kernel HttpOperationException and surfaced to the user as a generic 500.
The call chain was straightforward once you saw it:
VectorDatabaseService.GenerateVectorsFromSearchQueryText(queryText, removeStopWords)
-> AgentRagService.RetrieveRelevantChunks(query, chatHistory, ...)
-> HelpAgent.ProcessPromptAsync(...)
The pipeline embeds the user’s query at retrieval time so it can perform a vector search. That part is standard. The problem is what eventually arrives at the embedding model.
~
Why This Happens: Embedding Models Have a Tiny Context Window
The important detail is that embedding models and chat-completion models do not share the same context window.
Modern chat models can comfortably process prompts containing tens or even hundreds of thousands of tokens. Embedding models typically operate with much smaller limits. The embedding model used here capped requests at 8,192 tokens.
A RAG pipeline often sends related content through both paths. One path goes to a chat model for generation. Another goes to an embedding model for retrieval. Just because the chat model accepts the input does not mean the embedding model will.
The pipeline already had a lower-bound guard, MinInputQueryPromptLength, to reject empty or trivially short queries. It had no upper bound. GenerateVectorsFromSearchQueryText simply passed the cleaned query directly to the embedding API.
- A single very long user message – someone pastes an entire code file into the chat.
- Accumulated multi-turn chat history that, combined, exceeds 8192 tokens.
The first case is obvious once you see it. The second is where things become more interesting.
The Harder Failure Mode: Query Reformulation Makes Things Longer
The first case is easy to picture. The second one is where this got interesting.
The pipeline has a query reformulation step: a separate large-context LLM rewrites the user’s query into something more retrieval-friendly before it gets embedded. The instinct is to assume reformulation shortens things – it distils intent, strips noise. It does not always. Feed a large-context model a long multi-turn chat history and ask it to rewrite the query, and it can happily echo substantial chunks of that history back into the rewritten query. Code samples get pasted verbatim. Chat turns get verbosely paraphrased. The “reformulated” query comes out larger than the original input.
So the real failure set was:
- User sends a very long single message. Embedding rejects it. The easy case.
- Short user prompt plus multi-turn history, reformulation produces a too-long rewritten query, embedding rejects it. The harder case – and the one that had never been tested, because every unit test fed in a short single-turn query.
If your reformulation path is only ever exercised by happy-path single-turn tests, this failure mode is invisible until production finds it for you.
The Fix: A Two-tier Length Guard
One threshold is not enough here, because there are two different things you want: graceful recovery for “a bit too long,” and a hard stop for “absurdly long.” So the guard inside GenerateVectorsFromSearchQueryText has two tiers.
Tier 2: defensive truncation
The soft tier fires first. If the cleaned input exceeds an internal truncation threshold of 28,000 characters, the input is truncated and a warning is logged.
The exact character-to-token conversion varies by content, so the threshold is intentionally conservative rather than being tied to a precise token count.
The embedding still succeeds on the retained portion of the query.
private const int EmbeddingInputTruncationLengthChars = 28_000;
if (cleanedInput.Length > EmbeddingInputTruncationLengthChars)
{
_logger.LogWarning(
"Query input length {Length} exceeds truncation threshold {Threshold}. " +
"Truncating before embedding.",
cleanedInput.Length, EmbeddingInputTruncationLengthChars);
cleanedInput = cleanedInput[..EmbeddingInputTruncationLengthChars];
}
Truncating the tail is the right call here. Intent generally lives at the start of a query – the user states what they want, then pads it with context, code, and follow-on detail. Cutting the end preserves retrievability far better than cutting the start would. And because it logs a warning, it is observable: if truncation starts happening a lot, you will see it.
Tier 1: hard rejection
The hard tier exists for genuinely degenerate input – something so large that truncating it to 28k chars is no longer a reasonable representation of what the user asked. Above a separate, configurable MaxInputQueryPromptLength (default 100,000 characters), it throws a new exception type rather than quietly mangling the input:
// New file: ModelsVectorQueryTooLongException.cs
public class QueryTooLongException : Exception
{
public QueryTooLongException(string message) : base(message) { }
}
// In VectorDatabaseService.GenerateVectorsFromSearchQueryText:
if (cleanedInput.Length > _maxInputQueryPromptLength)
{
throw new QueryTooLongException(
$"Query input length {cleanedInput.Length} exceeds maximum " +
$"allowed length {_maxInputQueryPromptLength}.");
}
Note the deliberate symmetry: the pipeline already had a QueryTooShortException. The new exception is patterned on it exactly. When you add a new exception type, model it on its existing sibling – the catch blocks, the tests, and the next reader’s expectations all stay consistent for free.
Two thresholds, two jobs. Truncation at 28k chars gives you observable degradation. Rejection at 100k chars gives you a hard safety net.
They are independently tunable, which matters – you may want to move one without touching the other.
Catching It In the RAG service
Throwing the exception is only half of it. The callers – AgentRagService.RetrieveRelevantChunks and AgentBase.RetrieveRelevantChunks – were updated to catch it and return an empty result instead of letting it become a 500, again mirroring the existing QueryTooShortException handling:
catch (QueryTooLongException ex)
{
_logger.LogWarning(ex, "Query too long for embedding. Returning empty RAG result.");
return new AgentRAGResult(new List<RAGChunk>(), new IntentClassificationResult(), query);
}
MaxInputQueryPromptLength was then added to all 11 appsettings*.json files that already carried MinInputQueryPromptLength – if a config key has a min, it should have a max sitting right next to it.
~
Test the Multi-turn Path Explicitly
The tests are where the real lesson lands. Three tests went into VectorDatabaseServiceTests.cs – the reject path, the truncation path, and plain pass-through. Fine. The interesting ones went into a new AgentRagServiceTests.cs:
- Short prompt, no history – happy-path smoke test.
- Too-long prompt at the outset – the service catches
QueryTooLongExceptionand returns empty. - Multi-turn reformulation produces a too-long query. Short user prompt plus a 3-turn history, with the reformulation LLM mocked to return a 200,000-character result. Assert an empty result comes back – not a 500.
- Multi-turn reformulation succeeds – assert the reformulated query, not the original, is the one that gets embedded.
- Reformulation disabled – the original query is embedded directly.
Test 3 is the one that matters. You cannot surface the reformulation-blowup bug with a real LLM in a unit test – you would need to reliably coax it into producing pathologically long output, which you cannot.
So you mock the reformulation LLM and make it return 200k characters on purpose. That is the only way the edge case becomes a deterministic, repeatable test instead of a production incident.
~
In Summary
Guard both ends of every length-sensitive input, if you have a MinInputQueryPromptLength, write the Max in the same sitting
Embedding models and chat models have different context windows, the fact your chat model swallowed 40k tokens tells you nothing about the embedding call
Reformulation can make input larger, not smaller, treat reformulation output as untrusted-length input exactly like raw user input
Mock the path that produces pathologically long output. The happy-path test will never surface the reformulation blowup
Graceful degradation is only graceful if the user knows it happened, silent truncation without UI feedback is incomplete
~
Closing Thought
The root cause was not AI, embeddings, Semantic Kernel, or OpenAI. It was an unbounded input.
The technology stack made the failure harder to see, but the bug itself was one of the oldest bugs in software engineering: assuming a value would always stay within a reasonable range.
In production, eventually it won’t.
~
Enjoy what you’ve read, have questions about this content, or would like to see another topic?
You can schedule a call using my Calendly link to discuss consulting and development services.
~
Courses
Check my AI courses. From developers to decision makers, these have you covered:
- Developing an Artificial Intelligence Strategy for Your Organization
- Aligning Generative AI with Business Cases
- Vector Databases & Embeddings for Developers
This blog post was created with the help of AI tools. Yes, I used a bit of magic from language models to organize my thoughts and automate the boring parts, but the geeky fun and the
in C# are 100% mine.
