A production RAG-powered documentation chatbot started throwing intermittent “Server Error” responses.
No obvious pattern, no reliable repro, and the first sign of trouble was an automated alert in staging rather than a user complaint.
The cause turned out to be a limit almost nobody checks for: the embedding model’s context window.
Chat-completion models comfortably handle 128k tokens. The embedding model sitting next to them in the same pipeline caps at 8192.
If you guard the lower bound of your query length but not the upper bound – and most pipelines do exactly that – you have left a cliff edge in production.
Let’s dig in.
~
The “intermittent” part is what made this annoying. Short queries worked. Most conversations worked. Then occasionally a request would 500, and the stack trace pointed straight at the OpenAI embedding API:
Microsoft.SemanticKernel.HttpOperationException: HTTP 400 (invalid_request_error)
This model's maximum context length is 8192 tokens, however you requested
39935 tokens (39935 in your prompt; 0 for the completion). Please reduce
your prompt; or completion length.
Nearly 40,000 tokens requested against an 8192-token ceiling. The HTTP 400 from OpenAI bubbled up as a Semantic Kernel HttpOperationException and surfaced to the user as a generic 500.
The call chain was straightforward once you saw it:
VectorDatabaseService.GenerateVectorsFromSearchQueryText(queryText, removeStopWords)
-> AgentRagService.RetrieveRelevantChunks(query, chatHistory, ...)
-> HelpAgent.ProcessPromptAsync(...)
The pipeline embeds the user’s query at retrieval time so it can perform a vector search. That part is standard. The problem is what eventually arrives at the embedding model.
~
The important detail is that embedding models and chat-completion models do not share the same context window.
Modern chat models can comfortably process prompts containing tens or even hundreds of thousands of tokens. Embedding models typically operate with much smaller limits. The embedding model used here capped requests at 8,192 tokens.
A RAG pipeline often sends related content through both paths. One path goes to a chat model for generation. Another goes to an embedding model for retrieval. Just because the chat model accepts the input does not mean the embedding model will.
The pipeline already had a lower-bound guard, MinInputQueryPromptLength, to reject empty or trivially short queries. It had no upper bound. GenerateVectorsFromSearchQueryText simply passed the cleaned query directly to the embedding API.
The first case is obvious once you see it. The second is where things become more interesting.
The first case is easy to picture. The second one is where this got interesting.
The pipeline has a query reformulation step: a separate large-context LLM rewrites the user’s query into something more retrieval-friendly before it gets embedded. The instinct is to assume reformulation shortens things – it distils intent, strips noise. It does not always. Feed a large-context model a long multi-turn chat history and ask it to rewrite the query, and it can happily echo substantial chunks of that history back into the rewritten query. Code samples get pasted verbatim. Chat turns get verbosely paraphrased. The “reformulated” query comes out larger than the original input.
So the real failure set was:
If your reformulation path is only ever exercised by happy-path single-turn tests, this failure mode is invisible until production finds it for you.
One threshold is not enough here, because there are two different things you want: graceful recovery for “a bit too long,” and a hard stop for “absurdly long.” So the guard inside GenerateVectorsFromSearchQueryText has two tiers.
The soft tier fires first. If the cleaned input exceeds an internal truncation threshold of 28,000 characters, the input is truncated and a warning is logged.
The exact character-to-token conversion varies by content, so the threshold is intentionally conservative rather than being tied to a precise token count.
The embedding still succeeds on the retained portion of the query.
private const int EmbeddingInputTruncationLengthChars = 28_000;
if (cleanedInput.Length > EmbeddingInputTruncationLengthChars)
{
_logger.LogWarning(
"Query input length {Length} exceeds truncation threshold {Threshold}. " +
"Truncating before embedding.",
cleanedInput.Length, EmbeddingInputTruncationLengthChars);
cleanedInput = cleanedInput[..EmbeddingInputTruncationLengthChars];
}
Truncating the tail is the right call here. Intent generally lives at the start of a query – the user states what they want, then pads it with context, code, and follow-on detail. Cutting the end preserves retrievability far better than cutting the start would. And because it logs a warning, it is observable: if truncation starts happening a lot, you will see it.
The hard tier exists for genuinely degenerate input – something so large that truncating it to 28k chars is no longer a reasonable representation of what the user asked. Above a separate, configurable MaxInputQueryPromptLength (default 100,000 characters), it throws a new exception type rather than quietly mangling the input:
// New file: ModelsVectorQueryTooLongException.cs
public class QueryTooLongException : Exception
{
public QueryTooLongException(string message) : base(message) { }
}
// In VectorDatabaseService.GenerateVectorsFromSearchQueryText:
if (cleanedInput.Length > _maxInputQueryPromptLength)
{
throw new QueryTooLongException(
$"Query input length {cleanedInput.Length} exceeds maximum " +
$"allowed length {_maxInputQueryPromptLength}.");
}
Note the deliberate symmetry: the pipeline already had a QueryTooShortException. The new exception is patterned on it exactly. When you add a new exception type, model it on its existing sibling – the catch blocks, the tests, and the next reader’s expectations all stay consistent for free.
Two thresholds, two jobs. Truncation at 28k chars gives you observable degradation. Rejection at 100k chars gives you a hard safety net.
They are independently tunable, which matters – you may want to move one without touching the other.
Throwing the exception is only half of it. The callers – AgentRagService.RetrieveRelevantChunks and AgentBase.RetrieveRelevantChunks – were updated to catch it and return an empty result instead of letting it become a 500, again mirroring the existing QueryTooShortException handling:
catch (QueryTooLongException ex)
{
_logger.LogWarning(ex, "Query too long for embedding. Returning empty RAG result.");
return new AgentRAGResult(new List<RAGChunk>(), new IntentClassificationResult(), query);
}
MaxInputQueryPromptLength was then added to all 11 appsettings*.json files that already carried MinInputQueryPromptLength – if a config key has a min, it should have a max sitting right next to it.
~
The tests are where the real lesson lands. Three tests went into VectorDatabaseServiceTests.cs – the reject path, the truncation path, and plain pass-through. Fine. The interesting ones went into a new AgentRagServiceTests.cs:
QueryTooLongException and returns empty.Test 3 is the one that matters. You cannot surface the reformulation-blowup bug with a real LLM in a unit test – you would need to reliably coax it into producing pathologically long output, which you cannot.
So you mock the reformulation LLM and make it return 200k characters on purpose. That is the only way the edge case becomes a deterministic, repeatable test instead of a production incident.
~
Guard both ends of every length-sensitive input, if you have a MinInputQueryPromptLength, write the Max in the same sitting
Embedding models and chat models have different context windows, the fact your chat model swallowed 40k tokens tells you nothing about the embedding call
Reformulation can make input larger, not smaller, treat reformulation output as untrusted-length input exactly like raw user input
Mock the path that produces pathologically long output. The happy-path test will never surface the reformulation blowup
Graceful degradation is only graceful if the user knows it happened, silent truncation without UI feedback is incomplete
~
The root cause was not AI, embeddings, Semantic Kernel, or OpenAI. It was an unbounded input.
The technology stack made the failure harder to see, but the bug itself was one of the oldest bugs in software engineering: assuming a value would always stay within a reasonable range.
In production, eventually it won’t.
~
Enjoy what you’ve read, have questions about this content, or would like to see another topic?
You can schedule a call using my Calendly link to discuss consulting and development services.
~
Check my AI courses. From developers to decision makers, these have you covered:
AWS Summit New York City is back June 17 at Javits Convention Center. Join developers, architects, and cloud leaders for 200+ sessions on AI, modernization, security, and cloud infrastructure. Register for free.
AI chips keep getting faster, but the infrastructure developers rely on hasn't kept up. In this piece, Runpod (A.K.A the AI Developer Cloud) explores why the real bottleneck may be the stack around the hardware, and what a developer first approach to AI infrastructure looks like.
Every developer learns DRY early, and almost everyone learns it wrong.
Don't Repeat Yourself. See two pieces of code that look the same, extract a method, delete the duplicate. I did this for years and wrote some of the worst code I've ever had to maintain:
Every one started as an innocent attempt to not repeat myself.
Here's the part most people skip. The original definition, from Andy Hunt and Dave Thomas in The Pragmatic Programmer, says nothing about code:
Every piece of knowledge must have a single, unambiguous, authoritative representation within a system.
It's about knowledge. A single fact about your domain, like a tax rule or the format of an invoice number, should live in exactly one place. When that fact changes, you change it once instead of hunting for seven copies.
Two pieces of code can look identical and represent completely different knowledge.
Say you validate two addresses. One is a customer's shipping address, the other is a warehouse address. Today the rules are identical:
public bool IsValid(Address address) =>
!string.IsNullOrWhiteSpace(address.Street) &&
!string.IsNullOrWhiteSpace(address.City) &&
!string.IsNullOrWhiteSpace(address.PostalCode);
The DRY reflex says extract one validator and call it from both places. But these are different concepts that happen to share rules this week. The day the warehouse needs a loading-dock code, you're back in the shared method bolting on a flag to keep the other caller working:
public bool IsValid(Address address, bool requireDockCode = false) =>
!string.IsNullOrWhiteSpace(address.Street) &&
!string.IsNullOrWhiteSpace(address.City) &&
!string.IsNullOrWhiteSpace(address.PostalCode) &&
(!requireDockCode || !string.IsNullOrWhiteSpace(address.DockCode));
That boolean is the tell. The first time a shared method grows a flag so one caller behaves differently, you didn't have duplication. You had two things that looked alike and glued them together. Give it a year and the signature has three more flags, each one a place where the two concepts were never actually the same.
Duplication is far cheaper than the wrong abstraction.
Copy-paste is visible and local. You can see both copies, and if they drift apart, that was always allowed. The wrong abstraction is invisible and global. Every caller depends on one shape and bends it to fit, the flags pile up, and you end up afraid to touch a method you no longer understand. I've spent more time deleting bad abstractions than I ever saved writing them.
This is the hidden coupling cost I wrote about in the abstractions piece, and DRY-by-reflex is one of the most common ways it sneaks in.
Inside one class, a bad helper is annoying. Across module boundaries, it's structural damage.
Picture a modular monolith with a Billing module and a Shipping module.
Both have an Order.
A well-meaning engineer notices the two classes share fields and pulls them into a shared type both modules reference:
// Shared.Orders, referenced by both Billing and Shipping
public class Order
{
public Guid Id { get; set; }
public string CustomerName { get; set; }
public decimal Total { get; set; }
// ...whatever either module happens to need
}
Now Billing and Shipping can't evolve independently. A change to billing's order forces a recompile, re-test, and redeploy of shipping. You took two bounded contexts that were supposed to be decoupled and welded them together to save a few properties.
Two modules each owning their own Order is the whole point of keeping data inside its boundaries. The shapes are allowed to be similar, modeling the same real-world thing from two points of view that drift over time. It's the same reason vertical slices
tolerate a little repetition, so each slice can change on its own.
I don't deduplicate the second time I see something. I wait for the third, and I ask one question: if this rule changes, do both copies have to change together?
Let the code repeat until the right abstraction becomes obvious, because good abstractions are discovered from concrete cases, not guessed up front. Some people call this AHA, for "Avoid Hasty Abstractions."
A practical tell: extract when you can name the concept.
A real domain name like Money, TaxRate, or InvoiceNumber is probably knowledge worth a value object.
If the best name you can find is Helper, Utils, or ProcessData, you're abstracting shape, not knowledge.
Applied correctly, DRY is invaluable. A business rule belongs in exactly one place. Watch what happens when "an order over $1,000 needs manager approval" gets copy-pasted across three services:
// OrderService
if (order.Total > 1000) { /* require approval */ }
// CheckoutService
if (order.Total > 1000m) { /* require approval */ }
// AdminController - someone bumped the limit here, and only here
if (order.Total > 5000) { /* require approval */ }
You will eventually update two of them and ship a bug. That drifted third copy is exactly how it happens. Push the rule into the domain model where it has one home:
public bool RequiresManagerApproval() => Total > 1000;
That's the single authoritative representation DRY is actually about.
The next time you're about to delete a duplicate, don't ask whether the code looks the same. Ask whether it means the same. That one question will save you more maintenance pain than DRY ever saved you typing.
If you want to see how I draw these boundaries in a real system, with independent modules and abstractions that earn their place, that's the heart of Pragmatic Clean Architecture.
Thanks for reading.
And stay awesome!
This blog post was created with the help of AI tools. Yes, I used a bit of magic from language models to organize my thoughts and automate the boring parts, but the geeky fun and the
in C# are 100% mine.
Hi!
If you are building local AI apps in C#, you quickly hit a practical gap:
IChatClient.IChatClient bridge yet.That is exactly why I created this library
I wanted a clean, non-REST, in-process integration where:
IChatClient).IChatClient can be reused by Microsoft Agent Framework.So the library provides a thin adapter:
Foundry Local SDK -> FoundryLocalChatClientAdapter -> IChatClient
This lets you write provider-agnostic app code while still running local inference.
Install:
dotnet add package ElBruno.MAF.FoundryLocal.Adapter
dotnet add package Microsoft.Extensions.Hosting
dotnet add package Microsoft.Agents.AI
Use the wrapper as IChatClient:
using ElBruno.MAF.FoundryLocal;
using Microsoft.Extensions.AI;
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Hosting;
var builder = Host.CreateApplicationBuilder(args);
builder.Services.Configure<FoundryLocalOptions>(o =>
{
o.ModelAlias = "qwen2.5-0.5b";
o.DownloadIfMissing = true;
o.UnloadOnExit = true;
});
builder.Services.Configure<ChatRuntimeOptions>(_ => { });
builder.Services.AddSingleton<FoundryLocalModelLifecycleService>();
builder.Services.AddSingleton<IChatClient, FoundryLocalChatClientAdapter>();
using var host = builder.Build();
var chatClient = host.Services.GetRequiredService<IChatClient>();
var response = await chatClient.GetResponseAsync(
[
new(ChatRole.User, "Explain local-first AI in one paragraph.")
]);
Console.WriteLine(response.Text);
Reuse the same wrapper with Microsoft Agent Framework:
using Microsoft.Agents.AI;
var agent = new ChatClientAgent(
chatClient: chatClient,
instructions: "You are a concise local assistant.",
name: "LocalFoundryAgent",
description: "Foundry Local + IChatClient adapter",
tools: null,
loggerFactory: null,
services: null);
var agentResponse = await agent.RunAsync(
"Give me 3 bullet points about local AI in .NET.",
session: null,
options: null);
Console.WriteLine(agentResponse);
IChatClient programming modelIn short: local model runtime + standard .NET AI abstractions + minimal glue code.
Happy coding!
Greetings
El Bruno
More posts in my blog ElBruno.com.
More info in https://beacons.ai/elbruno
SQL Server 2025 and Azure SQL Database add native support for Base64 encoding and decoding with two T-SQL functions: BASE64_ENCODE and BASE64_DECODE. These functions make it much easier to convert binary data to text-friendly representations, and then convert those strings back to binary data when needed.
This is useful in many everyday scenarios: embedding binary content in JSON, constructing data: URLs for HTML, transmitting binary payloads through text-based protocols, and generating URL-safe tokens. Previously, developers often had to rely on XML tricks, application-side code, CLR functions, or custom conversion logic. Now, this functionality is directly available in T-SQL.
Important: Base64 is an encoding format, not an encryption mechanism. It makes binary data text-friendly, but it does not secure or hide the underlying data.
Here is the complete demo code used in this post.
-- Base64 encoding is a method of converting binary data into an ASCII string format by translating-- it into a radix-64 representation. Base64 decoding is the reverse process, converting an ASCII-- string back into binary data. Encoding is often used to safely transmit binary data over text-based-- protocols, such as HTTP or email, though note that encoding binary data increases its size by-- approximately 33%.SELECT StandardEncoded = BASE64_ENCODE(0xCAFECAFE) -- Note the / in the encoded string; unsafe for URLsSELECT StandardDecoded = BASE64_DECODE('yv7K/g==') -- Decodes back to 0xCAFECAFESELECT UrlSafeEncoded = BASE64_ENCODE(0xCAFECAFE, 1) -- Note the encoded string is safe for URLsSELECT UrlSafeDecoded = BASE64_DECODE('yv7K_g') -- Both URL-safe and URL-unsafe encodings decode to the same binary dataSELECT InvalidBase64 = BASE64_DECODE('qQ!!') -- Causes an error due to invalid Base64 characters-- Constructing a JSON object with an embedded image (also suitable for HTML or XML)DECLARE @ProductId int = 6DECLARE @ProductImage varbinary(max) = 0xCAFECAFE89504E470D0A1A0A0000000D49484452SELECT ProductJson = JSON_OBJECT( 'productId' : @ProductId, 'thumbnail' : 'data:image/png;base64,' || BASE64_ENCODE(@ProductImage) )GO-- Constructing a binary token for use in a URLDECLARE @Token varbinary(max) = CAST('user1|expiry=2025-12-31' as varbinary)SELECT @TokenDECLARE @UrlSafeEncodedToken varchar(max) = BASE64_ENCODE(@Token, 1)SELECT @UrlSafeEncodedTokenDECLARE @Url varchar(max) = 'https://api.myapp.com/download?token=' || @UrlSafeEncodedTokenSELECT @Url
BASE64_ENCODE converts a varbinary value into a Base64-encoded varchar value. In other words, it takes binary data and turns it into a string that can safely travel through systems that expect text.
SELECT StandardEncoded = BASE64_ENCODE(0xCAFECAFE)
| StandardEncoded |
|---|
yv7K/g== |
Notice that the encoded string includes a forward slash (/) and padding characters (==). This is standard Base64 output.
BASE64_DECODE performs the reverse operation. It takes a Base64-encoded varchar value and converts it back into the corresponding varbinary value.
SELECT StandardDecoded = BASE64_DECODE('yv7K/g==')
| StandardDecoded |
|---|
0xCAFECAFE |
The result is the original binary value.
Standard Base64 output can include characters such as +, /, and =. These characters can be inconvenient in URLs because they may require escaping or have special meaning depending on where they appear.
The optional second argument to BASE64_ENCODE controls whether SQL Server should generate URL-safe output.
SELECT UrlSafeEncoded = BASE64_ENCODE(0xCAFECAFE, 1)
| UrlSafeEncoded |
|---|
yv7K_g |
In this version, the slash is replaced with an underscore, and the padding characters are omitted. This makes the encoded value much friendlier for use in URLs.
BASE64_DECODE can decode both standard Base64 strings and URL-safe Base64 strings.
SELECT UrlSafeDecoded = BASE64_DECODE('yv7K_g')
| UrlSafeDecoded |
|---|
0xCAFECAFE |
Both yv7K/g== and yv7K_g decode to the same binary value.
Invalid input raises an error. This is useful because it prevents malformed encoded strings from silently producing incorrect binary data.
SELECT InvalidBase64 = BASE64_DECODE('qQ!!')
Msg 9803, Level 16, State 20Invalid data for type "Base64Decode".
In this example, the exclamation points are not valid Base64 characters, so decoding fails.
One very practical use case is embedding binary data in JSON. This is common when sending images, thumbnails, documents, or other binary payloads through APIs.
DECLARE @ProductId int = 6DECLARE @ProductImage varbinary(max) = 0xCAFECAFE89504E470D0A1A0A0000000D49484452SELECT ProductJson = JSON_OBJECT( 'productId' : @ProductId, 'thumbnail' : 'data:image/png;base64,' || BASE64_ENCODE(@ProductImage) )
{"productId":6,"thumbnail":"data:image/png;base64,yv7K/olQTkcNChoKAAAADUlIRFI="}
The thumbnail property contains a data:image/png;base64, prefix followed by the Base64-encoded binary image data. This same general technique can be used with HTML or XML, not just JSON.
Base64 encoding increases the size of the encoded data by roughly one-third, so it is convenient but not free. For large binary payloads, it may be better to store the binary data separately and reference it by URL or identifier.
Another common scenario is constructing a token that needs to be included in a URL. The token starts as binary data, gets URL-safe Base64 encoded, and is then appended to a query string.
DECLARE @Token varbinary(max) = CAST('user1|expiry=2025-12-31' as varbinary)SELECT @TokenDECLARE @UrlSafeEncodedToken varchar(max) = BASE64_ENCODE(@Token, 1)SELECT @UrlSafeEncodedTokenDECLARE @Url varchar(max) = 'https://api.myapp.com/download?token=' || @UrlSafeEncodedTokenSELECT @Url
First, the string token is represented as binary data:
0x75736572317C6578706972793D323032352D31322D3331
Then it is encoded using URL-safe Base64:
dXNlcjF8ZXhwaXJ5PTIwMjUtMTItMzE
Finally, that encoded token is placed directly in the URL:
https://api.myapp.com/download?token=dXNlcjF8ZXhwaXJ5PTIwMjUtMTItMzE
This is much safer for URLs than using standard Base64 output, because the URL-safe form avoids characters that commonly require escaping in query strings.
The new BASE64_ENCODE and BASE64_DECODE functions are small additions that solve a very practical problem. They make it easy to convert binary values into portable text and back again, directly in T-SQL.
Use standard Base64 when you need conventional encoded output, and use URL-safe Base64 when the encoded value will appear in a URL, token, route segment, or query string. Just remember that Base64 is not encryption, and that encoding increases the size of the data.
Welcome to F# Weekly,
A roundup of F# content from this past week:
Microsoft News
Videos
Blogs
New Releases
That’s all for now. Have a great week.
If you want to help keep F# Weekly going, click here to jazz me with Coffee!
