Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
154580 stories
·
33 followers

Your AI Agent Already Forgot Half of What You Told It

1 Share

This is the seventh article in a series on agentic engineering and AI-driven development. Read part one here, part two here, part three here, part four here, part five here, and part six here.

This is the latest article in my Radar series on AI-driven development and agentic engineering, and I have to admit that this one took a bit of a turn I wasn’t expecting.

In my last article I talked about context and context management and I promised to give you some real practical tips for using it. It was originally meant to be about specific, practical context management techniques that were really helpful to me building Octobatch and the Quality Playbook, two open source projects where I work with AIs to plan and orchestrate all of the work and every line of code is written by AI tools like Claude Code and Cursor.

But as I was writing this, I found that I’d adapted those same techniques to my work writing articles like this one. Which is surprising! I’ve been doing all this work finding ways to help people developing AI skills improve context management, so their skills run more efficiently. It turns out that those same exact techniques apply to anyone using AI tools, even when you’re using chatbots like Claude.ai or ChatGPT.

Full disclosure: I use multiple AI tools to manage this article series. My primary tools are Claude Cowork for brainstorming and managing my article research, notes, and backlog and Gemini’s mobile app for reading drafts aloud and taking my notes while I’m away from my desk. And I want to tell you about something that happened while I was using those tools, because I think it really helps show why context management isn’t just a problem for developers.

While I was writing this article, I was using Gemini’s mobile app to read the draft aloud and take my notes. Partway through the session I asked it to go back and check whether there were earlier notes it hadn’t incorporated yet. It told me it didn’t have access to the previous notes, which seemed weird and insane, since we had just taken those notes a few prompts earlier in the session. I could scroll back up and see them earlier in the conversation, but somehow it didn’t “know” about them.

Here’s what happened. Gemini had compacted our conversation without telling me, and the notes from the first half of the session were just… gone.

If you’ve ever had a web chat AI just seem to forget things you talked about earlier, you’ve experienced context compaction, just like I did. Understanding even the basics of context and context windows can make a big difference in preventing that kind of frustration.

This all reminded me of something I wrote more than two decades ago in Applied Software Project Management (back in 2005!): “Important information is discovered during the discussion that the team will need to refer back to during the development process, and if that information is not written down, the team will have to have the discussion all over again.”

Jenny Greene and I wrote that about human teams and project meetings, but it applies to AI sessions just as well.

Which brings me back to context, which I wrote about in my last article, and which I’ll write more about in the next one, because it’s one of the most important concepts to keep top of mind when working with AI.

Context loss may be invisible, but that doesn’t make it any less frustrating

Context is everything the AI is holding in its working memory during a conversation: what you’ve told it, what it’s told you, any files or instructions it’s read, and whatever internal notes the system has made along the way. All of that lives in a fixed-size context window—think of that as your AI’s short-term memory, the stuff it’s thinking about right now—and when the window fills up, the AI has to start letting things go. Different tools handle this differently: Some truncate older messages, some compress the conversation into a summary (which means details get lost even though the summary looks complete), and some just start behaving inconsistently so you can’t tell whether the AI forgot something or never understood it in the first place. The result is the same: The AI loses track of things you told it, decisions you made together, or details it noticed earlier in the session. And it won’t tell you it forgot. It’ll just keep generating confident-sounding output based on whatever it still has.

Before we dive in a little deeper, I want to do a quick jargon check. If you’ve seen the terms “skills” and “agents” floating around but aren’t sure what they are, think of skills as libraries for AIs and agents as interactive executables. Those aren’t perfectly precise definitions, but if you’re a developer they’re close enough for this discussion.

When you’re coding skills and agents, you run into context problems quickly. The work you’re asking the AI to do is often complex enough that the context window fills up, and the AI has to start compacting: compressing or dropping older parts of the conversation to make room for new ones. Compaction always seems to happen at the most frustrating and inconvenient time, which makes sense when you think about it. You hit context limits precisely when you’ve put the most information into the conversation, which is exactly when losing that information costs you the most.

That’s why I think it can often help to think of AIs as having the same shortcomings that human teams do, except those shortcomings are exaggerated by their AI nature. A person who forgets something from a meeting last week might remember it when you remind them. An AI that lost something to context compaction won’t, because the information is gone. But there’s something you can do about it, and it turns out the techniques that help are the same whether you’re building autonomous AI skills or just trying to get a chatbot to remember what you told it 20 minutes ago.

I’ve landed on four techniques that I come back to over and over again. Each one exists because at some point the AI forgot something important and I responded by putting that thing in a file where it couldn’t be forgotten. None of them require special tooling. And to my surprise, all of these techniques have turned out to be useful for both building software and managing a writing project like this one, whether I’m chatting with Claude, ChatGPT, or Gemini, or using a desktop tool like Claude Cowork or Codex. These are the techniques I find most valuable:

  • Split discovery from documentation: Don’t ask the AI to figure something out and produce polished output in the same pass.
  • Use handoff documents, not continuation prompts: Before closing a stale session, have the AI write down everything the next session needs to know.
  • Give the AI an acceptance criterion, not a procedure: Tell it what “done” looks like instead of spelling out the steps.
  • Use spec documents as the bridge between AI tools: Make a shared document the single source of truth that all your tools read from.

Split discovery from documentation

When you ask an AI to do something complex, you’re often asking it to do two things at once without realizing it. You’re asking it to figure something out and produce polished output at the same time. The problem is that figuring things out takes attention, and producing output takes attention, and the model only has so much of it. When you combine both tasks in the same prompt, the model starts cutting corners on one of them, and you can’t tell which one it shortchanged.

I ran into this with the Quality Playbook, an open source AI coding skill I built that runs structured code reviews against any codebase. One of the things it does is derive requirements from source code: It reads through the code, identifies what the code promises to do (I call these behavioral contracts), and then produces a requirements document. Originally this all happened in a single pass. The problem was that single-pass requirement generation ran out of attention after about 70 requirements. The model forgot behavioral contracts it had noticed earlier in the code, and the forgetting was completely invisible. There was no stack trace or error message, just incomplete output and no way to know what was missing. I fixed it by splitting the work into two separate prompts:

Read each source file and write down every behavioral contract you observe as a simple list in CONTRACTS.md.

Read CONTRACTS.md and the documentation, then derive requirements from them and write REQUIREMENTS.md.

Then a third pass checks whether every contract has a corresponding requirement, and if there are gaps, goes back to step one for the files with gaps.

The key idea is that CONTRACTS.md is external memory. When the model “forgets” about a behavioral contract it noticed earlier, that forgetting is normally invisible. With a contracts file, every observation is written down before any requirements work begins, so an uncovered contract is a visible, greppable gap. You can see what was forgotten and fix it.

The principle: Don’t ask the AI to figure out what exists and write formatted output in the same pass. The model runs out of attention trying to do both at once. Whenever you’re asking an AI to do something complex, consider whether you’re actually asking it to do two things at once. “Analyze this codebase and write a report” is two tasks. “Read this document and suggest improvements” is two tasks. Split them, and let the first pass write its observations to a file before the second pass starts working with them.

Use handoff documents, not continuation prompts

Anyone who’s spent a long session with an AI coding tool has felt the moment when the context starts to go stale. The AI stops tracking details it was handling fine an hour ago, or it contradicts something it said earlier. The session gets slow, and you’re often restarting because the AI seems to have gotten bogged down and filled up on what you told it. You get the sense that if you keep going, you’re going to spend more time correcting it than making progress.

Most developers respond to their session getting too long in one of two ways: They push through the problem, or they start a fresh one and try to reexplain everything from scratch. Both of those approaches can cause the AI to lose context. The first loses it to compaction; the second loses it to incomplete reexplanation. And both are frustrating! Specifically because you just spent so much time building up all that context with the AI.

There’s a third option. Before you close the session, ask the AI to write a handoff document: a file that captures everything the next session needs to know, written while the current session still has full context. The key is that you’re asking the AI to write this while the relevant details are still fresh in the working context, and in a way that it or another AI can read.

I built this into the Quality Playbook as a core part of how phases communicate. When I split the playbook from a single prompt to independent phases, I needed each phase to run as a completely independent session with no context carryover. So each phase got its own kickoff prompt as a standalone file. Here’s the structure each one follows:

Write a handoff document that a fresh session could use to pick up this work cold. Include everything it would need to know.

Every kickoff opens with what prior phases accomplished, includes explicit boundaries about what’s frozen, and names which future phase owns each piece of remaining work, because without it the AI will helpfully start doing Phase 3 work while you’re still in Phase 2. Each phase also ends with a required forward-looking handoff where the completing agent writes down what the next session needs to know.

The principle: Each handoff is a complete state snapshot. The incoming AI agent never needs to read prior kickoff prompts or chat history. Everything it needs is in the current handoff file: current state, uncommitted changes, immediate next task, pending tasks, file locations, and anything that was discovered during the prior session. A fresh AI session can pick it up cold.

If you’re deep into a Claude Code or Copilot session and you can feel the context getting stale, ask the AI to write a handoff document before you close the session. Tell it to include everything a fresh session would need to continue the work. Then start a new session and point it at that file. A fresh session with a good handoff document will usually outperform a stale session, because it’s starting with clean context instead of compacted, fragmented context.

Give the AI an acceptance criterion, not a procedure

When you give an AI a multistep task, the natural instinct is to spell out the steps. First do this, then do that, then combine the results. The problem is that step-by-step procedures are the first thing the AI forgets when the context window fills up. It’ll skip steps, merge phases, or quietly drop tasks, and there’s nothing in the procedure itself that would help the AI notice what it missed. The procedure tells the AI what to do, but it doesn’t tell the AI what “done” looks like.

I learned this the hard way with the Quality Playbook. The playbook runs multiple iteration passes over a codebase, and the results need to be cumulative. It keeps a list of all the bugs it finds in the code being tested in a file called BUGS.md. Early on, I gave the AI a procedure to run four times and then update that file:

First run the main pass, then run four iteration passes, then merge the findings into BUGS.md.

The AI did not respond well to that instruction.

It turns out that when you ask an AI to do a very complex task a specific number of times, it can lose count. In fact, from my experimentation, it seems that count is one of the first casualties of context compaction. Most of the time the AI decided three iterations was enough, or merged findings from only two passes, and no matter how many different ways I tried to rephrase that instruction, there was nothing I could come up with that prevented the problem.

However, everything changed when I replaced the “run four times” instruction with an acceptance criterion, or a specific condition that tells the AI when to stop looping:

You are done only when BUGS.md contains the cumulative findings from the main run plus all four itration passes.

Even when the AI lost track of intermediate steps, it could check the output against the criterion and know whether it was finished. And I could verify the output against the same criterion, which gave me a way to audit the agent’s work without watching every step.

In developer terms, the AI is really bad at loops like for (i = 0; i < 4; i++) because it loses track of the value of the iterator i when it compacts its context. But it’s really good at loops like while (!done) because it can check done based on the current state without relying on history.

The principle behind all this is that an acceptance criterion survives context pressure because the AI can always check “Am I done?” against a concrete test. This is actually the same principle behind test-driven development: write the test before the code so you know when you’re done. The acceptance criterion is the test for your AI session. When you’re giving an AI a task that has multiple steps, don’t describe the steps. Describe what “done” looks like, and let the AI figure out how to get there.

Use spec documents as the bridge between AI tools

Most developers working with AI don’t use just one tool. You might use Claude for design, Cursor for coding, and Copilot for quick edits. You might even use multiple models inside the same tool, like GPT-5.5 and Opus 4.7 in separate Copilot chats inside VS Code. It’s common to have one model for coding, another for review, and a third for orchestration and project management. The problem is that none of these tools or chats know what you told the others. Claude doesn’t know what you decided with Cursor. Two separate Copilot chats in the same editor don’t share context. You’re the one carrying context between them, and that’s exactly the kind of lossy handoff that causes drift. A design decision you made in one conversation gets lost or distorted by the time it reaches the tool that needs to implement it.

The fix is to make the spec document the single source of truth that all your AI tools read from. I used this when building a game prototype, where I had Claude handling design and planning and Cursor doing the coding. They never talked to each other directly, so the spec documents served as the shared contract: Claude wrote the specs, and Cursor read them. The rule I followed was simple:

Never tell the AI coder something that isn’t already in the specs. If you make a design decision in conversation, write it into the spec first, then point the coder at the spec.

If I made a design decision in a conversation with Claude, that decision had to be written into the spec before I told Cursor about it. If I discovered something during implementation, I wrote it into the appropriate doc first, then pointed the coder at it. The spec was always the single source of truth. When Claude and I changed the wound topology (removing one wound type, promoting another), we updated the docs first, then told Cursor to reread them. When we decided to add a new UI element, we wrote it into the UI spec first, then told Cursor to reread the doc.

The key was including rationale in the specs. Not just “show 5 progressive labels” but why: “The player shouldn’t be told what they’re fighting. They should discover it.” This helps the AI coder make better decisions when the spec doesn’t cover an edge case because it knows the intent behind the requirement.

The principle: The spec document is the shared context that all your tools can read. It prevents the drift that happens when design intent lives only in chat history that the other tool can’t see. This technique works any time you’re using more than one AI tool on the same project, which at this point is most projects.

How these techniques combine: Managing this article series

Those four practices came out of AI-driven development work, but they apply to almost any AI work. And while these techniques emerged for me while working on agents and skills, I think it’s valuable to demonstrate them in a nondevelopment context, so I’ll share an example from my work on the article series you’re reading now.

Over time, the process for how my AI assistant and I manage this article backlog evolved organically in conversation, but it was never written down anywhere except in the AI’s context window. Which means every time the session compacted or I started a fresh chat, the process was gone and I had to reexplain it. I caught this when the AI did something slightly wrong and I wanted to confirm we were on the same page. So I asked:

Every time I suggest a new article idea, you add an entry to the backlog, and then create a new markdown file with the source material, right?

That’s split discovery from documentation. I didn’t say “document our process.” I said “confirm what we do.” Discovery first, then documentation as a separate step. If I’d said “write up our process” without confirming first, the AI might have written something plausible but wrong, and I wouldn’t have caught the discrepancy.

Once we’d confirmed the process, I asked the AI to create two files. AGENTS.md is an emerging standard for AI-readable project context—a single file that tells any AI session what it needs to know about a project. You can learn more about the convention at agents.md. CONTEXT.md serves a similar role as a bootstrapping document—it’s less established as a standard, but the practice of asking the AI to dump everything it knows into a context file so the next session can pick it up cold has been one of the most valuable habits I’ve developed. Here’s the prompt I used:

Update the backlog file to explain what it is and how we maintain it. Create a CONTEXT.md with everything you’d need to bootstrap a new chat. Create an AGENTS.md to make it easy to bootstrap with a single-line prompt.

That prompt is a handoff document. I was explicitly asking the AI to write down everything it knew while it still had full context, specifically because I knew that context would be lost to compaction. The CONTEXT.md file is a handoff from this session to whatever fresh session picks up the work next week.

Notice what I didn’t say. I didn’t give step-by-step instructions for what should go in those files. I said “everything you would need to bootstrap this process again in case we lost it” and “a complete dump of all of the context you would need to bootstrap a new chat and get it to the point where this current chat is.” Those are acceptance criteria, not procedures. The AI had to figure out what belonged in those files. If I’d given it a procedure (“first write the publication history, then the voice rules, then the file locations”), it would have followed the list and missed anything I forgot to include. The acceptance criterion is harder to satisfy but more robust: the test is “Could a fresh session bootstrap from these files alone?”

And the AGENTS.md file itself is a spec document as a bridge between tools. It’s the shared contract that any AI session, whether it’s Claude, Gemini, Cowork, or a fresh chat, can read to get aligned with the project. This session wrote it; the next session reads it. The two sessions never communicate directly, so the spec file bridges the gap between them.

That’s all four practices in two prompts, applied to something as ordinary as managing a writing project. It didn’t require pipelines or codebases or batch orchestration. The practices work because they solve the same underlying problem regardless of the domain: important information living in the AI’s context window instead of on disk.

Context management is a development skill

Every practice I’ve described in this article and the last one is something developers have always been told to do: write things down, record your rationale, be deliberate about what you save and what you let go, write ADRs and design docs and inline comments explaining nonobvious choices. We’ve always known we should do more of it. When you’re working with AI, the cost of not doing it becomes immediate and visible.

The practices in this article all come down to the same thing: putting the important information in files where compaction can’t touch it, so you can see what the AI knows and verify that it matches reality. In the next article, I’ll go deeper on the debugging angle: how to use externalized files to understand what your AI is actually doing, with practical techniques that work even if you’re not building agents but are just using a chatbot.

The Quality Playbook is open source and works with GitHub Copilot, Cursor, and Claude Code. It’s also available as part of awesome-copilot.


Disclosure: Aspects of the approach described in this article are the subject of US Provisional Patent Application No. 64/044,178, filed April 20, 2026 by the author. The open source Quality Playbook project (Apache 2.0) includes a patent grant to users of that project under the terms of the Apache 2.0 license.



Read the whole story
alvinashcraft
52 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Deprecating dotMemory Unit

1 Share

dotMemory Unit has long served as a unit testing framework for detecting memory issues in .NET code. We are grateful to everyone who has used it as part of their development and testing workflows.

After careful consideration, we have decided to retire dotMemory Unit. The project will no longer receive active maintenance, compatibility updates, or security fixes, and we recommend that users plan to discontinue its use.

Why we’re making this change

This decision is based on several technical, security, and product considerations.

dotMemory Unit has not been actively developed for some time and does not support the latest .NET versions. Bringing it up to modern compatibility, reliability, and security standards would require a substantial architectural redesign.

In addition, dotMemory Unit generates workspaces in a legacy format that is incompatible with recent versions of dotMemory. This creates friction for users and prevents seamless integration with the latest JetBrains profiling tools.

Finally, some of the project’s dependencies are outdated and include known security vulnerabilities. Because dotMemory Unit is no longer actively maintained, we cannot reliably update these dependencies without risking compatibility issues or undertaking a full rebuild. Continuing to distribute or support tooling with unpatched vulnerabilities would not meet our security standards.

What this means for you

If you currently use dotMemory Unit, we strongly recommend that you stop using it, especially in security-sensitive environments.

We understand that dotMemory Unit has been valuable for in-test memory profiling, and we recognize that its deprecation may create a gap in some workflows. At this time, we do not have a direct replacement.

Timeline

[20.05.2026]: dotMemory Unit marked as deprecated on NuGet.org. No further updates, patches, or security fixes will be released. 

[28.05.2026]: Official deprecation notice posted across different channels. Documentation will remain online for some time, but will be updated to reflect the current state of the project and potential risks.

Thank you

We are sincerely grateful to everyone who has trusted dotMemory Unit over the years.

Read the whole story
alvinashcraft
52 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Can AI support creativity? What educators can learn from creative machine learning

1 Share

Can AI support creativity? The technology is often framed as threatening creative work either by automating it or by encouraging imitation. But Professor Rebecca Fiebrink’s work in creative machine learning suggests a more useful way to think about this relationship. In our March research seminar, she showed how machine learning can help people work with meaningful data, communicate ideas through examples, and build new kinds of creative projects.

Rebecca Fiebrink.
Rebecca Fiebrink is Professor of Creative Computing at the Creative Computing Institute, University of the Arts London.

Our current seminar series focuses on teaching applied AI and how educators of subjects beyond computing can make AI and machine learning relevant in their classroom. We were delighted to have Rebecca join us to share insights about the place of machine learning in artistic creation. In her talk, Rebecca explored three connected questions:

  • How machine learning can be valuable to musicians, artists, and other creators
  • What machine learning tools for creators should look like
  • What creators need to know about machine learning in order to use it effectively

Using movement, sound, and image data to teach about machine learning

One of the seminar’s key ideas was that machine learning can help creators work with forms of data that already matter to them. Rebecca showed that useful data can come from many sources, including microphones, webcams, phones, wearables, sensors, and body movement. She argued that collecting data is often relatively easy, while interpreting and using it is much harder. 

This suggests a different starting point for AI education. Instead of beginning with a large dataset prepared by somebody else, learners can start with data that is meaningful in their own context. For instance, data about hand gestures can be linked to different musical rhythms, colours, or game actions.

Visual examples of how hand gestures can be associated with rhythm, video game actions, or visuals using machine learning.
From hand gestures to rhythms and game actions. Images from the speaker’s presentation.

What counts as input?

The seminar also points to a broader shift in how we think about input if we consider creative work. Traditional computing often treats input as something abstract and controlled: a click, a typed command, or a button press. But many creative practices do not work like that. They depend on timing, gesture, rhythm, touch, sound, and movement.

Instead of asking learners to translate everything into words or code first, Fiebrink suggested that educators can use machine learning to allow learners to begin with movement, demonstration, or sound. This is especially relevant in art forms shaped by flow and physical expression, such as music, dance, performance, and interactive media.

Educators can use machine learning to allow learners to begin with movement, demonstration, or sound [instead of with code].

That creates interesting possibilities for teaching. AI does not have to be explored only through screens, prompts, and abstract models. It can also be approached through embodied activities, where learners use gestures, performance, and experimentation to see how an AI system responds. This can make machine learning feel more connected to forms of making that young people already understand.

Teaching machine learning through examples

A second important theme in the seminar was that machine learning allows people to instruct computers through data and examples. Rebecca suggested that this can be especially valuable in creative and embodied work, where what a person wants to express may be difficult to describe in words, maths, or code alone.

Contrasting pictures of painting and violin playing compared to a snapshot of code.
The seminar suggested that data and examples can communicate creative intent in ways that code or language cannot always capture.

One of the strongest examples in the seminar was ‘Wekinator‘, a tool Rebecca has been developing since 2008. She described the tool’s approach as ‘interactive machine learning’: users demonstrate training examples, train a model, test it in real time, then modify their examples and repeat the process.

This is a useful example for the classroom because it shows that training a machine learning model is not a single event, after which the model is trained and finished. Instead it is an iterative process. With Wekinator, learners can try something out, observe the result, and improve the system by changing the examples they provide. That makes ideas such as testing, evaluation, and bias much easier to discuss.

Supporting creativity and learner agency

Rebecca also argued that machine learning can help more people become creators. She contrasted large, one-size-fits-all systems that encourage users to imitate existing styles with smaller, more personal systems that can be trained on new data for specific purposes. She captured this contrast clearly, from prompts such as ‘Write music like Bach!’ to examples of personalised tools and interfaces.

Examples from the seminar showing how large models can make it easier for novices to conform to familiar creative styles like those of Bach or Monet.
Examples from the seminar showing how large models can make it easier for novices to conform to familiar creative styles.

This is an important distinction in teaching and learning. If learners only use AI tools to reproduce familiar outputs, then creative work can become narrow and formulaic. But if they can build or train systems around their own interests, intentions, and materials, then machine learning can support experimentation and authorship.

If [learners] can build or train systems around their own interests, intentions, and materials, then machine learning can support experimentation and authorship.

Teaching AI without turning it into a black box

In the final part of the seminar, Rebecca moved from examples to teaching principles. One of the clearest was that machine learning should be taught at a high level with minimal maths, but not as a black box.

Learners do not need advanced mathematics to start exploring machine learning meaningfully, but they do need to understand that:

  • Machine learning models are built from data
  • Models make predictions based on patterns
  • People can inspect, test, and improve models

Rebecca also argued that small data and interactive machine learning can be highly effective. She highlighted quick experimentation, creative usefulness, and the opportunity to build intuition about ideas such as outliers, features, regularisation, and bias in data. Small-scale activities can make technical ideas more visible and manageable for learners.

""
Small-data, interactive machine learning can support experimentation and build understanding of how models work.

Why this matters for teaching

Rebecca ended on an inspiring note: she argued that learning and teaching creative machine learning is both worth doing and possible. She pointed to a growing set of tools that support experimentation and original creative work without much maths or coding, including Wekinator, Teachable Machine, Micro:bit CreateAI, and more.

The seminar also addressed some important limitations. Rebecca warned that commercial tools are not always good at supporting learning or genuine creative work. She also discussed the difficulty of making generative AI tools safe for children, noting the need for built-in filters, moderation, prompt design, and extensive testing. Therefore, what’s important is to think about what learners are actually learning, and to make space for experimentation without losing sight of safety and critical thinking.

Join our next seminar

Our research seminars brings together educators and researchers to explore key questions in computing education.

Next in our series on applied AI, Prof. Gianfranco Polizzi (University of Birmingham, UK) will talk about media literacy in the age of AI. Sign up now to join the seminar on 16 June, 17:00 BST:

The post Can AI support creativity? What educators can learn from creative machine learning appeared first on Raspberry Pi Foundation.

Read the whole story
alvinashcraft
52 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Model Router (with Claude Opus 4.7) and MAI-Image-2e in Microsoft Foundry

1 Share

The Microsoft Foundry model catalog keeps growing — and I don’t believe I am the only one who thinks picking the right model for each task is becoming a challenge for anyone building agents. Is GPT-5.4 overkill for this prompt? Should I be using a Claude model for this reasoning step? Do I really need to wire up four different deployments just to keep my agent fast, smart, and manageable?

This is where Model Router model comes in — and it can simplify how agents utilize models in Microsoft Foundry. As a nice addition, it now routes to Claude models too, including the newest Claude Opus 4.7. In the same wave, Microsoft Foundry also brought in MAI-Image-2e, a faster and more efficient sibling of MAI-Image-2 for text-to-image generation.

Let me walk you through both.

What is Model Router model?

Model Router is a model in Microsoft Foundry that intelligently routes your prompts in real time to the most suitable large language model behind the scenes. You deploy it once — like any other Foundry model — and from then on your agent (or app) talks to a single deployment, while Model Router decides per request whether the prompt should be handled by a small, fast model or by a top-tier reasoning powerhouse.

The current version is 2025-11-18 (latest) and it is a living version — new models and capabilities are added in place, no version-bump migration needed.

The selection happens based on prompt complexity, reasoning needs, task type, and other attributes — and it does not store your prompts. It honors your deployment data zone boundaries.

Why use Model Router in agents?

For agents, this is a meaningful shift. An agent typically does many small steps — some trivial, some reasoning-heavy — and using the same expensive model for every single one is wasteful. With Model Router, the agent calls one single model deployment, and Model Router does the dispatching:

  • A simple classification step? Routed to a small model — cheap and fast.
  • A complex multi-step reasoning task? Routed to a top reasoning model — accurate.
  • A task where Claude is genuinely the best tool? Routed to Claude.

That makes Model Router an incredibly versatile workhorse — one model your agent calls, many models doing the actual work underneath.

You can also pick a routing mode to bias the decision:

  • Balanced (default) — considers both cost and quality dynamically. Great general-purpose default.
  • Quality — prioritizes accuracy. Best for complex reasoning and critical outputs.
  • Cost — prioritizes cost savings. Ideal for high-volume, budget-sensitive workloads.

And with Model subset, you decide exactly which underlying models are eligible for routing — useful for compliance, cost control, or for guaranteeing a minimum context window across the set. Built-in automatic failover is the icing on the cake — if one model has a transient issue, your request quietly falls over to the next best one.

Claude models in Model Router — including Opus 4.7 (preview)

The latest Model Router version adds a fresh set of Anthropic models alongside the OpenAI, Meta, xAI and DeepSeek line-up:

  • claude-haiku-4-5
  • claude-sonnet-4-5
  • claude-opus-4-1
  • claude-opus-4-6
  • claude-opus-4-7 — Anthropic’s most capable model

One important nuance — and this is the one most people miss on day one: Model Router support for the Claude family is currently in preview, and Claude models must be deployed separately from the Microsoft Foundry model catalog before Model Router can route to them. The OpenAI models in the routing set are run “from inside” Model Router and do not need a separate deployment. Claude is the exception — deploy the Claude variants you want first, then enable them in your Model Router subset, and the magic kicks in.

Worth noting that Claude is not the only one in preview. The 2025-11-18 routing set also marks DeepSeek-V3.1, DeepSeek-V3.2, gpt-oss-120b, Llama-4-Maverick-17B-128E-Instruct-FP8, grok-4, and grok-4-fast-reasoning as Model Router preview entries. The OpenAI GPT-4.x / GPT-5.x family is the GA core today — the rest is a rapidly growing preview frontier.

This is exactly the kind of setup I want for an enterprise agent — let Model Router pick between OpenAI and Claude per prompt, in Balanced or Quality mode, and let me stop arguing with myself about which model to hard-code. Just plan for it as a preview today and validate carefully before you push it into production.

Limits — still important

Here is the catch I always remind customers about: the effective context window of Model Router is the limit of the smallest underlying model. That means an API call with a very large context will only succeed if the prompt happens to be routed to a model that can handle it.

  • Use Model subset to restrict routing to models that all support the context window you need.
  • Shrink the prompt — summarize it, truncate to the relevant parts, or use document embeddings to retrieve only what matters.

Region-wise, Model Router is currently (when writing this article) available in East US 2 and Sweden Central, on Global Standard and Data Zone Standard deployments.

Vision inputs are accepted (all underlying models accept image input), but the routing decision itself is based on the text only. Audio input is not supported.

When would I use Model Router?

The biggest reason for me is simple — I want to give my agents a less model endpoints – even just the Model Router one.  I configure one Model Router deployment, point my agents at it, and from that moment on Model Router can use whichever model in its disposal best fits the prompt — small and fast for trivial steps, top-tier reasoning for the hard ones, and even Claude models for cases where Anthropic is the right tool (as long as I have deployed Claude models separately to my Foundry first).

That single-endpoint pattern simplifies agent building. My agent code does not need to know which model is best for which step. It does not need a giant switch statement of “if reasoning task → call model X, else call model Y.” It just calls Model Router — and Model Router does the dispatching across everything I have made available to it.

And no, Model Router is not a “silver bullet” that is answer to everything. There are many cases and reasons why you want to control which model to use. There are also many cases where Model Router will just work.

Model Router adds also:

  • A clean way to optimize for cost or quality without rewriting agent code every time a new model lands.
  • An easy way to fold in brand-new models (like Claude Opus 4.7) — deploy them once, add them to the subset, and the agents pick them up automatically.
  • Built-in failover for resilience.

Meet MAI-Image-2e — Microsoft’s faster image model

Now something almost-completely different — but still about models. The second one I want to highlight is MAI-Image-2e, one of Microsoft’s first-party image generation models in Microsoft Foundry.

MAI-Image-2e is a text-to-image generation model that produces high-quality, visually rich images from natural language prompts. It is built on top of MAI-Image-2 with a clear promise — up to 22% faster and four times more efficient than MAI-Image-2, while keeping the same level of quality. For developers building at scale, that is the smartest choice.

Key capabilities:

  • Text-to-image generation — high-quality images from natural language prompts.
  • Photorealistic image synthesis — realistic imagery with consistent visual structure, well suited for concept visualization and content creation.
  • Product, branding and commercial design — product imagery, marketing visuals, brand assets, and commercial creative workflows.

Specs:

  • Input length: up to 32,000 tokens for the prompt.
  • Output: a single PNG image.
  • Image size: both width and height must be at least 768 pixels. The total pixel count (width × height) must not exceed 1,048,576 — equivalent to 1024×1024. Either dimension can exceed 1024 as long as the total stays within that budget — for example 768×1365 is fine.
  • Regions: Global Standard deployment in West Central US, East US, West US, West Europe, Sweden Central, and South India.

You can deploy MAI-Image-2e like any other Microsoft Foundry model — from the Foundry portal or with a one-liner in Azure CLI — and you call it through the MAI image generation API endpoint at https://<resource-name>.services.ai.azure.com/mai/v1/images/generations, using Microsoft Entra ID or an API key. Or just experiment with it using the Playground in Microsoft Foundry.

When would I use MAI-Image-2e?

  • High-volume, fast-turnaround scenarios — product imagery at scale, marketing variations, branded assets, anywhere efficiency and cost per image matter.
  • Creative content generation — concept art, illustrations, and design exploration where the speed boost lets you iterate more in the same time.
  • Photorealistic visuals for marketing and commercial use.

If you need the absolute highest-fidelity output and speed is not the priority, MAI-Image-2 is still in the catalog. But for most workflows I have been building lately, MAI-Image-2e is the better default — faster, cheaper, same quality bar.

A hat tip to those interested of Microsoft Foundry

Microsoft Foundry continues to evolve and gain new features — GPT-5.5 and Claude models in Model Router and MAI-Image-2e for image generation are two good examples. Model Router is the piece that makes that composition practical. MAI-Image-2e is the piece that makes high-volume image workloads sustainable.

Hat on, AI flowing — give Model Router a try, and if possible with a Claude Opus 4.7 deployment in your subset, and spin up an MAI-Image-2e deployment next to it.

Stay tuned — there is more Foundry goodness coming and here is a tip for that: Have you registered to Microsoft Build 2026? If you have not – do it now, it is happening next week! –> https://build.microsoft.com/


Sources & further reading



Read the whole story
alvinashcraft
53 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Debugging the undebuggable: building observability into probabilistic AI systems

1 Share
Minimalist vector illustration of a person holding a torch on a winding path in a dark forest, serving as a metaphor for AI observability and debugging LLM systems.

Debugging used to be straightforward: A service failed, you checked the logs, followed the stack trace, and fixed the bug. Unfortunately, with AI systems, especially those powered by LLMs and agent workflows, that approach breaks down quickly.

The problem doesn’t just lie in a more complex system. It’s complicated by the fact that failures are no longer deterministic. The system may return a different answer for the same input. A tool might silently fail. Retrieval might return low-quality or noisy context. Nothing overtly “crashes,” but something is clearly wrong.

This tutorial focuses on a practical question: How do we debug a system that doesn’t fail in obvious ways? 

To tackle this question, we’ll build a small AI service and, more importantly, instrument it so that we can actually understand what’s happening inside it.


Why debugging AI systems feels different

Traditional debugging relies on three assumptions:

  • Inputs lead to predictable outputs
  • Failures throw errors
  • Logs tell the full story

None of these holds for AI systems.

Instead, we deal with:

  • Non-deterministic outputs
  • Hidden reasoning steps
  • External dependencies (retrieval, APIs, tools)
  • Large, dynamic prompts

This means debugging must shift from log-based thinking to observability-driven engineering.

“Debugging must shift from log-based thinking to observability-driven engineering.”


What we’re building

We’ll create a simple AI question-answering service with:

  • Retrieval (vector search + reranking)
  • External tool calls
  • LLM reasoning
  • Structured output validation
  • Observability (tracing + logging + token estimation)

The focus is not just on building it, but on making it debuggable.

Architecture overview: a debuggable AI system

Diagram showing the architecture of a debuggable AI system

This architecture highlights a key shift in modern AI Systems: Observability is a core component rather than an afterthought. Each stage of the workflow, from retrieval to tool execution to model reasoning, is instrumented, enabling engineers to trace decision-making. This makes it possible to debug not just failures but also unexpected behaviors, which are far more common in AI systems than in traditional software.


Step 1: install dependencies

bash
pip install fastapi uvicorn \
  langchain langchain-openai langchain-community \
  faiss-cpu rank-bm25 \
  httpx tenacity \
  opentelemetry-api opentelemetry-sdk \
  opentelemetry-instrumentation-fastapi \
  opentelemetry-instrumentation-httpx \
  tiktoken pydantic

We explicitly include OpenTelemetry because debugging AI systems without tracing is like flying blind.


Step 2: initialize the model (with production controls)

Python
import os
from langchain_openai import ChatOpenAI

api_key = os.environ.get("OPENAI_API_KEY")
if not api_key:
    raise ValueError("OPENAI_API_KEY must be set")

llm = ChatOpenAI(
model="gpt-4o-mini",
temperature=0,
model_kwargs={"response_format": {"type":     "json_object"}},
openai_api_key=api_key,
request_timeout=30, 
max_retries=2
)

Timeouts and retries are not optional. When something fails, you need to know if it’s your system or the model provider.


Step 3: add retrieval (and make it observable)

Python
from langchain.docstore.document import Document
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

docs = [
    Document(page_content="Observability helps debug AI systems."),
    Document(page_content="Retrieval quality impacts model output."),
    Document(page_content="Tracing reveals hidden execution paths.")
]

embeddings = OpenAIEmbeddings()
index = FAISS.from_documents(docs, embeddings)

Now, retrieval

Python
from rank_bm25 import BM25Okapi

def retrieve(query: str):
    results = index.similarity_search(query, k=5)

    # Add lexical reranking
    corpus = [doc.page_content.split() for doc in results]
    bm25 = BM25Okapi(corpus)
    scores = bm25.get_scores(query.split())

    ranked = sorted(zip(scores, results), reverse=True)

    return [
        {
            "text": doc.page_content,
            "source": doc.metadata.get("source", "internal")
        }
        for _, doc in ranked[:3]
    ]

Debugging insight

If the retrieval is wrong, everything downstream is wrong as well.

Always log what documents were retrieved.


Step 4: safe tool execution

Python
from urllib.parse import urlparse
import httpx
from tenacity import retry, stop_after_attempt, wait_exponential

# Restrict outbound requests to trusted domains only.
ALLOWED_DOMAINS = {
    "api.trusted-source.com",
    "documentation.org"
}

@retry(
    stop=stop_after_attempt(2),
    wait=wait_exponential(min=1, max=4)
)
async def fetch_external(url: str):

    parsed_url = urlparse(url)

    # Prevent SSRF and internal network probing.
    if parsed_url.netloc not in ALLOWED_DOMAINS:
        raise ValueError(
            f"URL domain '{parsed_url.netloc}' is not allowed."
        )

    async with httpx.AsyncClient(
        timeout=10,
        follow_redirects=False
    ) as client:

        response = await client.get(url)
        response.raise_for_status()

    # Truncate response to control token usage.
    return response.text[:3000]

Production AI systems should never allow unrestricted outbound requests from model-generated inputs. Without domain allowlists, agents can become SSRF vectors that can probe internal services, cloud metadata endpoints, or private infrastructure. Restricting outbound access to trusted domains is a minimal production safeguard.

“Production AI systems should never allow unrestricted outbound requests from model-generated inputs.”

Debugging insight

Tool failures are silent killers. Without retries and logging, you won’t know if:

  • The tool failed
  • The tool returned empty data
  • The model ignored the tool

Step 5: token visibility (not exact, but useful)

Python
import tiktoken

encoder = tiktoken.encoding_for_model("gpt-4o-mini")

def estimate_tokens(messages):
    """
    Approximate token usage for OpenAI-style chat payloads.

    Note:
    This is still an estimate. Real usage depends on:
    - system prompts
    - retrieved context
    - tool call arguments
    - provider-specific formatting
    - output tokens
    """

    # Approximate overhead used by OpenAI chat formatting.
    tokens_per_message = 3
    tokens_per_name = 1

    total = 0

    for message in messages:

        total += tokens_per_message

        for key, value in message.items():

            if isinstance(value, str):
                total += len(encoder.encode(value))

            if key == "name":
                total += tokens_per_name

    # Assistant reply priming tokens.
    total += 3

    return total

Token counting should be treated as an operational estimate, not an exact billing mechanism. Real request cost depends on the full message payloads, retrieved context, tool-call arguments, system prompts, provider-side formatting, and generated output tokens. Even approximate tracking, however, is extremely useful for debugging runaway agents and monitoring cost regressions in production systems.

Debugging insight

Unexpected cost spikes often come from:

  • large retrieved context
  • repeated loops
  • oversized prompts

Step 6: build the agent workflow (deterministic)

Python
def run_workflow(question: str):

    # Step 1: Retrieve
    context = retrieve(question)

    context_text = "\n".join([c["text"] for c in context])

    messages = [
        {"role": "system", "content": "Answer clearly using the provided context."},
        {"role": "user", "content": f"Context:\n{context_text}\n\nQuestion: {question}"}
    ]

    tokens = estimate_tokens(messages)

    response = llm.invoke(messages)

    return {
        "raw_output": response.content,
        "sources": [c["source"] for c in context],
        "token_estimate": tokens
    }

Debugging insight

We’re sticking to a deterministic flow, so we don’t have to deal with tools or agents acting out on their own.


Step 7: validate output (guardrails)

Python
from pydantic import BaseModel

class OutputSchema(BaseModel):
    answer: str
    sources: list[str]
    token_estimate: int

import json

def validate_output(raw):

    try:
        parsed = json.loads(raw["raw_output"])
    except Exception:
        parsed = {
            "answer": raw["raw_output"],
            "sources": raw["sources"],
            "token_estimate": raw["token_estimate"]
        }

    validated = OutputSchema(**parsed)
    return validated.dict()

Debugging insight

Failures here tell you:

  • The model ignored instructions
  • The output format changed
  • Something upstream corrupted the context

Step 8: add observability

Python
from fastapi import FastAPI

from opentelemetry import trace

from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import (
    BatchSpanProcessor,
    ConsoleSpanExporter,
)

from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.httpx import (
    HTTPXClientInstrumentor,
)

# Configure tracer provider.
provider = TracerProvider()

# For local debugging:
# export traces to console.
#
# In production, prefer:
# OTLPSpanExporter -> OpenTelemetry Collector -> Jaeger/Grafana/etc.
processor = BatchSpanProcessor(ConsoleSpanExporter())

provider.add_span_processor(processor)

trace.set_tracer_provider(provider)

# Create application.
app = FastAPI()

# Instrument FastAPI and outbound HTTP calls.
FastAPIInstrumentor.instrument_app(app)

HTTPXClientInstrumentor().instrument()

Instrumentation alone does not make traces visible. OpenTelemetry requires a tracer provider, span processor, and exporter to record and emit telemetry data. For local debugging, a console exporter is sufficient. In production systems, traces are typically exported via OTLP collectors to platforms such as Jaeger, Grafana, Tempo, Datadog, or Honeycomb.

Add endpoint:

Python
from fastapi.concurrency import run_in_threadpool
from pydantic import BaseModel

class Query(BaseModel):
    question: str

@app.post("/ask")
async def ask(q: Query):

    result = await run_in_threadpool(run_workflow, q.question)

    output = validate_output(result)

    return output

What you can now debug

With this setup, you can answer questions like:

“Why was the answer wrong?”

  • Check retrieved documents

“Why did the output change?”

  • Compare context and token size

“Why is latency high?”

  • Trace LLM vs tool vs retrieval

“Why is the cost increasing?

  • Inspect token estimates and context size

Engineering principle: make AI systems observable

AI systems are not just models. They are pipelines for:

  • retrieval
  • reasoning
  • tools
  • validation

Each part can fail independently. You need visibility at every step, or you’re essentially debugging in the dark.

“Each part can fail independently. You need visibility at every step, or you’re essentially debugging in the dark.”


Production lessons

  1. Logs are not enough

You need traces that show the full execution path

  1. Retrieval errors look like model errors

Always inspect the context first

  1. Tool failures are often silent

Add retries and instrumentation

  1. Token growth is a hidden risk

Monitor prompt size continuously

  1. Deterministic workflow simplifies debugging

Fewer moving parts = fewer unknowns


Conclusion

The takeaway here is that, since you’re dealing with a probabilistic system, your debugging tools (and approach) have to change. By introducing things like observability, deterministic workflows, structured validation, and proper tracing, you’re setting yourself up to see where the logic goes sideways (because it will).

The post Debugging the undebuggable: building observability into probabilistic AI systems appeared first on The New Stack.

Read the whole story
alvinashcraft
53 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Random.Code() - Fixing ref struct Bugs in Rocks, Part 3

1 Share
From: Jason Bock
Duration: 0:00
Views: 5

In this stream, I'll start investigating another bug I've found in Rocks that I think is related to ref readonly usage with ref structs.

https://github.com/JasonBock/Rocks/issues/415

#dotnet #csharp #roslyn

Read the whole story
alvinashcraft
53 minutes ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories