Improving agent quality at scale is one of the hardest operational problems teams face once agents are running in production. We’ve been working to close the gap between seeing what’s wrong and shipping a better version without breaking everything else. This post explains the thinking behind a new optimization loop for agents, what we learned building it, and how you can run it today.
From craft and intuition to traces, evals, and a quality conundrum
If you’re building production agents, you’ve probably walked a version of this path:
You started with prompt engineering. Wrote the system instruction, iterated on it, got the agent to mostly work. You and the model, in a tight feedback loop of “try it, read the output, tweak the prompt.” This phase is craft. It’s intuition-driven, and it gets you surprisingly far.
Then you added traces: OpenTelemetry, App Insights, whatever your stack uses. This is good engineering practice, but it’s also necessary. You couldn’t understand what the agent was actually doing without them. Now you can see the reasoning chain, the tool calls, the decisions. You have visibility.
Then came evaluation. At first, it was vibes: reading traces, gut-checking whether the output felt right. Over time you got more rigorous. You defined metrics, set guardrails, and established quality bars. Maybe you built a scoring rubric across multiple dimensions (policy compliance, cost-awareness, escalation accuracy). Now you can measure quality, not just feel it. You know your pass rate. You know which scenarios break.
And then you hit the wall.
Let’s say you ship a travel-approver agent. It calls three tools: lookup_travel_policy, check_department_budget, and get_flight_alternatives. It returns an approve,
deny, or escalate decision. The first week looks clean. Then finance flags a $4,800 trip that was approved without VP sign-off. You pull the trace: The tools ran, the loop completed, and the output was
confident. The agent just never called the budget-check tool.

You find the gap in the instruction, so you add a rule about cost thresholds. Re-run the eval. That case passes. But now the emergency-travel override that used to work flawlessly starts escalating everything. You try a different wording. The emergency case recovers. Two other scenarios regress.
You have the traces. You have the evals. You can see exactly what’s wrong and measure exactly how wrong it is. But fixing it without breaking something else? That’s where you’re stuck.
The data you painstakingly collected just sits there while you manually guess-and-check your way through configuration changes.
And it compounds. This might be tolerable for one agent. But if you’re operating five, 10, 20 agents across different domains, each with their own failure modes, their own evaluators, their own regression risks, the manual loop becomes untenable. You can’t individually nurse each agent through prompt revisions and hope nothing else breaks.
You’re not debugging anymore. You’re searching. And you’re doing it without a map.
Reframing the problem
Most teams treat agent improvement like debugging: Find the broken thing, then fix it. But an agent that skips a budget check isn’t “broken” the way a null pointer validation is broken. Its instruction just doesn’t encode enough constraint for that scenario. There are dozens of possible instruction variants that might fix it, and most of them regress something else.
In traditional software, when a test fails, you know what to fix. The stack trace points at a function. You patch the function, run the test suite, then confirm nothing else broke. With agents, quality failures could live in any of a dozen places: the system instruction, the model, a tool description, a skill definition. There’s no stack trace pointing at the broken line. The problem could be in any of those places, or several at once, and you can’t isolate it the way you’d isolate a bug.
But here’s what you might not have noticed: You already have almost everything you need. Your traces contain the failure signal. Your evaluators contain the quality definition. What’s missing is the loop that connects them. The loop that goes from “I see what’s broken” to “here’s a better configuration, scored against everything, ready to ship.”
> We built a system that does for agent configurations what your CI pipeline does for code.
We built a system that does for agent configurations what your CI pipeline does for code: Propose a change, score it against the full evaluation suite, and only promote it if quality holds across the board.
If you’ve done hyperparameter tuning, this will feel familiar. The optimizer explores a configuration space the same way a sweep explores learning rates and architectures. The difference is that the search dimensions are instructions, skills, tool definitions, and model selection instead of numeric parameters.
The optimization loop

You already have the pieces:
- An agent running in production (model, instructions, skills, tools)
- Evaluators that score quality across multiple dimensions
- Traces from real usage
The optimizer takes all three as input and runs a four-step loop. Each step is something you’d otherwise grind through manually; the system handles the heavy lifting.
1. The optimizer generates candidates. It searches across instructions, models, skills, and tool definitions. These aren’t random mutations. A reflector model reads traces from your evaluations, identifies why the agent scored poorly, and proposes targeted changes (more on the reflector shortly—it turned out to be the most important piece of the puzzle).
2. Candidates are scored and ranked. Same evaluators, same dataset, deterministic comparison. Every candidate is measured against the same bar your baseline was. Per-dimension scoring (policy compliance, cost-awareness, routing accuracy) means you can see exactly what improved and what regressed.

3. A developer reviews and decides. The loop isn’t completely autonomous. You look at what changed, why the optimizer proposed it, and whether the improvement is real. If it doesn’t look right, you reject and re-run (optionally with updated evaluators or a different search configuration). If it passes your judgment, you approve. This is deliberate. Automation without oversight compounds errors.
4. The winner ships as the next version. Versioned, reversible, auditable. This updates your agent’s configuration: same model, same tools, better instructions. If the new version underperforms in production, you roll back.
After shipping, production telemetry accumulates: user feedback, reviewer overrides, scenarios your eval set didn’t cover. This signal doesn’t flow directly into the optimizer. It flows into you: your decision to update evaluators, add new test cases, and trigger another optimization run. The optimizer works from your evaluations; production tells you what to measure next.
There’s more to say about how the optimizer explores this space internally: the search techniques, the tradeoffs, how the reflector generates hypotheses. That’s beyond what we can cover here. But one finding from inside the optimizer is worth pulling out.
What actually moves the needle
The optimizer isn’t just randomly mutating prompts. The central piece is a reflector: a separate model whose only job is to read failing traces and reason about why the agent scored poorly. It then proposes targeted edits for the next round.
Here’s what we found: The quality of that reflector, the model doing the diagnosis, has a disproportionate impact on outcomes. More so than the agent’s own model. More so than tuning other parameters in the search. This held across multiple agent types and domains.
What does that mean concretely? Swapping to a stronger reflector model improved optimization results more than any other single change we could make. The agent could be running gpt-4o or
gpt-4.1-mini. It didn’t matter as much as having a reflector that could clearly reason about why something went wrong and what to change about it.
> Better diagnosis beats better execution.
And here’s the implication for how you invest: The meta-cognition layer, the ability to reason about failures, matters more than anything else. Better diagnosis beats better execution. If you’re going to invest in one capability, invest in the quality of your failure analysis.
The engineering behind the reflector (how it reads traces, generates hypotheses, and avoids local optima) is its own story.
The travel-approver: A concrete run
Let’s go back to our earlier travel-approver agent example. Here’s what one optimization run might produce:

The winning candidate was a system-prompt rewrite. Same model, same tools, same skills. Just a better instruction. The optimizer added an explicit cost-threshold rule and an escalation ladder that the baseline lacked.
The $4,800 trip that started this story? The optimized agent calls the budget check, sees the amount exceeds the $3,000 threshold, and routes to VP review. Same scenario, different outcome. The instruction now encodes the constraint explicitly.
When to use this loop, and when to skip it
This loop works better in specific situations. Here’s how to know if it fits yours.
It’s a good fit when:
- You have an agent in production with traces and evaluation data
- Quality issues are cross-cutting: Fixing one thing breaks others
- You’re operating at scale, across multiple agents or ongoing iteration cycles
- The failure mode is at the configuration level: instructions, skills, tool definitions, model selection
It’s probably not the right tool when:
- Your agent is still in early development and you haven’t earned enough traces yet (manual approaches like prompt engineering are still a good path forward)
- The problem is infrastructure: context window too small, tools return bad data, latency
- You have one agent with one failure mode—in that case, just fix it manually
- The task is reasoning-bound (competition math, deep logic chains)—here, you need a model upgrade, not instruction optimization
Key takeaways
Here are the four things we’d carry to any system doing this kind of work:
- Quality is a search problem, not a debugging problem. Define what good looks like, search the configuration space, and rank what works. Stop trying to fix one case at a time.
- Invest in diagnosis. The reflector (the model that reasons about why things went wrong) has more impact than any other single lever. Better failure analysis beats better execution.
- Evaluators are the ceiling. Your optimization is only as good as your quality definition. Start with generated approximations, refine with real data. The first version is never the last.
- Keep the human in the loop. The optimizer proposes; the developer decides. Automation without oversight compounds errors.
How we built this in Microsoft Foundry
We packaged this loop into Agent Optimizer inside of Foundry Agent Service, available today through the azd CLI.
Here’s what the travel-approver run looks like from your terminal:
azd ai agent eval init # generate dataset & evaluator from a one-paragraph description
azd ai agent eval run # score the current version (baseline)
azd ai agent optimize # search over candidates
azd ai agent optimize apply --candidate <id> # apply the winner locally
azd deploy # ship as the next version
Five commands. The complexity lives in the optimizer, not in your workflow.

The system handles candidate generation, scoring, ranking, and version management. You handle the decision: approve, reject, or adjust your evaluators and run again.
On getting started without evaluation data: The system includes AI-assisted dataset and evaluator generation based on your agent’s configuration and traces. You describe what the agent should do in a paragraph, and eval init generates a multi-dimension evaluator using the traces if available. This makes it easier to bootstrap. The closer your eval data is to real user scenarios and real edge cases, the higher the quality ceiling. Your evaluators are the ceiling on optimization quality. If they can’t distinguish good from bad, the candidates are noise.
We’ve also seen cases where the reflector proposes a fix that passes the eval but introduces regressions on inputs not in the eval set. That’s why the human gate exists. The loop isn’t fully hands-off. You still need someone looking at the candidates before they ship.
What we’re exploring next
The loop as described is agent-level: one agent, one set of instructions, one optimization pass. Two directions we’re actively building toward:
Reducing deployment risk. Right now, shipping a candidate means replacing what’s in production. Full swap. If your eval set is strong, that works. But eval sets are approximations, and production traffic has a longer tail than any test suite. We’re building A/B-style deployment: Promote a candidate alongside the current version, route a fraction of traffic to it, and compare outcomes against the same evaluators that scored it in the loop. The developer gate doesn’t end at “approve.” It extends into production. Roll forward when evidence accumulates; roll back the moment it doesn’t.
Widening the search space. Today, the optimizer searches over instructions, skills, tool definitions, and model selection. That covers most failure modes. But sometimes the bottleneck is upstream of the agent itself: retrieval settings that return noise, knowledge gaps no instruction can fix, or tool sets that don’t match the task. We’re integrating Foundry IQ (managed knowledge grounding) and Foundry Toolbox (curated tool sets) as tunable dimensions. The optimizer can then search over retrieval configuration, which knowledge sources to ground on, and how tool sets are composed. Same scoring rubric, wider surface area. You stop running those experiments by hand.
There’s more here we’d like to share, especially as we continue to learn and explore this space. The optimizer’s architecture, the engineering discipline behind it, the edge cases that taught us the most—those are stories worth telling. Stay tuned to Command Line for more.
Try it out
Agent optimizer is in public preview. If your agents are stuck in the cycle of “fix one thing, break two others,” try it out and give us feedback.
The post The agent optimization loop and how we built it in Foundry appeared first on Command Line.












