Trace every run end-to-end, generate synthetic datasets to stress-test on demand, fire automated Red Team attacks at your own agents, and pin down why evaluations fail — all from the Microsoft Foundry control plane. Lock in guardrails that inspect every tool call at runtime, define the risks once, and enforce them across every agent run.
Mohammad Abuomar, Responsible AI Principal Architect, shares how to turn a coding agent into production-ready software inside Foundry.
Describe the agent, set the row count, confirm.
Your test set lands in seconds. Microsoft Foundry’s synthetic dataset generator builds eval data on demand. Get started.
Pin down why your agent fails evaluations.
Foundry’s Analyze Results uses AI to cluster failures, name the root cause, and recommend specific fixes. Check it out.
Lock down agent behavior with the Task Adherence Guardrail.
It inspects every tool call against the original task and blocks the off-script ones. Try it in Microsoft Foundry.
QUICK LINKS:
00:00 — Microsoft Foundry control plane
00:33 — See a finished agent
02:30 — See where the agent started
03:19 — Traces
04:04 — Built-in monitoring
04:34 — Evaluation types
05:51 — Red team evaluations
07:08 — Evaluation results
08:14 — Built-in Guardrails
08:14 — Wrap up
Link References
Get everything you need in Microsoft Foundry at https://ai.azure.com
Unfamiliar with Microsoft Mechanics?
As Microsoft’s official video series for IT, you can watch and share valuable content and demos of current and upcoming tech from the people who build it at Microsoft.
Keep getting this insider knowledge, join us on social:
Video Transcript:
-If you want to build agents that meet your expectations for output quality, performance, safety, and cost, it’s not just about the model or framework you select. The testing, evaluation, and the controls surrounding your agent matter. And that’s what the Microsoft Foundry control plane is designed to do, with tools you can use during development to make sure your agents raise the bar across every important dimension. Today, I’m going to walk through the process of building a coding agent and demonstrate where the controls in Foundry come in to make it better. I’ll start by showing my finished agent, then after that, I’ll show you the steps I took using Foundry to make it production ready. This agent is designed to take a simple user prompt, then find what it needs to build apps automatically.
-So, I’ll paste in my prompt asking it to generate a Windows desktop app for personal cashflow management. It needs to be fast, use WebUI, and easy to use for broad appeal. I’m also asking it to make it safe, secure, and follow privacy best practices. And it needs to be easy for a developer to read, maintain, and to add to it. I’ll submit the request and it gets to work, with its reasoning on the left and code on the right. This process takes several minutes to complete with a few interactions in between, so to save a little time, I’ll skip to the result. We can see the agent’s reasoning and plan, with its technology stack, approach and initial action. Then, we can follow all of the steps it performed to author and configure the app and its dependencies.
-Below that, is the React code and JavaScript. It asked whether to proceed writing this as an Electron and React setup, and I confirmed. Then it started to write, test and iterate on the app, followed by another question whether to implement more features or focus on security. And I responded to do both. It then finished writing the app and finally it outlined the steps to run the app locally.
-So, let’s test it out. I’ll move over to my terminal window running PowerShell and start it. And here is the generated app. It’s fully functional with user authentication. I can enter my first item, Travel Expenses, and the amount, and there’s a Category dropdown menu with pre-configured options, and I’ll choose “Transportation”. And it writes that record into the local data store. This is a simple, production-ready app that the agent was able to create in just a few minutes. But it didn’t start out this way, and if you’ve built agents or apps yourself, you’ll know a lot of what doesn’t get shown is the testing, iteration, and refinement work to end up with production-ready code. Let’s change that.
-Let’s go back in time to where this agent started. I’m in Visual Studio Code and this is my agent, which I built using the Foundry SDK. Here are the defined tools for it to use, WebSearch and CodeInterpreter. And on the left, we can see the full list of local tools. Like interacting with the file system, as well as git, patching, registry, local search and running shell scripts. And here in the center is the key SDK line that creates the agent, adds the tools, deployment name and so on.
-So, the agent is functional and I’ve also started manual testing. And this is where Foundry controls let me stress‑test the agent to see what works and what doesn’t and see the details for each run. In the Microsoft Foundry portal, I have my agent open and the Traces tab. These are OTel traces of all of the runs for this agent, with the newest runs on top, everything here is backed by Azure Monitor. And I can click into any conversation or Trace ID to view the Input + Output turns for that session. They’re easier to parse than standard logs, speeding up reviews. We can also see the system message, user input, and what the agent did. Along with the agent’s reasoning, the technology stack it used, and the app features. Below that, we can see the development process as well as tool outputs Beyond that, with built-in monitoring, you can get a roll-up view of all activities for our agent with key metrics I’m in the Monitor tab. It shows me the estimated cost and token usage so far. This agent is new so I haven’t configured Evaluations yet, but we’ll get to those in a moment.
-Next, you’ll see Operational metrics like the number of agent runs and how many successfully completed or failed, token consumption, tool calls made by the agent, and the error rate over time. Evaluations are where a lot more testing automation comes in to help you improve agent faster. I’m in the Evaluations tab, I need to create my first one. The options are: Automatic Evaluation, where you can automate the process using AI; Human Evaluation, where someone tests the agent and completes surveys; and Red team, where an agent runs automated attacks to expose vulnerabilities. I’ll start with Automatic Evaluation and hit Create. It starts with defining a target. My agent and the version I want are already selected. For data, I can upload an existing dataset or save time by creating a synthetic dataset, which is very cool. This generates data automatically, you just select the number of rows you want. I’ll guide it with a prompt, “Create a dataset for evaluating a coding agent.” I’ll skip the reference file and just Confirm. That automatically generates 90 rows of data to test with.
-Next, I’ll choose the evaluation Criteria. There are several built-in evaluators for Agents. Below that are evaluators for Quality. These are editable, so I’ll remove Coherence, Fluency, and Groundedness because my agent doesn’t need them. For Safety, there are seven evaluators, and I’ll keep them as-is and move on to Review, then Submit it. These Automatic Evaluations can take several minutes to complete, so while it’s working, I’ll move into Red Teaming, which is now becoming a core part of AI testing to spot vulnerabilities early on. I’ve started creating my first red team evaluation. Let’s look at the standard configuration for risk categories. You can modify these. It can check for unsafe categories plus ungrounded attributes, code vulnerabilities, and task adherence. It shows the tools that the agent can access. I’ll provide descriptions for web_search, to search the internet for relevant SDKs, and the code_interpreter to run code for the coding agent. Then I’ll Save it.
-Next, I’ll change Seed queries from 5 to 10 per category for more testing. In the Attack strategies, I can see exactly what the red teaming agents will try to do and select the ones most relevant to my agent. Each tile describes the attack type that will be tested. I’ll choose AsciiSmuggler, Base64, Jailbreak, StringJoin, UnicodeSubstitution, and IndirectJailbreak. Now, I can review the prohibited actions, including things like attempts to change password, and more. These are all things attackers might try to do with your agent, and we’re automating those tests for you. I’ll hit Submit to get everything started. Now, with two evaluations running, to save a little time, I’ll fast forward to the results of the evaluations.
-Here, we can see the two runs. I’ll open the Automatic Evaluation first. Then clicking into the Run shows the details for each evaluator. If I scroll to the right, you’ll see that we’re green almost across the board. One glaring exception is the TaskCompletion score at 59%, which is below my bar, so it’s something to fix. One of my favorite capabilities in evaluation is using AI to analyze the results. I’ll start the analysis, and it creates a nice cluster analysis showing the main issues. I mentioned TaskCompletion before. Here, you can see “incomplete resolution” and “action plan issues”. Drilling in, looks that there is a “lack of actionable output” and the AI suggests specific ways to fix it. This saved me time to find ways to improve my agent.
-Now, let’s review our Red Teaming evaluation. I’m at the top level view and I’ll click in to see the issues. Immediately, I can see that the Task adherence is red, which is also related to TaskCompletion. We can fix this using a built-in guardrail to check for task adherence. Guardrails define what risks to detect, from which point in the process, and how to respond. Let’s go to the agent playground. Scrolling down to Guardrails, I can see only the default model guardrail is set. Let’s add another by clicking Manage guardrails and Create. Here, I can define the risks and controls I want to enforce. I’ll start with Risk, and these are the types of risks we can detect and mitigate. There’s an option for “Task adherence” that I’ll choose. This guardrail checks any tool call made by the agent to ensure it’s used appropriately to “adhere” to the task.
-Now, I just need hit Submit to activate this guardrail. And the TaskCompletion issue should now be fixed. In fact, here I’ve run another evaluation, and we can see that TaskCompletion is now green and everything meets our overall quality goals. With that, my agent is ready for broader use. And while I focused today on a single agent and using Foundry controls to test it, expose vulnerabilities, and make it better, Foundry also provides fleet-wide performance visibility across all agents and enables centrally applied and enforced policies and configurations to keep agents compliant.
-To find out more and get started with these and other controls, you’ll find everything you need in Microsoft Foundry at ai.azure.com. Subscribe to Mechanics for the latest tech updates, and thanks for watching.