Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
154989 stories
·
33 followers

Introducing azure-functions-skills: An AI-Era Workspace for Azure Functions (Preview)

1 Share

AI coding agents connected to Azure Functions skills

Today we’re announcing azure-functions-skills in public preview: a one-command way to give your favorite coding agent (GitHub Copilot CLI, Claude Code, Codex CLI, VS Code) the skills, agent definition, MCP servers, hooks, and instructions it needs to ship secure-by-default, scale-ready Azure Functions — end-to-end.

AI coding agents now write the first draft of your function, scaffold the infrastructure, and run the deploy command. But ask a general-purpose agent to build for Azure Functions and the output is usually a step behind. It leans on older programming models that have been superseded, and it has no knowledge of newer capabilities: the serverless agents runtime, Flex Consumption defaults, the new Azure MCP template service, the latest binding shapes, this week’s runtime improvements, or Go language support. Worse, the code it produces often leaves hardcoded keys, connection strings, and other secrets sitting in your function for you to clean up later, picks patterns that don’t scale (client-per-invocation, blocking I/O on the hot path), and skips identity-based access entirely. The code compiles, but it isn’t secure, isn’t current, and isn’t using what Azure Functions offers today.

azure-functions-skills closes that gap. The skills steer the agent toward managed identity, Key Vault references, Flex Consumption, and the binding and concurrency patterns that scale — and the built-in doctor catches the rest before deploy.

Try it now: npx @azure/functions-skills install

In about 5 minutes you’ll have a working Functions project scaffolded with managed identity, a deploy-ready workflow, and a doctor HTML report you can wire into CI.

Requirements: Node 18+, an Azure subscription, and one of: GitHub Copilot CLI, Claude Code, Codex CLI, or VS Code.

Availability: azure-functions-skills is in public preview on npm as @azure/functions-skills and on the GitHub Copilot CLI / Claude Code / Codex plugin marketplaces. The skill set is intentionally small at launch and will grow with each Azure Functions release.

What is azure-functions-skills?

azure-functions-skills is a plugin for AI coding agents. It builds on the broader azure-skills plugin for cross-Azure scenarios, and it ships:

  • Skills. Task-focused playbooks the agent loads on demand (setup, create, deploy, diagnostics, best-practices, health-status, inventory, doctor, feedback).
  • An agent definition (functions-copilot) that routes user requests to the right skill and proposes the next workflow when one finishes.
  • MCP server configuration, hooks, and instruction files (copilot-instructions.md, CLAUDE.md, AGENTS.md). Everything the agent needs to behave consistently across hosts.
  • A companion CLI, @azure/functions-skills, that installs all of the above with one command, lets you run the agent (chat), and validates your project before deployment (doctor).

Names you’ll see in this post: @azure/functions-skills — the npm package and CLI you run. azure-functions-skills — the plugin (skills + instructions) the CLI installs. functions-copilot — the agent definition that routes you to the right skill.

Two design choices shape every feature:

  1. Skill discovery is a first-class product surface. Skill names and granularity are tuned so the agent picks the right one at the moment a developer asks for it, and so a developer browsing the catalog can recognize what each skill is for. Where a request belongs to the broader Azure surface, we route into the azure-skills plugin rather than reinvent it.
  2. The agent responds in the language you write in. Ask in Japanese, get Japanese. Ask in English, get English. The instruction files are wired so the host agent honors the conversation language consistently.

What ships in the preview

Skill catalog

The azure-functions-agents skill is included from launch and supports the Azure Functions serverless agents runtime that just launched at Build 2026.

Skill What it does
azure-functions-setup Detects Azure CLI / azd / Core Tools / language runtimes / the azure-skills plugin on your machine and walks you through installing what’s missing.
azure-functions-create Scaffolds new Functions projects, or adds functions to an existing project, using the Azure MCP template service so you always start from the latest templates.
azure-functions-agents 🚀 Scaffolds, extends, deploys, and troubleshoots event-driven AI agents on the Azure Functions serverless agents runtime (azurefunctions-agents-runtime) that just launched at Build 2026. Picks the best deployable GPT model based on subscription / region quota, wires Microsoft Foundry, Connector Namespaces, and remote MCP servers, and offloads code execution or web browsing to Azure Container Apps dynamic sessions.
azure-functions-deploy Hands off to the azure-skills prepare → validate → deploy workflow with Functions-specific guidance (Flex Consumption, functionAppConfig, private networking, identity).
azure-functions-best-practices Reviews an existing Function App against current best practices and proposes prioritized, approval-gated remediations.
azure-functions-diagnostics Investigates deployment failures, runtime errors, trigger / binding issues, and logging gaps.
azure-functions-health-status Collects the current running state, metrics, Application Insights signals, Resource Health, and Activity Log.
azure-functions-inventory Collects static specifications: SKU, runtime, networking, identity, settings, functions, and trigger inventory.
azure-functions-doctor Pre-deployment validation, used by the doctor CLI command below.
azure-functions-feedback Turns observations from a session into a previewed GitHub issue or PR against this repo.

The set is intentionally small at launch. It already includes azure-functions-agents so you can scaffold and deploy on the Azure Functions serverless agents runtime that just launched at Build 2026. A skill to assist migrating worker code to Go is next.

Have a skill you’d like to see? Open an issue at https://github.com/Azure/azure-functions-skills/issues, or just run azure-functions-feedback mid-session and the skill itself will prepare the issue draft for you.

The CLI: install, chat, doctor

install: one command for every host

Each AI coding agent has its own plugin install flow, and several of them spread the work across multiple steps. The GitHub Copilot CLI plugin, in particular, can only be installed at user scope. That’s useful for skills, but not what you want for project-specific MCP servers, hooks, or instruction files that should live with your repository.

install collapses all of that into one command and applies the right split by default:

  • Plugin (skills) → user scope. Available to every project on your machine.
  • Workspace artifacts (MCP, agent definition, hooks, CLAUDE.md / AGENTS.md) → the current directory. Committable alongside your code.

This keeps your user-scope agent context clean and makes the Azure Functions skills findable every time you open the workspace. If you want everything in the project, add --local:

# GitHub Copilot CLI (default: plugin user-scope, workspace artifacts here)
npx @azure/functions-skills install --agent ghcp

# Everything in the project
npx @azure/functions-skills install --agent ghcp --local

Use --agent claude for Claude Code or --agent codex for Codex CLI. The CLI also absorbs future plugin-flow changes so the command stays stable for users.

chat: start the agent with the right context

chat launches your installed agent of choice, already wired into the functions-copilot agent definition.

npx @azure/functions-skills chat

A typical first message looks like this:

“Create a Python HTTP trigger that reads from Cosmos DB using managed identity, and add a Service Bus output binding.”

The agent picks the right skills (create, then best-practices), uses the Azure MCP template service for the latest scaffold, and wires identity-based access by default. No keys in your repo.

The first time you run chat in a workspace, the setup skill auto-fires. It walks through prerequisites (Azure CLI, Azure Developer CLI, Core Tools, language runtimes, the azure-skills plugin) and offers to install anything missing, so a developer brand-new to Azure Functions can get to a working environment without bouncing between docs.

After setup, the agent suggests the most useful next skill based on your project state, which makes the rest of the catalog easy to discover.

chat launches the functions-copilot agent; on first run, the setup skill auto-fires to verify prerequisites

Everything after -- is passed through to the underlying agent, so any agent-native flag you rely on still works. Subsequent chat runs skip setup because the per-workspace state lives under .azure-functions-skills/.

VS Code users get the same experience: open the workspace, pick the functions-copilot agent, and run the setup skill from there.

Selecting the functions-copilot custom agent from the GitHub Copilot Chat agent picker in VS Code

doctor: shift-left for the two biggest incident causes

Do you know the top two causes of Azure Functions support incidents reported to our team?

  1. User code defects
  2. Function App misconfiguration

Together, they account for roughly half of the Azure Functions support incidents we see internally — based on our analysis of Customer Reported Incidents (CRIs) in Q1 CY2026, about 53% were related to customer code or configuration issues. Preventing this class of issue before deploy time eliminates a large fraction of the problems customers report.

doctor checks a workspace for exactly those issues. It runs in two tiers:

  • Tier 1 (deterministic, no LLM): host.json shape, runtime version, trigger configuration, extension bundle range, deprecated settings, lockfile presence, tracked .env files, and a set of supply-chain checks (lifecycle scripts, unpinned production dependencies, install-script dependencies, and more) informed by the recent npm / PyPI compromises.
  • Tier 2 (semantic, LLM via --deep): Uses your coding agent to find issues that need to read the code: client-per-invocation patterns, blocking I/O on the hot path, hardcoded secrets, Durable Functions non-determinism (Date.now(), Math.random(), network calls in orchestrators), credential collection patterns, and more.

Run it locally and get a self-contained HTML report (the --deep --accept-deep-risk flags opt into Tier 2 LLM checks; safe to run locally, see the CI note below before using in pipelines):

npx @azure/functions-skills doctor --dir . \
  --deep --accept-deep-risk \
  --agent github-copilot \
  --format html --output doctor-report.html

A representative run looks like this:

Tier 1 (deterministic)
  âś“ host.json shape ok
  âś“ runtime version pinned (~4)
  ⚠ extension bundle range too broad   host.json:5
  ⚠ unpinned production dependency      semver:^7.0.0 → pin to 7.5.4
  âś— tracked .env file with secret keys  .env:3

Tier 2 (semantic, via --deep)
  ⚠ blocking I/O on hot path            app/orders.py:42  (use async client)
  âś— hardcoded connection string         app/cosmos.py:11  (use Key Vault reference)
  ⚠ client-per-invocation pattern       app/blob.py:18    (hoist client to module scope)

Summary: 2 critical, 4 warnings — see doctor-report.html

Doctor HTML report showing Tier 1 host.json and dependency findings plus Tier 2 semantic findings from the coding agent

The same command can run in CI. Wire it into your deployment pipeline and you have shift-left for the configuration and code-quality issues that drive the majority of incidents, caught while the developer (or the agent acting for them) can still fix the diff cheaply.

A word on running –deep in CI

--deep runs the coding agent with file-write and shell-execution permissions, so any input the agent sees becomes a potential prompt-injection surface. We default to refusing --deep on pull_request events. You can opt in with AZURE_FUNCTIONS_DOCTOR_TRUST_PR=1 for trusted mirror pipelines.

The recommended pattern:

  • PR validation: --no-deep (Tier 1 only). Fast, deterministic, safe to run on untrusted PR content.
  • Post-merge / release: --deep on push: main, ideally gated behind a GitHub Environment with required reviewers and a scoped secret for the agent token.

See docs/doctor-guide.md and SECURITY.md for the full security model.

Where each skill fits

When you want to… Use
Get your local environment ready for Functions development azure-functions-setup
Start a new project or add a function azure-functions-create
Build a scheduled or event-driven AI agent (daily briefing, inbox digest, connector-triggered workflow) azure-functions-agents
Deploy to Azure azure-functions-deploy
Catch problems before deployment doctor CLI (or azure-functions-doctor)
Review an existing app against current best practices azure-functions-best-practices
Investigate a failing or misbehaving Function App azure-functions-diagnostics
Check the live health of a running app azure-functions-health-status
Send us feedback or a feature request azure-functions-feedback

functions-copilot routes your request to the appropriate skill, and proposes the next step after each workflow.

Getting started

Pick the agent you already use; the rest of the flow is the same.

# 1. Install the plugin (default: skills at user scope, workspace artifacts here)
npx @azure/functions-skills install --agent ghcp     # GitHub Copilot CLI
npx @azure/functions-skills install --agent claude   # Claude Code
npx @azure/functions-skills install --agent codex    # Codex CLI

# 2. Launch the agent (setup skill auto-fires on first run)
npx @azure/functions-skills chat

# 3. Validate before deploy (--deep enables Tier 2 LLM checks; safe locally, see CI note)
npx @azure/functions-skills doctor --deep --accept-deep-risk \
  --agent github-copilot \
  --format html --output doctor-report.html

VS Code: after step 1, open the workspace in VS Code, select the functions-copilot agent in GitHub Copilot Chat, and run the setup skill. Same first-run experience as chat, just inside the IDE.

Prefer the skills scoped to the current project only? Add --local to step 1.

Full docs, CI recipes, and the supply-chain check reference live at https://github.com/Azure/azure-functions-skills.

We want your feedback

azure-functions-skills is open source, MIT licensed, and developed in the open. The repository is the right place to:

  • Ask for skills you wish were there: open an issue, or run azure-functions-feedback mid-session and have the skill prepare the draft for you.
  • Report bugs or suggest improvements. Every issue is read.
  • Contribute a skill or doc. See CONTRIBUTING.md.

Repository: https://github.com/Azure/azure-functions-skills

We’re building the AI-era developer experience for Azure Functions in the open. Star the repo, open an issue, or run azure-functions-feedback mid-session and have the skill draft the issue for you. Tell us what to ship next.

The post Introducing azure-functions-skills: An AI-Era Workspace for Azure Functions (Preview) appeared first on Azure SDK Blog.

Read the whole story
alvinashcraft
2 hours ago
reply
Pennsylvania, USA
Share this story
Delete

Introducing Open-Source Skills for AWS SDK Best Practices

1 Share

We released a set of AWS SDK Skills as part of the open-source Agent Toolkit for AWS. These are AI skills that teach coding agents how to follow AWS SDK best practices. The project is available on GitHub under the Apache-2.0 license.

The problem

AI coding agents know the general shape of AWS SDK usage, but they get the details wrong. They generate incorrect API names, use incorrect parameter types, and miss SDK-specific patterns like paginators, waiters, and high-level APIs such as the transfer manager for Amazon Simple Storage Service (Amazon S3). These errors are especially common for newer SDKs like the AWS SDK for Swift, where agents generate code that looks plausible but fails to compile.

As developers increasingly rely on AI agents to write AWS SDK code, we need to make sure those agents produce code that compiles, follows best practices, and uses each SDK the way it was intended to be used.

What’s in a skill

Skills are modular packages that give AI coding agents specialized SDK knowledge. Each skill is authored by the SDK team that owns the language, so it reflects the things agents consistently get wrong for that specific SDK. A skill includes:

  • SKILL.md — core instructions with SDK-specific patterns and concrete examples
  • references/ — on-demand documentation for deeper topics, loaded only when needed
  • scripts/ — automation for build, test, and validation workflows

Skills are agent-agnostic. They work with any coding agent that supports the open skills format.

Common mistakes skills help prevent

Code that doesn’t compile. This is the most common failure mode for newer SDKs where the agent’s training data is thin or out of date. The AWS SDK for Swift uses Swift concurrency throughout. Operations are async-throwing, and so are the convenience client constructors. Agents frequently miss this and produce code that looks reasonable but doesn’t build:

// What agents tend to write. Does not compile.
let client = S3Client()
let response = client.listBuckets(input: ListBucketsInput())

Both lines are wrong: S3Client() is async throws, and so is listBuckets. With the Swift skill installed, the agent writes the modern Swift concurrency form:

let config = try await S3Client.S3ClientConfig(region: "us-west-2")
let client = S3Client(config: config)
let response = try await client.listBuckets(input: ListBucketsInput())

The first version sends the developer back to the docs to figure out why a plausible-looking line won’t build. The second one runs.

Code that runs but performs poorly or costs more. Agents often skip SDK features that exist precisely to make AWS calls efficient: paginators for ListObjects and similar APIs, waiters for resource-state polling, and the SDK’s high-level file methods like upload_file / download_file for large transfers. A handwritten loop that calls ListObjects without pagination silently drops results past the first page, polling code without waiters burns API calls and risks throttling, and manual file I/O for S3 transfers gives up multipart uploads and parallelism. The code compiles and often appears to work in small tests, but breaks once you’re dealing with real data volumes. With a skill installed, the agent reaches for the right SDK feature for the job: paginators for list operations, waiters for state polling, and the high-level transfer methods for files.

Code that runs but has subtle bugs. Manually marshalling DynamoDB types like {"S": "value"} is easy to get slightly wrong in ways that fail only on certain inputs. Catching a generic Exception instead of typed exceptions like ConditionalCheckFailedException makes retry logic swallow real failures. With a skill installed, the agent reaches for the document client (which handles the conversion correctly) and uses typed exceptions tied to the actual operations it’s calling.

Measuring the impact

We evaluate each skill against a benchmark of real SDK tasks (Amazon S3 operations, Amazon DynamoDB queries, client configuration, presigned URL generation, credential management) and grade the generated code on whether it compiles, passes lint, and actually does what the task asked for (judged by an LLM). Every task runs twice: once with no skill installed, and once with the relevant skill loaded.

Across our test suite, code generated with a skill installed consistently passed more checks than code generated without one.

Available skills

The following table summarizes the skills available at launch:

Skill SDK What it covers
aws-sdk-swift-usage AWS SDK for Swift Async patterns, struct-based config types, client initialization
aws-sdk-js-v3-usage AWS SDK for JavaScript v3 Package structure, client styles, middleware, runtime validation
aws-sdk-python-usage Boto3 / botocore Client vs. resource interfaces, paginators, waiters, error handling

Get started

You’ll need a coding agent that supports the open skills format. To install a skill from the Agent Toolkit for AWS, run:

npx skills add aws/agent-toolkit-for-aws/skills --skill <skill>

Replace <skill> with the one you want:

  • aws-sdk-swift-usage
  • aws-sdk-js-v3-usage
  • aws-sdk-python-usage

Or pass --skill multiple times to install more than one.

If your favorite SDK is missing or you’ve seen agents make mistakes that aren’t covered yet, open an issue or submit a skill. Visit the repository on GitHub to try it out.

Read the whole story
alvinashcraft
2 hours ago
reply
Pennsylvania, USA
Share this story
Delete

What’s new in Observability at Build 2026

1 Share

Observability for AI and Agent Workloads

AI agents are moving from prototype to production. To help teams ship and operate agents with the same rigor as the rest of their stack, Azure Monitor & Azure Copilot Observability agent brings end-to-end agent observability — grounded in OpenTelemetry so signals are portable across the toolchain.

Agent Observability in Azure Monitor

Agents are now a first-class artifact in Azure Monitor. With new views showing agent fleet, automated evaluations, cost breakdown, trace tree, and human-in-loop evals, you have the tools you need to gain observability in your agent. Microsoft Foundry is where you build your agents and set up evals, and Azure Monitor is your full-stack observability solution across all layers and components of your distributed service. Now, it’s easier to get started with a streamlined Microsoft OpenTelemetry Distro that powers all observability + governance surfaces including Foundry, Azure Monitor, and Agent 365.

Learn more: aka.ms/agent-obs-blog  |  Trace agents with the Agent Framework

Azure Copilot Observability agent – what’s new

The Observability agent, part of Azure Monitor, enables engineers to investigate issues and explore system behavior using natural language over telemetry data. At Build 2026, new updates expand its capabilities across both chat and investigation workflows. These enhancements include broader investigation entry points (such as AKS and Application Insights), deeper cross-resource analysis, and integration with Microsoft Foundry AI agents. Together, they provide end-to-end visibility into AI-driven systems, help teams move faster from detection to root cause, and enable sharing of investigation results for collaboration and follow-up.

Learn more: Azure Copilot observability agent (preview) - Azure Monitor | Microsoft Learn  | 

https://aka.ms/ObsAgentBlogBuild26

ROI of Agents in Foundry (Public Preview)

A new ROI view in Microsoft Foundry quantifies the business value of deployed agents — correlating cost, usage, and outcomes — so teams can see which agents are paying off and where to invest next.

Smarter, simpler monitoring with Azure Monitor

As cloud environments grow, monitoring gets harder. There is a need for reduced alert noise, less manual tuning, tighter security, and monitoring that is accessible so you can catch real issues faster and spend less time managing the tool.

Resource-scoped querying of Azure Monitor Workspace metrics — Generally Available

Azure Monitor workspaces now offer users the ability to scope their PromQL queries to one or more Azure resources (e.g., Virtual Machine, AKS, Application Insights, etc.) without requiring the user to have direct access to the AMW(s) where metrics are stored, to streamline the user experience and offer parity with how resource-scoped queries work on Log Analytics Workspaces today.

Learn more: Resource-scoped queries for Azure Monitor workspace - Azure Monitor | Microsoft Learn

Dynamic thresholds for log search alerts — Generally Available

Log search alerts now support dynamic thresholds at GA, using machine learning to learn each rule’s normal behavior from historical query results and automatically account for hourly, daily, and weekly seasonality. Thresholds are calculated per dimension combination, so multi-dimensional scenarios like AKS pod-restart spikes or resource-inventory drift get tailored baselines out of the box — with no manual tuning and no extra charge beyond the standard log search alert rate.

Learn more: Alert rules with dynamic thresholds overview

Simple log alerts in Azure Monitor — Generally Available

Simple log alerts are a type of log search alert that evaluates each row individually instead of aggregating over a time window, delivering low-latency detection for scenarios like failed automation jobs or critical Windows events. They also support Basic Logs, so customers can keep the cost savings of Basic-plan telemetry — including Application Insights traces — without giving up the ability to alert on it. Flexible trigger recurrence lets teams tune sensitivity and reduce noise without sacrificing responsiveness.

Learn more: Create a simple log search alert in Azure Monitor

Expanded OpenTelemetry Support

Modern environments span many clouds, languages, and platforms, and customers have increasingly standardized on OpenTelemetry (Otel) to instrument them consistently. The challenge is turning all that telemetry into real insight. Azure Monitor brings OpenTelemetry metrics, logs, and traces into one place where teams can troubleshoot quickly, visualize what’s happening, and act on it. From VMs and servers to applications and AI coding agents, get faster triage, troubleshooting, and ready-made dashboards all from one centralized experience in Azure.

OpenTelemetry App Troubleshooting via OTLP Ingestion — Generally Available

Azure Monitor now offers flexible, enterprise-ready OpenTelemetry ingestion and data storage to power application performance monitoring experiences. Use standard OpenTelemetry instrumentation and OTLP export to send metrics, logs, and traces to Azure Monitor data collection endpoints. Then monitor, triage, and troubleshoot application and platform performance using Application Insights and pre-built Grafana dashboards entirely based on OpenTelemetry data.

Learn more: https://aka.ms/AzureMonitorOTLPDirectGAblogIngest OpenTelemetry data into Azure Monitor.

Monitor AI coding agents with OpenTelemetry (GA)

With Azure Monitor’s OpenTelemetry support, you can collect OpenTelemetry Protocol (OTLP) signals from AI coding agents such as GitHub Copilot and Claude Code, and route them into Azure Monitor for end-to-end visibility. Ingested OTLP data is stored with OpenTelemetry semantics for logs and traces. Application Insights provides curated agent views for troubleshooting, detailed trace visualizations, end-to-end transaction views, and dedicated Grafana dashboards for coding agent monitoring.

Once OpenTelemetry metrics are ingested in Azure Monitor, they can be used to create SLIs.

Learn more: https://aka.ms/AzureMonitorOTLPCodingAgentsbloghttps://learn.microsoft.com/en-us/azure/azure-monitor/app/agents-view

OpenTelemetry Metrics, Visualizations, and Enhanced Monitoring for Azure VMs and Arc Servers — Generally Available

Azure Monitor now supports OpenTelemetry (OTel) metrics and visualizations for Azure Virtual Machines and Arc-enabled Servers, delivering an enhanced, unified monitoring experience. This release brings together key monitoring capabilities including recommended alerts, out-of-the-box Grafana dashboards, and at-scale configuration into a single experience. Customers can more easily monitor Guest OS health, accelerate troubleshooting, and optimize both performance and monitoring costs across their environments.

Learn more: https://aka.ms/vmiv2docs | Collect and customize OpenTelemetry metrics for Azure virtual machines - Azure Monitor

One Data Platform for Any Source and Any Destination, built on OpenTelemetry

OpenTelemetry brings consistency to telemetry collection, but production systems need control over how that data is processed and delivered. Learn how Azure Monitor enables centralized governance, multi-stage transformations, unprecedented scale (billions in EPS), and flexible routing for all your telemetry to turn OTel signals into actionable observability.

Learn more: https://aka.ms/datacollectionbuild2026

SLI and SLO — Generally Available

Azure Monitor now supports service-level indicators and objectives. These special metrics can now measure availability and latency from an end user’s critical journey. Define an SLI across the resources (Service Groups) that make up a service, set an SLO target on your Azure OTel and Prometheus metrics, and track error budget and burn rate from a single view. This lets engineering and SRE teams align day-to-day work to the customer experience, not just per-resource health.

Learn more: Create service level indicators in Azure Monitor.

Get started

We’re continuing Azure Monitor’s investment in efficient, end-to-end observability for developers, SREs, and IT pros. To learn more, connect with our experts at the Build Lightning Session- Broken, costly? Debug and operate AI agents with Azure Monitor on June 3rd at 9:50 AM PT.

 

Read the whole story
alvinashcraft
2 hours ago
reply
Pennsylvania, USA
Share this story
Delete

Introducing the Azure Functions serverless agents runtime (preview)

1 Share

We're thrilled to announce the Azure Functions serverless agents runtime, now in public preview. It brings a new, markdown-first programming model for building AI agents as a first-class workload on Azure Functions, with the event-driven triggers, scale-to-zero economics, and operational integrations you know and love from the platform.

A few things you could build in a matter of minutes:

  • A daily briefing agent that wakes up on a timer, scours the web, and drops a summary in your Outlook inbox every morning.
  • A Teams chat agent that triggers on every message and answers your team's questions, looking up data across your connected systems.
  • An on-call troubleshooting agent that investigates incidents by querying logs in Azure Data Explorer and reports back what it found.

Each one is a single markdown file with instructions plus a trigger, and deployed like any other function aoo and running on the Flex Consumption plan.

Why a serverless agents runtime

Building production agents today usually means stitching together a framework, a hosting layer, message queues, identity, secrets, observability, and a long list of per-service integrations. Most of that work is plumbing, not the agent.

Azure Functions has spent years making event-driven compute simple: declare a trigger, write the handler, get autoscale and managed identity for free. The serverless agents runtime applies that same model to agents:

  • Agents are the unit of work. You define behavior in natural language, not boilerplate.
  • Trigger agents from almost any event. HTTP requests, timers, queues, database changes, Teams messages, Outlook mail, and more.
  • Tools, MCP servers, connectors, and sandboxed execution are declared, not coded.
  • Deploy and operate like any function app. Flex Consumption for scale-to-zero and per-second billing, managed identity, VNet integration, Application Insights, and the same deployment tools you already use.

Markdown-first: what an agent looks like

An agent is a .agent.md file. Your app can have multiple agents, each with its own metadata that declares the trigger. The markdown body becomes the agent's instructions.

Here's a timer-triggered agent that summarizes the day's tech news and emails it:

--- name: Daily Tech News Email description: Fetches top tech news and emails a summary daily. trigger: type: timer_trigger args: schedule: "0 0 15 * * *" --- You are a news assistant. When triggered, do the following: 1. Scour the web for today's top tech news headlines. Use reputable sources; Include links to the original articles. 2. Summarize the top stories in a concise, well-formatted HTML email body. 3. Email the summary to $TO_EMAIL with the subject "Daily Tech News Summary" followed by today's date.

That's the whole function. Drop the file into your app, deploy, and it runs on the schedule. No framework wiring, no service-specific integration code.

Your agents can share configuration and capabilities through a few files alongside the agent definitions. agents.config.yaml declares system tools and the default model. mcp.json lists the MCP servers your agents can call, including MCP-enabled Azure connections. A /tools folder holds custom Python tools and a /skills folder holds reusable prompt fragments. Everything here is optional and available to every agent automatically when present.

In this example, the agent uses a Container Apps dynamic session to browse the web with Playwright, and a Microsoft Office 365 connection (exposed as an MCP server) to send the email:

# agents.config.yaml system_tools: dynamic_sessions_code_interpreter: endpoint: $ACA_SESSION_POOL_ENDPOINT model: $AZURE_OPENAI_DEPLOYMENT// mcp.json { "servers": { "office365": { "type": "http", "url": "$MICROSOFT_365_CONNECTION_MCP_ENDPOINT", "auth": { "scope": "https://apihub.azure.com/.default" } } } }

The function app's managed identity authenticates to the connection's MCP endpoint, so there are no secrets to manage. Any Azure connector that supports MCP, or any remote MCP server, can be added the same way.

Any of these global settings can be overridden per agent in the agent's metadata.

What you get in the preview

  • Triggers across the Azure Functions catalog. HTTP, Timer, Queue, Service Bus, Event Hubs, Cosmos DB, Blob, Event Grid, plus new connection-backed triggers like Teams messages, Outlook mail, and calendar events.
  • 1,400+ Azure connectors as tools. Create a connection, enable its MCP endpoint, and an agent can send mail, post to Teams, create records, query data, all without integration code or auth plumbing.
  • Any remote MCP server as tools. Use any remote MCP server.
  • Sandboxed code and browser automation. Run code or a Playwright-powered browser in Azure Container Apps dynamic sessions, isolated per agent session.
  • Built-in chat UI, HTTP API, and MCP server endpoint with no extra code.
  • Custom Python tools in a tools/ folder and reusable skills in a skills/ folder, shared across agents.
  • Pluggable model providers. Microsoft Foundry, Azure OpenAI, and OpenAI out of the box.

Where this fits

The serverless agents runtime is designed for the agents most enterprises actually need to build:

  • Scheduled background agents that summarize, monitor, or reconcile on a timer.
  • Event-driven assistants that react to messages, emails, alerts, and database changes.
  • Cross-system agents that tie multiple SaaS and enterprise apps together through connections. Trigger with a Teams message, look up the customer in Salesforce, send an email, and update a database record, all from one agent.
  • Conversational front-ends that pair an HTTP or chat-UI entry point with the same agents your event triggers invoke.
  • Agents as MCP servers that other agents and MCP clients can integrate with directly.

We want your feedback

The serverless agents runtime is in public preview, and we're actively building it out with input from real customer workloads. Tell us what you build, what's missing, and where the model should go next.

Get started

Docs: aka.ms/azure-functions-agents-docs

Building agents on Azure Functions has never been easier. We can't wait to see what you create with the serverless agents runtime!

Read the whole story
alvinashcraft
2 hours ago
reply
Pennsylvania, USA
Share this story
Delete

Azure Functions at Build 2026 Update

1 Share

Azure Functions took another big leap at Build 2026. It is now the best programming model for event-driven apps and agents, on the best infrastructure to write secure code that scales. The headline features: serverless agents, connectors to M365, Teams, and more, Go, MCP, and Durable Tasks.

Microsoft Copilot scales AI workflows to hundreds of millions with Durable Task Scheduler

Before we start with all the announcements, we want to highlight a new case study. As Microsoft Copilot scaled to support complex, long-running AI workflows, engineering teams needed a more reliable and consistent orchestration model. By standardizing on Durable Task Scheduler in Azure Functions, Copilot unified state management, retries, and recovery across services, helping run hundreds of millions of executions weekly while improving resilience and delivery speed.

Read the customer story: https://aka.ms/microsoft-copilot-dts

Serverless agents runtime (Preview)

→ Full post

Azure Functions now has a first-class programming model for AI agents. Define an agent in a .agent.md file with markdown instructions plus metadata that declares the trigger and tools, and deploy it exactly like any other Function. No framework to wire up, no hosting infrastructure to manage.

Any Azure Functions trigger can run an agent: HTTP, Timer, Service Bus, Event Hubs, Cosmos DB, or the new connection-backed triggers (Teams message, Outlook mail, calendar events, SharePoint item). Agents get access to MCP tool servers, sandboxed code and browser execution via Azure Container Apps dynamic sessions, and the full 1,400+ connector catalog. Built-in surfaces like chat UI, HTTP chat API, and MCP server endpoint are opt-in with no extra code.

The operational model is exactly what you already know: Flex Consumption for scale-to-zero and per-second billing, managed identity for auth, Application Insights for traces, azd for deployment.

Here's a timer-triggered agent that summarizes the day's tech news and emails it:

--- name: Daily Tech News Email description: Fetches top tech news and emails a summary daily. trigger: type: timer_trigger args: schedule: "0 0 15 * * *" --- You are a news assistant. When triggered, do the following: 1. Scour the web for today's top tech news headlines. Use reputable sources; Include links to the original articles. 2. Summarize the top stories in a concise, well-formatted HTML email body. 3. Email the summary to $TO_EMAIL with the subject "Daily Tech News Summary" followed by today's date.

That's the whole function!

Managed connectors (Preview)

→ Full post

Azure Functions now includes the same 1,400+ managed connectors behind Logic Apps and Power Platform as first-class triggers in your Functions code, plus typed SDKs for invoking connector actions from your function body. Built jointly with the Connectors team on the new Connector Namespace service, so connectors feel native to Functions and the library that already powers thousands of Logic Apps workflows is now available to Functions developers.

React to SaaS events with first-class triggers like Office 365 new-email, Teams message-posted, SharePoint item-created, Dataverse row-changed, Salesforce record-updated, calendar events, and more using the [ConnectorTrigger] attribute. Call connector actions from your code via strongly-typed clients like OutlookClient, TeamsClient, Office365UsersClient, DataverseClient, and SalesforceClient. 

public class ProcessEmail(TeamsClient teams) { [Function("OnNewEmail")] public async Task Run([ConnectorTrigger] Office365OnNewEmailTriggerPayload payload) { foreach (var email in payload.Body?.Value ?? []) { await teams.PostMessageToConversationAsync("Flow bot", "Channel", new PostMessageRequest { Recipient = new() { GroupId = _teamId, ChannelId = _channelId }, MessageBody = $"<b>New email</b> from {email.From}: {email.Subject}" }); } } }

MCP updates

→ Full MCP extension post

The Azure Functions MCP extension now covers all the MCP primitives like tool, resource, and prompt triggers are supported in .NET, Java, Python, TypeScript, and JavaScript. The extension also supports MCP Apps for interactive UI, where your tools can return rendered widgets instead of plain text.

And for .NET developers, a new fluent builder API makes it easier to compose MCP servers by chaining tool and resource definitions in a declarative style:

builder.ConfigureMcpTool("sayhello") .WithProperty("name", McpToolPropertyType.String, "Name of the user", required: true) .WithMetadata("ui", new { resourceUri = "ui://index.html" });

 

Finally, Built-in MCP authentication now offers a one-click configuration experience in the Azure portal, and a new AI tab in your function app lets you enable MCP auth without manual app registration or wiring. 

New Azure Functions CLI (Preview)

V5 is here! A ground-up, next-gen build of the Azure Functions CLI. Now in public preview this release gives local Functions development a refresh. 

Configuration profiles let you define your deployment targets up front, so func init can scaffold a project with full‑fidelity host settings in a single command. That means no more surprises when you deploy, earlier access to new platform capabilities, and improved reliability across environments.

New func setup preps your machine for .NET, Node, Python, or Go in one command. The func quickstart command scaffolds complete, ready-to-run apps from a curated catalog. And a new interactive func run dashboard gives you a live TTY UI with a function browser, log navigation, and keyboard shortcuts.

Existing func workflows for create, run, publish, and deploy carry forward unchanged, so you can try v5 alongside your current projects. Give it a spin and let us know what you think. Full command reference: Azure Functions local runtime and tools reference (v5)

Azure Functions VS Code Template Gallery (Preview)

The latest version of the Azure Functions extension for VS Code introduces a new Template Gallery, giving you single-click access to complete, ready-to-deploy templates. The gallery is hand-curated and maintained by the Functions team to keep every template aligned with the latest releases and best practices, including Azure Developer CLI (AZD) enablement and recommended settings. It already covers the majority of supported languages and triggers, and will continue to expand with the newest Azure Functions features. The same templates are available across both VS Code and the new Functions CLI (func quickstart).

 

Go language support (Preview)

→ Full post

Azure Functions now supports Go as a first-class language, available on Flex Consumption. The programming model is code-first and idiomatic: HTTP handlers are plain http.HandlerFunc, non-HTTP triggers take a context.Context and a typed payload, and the project layout is a standard Go module. Go build, go test, and go mod tidy just work.

package main import ( "fmt" "net/http" "github.com/azure/azure-functions-golang-worker/sdk" "github.com/azure/azure-functions-golang-worker/worker" ) func main() { app := sdk.FunctionApp() app.HTTP("hello", hello, sdk.WithMethods("GET", "POST"), sdk.WithAuth("anonymous"), ) worker.Start(app) } func hello(w http.ResponseWriter, r *http.Request) { name := r.URL.Query().Get("name") if name == "" { name = "world" } fmt.Fprintf(w, "Hello, %s!", name) }

Triggers in preview: HTTP, Timer, Service Bus, Event Hubs, Event Grid, Cosmos DB, and Blob Storage. No function.json, no interop shims, no generated metadata to keep in sync.

On-demand Sandboxes for Durable Task Scheduler (Private Preview)

→ Full post 

Move individual orchestration steps to managed, isolated compute while your orchestrator stays exactly where it is. Declare which activities should run as serverless, point at a container image, and DTS handles provisioning, scaling, and teardown. No infrastructure to manage, no idle costs, no orchestrator changes.

Each execution runs in a clean, microVM-backed sandbox with per-activity or per-invocation isolation, ideal for native toolchains (ffmpeg, LibreOffice, Pandoc), CPU-heavy preprocessing (OCR, image work), cross-runtime steps (a Python inference activity called from a .NET orchestrator), sandboxed execution of customer plugins or LLM-generated code, and bursty workloads that can't justify always-on infrastructure.

Sign up for On-demand Sandboxes Private Preview Today →

Azure Functions Skills for coding agents (Preview)

→ Full post

Bring Azure Functions expertise to your coding agent. Azure Functions Skills equips GitHub Copilot CLI, Claude Code, and Codex with Functions-specific knowledge like trigger and binding patterns, language anti-patterns, runtime versions, and deployment best practices, so your agent gives accurate guidance instead of generic advice. One command installs guided workflows to create, deploy, diagnose, and review Functions apps. The standout is the doctor command: it uses LLM-powered semantic analysis to catch configuration mistakes and code issues like missing error handling, blocking I/O, hardcoded secrets, durable-orchestrator non-determinism, and supply-chain risks before you deploy, available as both a local CLI command and a GitHub Actions pre-deploy gate.

Try it now! npx @azure/functions-skills install

Built-in Grafana dashboards (Generally available)

Every function app now has a single pane of glass for operations with zero setup. A new Grafana dashboards entry in the function app's portal TOC opens a prebuilt dashboard purpose-built for Functions: execution count, success/failure rates, p50/p95/p99 duration, resource utilization, scale activity, and recent errors linked to Application Insights logs all in one view, scoped to your app. It's powered by Azure Monitor managed Grafana, so there's nothing to provision, wire up, or pay extra for. Duplicate and customize it to make it your own, save it to your subscription, and share it with your team.

 

Start using built-in Grafana Dashboards today!

TLS/SSL certificate support Flex Consumption (Preview)

Azure Functions Flex Consumption now supports TLS/SSL certificates through a new site-scoped certificate model in public preview. Each function app can hold up to 3 private (.pfx) and 3 public (.cer) certificates uploaded directly, imported from Azure Key Vault, or issued as free App Service Managed Certificates to enable custom domains, client-certificate authentication, and mutual TLS scenarios on Flex Consumption.

See Configure site-scoped certificates, infrastructure as code instructions, and the cross-plan certificate comparison for details.

Rolling Updates for Flex Consumption (Generally Available)

Rolling updates are now generally available in the Flex Consumption plan, delivering zero-downtime deployments with a simple configuration change.

Instead of forcefully restarting all instances during code or configuration updates, the platform gracefully replaces live instances by draining batches every few seconds while dynamically scaling out the latest version to meet demand. This approach ensures uninterrupted execution and resilient throughput across HTTP, non-HTTP, and Durable workloads - even during intensive scale-out scenarios.

Learn more at Site update strategies in Flex Consumption.

OS-level dependencies with containers on Flex Consumption (coming soon)

Bring your own OS-level dependencies to Flex Consumption without giving up serverless. Package your Functions worker and app code as a container image with a standard Dockerfile (Chromium for Playwright, native toolchains, custom system libraries, whatever your app needs) and run it on the Flex Consumption plan. You get the things that make Flex valuable: dynamic, event-driven scaling across all triggers and the pay-per-execution billing model. This is expected in the next couple of months. 

Sign up to get early access and updates →

How to engage

Everything announced this week is being actively shaped by real workloads. We want to hear from you.

Read the whole story
alvinashcraft
2 hours ago
reply
Pennsylvania, USA
Share this story
Delete

The agent optimization loop and how we built it in Foundry

1 Share

Improving agent quality at scale is one of the hardest operational problems teams face once agents are running in production. We’ve been working to close the gap between seeing what’s wrong and shipping a better version without breaking everything else. This post explains the thinking behind a new optimization loop for agents, what we learned building it, and how you can run it today.

From craft and intuition to traces, evals, and a quality conundrum

If you’re building production agents, you’ve probably walked a version of this path:

You started with prompt engineering. Wrote the system instruction, iterated on it, got the agent to mostly work. You and the model, in a tight feedback loop of “try it, read the output, tweak the prompt.” This phase is craft. It’s intuition-driven, and it gets you surprisingly far.

Then you added traces: OpenTelemetry, App Insights, whatever your stack uses. This is good engineering practice, but it’s also necessary. You couldn’t understand what the agent was actually doing without them. Now you can see the reasoning chain, the tool calls, the decisions. You have visibility.

Then came evaluation. At first, it was vibes: reading traces, gut-checking whether the output felt right. Over time you got more rigorous. You defined metrics, set guardrails, and established quality bars. Maybe you built a scoring rubric across multiple dimensions (policy compliance, cost-awareness, escalation accuracy). Now you can measure quality, not just feel it. You know your pass rate. You know which scenarios break.

And then you hit the wall.

Let’s say you ship a travel-approver agent. It calls three tools: lookup_travel_policy, check_department_budget, and get_flight_alternatives. It returns an approve, deny, or escalate decision. The first week looks clean. Then finance flags a $4,800 trip that was approved without VP sign-off. You pull the trace: The tools ran, the loop completed, and the output was confident. The agent just never called the budget-check tool.

CLI trace of a failed run showing the travel-approver agent     
  completed without calling the budget-check tool

You find the gap in the instruction, so you add a rule about cost thresholds. Re-run the eval. That case passes. But now the emergency-travel override that used to work flawlessly starts escalating everything. You try a different wording. The emergency case recovers. Two other scenarios regress.

You have the traces. You have the evals. You can see exactly what’s wrong and measure exactly how wrong it is. But fixing it without breaking something else? That’s where you’re stuck.

The data you painstakingly collected just sits there while you manually guess-and-check your way through configuration changes.

And it compounds. This might be tolerable for one agent. But if you’re operating five, 10, 20 agents across different domains, each with their own failure modes, their own evaluators, their own regression risks, the manual loop becomes untenable. You can’t individually nurse each agent through prompt revisions and hope nothing else breaks.

You’re not debugging anymore. You’re searching. And you’re doing it without a map.

Reframing the problem

Most teams treat agent improvement like debugging: Find the broken thing, then fix it. But an agent that skips a budget check isn’t “broken” the way a null pointer validation is broken. Its instruction just doesn’t encode enough constraint for that scenario. There are dozens of possible instruction variants that might fix it, and most of them regress something else.

In traditional software, when a test fails, you know what to fix. The stack trace points at a function. You patch the function, run the test suite, then confirm nothing else broke. With agents, quality failures could live in any of a dozen places: the system instruction, the model, a tool description, a skill definition. There’s no stack trace pointing at the broken line. The problem could be in any of those places, or several at once, and you can’t isolate it the way you’d isolate a bug.

But here’s what you might not have noticed: You already have almost everything you need. Your traces contain the failure signal. Your evaluators contain the quality definition. What’s missing is the loop that connects them. The loop that goes from “I see what’s broken” to “here’s a better configuration, scored against everything, ready to ship.”

> We built a system that does for agent configurations what your CI pipeline does for code.

We built a system that does for agent configurations what your CI pipeline does for code: Propose a change, score it against the full evaluation suite, and only promote it if quality holds across the board.

If you’ve done hyperparameter tuning, this will feel familiar. The optimizer explores a configuration space the same way a sweep explores learning rates and architectures. The difference is that the search dimensions are instructions, skills, tool definitions, and model selection instead of numeric parameters.

The optimization loop

The four-step agent optimization loop: generate candidates,    
  score and rank, developer review, ship the winner

You already have the pieces:

  • An agent running in production (model, instructions, skills, tools)
  • Evaluators that score quality across multiple dimensions
  • Traces from real usage

The optimizer takes all three as input and runs a four-step loop. Each step is something you’d otherwise grind through manually; the system handles the heavy lifting.

1. The optimizer generates candidates. It searches across instructions, models, skills, and tool definitions. These aren’t random mutations. A reflector model reads traces from your evaluations, identifies why the agent scored poorly, and proposes targeted changes (more on the reflector shortly—it turned out to be the most important piece of the puzzle).

2. Candidates are scored and ranked. Same evaluators, same dataset, deterministic comparison. Every candidate is measured against the same bar your baseline was. Per-dimension scoring (policy compliance, cost-awareness, routing accuracy) means you can see exactly what improved and what regressed.

Per-dimension evaluator rubric scoring each candidate
  configuration against the baseline

3. A developer reviews and decides. The loop isn’t completely autonomous. You look at what changed, why the optimizer proposed it, and whether the improvement is real. If it doesn’t look right, you reject and re-run (optionally with updated evaluators or a different search configuration). If it passes your judgment, you approve. This is deliberate. Automation without oversight compounds errors.

4. The winner ships as the next version. Versioned, reversible, auditable. This updates your agent’s configuration: same model, same tools, better instructions. If the new version underperforms in production, you roll back.

After shipping, production telemetry accumulates: user feedback, reviewer overrides, scenarios your eval set didn’t cover. This signal doesn’t flow directly into the optimizer. It flows into you: your decision to update evaluators, add new test cases, and trigger another optimization run. The optimizer works from your evaluations; production tells you what to measure next.

There’s more to say about how the optimizer explores this space internally: the search techniques, the tradeoffs, how the reflector generates hypotheses. That’s beyond what we can cover here. But one finding from inside the optimizer is worth pulling out.

What actually moves the needle

The optimizer isn’t just randomly mutating prompts. The central piece is a reflector: a separate model whose only job is to read failing traces and reason about why the agent scored poorly. It then proposes targeted edits for the next round.

Here’s what we found: The quality of that reflector, the model doing the diagnosis, has a disproportionate impact on outcomes. More so than the agent’s own model. More so than tuning other parameters in the search. This held across multiple agent types and domains.

What does that mean concretely? Swapping to a stronger reflector model improved optimization results more than any other single change we could make. The agent could be running gpt-4o or gpt-4.1-mini. It didn’t matter as much as having a reflector that could clearly reason about why something went wrong and what to change about it.

> Better diagnosis beats better execution.

And here’s the implication for how you invest: The meta-cognition layer, the ability to reason about failures, matters more than anything else. Better diagnosis beats better execution. If you’re going to invest in one capability, invest in the quality of your failure analysis.

The engineering behind the reflector (how it reads traces, generates hypotheses, and avoids local optima) is its own story.

The travel-approver: A concrete run

Let’s go back to our earlier travel-approver agent example. Here’s what one optimization run might produce:

Optimization run results for the travel-approver agent      
  showing the winning system-prompt rewrite and per-dimension score gains

The winning candidate was a system-prompt rewrite. Same model, same tools, same skills. Just a better instruction. The optimizer added an explicit cost-threshold rule and an escalation ladder that the baseline lacked.

The $4,800 trip that started this story? The optimized agent calls the budget check, sees the amount exceeds the $3,000 threshold, and routes to VP review. Same scenario, different outcome. The instruction now encodes the constraint explicitly.

When to use this loop, and when to skip it

This loop works better in specific situations. Here’s how to know if it fits yours.

It’s a good fit when:

  • You have an agent in production with traces and evaluation data
  • Quality issues are cross-cutting: Fixing one thing breaks others
  • You’re operating at scale, across multiple agents or ongoing iteration cycles
  • The failure mode is at the configuration level: instructions, skills, tool definitions, model selection

It’s probably not the right tool when:

  • Your agent is still in early development and you haven’t earned enough traces yet (manual approaches like prompt engineering are still a good path forward)
  • The problem is infrastructure: context window too small, tools return bad data, latency
  • You have one agent with one failure mode—in that case, just fix it manually
  • The task is reasoning-bound (competition math, deep logic chains)—here, you need a model upgrade, not instruction optimization

Key takeaways

Here are the four things we’d carry to any system doing this kind of work:

  1. Quality is a search problem, not a debugging problem. Define what good looks like, search the configuration space, and rank what works. Stop trying to fix one case at a time.
  2. Invest in diagnosis. The reflector (the model that reasons about why things went wrong) has more impact than any other single lever. Better failure analysis beats better execution.
  3. Evaluators are the ceiling. Your optimization is only as good as your quality definition. Start with generated approximations, refine with real data. The first version is never the last.
  4. Keep the human in the loop. The optimizer proposes; the developer decides. Automation without oversight compounds errors.

How we built this in Microsoft Foundry

We packaged this loop into Agent Optimizer inside of Foundry Agent Service, available today through the azd CLI.

Here’s what the travel-approver run looks like from your terminal:

azd ai agent eval init # generate dataset & evaluator from a one-paragraph description azd ai agent eval run # score the current version (baseline) azd ai agent optimize # search over candidates azd ai agent optimize apply --candidate <id> # apply the winner locally azd deploy # ship as the next version

Five commands. The complexity lives in the optimizer, not in your workflow.

Foundry evaluation results view    
  summarizing the optimized agent's scores across dimensions

The system handles candidate generation, scoring, ranking, and version management. You handle the decision: approve, reject, or adjust your evaluators and run again.

On getting started without evaluation data: The system includes AI-assisted dataset and evaluator generation based on your agent’s configuration and traces. You describe what the agent should do in a paragraph, and eval init generates a multi-dimension evaluator using the traces if available. This makes it easier to bootstrap. The closer your eval data is to real user scenarios and real edge cases, the higher the quality ceiling. Your evaluators are the ceiling on optimization quality. If they can’t distinguish good from bad, the candidates are noise.

We’ve also seen cases where the reflector proposes a fix that passes the eval but introduces regressions on inputs not in the eval set. That’s why the human gate exists. The loop isn’t fully hands-off. You still need someone looking at the candidates before they ship.

What we’re exploring next

The loop as described is agent-level: one agent, one set of instructions, one optimization pass. Two directions we’re actively building toward:

Reducing deployment risk. Right now, shipping a candidate means replacing what’s in production. Full swap. If your eval set is strong, that works. But eval sets are approximations, and production traffic has a longer tail than any test suite. We’re building A/B-style deployment: Promote a candidate alongside the current version, route a fraction of traffic to it, and compare outcomes against the same evaluators that scored it in the loop. The developer gate doesn’t end at “approve.” It extends into production. Roll forward when evidence accumulates; roll back the moment it doesn’t.

Widening the search space. Today, the optimizer searches over instructions, skills, tool definitions, and model selection. That covers most failure modes. But sometimes the bottleneck is upstream of the agent itself: retrieval settings that return noise, knowledge gaps no instruction can fix, or tool sets that don’t match the task. We’re integrating Foundry IQ (managed knowledge grounding) and Foundry Toolbox (curated tool sets) as tunable dimensions. The optimizer can then search over retrieval configuration, which knowledge sources to ground on, and how tool sets are composed. Same scoring rubric, wider surface area. You stop running those experiments by hand.

There’s more here we’d like to share, especially as we continue to learn and explore this space. The optimizer’s architecture, the engineering discipline behind it, the edge cases that taught us the most—those are stories worth telling. Stay tuned to Command Line for more.

Try it out

Agent optimizer is in public preview. If your agents are stuck in the cycle of “fix one thing, break two others,” try it out and give us feedback.

The post The agent optimization loop and how we built it in Foundry  appeared first on Command Line.

Read the whole story
alvinashcraft
2 hours ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories