The Challenge: Ops is critical but takes time from innovation
Microsoft operates always-on, mission-critical production systems at extraordinary scale. Thousands of services, millions of deployments, and constant change are the reality of modern cloud engineering. These are titan systems that power organizations around the globe—including our own—with extremely low risk tolerance for downtime. While operations work like incident investigation, response and recovery, and remediation is essential, it’s also disruptive to innovation.
For engineers, operational toil often means being pulled away from feature work to diagnose alerts, sift through logs, correlate metrics across systems, or respond to incidents at any hour. On-call rotations and manual investigations slow teams down and introduce burnout. What's more, in the era of AI, demand for operational excellence has spiked to new heights. It became clear that traditional human-only processes couldn't meet the scale and complexity needs for system maintenance especially in the AI world where code shipping velocity has increased exponentially.
At the same time, we needed to integrate with the AI landscape which continues to evolve at a breakneck pace. New models, new tooling, and new best practices released constantly, fragmenting ecosystems between different platforms for observability, DevOps, incident management, and security. Beyond simply automating tasks, we needed to build an adaptable approach that could integrate with existing systems and improve over time.
Microsoft needed a fundamentally different way to perform operations—one that reduced toil, accelerated response, and gave engineers the time to focus on building great products.
The Solution: How we build Azure SRE Agent using agentic workflows
To address these challenges, Microsoft built Azure SRE Agent, an AI-powered operations agent that serves as an always-on SRE partner for engineers. In practice, Azure SRE Agent continuously observes production environments to detect and investigate incidents. It reasons across signals like logs, metrics, code changes, and other deployment records to perform root cause analysis. It supports engineers from triage to resolution and it’s used in a variety of autonomy levels from assistive investigation to automating remediation proposals. Everything occurs within governance guardrails and human approval checks grounded in role‑based access controls and clear escalation paths. What’s more, Azure SRE Agent learns from past incidents, outcomes, and human feedback to improve over time. But just as important as what was built is how it was built.
Azure SRE Agent was created using the agentic workflow approach—building agents with agents. Rather than treating AI as a bolt-on tool, Microsoft embedded specialized agents across the entire software development lifecycle (SDLC) to collaborate with developers, from planning through operations.
The diagram above outlines the agents used at each stage of development. They come together to form a full lifecycle:
- Plan & Code: Agents support spec‑driven development to unlock faster inner loop cycles for developers and even product managers. With AI, we can not only draft spec documentation that defines feature requirements for UX and software development agents but also create prototypes and check in code to staging which now enables PMs/UX/Engineering to rapidly iterate, generate and improve code even for early-stage merges.
- Verify, Test & Deploy: Agents for code quality review, security, evaluation, and deployment agents work together to shift left on quality and security issues. They also continuously assess reliability, ensure performance, and enforce consistent release best practices.
- Operate & Optimize: Azure SRE Agent handles ongoing operational work from investigating alerts, to assisting with remediation, and even resolving some issues autonomously. Moreover, it learns continuously over time and we provide Azure SRE Agent with its own specialized instance of Azure SRE Agent to maintain itself and catalyze feedback loops.
While agents surface insights, propose actions, mitigate issues and suggest long term code or IaC fixes autonomously, humans remain in the loop for oversight, approval, and decision-making when required. This combination of autonomy and governance proved critical for safe operations at scale. We also designed Azure SRE agent to integrate across existing systems. Our team uses custom agents, Model Context Protocol (MCP) and Python tools, telemetry connections, incident management platforms, code repositories, knowledge sources, business process and operational tools to add intelligence on top of established workflows rather than replacing them.
Built this way, Azure SRE Agent was not just a new tool but a new operational system. And at Microsoft’s scale, transformative systems lead to transformative outcomes.
The Impact: Reducing toil at enterprise scale
The impact of Azure SRE Agent is felt most clearly in day-to-day operations. By automating investigations and assisting with remediation, the agent reduces burden for on-call engineers and accelerates time to resolution.
Internally at Microsoft in the last nine months, we've seen:
- 35,000+ incidents have been handled autonomously by Azure SRE Agent.
- 50,000+ developer hours have been saved by reducing manual investigation and response work.
- Teams experienced a reduced on-call burden and faster time-to-mitigation during incidents.
To share a couple of specific cases, the Azure Container Apps and Azure App Service product group teams have had tremendous success with Azure SRE Agent. Engineers for Azure Container Apps had overwhelmingly positive (89%) responses to the root cause analysis (RCA) results from Azure SRE agent, covering over 90% of incidents. Meanwhile, Azure App Service has brought their time-to-mitigation for live-site incidents (LSIs) down to 3 minutes, a drastic improvement from the 40.5-hour average with human-only activity.
And this impact is felt within the developer experience. When asked developers about how the agent has changed ops work, one of our engineers had this to say:
“[It’s] been a massive help in dealing with quota requests which were being done manually at first. I can also say with high confidence that there have been quite a few CRIs that the agent was spot on/ gave the right RCA / provided useful clues that helped navigate my initial investigation in the right direction RATHER than me having to spend time exploring all different possibilities before arriving at the correct one. Since the Agent/AI has already explored all different combinations and narrowed it down to the right one, I can pick the investigation up from there and save me countless hours of logs checking.”
- Software Engineer II, Microsoft Engineering
Beyond the impact of the agent itself, the agentic workflow process has also completely redefined how we build.
Key learnings: Agentic workflow process and impact
It's very easy to think of agents as another form of advanced automation, but it's important to understand that Azure SRE agent is also a collaborative tool. Engineers can prompt the agent in their investigations to surface relevant context (logs, metrics, and related code changes) to propose actions far faster and easier than traditional troubleshooting. What’s more, they can also extend it for data analysis and dashboarding. Now engineers can focus on the agent’s findings to approve actions or intervene when necessary. The result is a human-AI partnership that scales operations expertise without sacrificing control.
While the process took time and experimentation to refine, the payoff has been extraordinary; our team is building high-quality features faster than ever since we introduced specialized agents for each stage of the SDLC. While these results were achieved inside Microsoft, the underlying patterns are broadly applicable.
First, building agents with agents is essential to scaling, as manual development quickly became a bottleneck; agents dramatically accelerated inner loop iteration through code generation, review, debugging, security fixes, etc. When it comes to agents, specialization matters, because generic agents plateau quickly. Real impact comes from agents equipped with domain‑specific skills, context, and access to the right tools and data.
Microsoft also learned to integrate deeply with existing systems, embedding agents into established telemetry, workflows, and platforms rather than attempting to replace them. Throughout this process, maintaining tight human‑in‑the‑loop governance proved critical. Autonomy had to be balanced with clear approval boundaries, role‑based access, and safety checks to build trust.
Finally, teams learned to invest in continuous feedback and evaluation, using ongoing measurement to improve agents over time and understand where automation added value versus where human judgment should remain central.
Want to learn more?
Azure SRE Agent is one example of how agentic workflows can transform both product development and operations at scale. Teams at Microsoft are on a mission of leading the industry by example, not just sharing results. We invite you to take the practical learnings from this blog and apply the same principles in your own environments.
