We spent a long time chasing model upgrades, polishing prompts, and debating orchestration strategies. The gains were visible in offline evals, but they didnât translate into the reliability and outcomes we wanted in production. The real breakthrough came when we started caring much more about what we were adding to the context, when, and in what form. In other words: context engineering.
Every context decision involves tradeoffs: latency, autonomy (how far the agent goes without asking), user oversight, pre-work (retrieve/verify/compute before answering), how the agent decides it has sufficient evidence, and the cost of being wrong. Push on one dimension, and you usually pay for it elsewhere.
This blog is our journey building Azure SRE Agent â a cloud AI agent that takes care of your Azure resources and handles your production incidents autonomously. We'll talk about how we got here, what broke along the way, which context patterns survived contact with production, and what we are doing next to treat context engineering as the primary lever for reliable AI-driven SRE.
Tool Explosion, Under-Reasoned
We started where everyone starts: scoped tools and prescriptive prompts. We didn't trust the model in prod, so we constrained it. Every action got its own tool. Every tool got its own guardrails.
Azure is a sprawling ecosystem - hundreds of services, each with its own APIs, failure modes, and operational quirks. Within 2 weeks, we had 100+ tools and a prompt that read like a policy manual.
The cracks showed fast. User hits an edge case? Add a tool. Tool gets misused? Add guardrails. Guardrails too restrictive? Add exceptions. The backlog grew faster than we could close it.
Worse, the agent couldnât generalize. It was competent at the scenarios weâd already encoded and brittle everywhere else. We hadn't built an agent - we'd built a workflow with an LLM stapled on.
Insight #1: If you donât trust the model to reason, youâll build brittle workflows instead of an agent.
Wide tools beat many tools
Our first real breakthrough came from asking a different question: what if, instead of 100 narrow tools, we gave the model two wide ones?
We introduced `az` and `kubectl` CLI commands as first-class tools. These arenât âtoolsâ in the traditional sense - theyâre entire command-line ecosystems. But from the modelâs perspective, theyâre just two entries: âexecute this Azure CLI commandâ and âexecute this Kubernetes commandâ
The impact was immediate:
- Context compression: Three tools instead of hundreds. Massive headroom recovered.
- Capability expansion: The model now had access to the entire az/kubectl surface area, not just the subset we had wrapped.
- Better reasoning: LLMs already âknowâ these CLIs from training data. By hiding them behind custom abstractions, we were fighting their priors.
This was our first hint of a deeper principle:
Insight #2: Donât fight the modelâs existing knowledge - lean on it.
Multi-Agent Architectures: Promise, Pain, and the Pivot
Looking at the success of generic tools, we went further and built a full multi-agent system with handoffs. A âhandoffâ meant one agent explicitly transferring control - along with the running context and intermediate results - to another agent.
Human teams are organized by specialty, so we mirrored that structure: specialized sub-agents with focused personas, each owning one azure service, handing off when investigations crossed boundaries.
The theory was elegant: lazy tool loading.
- The orchestrator knows about sub-agents, not individual tools.
- User asks about Kubernetes? Hand off to the K8s agent.
Networking question? Route to the networking agent.
- Each agent loads only its own tools. Context stays lean.
It worked beautifully at small scale. Then we grew to 50+ sub-agents and it fell apart.
The results showed a bimodal distribution: when handoffs worked, everything worked; when they didn't, the agent got lost. We saw a clear cliff â problems requiring more than four handoffs almost always failed.
The following patterns emerged:
- Discovery problems.
Each sub-agent only knew sub-agents it could directly call. Users would ask reasonable questions and get âI donât know how to help with thatâ - not because the capability didnât exist, but because the orchestrator didnât know that the right sub-agent was buried three hops away.
- System prompt fragility.
Each sub-agent has its own system prompt. A poorly tuned sub-agent doesnât just fail locally - it affects the entire reasoning chain with its conflicting instructions. The orchestratorâs context gets polluted with confused intermediate outputs, and suddenly nothing works. One bad agent drags down the whole interaction, and we had over 50 SubAgents at this point.
- Infinite Loops.
In the worst cases, agents started bouncing work around without making progress. The orchestrator would call a sub-agent, which would defer back to the orchestrator or another sub-agent, and so on. From the userâs perspective, nothing moved forward; under the hood, we were burning tokens and latency on a âyou handle it / no, you handle itâ loop. Hop limits and loop detection helped, but they also undercut the original clean architecture of the design.
- Tunnel Vision.
Human experts have overlapping domains - a Kubernetes engineer knows enough networking to suspect a route issue, enough about storage to rule it out. This overlap makes human handoffs intelligent. Our agents had hard boundaries. They either surrendered prematurely or developed tunnel vision, chasing symptoms in their domain while the root cause sat elsewhere.
Insight #3: Multi-agent systems are hard to scale - coordination is the real work.
The failures revealed a familiar pattern. With narrow tools, we'd constrained what the model could do â and paid in coverage gaps. With domain-scoped agents, we'd constrained what it could explore â and paid in coordination overhead. Same overcorrection, different layer.
The fix was to collapse dozens of specialists into a small set of generalists. This was only possible because we already had generic tools. We also moved the domain knowledge from system prompts into files the agents could read on demand (later morphed to agent skills capability inspired by Anthropic).
Our system evolved: fewer agents, broader tools, and on-demand knowledge replaced brittle routing and rigid boundaries. Reliability improved as we stopped depending on the handoff roulette.
Insight #4: Invest context budget in capabilities, not constraints.
A Real Example: The Agent Debugging Itself
Case in point: Our own Azure OpenAI infrastructure deployment started failing. We asked the SRE agent to debug it.
Without any predefined workflow, it checked deployment logs, spotted a quota error, queried our subscription limits, found the correct support request category, and filed a ticket with the support team. The next morning, we had an email confirming our quota increase.
Our old architecture couldn't have done this - we had no Cognitive Services sub-agent, no support request tool. But with az as a wide tool and cross-domain knowledge, the model could navigate Azure's surface area the same way a human would.
This is what we mean by capability expansion. We never anticipated this scenario. With generalist agents and wide tools, we didn't need to.
Context Management Techniques for Deep Agents
After consolidating tools and agents, we focused on context management for long-running conversations.
1. The Code Interpreter Revelation
Consider metrics analysis. We started with the naive approach: dump all metrics into the context window and ask the model to find anomalies.
This was backwards. We were taking deterministic, structured data and pushing it through a probabilistic system. We were asking an LLM to do what a single Pandas one-liner could do. We ended up paying in tokens, latency, and accuracy (models donât like zero-valued metrics).
Worse, it kind of worked. For short windows. For simple queries. Just enough success to hide how fundamentally wrong the approach was. Classic âworks in demo, fails in prod.â
The fix was obvious in hindsight: let the model write code.
- Donât send 50K tokens of metrics into the context.
- Send the metrics to a code interpreter.
- Let the model write the pandas/numpy analysis.
- Execute it. Return only the results and analysis of the results.
Metrics analysis had been our biggest source of tool failures. After this change: zero failures. And because we werenât paying the token tax anymore, we could extend time ranges by an order of magnitude.
Insight #5: LLMs are orchestrators, not calculators.
Use them to decide what computation to run, then let actual code perform the computation.
2. Planning and Compaction
We also added two other patterns: a todo-style planner and more aggressive compaction.
- Todo planner: Represent the plan as an explicit checklist outside the modelâs context, and let the model update it instead of re-deriving the workflow on every turn.
- Compaction: Continuously shrink history into summaries and structured state (e.g., key incident facts), so the context stays a small working set rather than an ever-growing log.
Insight #6: Externalizing plans and compacting history effectively âstretchâ the usable context window.
3. Progressive disclosure with Files
With code interpretation working, we hit the next wall: tool calls returning absurd amounts of data.
Real example: an internal App Service Control Plane log table on which a user fires off a SELECT * â style query. The table has ~3,000 columns. Single digit log entry expands to 200K+ tokens. The context window is gone. The model chokes. The user gets an error.
Our solution was session-based interception.
Tool calls that can return large payloads never go straight into context. Instead, they write as a âfileâ into a sandboxed environment where the data can be:
- Inspected ("what columns exist?")
- Filtered ("show only the error-related columns")
- Analyzed via code ("find rows where latency > p99")
- Summarized before anything enters the modelâs context
The model never sees the raw 200K tokens. It sees a reference to a session and a set of tools to interact with that session. We turned an unbounded context explosion into a bounded, interactive exploration. You have seen this with coding agents, and the idea was similar - could the model find its way through the large amount of data?
Insight #7: Treat large tool outputs as data sources, not context.
4. What's Next: Tool Call Chaining
The next update weâre working on is tool call chaining. This idea started with solving Troubleshooting Guides (TSGs) as Code.
A lot of agent workflows are predictable: ârun this query, fetch these logs, slice this data, summarize the result.â Today, we often force the model to walk that path with one tool at a time:
Today, it often looks like:
Model â Tool A â Model â Tool B â Model â Tool C â Model â ⊠â Response
The alternative:
Model â [Script: Tool A â Tool B â Tool C â ⊠â Final Output] â Model â Response
The model writes a small script that chains the tools together. The platform executes the script and returns consolidated results. Three roundtrips become one. Context overhead drops by 60â70%.
This also unlocks something subtle: deterministic workflows inside probabilistic systems. Long-running operations that must happen in a specific order can be encoded as scripts. The model decides what should happen; the script guarantees how it happens. Anthropic recently published a similar capability with Programmatic tool calling.
The Meta Lesson
Six months ago, we thought we were building an SRE agent. In reality, we were building a context engineering system that happens to do Site Reliability Engineering.
Better models are table stakes, but what moved the needle was what we controlled: generalist capabilities and disciplined context management.
Karpathyâs analogy holds: If context windows are the agentâs âRAMâ then context engineering is memory management: what to load, what to compress, what to page out, and what to compute externally. As you fill it up, model quality often drops non-linearly - âlost in the middle,â ânot adhering to my instructions,â and plain old long-context degradation shows up well before we hit the advertised limits. More tokens donât just cost latency; they quietly erode accuracy.
Weâre not done. Most of what we have done is âtry it, observe, watch it break, tighten the loopâ. But these patterns that keep working - wide tools, code execution, context compaction, tool chaining - are the same ones we see rediscovered across other agent stacks. In the end, the throughline is simple: give the model fewer, cleaner choices and spend your effort making the context it sees small, structured, and easy to operate on.