Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
154984 stories
·
33 followers

The agent optimization loop and how we built it in Foundry

1 Share

Improving agent quality at scale is one of the hardest operational problems teams face once agents are running in production. We’ve been working to close the gap between seeing what’s wrong and shipping a better version without breaking everything else. This post explains the thinking behind a new optimization loop for agents, what we learned building it, and how you can run it today.

From craft and intuition to traces, evals, and a quality conundrum

If you’re building production agents, you’ve probably walked a version of this path:

You started with prompt engineering. Wrote the system instruction, iterated on it, got the agent to mostly work. You and the model, in a tight feedback loop of “try it, read the output, tweak the prompt.” This phase is craft. It’s intuition-driven, and it gets you surprisingly far.

Then you added traces: OpenTelemetry, App Insights, whatever your stack uses. This is good engineering practice, but it’s also necessary. You couldn’t understand what the agent was actually doing without them. Now you can see the reasoning chain, the tool calls, the decisions. You have visibility.

Then came evaluation. At first, it was vibes: reading traces, gut-checking whether the output felt right. Over time you got more rigorous. You defined metrics, set guardrails, and established quality bars. Maybe you built a scoring rubric across multiple dimensions (policy compliance, cost-awareness, escalation accuracy). Now you can measure quality, not just feel it. You know your pass rate. You know which scenarios break.

And then you hit the wall.

Let’s say you ship a travel-approver agent. It calls three tools: lookup_travel_policy, check_department_budget, and get_flight_alternatives. It returns an approve, deny, or escalate decision. The first week looks clean. Then finance flags a $4,800 trip that was approved without VP sign-off. You pull the trace: The tools ran, the loop completed, and the output was confident. The agent just never called the budget-check tool.

CLI trace of a failed run showing the travel-approver agent     
  completed without calling the budget-check tool

You find the gap in the instruction, so you add a rule about cost thresholds. Re-run the eval. That case passes. But now the emergency-travel override that used to work flawlessly starts escalating everything. You try a different wording. The emergency case recovers. Two other scenarios regress.

You have the traces. You have the evals. You can see exactly what’s wrong and measure exactly how wrong it is. But fixing it without breaking something else? That’s where you’re stuck.

The data you painstakingly collected just sits there while you manually guess-and-check your way through configuration changes.

And it compounds. This might be tolerable for one agent. But if you’re operating five, 10, 20 agents across different domains, each with their own failure modes, their own evaluators, their own regression risks, the manual loop becomes untenable. You can’t individually nurse each agent through prompt revisions and hope nothing else breaks.

You’re not debugging anymore. You’re searching. And you’re doing it without a map.

Reframing the problem

Most teams treat agent improvement like debugging: Find the broken thing, then fix it. But an agent that skips a budget check isn’t “broken” the way a null pointer validation is broken. Its instruction just doesn’t encode enough constraint for that scenario. There are dozens of possible instruction variants that might fix it, and most of them regress something else.

In traditional software, when a test fails, you know what to fix. The stack trace points at a function. You patch the function, run the test suite, then confirm nothing else broke. With agents, quality failures could live in any of a dozen places: the system instruction, the model, a tool description, a skill definition. There’s no stack trace pointing at the broken line. The problem could be in any of those places, or several at once, and you can’t isolate it the way you’d isolate a bug.

But here’s what you might not have noticed: You already have almost everything you need. Your traces contain the failure signal. Your evaluators contain the quality definition. What’s missing is the loop that connects them. The loop that goes from “I see what’s broken” to “here’s a better configuration, scored against everything, ready to ship.”

> We built a system that does for agent configurations what your CI pipeline does for code.

We built a system that does for agent configurations what your CI pipeline does for code: Propose a change, score it against the full evaluation suite, and only promote it if quality holds across the board.

If you’ve done hyperparameter tuning, this will feel familiar. The optimizer explores a configuration space the same way a sweep explores learning rates and architectures. The difference is that the search dimensions are instructions, skills, tool definitions, and model selection instead of numeric parameters.

The optimization loop

The four-step agent optimization loop: generate candidates,    
  score and rank, developer review, ship the winner

You already have the pieces:

  • An agent running in production (model, instructions, skills, tools)
  • Evaluators that score quality across multiple dimensions
  • Traces from real usage

The optimizer takes all three as input and runs a four-step loop. Each step is something you’d otherwise grind through manually; the system handles the heavy lifting.

1. The optimizer generates candidates. It searches across instructions, models, skills, and tool definitions. These aren’t random mutations. A reflector model reads traces from your evaluations, identifies why the agent scored poorly, and proposes targeted changes (more on the reflector shortly—it turned out to be the most important piece of the puzzle).

2. Candidates are scored and ranked. Same evaluators, same dataset, deterministic comparison. Every candidate is measured against the same bar your baseline was. Per-dimension scoring (policy compliance, cost-awareness, routing accuracy) means you can see exactly what improved and what regressed.

Per-dimension evaluator rubric scoring each candidate
  configuration against the baseline

3. A developer reviews and decides. The loop isn’t completely autonomous. You look at what changed, why the optimizer proposed it, and whether the improvement is real. If it doesn’t look right, you reject and re-run (optionally with updated evaluators or a different search configuration). If it passes your judgment, you approve. This is deliberate. Automation without oversight compounds errors.

4. The winner ships as the next version. Versioned, reversible, auditable. This updates your agent’s configuration: same model, same tools, better instructions. If the new version underperforms in production, you roll back.

After shipping, production telemetry accumulates: user feedback, reviewer overrides, scenarios your eval set didn’t cover. This signal doesn’t flow directly into the optimizer. It flows into you: your decision to update evaluators, add new test cases, and trigger another optimization run. The optimizer works from your evaluations; production tells you what to measure next.

There’s more to say about how the optimizer explores this space internally: the search techniques, the tradeoffs, how the reflector generates hypotheses. That’s beyond what we can cover here. But one finding from inside the optimizer is worth pulling out.

What actually moves the needle

The optimizer isn’t just randomly mutating prompts. The central piece is a reflector: a separate model whose only job is to read failing traces and reason about why the agent scored poorly. It then proposes targeted edits for the next round.

Here’s what we found: The quality of that reflector, the model doing the diagnosis, has a disproportionate impact on outcomes. More so than the agent’s own model. More so than tuning other parameters in the search. This held across multiple agent types and domains.

What does that mean concretely? Swapping to a stronger reflector model improved optimization results more than any other single change we could make. The agent could be running gpt-4o or gpt-4.1-mini. It didn’t matter as much as having a reflector that could clearly reason about why something went wrong and what to change about it.

> Better diagnosis beats better execution.

And here’s the implication for how you invest: The meta-cognition layer, the ability to reason about failures, matters more than anything else. Better diagnosis beats better execution. If you’re going to invest in one capability, invest in the quality of your failure analysis.

The engineering behind the reflector (how it reads traces, generates hypotheses, and avoids local optima) is its own story.

The travel-approver: A concrete run

Let’s go back to our earlier travel-approver agent example. Here’s what one optimization run might produce:

Optimization run results for the travel-approver agent      
  showing the winning system-prompt rewrite and per-dimension score gains

The winning candidate was a system-prompt rewrite. Same model, same tools, same skills. Just a better instruction. The optimizer added an explicit cost-threshold rule and an escalation ladder that the baseline lacked.

The $4,800 trip that started this story? The optimized agent calls the budget check, sees the amount exceeds the $3,000 threshold, and routes to VP review. Same scenario, different outcome. The instruction now encodes the constraint explicitly.

When to use this loop, and when to skip it

This loop works better in specific situations. Here’s how to know if it fits yours.

It’s a good fit when:

  • You have an agent in production with traces and evaluation data
  • Quality issues are cross-cutting: Fixing one thing breaks others
  • You’re operating at scale, across multiple agents or ongoing iteration cycles
  • The failure mode is at the configuration level: instructions, skills, tool definitions, model selection

It’s probably not the right tool when:

  • Your agent is still in early development and you haven’t earned enough traces yet (manual approaches like prompt engineering are still a good path forward)
  • The problem is infrastructure: context window too small, tools return bad data, latency
  • You have one agent with one failure mode—in that case, just fix it manually
  • The task is reasoning-bound (competition math, deep logic chains)—here, you need a model upgrade, not instruction optimization

Key takeaways

Here are the four things we’d carry to any system doing this kind of work:

  1. Quality is a search problem, not a debugging problem. Define what good looks like, search the configuration space, and rank what works. Stop trying to fix one case at a time.
  2. Invest in diagnosis. The reflector (the model that reasons about why things went wrong) has more impact than any other single lever. Better failure analysis beats better execution.
  3. Evaluators are the ceiling. Your optimization is only as good as your quality definition. Start with generated approximations, refine with real data. The first version is never the last.
  4. Keep the human in the loop. The optimizer proposes; the developer decides. Automation without oversight compounds errors.

How we built this in Microsoft Foundry

We packaged this loop into Agent Optimizer inside of Foundry Agent Service, available today through the azd CLI.

Here’s what the travel-approver run looks like from your terminal:

azd ai agent eval init # generate dataset & evaluator from a one-paragraph description azd ai agent eval run # score the current version (baseline) azd ai agent optimize # search over candidates azd ai agent optimize apply --candidate <id> # apply the winner locally azd deploy # ship as the next version

Five commands. The complexity lives in the optimizer, not in your workflow.

Foundry evaluation results view    
  summarizing the optimized agent's scores across dimensions

The system handles candidate generation, scoring, ranking, and version management. You handle the decision: approve, reject, or adjust your evaluators and run again.

On getting started without evaluation data: The system includes AI-assisted dataset and evaluator generation based on your agent’s configuration and traces. You describe what the agent should do in a paragraph, and eval init generates a multi-dimension evaluator using the traces if available. This makes it easier to bootstrap. The closer your eval data is to real user scenarios and real edge cases, the higher the quality ceiling. Your evaluators are the ceiling on optimization quality. If they can’t distinguish good from bad, the candidates are noise.

We’ve also seen cases where the reflector proposes a fix that passes the eval but introduces regressions on inputs not in the eval set. That’s why the human gate exists. The loop isn’t fully hands-off. You still need someone looking at the candidates before they ship.

What we’re exploring next

The loop as described is agent-level: one agent, one set of instructions, one optimization pass. Two directions we’re actively building toward:

Reducing deployment risk. Right now, shipping a candidate means replacing what’s in production. Full swap. If your eval set is strong, that works. But eval sets are approximations, and production traffic has a longer tail than any test suite. We’re building A/B-style deployment: Promote a candidate alongside the current version, route a fraction of traffic to it, and compare outcomes against the same evaluators that scored it in the loop. The developer gate doesn’t end at “approve.” It extends into production. Roll forward when evidence accumulates; roll back the moment it doesn’t.

Widening the search space. Today, the optimizer searches over instructions, skills, tool definitions, and model selection. That covers most failure modes. But sometimes the bottleneck is upstream of the agent itself: retrieval settings that return noise, knowledge gaps no instruction can fix, or tool sets that don’t match the task. We’re integrating Foundry IQ (managed knowledge grounding) and Foundry Toolbox (curated tool sets) as tunable dimensions. The optimizer can then search over retrieval configuration, which knowledge sources to ground on, and how tool sets are composed. Same scoring rubric, wider surface area. You stop running those experiments by hand.

There’s more here we’d like to share, especially as we continue to learn and explore this space. The optimizer’s architecture, the engineering discipline behind it, the edge cases that taught us the most—those are stories worth telling. Stay tuned to Command Line for more.

Try it out

Agent optimizer is in public preview. If your agents are stuck in the cycle of “fix one thing, break two others,” try it out and give us feedback.

The post The agent optimization loop and how we built it in Foundry  appeared first on Command Line.

Read the whole story
alvinashcraft
23 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

Grounding at scale: Engineering the retrieval system for the agentic web

1 Share

Humans and AI don’t search the same way. As people increasingly turn to chatbots and agents for information, grounding that AI—connecting it to fresh, relevant, and authoritative information—takes on new importance as foundational infrastructure. Microsoft’s grounding layer already powers most of the world’s major AI assistants. And today at Build, we took that work further with Web IQ, a new grounding system for the agentic web. 

Web IQ delivers industry-leading quality, sub-165ms P95 latency (~2.5× faster than the nearest alternative), token-efficient retrieval, respecting publishers’ preferences. The same infrastructure powering Copilot, ChatGPT, enterprise systems from Nasdaq, and others, is now available as a neutral, MCP-native, model-agnostic platform. In this post, we’ll explore the architectural challenge, the Web IQ stack, and how we optimized for speed at scale.  

Grounding redefines the optimization problem 

Most discussions of AI systems still start with models. But once those systems are deployed at scale—especially in search, copilots, and agentic workflows—the dominant bottleneck shifts. The central problem becomes grounding: what information reaches the model, how fresh it is, how much context can be included, and how quickly that evidence can be delivered. 

In a grounding system, those requirements collapse into three tightly coupled constraints: latency, quality, and token efficiency. 

In classical search, these dimensions can often be traded-off relatively independently: A slower system can still be useful if it returns strong document results, and an imperfect ranking can still succeed if the user can inspect and repair the outcome. Inside an AI inference loop, that decoupling disappears. 

In AI search and agentic systems, grounding sits inside the inference loop. Retrieval directly shapes generation, tokens determine both cost and latency, and missing or stale context propagates into reasoning errors rather than degrading gracefully. The optimization target is therefore no longer a ranking function in isolation, but rather a coupled system operating under latency, quality, and token-efficiency constraints. In that setting, grounding goes from a component to a system architecture problem. 

Semantic‑first as a system design principle 

Before describing Web IQ’s architecture, it helps to name the underlying shift more precisely: Large‑scale retrieval is moving from hybrid stacks, where lexical systems dominate first‑stage recall and dense models re-rank, toward semantic‑first systems in which representation learning defines the primary retrieval space. 

That shift is now practical because modern embedding models preserve substantially more of the relevance signal at retrieval time, and ANN infrastructure is mature enough to search that space under production latency constraints. Just as importantly, retrieval is no longer limited to one vector per document. Instead, the effective unit can be a passage, span, or a small set of learned representations that retain finer interaction structure until late in the pipeline. 

  • Content is indexed as semantic representations rather than only lexical postings 
  • Candidate generation operates over neighborhoods in the embedding space, often at passage or sub-document granularity
  • Fine relevance signals can be deferred to later interaction stages instead of being collapsed entirely into a single early score
  • Lexical matching remains useful as a constraint, calibration signal, and fallback for exactness-sensitive cases

Rather than eliminate hybridization, this relocates it. In a semantic‑first stack, dense retrieval becomes the default access path, while later stages recover precision through richer interaction, filtering, calibration, and task-specific refinement. That choice propagates through the system: how content is chunked, how representations are trained, what the ANN index must preserve, and how evidence is assembled for downstream reasoning. 

This direction has been visible inside Bing for years: shift more of the retrieval quality into learned representations, reduce dependence on head-query interaction logs, and expose content that lexical access paths and click priors systematically underserve. The long-term implication is a retrieval stack whose first stage is semantic by construction and whose later stages recover fine-grained matching only where it matters. 

Web IQ is the first grounding system built end-to-end around that retrieval premise. 

A reference architecture for grounding: The Web IQ stack 

At the base of Web IQ is a retrieval system operating at global scale, but the key design choice is that documents are no longer the primary unit of access. The system is organized around both semantic representations of content and the operational question that follows from that choice: how to search a global embedding space with high recall, bounded latency, and enough structure preserved for downstream grounding. 

That immediately elevates two components from implementation details to system primitives: the embedding model, which determines what notions of relevance are geometrically recoverable, and the ANN index, which determines whether that geometry can be searched fast enough and updated often enough to reflect the live state of the corpus. 

Harrier: Embedding as the geometry of the system 

In a semantic‑first system, the embedding model defines the retrieval geometry. It determines which documents, passages, or sub-document units are near a query, which distinctions are preserved under compression into vectors, and which relevance signals must be recovered later through more expensive interaction. 

Formally, Harrier, our family of custom-trained and open-source multilingual text embedding models, learns a mapping:f𝜃:textd

sqx=f𝜃qf𝜃xf𝜃qf𝜃x

The formulation is simple, but the systems implication is severe: Retrieval can only surface structure that the embedding space preserves. If multilingual equivalence, paraphrase robustness, entity specificity, or fine topical distinctions aren’t encoded well enough in the representation, the downstream stack can at best compensate partially and at additional cost. 

Harrier is trained using large-scale contrastive learning, combining billions of weakly supervised pairs with high-quality curated examples and synthetic data generation.  

The goal is not merely high benchmark retrieval accuracy. The model must produce a space that remains stable across languages, robust to phrasing variation, efficient under ANN search, and aligned with the kinds of evidence selection and reasoning tasks the grounding layer performs later in the pipeline.

A key design choice in Harrier is the use of decoder‑only architectures with last‑token pooling and normalization, producing dense representations that are operationally consistent across tasks. That differs from the older encoder-centric embedding pattern and reflects a tighter coupling between retrieval models and the broader LLM stack. 

In practice, Harrier builds on modern decoder backbones and is refined through staged training: broad pretraining to inherit linguistic and world knowledge, contrastive specialization to shape retrieval behavior on domain data, and distillation into smaller deployment variants. Distillation matters not only for cost; it’s what allows the system to preserve a compatible embedding geometry across deployment tiers while pushing latency and throughput in the right direction. 

The result is an embedding model that is competitive on public benchmarks and, more importantly, behaves predictably under production workloads where distribution shift, multilingual traffic, and latency constraints matter more than leaderboard position.  

DiskANN: When geometry meets reality 

If Harrier defines the geometry, DiskANN3 defines what is operationally achievable inside it.  

Approximate nearest neighbor search is often presented as an algorithmic trick—at web scale, it’s an operating constraint that determines the memory footprint, recall-latency frontier, and freshness envelope of the entire retrieval system. 

DiskANN3 matters because it provides high-recall streaming search and operational flexibility on memory vs. throughput. 

It decouples update and query logic which controls index quality from storage details. This allows high-recall search from different memory regimes, from disk-resident regimes avoiding the requirement that the full graph and vectors live in memory to purely memory-based indices for highest throughput and the spectrum in between. 

But the more consequential issue isn’t static search quality; it’s whether the index can absorb continuous updates without losing stability. 

In a grounding system, retrieval is only as current as the index, and stale graph structure shows up immediately as missed evidence, longer prompts, and more retries downstream. 

In Web IQ that means distributed ANN graphs, streaming update paths, and mutation strategies that avoid frequent full rebuilds. Rather than simply fast query-time traversal, the objective is a semantic index that can remain both searchable and live. 

New updated logic in DiskANN3 makes the update problem explicit: Proximity graphs are hard to mutate because local connectivity is fragile, and naive deletions or insertions can degrade search quality or force rebuilds. Solving that moves the system toward a truly streaming semantic index that takes only few milliseconds to make new content searchable, and always retains high search quality without full index rebuilds. This is essential for providing accurate grounding to AI agents.  

Evidence objects: Controlling token economics 

Once retrieval produces candidates, the next problem is context construction: selecting and packaging the evidence that the model will actually consume. 

Web IQ departs from the document-centric search stack. Beyond just handing whole documents to the model, it can construct evidence objects: passage-level units with provenance, structural metadata, and enough local context to remain interpretable when detached from the source page. The aim is to preserve the evidence needed for reasoning without paying the token cost of full-document recall.  

That changes the optimization target from document relevance to information density per token. Better evidence objects reduce prompt size, improve reasoning quality by concentrating the relevant facts, and preserve attribution so that outputs remain inspectable. This is the practical meaning of returning the most relevant chunks rather than entire documents.  

Orchestration: The hidden system layer 

At the top of the stack sits orchestration, which has become one of the most important components precisely because AI queries aren’t limited to short keyword expressions. Instead, they’re often long, compositional, and dependent on prior conversational state. 

The orchestration layer interprets those requests, maps them onto retrieval strategies, executes those strategies across distributed infrastructure, and assembles evidence under strict latency and context-window constraints. Because it operates statefully against short-term memory and partial prior results, this layer is better thought of as execution planning for grounding rather than as a thin wrapper around search. 

Optimizing for speed at scale 

A grounding system also must be fast enough to remain inside an interactive inference loop. In practice, that means designing towards 100ms search latency—not as a marketing target, but as a systems target. Once retrieval, evidence construction, and orchestration sit on the critical path of generation, every additional millisecond increases both user-visible delay and the probability of cascading retries. 

At that scale, performance is governed less by median latency than by the tail. The system therefore must be engineered around microsecond-level budget discipline across network hops, storage access, ANN traversal, and model execution, with aggressive control of tail amplification, careful failure handling, and degradation paths that preserve correctness when subsystems are slow or unavailable. Speed isn’t one optimization; it’s a property of the entire distributed pipeline. 

That in turn makes efficiency a first-order design principle. Embedding models and re-ranking stages have to run on extremely efficient kernels and inference engines; data movement has to be minimized; and batching, caching, and memory layout have to be tuned for real workloads rather than benchmarks. The result is a culture of relentless performance work: shaving tail latency, reducing waste in every stage, and treating throughput, reliability, and latency as coupled properties of the same system. 

The web as substrate: Bing, crawling, and the system beneath grounding 

All of the layers above assume something more fundamental: a high-fidelity, continuously updated representation of the web. Far from a static dataset, that substrate is a dynamic, adversarial, multi-stakeholder system whose content, structure, and incentives change continuously. 

For agentic grounding, crawl quality is upstream of answer quality. If the system doesn’t discover the right pages, revisit them at the right cadence, or parse them into stable representations, retrieval can’t recover the missing evidence later. At web scale, that makes crawling and indexing first-class systems problems: deciding what to fetch, when to revisit it, how to normalize heterogeneous content, and how to propagate updates through a distributed index without taking the system offline or destabilizing retrieval semantics. 

The web is also an ecosystem, not just a corpus. A production crawler must operate with politeness, respect publisher constraints, and preserve attribution, usage and quality signals from crawl through index construction and into evidence objects. Those constraints are part of the grounding system itself because the model can only cite and reason over evidence that has been collected, interpreted, and packaged responsibly. 

Another complication is that the web responds to retrieval systems: Content is optimized for ranking, deduplicated, and continuously reshaped. Covering trillions of pages therefore takes more than bandwidth. It requires sophisticated models for discovery, canonicalization, spam detection, language understanding, and change prediction, together with trust and quality defenses that keep a semantic-first stack stable under continuous drift. 

That’s why a long-lived system like Bing matters to Web IQ. Broad coverage isn’t only a matter of crawl volume; it depends on years of accumulated infrastructure, change models, publisher integration, anti-spam signals, and operational feedback. For agentic grounding, that history matters because a system can only ground against the web it has learned to discover, understand, and maintain over time. 

A system perspective on grounding 

The point here isn’t that any individual component is unprecedented. Embedding models, ANN indexes, crawlers, and orchestration layers all existed before. What changes in Web IQ is that they’re treated as one coupled system, organized around semantic-first retrieval, and optimized for the constraints that agentic grounding imposes. 

Taken together, the system perspective is straightforward: 

  • Embeddings define what is geometrically retrievable
  • ANN infrastructure determines whether that representation can be served with sufficient recall, freshness, and latency
  • Evidence objects determine how efficiently the model can consume grounded context
  • Orchestration, performance engineering, and crawl quality determine whether the pipeline can operate reliably at web scale

At that point, grounding is no longer an extension of search. It is a core infrastructure layer for agentic AI.

The post Grounding at scale: Engineering the retrieval system for the agentic web appeared first on Command Line.

Read the whole story
alvinashcraft
27 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

Microsoft Scout: From personal project to enterprise-ready personal agent

1 Share

Early this year, OpenClaw demos were seemingly everywhere, though many seemed to amount at best to a cool party trick (“Look, my agent ordered a pizza”). But it got long-time Microsoft employee Omar Shahine thinking: How useful could claws actually be? 

Very, it turned out. In his spare time, Shahine created Lobster, a personal AI assistant built on OpenClaw. It has its own Apple ID and email address, so he can text with it from any device with iMessage. He initially split Lobster into a trio of agents, each with its own security profile and tool access (eight weeks in, that number had increased to nine always-on agents). Lobster handles travel logistics, proactively sends family reminders ahead of time, and generally helps Shahine and his family stay organized and get things done. And after presenting Lobster to Microsoft’s AI Accelerator group, it landed Shahine a new job: bringing OpenClaw to M365 and the cloud as CVP of what was deemed “Project Lobster.” 

At the same time, Microsoft Member of Technical Staff Jakob Werner was pursuing a similar idea with a twist: a desktop app-based agent inspired by OpenClaw. The goal was to deliver a powerful enterprise-secure personal AI assistant that anyone within Microsoft could use. In just a couple weeks, what was referred to internally as “Clawpilot” had already been downloaded by thousands of Microsoft employees, and that community continues to grow. 

When Shahine started assembling a small team of enthusiastic builders—Ocean’s 11, naturally—Werner quickly joined their ranks. The two recently caught up in Redmond, Washington, to compare notes on building these always-on, autonomous agents and navigating the worlds of enterprise security, agentic memory, and more.

Embracing the spirit of open source 

The Project Lobster team is representative of a new way of working within Microsoft, fueled by AI advancements. It’s a tight-knit group that prefers to collaborate asynchronously. There’s a general consensus against meetings. Everyone contributes to the codebase, including Shahine. And there’s no traditional executive assistant among their ranks: Each team member actively uses prototypes throughout the day to fully immerse themselves in the tech as they’re building it. There’s even a growing open-source community around the team that mirrors what’s found with open-source projects outside Microsoft’s walls. 

“I’ve never seen a project inside the company where so many people showed up with their ideas and their code and did the work to produce a PR,” says Shahine. 

> I’ve never seen a project inside the company where so many people showed up with their ideas.

In fact, internal excitement around Project Lobster has been such that the team fielded pull requests (PRs) left and right during the early building phase, which they reviewed to determine whether they met the bar to make it into the product. Even some of Shahine’s changes didn’t make the cut. The focus had to remain on the central goal of the product: Creating an always-on personal agent for work. An AI helper that learns your goals, adapts to your daily work patterns, and acts with context, identifying issues before they surface, keeping projects on track and driving outcomes without constant input. An agent that can detect when a calendar is overbooked and propose specific changes before the week begins or identify when a decision is stalled and draft a targeted follow-up to unblock it. 

“We have to determine if a given PR changes the central idea of the product or not—and the speed of that review is human speed, not AI speed,” notes Werner. “Anyone can make a PR super quickly now. We’re trying to help the community and teach contributors how to review PRs.” 

While the work began as an internal experiment, it quickly turned into a customer-focused effort that’s culminated with the introduction of Microsoft Scout—an always-on personal agent powered by OpenClaw open-source technology. 

From experiment to enterprise-ready product 

Microsoft Scout operates autonomously—with its own identity—acting on your behalf. It works across cloud, desktop, and web browser, so it can connect across the surfaces you use—Teams, Outlook, OneDrive, and SharePoint—and the systems where work lives, including email, calendar, and contacts. 

Unlike your average claw in the wild, Microsoft Scout combines OpenClaw code with enterprise identity, governance, and security. Every package is ingested through a curated, signed Microsoft supply chain, and every tool call, model request, and network hop is mediated by a zero-trust runtime—the agent’s container is treated as untrusted, with Microsoft-controlled identity, tokens, and policy sitting outside it. With Agent 365, admins get a single control plane, and Microsoft Purview gives security teams the same compliance and DLP signal they already get from other M365 surfaces. 

“It’s a super powerful tool,” acknowledges Werner. “And to be enterprise secure, we needed to make sure the data governance was right, that the privacy was right, and that it doesn’t cancel a meeting and send all your personal information to that email chain. If I send my agent to you, it shouldn’t tell you everything about me. These areas are possible to contain, but we also had to do it in a balanced way that doesn’t restrict the possibilities down to nothing.” 

It’s a tradeoff worth making. And with Microsoft’s tried and trusted enterprise security offerings and ongoing research and innovation in the space, the team had a solid foundation from which to address the challenge. 

The role of agentic memory 

In order for an always-on personal AI agent to be truly useful, it needs to be proactive—and that requires context powered by Work IQ. Over time, Microsoft Scout understands the way you work, uses the same productivity tools you use, and takes things off your plate without the need for constant prompts. It learns your goals, adapts to your daily work patterns, and acts with intent. Unlike previous technological waves, this is software that’s truly personalized. That’s transformative, but it’s not without tradeoffs. 

“OpenClaw, Claude Code, GitHub Copilot CLI, these are agentic coding harnesses that are basically remembering—writing things down just like people do,” Shahine notes. “They write things down like a diary. But just like it needs to remember things, it needs to forget some things, too.” 

> Just like it needs to remember things, it needs to forget some things, too.

As an example, Shahine points back to the introduction of memory to ChatGPT. He spent some time telling ChatGPT that his daughter was 17 while his son was 13. But a year later, that information remained static. The system didn’t have a concept that some facts need to change over time, while other pieces of information—like your name—will stay exactly the same. 

“In the design phase, I was thinking about the human and how humans memorize things,” says Werner. “I forget things that are irrelevant because I didn’t use them. So I built a system where, if I’m going to use it repeatedly, it’s going to stick. But if I’m not going to use it regularly, I want the system to forget. I don’t want to have an infinite diary of things, right? So there’s kind of layers of memory, and it kind of disappears over time if it’s not used. Meanwhile, the relevance of other pieces of memory grows as you use them more.” 

Forming a new center of gravity 

When they first joined forces, Werner introduced Shahine to the concept of gravity—the framework around which he operated. 

“To build a truly great product, I don’t think I can make it myself,” Werner explains. “We need to collaborate with other people. But how do we influence other people to collaborate with us? And the mindset I use and try to instill in my team is gravity. We build something and make it so big in influence—not in the number of features, but in its influence—that when exciting new ideas pop up, they want to try and join the gravity of our work rather than dissolve focus.” 

“And I didn’t really know what you were talking about until my new role was announced,” admits Shahine. “But since then, I’ve received hundreds if not thousands of messages from people who want to help, people who want to learn, people who want to show me what they did, and customers who want to know ASAP when they’re going to get their hands on what we’re building. There are a lot of other words for that—user pull, signal—but your mantra of gravity really resonates with me now.” 


Microsoft employees have already been using an early Microsoft Scout desktop experience. We built this to learn how always-on agents show up in real work, and we’re seeing it take on coordination, surface risks earlier, and keep work moving without constant prompting. 

We’re now extending that early experience to Frontier organizations. Microsoft Scout is available as an experimental release through Frontier, giving customers a chance to explore how it can fit into their own workflows. 

Access requires Frontier enrollment, Intune policy configuration, and an opt-in attestation. Users with a GitHub Copilot license can then download and install the experience. Learn more.

The post Microsoft Scout: From personal project to enterprise-ready personal agent appeared first on Command Line.

Read the whole story
alvinashcraft
48 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

Azure DevOps and GitHub: Journeying into the AI Era

1 Share

AI is changing how software gets planned, built, and reviewed. As teams adopt agentic development, the platform underneath those workflows matters more. They need tools that bring planning, coding, security, and collaboration together—and can keep pace with how development is evolving.

That’s why we’re delivering the newest agentic capabilities on GitHub across planning, coding, code review, and security. For teams driving active development, that often means moving repositories to GitHub to unlock the latest AI-powered workflows, while continuing to use Azure Boards and Pipelines. For teams that need more time migrating repos to GitHub, we’re continuing to invest in Azure DevOps with improvements focused on security and code quality in the workflows they rely on today.

Unlocking agentic development on GitHub

Organizations that move repositories to GitHub unlock the full agentic development experience, including a broad choice of models from Anthropic, OpenAI, Google, and others in Copilot, along with agentic workflows across GitHub, VS Code, mobile, CLI, and integrations like Teams, Slack, and Azure Boards. Enterprise governance is built in through an agent control plane that provides visibility, audit, and policy management.

Teams can continue using Azure DevOps capabilities such as Azure Boards, Pipelines, and Test Plans alongside their GitHub repos. Azure DevOps basic usage rights are included with GitHub Enterprise, making it easier to adopt a hybrid model without additional overhead. Enterprise Live Migrations and deeper GitHub–Azure DevOps integration make it practical to transition at enterprise scale, as reflected in Microsoft’s own migration journey.

Enterprise Live Migrations — Preview

Large repos and complex organizations need migration options that respect how their teams operate. Enterprise Live Migrations (ELM), now available in preview, enables organizations to migrate multiple repositories together with minimal cutover downtime. ELM supports starting migration without locking the Azure DevOps repository, allowing developers to continue working while teams transition at a pace that fits their portfolio and operational needs. Repository history, branches, and metadata are preserved throughout the migration process. For organizations choosing GitHub for source control while continuing to use Azure DevOps for planning, CI/CD and testing, ELM is designed to make that transition significantly easier and less disruptive. A script-based migration experience is available today in preview, with a more streamlined end-to-end experience coming soon. Join the waitlist for early access to Enterprise Live Migration.

Customer zero: How Microsoft is moving to GitHub internally

Microsoft is using this model internally too. In the Copilot, Agents, and Platform team, a small migration team has moved 1,575 repositories from Azure Repos to GitHub Enterprise, helping drive a company-wide shift at scale. Today, more than 3,000 developers are on GitHub, and 45% of pull requests now happen there.

AI is the forcing function. GitHub is where Copilot, agents, and new AI workflows ship first and run at scale, so moving source control there gives teams faster access to the latest capabilities. At the same time, teams can continue to use Azure Pipelines and Azure Boards in Azure DevOps while they move their code to GitHub, without waiting for a full platform transformation.

For a deeper look at how the team approached the migration, including practical best practices and lessons learned, see the related blog post.

Connecting agents to your DevOps context

As teams adopt hybrid workflows across GitHub and Azure DevOps, agents need context from the systems where work gets planned, built, reviewed, and shipped. GitHub already provides a remote MCP server for repository and workflow context. The new Azure DevOps remote MCP server in preview extends that model to Azure DevOps, bringing work items, builds, pull requests, test plans, and other project context directly into agentic workflows.

The service is fully hosted by Microsoft, geo-routed, and stateless, with no Azure DevOps customer data persisted within the MCP service itself. It’s also available in Azure AI Foundry for building custom agents, and we’re working to make it available in Copilot Studio and Microsoft 365 Copilot so teams can connect Azure DevOps to their agents without managing additional infrastructure.

Security and quality improvements for Azure DevOps repositories

For teams continuing to run repos on Azure DevOps, we’re bringing AI-powered capabilities directly into the workflows they already use. The updates below focus on two areas that matter most: improving code quality in pull requests and helping developers remediate security issues faster.

Copilot Code Review for Azure DevOps — Preview

With Copilot Code Review, now in preview, developers can request an automated review on any pull request and receive inline comments and suggestions directly in the Azure DevOps pull request experience. Organization administrators maintain full control over adoption and enablement through settings at the org, project, or repository level. Usage is billed using GitHub AI Credits through the same Azure subscription your organization already uses for Azure DevOps, with no separate license management required. Copilot Code Review is designed to complement, not replace, human reviewers. It doesn’t block merges or count toward required reviewers. Instead, it acts as an additional layer of review that helps surface issues earlier, allowing engineers to focus their attention on the changes that matter most. Join the waitlist for early access to Copilot Code Review for Azure DevOps.

Autofix for CodeQL in Azure DevOps — Preview

Security issues are only valuable to surface if teams can fix them efficiently. Autofix for CodeQL alerts, now in preview for Azure Repos as part of GitHub Advanced Security for Azure DevOps, brings AI-powered remediation directly into the code security workflow. With Autofix, developers can request a suggested fix for an open CodeQL alert, and Copilot automatically creates a pull request with the proposed remediation ready for review. It’s the same Autofix experience GitHub customers have used to significantly reduce time-to-remediation for security issues. Preview today covers a subset of CodeQL alerts and will expand over time toward broader coverage across code scanning alerts as we move toward general availability. Join the waitlist for early access to autofix for CodeQL in Azure DevOps.

Apple Silicon Macs for Azure Pipelines, with pay-per-minute billing

Apple Silicon Macs — a capability many customers have been waiting for — are now available as Microsoft-hosted agents in Azure Pipelines with pay-per-minute billing. Teams can now run faster Apple-native builds for iOS and macOS without managing their own Mac infrastructure or planning capacity for parallel jobs, while paying through a flexible usage-based model that scales with pipeline demand.

Moving forward

GitHub is where the newest AI-powered development workflows are taking shape, and for many teams that will increasingly mean moving source control there over time. We’re investing to make that transition practical, whether you’re migrating repositories now, adopting a hybrid Azure DevOps + GitHub model, or moving repositories in phases based on complexity and business priorities.

Across each of those paths, the goal is the same: help teams move forward with stronger security, better code quality, and access to the latest agentic capabilities.

The post Azure DevOps and GitHub: Journeying into the AI Era appeared first on Azure DevOps Blog.

Read the whole story
alvinashcraft
1 minute ago
reply
Pennsylvania, USA
Share this story
Delete

5 Heart Animations Using .NET MAUI

1 Share

Try your hand at some simple, engaging animations in .NET MAUI. Here are five heart animations you can master today!

I’m a big fan of animations that capture the user’s attention—the kind that don’t just make an app look pretty, but also have the ability to guide users and clearly communicate what’s happening. When used correctly, animations become a key part of design and usability. ✨

In .NET MAUI, we have simple yet very powerful animations that allow us to add that dynamic touch without unnecessary complexity. With just a few lines of code, we can transform common interactions into more natural and enjoyable experiences.

I genuinely love this article, because in it we’ll learn how to create five super-creative animations that elevate our application and showcase the beautiful and powerful things we can achieve using the basic animations available in .NET MAUI

1. Heart ECG Monitor

Heart ECG Monitor Demo

OMG, can we do this in .NET MAUI?? Yess!!! (This is my favorite animation, hahaha).

Let’s create an ECG monitor–style animation, where we can see the movement of the line. To achieve this, we’ll work with both XAML and code-behind.

Let’s start with the XAML.

The design includes a heart and some lines that simulate the monitor’s movement. Both elements will be vertically aligned, so the layout that will contain our design is a VerticalStackLayout.

Inside this layout, we’ll add:

➖ An Image for the heart
➖ And a GraphicsView to draw and animate the monitor lines

<VerticalStackLayout Padding="24" Spacing="16"
   
    <!-- Heart --> 
    <Image x:Name="Heart" 
    Source="redheart" 
    WidthRequest="35" 
    HeightRequest="35" 
    HorizontalOptions="Center" />
    
    <!-- ECG --> 
    <GraphicsView x:Name="Ecg" 
    HeightRequest="120" />

</VerticalStackLayout>

Perfect! Now we’ll continue with the code that brings the monitor and the heart to life. For the purpose of this example, we’ll add everything directly in the code-behind. And how do we locate the code-behind? For example, if your XAML file is called MainPage.xaml, the code-behind is in MainPage.xaml.cs.

Variable Declaration

WindowSize: This variable allows us to control how many heartbeats can be displayed on the screen at the same time. The higher this value is, the more heartbeats will be visible before they move out of view. As a result, a larger WindowSize makes the heartbeat appear slower on the monitor.

Beat: We define an array to represent a complete heartbeat. Values close to zero represent the baseline, while higher values represent the peaks.

_samples: This variable represents what should be displayed on the screen. Initially, we fill it with zeros to display a flat line.

EcgDrawable: This holds the object responsible for drawing the monitor signal.

_i: Finally, we have the index that will be used to iterate through the Beat array.

// Variables
 
const int WindowSize = 140; 
static readonly float[] Beat = 
    { 
    0,0,0, .12f,.25f,.12f, 0, -.25f,1f,-.35f, 0,0, .20f,.35f,.20f, 0,0,0,0 
};

readonly List<float> _samples = Enumerable.Repeat(0f, WindowSize).ToList(); 
readonly EcgDrawable _drawable; 
int _i;

Continuing with the Constructor

Once we have all the required variables defined, the next step is to connect the components and start the animation inside the constructor.

First, we create an instance of EcgDrawable and assign it to the GraphicsView. This allows the view to know how to render the ECG signal based on the provided samples.

Next, we configure a timer using the Dispatcher. This timer is responsible for updating the signal at a regular interval, which helps us achieve a smooth and continuous animation.

On each timer tick:

  • The next value from the heartbeat pattern is added to the samples list.
  • The window size is preserved by removing the oldest value.
  • The GraphicsView is invalidated to force a redraw.

Finally, we start the timer, and the heart monitor animation begins automatically.

In code, it looks like this:

public HeartAnimations() 
{ 
    InitializeComponent(); 
    _drawable = new EcgDrawable(_samples); 
    Ecg.Drawable = _drawable; 
    var t = Dispatcher.CreateTimer(); 
    t.Interval = TimeSpan.FromMilliseconds(16); 
    t.Tick += (_, __) => 
    { 
    _samples.Add(Beat[_i++ % Beat.Length]); 
    if (_samples.Count > WindowSize) _samples.RemoveAt(0); 
    Ecg.Invalidate(); 
    }; 
    t.Start(); 
}

Finally, let’s add the EcgDrawable class.

This class is responsible for visually rendering the ECG signal inside the GraphicsView component. It takes the list of values that represent the heart signal and transforms them into a continuous line, automatically calculating:

  • The vertical position of the signal
  • The horizontal distribution of each point

In code, this translates to the following:

class EcgDrawable(IReadOnlyList<float> samples) : IDrawable 
{ 
    public void Draw(ICanvas c, RectF r) 
    { 
    if (samples.Count < 2) return; 
    c.StrokeColor = Colors.Red; 
    c.StrokeSize = 2; 
    var baseline = r.Top + r.Height * .55f; 
    var amp = r.Height * .35f; 
    var dx = r.Width / (samples.Count - 1); 
    var p = new PathF(); 
    p.MoveTo(r.Left, baseline - samples[0] * amp);  
    for (int i = 1; i < samples.Count; i++) 
    p.LineTo(r.Left + i * dx, baseline - samples[i] * amp);
     
    c.DrawPath(p); 
    } 
}

And that’s it! We’ve just finished building the Heart ECG Monitor animation. See? It wasn’t that hard! I encourage you to try it out and experiment with it on your own.

Now, let’s move on to the next animation!

2. Heart Ripple Pulse

Heart Ripple Pulse

This animation represents the heartbeat and the waves generated by that pulse.

Here, we combine several simple animations and use a Grid as the main container to overlay all the elements in the same space, making them feel like a single, unified animation.

We’ll start with the XAML, where we will define:

➖ Two Border components in red color, representing the ripple waves.

➖ A heart image, which will act as the main element at the center.

<Grid HorizontalOptions="Center" 
    VerticalOptions="Start" 
    Margin="0,100" 
    WidthRequest="140" 
    HeightRequest="140">
    
    <Border x:Name="Ripple1" 
    Stroke="Red" 
    StrokeThickness="3" 
    Opacity="0" 
    Scale="0.2" 
    StrokeShape="RoundRectangle 999" />
     
    <Border x:Name="Ripple2" 
    Stroke="Red" 
    StrokeThickness="3" 
    Opacity="0" 
    Scale="0.2" 
    StrokeShape="RoundRectangle 999" />
     
    <Image x:Name="Heart" 
    Source="redheart.png" 
    WidthRequest="48" 
    HeightRequest="48" 
    HorizontalOptions="Center" 
    VerticalOptions="Center" />

</Grid>

To bring our animation to life, we’ll work with the OnAppearing and in a method called Loop.

OnAppearing: To explore something different, this time we’ll work with the OnAppearing method (✍️ please make sure to declare the _running bool variable).

When the screen appears, we call the Loop method, which is responsible for running the animation cycle.

protected override void OnAppearing() 
{ 
    base.OnAppearing(); 
    _running = true; 
    _ = Loop(); 
}

Loop(): This is where the magic happens. ✨ Inside this method:

  • We trigger the heart pop, a subtle scale effect on the heart
  • We animate the two circular lines to create the ripple effect
  • We use a delay to give each animation enough time and make them clearly distinguishable

Thanks to this approach, we achieve a smooth, easy-to-follow and well-synchronized animation.

In code, this would look like the following:

async Task Loop() 
{ 
    while (_running) 
    { 
    // Reset 
    Ripple1.Opacity = Ripple2.Opacity = 0; 
    Ripple1.Scale = Ripple2.Scale = 0.2; 
    Heart.Scale = 1;
     
    // Heart pop 
    _ = Heart.ScaleTo(1.12, 90, Easing.CubicOut)
    
    .ContinueWith(_ => MainThread.InvokeOnMainThreadAsync( 
    () => Heart.ScaleTo(1.0, 120, Easing.CubicInOut)));
    
    // Ripple 1 
    _ = Task.WhenAll( 
    Ripple1.FadeTo(0.45, 60, Easing.CubicOut), 
    Ripple1.ScaleTo(1.8, 700, Easing.CubicOut) 
    ).ContinueWith(_ => MainThread.InvokeOnMainThreadAsync( 
    () => Ripple1.FadeTo(0, 250, Easing.CubicIn)));

    // Ripple 2 
    await Task.Delay(120); 
    _ = Task.WhenAll( 
    Ripple2.FadeTo(0.35, 60, Easing.CubicOut), 
    Ripple2.ScaleTo(1.8, 700, Easing.CubicOut) 
    ).ContinueWith(_ => MainThread.InvokeOnMainThreadAsync( 
    () => Ripple2.FadeTo(0, 250, Easing.CubicIn)));
     
    await Task.Delay(650); 
    } 
}

And that’s it! Let’s keep going and jump into the next animation!

3. Blinking Heart

Blinking Heart

In this animation, we’ll make a heart blink in a subtle and elegant way. We’ll start with the XAML. In this case, the structure is very simple, since we only need a single heart image, centered on the screen. From this single element, we’ll build the entire animation.

<Image x:Name="Heart"  
    Source="redheart.png"  
    WidthRequest="56"  
    HeightRequest="56"  
    HorizontalOptions="Center"  
    VerticalOptions="Center" />

We’ll continue working with OnAppearing to start the animation, and additionally we’ll create a method called BlinkLoop() to handle the animation logic. This one is composed of FadeTo and ScaleTo animations, which allow us to play with the icon’s opacity and size, creating the blinking effect.

In code, it would look like the following:

protected override void OnAppearing() 
{ 
    base.OnAppearing(); 
    _running = true; 
    _ = BlinkLoop(); 
} 
async Task BlinkLoop() 
{

    while (_running) 
    { 
    await Task.WhenAll( 
    Heart.FadeTo(0.35, 220, Easing.CubicInOut), 
    Heart.ScaleTo(0.95, 220, Easing.CubicInOut) 
    ); 
    
    await Task.WhenAll( 
    Heart.FadeTo(1, 220, Easing.CubicInOut), 
    Heart.ScaleTo(1, 220, Easing.CubicInOut) 
    ); 
    } 
}

All set! Ready to move on to the next animation ✨

4. Rotating Heart

Rotate Heart

Now we’ll create an animation where our heart rotates. ❤️ This type of animation is very useful, for example, as a loading indicator in your screens.

To get started with the implementation, we’ll begin by adding the Image in XAML, as shown below:

<Image x:Name="Heart" 
    Source="redheart.png" 
    WidthRequest="64" 
    HeightRequest="64" 
    HorizontalOptions="Center" 
    VerticalOptions="Center" />

And in the code-behind, it’s time to bring the heart to life! ❤️

We’ll use the OnAppearing method, and for the animation (✍️ please make sure to declare the _isRunning bool variable), we’ll rely on RotateTo(), as shown below:

protected override async void OnAppearing() 
{ 
    base.OnAppearing(); 
    _isRunning = true; 
    
    while (_isRunning) 
    { 
    await Heart.RotateTo(360, 4000, Easing.Linear); 
    Heart.Rotation = 0; 
    } 
}

Done! Let’s bring the next animation to life ✨

5. Jumping Heart

Heart Jumping

For our final animation, we’ll make the heart jump up and down in a continuous loop.

This animation also works great as a loading indicator, adding a playful touch to your screen.

Let’s start by adding the Image to the XAML:

<Image x:Name="Heart" 
    Source="redheart.png" 
    WidthRequest="64" 
    HeightRequest="64" 
    HorizontalOptions="Center" 
    VerticalOptions="Center" />

In the code-behind, we’ll create the animation by working with TranslationTo and the TranslationY methods, as shown below:

protected override async void OnAppearing() 
{ 
    base.OnAppearing(); 
    _isRunning = true; 
    Heart.TranslationY = -120;
 
    while (_isRunning)
    
    { 
    await Heart.TranslateTo(0, 0, 600, Easing.CubicOut); 
    await Heart.TranslateTo(0, -12, 120, Easing.CubicInOut); 
    await Heart.TranslateTo(0, 0, 120, Easing.CubicInOut); 
    await Heart.TranslateTo(0, -120, 450, Easing.CubicIn); 
    } 
}

And that’s all—thanks for building this with me!

Conclusion

And that’s it! We’ve successfully created five awesome, ready-to-use animations with .NET MAUI. Each one was designed to be easy to implement, demonstrating just how powerful even the most basic animations can be when used correctly.

I hope this article has inspired you and sparked new ideas for incorporating animations into your current and future .NET MAUI projects ✨

As always, if you have any questions or would like me to explore any of these animations in more detail, feel free to leave a comment — I’d be happy to help!

See you in the next article! ‍♀️

Read the whole story
alvinashcraft
3 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Azure Cosmos DB MCP Toolkit Is Now Generally Available — Bringing Your Database to AI Agents at Scale

1 Share

Since we introduced the Azure Cosmos DB MCP Toolkit at Ignite 2025 in preview, the response has been clear: developers want a straightforward way to connect AI agents to their production databases. Customers asked for stability, broader embedding provider support, and a smoother path from experimentation to production.

Today, we’re announcing the general availability of the Azure Cosmos DB MCP Toolkit (v1.1.2), now with deeper Microsoft Foundry integration, multi-provider embedding support, and the reliability improvements you asked for.

Azure Cosmos DB MCP Toolkit generally available

The Problem: Getting AI Agents to Talk to Your Data Is Harder Than It Should Be

Building an AI agent is one thing. Getting that agent to securely read, search, and reason over your actual production data is another challenge entirely.

Most teams face the same friction:

  • Custom integration code for every agent-to-database connection
  • Security concerns around giving LLMs direct database access
  • Embedding lock-in to a single provider, making it hard to switch or optimize
  • Brittle scripts that break when configurations change or permissions shift

You end up spending more time wiring plumbing than building the intelligent experiences your users actually want.

What’s New in the GA Release

The v1.1.2 GA release focuses on three areas customers asked for most: flexibility, reliability, and developer experience.

Multi-Provider Embedding Support

Vector search is no longer locked to a single embedding provider. The toolkit now supports:

  • Azure AI Services (Cognitive Services endpoints)
  • Azure AI Foundry project endpoints
  • OpenAI native API

The system automatically detects your endpoint type based on URL pattern — no manual configuration flags needed. A new IEmbeddingClient abstraction layer means you can swap providers without changing your agent code.

         
    {    
      "OPENAI_ENDPOINT": "https://your-resource.cognitiveservices.azure.com/",    
      "OPENAI_API_KEY": "your-key",    
      "OPENAI_EMBEDDING_DEPLOYMENT": "text-embedding-ada-002"    
    }    

Whether you’re using Azure AI Services, a Foundry project endpoint, or OpenAI directly — the same configuration pattern works. The toolkit figures out the rest.

 

Whether you’re using Azure AI Services, a Foundry project endpoint, or OpenAI directly — the same configuration pattern works. The toolkit figures out the rest.

Improved Reliability and Error Handling

We heard the feedback on rough edges during preview. The GA release includes fixes for:

  • Role assignment scripts (Assign-Role-To-Users.ps1, Assign-Role-To-Current-User.ps1, Verify-Role-Assignments.ps1) now handle edge cases correctly
  • Structured error responses — Role-denied tool calls return a proper 403 JSON-RPC response instead of a 500 error
  • Foundry connection parameter handling works correctly when using project names
  • Startup validation rejects invalid endpoint configurations early with actionable guidance

Better MCP Transport and Compatibility

  • MCP HTTP transport is now properly registered for SDK endpoint mapping
  • External MCP clients connect reliably at /mcp
  • The web UI and SDK endpoints coexist without conflicts

Microsoft Foundry Integration

The MCP Toolkit integrates directly with Microsoft Foundry, giving your agents access to Cosmos DB in just a few clicks:

  1. Navigate to your Foundry project
  2. Go to Build → Create agent
  3. Select + Add in the tools section
  4. Select the Catalog tab
  5. Choose Azure Cosmos DB and click Create

That’s it. Your agent can now query databases, perform vector searches, and discover schemas — all through the standardized MCP protocol with enterprise-grade security (Entra ID, RBAC, managed identities).

 

What You Can Build

The toolkit exposes 8 MCP tools that cover the most common agent-to-database patterns:

 

Tool What It Does
list_databases Discover all databases in your account
list_collections Explore containers within a database
get_recent_documents Retrieve recent documents sorted by timestamp
find_document_by_id Look up specific documents by ID
text_search Search by property values using CONTAINS
vector_search Semantic search using vector embeddings
get_approximate_schema Sample and infer container schemas

 

Example: AI-Powered Support Agent

A support agent receives “What’s the status of order #12345?” and autonomously:

  1. Calls find_document_by_id to retrieve the order
  2. Reads shipping status and estimated delivery
  3. Responds with a personalized, accurate answer — no human lookup required

Example: Knowledge Base with RAG

A documentation agent uses vector_search to find semantically relevant articles, synthesizes answers from multiple sources, and cites specific documents — all backed by Cosmos DB’s global distribution and low latency.

Getting Started

# Clone the toolkit git clone https://github.com/AzureCosmosDB/MCPToolKit.git

# Configure your environment cp .env.example .env # Set your Cosmos DB connection, embedding endpoint, and auth settings

# Run the MCP server dotnet run For detailed setup including Entra ID authentication and managed identity configuration, see the Quick Start Guide.

If you’re setting up your development environment for Azure Cosmos DB, watch this session Azure Cosmos DB Dev Environment with AI | at Azure Cosmos DB Conf 2026 

The Bottom Line

The Azure Cosmos DB MCP Toolkit v1.1 is production-ready, open source, and designed to get out of your way. Swap between Azure AI Services, Azure AI Foundry, or OpenAI embeddings without touching your agent code. Add Cosmos DB tools to a Foundry agent straight from the catalog. Run it at scale with proper error handling, validated configurations, and enterprise-grade RBAC then extend it however you need.

If you’ve been waiting for GA to move forward — now’s the time. If you’ve been running the preview, upgrade to v1.1 for the multi-provider embedding support and stability fixes.

If you’ve been waiting for GA to move forward — now’s the time. If you’ve been running the preview, upgrade to v1.1 for stability fixes.

 

About Azure Cosmos DB

Azure Cosmos DB is a fully managed and serverless NoSQL and vector database for modern app development, including AI applications. With its SLA-backed speed and availability as well as instant dynamic scalability, it is ideal for real-time NoSQL and MongoDB applications that require high performance and distributed computing over massive volumes of NoSQL and vector data.

To stay in the loop on Azure Cosmos DB updates, follow us on XYouTube, and LinkedIn.  Join the discussion with other developers on the #nosql channel on the Microsoft Open Source Discord.

 

The post Azure Cosmos DB MCP Toolkit Is Now Generally Available — Bringing Your Database to AI Agents at Scale appeared first on Azure Cosmos DB Blog.

Read the whole story
alvinashcraft
5 minutes ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories