Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
151462 stories
·
33 followers

Microsoft starts removing Copilot buttons from Windows 11 apps

1 Share

Microsoft is starting to remove "unnecessary" Copilot buttons from its Windows 11 apps. In the latest version of the Notepad app for Windows Insiders, Microsoft has removed the Copilot button in favor of a "writing tools" menu. The Copilot button in the Snipping Tool app also no longer appears when you select an area to capture.

The change is part of "reducing unnecessary Copilot entry points, starting with apps like Snipping Tool, Photos, Widgets and Notepad," that Microsoft promised to complete as part of its broader plan to fix Windows 11. While Copilot buttons are being removed, it looks like the underlying AI features are here to stay, …

Read the full story at The Verge.

Read the whole story
alvinashcraft
14 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Agent Governance Toolkit: Architecture Deep Dive, Policy Engines, Trust, and SRE for AI Agents

1 Share

Last week we announced the Agent Governance Toolkit on the Microsoft Open Source Blog, an open-source project that brings runtime security governance to autonomous AI agents. In that announcement, we covered the why: AI agents are making autonomous decisions in production, and the security patterns that kept systems safe for decades need to be applied to this new class of workload.

 

In this post, we'll go deeper into the how: the architecture, the implementation details, and what it takes to run governed agents in production.

The Problem: Production Infrastructure Meets Autonomous Agents

If you manage production infrastructure, you already know the playbook: least privilege, mandatory access controls, process isolation, audit logging, and circuit breakers for cascading failures. These patterns have kept production systems safe for decades.

Now imagine a new class of workload arriving on your infrastructure, AI agents that autonomously execute code, call APIs, read databases, and spawn sub-processes. They reason about what to do, select tools, and act in loops. And in many current deployments, they do all of this without the security controls you'd demand of any other production workload.

That gap is what led us to build the Agent Governance Toolkit: an open-source project, that applies proven security concepts from operating systems, service meshes, and SRE to the emerging world of autonomous AI agents.

To frame this in familiar terms: most AI agent frameworks today are like running every process as root, no access controls, no isolation, no audit trail. The Agent Governance Toolkit is the kernel, the service mesh, and the SRE platform for AI agents.

When an agent calls a tool, say, `DELETE FROM users WHERE created_at < NOW()`, there is typically no policy layer checking whether that action is within scope. There is no identity verification when one agent communicates with another. There is no resource limit preventing an agent from making 10,000 API calls in a minute. And there is no circuit breaker to contain cascading failures when things go wrong.

OWASP Agentic Security Initiative

In December 2025, OWASP published the Agentic AI Top 10: the first formal taxonomy of risks specific to autonomous AI agents. The list reads like a security engineer's nightmare: goal hijacking, tool misuse, identity abuse, memory poisoning, cascading failures, rogue agents, and more.

If you've ever hardened a production server, these risks will feel both familiar and urgent. The Agent Governance Toolkit is designed to help address all 10 of these risks through deterministic policy enforcement, cryptographic identity, execution isolation, and reliability engineering patterns.

Note: The OWASP Agentic Security Initiative has since adopted the ASI 2026 taxonomy (ASI01–ASI10). The toolkit's copilot-governance package now uses these identifiers with backward compatibility for the original AT numbering.

Architecture: Nine Packages, One Governance Stack

The toolkit is structured as a v3.0.0 Public Preview monorepo with nine independently installable packages:

Package

What It Does

Agent OS

Stateless policy engine, intercepts agent actions before execution with configurable pattern matching and semantic intent classification

Agent Mesh

Cryptographic identity (DIDs with Ed25519), Inter-Agent Trust Protocol (IATP), and trust-gated communication between agents

Agent Hypervisor

Execution rings inspired by CPU privilege levels, saga orchestration for multi-step transactions, and shared session management

Agent Runtime

Runtime supervision with kill switches, dynamic resource allocation, and execution lifecycle management

Agent SRE

SLOs, error budgets, circuit breakers, chaos engineering, and progressive delivery, production reliability practices adapted for AI agents

Agent Compliance

Automated governance verification with compliance grading and regulatory framework mapping (EU AI Act, NIST AI RMF, HIPAA, SOC 2)

Agent Lightning

Reinforcement learning training governance with policy-enforced runners and reward shaping

Agent Marketplace

Plugin lifecycle management with Ed25519 signing, trust-tiered capability gating, and SBOM generation

Integrations

20+ framework adapters for LangChain, CrewAI, AutoGen, Semantic Kernel, Google ADK, Microsoft Agent Framework, OpenAI Agents SDK, and more

Agent OS: The Policy Engine

Agent OS intercepts agent tool calls before they execute:

from agent_os import StatelessKernel, ExecutionContext, Policy

kernel = StatelessKernel()
ctx = ExecutionContext(
    agent_id="analyst-1",
    policies=[
        Policy.read_only(),                    # No write operations
        Policy.rate_limit(100, "1m"),          # Max 100 calls/minute
        Policy.require_approval(
            actions=["delete_*", "write_production_*"],
            min_approvals=2,
            approval_timeout_minutes=30,
        ),
    ],
)

result = await kernel.execute(
    action="delete_user_record",
    params={"user_id": 12345},
    context=ctx,
)

The policy engine works in two layers: configurable pattern matching (with sample rule sets for SQL injection, privilege escalation, and prompt injection that users customize for their environment) and a semantic intent classifier that helps detect dangerous goals regardless of phrasing. When an action is classified as `DESTRUCTIVE_DATA`, `DATA_EXFILTRATION`, or `PRIVILEGE_ESCALATION`, the engine blocks it, routes it for human approval, or downgrades the agent's trust level, depending on the configured policy.

Important: All policy rules, detection patterns, and sensitivity thresholds are externalized to YAML configuration files. The toolkit ships with sample configurations in `examples/policies/` that must be reviewed and customized before production deployment. No built-in rule set should be considered exhaustive. Policy languages supported: YAML, OPA Rego, and Cedar.

The kernel is stateless by design, each request carries its own context. This means you can deploy it behind a load balancer, as a sidecar container in Kubernetes, or in a serverless function, with no shared state to manage. On AKS or any Kubernetes cluster, it fits naturally into existing deployment patterns. Helm charts are available for agent-os, agent-mesh, and agent-sre.

Agent Mesh: Zero-Trust Identity for Agents

In service mesh architectures, services prove their identity via mTLS certificates before communicating. AgentMesh applies the same principle to AI agents using decentralized identifiers (DIDs) with Ed25519 cryptography and the Inter-Agent Trust Protocol (IATP):

from agentmesh import AgentIdentity, TrustBridge

identity = AgentIdentity.create(
    name="data-analyst",
    sponsor="alice@company.com",          # Human accountability
    capabilities=["read:data", "write:reports"],
)
# identity.did -> "did:mesh:data-analyst:a7f3b2..."

bridge = TrustBridge()
verification = await bridge.verify_peer(
    peer_id="did:mesh:other-agent",
    required_trust_score=700,  # Must score >= 700/1000
)

A critical feature is trust decay: an agent's trust score decreases over time without positive signals. An agent trusted last week but silent since then gradually becomes untrusted, modeling the reality that trust requires ongoing demonstration, not a one-time grant.

Delegation chains enforce scope narrowing: a parent agent with read+write permissions can delegate only read access to a child agent, never escalate.

Agent Hypervisor: Execution Rings

CPU architectures use privilege rings (Ring 0 for kernel, Ring 3 for userspace) to isolate workloads. The Agent Hypervisor applies this model to AI agents:

Ring

Trust Level

Capabilities

Ring 0 (Kernel)

Score ≥ 900

Full system access, can modify policies

Ring 1 (Supervisor)

Score ≥ 700

Cross-agent coordination, elevated tool access

Ring 2 (User)

Score ≥ 400

Standard tool access within assigned scope

Ring 3 (Untrusted)

Score < 400

Read-only, sandboxed execution only

New and untrusted agents start in Ring 3 and earn their way up, exactly the principle of least privilege that production engineers apply to every other workload.

Each ring enforces per-agent resource limits: maximum execution time, memory caps, CPU throttling, and request rate limits. If a Ring 2 agent attempts a Ring 1 operation, it gets blocked, just like a userspace process trying to access kernel memory.

These ring definitions and their associated trust score thresholds are fully configurable via policy. Organizations can define custom ring structures, adjust the number of rings, set different trust score thresholds for transitions, and configure per-ring resource limits to match their security requirements.

The hypervisor also provides saga orchestration for multi-step operations. When an agent executes a sequence, draft email → send → update CRM, and the final step fails, compensating actions fire in reverse. Borrowed from distributed transaction patterns, this ensures multi-agent workflows maintain consistency even when individual steps fail.

Agent SRE: SLOs and Circuit Breakers for Agents

If you practice SRE, you measure services by SLOs and manage risk through error budgets. Agent SRE extends this to AI agents:

When an agent's safety SLI drops below 99 percent, meaning more than 1 percent of its actions violate policy, the system automatically restricts the agent's capabilities until it recovers. This is the same error-budget model that SRE teams use for production services, applied to agent behavior.

We also built nine chaos engineering fault injection templates: network delays, LLM provider failures, tool timeouts, trust score manipulation, memory corruption, and concurrent access races. Because the only way to know if your agent system is resilient is to break it intentionally.

Agent SRE integrates with your existing observability stack through adapters for Datadog, PagerDuty, Prometheus, OpenTelemetry, Langfuse, LangSmith, Arize, MLflow, and more. Message broker adapters support Kafka, Redis, NATS, Azure Service Bus, AWS SQS, and RabbitMQ.

Compliance and Observability

If your organization already maps to CIS Benchmarks, NIST AI RMF, or other frameworks for infrastructure compliance, the OWASP Agentic Top 10 is the equivalent standard for AI agent workloads. The toolkit's agent-compliance package provides automated governance grading against these frameworks.

The toolkit is framework-agnostic, with 20+ adapters that hook into each framework's native extension points, so adding governance to an existing agent is typically a few lines of configuration, not a rewrite.

The toolkit exports metrics to any OpenTelemetry-compatible platform, Prometheus, Grafana, Datadog, Arize, or Langfuse. If you're already running an observability stack for your infrastructure, agent governance metrics flow through the same pipeline.

Key metrics include: policy decisions per second, trust score distributions, ring transitions, SLO burn rates, circuit breaker state, and governance workflow latency.

Getting Started

# Install all packages
pip install agent-governance-toolkit[full]

# Or individual packages
pip install agent-os-kernel agent-mesh agent-sre

The toolkit is available across language ecosystems: Python, TypeScript (`@microsoft/agentmesh-sdk` on npm), Rust, Go, and .NET (`Microsoft.AgentGovernance` on NuGet).

Azure Integrations

While the toolkit is platform-agnostic, we've included integrations that help enable the fastest path to production, on Azure:

Azure Kubernetes Service (AKS): Deploy the policy engine as a sidecar container alongside your agents. Helm charts provide production-ready manifests for agent-os, agent-mesh, and agent-sre.

Azure AI Foundry Agent Service: Use the built-in middleware integration for agents deployed through Azure AI Foundry.

OpenClaw Sidecar: One compelling deployment scenario is running OpenClaw, the open-source autonomous agent, inside a container with the Agent Governance Toolkit deployed as a sidecar. This gives you policy enforcement, identity verification, and SLO monitoring over OpenClaw's autonomous operations. On Azure Kubernetes Service (AKS), the deployment is a standard pod with two containers: OpenClaw as the primary workload and the governance toolkit as the sidecar, communicating over localhost. We have a reference architecture and Helm chart available in the repository.

The same sidecar pattern works with any containerized agent, OpenClaw is a particularly compelling example because of the interest in autonomous agent safety.

Tutorials and Resources

34+ step-by-step tutorials covering policy engines, trust, compliance, MCP security, observability, and cross-platform SDK usage are available in the repository.

git clone https://github.com/microsoft/agent-governance-toolkit
cd agent-governance-toolkit
pip install -e "packages/agent-os[dev]" -e "packages/agent-mesh[dev]" -e "packages/agent-sre[dev]"

# Run the demo
python -m agent_os.demo

What's Next

AI agents are becoming autonomous decision-makers in production infrastructure, executing code, managing databases, and orchestrating services. The security patterns that kept production systems safe for decades, least privilege, mandatory access controls, process isolation, audit logging, are exactly what these new workloads need. We built them. They're open source.

We're building this in the open because agent security is too important for any single organization to solve alone:

  • Security research: Adversarial testing, red-team results, and vulnerability reports strengthen the toolkit for everyone.
  • Community contributions: Framework adapters, detection rules, and compliance mappings from the community expand coverage across ecosystems.

We are committed to open governance. We're releasing this project under Microsoft today, and we aspire to move it into a foundation home, such as the AI and Data Foundation (AAIF), where it can benefit from cross-industry stewardship. We're actively engaging with foundation partners on this path.

The Agent Governance Toolkit is open source under the MIT license. Contributions welcome at github.com/microsoft/agent-governance-toolkit.

Read the whole story
alvinashcraft
15 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

The "IQ Layer": Microsoft’s Blueprint for the Agentic Enterprise

1 Share

The "IQ Layer": Microsoft’s Blueprint for the Agentic Enterprise

Modern enterprises have experimented with artificial intelligence for years, yet many deployments have struggled to move beyond basic automation and conversational interfaces. The fundamental limitation has not been the reasoning power of AI models—it has been their lack of organizational context.

In most organizations, AI systems historically lacked visibility into how work actually happens. They could process language and generate responses, but they could not fully understand business realities such as:

  • Who is responsible for a project
  • What internal metrics represent
  • Where corporate policies are stored
  • How teams collaborate across tools and departments

Without this contextual awareness, AI often produced answers that sounded intelligent but lacked real business value.

To address this challenge, Microsoft introduced a new architectural model known as the IQ Layer. This framework establishes a structured intelligence layer across the enterprise, enabling AI systems to interpret work activity, enterprise data, and organizational knowledge.

The architecture is built around three integrated intelligence domains:

  • Work IQ
  • Fabric IQ
  • Foundry IQ

Together, these layers allow AI systems to move beyond simple responses and deliver insights that are aligned with real organizational context.

 

The Three Foundations of Enterprise Context

For AI to evolve from a helpful assistant into a trusted decision-support partner, it must understand multiple dimensions of enterprise operations. Microsoft addresses this need by organizing contextual intelligence into three distinct layers.

IQ Layer

Purpose

Platform Foundation

Work IQ

Collaboration and work activity signals

Microsoft 365, Microsoft Teams, Microsoft Graph

Fabric IQ

Structured enterprise data understanding

Microsoft Fabric, Power BI, OneLake

Foundry IQ

Knowledge retrieval and AI reasoning

Azure AI Foundry, Azure AI Search, Microsoft Purview

Each layer contributes a unique type of intelligence that enables enterprise AI systems to understand the organization from different perspectives.

Work IQ — Understanding How Work Gets Done

The first layer, Work IQ, focuses on the signals generated by daily collaboration and communication across an organization.

Built on top of Microsoft Graph, Work IQ analyses activity patterns across the Microsoft 365 ecosystem, including:

  • Email communication
  • Virtual meetings
  • Shared documents
  • Team chat conversations
  • Calendar interactions
  • Organizational relationships

These signals help AI systems map how work actually flows across teams.

Rather than requiring users to provide background context manually, AI can infer critical information automatically, such as:

  • Project stakeholders
  • Communication networks
  • Decision makers
  • Subject matter experts

For example, if an employee asks:

"What is the latest update on the migration project?"

Work IQ can analyse multiple collaboration sources including:

  • Project discussions in Microsoft Teams
  • Meeting transcripts
  • Shared project documentation
  • Email discussions

As a result, AI responses become grounded in real workplace activity instead of generic information.

Fabric IQ — Understanding Enterprise Data

While Work IQ focuses on collaboration signals, Fabric IQ provides insight into structured enterprise data.

Operating within Microsoft Fabric, this layer transforms raw datasets into meaningful business concepts.

Instead of interpreting information as isolated tables and columns, Fabric IQ enables AI systems to reason about business entities such as:

  • Customers
  • Products
  • Orders
  • Revenue metrics
  • Inventory levels

By leveraging semantic models from Power BI and unified storage through OneLake, Fabric IQ establishes a shared data language across the organization.

This allows AI systems to answer strategic questions such as:

"Why did revenue decline last quarter?"

Instead of simply retrieving numbers, the AI can analyse multiple business drivers, including:

  • Product performance trends
  • Regional sales variations
  • Customer behaviour segments
  • Supply chain disruptions

The outcome is not just data access, but decision-oriented insight.

Foundry IQ — Understanding Enterprise Knowledge

The third layer, Foundry IQ, addresses another major enterprise challenge: fragmented knowledge repositories.

Organizations store valuable information across numerous systems, including:

  • SharePoint repositories
  • Policy documents
  • Contracts
  • Technical documentation
  • Internal knowledge bases
  • Corporate wikis

Historically, connecting these knowledge sources to AI required complex retrieval-augmented generation (RAG) architectures.

Foundry IQ simplifies this process through services within Azure AI Foundry and Azure AI Search.

Capabilities include:

  • Automated document indexing
  • Semantic search capabilities
  • Document grounding for AI responses
  • Access-aware information retrieval

Integration with Microsoft Purview ensures that governance policies remain intact. Sensitivity labels, compliance rules, and access permissions continue to apply when AI systems retrieve and process information.

This ensures that users only receive information they are authorized to access.

From Chatbots to Autonomous Enterprise Agents

The full potential of the IQ architecture becomes clear when all three layers operate together.

This integrated intelligence model forms the basis of what Microsoft describes as the Agentic Enterprise—an environment where AI systems function as proactive digital collaborators rather than passive assistants.

Instead of simple chat interfaces, organizations will deploy AI agents capable of understanding context, reasoning about business situations, and initiating actions.

Example Scenario: Supply Chain Disruption

Consider a scenario where a shipment delay threatens delivery commitments.

Within the IQ architecture:

Fabric IQ
Detects anomalies in shipment or logistics data and identifies potential risks to delivery schedules.

Foundry IQ
Retrieves supplier contracts and evaluates service-level agreements to determine whether penalties or mitigation clauses apply.

Work IQ
Identifies the logistics manager responsible for the account and prepares a contextual briefing tailored to their communication patterns.

Tasks that previously required hours of investigation can now be completed by AI systems within minutes.

Governance Embedded in the Architecture

For enterprise leaders, security and compliance remain critical considerations in AI adoption.

Microsoft designed the IQ framework with governance deeply embedded in its architecture.

Key governance capabilities include:

Permission-Aware Intelligence

AI responses respect user permissions enforced through Microsoft Entra ID, ensuring individuals only see information they are authorized to access.

Compliance Enforcement

Data classification and protection policies defined in Microsoft Purview continue to apply throughout AI workflows.

Observability and Monitoring

Organizations can monitor AI agents and automation processes through tools such as Microsoft Copilot Studio and other emerging agent management platforms.

This provides transparency and operational control over AI-driven systems.

The Strategic Shift: AI as Enterprise Infrastructure

Perhaps the most significant implication of the IQ architecture is the transformation of AI from a standalone tool into a foundational enterprise capability.

In earlier deployments, organizations treated AI as isolated applications or experimental tools.

With the IQ Layer approach, AI becomes deeply integrated across core platforms including:

  • Microsoft 365
  • Microsoft Fabric
  • Azure AI Foundry

This integrated intelligence allows AI systems to behave more like experienced digital employees.

They can:

  • Understand organizational workflows
  • Analyse complex data relationships
  • Retrieve institutional knowledge
  • Collaborate with human teams

Enterprises that successfully implement this intelligence layers will be better positioned to make faster decisions, respond to change more effectively, and unlock new levels of operational intelligence.

References:

  1. Work IQ MCP overview (preview) - Microsoft Copilot Studio | Microsoft Learn
  2. What is Fabric IQ (preview)? - Microsoft Fabric | Microsoft Learn
  3. What is Foundry IQ? - Microsoft Foundry | Microsoft Learn
  4. From Data Platform to Intelligence Platform: Introducing Microsoft Fabric IQ | Microsoft Fabric Blog | Microsoft Fabric

 

Read the whole story
alvinashcraft
15 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Episode 567: Building Voice and Streaming Apps for the Enterprise with Alberto

1 Share

Brandon interviews Alberto González, CTO of WebRTC.ventures, about building voice and streaming applications for the enterprise. They dig into how developers can integrate WebRTC into their apps, the unique challenges it presents, and where AI fits in. Plus, Alberto shares tips on kitesurfing and why you should visit Barcelona.

Episode Links:

Contact Alberto:

Special Guest: Alberto González.





Download audio: https://aphid.fireside.fm/d/1437767933/9b74150b-3553-49dc-8332-f89bbbba9f92/73ddac53-3ab3-47d4-9691-3a5ea89b93c1.mp3
Read the whole story
alvinashcraft
15 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Episode 61: The Power We Hold: Women Spies, Storytelling, and the Courage to Create

1 Share

In this episode, we’re joined by Cadie Hopkins and Megan Jaffer, co-founders of Iron Butterfly Media, who have made it their mission to amplify the voices of women across the global intelligence community. They have created their first feature-length documentary, “The Power We Hold: Stories from Women Spies,” which released in March 2026. Cadie and Megan feature ten amazing women in their film, and talk about why they felt compelled to bring these powerful stories to a broader audience. They reflect on the unique leadership lessons women in intelligence bring to the table—from resilience and collaboration to a more “democratic” model of leadership. They also explore what it’s like to build a creative company together, how they brought their documentary to life with donor support, and why storytelling can be such a powerful catalyst for change.





Download audio: https://anchor.fm/s/44cc6cdc/podcast/play/117918357/https%3A%2F%2Fd3ctxlq1ktw2nl.cloudfront.net%2Fstaging%2F2026-3-3%2F38487f90-931c-f3fa-5924-15bd12aeea32.mp3
Read the whole story
alvinashcraft
15 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

How Drasi used GitHub Copilot to find documentation bugs

1 Share

For early-stage open-source projects, the “Getting started” guide is often the first real interaction a developer has with the project. If a command fails, an output doesn’t match, or a step is unclear, most users won’t file a bug report—they will just move on.

Drasi, a CNCF sandbox project that detects changes in your data and triggers immediate reactions, is supported by our small team of four engineers in Microsoft Azure’s Office of the Chief Technology Officer, and we move fast. We have comprehensive tutorials, but we are shipping code faster than we can manually test them.

Detect and react to your first database change using Drasi
The team didn’t realize how big this gap was until late 2025, when GitHub updated its Dev Container infrastructure, bumping the minimum Docker version. The update broke the Docker daemon connection—and every single tutorial stopped working. Because we relied on manual testing, we didn’t immediately know the extent of the damage. Any developer trying Drasi during that window would have hit a wall.

This incident forced a realization: with advanced AI coding assistants, documentation testing can be converted to a monitoring problem.

The problem: Why does documentation break?
Documentation usually breaks for two reasons:

  1. The curse of knowledge
    Experienced developers write documentation with implicit context. When we write “wait for the query to bootstrap,” we know to run drasi list query and watch for the Running status, or even better—run the drasi wait command. A new user has no such context. Neither does an AI agent. They read the instructions literally and don’t know what to do. They get stuck on the “how,” while we only document the “what.”
  2. Silent drift
    Documentation doesn’t fail loudly like code does. When you rename a configuration file in your codebase, the build fails immediately. But when your documentation still references the old filename, nothing happens. The drift accumulates silently until a user reports confusion.

This is compounded for tutorials like ours, which spin up sandbox environments with Docker, k3d, and sample databases. When any upstream dependency changes—a deprecated flag, a bumped version, or a new default—our tutorials can break silently.

The solution: Agents as synthetic users
To solve this, we treated tutorial testing as a simulation problem. We built an AI agent that acts as a “synthetic new user.”

This agent has three critical characteristics:

It is naïve: It has no prior knowledge of Drasi—it knows only what is explicitly written in the tutorial.
It is literal: It executes every command exactly as written. If a step is missing, it fails.
It is unforgiving: It verifies every expected output. If the doc says, “You should see ‘Success’”, and the command line interface (CLI) just returns silently—the agent flags it and fails fast.
The stack: GitHub Copilot CLI and Dev Containers
We built a solution using GitHub Actions, Dev Containers, Playwright, and the GitHub Copilot CLI.

Our tutorials require heavy infrastructure:

A full Kubernetes cluster (k3d)
Docker-in-Docker
Real databases (such as PostgreSQL and MySQL)
We needed an environment that exactly matches what our human users experience. If users run in a specific Dev Container on GitHub Codespaces, our test must run in that same Dev Container.

The architecture
Inside the container, we invoke the Copilot CLI with a specialized system prompt (view the full prompt here):

A screen shot of a computer terminal:

bash

copilot -p “$(cat prompt.md)” \
–allow-all-tools \
–allow-all-paths \
–deny-tool ‘fetch’ \
–deny-tool ‘websearch’ \
–deny-tool ‘githubRepo’ \
–deny-tool ‘shell(curl *)’ \

# ... additional deny-tool flags 

–allow-url localhost \
–allow-url 127.0.0.1
This prompt using the prompt mode (-p) of the CLI agent gives us an agent that can execute terminal commands, write files, and run browser scripts—just like a human developer sitting at their terminal. For the agent to simulate a real user, it needs these capabilities.

To enable the agents to open webpages and interact with them as any human following the tutorial steps would, we also install Playwright on the Dev Container. The agent also takes screenshots which it then compares against those provided in the documentation.

Security model
Our security model is built around one principle: the container is the boundary.

Rather than trying to restrict individual commands (a losing game when the agent needs to run arbitrary node scripts for Playwright), we treat the entire Dev Container as an isolated sandbox and control what crosses its boundaries: no outbound network access beyond localhost, a Personal Access Token (PAT) with only “Copilot Requests” permission, ephemeral containers destroyed after each run, and a maintainer-approval gate for triggering workflows.

Dealing with non-determinism
One of the biggest challenges with AI-based testing is non-determinism. Large language models (LLMs) are probabilistic—sometimes the agent retries a command; other times it gives up.

We handled this with a three-stage retry with model escalation (start with Gemini-Pro, on failure try with Claude Opus), semantic comparison for screenshots instead of pixel-matching, and verification of core-data fields rather than volatile values.

We also have a list of tight constraints in our prompts that prevent the agent from going on a debugging journey, directives to control the structure of the final report, and also skip directives that tell the agent to bypass optional tutorial sections like setting up external services.

Artifacts for debugging
When a run fails, we need to know why. Since the agent is running in a transient container, we can’t just Secure Shell (SSH) in and look around.

So, our agent preserves evidence of every run—screenshots of web UIs, terminal output of critical commands, and a final markdown report detailing its reasoning like shown here:

Drasi Getting Started Tutorial Evaluation

Environment

  • Timestamp: 2026-02-20T13:32:07.998Z
  • Directory: /workspaces/learning/tutorial/getting-started

Step 1: Setup Drasi Environment

  • Skipped as per instructions (already in DevContainer).
  • Verified environment setup by checking resources folder existence.

Step 2: Create PostgreSQL Source

  • Command: drasi apply -f ./resources/hello-world-source.yaml

………………………………
………… more steps ……….
………………………………

Scenario 1: hello-world-from

  • Initial check: “Brian Kernighan” present. (Screenshot: 09_hello-world-from.png)
  • Action: Insert ‘Allen’, ‘Hello World’.
  • Verification: “Allen” appeared in UI. (Screenshot: 10_hello-world-from-updated.png)
  • Result: PASSED

……………………………………………………
….. more validation by playwright taking screenshots …..
……………………………………………………

Conclusion

The tutorial instructions were clear and the commands executed successfully. The expected behavior matches the actual behavior observed via the Debug Reaction UI.

STATUS: SUCCESS

These artifacts are uploaded to the GitHub Action run summary, allowing us to “time travel” back to the exact moment of failure and see what the agent saw.

Screenshot of Agents report output in a folder with other files.
Parsing the agent’s report
With LLMs, getting a definitive “Pass/Fail” signal that a machine can understand can be challenging. An agent might write a long, nuanced conclusion like:

To make this actionable in a CI/CD pipeline, we had to do some prompt engineering. We explicitly instructed the agent:

In our GitHub Action, we then simply grep for this specific string to set the exit code of the workflow.

Simple techniques like this bridge the gap between AI’s fuzzy, probabilistic outputs and CI’s binary pass/fail expectations.

Automation
We now have an automated version of the workflow which runs weekly. This version evaluates all our tutorials every week in parallel—each tutorial gets its own sandbox container and a fresh perspective from the agent acting as a synthetic user. If any of the tutorial evaluation fails—the workflow is configured to file an issue on our GitHub repo.

This workflow can optionally also be run on pull-requests, but to prevent attacks we have added a maintainer-approval requirement and a pull_request_target trigger—which means that even on pull-requests by external contributors, the workflow that executes will be the one in our main branch.

Running the Copilot CLI requires a PAT token which is stored in the environment secrets for our repo. To make sure this does not leak, each run requires maintainer approval—except the automated weekly run which only runs on the main branch of our repo.

What we found: Bugs that matter
Since implementing this system, we have run over 200 “synthetic user” sessions. The agent identified 18 distinct issues including some serious environment issues and other documentation issues like these. Fixing them improved the docs for everyone, not just the bot.

Implicit dependencies: In one tutorial, we instructed users to create a tunnel to a service. The agent ran the command, and then—following the next instruction—killed the process to run the next command.
The fix: We realized we hadn’t told the user to keep that terminal open. We added a warning: “This command blocks. Open a new terminal for subsequent steps.”
Missing verification steps: We wrote: “Verify the query is running.” The agent got stuck: “How, exactly?”
The fix: We replaced the vague instruction with an explicit command: drasi wait -f query.yaml.
Format drift: Our CLI output had evolved. New columns were added; older fields were deprecated. The documentation screenshots still showed the 2024 version of the interface. A human tester might gloss over this (“it looks mostly right”). The agent flagged every mismatch, forcing us to keep our examples up to date.
AI as a force multiplier
We often hear about AI replacing humans, but in this case, the AI is providing us with a workforce we never had.

To replicate what our system does—running six tutorials across fresh environments every week—we would need a dedicated QA resource or a significant budget for manual testing. For a four-person team, that is impossible. By deploying these Synthetic Users, we have effectively hired a tireless QA engineer who works nights, weekends, and holidays.

Our tutorials are now validated weekly by synthetic users—try the Getting Started guide yourself and see the results firsthand. And if you’re facing the same documentation drift in your own project, consider GitHub Copilot CLI not just as a coding assistant, but as an agent—give it a prompt, a container, and a goal—and let it do the work a human doesn’t have time for.

The post How Drasi used GitHub Copilot to find documentation bugs appeared first on Microsoft Azure Blog.

Read the whole story
alvinashcraft
16 minutes ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories