Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
152893 stories
·
33 followers

From capability to responsibility: Securing our global digital ecosystem with next‑generation AI

1 Share

Cybersecurity is at a turning point. Advanced AI models are dramatically accelerating vulnerability discovery and creating conditions ripe for exploitation, underscored by the announcement of Claude Mythos Preview. This marks a shift, and whether this technology will favor defenders or attackers will depend on the choices we make now. 

With the right safeguards, these capabilities can help trusted defenders identify and fix vulnerabilities across critical systems in hospitals, power grids, water, and telecommunications. Released irresponsibly or not properly secured, however, those same capabilities could be abused by malicious actors, threatening the foundations of our digital ecosystem. 

Much of the discussion has rightly focused on risks. As advanced AI models speed up the discovery of vulnerabilities, the way we fix them must speed up too. That means stronger pre-deployment risk assessments and close collaboration between governments, frontier AI developers, software providers, and the broader ecosystem to ensure these tools reduce, rather than increase, cyber risk. This is particularly important as AI systems themselves have become high‑value targets, requiring stronger protection of models, systems, data, and underlying infrastructure. 

This is ultimately an international challenge. Neither software supply chains nor threat actors stop at borders. Neither can our response. Meeting this moment will require shared approaches across countries, sectors, and systems—rooted in trust, shared standards, resilience, and responsible use. 

This moment is also an opportunity. Security has been and remains the top priority at Microsoft. Over the last two years, through our  Secure Future Initiative, we have strengthened our security foundations for this age of AI, in part by using AI to accelerate vulnerability discovery and remediation. We have also invested in fundamental AI for security research, including the development of open-source industry benchmarks that can be used to evaluate whether models are ready for real-world security work. We are accelerating that work through deeper public-private collaboration and in partnership with AI, including Anthropic’s Project Glasswing and OpenAI’s Trusted Access for Cyber program. 

Securing our digital ecosystem with nextgeneration AI is within reach but is not automatic.  

Building secure foundations for the era of frontier AI  

Ensuring advanced AI technologies are used to strengthen cybersecurity requires deliberate and urgent action. We are sharing the following recommendations as practical steps governments, industry, and the broader ecosystem can take to ensure these tools, often referred to as “frontier AI”, reinforce the security foundations on which digital societies depend. And we hope to continue to partner with model providers, industry and government so we can work together to improve security outcomes for all. 

1. Reinforce core cybersecurity practices  

Advanced AI can strengthen cybersecurity only when strong, consistent cyber hygiene is already in place. As frontier AI accelerates vulnerability discovery and response, core practices such as rapid patching, access control, and system resilience become more critical, not less. 

Security gains in the frontier AI era depend on close coordination between technology providers advancing new capabilities and the organizations responsible for operating, updating, and securing real‑world systems. Without this interdependence, advanced AI cannot deliver durable improvements in security. No organization can solve these cybersecurity problems alone. 

That is why sustained investment in what we know works remains essential: secure‑by‑design product lifecycles, Zero Trust architectures, multi‑factor authentication, least‑privileged access, and ongoing security training. Broad adoption and harmonization of established cybersecurity frameworks to ensure consistent resilience across AIenabled systems. Trusted cloud environments that enable these practices at scale, supporting secure data handling, continuous patching, and the secure deployment of AI‑enabled tools for defenders.  

  2. Release advanced capabilities responsibly  

As frontier AI systems gain reasoning, coding, and agentic capabilities, some of the most serious security risks arise before deployment, including realistic misuse involving multi‑step reasoning, tool use, and reconnaissance. Technical safety benchmarks remain important, but they are insufficient without rigorous, real‑world testing.  

As a result, governments are increasingly establishing pre‑deployment evaluations that combine technical testing with threat modeling. These assessments are most effective when frontier developers work closely with organizations that track national‑security risks. Investing in secure evaluation environments and modern testing methods can help governments keep pace as capabilities advance.  

Responsible release practices, including phased and controlled access, are a critical extension of this approach. Our work with Anthropic in Project Glasswing offers one practical model, enabling trusted defenders to evaluate advanced capabilities in constrained settings prior to broader release. Similarly, OpenAI and Microsoft work closely through Trusted Access for Cyber program, and we already support OpenAI’s use of scoped, early deployments for safety and security testing.  

Responsibility does not end at release. Organizations that deploy frontier models are often best positioned to detect emerging misuse and should monitor, mitigate, and share threat information. Microsoft is working with peers through the Frontier Model Forum to advance best practices for evaluating and managing cyber risk and enable information sharing. Governments should encourage continued industry collaboration to restrict access for identified threat actors and counter adversarial or malicious use of advanced AI. 

  3. Modernize vulnerability management  

AI is changing both the speed of vulnerability discovery and what constitutes meaningful security risk. Faster discovery only improves security if triage, validation, and remediation can keep up. 

As AI accelerates discovery, vulnerability management must shift from tracking raw volume to reducing real‑world risk. That means prioritizing vulnerabilities that are genuinely exploitable, assigning clear responsibility for triage and remediation, and using phased, risk‑based disclosure when private coordination improves safety. Above all, systems must be designed around validation and realistic remediation capacity, not the assumption that more findings automatically lead to better security. 

Developers of frontier AI models should embed vulnerability coordination and disclosure directly into responsible‑release frameworks. And work with governments and industry to ensure findings are routed to the right owners, acted on early, and supported by clear coordination pathways. 

  4. Fix faster: Strengthen and accelerate response and remediation 

As AI accelerates vulnerability discovery, remediation must keep pace. Initiatives such as DARPA’s AI Cyber Challenge show how AI can help both find and fix flaws in open‑source software. Hardening defenses requires investment not just in detection tools but in the people, processes, and infrastructure responsible for fixing vulnerabilities, especially in critical sectors. 

Much of the software underpinning critical infrastructure relies on open‑source components maintained by small teams or volunteers with limited security capacity. A surge in AI‑enabled discovery risks overwhelming existing triage and disclosure processes. Efforts such as the GitHub Secure Open Source Fundalongside investments by Microsoft and others through the Linux Foundation, Alpha‑Omega, and OpenSSF, are helping maintainers adapt in ways that are practical and aligned with existing workflows.  

Governments should treat remediation capacity as a core resilience priority, including sustained investment in and support for maintainers, surge capacity during large discovery events, and modernized disclosure pathways—recognizing that effective remediation still largely depends on human judgment, coordination, and time.  

  5. Advance AI security internationally 

AI security is essential to deploy AI at scale. Because AI systems, supply chains, and the risks they introduce operate across borders, national approaches alone will not be sufficient. 

Governments and industry should work together to build interoperable international foundations for AI security, including risk evaluation, coordinated vulnerability disclosure, and information sharing. Priorities should include strengthening the defensive use of AI, preventing misuse through shared norms and safeguards, and securing AI systems- and the AI technology stack.  

Global participation is critical. Countries and organizations with limited cybersecurity resources or legacy infrastructure are often the most exposed. International cooperation should prioritize capacitybuilding, ensuring that the security benefits of AI are realized broadly and equitably. 

AI security is not just a safeguard; it is an enabler for innovation and growth. By acting collectively and moving quickly, governments and industry can strengthen global digital resilience and unlock the trusted adoption of AI across economies, critical infrastructure, and public services.

Meeting the moment: Use frontier AI capabilities to build trust and confidence  

Meeting this moment is ultimately about trust: not in any single technology or provider, but in our collective ability to introduce advanced AI responsibly.  

Used deliberately and built on strong security foundations, these capabilities can strengthen cybersecurity and reinforce confidence in the systems society depends on. The choice is not between innovation and security but whether we enable them to reinforce one another. 

That outcome is within reach. With governments, industry, and infrastructure operators aligned, advanced AI can be deployed in ways that match real‑world defensive capacity and support trusted, lawful action. Done right and working together, frontier AI can help protect the digital infrastructure that underpins modern life and build lasting confidence in its resilience. 

The post From capability to responsibility: Securing our global digital ecosystem with next‑generation AI appeared first on Microsoft On the Issues.

Read the whole story
alvinashcraft
16 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

Dispatches from O'Reilly: Fast Paths and Slow Paths

1 Share
Selective control in autonomous AI systems: Why governing every decision breaks autonomy—and how runtime control actually works at scale
Read the whole story
alvinashcraft
24 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

Agentic Data Science Pair Programming With marimo pair

1 Share

How do you add agent skills to your data science workflow? How can a coding agent assist with data wrangling and research? This week on the show, Trevor Manz from marimo joins us to discuss marimo pair.

Trevor is a founding engineer at marimo, where he’s been working on integrating LLM tools with marimo. We discuss the balancing act of building a skill and determining how to give an agent access to all the variables in a notebook. He shares how they built a specialized reactive REPL that eliminates hidden state and allows the agent to continue constructing a reproducible Python program.

We dig into installing and getting started with marimo pair. Trevor also covers several of the tasks an agent can tackle in a data science workflow.

Video Course Spotlight: Getting Started With marimo Notebooks

Discover how marimo notebook simplifies coding with reactive updates, UI elements, and sandboxing for safe, sharable notebooks.

Topics:

  • 00:00:00 – Introduction
  • 00:02:26 – Trevor’s role at marimo
  • 00:03:08 – Current AI tools in marimo
  • 00:06:26 – Describing marimo notebooks
  • 00:10:11 – What is marimo pair?
  • 00:18:49 – Building an agent skill
  • 00:27:34 – Setup & installation
  • 00:31:16 – Video Course Spotlight
  • 00:32:42 – Examples of EDA and data wrangling
  • 00:45:46 – Experimenting inside of a notebook
  • 00:50:40 – Managing context
  • 00:53:25 – Accessing additional libraries
  • 00:57:16 – Recent tools and updates from the marimo community
  • 00:59:31 – What are you excited about in the world of Python?
  • 01:01:10 – What do you want to learn next?
  • 01:02:26 – How can people follow your work online?
  • 01:03:13 – Thanks and goodbye

Show Links:

Level up your Python skills with our expert-led courses:

Support the podcast & join our community of Pythonistas





Download audio: https://dts.podtrac.com/redirect.mp3/files.realpython.com/podcasts/RPP_E293_02_Trevor_marimo-pair.7cd045a838d8.mp3
Read the whole story
alvinashcraft
1 minute ago
reply
Pennsylvania, USA
Share this story
Delete

OpenAI’s Big Reset + A.I. in the Doctor’s Office + Talkie, a pre-1930s LLM

1 Share
Will the rising tide of A.I. adoption lift all boats?
Read the whole story
alvinashcraft
1 minute ago
reply
Pennsylvania, USA
Share this story
Delete

Episode 514 - Literally Writing The Book on Technical Interviews w/ David McCarter

1 Share

if you want to check out all the things ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠torc.dev⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ has going on head to ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠linktr.ee/taylordesseyn⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ for more information on how to get plugged in!





Download audio: https://anchor.fm/s/ce6260/podcast/play/119323363/https%3A%2F%2Fd3ctxlq1ktw2nl.cloudfront.net%2Fstaging%2F2026-4-1%2Fb0bcfd12-a9a6-4254-12c2-60540d2ff829.mp3
Read the whole story
alvinashcraft
1 minute ago
reply
Pennsylvania, USA
Share this story
Delete

A Virtual Agent team at Docker: How the Coding Agent Sandboxes team uses a fleet of agents to ship faster

1 Share

I work on Coding Agent Sandboxes, aka “sbx” at Docker. The project provides secure, microVM-based isolation for running AI coding agents like Claude Code, Gemini, Codex, Docker Agent and Kiro. Agents get full autonomy inside a sandbox (their own Docker daemon, network, filesystem) without touching your host system. Over the past couple of weeks, we built something on top of it: a virtual team of seven AI agent roles that test the product, triage issues, post release notes, and even fix bugs, all running autonomously in CI. We call it the Fleet.

The Fleet is built on Claude Code skills: markdown files that give an agent a persona, a set of responsibilities, and the tools it’s allowed to use. Think of a skill not as a script that says “run these steps,” but as a role description that says “you are the build engineer, here’s what you know and how you make decisions.” That distinction matters because agents need judgment, not just instructions. When a test fails unexpectedly, a script stops. A role investigates.

The same skill file, the same behavior, whether it runs on a developer’s laptop or in CI.

Local First, CI Second

Coding Agent Sandboxes is a CLI tool (sbx) that manages sandbox lifecycles: create, start, stop, remove, configure networking, mount workspaces, and more. It runs on MacOS, Linux and Windows. Every release needs testing across both platforms, across upgrade paths between versions, and under sustained load to catch resource leaks. The team also needs daily visibility into what shipped, and a way to triage the growing issue backlog without it becoming a full-time job.

We could have written traditional test scripts and reporting tools. Instead, we built agent roles that handle these tasks autonomously, both on our laptops and in CI.

The design principle behind the Fleet is simple: every skill runs on your machine first.

When we built the /cli-tester skill (the Fleet’s exploratory tester, more on that below), we didn’t start by writing a GitHub workflow. We started by invoking it locally. We watched it build the binaries, exercise the CLI commands, find issues, and report them. We tweaked the skill until it did the right thing in our terminal. Only then did we wire it into a workflow.

This matters because the alternative is painful. If you build CI-only agents, you debug them through commit-push-wait-read-logs cycles. Every iteration takes minutes. When the skill runs locally first, the iteration takes seconds. You see the agent think. You see where it gets confused. You fix the skill file, re-invoke, and try again.

CI is just another runtime for the same skill. The /cli-tester that runs nightly on MacOS, Linux and Windows runners is the exact same skill we invoke from our terminals. The workflow sets up the environment, checks out the code, and calls the skill. That’s it. No separate “CI version.” No translation layer. One skill, two runtimes.

This is what makes the Fleet practical. You’re not maintaining two systems. You’re maintaining one set of skills and a set of workflows that invoke them.

The Roster

The skills directory has 20 skills in total. Most are foundational knowledge (architecture, code style, Go conventions, security, testing patterns). Seven of them are the Fleet: the roles that run autonomously on CI. Each one is a SKILL.md file that describes a persona, not a procedure.

image1

/build-engineer is the foundation that other skills stand on. It references topic files for building binaries, container templates, and local installs. It knows the Taskfile.yml, the docker-bake.hcl, and the platform-specific build flags. It doesn’t run on CI by itself. Other skills load it when they need to compile anything.

/project-manager is the team’s memory. It deduplicates findings against existing issues and PRs before creating new ones, manages the GitHub Projects board (setting status, priority, and labels), and handles interactive triage when running locally. On CI, it switches to fully automatic mode: no questions asked, just deduplicate and create. It uses GraphQL pagination to scan the entire project board, not just the first page. Every other skill that discovers something calls the project-manager before opening an issue.

/product-owner translates commit-speak into human language. It collects merged PRs from a date range, categorizes them (New Features, Bug Fixes, Improvements, Documentation, Maintenance), and rewrites each one in plain English. “feat(cli): add TZ env passthrough” becomes “Docker Sandboxes now automatically use your local timezone.” On CI, it outputs Slack Block Kit JSON. Locally, it renders a markdown table. It filters out noise from bots (Dependabot bumps, workflow-only changes) and skips posting when there’s nothing meaningful to report.

/cli-tester is the exploratory tester of the Fleet, and it’s the largest skill by far. Unlike traditional test scripts that assert expected output and fail on any deviation, the cli-tester investigates what it finds. When output doesn’t match expectations, it asks why before filing a bug.

It defines 52+ test scenarios organized into 14 tiers: Core Lifecycle, Agent Smoke, Workspace, Network Policy, Sandbox Features, Blueprint, CLI UX, Environment, Code Tasks, Agent Network, Reliability, Collaboration, Error Recovery, and Human-Only (skipped in CI). It builds the binaries through the build-engineer, triages findings through the project-manager, and loads product scenarios defined by the actual Product Manager on the team. It monitors disk space during testing, posts an executive summary to Slack when it finishes, and runs nightly on CI across MacOS, Linux and Windows.

It also powers a slash command on GitHub. When someone comments /cli-tester-review on a pull request, CI spins up three runners (MacOS, Linux and Windows), each loading the skill to exercise the PR’s changes on that platform. The agents explore the code, run the scenarios, and post their findings as comments directly on the pull request.

/performance-tester runs in two modes. Lifecycle Endurance repeatedly cycles create/stop/rm to detect reliability issues and resource leaks, producing xUnit JSON output. Code Exploration Benchmark clones a real Git repository and compares host-vs-sandbox I/O performance and Claude Code session behavior. Both modes measure disk usage over time and flag regressions. The goal is catching the slow degradation that no single test run would notice.

/upgrade-tester runs a four-phase test plan. Phase A creates pre-upgrade state (sandboxes, configurations). Phase B installs the new version. Phase C verifies everything still works after the upgrade. Phase D optionally downgrades and verifies again. It takes two version tags as input, builds the binaries for each, creates VMs, and produces an executive summary with pass/fail per phase. Upgrade regressions are the kind of bug that’s invisible in a single-version test suite.

/software-engineer operates in two modes. Reactive: when someone adds the agent-fix label to a GitHub issue, a MacOS runner picks it up and runs a ralph-loop to work the issue, contributing a PR with minimal, focused changes. Proactive: weekly, it runs in architect mode, scanning the codebase for quality issues, producing up to five findings, triaging them through the project-manager, then spawning three MacOS runners in parallel to fix three of them. Each runner delivers a PR targeting a specific simplification or tech-debt reduction.

Skills That Compose

Individual skills are useful. Skills that load other skills are a team.

The seven Fleet roles sit on top of thirteen foundational skills: architecture, code style, Go conventions, software design, security, testing patterns, development workflow, git worktrees, and others. The foundational skills encode project knowledge. The Fleet roles encode behavior. A Fleet role loads the foundational skills it needs, the same way a new team member reads the project’s contributing guide before writing code.

The /cli-tester doesn’t know how to build binaries. It loads the /build-engineer for that. It doesn’t know whether the bug it found is a duplicate. It loads the /project-manager to check. The tester focuses on testing. The builder focuses on building. The manager focuses on triaging. Each role stays in its lane, and the composition creates something none of them could do alone.

The /software-engineer follows the same pattern. It loads the /build-engineer so it can compile the project, and it loads coding best practices and software design conventions so its output meets the team’s standards. The skill doesn’t try to encode everything. It delegates to the foundational skills.

The /performance-tester loads the /cli-tester, extending it with duration and metrics. Instead of duplicating the testing logic, it reuses it and adds a measurement layer on top.

This is the skills-as-roles principle in practice. When you design skills as personas with clear responsibilities (instead of step-by-step commands), they compose naturally. A tester that loads a builder and a manager is doing the same thing a human tester does: asking a colleague to compile the project and checking with the PM before filing a bug. The difference is that the “asking” happens through skill composition instead of a Slack message.

The Ralph-Loop Is the Engine

The Ralph Wiggum loop is a pattern popularized by Geoffrey Huntley in 2025: a Bash loop that keeps feeding an AI coding agent the same task until the work is done. At its simplest, it’s while :; do cat PROMPT.md | claude-code ; done. Each iteration spawns a fresh agent with a clean context window. The agent reads the task, implements one piece, runs the tests, commits if they pass, and exits. The loop restarts, and the next iteration picks up where the previous one left off. Instead of hoping for first-try perfection, you design for iteration.

Our implementation of this pattern is called a Ralph-loop. The Fleet skills define what each agent role knows. The Ralph-loop defines how the iteration runs.

Our Ralph-loop is a composite GitHub Action backed by a shell script that adds a layer on top of the basic pattern: a separate worker and reviewer. It fetches the issue context, creates a working branch, and iterates: the worker implements changes and writes a summary, the reviewer evaluates the diff and decides SHIP or REVISE. If REVISE, the feedback goes back to the worker for another pass. Up to five iterations by default. If the reviewer says SHIP, the loop pushes the branch, creates a PR, and comments on the original issue.

The worker and reviewer run as separate Claude invocations with different models. The worker uses Opus for implementation. The reviewer uses Opus with 1M context to evaluate the full diff against the task requirements. Each one loads the /software-engineer skill (which in turn loads the build-engineer and coding best practices), so they share the same project knowledge but apply it from different perspectives.

Separating generation from evaluation is deliberate. The same agent that wrote the code shouldn’t evaluate whether the code is good. It’s the oldest principle in quality assurance: the person who built the thing shouldn’t be the only person who tests it. The worker’s job is to solve the problem. The reviewer’s job is to decide whether the problem is actually solved.

The Ralph-loop works locally too. The same ralph-loop.sh script that CI calls can be invoked from your terminal with --issue-number 42. Locally, it parses CLI arguments instead of reading environment variables, and outputs plain text instead of streaming JSON. Same loop, same prompts, same iteration pattern. We debugged the worker and reviewer prompts on our laptops before they ever ran in CI.

The workflows handle scheduling and triggering: nightly cron for the testers, label events for the software-engineer, weekly cron for the architect mode. The Ralph-loop handles the iteration pattern. The skills handle the domain knowledge. Three layers, each with a clear job.

This separation is what made the Fleet possible to build in a couple of weeks. We didn’t have to reinvent the automation loop for every role. The Ralph-loop already knew how to iterate. We just needed to give each role its own skill file and wire the triggers.

What the Fleet Ships

The Fleet has been running for a couple of weeks. Here’s what it delivers.

Automated issue resolution. A team member labels an issue with agent-fix. The CI grabs a MacOS runner, reads the issue, and starts working. The result is a pull request that addresses the issue. Not every PR lands without changes, but the first draft is there for review, often within the hour.

Daily release notes. The product-owner traverses the git log every day and posts a Slack summary for stakeholders. No one has to manually compile “what shipped this week.” The stakeholders see progress in real time, at the speed the team actually moves.

Nightly exploratory testing. The cli-tester runs every night on MacOS and Windows. It loads the product scenarios that the Product Manager has defined, exercises the CLI, and opens issues for anything it finds. Before opening an issue, it checks for duplicates through the project-manager. When it finishes, it posts a Slack message with the results.

Performance and upgrade testing. The performance-tester and upgrade-tester run on CI across both platforms. Disk usage regressions, behavioral differences between sandbox and non-sandbox modes, and version compatibility issues get caught before they reach a human reporter.

Weekly tech-debt reduction. Every week, the software-engineer runs in architect mode. It reviews the codebase, identifies three spots where code can be simplified or legacy patterns can be cleaned up, spawns three parallel runners, and delivers three PRs. Each one is a small, focused improvement. Over time, they compound.

What We Don’t Automate

The Fleet creates pull requests. It does not merge them.

That’s the trust boundary, and it’s deliberate. Merge decisions stay with humans. So do architectural choices, scope decisions, and prioritization. The agents do the work. The team decides what work matters and whether the output meets the bar.

The supervision model scales the same way it works on a developer’s laptop. When we run multiple agents locally in parallel worktrees, we review their output before merging. With the Fleet, the team supervises seven agent roles running on CI. The shape of the oversight is the same: review the output, approve or adjust, move on. The difference is that the agents don’t need anyone’s laptop to start working.

The Fleet is not replacing the team. It’s extending it. Seven roles that handle repetitive, well-defined work so humans can focus on work that requires judgment, context, and taste. The Fleet has many arms, but the team still steers the ship.

What We Learnt Building the Fleet

Start with the foundation, not the flashiest skill. We started with the /cli-tester because testing the CLI felt like the highest-value target. But it needed to build binaries, triage issues, and load product scenarios, all things that depended on other skills we hadn’t written yet. We should have started with the /build-engineer, the skill everything else stands on. The second skill was better because of what we learned from the first. Don’t design the full fleet upfront.

Build locally first, deploy to CI second. The commit-push-wait-read-logs cycle is where velocity goes to die. If you can’t debug a skill in your terminal, it’s not ready for a workflow. Some behaviors only surface on CI runners (different OS, permissions, network constraints), and those iterations cost hours of wall-clock time. Minimize what can only be tested in CI.

Write skills as roles, not scripts. Ask yourself: “If a new team member joined tomorrow with this exact role, what would I tell them?” What do they need to know? What tools can they use? How should they handle ambiguity? That conversation is your SKILL.md. “You are the build engineer, here’s what you know” produces better judgment than “run these five steps.” When something unexpected happens, a role investigates. A script stops.

Compose skills like you compose teams. The /cli-tester doesn’t know how to build binaries or triage bugs. It loads the /build-engineer and /project-manager for that. Each role stays in its lane. The composition creates what none of them could do alone.

Separate generation from evaluation. The agent that wrote the code shouldn’t be the only one that reviews it. Our Ralph-loop uses a worker and a reviewer for a reason: the oldest principle in quality assurance applies to agents too.

Triage matters more than detection. The /cli-tester initially filed issues for every unexpected output. Transient failures, timing-dependent behavior, environment quirks: everything became an issue. The signal-to-noise ratio got bad enough that the team started ignoring findings. Getting the triage right (deduplication, confirming before filing) took longer than building the tester itself.And one more thing. All Fleet agents, even on ephemeral CI runners, run inside Coding Agent Sandboxes. We test with what our users use.

Read the whole story
alvinashcraft
1 minute ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories