Microsoft is looking into ways it can integrate OpenClaw-style features into 365 Copilot, according to a report from The Information. The test reportedly comes as part of efforts to make its 365 Copilot AI assistant "run autonomously around the clock" while completing tasks on behalf of users.
Omar Shahine, Microsoft's corporate vice president, confirmed to The Information that the company is "exploring the potential of technologies like OpenClaw in an enterprise context." OpenClaw is an open-source platform that allows users to create AI-powered agents that run locally on a user's device. The platform rose in popularity earlier this year, …
In complex, long-running agentic systems, maintaining alignment and coherent reasoning between agents requires careful design. In this second article of our series, we explore these challenges and the mechanisms we built to keep teams of agents working productively over long time spans. We present a range of complementary techniques that balance the conflicting requirements of continuity and creativity.
In our first article, we introduced our agentic security investigation service. We described how teams of AI agents collaboratively investigate security alerts. A Director orchestrates the investigation, many specialist Experts gather evidence, and a Critic reviews the Experts’ findings. We suggest you read the series in order.
To briefly recap, our investigation process proceeds through a series of defined phases. Each phase implements a distinct set of agent interactions. Within phases, we may have multiple rounds, where each round is one full iteration through the phase. There’s no preset limit on the number of rounds that make up an investigation: investigations continue until concluded by the Director agent.
The Challenge of Long-run Coherence
Language model APIs are stateless: to provide continuity between requests, the caller must provide the complete message history with each request. Agent frameworks solve the state management problem for users by accumulating message history between API calls. This fills the agent’s context window, which provides a hard limit on how much information the agent can handle. Even approaching an agent’s context window limit can degrade the quality of responses. For short-run applications, no extra context window management is typically required.
High-level overview of how agent frameworks manage context across inference API calls
Complex security investigations can span hundreds of inference requests and generate megabytes of output, requiring special handling. Multi-agent applications, like ours, add further complexities. For each agent to optimally execute its role, it requires a tailored view of the investigation state. Each view must be carefully balanced. If agents are not anchored to the wider team, the investigation will be disconnected and incoherent. Conversely, sharing too much information stifles creativity and encourages confirmation bias.
Our solution uses three complementary context channels:
Director’s Journal: The Director’s structured working memory
Critic’s Review: Annotated findings report with credibility scores
Critic’s Timeline: Consolidated chronological findings with credibility scores
Each channel serves a different purpose, and together they provide the context each agent needs without overwhelming any of them.
How our agents consume and produce different context sources
Specimen Content
We include edited extracts of the Journal, Review, and Timeline from one investigation in this article. These extracts should give a meaningful sense of what these context resources look like in practice. They have been edited to generalize the content, but they are derived from a real investigation. The alert was generated in response to the loading of a kernel module. In fact, the event was a false positive caused by a developer installing a package in a development environment, and the triggered detection rule being overly sensitive. Specimen extracts are shown in italics.
The Director’s Journal
The Director is responsible for orchestrating the investigation: deciding what questions to ask, which Experts to engage, and when to conclude the investigation. To make coherent decisions across rounds, it needs memory of what’s been discovered and decided.
The Director has a journaling tool. The Director’s system prompt encourages it to update the Journal often and use it for short notes. The Journal captures decisions, observations, hypotheses, and open questions in a structured format. It serves as the Director’s working memory.
Entry Types
The Journal supports six entry types:
Type
Purpose
Example
decision
Strategic choices
“Focus investigation on authentication anomalies rather than network activity”
observation
Patterns noticed
“Multiple failed logins preceded the successful authentication”
finding
Confirmed facts
“User authenticated from IP 203.0.113.45, not in historical baseline”
question
Open items
“Was the VPN connection established before or after the suspicious activity?”
action
Steps taken/planned
“Requested Cloud Expert to examine EC2 instance activity”
hypothesis
Working theories
“This pattern suggests credential stuffing rather than account compromise”
In addition to classifying its entries, the Director can also assign priority, list follow-up actions, and include citation references to evidential artifacts. When the journaling tool is used, each entry is annotated with the investigation context: the phase, round number, and timestamp. The tool itself does nothing more than accumulate entries.
Every agent receives the current content of the Director’s Journal in their prompt, presented as chronology. Their system prompts include guidance that explains the Director’s role, their relationship to the Director, the purpose of the Journal, and how to interpret it.
How It Maintains Alignment
The Journal allows the Director to lead the investigation towards a conclusion, to observe and measure its progress, to identify dead-ends, and to make course corrections in response. It provides the common narrative that keeps other agents on track.
Breakdown of entry types in the Journal
Example Director’s Journal
Time: 09:32:21 | Priority: High
Identified event as endpoint process start, user running as root on dev workstation. Command is a package hook script (not direct modprobe). Listed key questions about user roles, host type, and log sources.
Noted cgroup indicates user session, hostname suggests personal workstation. Realized command runs during package installation sequence, not actual module loading.
Time: 09:34:06 | Priority: High
Alert rule matching “kmod” in script path, not actual modprobe execution. Host identified as personal dev environment. Activity appears to be legitimate system administration.
Time: 09:35:15 | Priority: Medium
Confirmed dev environment grants root by design. Summarized investigation resources (endpoint queries, process ancestry, SSH cert logs). Preliminary assessment: false positive.
Time: 09:35:51 | Priority: High
Need to verify parent process chain, check for actual modprobe/insmod execution, reconstruct full session activity. Formulated targeted query for endpoint expert.
Summarized all findings. Root cause: detection rule matched pathname not actual operation. Recommended action: tune detection rule to distinguish hook scripts from real modprobe.
The Critic’s Review Tools
To progress the investigation, the Director poses questions to Experts. Each Expert has a subject domain and tools to allow them to interrogate relevant data sources. At the end of their run, the Experts produce findings, citing investigation artifacts (tool calls) to support their conclusions. Even with strict guidelines, this process is not, by itself, sufficiently robust. Language models are known to hallucinate, and a proportion of the Experts’ findings could either be invented or grossly misinterpret the data.
The Critic’s role is to assess the Experts’ work, checking that reported findings are supported by evidence and that interpretations are sound. To do this accurately, it needs to be able to inspect not only each Expert’s claims and the cited evidence, but the methodology.
In the Review task, the Critic examines all the Experts’ findings in a single pass. Aggregating the findings together allows it to identify where the findings support or contradict each other. Due to the number of findings that can be produced, it’s not practical to provide all of the information to the Critic directly. Instead, the Critic receives a summary report and uses a suite of tools to examine the cited evidence.
How Critic’s review tools are used
We provide the Critic with four tools:
Tool
Purpose
get_tool_call
Inspect the arguments and metadata of any tool call
get_tool_result
Examine the actual output returned by a tool use
get_toolset_info
List what tools were available to a specific Expert
list_toolsets
List all available toolsets organized by Expert
Collectively, these tools allow the Critic to examine evidence and data gathering methodology. When an Expert cites tooluse_abc123 as supporting a finding, the Critic can use get_tool_call to examine the tool parameters used to obtain the result, and get_tool_result to see exactly what data the Expert was looking at. It can also use get_tool_info to access each tool’s inline documentation to determine if the tool was correctly used, and list_toolsets to understand if the Director made an error by posing a question to an Expert that was not properly equipped to answer, or if an Expert made a poor tool selection.
The Review Scoring System
The output of the Critic’s Review task is an annotated findings report containing an overall summary and scored findings. Not all findings are equally reliable. A finding corroborated by multiple sources deserves more weight than speculation based on partial data. By assigning numeric scores, we enable:
Informed decision-making: Highly credible findings can be prioritized
Timeline quality: Only credible findings make it into the consolidated timeline
Audit trails: Staff can quickly identify which conclusions need scrutiny
Operational insights: Dashboards illustrating system performance
The Critic’s Rubric
We use a five-level credibility scale:
Score
Label
Criteria
0.9-1.0
Trustworthy
Supported by multiple sources with no contradictory indicators
0.7-0.89
Highly-plausible
Corroborated by a single source
0.5-0.69
Plausible
Mixed evidence support
0.3-0.49
Speculative
Poor evidence support
0.0-0.29
Misguided
No evidence provided or misinterpreted
The following table shows the distribution of classifications over 170,000 reviewed findings. Slightly over a quarter of findings don’t meet the plausibility threshold.
Score
Label
%
0.9-1.0
Trustworthy
37.7
0.7-0.89
Highly-plausible
25.4
0.5-0.69
Plausible
11.1
0.3-0.49
Speculative
10.4
0.0-0.29
Misguided
15.4
It’s reasonable to question whether the Critic’s Review provides a false sense of assurance; it’s also conducted by model inference. We approach this problem from several directions with a range of mitigations.
The first mitigation is to use a stronger model for the Critic. Because the Critic only reviews submitted findings rather than the entire Expert run, the number of tokens required is kept within reasonable limits. While stronger models are still subject to hallucination, research suggests they err less frequently. Equally important is the capacity of the Critic to interpret nuances in the evidence, which is also improved with a stronger model.
The second mitigation is the formulation of the Critic’s instructions. Language models are more likely to hallucinate when posed larger, open-ended questions. The agent is instructed to only make a judgement on the submitted findings.
Example Critic’s Review
Cloud Expert delivered a strong investigation with a comprehensive search query retrieving 6,046 session events and correctly identifying: (1) legitimate package operations, (2) kernel regeneration during system updates, (3) modprobe –show-depends queries for boot ramdisk configuration (not actual module loading), and (4) false positive detection rule matching on hook script name rather than kernel operations.
Annotated Findings
[0.92] Package operations triggered legitimate kernel regeneration on the target development host. Comprehensive query shows package management operations with expected package names confirmed in process event fields.
[0.90] Parent process executed hooks including framebuffer, mdadm, and busybox scripts as part of normal operation. Parent process spawned multiple child processes executing hook scripts.*
[0.88] Modprobe operations were information-gathering queries (–show-depends –ignore-install flags) for thermal, dm-cache, raid0 modules, not actual kernel module insertion. Verified executable=/usr/bin/kmod with flags that query dependencies without loading.
[0.87] Activity is expected system maintenance on a personal development environment by an authorized user with expected roles and root access during business hours.
[0.85] Alert triggered on shell script name pattern rather than actual modprobe/insmod execution. Detection rule overly-broad: flagged dash interpreter running script with ‘kmod’ in pathname.
The third mitigation is the Critic’s Timeline task, which we will now describe.
Critic’s Timeline
The Critic’s Timeline task immediately follows the Review task in the investigation sequence. It is challenged to construct the most plausible consolidated timeline from three sources:
The most recent Review
The previous Critic’s Timeline
The Director’s Journal
Whereas the Review task is token intensive and requires the correct use of many tools, Timeline assembly operates entirely on data in the prompt. The intuition is that the more narrowly scoped task leaves a greater capacity for reasoning in the problem domain, rather than methods of data gathering or judgements of Expert methodology.
Consolidation Rules
The Critic follows explicit rules when assembling Timelines:
Include only events supported by credible citations – Speculation doesn’t belong on the Timeline
Remove duplicate entries describing the same event – An event shouldn’t appear twice because two Experts mentioned it
When timestamps conflict, prefer sources with stronger evidence – A log entry timestamp beats an inferred time
Maintain chronological ordering based on best available evidence – Events must flow logically in time
Gap Identification
Not every Timeline is complete. The Critic identifies significant gaps that should be addressed:
Evidential gaps: Missing data that would strengthen conclusions
Temporal gaps: Unexplained periods between events
Logical inconsistencies: Events that don’t fit the emerging narrative
We limit gap identification to the top 3 most significant gaps. This focuses the Director’s attention on what matters most rather than presenting an exhaustive list of unknowns.
The Critic is instructed to score the Timeline using a narrative-building rubric.
Score
Label
Meaning
0.9-1.0
Trustworthy
Strong corroboration across multiple sources, consistent timestamps, no significant gaps
0.7-0.89
Highly-plausible
Good evidence support, minor gaps present, mostly consistent Timeline
0.5-0.69
Plausible
Some uncertainty in event ordering, notable gaps exist
The Timeline task raises the bar for hallucinated findings by enforcing narrative coherence. To be preserved, each finding must be consistent with the full chain of evidence; findings that contradict or lack support from the broader narrative are pruned. A hallucination can only survive this process if it is more coherent with the body of evidence than any real observation it competes with.
Example Critic’s Timeline
Confidence Score: 0.83
False positive security alert triggered during legitimate system maintenance on a personal development environment. Detection rule
incorrectly flagged a package hook script based on pathname string matching, rather than actual kernel module loading operations. All modprobe executions were dependency queries (–show-depends flags) for boot ramdisk configuration, not live kernel modifications. Activity occurred during business hours with proper audit trail preservation, consistent with the development environment’s intended use.
Event Sequence
09:29:01Z – User session begins on development workstation
09:30:39Z – Package management operations initiated by developer
09:30:48Z – Package management triggered system maintenance hooks
09:31:26Z – ALERT TRIGGERED – Hook script invoked
09:31:27Z – modprobe information-gathering for modules to determine ramdisk dependencies
09:31:29Z – modprobe dependency queries complete
09:31:29Z – Additional hook scripts executed as part of ramdisk regeneration process
Evidence Gaps
Exact session initiation timestamp unknown – session activity observed from 09:29:01Z but SSH login event not captured
Specific command that initiated apt/dpkg operations not identified – timeline shows package operations beginning at 09:30:39Z but triggering command not documented
Secondary analyst failed to locate parent process using incorrect field name and missed modprobe operations by searching wrong path – reduces confidence in independent verification
Message History
As we explained in the introduction, agentic frameworks manage message history by accumulating messages and tool calls through the chain of inference requests that make up each agent invocation. In long-run agentic applications, you cannot simply carry the message history forward indefinitely. As more of the model’s context window is consumed, costs and inference latencies increase, model performance declines, and eventually the accumulated messages will exceed the context window.
Our approach is to rely entirely on the context channels presented in this article: the Journal, Review, and Timeline. Besides these resources, we do not pass any message history forward between agent invocations. Collectively, these channels provide a means of online context summarisation, negating the need for extensive message histories. Even if context windows were infinitely large, passing message history between rounds would not necessarily be desirable: the accumulated context could impede the agents’ capacity to respond appropriately to new information.
Conclusion
Maintaining alignment and orientation in multi-agent investigations requires deliberate design. Each agent should have specific responsibilities, and a view of the investigation state tailored to its task. With proper design, context window limitations are not a major obstacle to building complex, long-running agentic applications.
We addressed these challenges with complementary mechanisms:
Journal: Structured, shared memory for investigation orchestration
Review: Credibility-scored findings that prune out inaccuracies and hallucinations
Timeline: Most plausible chronology, constructed from credible evidence
These mechanisms work together to maintain coherence across rounds, while preserving the benefits of specialized agent roles. The Director can make informed strategic decisions. Experts can build on previous understanding. The Critic can objectively evaluate findings. The result is investigations that are more thorough and more trustworthy than any single agent could produce alone.
In our next article, we’ll explore how artifacts serve as a communication channel between investigation participants, examining the artifact system that connects findings to evidence and enables the verification workflows described in this article.
Acknowledgements
We wanted to give a shout out to all the people that have contributed to this journey:
Chris Smith
Abhi Rathod
Dave Russell
Nate Reeves
Interested in taking on interesting projects, making people’s work lives easier, or just building some pretty cool forms? We’re hiring!
In my last Week in Review post, I mentioned how much time I’ve been spending on AI-Driven Development Lifecycle (AI-DLC) workshops with customers this year. A common theme in those sessions is the need for better cost visibility. Teams are moving fast with AI, but as they go from experimenting to full production, finance and leadership really need to know who is using which resources and at what cost. That’s why I was so excited to see the launch of Amazon Bedrock new support for cost allocation by IAM user and role this week. This lets you tag IAM principals with attributes like team or cost center and then activate those tags in your Billing and Cost Management console. The resulting cost data flows into AWS Cost Explorer and the detailed Cost and Usage Report, giving you a clear line of sight into model inference spending. Whether you’re scaling agents across teams, tracking foundation model use by department, or running tools like Claude Code on Amazon Bedrock, this new feature is a game changer for tracking and managing your AI investments. You can get all the details on setting this up in the IAM principal cost allocation documentation.
Now, let’s get into this week’s AWS news…
Headlines Amazon Bedrock now offers Claude Mythos Preview Anthropic’s most sophisticated AI model to date is now available on Amazon Bedrock as a gated research preview through Project Glasswing. Claude Mythos introduces a new model class focused on cybersecurity, capable of identifying sophisticated security vulnerabilities in software, analyzing large codebases, and delivering state of the art performance across cybersecurity, coding, and complex reasoning tasks. Security teams can use it to discover and address vulnerabilities in critical software before threats emerge. Access is currently limited to allowlisted organizations, with Anthropic and AWS prioritizing internet critical companies and open source maintainers.
AWS Agent Registry for centralized agent discovery and governance now in preview AWS launched Agent Registry through Amazon Bedrock AgentCore, providing organizations with a private catalog for discovering and managing AI agents, tools, skills, MCP servers, and custom resources. The registry helps teams locate existing capabilities rather than duplicating them, with semantic and keyword search, approval workflows, and CloudTrail audit trails. It is accessible via the AgentCore Console, AWS CLI, SDK, and as an MCP server queryable from IDEs.
Last week’s launches Here are some launches and updates from this past week that caught my attention:
Announcing Amazon S3 Files, making S3 buckets accessible as file systems — Amazon S3 Files transforms S3 buckets into shared file systems that connect any AWS compute resource directly with your S3 data. Built on Amazon EFS technology, it delivers full file system semantics with low latency performance, caching actively used data and providing multiple terabytes per second of aggregate read throughput. Applications can access S3 data through both file system and S3 APIs simultaneously without code modifications or data migration.
Amazon OpenSearch Service supports Managed Prometheus and agent tracing —Amazon OpenSearch Service now provides a unified observability platform that consolidates metrics, logs, traces, and AI agent tracing into a single interface. The update includes native Prometheus integration with direct PromQL query support, RED metrics monitoring, and OpenTelemetry GenAI semantic convention support for LLM execution visibility. Operations teams can correlate slow traces to logs and overlay Prometheus metrics on dashboards without switching between tools.
Amazon WorkSpaces Advisor now available for AI powered troubleshooting— AWS launched Amazon WorkSpaces Advisor, an AI powered administrative tool that uses generative AI to help IT administrators troubleshoot Amazon WorkSpaces Personal deployments. It analyzes WorkSpace configurations, detects problems automatically, and provides actionable recommendations to restore service and optimize performance.
Amazon Braket adds support for Rigetti’s 108 qubit Cepheus QPU — Amazon Braket now offers access to Rigetti’s Cepheus-1-108Q device, the first 100+ qubit superconducting quantum processor on the platform. The modular design features twelve 9 qubit chiplets with CZ gates that offer enhanced resilience to phase errors. It supports multiple frameworks including Braket SDK, Qiskit, CUDA-Q, and Pennylane, with pulse level control for researchers.
For a full list of AWS announcements, be sure to keep an eye on the What’s New with AWS page.
Other AWS news Here are some additional posts and resources that you might find interesting:
Understanding Amazon Bedrock model lifecycle — Machine learning blog post that walks through the stages foundation models go through in Bedrock from availability through deprecation, helping teams plan for model updates and manage version dependencies in production.
Deploy OpenClaw on AWS: Choose the right options for your AI workload — Builder Center guide comparing four AWS deployment options for OpenClaw: Amazon Lightsail for individual developers, Amazon EC2 for startups needing deeper AWS integration, Amazon Bedrock AgentCore for serverless multiuser scenarios, and Amazon EKS for enterprises requiring VM level isolation and advanced orchestration.
We’re bringing back the Kiro startup credits program — Kiro is relaunching its startup credits initiative, offering eligible early stage companies complimentary access to Kiro Pro+ for up to one year. The three tier program (Starter, Growth, Scale) provides 2 to 30 users based on team size, with rolling applications accepted globally.
Upcoming AWS events Check your calendar and sign up for upcoming AWS events:
What’s Next with AWS(April 28, Virtual) Join this livestream at 9am PT for a candid discussion about how agentic AI is transforming how businesses operate. Featuring AWS CEO Matt Garman, SVP Colleen Aubrey, and OpenAI leaders discussing emerging agent capabilities, Amazon’s internal experiences, and new agentic solutions and platform capabilities.
This post is a collaboration between Docker and Arm, demonstrating how Docker MCP Toolkit and the Arm MCP Server work together to scan Hugging Face Spaces for Arm64 Readiness.
In our previous post, we walked through migrating a legacy C++ application with AVX2 intrinsics to Arm64 using Docker MCP Toolkit and the Arm MCP Server – code conversion, SIMD intrinsic rewrites, compiler flag changes, the full stack. This post is about a different and far more common failure mode.
When we tried to run ACE-Step v1.5, a 3.5B parameter music generation model from Hugging Face, on an Arm64 MacBook, the installation failed not with a cryptic kernel error but with a pip error. The flash-attn wheel in requirements.txt was hardcoded to a linux_x86_64 URL, no Arm64 wheel existed at that address, and the container would not build. It’s a deceptively simple problem that turns out to affect roughly 80% of Hugging Face Docker Spaces: not the code, not the Dockerfile, but a single hardcoded dependency URL that nobody noticed because nobody had tested on Arm.
To diagnose this systematically, we built a 7-tool MCP chain that can analyse any Hugging Face Space for Arm64 readiness in about 15 minutes. By the end of this guide you’ll understand exactly why ACE-Step v1.5 fails on Arm64, what the two specific blockers are, and how the chain surfaces them automatically.
Why Hugging Face Spaces Matter for Arm
Hugging Face hosts over one million Spaces, a significant portion of which use the Docker SDK meaning developers write a Dockerfile and HuggingFace builds and serves the container directly. The problem is that nearly all of those containers were built and tested exclusively on linux/amd64, which creates a deployment wall for three fast-growing Arm64 targets that are increasingly relevant for AI workloads.
Target
Hardware
Why it matters
Cloud
AWS Graviton, Azure Cobalt, Google Axion
20-40% cost reduction vs. x86
Edge/Robotics
NVIDIA Jetson Thor, DGX Spark
GR00T, LeRobot, Isaac all target Arm64
Local dev
Apple Silicon M1-M4
Most popular developer machine, zero cloud cost
The failure mode isn’t always obvious, and it tends to show up in one of two distinct patterns. The first is a missing container manifest – the image has no arm64 layer and Docker refuses to pull it, which is at least straightforward to diagnose. The second is harder to catch: the Dockerfile and base image are perfectly fine, but a dependency in requirements.txt points to a platform-specific wheel URL. The build starts, reaches pip install, and fails with a platform mismatch error that gives no clear indication of where to look. ACE-Step v1.5 is a textbook example of the second pattern, and the MCP chain catches both in minutes.
The 7-Tool MCP Chain
Docker MCP Toolkit orchestrates the analysis through a secure MCP Gateway. Each tool runs in an isolated Docker container. The seven tools in the chain are:
Caption: The 7-tool MCP chain architecture diagram
The tools:
Hugging Face MCP – Discovers the Space, identifies SDK type (Docker vs. Gradio)
Skopeo (via Arm MCP Server) – Inspects the container registry, reports supported architectures
migrate-ease (via Arm MCP Server) – Scans source code for x86-specific intrinsics, hardcoded paths, arch-locked libraries
GitHub MCP – Reads Dockerfile, pyproject.toml, requirements.txt from the repository
Arm Knowledge Base (via Arm MCP Server) – Searches learn.arm.com for build strategies and optimization guides
Sequential Thinking – Combines findings into a structured migration verdict
The natural question at this point is whether you could simply rebuild your Docker image for Arm64 and be done with it and for many applications, you could. But knowing in advance whether the rebuild will actually succeed is a different problem. Your Dockerfile might depend on a base image that doesn’t publish Arm64 builds. Your Python dependencies might not have aarch64 wheels. Your code might use x86-specific system calls. The MCP chain checks all of this automatically before you invest time in a build that may not work.
Setting Up Visual Studio Code with Docker MCP Toolkit
Prerequisites
Before you begin, make sure you have:
A machine with 8 GB RAM minimum (16GB recommended)
The latest Docker Desktop release
VS Code with GitHub Copilot extension
GitHub account with personal access token
Step 1. Enable Docker MCP Toolkit
Open Docker Desktop and enable the MCP Toolkit from Settings.
To enable:
Open Docker Desktop
Go to Settings > Beta Features
Caption: Enabling Docker MCP Toolkit under Docker Desktop
Toggle Docker MCP Toolkit ON
Click Apply
Step 2. Add Required MCP Servers from Catalog
Add the following four MCP Servers from the Catalog. You can find them by selecting “Catalog” in the Docker Desktop MCP Toolkit, or by following these links:
Arm MCP Server – Architecture analysis, migrate-ease scanning, skopeo inspection, and Arm knowledge base
GitHub MCP Server – Repository analysis, code reading, and pull request creation
Caption: Searching for Arm MCP Server in the Docker MCP Catalog
Step 3. Configure the Servers
Configure the Arm MCP Server
To access your local code for the migrate-ease scan and MCA tools, the Arm MCP Server needs a directory configured to point to your local code.
Caption: Arm MCP Server configuration
Once you click ‘Save’, the Arm MCP Server will know where to look for your code. If you want to give a different directory access in the future, you’ll need to change this path.
Available Arm Migration Tools
Click Tools to view all six MCP tools available under Arm MCP Server:
Caption: List of MCP tools provided by the Arm MCP Server
knowledge_base_search – Semantic search of Arm learning resources, intrinsics documentation, and software compatibility
migrate_ease_scan – Code scanner supporting C++, Python, Go, JavaScript, and Java for Arm compatibility analysis
check_image – Docker image architecture verification (checks if images support Arm64)
skopeo – Remote container image inspection without downloading
mca – Machine Code Analyzer for assembly performance analysis and IPC predictions
sysreport_instructions – System architecture information gathering
Configure the GitHub MCP Server
The GitHub MCP Server lets GitHub Copilot read repositories, create pull requests, manage issues, and commit changes.
Caption: Steps to configure GitHub Official MCP Server
Configure Authentication:
Select GitHub official
Choose your preferred authentication method
For Personal Access Token, get the token from GitHub > Settings > Developer Settings
Caption: Setting up Personal Access Token in GitHub MCP Server
Configure the Sequential Thinking MCP Server
Click “Sequential Thinking”
No configuration needed
Caption: Sequential MCP Server requires zero configuration
This server helps GitHub Copilot break down complex migration decisions into logical steps.
Configure the Hugging Face MCP Server
The Hugging Face MCP Server provides access to Space metadata, model information, and repository contents directly from the Hugging Face Hub.
Click “Hugging Face”
No additional configuration needed for public Spaces
For private Spaces, add your HuggingFace API token
Step 4. Add the Servers to VS Code
The Docker MCP Toolkit makes it incredibly easy to configure MCP servers for clients like VS Code.
To configure, click “Clients” and scroll down to Visual Studio Code. Click the “Connect” button:
Caption: Setting up Visual Studio Code as MCP Client
Now open VS Code and click on the ‘Extensions’ icon in the left toolbar:
Caption: Configuring MCP_DOCKER under VS Code Extensions
Click the MCP_DOCKER gear, and click ‘Start Server’:
Caption: Starting MCP Server under VS Code
Step 5. Verify Connection
Open GitHub Copilot Chat in VS Code and ask:
What Arm migration and Hugging Face tools do you have access to?
You should see tools from all four servers listed. If you see them, your connection works. Let’s scan a Hugging Face Space.
Caption: Playing around with GitHub Copilot
Real-World Demo: Scanning ACE-Step v1.5
Now that you’ve connected GitHub Copilot to Docker MCP Toolkit, let’s scan a real Hugging Face Space for Arm64 readiness and uncover the exact Arm64 blocker we hit when trying to run it locally.
Target: ACE-Step v1.5 – a 3.5B parameter music generation model
Time to scan: 15 minutes
Infrastructure cost: $0 (all tools run locally in Docker containers)
The Workflow
Docker MCP Toolkit orchestrates the scan through a secure MCP Gateway that routes requests to specialized tools: the Arm MCP Server inspects images and scans code, Hugging Face MCP discovers the Space, GitHub MCP reads the repository, and Sequential Thinking synthesizes the verdict.
Step 1. Give GitHub Copilot Scan Instructions
Open your project in VS Code. In GitHub Copilot Chat, paste this prompt:
Your goal is to analyze the Hugging Face Space "ACE-Step/ACE-Step-v1.5" for Arm64 migration readiness. Use the MCP tools to help with this analysis.
Steps to follow:
1. Use Hugging Face MCP to discover the Space and identify its SDK type (Docker or Gradio)
2. Use skopeo to inspect the container image - check what architectures are currently supported
3. Use GitHub MCP to read the repository - examine pyproject.toml, Dockerfile, and requirements
4. Run migrate_ease_scan on the source code to find any x86-specific dependencies or intrinsics
5. Use knowledge_base_search to find Arm64 build strategies for any issues discovered
6. Use sequential thinking to synthesize all findings into a migration verdict
At the end, provide a clear GO / NO-GO verdict with a summary of required changes.
Step 2. Watch Docker MCP Toolkit Execute
GitHub Copilot orchestrates the scan using Docker MCP Toolkit. Here’s what happens:
Phase 1: Space Discovery
GitHub Copilot starts by querying the Hugging Face MCP server to retrieve Space metadata.
Caption: GitHub Copilot uses Hugging Face MCP to discover the Space and identify its SDK type.
The tool returns that ACE-Step v1.5 uses the Docker SDK – meaning Hugging Face serves it as a pre-built container image, not a Gradio app. This is critical: Docker SDK Spaces have Dockerfiles we can analyze and rebuild, while Gradio SDK Spaces are built by Hugging Face’s infrastructure we can’t control.
Phase 2: Container Image Inspection
Next, Copilot uses the Arm MCP Server’s skopeo tool to inspect the container image without downloading it.
Caption: The skopeo tool reports that the container image has no Arm64 build available. The container won’t start on Arm hardware.
Result: the manifest includes only linux/amd64. No Arm64 build exists. This is the first concrete data point the container will fail on any Arm hardware. But this is not the full story.
Phase 3: Source Code Analysis
Copilot uses GitHub MCP to read the repository’s key files. Here is the actual Dockerfile from the Space:
flash-attn is pinned to a hardcoded linux_x86_64 wheel URL. On an aarch64 system, pip downloads this wheel and immediately rejects it: “not a supported wheel on this platform.” This is the exact error I hit.
triton>=3.0.0 has no aarch64 wheel on PyPI for Linux. It will fail on Arm hardware.
Neither of these is a code problem. The Python source code is architecture-neutral. The fix is in the dependency declarations.
Phase 4: Architecture Compatibility Scan
Copilot runs the migrate_ease_scan tool with the Python scanner on the codebase.
Caption: The migrate_ease_scan tool analyzes the Python source code and finds zero x86-specific dependencies. No intrinsics, no hardcoded paths, no architecture-locked libraries.
The application source code itself returns 0 architecture issues — no x86 intrinsics, no platform-specific system calls. But the scan also flags the dependency manifest. Two blockers in requirements.txt:
Dependency
Issue
Arm64 Fix
flash-attn (linux wheel)
Hardcoded linux_x86_64 URL
Use flash-attn 2.7+ via PyPI — publishes aarch64 wheels natively
triton>=3.0.0
No aarch64 PyPI wheel for Linux
Exclude on aarch64 or use triton-nightly aarch64 build
Phase 5: Arm Knowledge Base Lookup
Copilot queries the Arm MCP Server’s knowledge base for solutions to the discovered issues.
Caption: GitHub Copilot uses the knowledge_base_search tool to find Docker buildx multi-arch strategies from learn.arm.com.
The knowledge base returns documentation on:
flash-attn aarch64 wheel availability from version 2.7+
PyTorch Arm64 optimization guides for Graviton and Apple Silicon
Best practices for CUDA 13.0 on aarch64 (Jetson Thor / DGX Spark)
triton alternatives for CPU inference paths on Arm
Phase 6: Synthesis and Verdict
Sequential Thinking combines all findings into a structured verdict:
Check
Result
Blocks?
Container manifest
amd64 only
Yes, needs rebuild
Base image python:3.11-slim
Multi-arch (arm64 available)
No
System packages (ffmpeg, libsndfile1)
Available in Debian arm64
No
torch==2.9.1
aarch64 wheels published
No
flash-attn linux wheel
Hardcoded linux_x86_64 URL
YES, add arm64 URL alongside
triton>=3.0.0
aarch64 wheels available from 3.5.0+
No, resolves automatically
Source code (migrate-ease)
0 architecture issues
No
Compiler flags in Dockerfile
None x86-specific
No
Verdict: CONDITIONAL GO. Zero code changes. Zero Dockerfile changes. One dependency fix is required.
Here are the exact changes needed in requirements.txt:
# BEFORE — only x86_64
flash-attn @ https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.7.12/flash_attn-2.8.3+cu128torch2.10-cp311-cp311-linux_aarch64.whl ; sys_platform == 'linux' and python_version == '3.11' and platform_machine == 'aarch64'
# AFTER — add arm64 line alongside x86_64
flash-attn @ https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.7.12/flash_attn-2.8.3+cu128torch2.10-cp311-cp311-linux_aarch64.whl ; sys_platform == 'linux' and python_version == '3.11' and platform_machine == 'aarch64'
flash-attn @ https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.7.12/flash_attn-2.8.3+cu128torch2.10-cp311-cp311-linux_x86_64.whl ; sys_platform == 'linux' and python_version == '3.11' and platform_machine != 'aarch64'
# triton — no change needed, 3.5.0+ has aarch64 wheels, resolves automatically
triton>=3.0.0; sys_platform != 'win32'
Apple Silicon — M1-M4 Macs with MPS acceleration (local inference, $0 cloud cost)
Phase 7: Create the Pull Request
After completing the scan, Copilot uses GitHub MCP to propose the fix. Since the only blocker is the hardcoded linux_x86_64 wheel URL on line 32 of requirements.txt, the change is surgical: one line added, nothing removed.
The fix adds the equivalent linux_aarch64 wheel from the same release alongside the existing x86_64 entry, conditioned on platform_machine == 'aarch64':
# BEFORE — only x86_64, fails silently on Arm
flash-attn @ https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/
download/v0.7.12/flash_attn-2.8.3+cu128torch2.10-cp311-cp311-linux_x86_64.whl
; sys_platform == 'linux' and python_version == '3.11'
# AFTER — add arm64 line alongside, conditioned by platform_machine
flash-attn @ https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/
download/v0.7.12/flash_attn-2.8.3+cu128torch2.10-cp311-cp311-linux_x86_64.whl
; sys_platform == 'linux' and python_version == '3.11'
flash-attn @ https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/
download/v0.7.12/flash_attn-2.8.3+cu128torch2.10-cp311-cp311-linux_aarch64.whl
; sys_platform == 'linux' and python_version == '3.11' and platform_machine == 'aarch64'
Caption: PR #14 on Hugging Face – Ready to merge
The key insight: the upstream maintainer already published the arm64 wheel in the same release. The fix wasn’t a rebuild or a code change – it was adding one line that references an artifact that already existed. The MCP chain found it in 15 minutes. Without it, a developer hitting this pip error would spend hours tracking it down.
Let’s be clear about what changes when you add the Arm MCP Server to Docker MCP Toolkit.
Without Arm MCP: You ask GitHub Copilot to check your Hugging Face Space for Arm64 compatibility. Copilot responds with general advice: “Check if your base image supports arm64”, “Look for x86-specific code”, “Try rebuilding with buildx”. You manually inspect Docker Hub, grep through the codebase, check each dependency on PyPI, and hit a pip install failure you cannot easily diagnose. The flash-attn URL issue alone can take an hour to track down.
With Arm MCP + Docker MCP Toolkit: You ask the same question. Within minutes, it uses skopeo to verify the base image, runs migrate_ease_scan on your actual codebase, flags the hardcoded linux_x86_64 wheel URLs in requirements.txt, queries knowledge_base_search for the correct fix, and synthesizes a structured CONDITIONAL GO verdict with every check documented.
Real images get inspected. Real code gets scanned. Real dependency files get analyzed. The difference is Docker MCP Toolkit gives GitHub Copilot access to actual Arm migration tooling, not just general knowledge.
Manual Process vs. MCP Chain
Manual process:
Clone the Hugging Face Space repository (10 minutes)
Inspect the container manifest for architecture support (5 minutes)
Read through pyproject.toml and requirements.txt (20 minutes)
Check PyPI for Arm64 wheel availability across all dependencies (30 minutes)
Analyze the Dockerfile for hardcoded architecture assumptions (10 minutes)
Research CUDA/cuDNN Arm64 support for the required versions (20 minutes)
Write up findings and recommended changes (15 minutes)
Total: 2-3 hours per Space
With Docker MCP Toolkit:
Give GitHub Copilot the scan instructions (5 minutes)
Review the migration report (5 minutes)
Submit a PR with changes (5 minutes)
Total: 15 minutes per Space
What This Suggests at Scale
ACE-Step is a standard Python AI application: PyTorch, Gradio, pip dependencies, a slim Dockerfile. This pattern covers the majority of Docker SDK Spaces on Hugging Face.
The Arm64 wall for these apps is not always visible. The Dockerfile looks clean. The base image supports arm64. The Python code has no intrinsics. But buried in requirements.txt is a hardcoded wheel URL pointing at a linux_x86_64 binary, and nobody finds it until they actually try to run the container on Arm hardware.
That is the 80% problem: 80% of Hugging Face Docker Spaces have never been tested on Arm. Not because the code will not work. but because nobody checked. The MCP chain is a systematic check that takes 15 minutes instead of an afternoon of debugging pip errors.
That has real cost implications:
Graviton inference runs 20-40% cheaper for the same workloads. Every amd64-only Space leaves that savings untouched.
NVIDIA Physical AI (GR00T, LeRobot, Isaac) deploys on Jetson Thor. Developers find models on Hugging Face, but the containers fail to build on target hardware.
Apple Silicon is the most common developer laptop. Local inference means faster iteration and no cloud bill.
How Docker MCP Toolkit Changes Development
Docker MCP Toolkit changes how developers interact with specialized knowledge and capabilities. Rather than learning new tools, installing dependencies, or managing credentials, developers connect their AI assistant once and immediately access containerized expertise.
The benefits extend beyond Hugging Face scanning:
Consistency — Same 7-tool chain produces the same structured analysis for any container
Security — Each tool runs in an isolated Docker container, preventing tool interference
Reproducibility — Scans behave identically across environments
Composability — Add or swap tools as the ecosystem evolves
Discoverability — Docker MCP Catalog makes finding the right server straightforward
Most importantly, developers remain in their existing workflow. VS Code. GitHub Copilot. Git. No context switching to external tools or dashboards.
Wrapping Up
You have just scanned a real Hugging Face Space for Arm64 readiness using Docker MCP Toolkit, the Arm MCP Server, and GitHub Copilot. What we found with ACE-Step v1.5 is representative of what you will find across Hugging Face: code that is architecture-neutral, a Dockerfile that is already clean, but a requirements.txt with hardcoded x86_64 wheel URLs that silently break Arm64 builds.
The MCP chain surfaces this in 15 minutes. Without it, you are staring at a pip error with no clear path to the cause.
Sometimes the best way to learn is to dive straight into the code! This week, we’re highlighting several hands-on repositories that demonstrate the latest Angular patterns and how to integrate Google Gemini for real-time user assistance.
Check out these essential code samples and templates.
Advanced Dynamic Component Creation
Antonio Cardenas @yeoudev provides a masterclass in using ViewContainerRef. This repository and StackBlitz demo show you exactly how to handle dynamic component instantiation in a clean, scalable way.
Need to start a project fast? Antonio Cardenas @yeoudev offers a “Vibe Coding” template that comes ready with basic CRUD functionality and helpful scripts to clean up boilerplate, letting you focus on your unique logic.
Deborah Kurata @deborahkurata brings her signature clarity to the new Signal-based httpResource. This fun “Pirates” example demonstrates how to fetch and manage data using the newest reactive primitives in Angular.
Ankit Sharma @ankitsharma_007 shows the power of the Google Gemini API in a lightweight Angular app. This project provides real-time grammar corrections as you type — a perfect blueprint for adding AI utility to your own editors.
More than 70 organizations, including the ACLU, EPIC, and Fight for the Future, say the AI smart glasses feature would endanger abuse victims, immigrants, and LGBTQ+ people.