Microsoft is looking into ways it can integrate OpenClaw-style features into 365 Copilot, according to a report from The Information. The test reportedly comes as part of efforts to make its 365 Copilot AI assistant "run autonomously around the clock" while completing tasks on behalf of users.
Omar Shahine, Microsoft's corporate vice president, confirmed to The Information that the company is "exploring the potential of technologies like OpenClaw in an enterprise context." OpenClaw is an open-source platform that allows users to create AI-powered agents that run locally on a user's device. The platform rose in popularity earlier this year, …
In complex, long-running agentic systems, maintaining alignment and coherent reasoning between agents requires careful design. In this second article of our series, we explore these challenges and the mechanisms we built to keep teams of agents working productively over long time spans. We present a range of complementary techniques that balance the conflicting requirements of continuity and creativity.
In our first article, we introduced our agentic security investigation service. We described how teams of AI agents collaboratively investigate security alerts. A Director orchestrates the investigation, many specialist Experts gather evidence, and a Critic reviews the Experts’ findings. We suggest you read the series in order.
To briefly recap, our investigation process proceeds through a series of defined phases. Each phase implements a distinct set of agent interactions. Within phases, we may have multiple rounds, where each round is one full iteration through the phase. There’s no preset limit on the number of rounds that make up an investigation: investigations continue until concluded by the Director agent.
Language model APIs are stateless: to provide continuity between requests, the caller must provide the complete message history with each request. Agent frameworks solve the state management problem for users by accumulating message history between API calls. This fills the agent’s context window, which provides a hard limit on how much information the agent can handle. Even approaching an agent’s context window limit can degrade the quality of responses. For short-run applications, no extra context window management is typically required.

Complex security investigations can span hundreds of inference requests and generate megabytes of output, requiring special handling. Multi-agent applications, like ours, add further complexities. For each agent to optimally execute its role, it requires a tailored view of the investigation state. Each view must be carefully balanced. If agents are not anchored to the wider team, the investigation will be disconnected and incoherent. Conversely, sharing too much information stifles creativity and encourages confirmation bias.
Our solution uses three complementary context channels:
Each channel serves a different purpose, and together they provide the context each agent needs without overwhelming any of them.

We include edited extracts of the Journal, Review, and Timeline from one investigation in this article. These extracts should give a meaningful sense of what these context resources look like in practice. They have been edited to generalize the content, but they are derived from a real investigation. The alert was generated in response to the loading of a kernel module. In fact, the event was a false positive caused by a developer installing a package in a development environment, and the triggered detection rule being overly sensitive. Specimen extracts are shown in italics.
The Director is responsible for orchestrating the investigation: deciding what questions to ask, which Experts to engage, and when to conclude the investigation. To make coherent decisions across rounds, it needs memory of what’s been discovered and decided.Â
The Director has a journaling tool. The Director’s system prompt encourages it to update the Journal often and use it for short notes. The Journal captures decisions, observations, hypotheses, and open questions in a structured format. It serves as the Director’s working memory.
The Journal supports six entry types:
| Type | Purpose | Example |
|---|---|---|
| decision | Strategic choices | “Focus investigation on authentication anomalies rather than network activity” |
| observation | Patterns noticed | “Multiple failed logins preceded the successful authentication” |
| finding | Confirmed facts | “User authenticated from IP 203.0.113.45, not in historical baseline” |
| question | Open items | “Was the VPN connection established before or after the suspicious activity?” |
| action | Steps taken/planned | “Requested Cloud Expert to examine EC2 instance activity” |
| hypothesis | Working theories | “This pattern suggests credential stuffing rather than account compromise” |
In addition to classifying its entries, the Director can also assign priority, list follow-up actions, and include citation references to evidential artifacts. When the journaling tool is used, each entry is annotated with the investigation context: the phase, round number, and timestamp. The tool itself does nothing more than accumulate entries.
Every agent receives the current content of the Director’s Journal in their prompt, presented as chronology. Their system prompts include guidance that explains the Director’s role, their relationship to the Director, the purpose of the Journal, and how to interpret it.
The Journal allows the Director to lead the investigation towards a conclusion, to observe and measure its progress, to identify dead-ends, and to make course corrections in response. It provides the common narrative that keeps other agents on track.

Time: 09:32:21 | Priority: High
Identified event as endpoint process start, user running as root on dev workstation. Command is a package hook script (not direct modprobe). Listed key questions about user roles, host type, and log sources.
Time: 09:32:29 | Priority: Medium
Identified 4 relevant expert domains needed: endpoint telemetry, identity/access (user roles), configuration management (host config), user-behavior (activity patterns).
Time: 09:33:10 | Priority: Medium
Noted cgroup indicates user session, hostname suggests personal workstation. Realized command runs during package installation sequence, not actual module loading.
Time: 09:34:06 | Priority: High
Alert rule matching “kmod” in script path, not actual modprobe execution. Host identified as personal dev environment. Activity appears to be legitimate system administration.
Time: 09:35:15 | Priority: Medium
Confirmed dev environment grants root by design. Summarized investigation resources (endpoint queries, process ancestry, SSH cert logs). Preliminary assessment: false positive.
Time: 09:35:51 | Priority: High
Need to verify parent process chain, check for actual modprobe/insmod execution, reconstruct full session activity. Formulated targeted query for endpoint expert.
Time: 09:40:47 | Priority: Critical
Reviewed expert findings (6,046 events showing package install triggered event). All 4 experts agree: FALSE POSITIVE. Timeline confidence 0.83. Decision: advance to conclude.
Time: 09:41:15 | Priority: High
Summarized all findings. Root cause: detection rule matched pathname not actual operation. Recommended action: tune detection rule to distinguish hook scripts from real modprobe.
To progress the investigation, the Director poses questions to Experts. Each Expert has a subject domain and tools to allow them to interrogate relevant data sources. At the end of their run, the Experts produce findings, citing investigation artifacts (tool calls) to support their conclusions. Even with strict guidelines, this process is not, by itself, sufficiently robust. Language models are known to hallucinate, and a proportion of the Experts’ findings could either be invented or grossly misinterpret the data.
The Critic’s role is to assess the Experts’ work, checking that reported findings are supported by evidence and that interpretations are sound. To do this accurately, it needs to be able to inspect not only each Expert’s claims and the cited evidence, but the methodology.Â
In the Review task, the Critic examines all the Experts’ findings in a single pass. Aggregating the findings together allows it to identify where the findings support or contradict each other. Due to the number of findings that can be produced, it’s not practical to provide all of the information to the Critic directly. Instead, the Critic receives a summary report and uses a suite of tools to examine the cited evidence.

We provide the Critic with four tools:
| Tool | Purpose |
|---|---|
| get_tool_call | Inspect the arguments and metadata of any tool call |
| get_tool_result | Examine the actual output returned by a tool use |
| get_toolset_info | List what tools were available to a specific Expert |
| list_toolsets | List all available toolsets organized by Expert |
Collectively, these tools allow the Critic to examine evidence and data gathering methodology. When an Expert cites tooluse_abc123 as supporting a finding, the Critic can use get_tool_call to examine the tool parameters used to obtain the result, and get_tool_result to see exactly what data the Expert was looking at. It can also use get_tool_info to access each tool’s inline documentation to determine if the tool was correctly used, and list_toolsets to understand if the Director made an error by posing a question to an Expert that was not properly equipped to answer, or if an Expert made a poor tool selection.
The output of the Critic’s Review task is an annotated findings report containing an overall summary and scored findings. Not all findings are equally reliable. A finding corroborated by multiple sources deserves more weight than speculation based on partial data. By assigning numeric scores, we enable:
Operational insights: Dashboards illustrating system performance
We use a five-level credibility scale:
| Score | Label | Criteria |
|---|---|---|
| 0.9-1.0 | Trustworthy | Supported by multiple sources with no contradictory indicators |
| 0.7-0.89 | Highly-plausible | Corroborated by a single source |
| 0.5-0.69 | Plausible | Mixed evidence support |
| 0.3-0.49 | Speculative | Poor evidence support |
| 0.0-0.29 | Misguided | No evidence provided or misinterpreted |
The following table shows the distribution of classifications over 170,000 reviewed findings. Slightly over a quarter of findings don’t meet the plausibility threshold.
| Score | Label | % |
|---|---|---|
| 0.9-1.0 | Trustworthy | 37.7 |
| 0.7-0.89 | Highly-plausible | 25.4 |
| 0.5-0.69 | Plausible | 11.1 |
| 0.3-0.49 | Speculative | 10.4 |
| 0.0-0.29 | Misguided | 15.4 |
It’s reasonable to question whether the Critic’s Review provides a false sense of assurance; it’s also conducted by model inference. We approach this problem from several directions with a range of mitigations.
The first mitigation is to use a stronger model for the Critic. Because the Critic only reviews submitted findings rather than the entire Expert run, the number of tokens required is kept within reasonable limits. While stronger models are still subject to hallucination, research suggests they err less frequently. Equally important is the capacity of the Critic to interpret nuances in the evidence, which is also improved with a stronger model.
The second mitigation is the formulation of the Critic’s instructions. Language models are more likely to hallucinate when posed larger, open-ended questions. The agent is instructed to only make a judgement on the submitted findings.
Cloud Expert delivered a strong investigation with a comprehensive search query retrieving 6,046 session events and correctly identifying: (1) legitimate package operations, (2) kernel regeneration during system updates, (3) modprobe –show-depends queries for boot ramdisk configuration (not actual module loading), and (4) false positive detection rule matching on hook script name rather than kernel operations.
[0.92] Package operations triggered legitimate kernel regeneration on the target development host. Comprehensive query shows package management operations with expected package names confirmed in process event fields.
[0.90] Parent process executed hooks including framebuffer, mdadm, and busybox scripts as part of normal operation. Parent process spawned multiple child processes executing hook scripts.*
[0.88] Modprobe operations were information-gathering queries (–show-depends –ignore-install flags) for thermal, dm-cache, raid0 modules, not actual kernel module insertion. Verified executable=/usr/bin/kmod with flags that query dependencies without loading.
[0.87] Activity is expected system maintenance on a personal development environment by an authorized user with expected roles and root access during business hours.
[0.85] Alert triggered on shell script name pattern rather than actual modprobe/insmod execution. Detection rule overly-broad: flagged dash interpreter running script with ‘kmod’ in pathname.
The third mitigation is the Critic’s Timeline task, which we will now describe.
The Critic’s Timeline task immediately follows the Review task in the investigation sequence. It is challenged to construct the most plausible consolidated timeline from three sources:
Whereas the Review task is token intensive and requires the correct use of many tools, Timeline assembly operates entirely on data in the prompt. The intuition is that the more narrowly scoped task leaves a greater capacity for reasoning in the problem domain, rather than methods of data gathering or judgements of Expert methodology.
The Critic follows explicit rules when assembling Timelines:
Maintain chronological ordering based on best available evidence – Events must flow logically in time
Not every Timeline is complete. The Critic identifies significant gaps that should be addressed:
We limit gap identification to the top 3 most significant gaps. This focuses the Director’s attention on what matters most rather than presenting an exhaustive list of unknowns.
The Critic is instructed to score the Timeline using a narrative-building rubric.
| Score | Label | Meaning |
|---|---|---|
| 0.9-1.0 | Trustworthy | Strong corroboration across multiple sources, consistent timestamps, no significant gaps |
| 0.7-0.89 | Highly-plausible | Good evidence support, minor gaps present, mostly consistent Timeline |
| 0.5-0.69 | Plausible | Some uncertainty in event ordering, notable gaps exist |
| 0.3-0.49 | Speculative | Poor evidence support, significant gaps, conflicted narrative |
| 0.0-0.29 | Invalid | No evidence, confounding inconsistencies present |
The Timeline task raises the bar for hallucinated findings by enforcing narrative coherence. To be preserved, each finding must be consistent with the full chain of evidence; findings that contradict or lack support from the broader narrative are pruned. A hallucination can only survive this process if it is more coherent with the body of evidence than any real observation it competes with.
Confidence Score: 0.83
False positive security alert triggered during legitimate system maintenance on a personal development environment. Detection ruleÂ
incorrectly flagged a package hook script based on pathname string matching, rather than actual kernel module loading operations. All modprobe executions were dependency queries (–show-depends flags) for boot ramdisk configuration, not live kernel modifications. Activity occurred during business hours with proper audit trail preservation, consistent with the development environment’s intended use.
09:29:01Z – User session begins on development workstation
09:30:39Z – Package management operations initiated by developer
09:30:48Z – Package management triggered system maintenance hooks
09:31:26Z – ALERT TRIGGERED – Hook script invoked
09:31:27Z – modprobe information-gathering for modules to determine ramdisk dependencies
09:31:29Z – modprobe dependency queries complete
09:31:29Z – Additional hook scripts executed as part of ramdisk regeneration process
As we explained in the introduction, agentic frameworks manage message history by accumulating messages and tool calls through the chain of inference requests that make up each agent invocation. In long-run agentic applications, you cannot simply carry the message history forward indefinitely. As more of the model’s context window is consumed, costs and inference latencies increase, model performance declines, and eventually the accumulated messages will exceed the context window.
Our approach is to rely entirely on the context channels presented in this article: the Journal, Review, and Timeline. Besides these resources, we do not pass any message history forward between agent invocations. Collectively, these channels provide a means of online context summarisation, negating the need for extensive message histories. Even if context windows were infinitely large, passing message history between rounds would not necessarily be desirable: the accumulated context could impede the agents’ capacity to respond appropriately to new information.
Maintaining alignment and orientation in multi-agent investigations requires deliberate design. Each agent should have specific responsibilities, and a view of the investigation state tailored to its task. With proper design, context window limitations are not a major obstacle to building complex, long-running agentic applications.
We addressed these challenges with complementary mechanisms:
These mechanisms work together to maintain coherence across rounds, while preserving the benefits of specialized agent roles. The Director can make informed strategic decisions. Experts can build on previous understanding. The Critic can objectively evaluate findings. The result is investigations that are more thorough and more trustworthy than any single agent could produce alone.
In our next article, we’ll explore how artifacts serve as a communication channel between investigation participants, examining the artifact system that connects findings to evidence and enables the verification workflows described in this article.
We wanted to give a shout out to all the people that have contributed to this journey:
Interested in taking on interesting projects, making people’s work lives easier, or just building some pretty cool forms? We’re hiring! 
In my last Week in Review post, I mentioned how much time I’ve been spending on AI-Driven Development Lifecycle (AI-DLC) workshops with customers this year. A common theme in those sessions is the need for better cost visibility. Teams are moving fast with AI, but as they go from experimenting to full production, finance and leadership really need to know who is using which resources and at what cost. That’s why I was so excited to see the launch of Amazon Bedrock new support for cost allocation by IAM user and role this week. This lets you tag IAM principals with attributes like team or cost center and then activate those tags in your Billing and Cost Management console. The resulting cost data flows into AWS Cost Explorer and the detailed Cost and Usage Report, giving you a clear line of sight into model inference spending. Whether you’re scaling agents across teams, tracking foundation model use by department, or running tools like Claude Code on Amazon Bedrock, this new feature is a game changer for tracking and managing your AI investments. You can get all the details on setting this up in the IAM principal cost allocation documentation.
Now, let’s get into this week’s AWS news…
Headlines
Amazon Bedrock now offers Claude Mythos Preview Anthropic’s most sophisticated AI model to date is now available on Amazon Bedrock as a gated research preview through Project Glasswing. Claude Mythos introduces a new model class focused on cybersecurity, capable of identifying sophisticated security vulnerabilities in software, analyzing large codebases, and delivering state of the art performance across cybersecurity, coding, and complex reasoning tasks. Security teams can use it to discover and address vulnerabilities in critical software before threats emerge. Access is currently limited to allowlisted organizations, with Anthropic and AWS prioritizing internet critical companies and open source maintainers.
AWS Agent Registry for centralized agent discovery and governance now in preview AWS launched Agent Registry through Amazon Bedrock AgentCore, providing organizations with a private catalog for discovering and managing AI agents, tools, skills, MCP servers, and custom resources. The registry helps teams locate existing capabilities rather than duplicating them, with semantic and keyword search, approval workflows, and CloudTrail audit trails. It is accessible via the AgentCore Console, AWS CLI, SDK, and as an MCP server queryable from IDEs.
Last week’s launches
Here are some launches and updates from this past week that caught my attention:
For a full list of AWS announcements, be sure to keep an eye on the What’s New with AWS page.
Other AWS news
Here are some additional posts and resources that you might find interesting:
Upcoming AWS events
Check your calendar and sign up for upcoming AWS events:
Browse here for upcoming AWS led in person and virtual events, startup events, and developer focused events.
That’s all for this week. Check back next Monday for another Weekly Roundup!
~ micah
This post is a collaboration between Docker and Arm, demonstrating how Docker MCP Toolkit and the Arm MCP Server work together to scan Hugging Face Spaces for Arm64 Readiness.
In our previous post, we walked through migrating a legacy C++ application with AVX2 intrinsics to Arm64 using Docker MCP Toolkit and the Arm MCP Server – code conversion, SIMD intrinsic rewrites, compiler flag changes, the full stack. This post is about a different and far more common failure mode.
When we tried to run ACE-Step v1.5, a 3.5B parameter music generation model from Hugging Face, on an Arm64 MacBook, the installation failed not with a cryptic kernel error but with a pip error. The flash-attn wheel in requirements.txt was hardcoded to a linux_x86_64 URL, no Arm64 wheel existed at that address, and the container would not build. It’s a deceptively simple problem that turns out to affect roughly 80% of Hugging Face Docker Spaces: not the code, not the Dockerfile, but a single hardcoded dependency URL that nobody noticed because nobody had tested on Arm.
To diagnose this systematically, we built a 7-tool MCP chain that can analyse any Hugging Face Space for Arm64 readiness in about 15 minutes. By the end of this guide you’ll understand exactly why ACE-Step v1.5 fails on Arm64, what the two specific blockers are, and how the chain surfaces them automatically.
Hugging Face hosts over one million Spaces, a significant portion of which use the Docker SDK meaning developers write a Dockerfile and HuggingFace builds and serves the container directly. The problem is that nearly all of those containers were built and tested exclusively on linux/amd64, which creates a deployment wall for three fast-growing Arm64 targets that are increasingly relevant for AI workloads.
|
Target |
Hardware |
Why it matters |
|---|---|---|
|
Cloud |
AWS Graviton, Azure Cobalt, Google Axion |
20-40% cost reduction vs. x86 |
|
Edge/Robotics |
NVIDIA Jetson Thor, DGX Spark |
GR00T, LeRobot, Isaac all target Arm64 |
|
Local dev |
Apple Silicon M1-M4 |
Most popular developer machine, zero cloud cost |
The failure mode isn’t always obvious, and it tends to show up in one of two distinct patterns. The first is a missing container manifest – the image has no arm64 layer and Docker refuses to pull it, which is at least straightforward to diagnose. The second is harder to catch: the Dockerfile and base image are perfectly fine, but a dependency in requirements.txt points to a platform-specific wheel URL. The build starts, reaches pip install, and fails with a platform mismatch error that gives no clear indication of where to look. ACE-Step v1.5 is a textbook example of the second pattern, and the MCP chain catches both in minutes.
Docker MCP Toolkit orchestrates the analysis through a secure MCP Gateway. Each tool runs in an isolated Docker container. The seven tools in the chain are:
Caption: The 7-tool MCP chain architecture diagram
The tools:
Dockerfile, pyproject.toml, requirements.txt from the repositoryThe natural question at this point is whether you could simply rebuild your Docker image for Arm64 and be done with it and for many applications, you could. But knowing in advance whether the rebuild will actually succeed is a different problem. Your Dockerfile might depend on a base image that doesn’t publish Arm64 builds. Your Python dependencies might not have aarch64 wheels. Your code might use x86-specific system calls. The MCP chain checks all of this automatically before you invest time in a build that may not work.
Before you begin, make sure you have:
Open Docker Desktop and enable the MCP Toolkit from Settings.
To enable:
Caption: Enabling Docker MCP Toolkit under Docker Desktop
Add the following four MCP Servers from the Catalog. You can find them by selecting “Catalog” in the Docker Desktop MCP Toolkit, or by following these links:
migrate-ease scanning, skopeo inspection, and Arm knowledge base
Caption: Searching for Arm MCP Server in the Docker MCP Catalog
To access your local code for the migrate-ease scan and MCA tools, the Arm MCP Server needs a directory configured to point to your local code.
Caption: Arm MCP Server configuration
Once you click ‘Save’, the Arm MCP Server will know where to look for your code. If you want to give a different directory access in the future, you’ll need to change this path.
Available Arm Migration Tools
Click Tools to view all six MCP tools available under Arm MCP Server:
Caption: List of MCP tools provided by the Arm MCP Server
knowledge_base_search – Semantic search of Arm learning resources, intrinsics documentation, and software compatibilitymigrate_ease_scan – Code scanner supporting C++, Python, Go, JavaScript, and Java for Arm compatibility analysischeck_image – Docker image architecture verification (checks if images support Arm64)skopeo – Remote container image inspection without downloadingmca – Machine Code Analyzer for assembly performance analysis and IPC predictionssysreport_instructions – System architecture information gatheringThe GitHub MCP Server lets GitHub Copilot read repositories, create pull requests, manage issues, and commit changes.
Caption: Steps to configure GitHub Official MCP Server
Configure Authentication:
Caption: Setting up Personal Access Token in GitHub MCP Server
Caption: Sequential MCP Server requires zero configuration
This server helps GitHub Copilot break down complex migration decisions into logical steps.
The Hugging Face MCP Server provides access to Space metadata, model information, and repository contents directly from the Hugging Face Hub.
The Docker MCP Toolkit makes it incredibly easy to configure MCP servers for clients like VS Code.
To configure, click “Clients” and scroll down to Visual Studio Code. Click the “Connect” button:
Caption: Setting up Visual Studio Code as MCP Client
Now open VS Code and click on the ‘Extensions’ icon in the left toolbar:
Caption: Configuring MCP_DOCKER under VS Code Extensions
Click the MCP_DOCKER gear, and click ‘Start Server’:
Caption: Starting MCP Server under VS Code
Open GitHub Copilot Chat in VS Code and ask:
What Arm migration and Hugging Face tools do you have access to?
You should see tools from all four servers listed. If you see them, your connection works. Let’s scan a Hugging Face Space.
Caption: Playing around with GitHub Copilot
Now that you’ve connected GitHub Copilot to Docker MCP Toolkit, let’s scan a real Hugging Face Space for Arm64 readiness and uncover the exact Arm64 blocker we hit when trying to run it locally.
Docker MCP Toolkit orchestrates the scan through a secure MCP Gateway that routes requests to specialized tools: the Arm MCP Server inspects images and scans code, Hugging Face MCP discovers the Space, GitHub MCP reads the repository, and Sequential Thinking synthesizes the verdict.Â
Step 1. Give GitHub Copilot Scan Instructions
Open your project in VS Code. In GitHub Copilot Chat, paste this prompt:
Your goal is to analyze the Hugging Face Space "ACE-Step/ACE-Step-v1.5" for Arm64 migration readiness. Use the MCP tools to help with this analysis.
Steps to follow:
1. Use Hugging Face MCP to discover the Space and identify its SDK type (Docker or Gradio)
2. Use skopeo to inspect the container image - check what architectures are currently supported
3. Use GitHub MCP to read the repository - examine pyproject.toml, Dockerfile, and requirements
4. Run migrate_ease_scan on the source code to find any x86-specific dependencies or intrinsics
5. Use knowledge_base_search to find Arm64 build strategies for any issues discovered
6. Use sequential thinking to synthesize all findings into a migration verdict
At the end, provide a clear GO / NO-GO verdict with a summary of required changes.
Step 2. Watch Docker MCP Toolkit Execute
GitHub Copilot orchestrates the scan using Docker MCP Toolkit. Here’s what happens:
Phase 1: Space Discovery
GitHub Copilot starts by querying the Hugging Face MCP server to retrieve Space metadata.
Caption: GitHub Copilot uses Hugging Face MCP to discover the Space and identify its SDK type.
The tool returns that ACE-Step v1.5 uses the Docker SDK – meaning Hugging Face serves it as a pre-built container image, not a Gradio app. This is critical: Docker SDK Spaces have Dockerfiles we can analyze and rebuild, while Gradio SDK Spaces are built by Hugging Face’s infrastructure we can’t control.
Phase 2: Container Image Inspection
Next, Copilot uses the Arm MCP Server’s skopeo tool to inspect the container image without downloading it.
Caption: The skopeo tool reports that the container image has no Arm64 build available. The container won’t start on Arm hardware.
Result: the manifest includes only linux/amd64. No Arm64 build exists. This is the first concrete data point the container will fail on any Arm hardware. But this is not the full story.
Phase 3: Source Code Analysis
Copilot uses GitHub MCP to read the repository’s key files. Here is the actual Dockerfile from the Space:
FROM python:3.11-slim
ENV PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1 \
DEBIAN_FRONTEND=noninteractive \
TORCHAUDIO_USE_TORCHCODEC=0
RUN apt-get update && \
apt-get install -y --no-install-recommends git libsndfile1 build-essential && \
apt-get install -y ffmpeg libavcodec-dev libavformat-dev libavutil-dev libswresample-dev && \
rm -rf /var/lib/apt/lists/*
RUN useradd -m -u 1000 user
RUN mkdir -p /data && chown user:user /data && chmod 755 /data
ENV HOME=/home/user \
PATH=/home/user/.local/bin:$PATH \
GRADIO_SERVER_NAME=0.0.0.0 \
GRADIO_SERVER_PORT=7860
WORKDIR $HOME/app
COPY --chown=user:user requirements.txt .
COPY --chown=user:user acestep/third_parts/nano-vllm ./acestep/third_parts/nano-vllm
USER user
RUN pip install --no-cache-dir --user -r requirements.txt
RUN pip install --no-deps ./acestep/third_parts/nano-vllm
COPY --chown=user:user . .
EXPOSE 7860
CMD ["python", "app.py"]
The Dockerfile itself looks clean:
But the real problem is in requirements.txt. This is what I hit when I tried to install ACE-Step locally:
# nano-vllm dependencies
triton>=3.0.0; sys_platform != 'win32'
flash-attn @ https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/
download/v0.7.12/flash_attn-2.8.3+cu128torch2.10-cp311-cp311-linux_x86_64.whl
; sys_platform == 'linux' and python_version == '3.11'
Two immediate blockers:
flash-attn is pinned to a hardcoded linux_x86_64 wheel URL. On an aarch64 system, pip downloads this wheel and immediately rejects it: “not a supported wheel on this platform.” This is the exact error I hit.triton>=3.0.0 has no aarch64 wheel on PyPI for Linux. It will fail on Arm hardware.Neither of these is a code problem. The Python source code is architecture-neutral. The fix is in the dependency declarations.
Phase 4: Architecture Compatibility Scan
Copilot runs the migrate_ease_scan tool with the Python scanner on the codebase.
Caption: The migrate_ease_scan tool analyzes the Python source code and finds zero x86-specific dependencies. No intrinsics, no hardcoded paths, no architecture-locked libraries.
The application source code itself returns 0 architecture issues — no x86 intrinsics, no platform-specific system calls. But the scan also flags the dependency manifest. Two blockers in requirements.txt:
|
Dependency |
Issue |
Arm64 Fix |
|---|---|---|
|
flash-attn (linux wheel) |
Hardcoded linux_x86_64 URL |
Use flash-attn 2.7+ via PyPI — publishes aarch64 wheels natively |
|
triton>=3.0.0 |
No aarch64 PyPI wheel for Linux |
Exclude on aarch64 or use triton-nightly aarch64 build |
Phase 5: Arm Knowledge Base Lookup
Copilot queries the Arm MCP Server’s knowledge base for solutions to the discovered issues.
Caption: GitHub Copilot uses the knowledge_base_search tool to find Docker buildx multi-arch strategies from learn.arm.com.
The knowledge base returns documentation on:
Phase 6: Synthesis and Verdict
Sequential Thinking combines all findings into a structured verdict:
|
Check |
Result |
Blocks? |
|---|---|---|
|
Container manifest |
amd64 only |
Yes, needs rebuild |
|
Base image python:3.11-slim |
Multi-arch (arm64 available) |
No |
|
System packages (ffmpeg, libsndfile1) |
Available in Debian arm64 |
No |
|
torch==2.9.1 |
aarch64 wheels published |
No |
|
flash-attn linux wheel |
Hardcoded linux_x86_64 URL |
YES, add arm64 URL alongside |
|
triton>=3.0.0 |
aarch64 wheels available from 3.5.0+ |
No, resolves automatically |
|
Source code (migrate-ease) |
0 architecture issues |
No |
|
Compiler flags in Dockerfile |
None x86-specific |
No |
Verdict: CONDITIONAL GO. Zero code changes. Zero Dockerfile changes. One dependency fix is required.
Here are the exact changes needed in requirements.txt:
# BEFORE — only x86_64
flash-attn @ https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.7.12/flash_attn-2.8.3+cu128torch2.10-cp311-cp311-linux_aarch64.whl ; sys_platform == 'linux' and python_version == '3.11' and platform_machine == 'aarch64'
# AFTER — add arm64 line alongside x86_64
flash-attn @ https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.7.12/flash_attn-2.8.3+cu128torch2.10-cp311-cp311-linux_aarch64.whl ; sys_platform == 'linux' and python_version == '3.11' and platform_machine == 'aarch64'
flash-attn @ https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.7.12/flash_attn-2.8.3+cu128torch2.10-cp311-cp311-linux_x86_64.whl ; sys_platform == 'linux' and python_version == '3.11' and platform_machine != 'aarch64'
# triton — no change needed, 3.5.0+ has aarch64 wheels, resolves automatically
triton>=3.0.0; sys_platform != 'win32'
After those two fixes, the build command is:
docker buildx build --platform linux/arm64 -t ace-step:arm64 .
That single command unlocks three deployment paths:
Phase 7: Create the Pull Request
After completing the scan, Copilot uses GitHub MCP to propose the fix. Since the only blocker is the hardcoded linux_x86_64 wheel URL on line 32 of requirements.txt, the change is surgical: one line added, nothing removed.
The fix adds the equivalent linux_aarch64 wheel from the same release alongside the existing x86_64 entry, conditioned on platform_machine == 'aarch64':
# BEFORE — only x86_64, fails silently on Arm
flash-attn @ https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/
download/v0.7.12/flash_attn-2.8.3+cu128torch2.10-cp311-cp311-linux_x86_64.whl
; sys_platform == 'linux' and python_version == '3.11'
# AFTER — add arm64 line alongside, conditioned by platform_machine
flash-attn @ https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/
download/v0.7.12/flash_attn-2.8.3+cu128torch2.10-cp311-cp311-linux_x86_64.whl
; sys_platform == 'linux' and python_version == '3.11'
flash-attn @ https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/
download/v0.7.12/flash_attn-2.8.3+cu128torch2.10-cp311-cp311-linux_aarch64.whl
; sys_platform == 'linux' and python_version == '3.11' and platform_machine == 'aarch64'
Caption: PR #14 on Hugging Face – Ready to merge
The key insight: the upstream maintainer already published the arm64 wheel in the same release. The fix wasn’t a rebuild or a code change – it was adding one line that references an artifact that already existed. The MCP chain found it in 15 minutes. Without it, a developer hitting this pip error would spend hours tracking it down.
PR: https://huggingface.co/spaces/ACE-Step/Ace-Step-v1.5/discussions/14
Let’s be clear about what changes when you add the Arm MCP Server to Docker MCP Toolkit.
Real images get inspected. Real code gets scanned. Real dependency files get analyzed. The difference is Docker MCP Toolkit gives GitHub Copilot access to actual Arm migration tooling, not just general knowledge.
Manual process:
Total: 2-3 hours per Space
With Docker MCP Toolkit:
Total: 15 minutes per Space
ACE-Step is a standard Python AI application: PyTorch, Gradio, pip dependencies, a slim Dockerfile. This pattern covers the majority of Docker SDK Spaces on Hugging Face.
The Arm64 wall for these apps is not always visible. The Dockerfile looks clean. The base image supports arm64. The Python code has no intrinsics. But buried in requirements.txt is a hardcoded wheel URL pointing at a linux_x86_64 binary, and nobody finds it until they actually try to run the container on Arm hardware.
That is the 80% problem: 80% of Hugging Face Docker Spaces have never been tested on Arm. Not because the code will not work. but because nobody checked. The MCP chain is a systematic check that takes 15 minutes instead of an afternoon of debugging pip errors.
That has real cost implications:
Docker MCP Toolkit changes how developers interact with specialized knowledge and capabilities. Rather than learning new tools, installing dependencies, or managing credentials, developers connect their AI assistant once and immediately access containerized expertise.
The benefits extend beyond Hugging Face scanning:
Most importantly, developers remain in their existing workflow. VS Code. GitHub Copilot. Git. No context switching to external tools or dashboards.
You have just scanned a real Hugging Face Space for Arm64 readiness using Docker MCP Toolkit, the Arm MCP Server, and GitHub Copilot. What we found with ACE-Step v1.5 is representative of what you will find across Hugging Face: code that is architecture-neutral, a Dockerfile that is already clean, but a requirements.txt with hardcoded x86_64 wheel URLs that silently break Arm64 builds.
The MCP chain surfaces this in 15 minutes. Without it, you are staring at a pip error with no clear path to the cause.
Ready to try it? Open Docker Desktop and explore the MCP Catalog. Start with the Arm MCP Server, add GitHub,Sequential Thinking, and Hugging Face MCP. Point the chain at any Hugging Face Space you’re working with and see what comes back.

Sometimes the best way to learn is to dive straight into the code! This week, we’re highlighting several hands-on repositories that demonstrate the latest Angular patterns and how to integrate Google Gemini for real-time user assistance.
Check out these essential code samples and templates.
Antonio Cardenas @yeoudev provides a masterclass in using ViewContainerRef. This repository and StackBlitz demo show you exactly how to handle dynamic component instantiation in a clean, scalable way.
Need to start a project fast? Antonio Cardenas @yeoudev offers a “Vibe Coding” template that comes ready with basic CRUD functionality and helpful scripts to clean up boilerplate, letting you focus on your unique logic.
Deborah Kurata @deborahkurata brings her signature clarity to the new Signal-based httpResource. This fun “Pirates” example demonstrates how to fetch and manage data using the newest reactive primitives in Angular.
Ankit Sharma @ankitsharma_007 shows the power of the Google Gemini API in a lightweight Angular app. This project provides real-time grammar corrections as you type — a perfect blueprint for adding AI utility to your own editors.
Have you built a cool helper tool with Gemini or tried the new httpResource? Your code samples could be the missing piece for another developer!
Keep the community growing! Use #AngularSparkles and #AngularAI to share your latest GitHub repos and StackBlitz demos!
Mastering Dynamic Components, HTTP Resources, and AI Writing Assistants 🛠️ was originally published in Angular Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.