Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
151602 stories
·
33 followers

Booking.com confirms hackers accessed customers’ data

1 Share
The travel giant notified customers that their personal data, including names, emails, physical addresses, phone numbers may have been accessed in a security incident.
Read the whole story
alvinashcraft
37 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Microsoft is testing OpenClaw-like AI bots for 365 Copilot

1 Share
The OpenClaw logo on a red background.

Microsoft is looking into ways it can integrate OpenClaw-style features into 365 Copilot, according to a report from The Information. The test reportedly comes as part of efforts to make its 365 Copilot AI assistant "run autonomously around the clock" while completing tasks on behalf of users.

Omar Shahine, Microsoft's corporate vice president, confirmed to The Information that the company is "exploring the potential of technologies like OpenClaw in an enterprise context." OpenClaw is an open-source platform that allows users to create AI-powered agents that run locally on a user's device. The platform rose in popularity earlier this year, …

Read the full story at The Verge.

Read the whole story
alvinashcraft
37 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Managing context in long-run agentic applications

1 Share

Excerpt

In complex, long-running agentic systems, maintaining alignment and coherent reasoning between agents requires careful design. In this second article of our series, we explore these challenges and the mechanisms we built to keep teams of agents working productively over long time spans. We present a range of complementary techniques that balance the conflicting requirements of continuity and creativity.


In our first article, we introduced our agentic security investigation service. We described how teams of AI agents collaboratively investigate security alerts. A Director orchestrates the investigation, many specialist Experts gather evidence, and a Critic reviews the Experts’ findings. We suggest you read the series in order.

To briefly recap, our investigation process proceeds through a series of defined phases. Each phase implements a distinct set of agent interactions. Within phases, we may have multiple rounds, where each round is one full iteration through the phase. There’s no preset limit on the number of rounds that make up an investigation: investigations continue until concluded by the Director agent.

The Challenge of Long-run Coherence

Language model APIs are stateless: to provide continuity between requests, the caller must provide the complete message history with each request. Agent frameworks solve the state management problem for users by accumulating message history between API calls. This fills the agent’s context window, which provides a hard limit on how much information the agent can handle. Even approaching an agent’s context window limit can degrade the quality of responses. For short-run applications, no extra context window management is typically required.

High-level overview of how agent frameworks manage context across inference API calls
High-level overview of how agent frameworks manage context across inference API calls

Complex security investigations can span hundreds of inference requests and generate megabytes of output, requiring special handling. Multi-agent applications, like ours, add further complexities. For each agent to optimally execute its role, it requires a tailored view of the investigation state. Each view must be carefully balanced. If agents are not anchored to the wider team, the investigation will be disconnected and incoherent. Conversely, sharing too much information stifles creativity and encourages confirmation bias.

Our solution uses three complementary context channels:

  • Director’s Journal: The Director’s structured working memory
  • Critic’s Review: Annotated findings report with credibility scores
  • Critic’s Timeline: Consolidated chronological findings with credibility scores

Each channel serves a different purpose, and together they provide the context each agent needs without overwhelming any of them.

How our agents consume and produce different context sources
How our agents consume and produce different context sources

Specimen Content

We include edited extracts of the Journal, Review, and Timeline from one investigation in this article. These extracts should give a meaningful sense of what these context resources look like in practice. They have been edited to generalize the content, but they are derived from a real investigation. The alert was generated in response to the loading of a kernel module. In fact, the event was a false positive caused by a developer installing a package in a development environment, and the triggered detection rule being overly sensitive. Specimen extracts are shown in italics.

The Director’s Journal

The Director is responsible for orchestrating the investigation: deciding what questions to ask, which Experts to engage, and when to conclude the investigation. To make coherent decisions across rounds, it needs memory of what’s been discovered and decided. 

The Director has a journaling tool. The Director’s system prompt encourages it to update the Journal often and use it for short notes. The Journal captures decisions, observations, hypotheses, and open questions in a structured format. It serves as the Director’s working memory.

Entry Types

The Journal supports six entry types:

Type Purpose Example
decision Strategic choices “Focus investigation on authentication anomalies rather than network activity”
observation Patterns noticed “Multiple failed logins preceded the successful authentication”
finding Confirmed facts “User authenticated from IP 203.0.113.45, not in historical baseline”
question Open items “Was the VPN connection established before or after the suspicious activity?”
action Steps taken/planned “Requested Cloud Expert to examine EC2 instance activity”
hypothesis Working theories “This pattern suggests credential stuffing rather than account compromise”

In addition to classifying its entries, the Director can also assign priority, list follow-up actions, and include citation references to evidential artifacts. When the journaling tool is used, each entry is annotated with the investigation context: the phase, round number, and timestamp. The tool itself does nothing more than accumulate entries.

Every agent receives the current content of the Director’s Journal in their prompt, presented as chronology. Their system prompts include guidance that explains the Director’s role, their relationship to the Director, the purpose of the Journal, and how to interpret it.

How It Maintains Alignment

The Journal allows the Director to lead the investigation towards a conclusion, to observe and measure its progress, to identify dead-ends, and to make course corrections in response. It provides the common narrative that keeps other agents on track.

Breakdown of entry types in the Journal
Breakdown of entry types in the Journal

Example Director’s Journal

Time: 09:32:21 | Priority: High

Identified event as endpoint process start, user running as root on dev workstation. Command is a package hook script (not direct modprobe). Listed key questions about user roles, host type, and log sources.

Time: 09:32:29 | Priority: Medium

Identified 4 relevant expert domains needed: endpoint telemetry, identity/access (user roles), configuration management (host config), user-behavior (activity patterns).

Time: 09:33:10 | Priority: Medium

Noted cgroup indicates user session, hostname suggests personal workstation. Realized command runs during package installation sequence, not actual module loading.

Time: 09:34:06 | Priority: High

Alert rule matching “kmod” in script path, not actual modprobe execution. Host identified as personal dev environment. Activity appears to be legitimate system administration.

Time: 09:35:15 | Priority: Medium

Confirmed dev environment grants root by design. Summarized investigation resources (endpoint queries, process ancestry, SSH cert logs). Preliminary assessment: false positive.

Time: 09:35:51 | Priority: High

Need to verify parent process chain, check for actual modprobe/insmod execution, reconstruct full session activity. Formulated targeted query for endpoint expert.

Time: 09:40:47 | Priority: Critical

Reviewed expert findings (6,046 events showing package install triggered event). All 4 experts agree: FALSE POSITIVE. Timeline confidence 0.83. Decision: advance to conclude.

Time: 09:41:15 | Priority: High

Summarized all findings. Root cause: detection rule matched pathname not actual operation. Recommended action: tune detection rule to distinguish hook scripts from real modprobe.

The Critic’s Review Tools

To progress the investigation, the Director poses questions to Experts. Each Expert has a subject domain and tools to allow them to interrogate relevant data sources. At the end of their run, the Experts produce findings, citing investigation artifacts (tool calls) to support their conclusions. Even with strict guidelines, this process is not, by itself, sufficiently robust. Language models are known to hallucinate, and a proportion of the Experts’ findings could either be invented or grossly misinterpret the data.

The Critic’s role is to assess the Experts’ work, checking that reported findings are supported by evidence and that interpretations are sound. To do this accurately, it needs to be able to inspect not only each Expert’s claims and the cited evidence, but the methodology. 

In the Review task, the Critic examines all the Experts’ findings in a single pass. Aggregating the findings together allows it to identify where the findings support or contradict each other. Due to the number of findings that can be produced, it’s not practical to provide all of the information to the Critic directly. Instead, the Critic receives a summary report and uses a suite of tools to examine the cited evidence.

How Critic’s review tools are used
How Critic’s review tools are used

We provide the Critic with four tools:

Tool Purpose
get_tool_call Inspect the arguments and metadata of any tool call
get_tool_result Examine the actual output returned by a tool use
get_toolset_info List what tools were available to a specific Expert
list_toolsets List all available toolsets organized by Expert

Collectively, these tools allow the Critic to examine evidence and data gathering methodology. When an Expert cites tooluse_abc123 as supporting a finding, the Critic can use get_tool_call to examine the tool parameters used to obtain the result, and get_tool_result to see exactly what data the Expert was looking at. It can also use get_tool_info to access each tool’s inline documentation to determine if the tool was correctly used, and list_toolsets to understand if the Director made an error by posing a question to an Expert that was not properly equipped to answer, or if an Expert made a poor tool selection.

The Review Scoring System

The output of the Critic’s Review task is an annotated findings report containing an overall summary and scored findings. Not all findings are equally reliable. A finding corroborated by multiple sources deserves more weight than speculation based on partial data. By assigning numeric scores, we enable:

  1. Informed decision-making: Highly credible findings can be prioritized
  2. Timeline quality: Only credible findings make it into the consolidated timeline
  3. Audit trails: Staff can quickly identify which conclusions need scrutiny

Operational insights: Dashboards illustrating system performance

The Critic’s Rubric

We use a five-level credibility scale:

Score Label Criteria
0.9-1.0 Trustworthy Supported by multiple sources with no contradictory indicators
0.7-0.89 Highly-plausible Corroborated by a single source
0.5-0.69 Plausible Mixed evidence support
0.3-0.49 Speculative Poor evidence support
0.0-0.29 Misguided No evidence provided or misinterpreted

The following table shows the distribution of classifications over 170,000 reviewed findings. Slightly over a quarter of findings don’t meet the plausibility threshold.

Score Label %
0.9-1.0 Trustworthy 37.7
0.7-0.89 Highly-plausible 25.4
0.5-0.69 Plausible 11.1
0.3-0.49 Speculative 10.4
0.0-0.29 Misguided 15.4

It’s reasonable to question whether the Critic’s Review provides a false sense of assurance; it’s also conducted by model inference. We approach this problem from several directions with a range of mitigations.

The first mitigation is to use a stronger model for the Critic. Because the Critic only reviews submitted findings rather than the entire Expert run, the number of tokens required is kept within reasonable limits. While stronger models are still subject to hallucination, research suggests they err less frequently. Equally important is the capacity of the Critic to interpret nuances in the evidence, which is also improved with a stronger model.

The second mitigation is the formulation of the Critic’s instructions. Language models are more likely to hallucinate when posed larger, open-ended questions. The agent is instructed to only make a judgement on the submitted findings.

Example Critic’s Review

Cloud Expert delivered a strong investigation with a comprehensive search query retrieving 6,046 session events and correctly identifying: (1) legitimate package operations, (2) kernel regeneration during system updates, (3) modprobe –show-depends queries for boot ramdisk configuration (not actual module loading), and (4) false positive detection rule matching on hook script name rather than kernel operations.

Annotated Findings

[0.92] Package operations triggered legitimate kernel regeneration on the target development host. Comprehensive query shows package management operations with expected package names confirmed in process event fields.

[0.90] Parent process executed hooks including framebuffer, mdadm, and busybox scripts as part of normal operation. Parent process spawned multiple child processes executing hook scripts.*

[0.88] Modprobe operations were information-gathering queries (–show-depends –ignore-install flags) for thermal, dm-cache, raid0 modules, not actual kernel module insertion. Verified executable=/usr/bin/kmod with flags that query dependencies without loading.

[0.87] Activity is expected system maintenance on a personal development environment by an authorized user with expected roles and root access during business hours.

[0.85] Alert triggered on shell script name pattern rather than actual modprobe/insmod execution. Detection rule overly-broad: flagged dash interpreter running script with ‘kmod’ in pathname.

The third mitigation is the Critic’s Timeline task, which we will now describe.

Critic’s Timeline

The Critic’s Timeline task immediately follows the Review task in the investigation sequence. It is challenged to construct the most plausible consolidated timeline from three sources:

  1. The most recent Review
  2. The previous Critic’s Timeline
  3. The Director’s Journal

Whereas the Review task is token intensive and requires the correct use of many tools, Timeline assembly operates entirely on data in the prompt. The intuition is that the more narrowly scoped task leaves a greater capacity for reasoning in the problem domain, rather than methods of data gathering or judgements of Expert methodology.

Consolidation Rules

The Critic follows explicit rules when assembling Timelines:

  1. Include only events supported by credible citations – Speculation doesn’t belong on the Timeline
  2. Remove duplicate entries describing the same event – An event shouldn’t appear twice because two Experts mentioned it
  3. When timestamps conflict, prefer sources with stronger evidence – A log entry timestamp beats an inferred time

Maintain chronological ordering based on best available evidence – Events must flow logically in time

Gap Identification

Not every Timeline is complete. The Critic identifies significant gaps that should be addressed:

  1. Evidential gaps: Missing data that would strengthen conclusions
  2. Temporal gaps: Unexplained periods between events
  3. Logical inconsistencies: Events that don’t fit the emerging narrative

We limit gap identification to the top 3 most significant gaps. This focuses the Director’s attention on what matters most rather than presenting an exhaustive list of unknowns.

The Critic is instructed to score the Timeline using a narrative-building rubric.

Score Label Meaning
0.9-1.0 Trustworthy Strong corroboration across multiple sources, consistent timestamps, no significant gaps
0.7-0.89 Highly-plausible Good evidence support, minor gaps present, mostly consistent Timeline
0.5-0.69 Plausible Some uncertainty in event ordering, notable gaps exist
0.3-0.49 Speculative Poor evidence support, significant gaps, conflicted narrative
0.0-0.29 Invalid No evidence, confounding inconsistencies present

The Timeline task raises the bar for hallucinated findings by enforcing narrative coherence. To be preserved, each finding must be consistent with the full chain of evidence; findings that contradict or lack support from the broader narrative are pruned. A hallucination can only survive this process if it is more coherent with the body of evidence than any real observation it competes with.

Example Critic’s Timeline

Confidence Score: 0.83

False positive security alert triggered during legitimate system maintenance on a personal development environment. Detection rule 

incorrectly flagged a package hook script based on pathname string matching, rather than actual kernel module loading operations. All modprobe executions were dependency queries (–show-depends flags) for boot ramdisk configuration, not live kernel modifications. Activity occurred during business hours with proper audit trail preservation, consistent with the development environment’s intended use.

Event Sequence

09:29:01Z – User session begins on development workstation

09:30:39Z – Package management operations initiated by developer

09:30:48Z – Package management triggered system maintenance hooks

09:31:26ZALERT TRIGGERED – Hook script invoked

09:31:27Z – modprobe information-gathering for modules to determine ramdisk dependencies

09:31:29Z – modprobe dependency queries complete

09:31:29Z – Additional hook scripts executed as part of ramdisk regeneration process

Evidence Gaps

  • Exact session initiation timestamp unknown – session activity observed from 09:29:01Z but SSH login event not captured
  • Specific command that initiated apt/dpkg operations not identified – timeline shows package operations beginning at 09:30:39Z but triggering command not documented
  • Secondary analyst failed to locate parent process using incorrect field name and missed modprobe operations by searching wrong path – reduces confidence in independent verification

Message History

As we explained in the introduction, agentic frameworks manage message history by accumulating messages and tool calls through the chain of inference requests that make up each agent invocation. In long-run agentic applications, you cannot simply carry the message history forward indefinitely. As more of the model’s context window is consumed, costs and inference latencies increase, model performance declines, and eventually the accumulated messages will exceed the context window.

Our approach is to rely entirely on the context channels presented in this article: the Journal, Review, and Timeline. Besides these resources, we do not pass any message history forward between agent invocations. Collectively, these channels provide a means of online context summarisation, negating the need for extensive message histories. Even if context windows were infinitely large, passing message history between rounds would not necessarily be desirable: the accumulated context could impede the agents’ capacity to respond appropriately to new information.

Conclusion

Maintaining alignment and orientation in multi-agent investigations requires deliberate design. Each agent should have specific responsibilities, and a view of the investigation state tailored to its task. With proper design, context window limitations are not a major obstacle to building complex, long-running agentic applications.

We addressed these challenges with complementary mechanisms:

  • Journal: Structured, shared memory for investigation orchestration
  • Review: Credibility-scored findings that prune out inaccuracies and hallucinations
  • Timeline: Most plausible chronology, constructed from credible evidence

These mechanisms work together to maintain coherence across rounds, while preserving the benefits of specialized agent roles. The Director can make informed strategic decisions. Experts can build on previous understanding. The Critic can objectively evaluate findings. The result is investigations that are more thorough and more trustworthy than any single agent could produce alone.

In our next article, we’ll explore how artifacts serve as a communication channel between investigation participants, examining the artifact system that connects findings to evidence and enables the verification workflows described in this article.

Acknowledgements

We wanted to give a shout out to all the people that have contributed to this journey:

  • Chris Smith
  • Abhi Rathod
  • Dave Russell
  • Nate Reeves

 

Interested in taking on interesting projects, making people’s work lives easier, or just building some pretty cool forms? We’re hiring! 💼

Apply now
Read the whole story
alvinashcraft
37 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

AWS Weekly Roundup: Claude Mythos Preview in Amazon Bedrock, AWS Agent Registry, and more (April 13, 2026)

1 Share

In my last Week in Review post, I mentioned how much time I’ve been spending on AI-Driven Development Lifecycle (AI-DLC) workshops with customers this year. A common theme in those sessions is the need for better cost visibility. Teams are moving fast with AI, but as they go from experimenting to full production, finance and leadership really need to know who is using which resources and at what cost. That’s why I was so excited to see the launch of Amazon Bedrock new support for cost allocation by IAM user and role this week. This lets you tag IAM principals with attributes like team or cost center and then activate those tags in your Billing and Cost Management console. The resulting cost data flows into AWS Cost Explorer and the detailed Cost and Usage Report, giving you a clear line of sight into model inference spending. Whether you’re scaling agents across teams, tracking foundation model use by department, or running tools like Claude Code on Amazon Bedrock, this new feature is a game changer for tracking and managing your AI investments. You can get all the details on setting this up in the IAM principal cost allocation documentation.

Now, let’s get into this week’s AWS news…

Headlines
Amazon Bedrock now offers Claude Mythos Preview Anthropic’s most sophisticated AI model to date is now available on Amazon Bedrock as a gated research preview through Project Glasswing. Claude Mythos introduces a new model class focused on cybersecurity, capable of identifying sophisticated security vulnerabilities in software, analyzing large codebases, and delivering state of the art performance across cybersecurity, coding, and complex reasoning tasks. Security teams can use it to discover and address vulnerabilities in critical software before threats emerge. Access is currently limited to allowlisted organizations, with Anthropic and AWS prioritizing internet critical companies and open source maintainers.

AWS Agent Registry for centralized agent discovery and governance now in preview AWS launched Agent Registry through Amazon Bedrock AgentCore, providing organizations with a private catalog for discovering and managing AI agents, tools, skills, MCP servers, and custom resources. The registry helps teams locate existing capabilities rather than duplicating them, with semantic and keyword search, approval workflows, and CloudTrail audit trails. It is accessible via the AgentCore Console, AWS CLI, SDK, and as an MCP server queryable from IDEs.

Last week’s launches
Here are some launches and updates from this past week that caught my attention:

  • Announcing Amazon S3 Files, making S3 buckets accessible as file systems — Amazon S3 Files transforms S3 buckets into shared file systems that connect any AWS compute resource directly with your S3 data. Built on Amazon EFS technology, it delivers full file system semantics with low latency performance, caching actively used data and providing multiple terabytes per second of aggregate read throughput. Applications can access S3 data through both file system and S3 APIs simultaneously without code modifications or data migration.
  • Amazon OpenSearch Service supports Managed Prometheus and agent tracing —Amazon OpenSearch Service now provides a unified observability platform that consolidates metrics, logs, traces, and AI agent tracing into a single interface. The update includes native Prometheus integration with direct PromQL query support, RED metrics monitoring, and OpenTelemetry GenAI semantic convention support for LLM execution visibility. Operations teams can correlate slow traces to logs and overlay Prometheus metrics on dashboards without switching between tools.
  • Amazon WorkSpaces Advisor now available for AI powered troubleshooting— AWS launched Amazon WorkSpaces Advisor, an AI powered administrative tool that uses generative AI to help IT administrators troubleshoot Amazon WorkSpaces Personal deployments. It analyzes WorkSpace configurations, detects problems automatically, and provides actionable recommendations to restore service and optimize performance.
  • Amazon Braket adds support for Rigetti’s 108 qubit Cepheus QPU — Amazon Braket now offers access to Rigetti’s Cepheus-1-108Q device, the first 100+ qubit superconducting quantum processor on the platform. The modular design features twelve 9 qubit chiplets with CZ gates that offer enhanced resilience to phase errors. It supports multiple frameworks including Braket SDK, Qiskit, CUDA-Q, and Pennylane, with pulse level control for researchers.

For a full list of AWS announcements, be sure to keep an eye on the What’s New with AWS page.

Other AWS news
Here are some additional posts and resources that you might find interesting:

Upcoming AWS events
Check your calendar and sign up for upcoming AWS events:

  • What’s Next with AWS (April 28, Virtual) Join this livestream at 9am PT for a candid discussion about how agentic AI is transforming how businesses operate. Featuring AWS CEO Matt Garman, SVP Colleen Aubrey, and OpenAI leaders discussing emerging agent capabilities, Amazon’s internal experiences, and new agentic solutions and platform capabilities.

Browse here for upcoming AWS led in person and virtual events, startup events, and developer focused events.


That’s all for this week. Check back next Monday for another Weekly Roundup!

~ micah

Read the whole story
alvinashcraft
37 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

How to Analyze Hugging Face for Arm64 Readiness

1 Share

This post is a collaboration between Docker and Arm, demonstrating how Docker MCP Toolkit and the Arm MCP Server work together to scan Hugging Face Spaces for Arm64 Readiness.

In our previous post, we walked through migrating a legacy C++ application with AVX2 intrinsics to Arm64 using Docker MCP Toolkit and the Arm MCP Server – code conversion, SIMD intrinsic rewrites, compiler flag changes, the full stack. This post is about a different and far more common failure mode.

When we tried to run ACE-Step v1.5, a 3.5B parameter music generation model from Hugging Face, on an Arm64 MacBook, the installation failed not with a cryptic kernel error but with a pip error. The flash-attn wheel in requirements.txt was hardcoded to a linux_x86_64 URL, no Arm64 wheel existed at that address, and the container would not build. It’s a deceptively simple problem that turns out to affect roughly 80% of Hugging Face Docker Spaces: not the code, not the Dockerfile, but a single hardcoded dependency URL that nobody noticed because nobody had tested on Arm.

To diagnose this systematically, we built a 7-tool MCP chain that can analyse any Hugging Face Space for Arm64 readiness in about 15 minutes. By the end of this guide you’ll understand exactly why ACE-Step v1.5 fails on Arm64, what the two specific blockers are, and how the chain surfaces them automatically.

Why Hugging Face Spaces Matter for Arm

Hugging Face hosts over one million Spaces, a significant portion of which use the Docker SDK meaning developers write a Dockerfile and HuggingFace builds and serves the container directly. The problem is that nearly all of those containers were built and tested exclusively on linux/amd64, which creates a deployment wall for three fast-growing Arm64 targets that are increasingly relevant for AI workloads.

Target

Hardware

Why it matters

Cloud

AWS Graviton, Azure Cobalt, Google Axion

20-40% cost reduction vs. x86

Edge/Robotics

NVIDIA Jetson Thor, DGX Spark

GR00T, LeRobot, Isaac all target Arm64

Local dev

Apple Silicon M1-M4

Most popular developer machine, zero cloud cost

The failure mode isn’t always obvious, and it tends to show up in one of two distinct patterns. The first is a missing container manifest – the image has no arm64 layer and Docker refuses to pull it, which is at least straightforward to diagnose. The second is harder to catch: the Dockerfile and base image are perfectly fine, but a dependency in requirements.txt points to a platform-specific wheel URL. The build starts, reaches pip install, and fails with a platform mismatch error that gives no clear indication of where to look. ACE-Step v1.5 is a textbook example of the second pattern, and the MCP chain catches both in minutes.

The 7-Tool MCP Chain

Docker MCP Toolkit orchestrates the analysis through a secure MCP Gateway. Each tool runs in an isolated Docker container. The seven tools in the chain are:

The 7-tool MCP chain architecture diagram

Caption: The 7-tool MCP chain architecture diagram

The tools:

  1. Hugging Face MCP – Discovers the Space, identifies SDK type (Docker vs. Gradio)
  2. Skopeo (via Arm MCP Server) – Inspects the container registry, reports supported architectures
  3. migrate-ease (via Arm MCP Server) – Scans source code for x86-specific intrinsics, hardcoded paths, arch-locked libraries
  4. GitHub MCP – Reads Dockerfile, pyproject.toml, requirements.txt from the repository
  5. Arm Knowledge Base (via Arm MCP Server) – Searches learn.arm.com for build strategies and optimization guides
  6. Sequential Thinking – Combines findings into a structured migration verdict
  7. Docker MCP Gateway – Routes requests, manages container lifecycle

The natural question at this point is whether you could simply rebuild your Docker image for Arm64 and be done with it and for many applications, you could. But knowing in advance whether the rebuild will actually succeed is a different problem. Your Dockerfile might depend on a base image that doesn’t publish Arm64 builds. Your Python dependencies might not have aarch64 wheels. Your code might use x86-specific system calls. The MCP chain checks all of this automatically before you invest time in a build that may not work.

Setting Up Visual Studio Code with Docker MCP Toolkit

Prerequisites

Before you begin, make sure you have:

  • A machine with 8 GB RAM minimum (16GB recommended)
  • The latest Docker Desktop release
  • VS Code with GitHub Copilot extension
  • GitHub account with personal access token

Step 1. Enable Docker MCP Toolkit

Open Docker Desktop and enable the MCP Toolkit from Settings.

To enable:

  1. Open Docker Desktop
  2. Go to Settings > Beta Features
Enabling Docker MCP Toolkit under Docker Desktop

Caption: Enabling Docker MCP Toolkit under Docker Desktop

  1. Toggle Docker MCP Toolkit ON
  2. Click Apply

Step 2. Add Required MCP Servers from Catalog

Add the following four MCP Servers from the Catalog. You can find them by selecting “Catalog” in the Docker Desktop MCP Toolkit, or by following these links:

Searching for Arm MCP Server in the Docker MCP Catalog

Caption: Searching for Arm MCP Server in the Docker MCP Catalog

Step 3. Configure the Servers

  1. Configure the Arm MCP Server

To access your local code for the migrate-ease scan and MCA tools, the Arm MCP Server needs a directory configured to point to your local code.

Arm MCP Server configuration

Caption: Arm MCP Server configuration

Once you click ‘Save’, the Arm MCP Server will know where to look for your code. If you want to give a different directory access in the future, you’ll need to change this path.

Available Arm Migration Tools

Click Tools to view all six MCP tools available under Arm MCP Server:

List of MCP tools provided by the Arm MCP Server

Caption: List of MCP tools provided by the Arm MCP Server

  • knowledge_base_search – Semantic search of Arm learning resources, intrinsics documentation, and software compatibility
  • migrate_ease_scan – Code scanner supporting C++, Python, Go, JavaScript, and Java for Arm compatibility analysis
  • check_image – Docker image architecture verification (checks if images support Arm64)
  • skopeo – Remote container image inspection without downloading
  • mca – Machine Code Analyzer for assembly performance analysis and IPC predictions
  • sysreport_instructions – System architecture information gathering

  1. Configure the GitHub MCP Server

The GitHub MCP Server lets GitHub Copilot read repositories, create pull requests, manage issues, and commit changes.

Steps to configure GitHub Official MCP Server

Caption: Steps to configure GitHub Official MCP Server

Configure Authentication:

  1. Select GitHub official
  2. Choose your preferred authentication method
  3. For Personal Access Token, get the token from GitHub > Settings > Developer Settings
Setting up Personal Access Token in GitHub MCP Server

Caption: Setting up Personal Access Token in GitHub MCP Server

  1. Configure the Sequential Thinking MCP Server
  • Click “Sequential Thinking”
  • No configuration needed
Sequential MCP Server requires zero configuration

Caption: Sequential MCP Server requires zero configuration

This server helps GitHub Copilot break down complex migration decisions into logical steps.

  1. Configure the Hugging Face MCP Server

The Hugging Face MCP Server provides access to Space metadata, model information, and repository contents directly from the Hugging Face Hub.

  • Click “Hugging Face”
  • No additional configuration needed for public Spaces
  • For private Spaces, add your HuggingFace API token

Step 4. Add the Servers to VS Code

The Docker MCP Toolkit makes it incredibly easy to configure MCP servers for clients like VS Code.

To configure, click “Clients” and scroll down to Visual Studio Code. Click the “Connect” button:

Setting up Visual Studio Code as MCP Client

Caption: Setting up Visual Studio Code as MCP Client

Now open VS Code and click on the ‘Extensions’ icon in the left toolbar:

Configuring MCP_DOCKER under VS Code Extensions

Caption: Configuring MCP_DOCKER under VS Code Extensions

Click the MCP_DOCKER gear, and click ‘Start Server’:

Starting MCP Server under VS Code

Caption: Starting MCP Server under VS Code

Step 5. Verify Connection

Open GitHub Copilot Chat in VS Code and ask:

What Arm migration and Hugging Face tools do you have access to?

You should see tools from all four servers listed. If you see them, your connection works. Let’s scan a Hugging Face Space.

Playing around with GitHub Copilot

Caption: Playing around with GitHub Copilot

image15

Real-World Demo: Scanning ACE-Step v1.5

Now that you’ve connected GitHub Copilot to Docker MCP Toolkit, let’s scan a real Hugging Face Space for Arm64 readiness and uncover the exact Arm64 blocker we hit when trying to run it locally.

  • Target: ACE-Step v1.5 – a 3.5B parameter music generation model 
  • Time to scan: 15 minutes 
  • Infrastructure cost: $0 (all tools run locally in Docker containers) 

The Workflow

Docker MCP Toolkit orchestrates the scan through a secure MCP Gateway that routes requests to specialized tools: the Arm MCP Server inspects images and scans code, Hugging Face MCP discovers the Space, GitHub MCP reads the repository, and Sequential Thinking synthesizes the verdict. 

Step 1. Give GitHub Copilot Scan Instructions

Open your project in VS Code. In GitHub Copilot Chat, paste this prompt:

Your goal is to analyze the Hugging Face Space "ACE-Step/ACE-Step-v1.5" for Arm64 migration readiness. Use the MCP tools to help with this analysis.

Steps to follow:
1. Use Hugging Face MCP to discover the Space and identify its SDK type (Docker or Gradio)
2. Use skopeo to inspect the container image - check what architectures are currently supported
3. Use GitHub MCP to read the repository - examine pyproject.toml, Dockerfile, and requirements
4. Run migrate_ease_scan on the source code to find any x86-specific dependencies or intrinsics
5. Use knowledge_base_search to find Arm64 build strategies for any issues discovered
6. Use sequential thinking to synthesize all findings into a migration verdict

At the end, provide a clear GO / NO-GO verdict with a summary of required changes.

Step 2. Watch Docker MCP Toolkit Execute

GitHub Copilot orchestrates the scan using Docker MCP Toolkit. Here’s what happens:

Phase 1: Space Discovery

GitHub Copilot starts by querying the Hugging Face MCP server to retrieve Space metadata.

GitHub Copilot uses HuggingFace MCP to discover the Space and identify its SDK type.

Caption: GitHub Copilot uses Hugging Face MCP to discover the Space and identify its SDK type.

The tool returns that ACE-Step v1.5 uses the Docker SDK – meaning Hugging Face serves it as a pre-built container image, not a Gradio app. This is critical: Docker SDK Spaces have Dockerfiles we can analyze and rebuild, while Gradio SDK Spaces are built by Hugging Face’s infrastructure we can’t control.

Phase 2: Container Image Inspection

Next, Copilot uses the Arm MCP Server’s skopeo tool to inspect the container image without downloading it.

The skopeo tool reports that the container image has no arm64 build available. The container won't start on Arm hardware.

Caption: The skopeo tool reports that the container image has no Arm64 build available. The container won’t start on Arm hardware.

Result: the manifest includes only linux/amd64. No Arm64 build exists. This is the first concrete data point  the container will fail on any Arm hardware. But this is not the full story.

Phase 3: Source Code Analysis

Copilot uses GitHub MCP to read the repository’s key files. Here is the actual Dockerfile from the Space:

FROM python:3.11-slim

ENV PYTHONDONTWRITEBYTECODE=1 \
    PYTHONUNBUFFERED=1 \
    DEBIAN_FRONTEND=noninteractive \
    TORCHAUDIO_USE_TORCHCODEC=0

RUN apt-get update && \
    apt-get install -y --no-install-recommends git libsndfile1 build-essential && \
    apt-get install -y ffmpeg libavcodec-dev libavformat-dev libavutil-dev libswresample-dev && \
    rm -rf /var/lib/apt/lists/*

RUN useradd -m -u 1000 user
RUN mkdir -p /data && chown user:user /data && chmod 755 /data

ENV HOME=/home/user \
    PATH=/home/user/.local/bin:$PATH \
    GRADIO_SERVER_NAME=0.0.0.0 \
    GRADIO_SERVER_PORT=7860

WORKDIR $HOME/app
COPY --chown=user:user requirements.txt .
COPY --chown=user:user acestep/third_parts/nano-vllm ./acestep/third_parts/nano-vllm
USER user

RUN pip install --no-cache-dir --user -r requirements.txt
RUN pip install --no-deps ./acestep/third_parts/nano-vllm

COPY --chown=user:user . .
EXPOSE 7860
CMD ["python", "app.py"]

The Dockerfile itself looks clean:

  • python:3.11-slim already publishes multi-arch builds including arm64
  • No -mavx2, no -march=x86-64 compiler flags
  • build-essential, ffmpeg, libsndfile1 are all available in Debian’s arm64 repositories

But the real problem is in requirements.txt. This is what I hit when I tried to install ACE-Step locally:

# nano-vllm dependencies
triton>=3.0.0; sys_platform != 'win32'

flash-attn @ https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/
  download/v0.7.12/flash_attn-2.8.3+cu128torch2.10-cp311-cp311-linux_x86_64.whl
  ; sys_platform == 'linux' and python_version == '3.11'

Two immediate blockers:

  • flash-attn is pinned to a hardcoded linux_x86_64 wheel URL. On an aarch64 system, pip downloads this wheel and immediately rejects it: “not a supported wheel on this platform.” This is the exact error I hit.
  • triton>=3.0.0 has no aarch64 wheel on PyPI for Linux. It will fail on Arm hardware.

Neither of these is a code problem. The Python source code is architecture-neutral. The fix is in the dependency declarations.

Phase 4: Architecture Compatibility Scan

Copilot runs the migrate_ease_scan tool with the Python scanner on the codebase.

The migrate_ease_scan tool analyzes the Python source code and finds zero x86-specific dependencies. No intrinsics, no hardcoded paths, no architecture-locked libraries.

Caption: The migrate_ease_scan tool analyzes the Python source code and finds zero x86-specific dependencies. No intrinsics, no hardcoded paths, no architecture-locked libraries.

The application source code itself returns 0 architecture issues — no x86 intrinsics, no platform-specific system calls. But the scan also flags the dependency manifest. Two blockers in requirements.txt:

Dependency

Issue

Arm64 Fix

flash-attn (linux wheel)

Hardcoded linux_x86_64 URL

Use flash-attn 2.7+ via PyPI — publishes aarch64 wheels natively

triton>=3.0.0

No aarch64 PyPI wheel for Linux

Exclude on aarch64 or use triton-nightly aarch64 build

Phase 5: Arm Knowledge Base Lookup

Copilot queries the Arm MCP Server’s knowledge base for solutions to the discovered issues.

GitHub Copilot uses the knowledge_base_search tool to find Docker buildx multi-arch strategies from learn.arm.com.

Caption: GitHub Copilot uses the knowledge_base_search tool to find Docker buildx multi-arch strategies from learn.arm.com.

The knowledge base returns documentation on:

  • flash-attn aarch64 wheel availability from version 2.7+
  • PyTorch Arm64 optimization guides for Graviton and Apple Silicon
  • Best practices for CUDA 13.0 on aarch64 (Jetson Thor / DGX Spark)
  • triton alternatives for CPU inference paths on Arm

Phase 6: Synthesis and Verdict

Sequential Thinking combines all findings into a structured verdict

Sequential Thinking combines all findings into a structured verdict:

Check

Result

Blocks?

Container manifest

amd64 only

Yes, needs rebuild

Base image python:3.11-slim

Multi-arch (arm64 available)

No

System packages (ffmpeg, libsndfile1)

Available in Debian arm64

No

torch==2.9.1

aarch64 wheels published

No

flash-attn linux wheel

Hardcoded linux_x86_64 URL

YES, add arm64 URL alongside

triton>=3.0.0

aarch64 wheels available from 3.5.0+

No, resolves automatically

Source code (migrate-ease)

0 architecture issues

No

Compiler flags in Dockerfile

None x86-specific

No

Verdict: CONDITIONAL GO. Zero code changes. Zero Dockerfile changes. One dependency fix is required.

image18

image9

Here are the exact changes needed in requirements.txt:

# BEFORE — only x86_64

flash-attn @ https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.7.12/flash_attn-2.8.3+cu128torch2.10-cp311-cp311-linux_aarch64.whl ; sys_platform == 'linux' and python_version == '3.11' and platform_machine == 'aarch64'


# AFTER — add arm64 line alongside x86_64
flash-attn @ https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.7.12/flash_attn-2.8.3+cu128torch2.10-cp311-cp311-linux_aarch64.whl ; sys_platform == 'linux' and python_version == '3.11' and platform_machine == 'aarch64'
flash-attn @ https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.7.12/flash_attn-2.8.3+cu128torch2.10-cp311-cp311-linux_x86_64.whl ; sys_platform == 'linux' and python_version == '3.11' and platform_machine != 'aarch64'

# triton — no change needed, 3.5.0+ has aarch64 wheels, resolves automatically
triton>=3.0.0; sys_platform != 'win32'

After those two fixes, the build command is:

docker buildx build --platform linux/arm64 -t ace-step:arm64 .

That single command unlocks three deployment paths:

  • NVIDIA Arm64 — Jetson Thor, DGX Spark (aarch64 + CUDA 13.0)
  • Cloud Arm64 — AWS Graviton, Azure Cobalt, Google Axion (20-40% cost savings)
  • Apple Silicon — M1-M4 Macs with MPS acceleration (local inference, $0 cloud cost)
image10

Phase 7: Create the Pull Request

After completing the scan, Copilot uses GitHub MCP to propose the fix. Since the only blocker is the hardcoded linux_x86_64 wheel URL on line 32 of requirements.txt, the change is surgical: one line added, nothing removed.

The fix adds the equivalent linux_aarch64 wheel from the same release alongside the existing x86_64 entry, conditioned on platform_machine == 'aarch64':

# BEFORE — only x86_64, fails silently on Arm
flash-attn @ https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/
  download/v0.7.12/flash_attn-2.8.3+cu128torch2.10-cp311-cp311-linux_x86_64.whl
  ; sys_platform == 'linux' and python_version == '3.11'

# AFTER — add arm64 line alongside, conditioned by platform_machine
flash-attn @ https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/
  download/v0.7.12/flash_attn-2.8.3+cu128torch2.10-cp311-cp311-linux_x86_64.whl
  ; sys_platform == 'linux' and python_version == '3.11'
flash-attn @ https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/
  download/v0.7.12/flash_attn-2.8.3+cu128torch2.10-cp311-cp311-linux_aarch64.whl
  ; sys_platform == 'linux' and python_version == '3.11' and platform_machine == 'aarch64'
PR #14 on Hugging Face - Ready to merge

Caption: PR #14 on Hugging Face – Ready to merge

The key insight: the upstream maintainer already published the arm64 wheel in the same release. The fix wasn’t a rebuild or a code change – it was adding one line that references an artifact that already existed. The MCP chain found it in 15 minutes. Without it, a developer hitting this pip error would spend hours tracking it down.

PR: https://huggingface.co/spaces/ACE-Step/Ace-Step-v1.5/discussions/14

Without Arm MCP vs. With Arm MCP

Let’s be clear about what changes when you add the Arm MCP Server to Docker MCP Toolkit.

  • Without Arm MCP: You ask GitHub Copilot to check your Hugging Face Space for Arm64 compatibility. Copilot responds with general advice: “Check if your base image supports arm64”, “Look for x86-specific code”, “Try rebuilding with buildx”. You manually inspect Docker Hub, grep through the codebase, check each dependency on PyPI, and hit a pip install failure you cannot easily diagnose. The flash-attn URL issue alone can take an hour to track down.
  • With Arm MCP + Docker MCP Toolkit: You ask the same question. Within minutes, it uses skopeo to verify the base image, runs migrate_ease_scan on your actual codebase, flags the hardcoded linux_x86_64 wheel URLs in requirements.txt, queries knowledge_base_search for the correct fix, and synthesizes a structured CONDITIONAL GO verdict with every check documented.

Real images get inspected. Real code gets scanned. Real dependency files get analyzed. The difference is Docker MCP Toolkit gives GitHub Copilot access to actual Arm migration tooling, not just general knowledge.

Manual Process vs. MCP Chain

Manual process:

  1. Clone the Hugging Face Space repository (10 minutes)
  2. Inspect the container manifest for architecture support (5 minutes)
  3. Read through pyproject.toml and requirements.txt (20 minutes)
  4. Check PyPI for Arm64 wheel availability across all dependencies (30 minutes)
  5. Analyze the Dockerfile for hardcoded architecture assumptions (10 minutes)
  6. Research CUDA/cuDNN Arm64 support for the required versions (20 minutes)
  7. Write up findings and recommended changes (15 minutes)

Total: 2-3 hours per Space

With Docker MCP Toolkit:

  1. Give GitHub Copilot the scan instructions (5 minutes)
  2. Review the migration report (5 minutes)
  3. Submit a PR with changes (5 minutes)

Total: 15 minutes per Space

What This Suggests at Scale

ACE-Step is a standard Python AI application: PyTorch, Gradio, pip dependencies, a slim Dockerfile. This pattern covers the majority of Docker SDK Spaces on Hugging Face.

The Arm64 wall for these apps is not always visible. The Dockerfile looks clean. The base image supports arm64. The Python code has no intrinsics. But buried in requirements.txt is a hardcoded wheel URL pointing at a linux_x86_64 binary, and nobody finds it until they actually try to run the container on Arm hardware.

That is the 80% problem: 80% of Hugging Face Docker Spaces have never been tested on Arm. Not because the code will not work. but because nobody checked. The MCP chain is a systematic check that takes 15 minutes instead of an afternoon of debugging pip errors.

That has real cost implications:

  • Graviton inference runs 20-40% cheaper for the same workloads. Every amd64-only Space leaves that savings untouched.
  • NVIDIA Physical AI (GR00T, LeRobot, Isaac) deploys on Jetson Thor. Developers find models on Hugging Face, but the containers fail to build on target hardware.
  • Apple Silicon is the most common developer laptop. Local inference means faster iteration and no cloud bill.

How Docker MCP Toolkit Changes Development

Docker MCP Toolkit changes how developers interact with specialized knowledge and capabilities. Rather than learning new tools, installing dependencies, or managing credentials, developers connect their AI assistant once and immediately access containerized expertise.

The benefits extend beyond Hugging Face scanning:

  • Consistency — Same 7-tool chain produces the same structured analysis for any container
  • Security — Each tool runs in an isolated Docker container, preventing tool interference
  • Reproducibility — Scans behave identically across environments
  • Composability — Add or swap tools as the ecosystem evolves
  • Discoverability — Docker MCP Catalog makes finding the right server straightforward

Most importantly, developers remain in their existing workflow. VS Code. GitHub Copilot. Git. No context switching to external tools or dashboards.

Wrapping Up

You have just scanned a real Hugging Face Space for Arm64 readiness using Docker MCP Toolkit, the Arm MCP Server, and GitHub Copilot. What we found with ACE-Step v1.5 is representative of what you will find across Hugging Face: code that is architecture-neutral, a Dockerfile that is already clean, but a requirements.txt with hardcoded x86_64 wheel URLs that silently break Arm64 builds.

The MCP chain surfaces this in 15 minutes. Without it, you are staring at a pip error with no clear path to the cause.

Ready to try it? Open Docker Desktop and explore the MCP Catalog. Start with the Arm MCP Server, add GitHub,Sequential Thinking, and Hugging Face MCP. Point the chain at any Hugging Face Space you’re working with and see what comes back.

Learn More

Read the whole story
alvinashcraft
38 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Mastering Dynamic Components, HTTP Resources, and AI Writing Assistants ️

1 Share

Sometimes the best way to learn is to dive straight into the code! This week, we’re highlighting several hands-on repositories that demonstrate the latest Angular patterns and how to integrate Google Gemini for real-time user assistance.

Check out these essential code samples and templates.

Advanced Dynamic Component Creation

Antonio Cardenas @yeoudev provides a masterclass in using ViewContainerRef. This repository and StackBlitz demo show you exactly how to handle dynamic component instantiation in a clean, scalable way.

Angular Vibe Coding: The Ultimate CRUD Template

Need to start a project fast? Antonio Cardenas @yeoudev offers a “Vibe Coding” template that comes ready with basic CRUD functionality and helpful scripts to clean up boilerplate, letting you focus on your unique logic.

Mastering httpResource with Pirates

Deborah Kurata @deborahkurata brings her signature clarity to the new Signal-based httpResource. This fun “Pirates” example demonstrates how to fetch and manage data using the newest reactive primitives in Angular.

Build an AI-Powered Grammar Assistant

Ankit Sharma @ankitsharma_007 shows the power of the Google Gemini API in a lightweight Angular app. This project provides real-time grammar corrections as you type — a perfect blueprint for adding AI utility to your own editors.

Have you built a cool helper tool with Gemini or tried the new httpResource? Your code samples could be the missing piece for another developer!

Keep the community growing! Use #AngularSparkles and #AngularAI to share your latest GitHub repos and StackBlitz demos!


Mastering Dynamic Components, HTTP Resources, and AI Writing Assistants 🛠️ was originally published in Angular Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Read the whole story
alvinashcraft
38 minutes ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories