Why edge AI development is still hard
AI is no longer confined to cloud experiments. Developers are increasingly expected to deliver AI inside apps, devices, and edge systems where responsiveness, privacy, resilience, and local control are essential. But building those experiences for production is still difficult.
Teams often have to solve model packaging, runtime fragmentation, hardware differences, and deployment complexity before they can ship a single reliable feature. That slows iteration and makes it harder to move from prototype to product.
At Microsoft Build 2026, we’re announcing updates across Foundry Local and Foundry Local on Azure Local that help developers build once and run AI closer to where data is created and decisions are made. These updates expand platform support, improve control over inference and acceleration, add new on-device APIs, and simplify deployment across disconnected, regulated, and sovereign environments.
What’s new in Foundry Local
The latest Foundry Local updates focus on the areas developers care about most: broader platform reach, familiar APIs, better runtime control, and simpler access to hardware acceleration. Together, these improvements help teams move faster from experimentation to production on AI PCs, edge devices, and enterprise infrastructure.
Foundry Local
Last month we announced the 1.1.0 release of Foundry Local (Foundry Local 1.1: Live Transcription, Embeddings, and Responses API | Microsoft Foundry Blog) — Microsoft’s cross-platform local AI solution that let developers bring AI directly into their applications with no cloud dependency, no network latency, and no per-token costs.
The 1.1.0 release added:
- Live audio transcription for real-time speech-to-text scenarios like captioning, voice UIs, and meeting transcription.
- Text embeddings for semantic search, RAG, clustering, and similarity matching use cases.
- Responses API support for structured agentic interactions, including tool calling and multimodal vision-language input.
- WebGPU execution provider plugin delivered separately to reduce the default package size for applications that don’t need it.
- Reduced JavaScript package size by replacing the koffi FFI layer with a custom Node-API C addon.
- Broader .NET compatibility by targeting lower framework versions in the C# SDK.
Today we are announcing the 1.2.0 release of Foundry Local, which expands language support in the Live Transcription API, offers a wide range of device support for Linux, improves cancellation and execution provider workflows, adds new on-device API options, and strengthens the Windows acceleration story with Windows ML (WinML) 2.0.
What’s new in 1.2.0
from foundry_local_sdk import Configuration, FoundryLocalManager
config = Configuration(app_name="my_app")
FoundryLocalManager.initialize(config)
manager = FoundryLocalManager.instance
model = manager.catalog.get_model(
"nvidia-nemotron-3.5-asr-streaming-multilingual-0.6b"
)
model.download()
model.load()
session = model.get_audio_client().create_live_transcription_session()
session.settings.sample_rate = 16000
session.settings.channels = 1
session.settings.language = "auto" # or "de", "zh-CN", "en", ...
session.start()
session.append(pcm_bytes) # push audio chunks from a mic/file
for result in session.get_stream():
print(result.content[0].text) # clean text, inline language tags stripped
session.stop()
- Faster model downloads via cross-region catalog: Foundry Local now fronts the model catalog with Azure Traffic Manager, routing each user to the best-performing region, so end users see noticeably faster first-run model downloads. No code changes required — developers just need to bump to the v1.2.0 SDK.
- Download and EP cancellation across all 5 SDKs: Cancel model and execution-provider downloads from C#, Python, JavaScript, Rust, and C++ using each language’s native cancellation pattern. Try out: https://github.com/microsoft/Foundry-Local/blob/main/README.md
- Inference cancellation: Cancel in-flight chat completions and transcription sessions cleanly when users move on, without wasted compute or orphaned streams. Try out: https://github.com/microsoft/Foundry-Local/blob/main/README.md
- Per-EP download progress in Python: Surface per-provider download progress in Python instead of a generic spinner. Try out: https://github.com/microsoft/Foundry-Local/tree/main/sdk/python
- Upgraded to Windows ML (WinML) 2.0: The Foundry Local WinML packages now ship with the latest WinML 2.0, removing the previous Windows App SDK runtime dependency and bootstrap step so Python, JavaScript, Rust, and C++ apps get NPU and GPU acceleration with no extra installation or initialization code. Try out: https://learn.microsoft.com/en-us/windows/ai/new-windows-ml/overview
- WebGPU execution provider for WinML: Expand GPU acceleration coverage across more Windows hardware with the new WebGPU execution provider for the WinML SDK. Try out: https://learn.microsoft.com/en-us/windows/ai/new-windows-ml/overview
Foundry Local in action: voice input in GitHub Copilot CLI
The GitHub Copilot CLI’s voice input is built on Foundry Local. When you dictate a prompt in the terminal, audio is captured from your mic, streamed into a Foundry Local live transcription session running the Nemotron ASR Streaming model, and the partial + final results are piped straight into the CLI’s input buffer — all on-device, no cloud hop, no audio leaving the machine.
To enable use /voice on and then you can speak into your Copilot CLI by holding space (or, Ctrl+k v to toggle):

There is no private API or custom integration here. The CLI uses the same create_live_transcription_session() entry point shown in the snippet above, with the same sample_rate / channels / language=”auto” settings, the same append(pcm_bytes) push model, and the same get_stream() iterator. Cancellation when you hit Esc mid-utterance uses the new 1.2.0 inference cancellation path. If you have the Copilot CLI installed, run a few prompts with voice and look at:
- End-to-end latency from speech to token — that’s your floor for what a streaming-ASR UX feels like on the user’s hardware.
- Quality – the model delivers high accuracy (in our internal testing the model delivers ~8% Word Error Rate).
- Low Resource usage while transcribing — the model uses low single digit (%) CPU resource.
If the behavior works for your use case, you can reproduce it in your own app in a few lines using any of the five SDKs — no extra services to stand up, no per-minute transcription bill.
How developers are using Foundry Local
Foundry Local is already being used across privacy-sensitive, performance-sensitive, and hardware-diverse scenarios. From local assistants and document workflows to multimodal context collection and enterprise AI pipelines, developers are using it to reduce platform complexity and deliver production-ready AI experiences faster.
Privacy-first and secure local AI
Across consumer apps and enterprise workflows, developers are using Foundry Local to keep sensitive data closer to the device while delivering faster, more responsive AI experiences.
Foxit PDF Editor AI Assistant
Foxit uses Foundry Local to bring secure, local AI into document workflows such as question answering, summarization, translation, and document understanding. The result is a more practical path to on-device AI that helps keep sensitive information closer to the user while simplifying deployment at scale.
“Foundry Local gives us a practical way to bring powerful AI experiences directly into PDF workflows while keeping sensitive data closer to the user. Just as importantly, its managed local model approach helps simplify deployment, improve reliability, and reduce the operational burden of delivering on-device AI at scale.” – Queena Wei, SVP of Product at Foxit
Raycast
Raycast uses Foundry Local to make privacy-first, on-device AI more accessible to end users. By simplifying model discovery and local interaction, it helps bring local AI into everyday workflows with less friction.
“The integration of Foundry Local into Raycast gives our users the perfect option for privacy-first local AI. With it, they can easily leverage a variety of powerful models optimized for their Windows devices. Foundry Local made it super easy for us to implement the first step, a platform to browse and install models and a quick chat interface to use them, no internet required.” – Thomas Paul Mann, CEO & Founder at Raycast
Rakuten
Rakuten uses Foundry Local to bring responsive, privacy-sensitive AI experiences directly onto the device while balancing local responsiveness with broader cloud-connected capabilities. The result is a hybrid experience that feels more natural to end users while improving efficiency behind the scenes.
“Through our partnership with HP, Rakuten AI for Desktop uses Foundry Local to bring AI closer to the user — running responsive, privacy-sensitive experiences directly on the device while reducing cloud inference costs. Combined with Rakuten AI’s cloud intelligence and ecosystem integrations, this enables a hybrid AI experience that feels native to the desktop and scales efficiently for more advanced tasks.” – Vasanth Raju, Head of AI Product at Rakuten Group
PhonePe
PhonePe uses Foundry Local to power AI-driven transaction insights in its digital payments app with strong data protection. This helps deliver more responsive, privacy-conscious AI experiences without requiring personal financial information to leave the device.
Liquid AI’s ShieldFlow
ShieldFlow is an on-device privacy layer to redact sensitive data and prevent prompt injection before any prompt leaves the device. Through Foundry Local, ShieldFlow runs efficiently on CPUs on every Windows device including AI PCs, and enterprises can pull customized Liquid Foundational Model (LFM) tuned to their own policies and roll them out across their Windows fleet through a single managed runtime.
Hardware portability and cross-device optimization
For teams building across different chips and execution environments, Foundry Local helps reduce hardware-specific complexity and accelerate deployment across devices.
Cephable
Cephable is a private AI assistant that runs entirely on device, enabling voice control, dictation, content generation, and task automation across apps. With Foundry Local, Cephable’s AI features run faster, support more models across NPU, GPU, and CPU, and let the team focus on building the assistant instead of managing silicon-specific optimizations.
“Since shifting from our custom inferencing implementation to Foundry Local, our engineers have been able to ship core features faster. We’re saving dozens of hours on optimizing models and managing build pipelines to handle the right acceleration in the right version of our app package. This directly leads to a better user experience and more choice for our users.” – Cordellia Yokum, Director and Principal Architect at Cephable
FlowyAIPC
FlowyAIPC builds an intelligent assistant for the era of heterogeneous AIPC silicon. FlowyAIPC integrates Foundry Local and Windows ML to solve the fundamental challenge of model-hardware decoupling across Intel, AMD, Qualcomm, and NVIDIA chips spanning CPU, NPU, iGPU, and dGPU.
“By leveraging Foundry Local’s automatic hardware detection and execution-provider abstraction, FlowyAIPC dynamically routes AI workloads to the optimal compute unit without user intervention: lightweight inference and sustained background tasks tap the NPU for power efficiency, while demanding generative workloads seamlessly spill to the GPU or CPU.” – Guoliang QI, CEO at StarwaveAI
AnythingLLM
AnythingLLM is a local-first, zero-configuration AI desktop application that allows enterprises to run LLMs completely on-device. Instead of maintaining separate runtimes for each hardware configuration, AnythingLLM uses Foundry Local to deliver on-device AI across a broad range of silicon platforms.
“With the rapid pace of AI software, maintaining custom runtimes for every specialized NPU and hardware configuration on the market creates a massive development bottleneck. The Foundry Local SDK helps us solve this by providing optimized, hardware-level, vendor agnostic performance out of the box, allowing us to deliver a consistent and secure local AI experience to our Windows users globally without the engineering overhead.” – Timothy Carambat, Founder & CEO at AnythingLLM
LUCI Desktop by Memories.ai
Memories.ai uses Foundry Local to run multimodal models efficiently across Qualcomm, Intel, and AMD devices in LUCI Desktop which provides an on-device context layer for PCs. That portability helps the team scale on-device research and multimodal workflows without extensive per-chip optimization.
“Foundry Local SDK took the silicon-portability problem off our plate — one SDK, simple APIs, and our multimodal models run efficiently across Qualcomm, Intel, and AMD without weeks of per-chip optimization. It lets us scale our on-device research globally on day one and keeps our team focused on the harder problems above the silicon layer.” – Shawn Shen, CEO at Memories.ai
Model HQ by LLMWare
Model HQ enables enterprise teams to build and run RAG pipelines and multi-step agents locally on AI PCs and private servers using a no-code interface. By integrating Foundry Local, Model HQ enables fast, offline-capable AI experiences directly on Windows devices built on chips from AMD, Intel, Qualcomm and Nvidia.
“The Foundry Local SDK made it incredibly easy for us to integrate NPU-optimized local AI models directly into Model HQ and rapidly deliver high-performance on-device NPU inferencing with minimal engineering overhead. It has significantly accelerated our ability to fully leverage emerging NPU compute capabilities for fast, efficient, and power-optimized local AI experiences.” – Darren Oberst, Co-Founder at LLMWare
Taken together, these customer stories show what Foundry Local means for developers in practice: fewer runtime and hardware-specific hurdles, faster paths from prototype to production, and more control over how AI runs on real devices. Whether you’re building privacy-sensitive apps, deploying across diverse silicon, or operationalizing local RAG and agent workflows, Foundry Local helps you spend less time stitching infrastructure together and more time shipping experiences that work.
Foundry Local on Azure Local
At Build, we’re also introducing Foundry Local on Azure Local in preview: a new on-premises AI platform for running models, agents, and tools at enterprise scale.
Designed for organizations that seek control, compliance, and low-latency execution, Foundry Local on Azure Local runs as containerized Kubernetes workloads on Azure Local and is orchestrated through Azure Arc. It helps teams deploy consistently across edge, hybrid, and fully disconnected environments while keeping AI close to the data and operations that depend on it.
Here are some of the key preview capabilities announced today:
Register to get access to Foundry Local on Azure Local preview: https://aka.ms/FoundryLocalAzure_PreviewRequest
- Custom MCP tools – Extend agents with custom tool servers using the Model Context Protocol (MCP) standard.
- GitHub Enterprise Local – Build and deploy AI apps end to end on-premises with local repos, CI/CD pipelines, and integrated security scanning. https://aka.ms/GHEL
- Azure Local for small form factor devices – Extend Azure Local to industrial PCs and ruggedized devices for manufacturing and retail edge deployments, with turnkey AI inference and Azure Arc-based device management. https://aka.ms/AzureSFF
- Watch the demo – aka.ms/AzureSFFLaunchDemo
Early momentum is already visible across sovereign, industrial, and disconnected scenarios where organizations seek to have AI run reliably under strict operational and compliance constraints.
“In energy operations, AI needs to run where the work happens – at remote facilities, offshore platforms, and field locations where connectivity is often limited, and safety is paramount. Foundry Local on Azure Local gives us a path to bring AI-driven decision-making closer to our operational data, with the governance our industry demands. The ability to deploy and run AI workloads consistently across edge and field environments, even when disconnected, is critical as we advance Chevron’s vision for autonomous and intelligent operations.” (Chevron) Ed Moore – OT Strategist and Distinguished Engineer
Together, these capabilities help organizations support both sovereign AI requirements, such as data control and compliance, and industrial edge scenarios that depend on real-time, localized execution.
Get started
If you want to start building with Foundry Local, begin with the documentation, Edge AI for Beginners, explore the available samples, and test local inference in your own application workflow. From there, you can evaluate the right model, runtime, and hardware path for your scenario, whether you’re building for AI PCs, enterprise apps, edge devices, or disconnected environments.
If you’re following Microsoft Build 2026, these related sessions can help you go deeper into the announcements and developer scenarios supported by these releases:
The post Accelerate Edge AI Development with Foundry Local appeared first on Microsoft Foundry Blog.