Today we’re announcing the 1.1.0 release of Foundry Local — Microsoft’s cross-platform local AI solution that lets developers bring AI directly into their applications with no cloud dependency, no network latency, and no per-token costs.
This release adds the following:
- Live audio transcription for real-time speech-to-text scenarios like captioning, voice UIs, and meeting transcription.
- Text embeddings for semantic search, RAG, clustering, and similarity matching use cases.
- Responses API support for structured agentic interactions, including tool calling and multimodal vision-language input.
- WebGPU execution provider plugin delivered separately to reduce the default package size for applications that don’t need it.
- Reduced JavaScript package size by replacing the koffi FFI layer with a custom Node-API C addon.
- Broader .NET compatibility by targeting lower framework versions in the C# SDK.
What’s new
Live Transcription API
Foundry Local now supports real-time speech-to-text streaming directly from a microphone — ideal for live captioning, voice-driven UIs, meeting transcription, and accessibility scenarios. The new Live Transcription API lets you push raw PCM audio chunks and receive transcription results as they arrive, with clear is_final markers distinguishing interim from finalized text.
The API is built around a simple session-based pattern available across all SDK language bindings (JavaScript, C#, Python, Rust):
- Load a streaming speech model from the catalog
- Create a live transcription session with audio settings (sample rate, channels, language)
- Start the session and begin appending audio data
- Consume transcription results via an async stream
Example usage
Throughout this article, the examples are shown using the Python SDK language binding. However, in all examples, JavaScript, Rust, and C# bindings are also available. See the Foundry Local samples on GitHub.
"""
Live microphone transcription using Foundry Local.
This script loads a streaming speech model, captures audio from the
microphone via PyAudio, and prints transcription results in real time.
Requirements:
pip install foundry-local-sdk pyaudio
"""
import threading
import pyaudio
from foundry_local_sdk import Configuration, FoundryLocalManager
# ---------------------------------------------------------------------------
# 1. Initialize Foundry Local
# ---------------------------------------------------------------------------
config = Configuration(app_name="foundry_local_samples")
FoundryLocalManager.initialize(config)
manager = FoundryLocalManager.instance
# ---------------------------------------------------------------------------
# 2. Download and load the streaming speech model
# ---------------------------------------------------------------------------
model = manager.catalog.get_model("nemotron-speech-streaming-en-0.6b")
if not model.is_cached:
print("Downloading model...")
model.download(
lambda progress: print(f"\r Progress: {progress:.1f}%", end="", flush=True)
)
print("\n Download complete.")
model.load()
# ---------------------------------------------------------------------------
# 3. Create a live transcription session
# ---------------------------------------------------------------------------
audio_client = model.get_audio_client()
session = audio_client.create_live_transcription_session()
session.settings.sample_rate = 16000
session.settings.channels = 1
session.settings.language = "en"
session.start()
# ---------------------------------------------------------------------------
# 4. Read transcription results in a background thread
# ---------------------------------------------------------------------------
def read_results():
for result in session.get_stream():
text = result.content[0].text if result.content else ""
if result.is_final:
print(f"\n [FINAL] {text}")
elif text:
print(text, end="", flush=True)
read_thread = threading.Thread(target=read_results, daemon=True)
read_thread.start()
# ---------------------------------------------------------------------------
# 5. Capture microphone audio and feed it to the session
# ---------------------------------------------------------------------------
RATE, CHANNELS, CHUNK = 16000, 1, 480 # 30 ms frames
pa = pyaudio.PyAudio()
stream = pa.open(
format=pyaudio.paInt16,
channels=CHANNELS,
rate=RATE,
input=True,
frames_per_buffer=CHUNK,
)
print("Speak into your microphone. Press Ctrl+C to stop.\n")
try:
while True:
pcm_data = stream.read(CHUNK, exception_on_overflow=False)
session.append(pcm_data)
except KeyboardInterrupt:
print("\nStopping...")
# ---------------------------------------------------------------------------
# 6. Cleanup
# ---------------------------------------------------------------------------
stream.close()
pa.terminate()
session.stop()
read_thread.join(timeout=5)
model.unload()
Optimized for on-device streaming ASR
To identify the best model for real-time on-device transcription, we conducted a systematic empirical study across over 50 configurations spanning encoder-decoder, transducer, and LLM-based ASR architectures — including OpenAI Whisper, NVIDIA Nemotron, Parakeet TDT, Canary, Conformer Transducer, and Qwen3-ASR — evaluated across batch, chunked, and streaming inference modes.
From this study, we identified NVIDIA’s Nemotron Speech Streaming as the strongest candidate for real-time English streaming on resource-constrained hardware. We then re-implemented the complete streaming inference pipeline in ONNX Runtime and applied multiple post-training quantization strategies — including importance-weighted k-quant, mixed-precision schemes, and round-to-nearest quantization — combined with graph-level operator fusion. These optimizations reduced the model from 2.47 GB to as little as 0.67 GB while maintaining word error rate (WER) within 1% absolute of the full-precision PyTorch baseline.
Our recommended configuration, the int4 k-quant variant, achieves 8.20% average streaming WER across eight standard benchmarks, running comfortably faster than real-time on CPU with 0.56s algorithmic latency — establishing a new quality-efficiency Pareto point for on-device streaming ASR.
The model is available in the Foundry catalog as nemotron-speech-streaming-en-0.6b.
For the full methodology and benchmark results, see our paper: Pushing the Limits of On-Device Streaming ASR: A Compact, High-Accuracy English Model for Low-Latency Inference (arXiv:2604.14493).
Embeddings API for semantic search scenarios
Foundry Local now supports text embedding generation across all four SDKs (C#, JavaScript, Python, and Rust). Embeddings unlock a wide range of local AI scenarios including semantic search, RAG (retrieval-augmented generation), clustering, and similarity matching — all running entirely on-device.
The Embeddings API supports both single and batch input, with configurable dimensions and encoding format. Responses follow the OpenAI embeddings format for seamless cloud-to-edge portability.
Example usage
The following example pairs Foundry Local embeddings with ChromaDB to build a fully local semantic search pipeline — documents are embedded and indexed in-memory, then natural-language queries are matched to the most relevant results.
"""
Semantic search using Foundry Local embeddings and ChromaDB.
This script loads an embedding model locally, indexes a set of documents
into an in-memory ChromaDB collection, and performs natural-language
semantic queries against them — all running on-device.
Requirements:
pip install foundry-local-sdk chromadb
"""
import chromadb
from foundry_local_sdk import Configuration, FoundryLocalManager
# ---------------------------------------------------------------------------
# 1. Initialize Foundry Local
# ---------------------------------------------------------------------------
config = Configuration(app_name="foundry_local_samples")
FoundryLocalManager.initialize(config)
manager = FoundryLocalManager.instance
# ---------------------------------------------------------------------------
# 2. Enable additional hardware acceleration for end users
# ---------------------------------------------------------------------------
manager.download_and_register_eps(
progress_callback=lambda ep, progress: print(
f"\r Downloading EP '{ep}': {progress:.1f}%", end="", flush=True
)
)
print("\n EP registration complete.\n")
print("Available EPs:")
for ep in manager.discover_eps():
print(f" {ep.name} (registered: {ep.is_registered})")
print()
# ---------------------------------------------------------------------------
# 3. Download and load an embedding model
# ---------------------------------------------------------------------------
model = manager.catalog.get_model("qwen3-embedding-0.6b")
if not model.is_cached:
print("Downloading model...")
model.download(
lambda progress: print(f"\r Progress: {progress:.1f}%", end="", flush=True)
)
print("\n Download complete.")
model.load()
client = model.get_embedding_client()
# ---------------------------------------------------------------------------
# 4. Build a knowledge base
# ---------------------------------------------------------------------------
documents = [
"Python is a high-level programming language known for its readability and versatility.",
"Rust is a systems programming language focused on safety, speed, and concurrency.",
"Machine learning is a subset of artificial intelligence that learns from data.",
"The capital of France is Paris, known for the Eiffel Tower.",
"Docker containers package applications with their dependencies for consistent deployment.",
"PostgreSQL is a powerful open-source relational database system.",
"Neural networks are computing systems inspired by biological brain structures.",
"Kubernetes orchestrates containerized workloads across clusters of machines.",
"The Python GIL limits true multi-threading for CPU-bound tasks.",
"Vector databases store and search high-dimensional embeddings efficiently.",
]
print("Generating embeddings for knowledge base...")
batch_response = client.generate_embeddings(documents)
embeddings = [item.embedding for item in batch_response.data]
print(f"Indexed {len(embeddings)} documents ({len(embeddings[0])} dimensions each)")
# ---------------------------------------------------------------------------
# 5. Store embeddings in ChromaDB
# ---------------------------------------------------------------------------
chroma = chromadb.Client()
collection = chroma.create_collection(
name="knowledge_base", metadata={"hnsw:space": "cosine"}
)
collection.add(
ids=[f"doc-{i}" for i in range(len(documents))],
embeddings=embeddings,
documents=documents,
)
# ---------------------------------------------------------------------------
# 6. Semantic search
# ---------------------------------------------------------------------------
queries = [
"What programming language is good for beginners?",
"How do I deploy applications in production?",
"Tell me about AI and deep learning",
]
for query in queries:
query_embedding = client.generate_embedding(query).data[0].embedding
results = collection.query(query_embeddings=[query_embedding], n_results=3)
print(f'\n🔍 Query: "{query}"')
for doc, distance in zip(results["documents"][0], results["distances"][0]):
print(f" [{1 - distance:.3f}] {doc}")
# ---------------------------------------------------------------------------
# 7. Cleanup
# ---------------------------------------------------------------------------
model.unload()
Responses API
Foundry Local now includes an Open Responses API client, bringing structured agentic AI capabilities to on-device inference. The Responses API provides a higher-level abstraction over chat completions with built-in support for:
- Streaming — token-by-token server-sent events
- Multi-turn conversations — chain responses with
previous_response_id
- Tool calling — define function tools and handle tool call/result round-trips
- Vision — pass images alongside text input (model-dependent)
Example usage
With the Foundry Local 1.1 release we’ve also added Qwen3.5 VLM to the model catalog — a natively multimodal vision-language model that can reason over images and text together. Smaller variants (3B, 7B) are optimized for on-device inference, making it practical to run vision tasks locally without cloud dependencies.
This enables scenarios like document understanding, diagram analysis, UI screenshot interpretation, and visual question answering — all running entirely on-device. For example, the following code streams a description of an image from the Qwen3.5 VLM using the Responses API:
"""
Image description using Foundry Local and the OpenAI Responses API.
This script loads a vision-language model locally via Foundry Local,
starts the built-in web service, and uses the OpenAI SDK's Responses API
to stream a description of an image — all running on-device.
Requirements:
pip install foundry-local-sdk openai Pillow
"""
import base64
import io
import urllib.request
from PIL import Image
from openai import OpenAI
from foundry_local_sdk import Configuration, FoundryLocalManager
# ---------------------------------------------------------------------------
# 1. Initialize Foundry Local
# ---------------------------------------------------------------------------
config = Configuration(app_name="foundry_local_samples")
FoundryLocalManager.initialize(config)
manager = FoundryLocalManager.instance
# ---------------------------------------------------------------------------
# 2. Enable additional hardware acceleration for end users
# ---------------------------------------------------------------------------
manager.download_and_register_eps(
progress_callback=lambda ep, progress: print(
f"\r Downloading EP '{ep}': {progress:.1f}%", end="", flush=True
)
)
print("\n EP registration complete.\n")
print("Available EPs:")
for ep in manager.discover_eps():
print(f" {ep.name} (registered: {ep.is_registered})")
print()
# ---------------------------------------------------------------------------
# 3. Download and load the vision model
# ---------------------------------------------------------------------------
model = manager.catalog.get_model("qwen3-vl-2b-instruct")
if not model.is_cached:
print("Downloading model...")
model.download(
lambda progress: print(f"\r Progress: {progress:.1f}%", end="", flush=True)
)
print("\n Download complete.")
print("Loading model...")
model.load()
print("Model ready.\n")
# ---------------------------------------------------------------------------
# 4. Start the Foundry Local web service
# ---------------------------------------------------------------------------
manager.start_web_service()
base_url = manager.urls[0].rstrip("/") + "/v1"
client = OpenAI(base_url=base_url, api_key="notneeded")
# ---------------------------------------------------------------------------
# 5. Prepare the image
# ---------------------------------------------------------------------------
image_url = (
"https://github.com/microsoft/Foundry-Local/blob/main/"
"samples/python/web-server-responses-vision/src/test_image.jpg?raw=true"
)
print(f"Fetching image: {image_url}")
with urllib.request.urlopen(image_url) as resp:
img = Image.open(io.BytesIO(resp.read()))
img.thumbnail((512, 512))
buf = io.BytesIO()
img.save(buf, format="JPEG")
image_b64 = base64.b64encode(buf.getvalue()).decode()
# ---------------------------------------------------------------------------
# 6. Call the Responses API with vision input
# ---------------------------------------------------------------------------
vision_input = [
{
"type": "message",
"role": "user",
"content": [
{"type": "input_text", "text": "Describe what you see in this image."},
{
"type": "input_image",
"image_data": image_b64,
"media_type": "image/jpeg",
},
],
}
]
print("Streaming response:\n")
stream = client.responses.create(
model=model.id,
input="placeholder",
extra_body={"input": vision_input},
stream=True,
)
for event in stream:
if getattr(event, "type", None) == "response.output_text.delta":
print(getattr(event, "delta", ""), end="", flush=True)
print("\n")
# ---------------------------------------------------------------------------
# 7. Cleanup
# ---------------------------------------------------------------------------
client.close()
manager.stop_web_service()
model.unload()
print("Done.")
WebGPU Execution Provider Plugin
The WebGPU execution provider is now delivered as a separate plugin rather than being bundled with the Windows ONNX Runtime package. This change reduces the default package size for applications that don’t need WebGPU, while keeping it available as an on-demand plugin for scenarios that require it. The plugin is automatically acquired via the standard execution provider download mechanism — no changes are needed in your application code.
.NET SDK: Broader Compatibility
The C# SDK packages now target lower framework versions, broadening compatibility for applications that haven’t yet upgraded to the latest .NET runtime:
Microsoft.AI.Foundry.Local now targets netstandard2.0 (previously net9.0) — compatible with .NET Framework 4.6.1+, .NET Core 2.0+, Mono, Xamarin, and Unity. This makes it straightforward to add local AI capabilities to existing .NET applications regardless of runtime version.
Microsoft.AI.Foundry.Local.WinML now targets net8.0 (previously net9.0) — providing Windows hardware acceleration via GPU/NPU execution providers while maintaining broad compatibility across modern .NET LTS runtimes.
Reduced JavaScript package size
The JavaScript SDK’s native interop layer has been rewritten from koffi (a runtime FFI library) to a purpose-built Node-API C addon with prebuilt binaries shipped per platform. This removes the large koffi dependency from the package while keeping the SDK’s public API surface unchanged.
The benefits include:
- ~27 MB smaller install footprint — eliminates the koffi transitive dependency tree
- Faster load times — prebuilt
.node binaries load directly without runtime FFI setup
- Better stability — Node-API provides a stable ABI across Node.js versions, avoiding breakage on engine upgrades
- No native compilation required — prebuilt addons for each platform (Windows, macOS, Linux) ship with the npm package, so
npm install just works without a C toolchain
Get Started
Update to Foundry Local 1.1.0 by installing the latest SDK for your language:
# Python
pip install foundry-local-sdk --upgrade # macOS/Linux
pip install foundry-local-sdk-winml --upgrade # Windows
# JavaScript
npm install foundry-local-sdk@latest # macOS/Linux
npm install foundry-local-sdk-winml@latest # Windows
# C#
dotnet add package Microsoft.AI.Foundry.Local # macOS/Linux
dotnet add package Microsoft.AI.Foundry.Local.WinML # Windows
# Rust (macOS/Linux)
cargo add foundry-local-sdk # macOS/Linux
cargo add foundry-local-sdk --features winml # Windows
What’s coming in the next release
- C++ language binding — already available for early testing and feedback in the Foundry Local GitHub repo.
- Smaller package size — further reductions to the core runtime footprint.
- Audio enhancements — word and segment level timestamps, and additional language support for live transcription.
Learn more
The post Foundry Local 1.1: Live Transcription, Embeddings, and Responses API appeared first on Microsoft Foundry Blog.