Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
153538 stories
·
33 followers

Foundry Local 1.1: Live Transcription, Embeddings, and Responses API

1 Share

Today we’re announcing the 1.1.0 release of Foundry Local — Microsoft’s cross-platform local AI solution that lets developers bring AI directly into their applications with no cloud dependency, no network latency, and no per-token costs.

This release adds the following:

  • Live audio transcription for real-time speech-to-text scenarios like captioning, voice UIs, and meeting transcription.
  • Text embeddings for semantic search, RAG, clustering, and similarity matching use cases.
  • Responses API support for structured agentic interactions, including tool calling and multimodal vision-language input.
  • WebGPU execution provider plugin delivered separately to reduce the default package size for applications that don’t need it.
  • Reduced JavaScript package size by replacing the koffi FFI layer with a custom Node-API C addon.
  • Broader .NET compatibility by targeting lower framework versions in the C# SDK.

What’s new

Live Transcription API

Foundry Local now supports real-time speech-to-text streaming directly from a microphone — ideal for live captioning, voice-driven UIs, meeting transcription, and accessibility scenarios. The new Live Transcription API lets you push raw PCM audio chunks and receive transcription results as they arrive, with clear is_final markers distinguishing interim from finalized text.

The API is built around a simple session-based pattern available across all SDK language bindings (JavaScript, C#, Python, Rust):

  1. Load a streaming speech model from the catalog
  2. Create a live transcription session with audio settings (sample rate, channels, language)
  3. Start the session and begin appending audio data
  4. Consume transcription results via an async stream

Example usage

Throughout this article, the examples are shown using the Python SDK language binding. However, in all examples, JavaScript, Rust, and C# bindings are also available. See the Foundry Local samples on GitHub.

"""
Live microphone transcription using Foundry Local.

This script loads a streaming speech model, captures audio from the
microphone via PyAudio, and prints transcription results in real time.

Requirements:
    pip install foundry-local-sdk pyaudio
"""

import threading
import pyaudio
from foundry_local_sdk import Configuration, FoundryLocalManager

# ---------------------------------------------------------------------------
# 1. Initialize Foundry Local
# ---------------------------------------------------------------------------

config = Configuration(app_name="foundry_local_samples")
FoundryLocalManager.initialize(config)
manager = FoundryLocalManager.instance

# ---------------------------------------------------------------------------
# 2. Download and load the streaming speech model
# ---------------------------------------------------------------------------

model = manager.catalog.get_model("nemotron-speech-streaming-en-0.6b")

if not model.is_cached:
    print("Downloading model...")
    model.download(
        lambda progress: print(f"\r  Progress: {progress:.1f}%", end="", flush=True)
    )
    print("\n  Download complete.")

model.load()

# ---------------------------------------------------------------------------
# 3. Create a live transcription session
# ---------------------------------------------------------------------------

audio_client = model.get_audio_client()
session = audio_client.create_live_transcription_session()
session.settings.sample_rate = 16000
session.settings.channels = 1
session.settings.language = "en"

session.start()

# ---------------------------------------------------------------------------
# 4. Read transcription results in a background thread
# ---------------------------------------------------------------------------

def read_results():
    for result in session.get_stream():
        text = result.content[0].text if result.content else ""
        if result.is_final:
            print(f"\n  [FINAL] {text}")
        elif text:
            print(text, end="", flush=True)

read_thread = threading.Thread(target=read_results, daemon=True)
read_thread.start()

# ---------------------------------------------------------------------------
# 5. Capture microphone audio and feed it to the session
# ---------------------------------------------------------------------------

RATE, CHANNELS, CHUNK = 16000, 1, 480  # 30 ms frames
pa = pyaudio.PyAudio()
stream = pa.open(
    format=pyaudio.paInt16,
    channels=CHANNELS,
    rate=RATE,
    input=True,
    frames_per_buffer=CHUNK,
)

print("Speak into your microphone. Press Ctrl+C to stop.\n")
try:
    while True:
        pcm_data = stream.read(CHUNK, exception_on_overflow=False)
        session.append(pcm_data)
except KeyboardInterrupt:
    print("\nStopping...")

# ---------------------------------------------------------------------------
# 6. Cleanup
# ---------------------------------------------------------------------------

stream.close()
pa.terminate()
session.stop()
read_thread.join(timeout=5)
model.unload()

Optimized for on-device streaming ASR

To identify the best model for real-time on-device transcription, we conducted a systematic empirical study across over 50 configurations spanning encoder-decoder, transducer, and LLM-based ASR architectures — including OpenAI Whisper, NVIDIA Nemotron, Parakeet TDT, Canary, Conformer Transducer, and Qwen3-ASR — evaluated across batch, chunked, and streaming inference modes.

From this study, we identified NVIDIA’s Nemotron Speech Streaming as the strongest candidate for real-time English streaming on resource-constrained hardware. We then re-implemented the complete streaming inference pipeline in ONNX Runtime and applied multiple post-training quantization strategies — including importance-weighted k-quant, mixed-precision schemes, and round-to-nearest quantization — combined with graph-level operator fusion. These optimizations reduced the model from 2.47 GB to as little as 0.67 GB while maintaining word error rate (WER) within 1% absolute of the full-precision PyTorch baseline.

Our recommended configuration, the int4 k-quant variant, achieves 8.20% average streaming WER across eight standard benchmarks, running comfortably faster than real-time on CPU with 0.56s algorithmic latency — establishing a new quality-efficiency Pareto point for on-device streaming ASR.

The model is available in the Foundry catalog as nemotron-speech-streaming-en-0.6b.

For the full methodology and benchmark results, see our paper: Pushing the Limits of On-Device Streaming ASR: A Compact, High-Accuracy English Model for Low-Latency Inference (arXiv:2604.14493).

Embeddings API for semantic search scenarios

Foundry Local now supports text embedding generation across all four SDKs (C#, JavaScript, Python, and Rust). Embeddings unlock a wide range of local AI scenarios including semantic search, RAG (retrieval-augmented generation), clustering, and similarity matching — all running entirely on-device.

The Embeddings API supports both single and batch input, with configurable dimensions and encoding format. Responses follow the OpenAI embeddings format for seamless cloud-to-edge portability.

Example usage

The following example pairs Foundry Local embeddings with ChromaDB to build a fully local semantic search pipeline — documents are embedded and indexed in-memory, then natural-language queries are matched to the most relevant results.

"""
Semantic search using Foundry Local embeddings and ChromaDB.

This script loads an embedding model locally, indexes a set of documents
into an in-memory ChromaDB collection, and performs natural-language
semantic queries against them — all running on-device.

Requirements:
    pip install foundry-local-sdk chromadb
"""

import chromadb

from foundry_local_sdk import Configuration, FoundryLocalManager

# ---------------------------------------------------------------------------
# 1. Initialize Foundry Local
# ---------------------------------------------------------------------------

config = Configuration(app_name="foundry_local_samples")
FoundryLocalManager.initialize(config)
manager = FoundryLocalManager.instance

# ---------------------------------------------------------------------------
# 2. Enable additional hardware acceleration for end users
# ---------------------------------------------------------------------------

manager.download_and_register_eps(
    progress_callback=lambda ep, progress: print(
        f"\r  Downloading EP '{ep}': {progress:.1f}%", end="", flush=True
    )
)
print("\n  EP registration complete.\n")

print("Available EPs:")
for ep in manager.discover_eps():
    print(f"  {ep.name} (registered: {ep.is_registered})")
print()

# ---------------------------------------------------------------------------
# 3. Download and load an embedding model
# ---------------------------------------------------------------------------

model = manager.catalog.get_model("qwen3-embedding-0.6b")

if not model.is_cached:
    print("Downloading model...")
    model.download(
        lambda progress: print(f"\r  Progress: {progress:.1f}%", end="", flush=True)
    )
    print("\n  Download complete.")

model.load()

client = model.get_embedding_client()

# ---------------------------------------------------------------------------
# 4. Build a knowledge base
# ---------------------------------------------------------------------------

documents = [
    "Python is a high-level programming language known for its readability and versatility.",
    "Rust is a systems programming language focused on safety, speed, and concurrency.",
    "Machine learning is a subset of artificial intelligence that learns from data.",
    "The capital of France is Paris, known for the Eiffel Tower.",
    "Docker containers package applications with their dependencies for consistent deployment.",
    "PostgreSQL is a powerful open-source relational database system.",
    "Neural networks are computing systems inspired by biological brain structures.",
    "Kubernetes orchestrates containerized workloads across clusters of machines.",
    "The Python GIL limits true multi-threading for CPU-bound tasks.",
    "Vector databases store and search high-dimensional embeddings efficiently.",
]

print("Generating embeddings for knowledge base...")
batch_response = client.generate_embeddings(documents)
embeddings = [item.embedding for item in batch_response.data]
print(f"Indexed {len(embeddings)} documents ({len(embeddings[0])} dimensions each)")

# ---------------------------------------------------------------------------
# 5. Store embeddings in ChromaDB
# ---------------------------------------------------------------------------

chroma = chromadb.Client()
collection = chroma.create_collection(
    name="knowledge_base", metadata={"hnsw:space": "cosine"}
)
collection.add(
    ids=[f"doc-{i}" for i in range(len(documents))],
    embeddings=embeddings,
    documents=documents,
)

# ---------------------------------------------------------------------------
# 6. Semantic search
# ---------------------------------------------------------------------------

queries = [
    "What programming language is good for beginners?",
    "How do I deploy applications in production?",
    "Tell me about AI and deep learning",
]

for query in queries:
    query_embedding = client.generate_embedding(query).data[0].embedding
    results = collection.query(query_embeddings=[query_embedding], n_results=3)

    print(f'\n🔍 Query: "{query}"')
    for doc, distance in zip(results["documents"][0], results["distances"][0]):
        print(f"   [{1 - distance:.3f}] {doc}")

# ---------------------------------------------------------------------------
# 7. Cleanup
# ---------------------------------------------------------------------------

model.unload()

Responses API

Foundry Local now includes an Open Responses API client, bringing structured agentic AI capabilities to on-device inference. The Responses API provides a higher-level abstraction over chat completions with built-in support for:

  • Streaming — token-by-token server-sent events
  • Multi-turn conversations — chain responses with previous_response_id
  • Tool calling — define function tools and handle tool call/result round-trips
  • Vision — pass images alongside text input (model-dependent)

Example usage

With the Foundry Local 1.1 release we’ve also added Qwen3.5 VLM to the model catalog — a natively multimodal vision-language model that can reason over images and text together. Smaller variants (3B, 7B) are optimized for on-device inference, making it practical to run vision tasks locally without cloud dependencies.

This enables scenarios like document understanding, diagram analysis, UI screenshot interpretation, and visual question answering — all running entirely on-device. For example, the following code streams a description of an image from the Qwen3.5 VLM using the Responses API:

"""
Image description using Foundry Local and the OpenAI Responses API.

This script loads a vision-language model locally via Foundry Local,
starts the built-in web service, and uses the OpenAI SDK's Responses API
to stream a description of an image — all running on-device.

Requirements:
    pip install foundry-local-sdk openai Pillow
"""

import base64
import io
import urllib.request

from PIL import Image
from openai import OpenAI

from foundry_local_sdk import Configuration, FoundryLocalManager

# ---------------------------------------------------------------------------
# 1. Initialize Foundry Local
# ---------------------------------------------------------------------------

config = Configuration(app_name="foundry_local_samples")
FoundryLocalManager.initialize(config)
manager = FoundryLocalManager.instance

# ---------------------------------------------------------------------------
# 2. Enable additional hardware acceleration for end users
# ---------------------------------------------------------------------------

manager.download_and_register_eps(
    progress_callback=lambda ep, progress: print(
        f"\r  Downloading EP '{ep}': {progress:.1f}%", end="", flush=True
    )
)
print("\n  EP registration complete.\n")

print("Available EPs:")
for ep in manager.discover_eps():
    print(f"  {ep.name} (registered: {ep.is_registered})")
print()

# ---------------------------------------------------------------------------
# 3. Download and load the vision model
# ---------------------------------------------------------------------------

model = manager.catalog.get_model("qwen3-vl-2b-instruct")

if not model.is_cached:
    print("Downloading model...")
    model.download(
        lambda progress: print(f"\r  Progress: {progress:.1f}%", end="", flush=True)
    )
    print("\n  Download complete.")

print("Loading model...")
model.load()
print("Model ready.\n")

# ---------------------------------------------------------------------------
# 4. Start the Foundry Local web service
# ---------------------------------------------------------------------------

manager.start_web_service()
base_url = manager.urls[0].rstrip("/") + "/v1"
client = OpenAI(base_url=base_url, api_key="notneeded")

# ---------------------------------------------------------------------------
# 5. Prepare the image
# ---------------------------------------------------------------------------

image_url = (
    "https://github.com/microsoft/Foundry-Local/blob/main/"
    "samples/python/web-server-responses-vision/src/test_image.jpg?raw=true"
)

print(f"Fetching image: {image_url}")
with urllib.request.urlopen(image_url) as resp:
    img = Image.open(io.BytesIO(resp.read()))

img.thumbnail((512, 512))
buf = io.BytesIO()
img.save(buf, format="JPEG")
image_b64 = base64.b64encode(buf.getvalue()).decode()

# ---------------------------------------------------------------------------
# 6. Call the Responses API with vision input
# ---------------------------------------------------------------------------

vision_input = [
    {
        "type": "message",
        "role": "user",
        "content": [
            {"type": "input_text", "text": "Describe what you see in this image."},
            {
                "type": "input_image",
                "image_data": image_b64,
                "media_type": "image/jpeg",
            },
        ],
    }
]

print("Streaming response:\n")
stream = client.responses.create(
    model=model.id,
    input="placeholder",
    extra_body={"input": vision_input},
    stream=True,
)

for event in stream:
    if getattr(event, "type", None) == "response.output_text.delta":
        print(getattr(event, "delta", ""), end="", flush=True)
print("\n")

# ---------------------------------------------------------------------------
# 7. Cleanup
# ---------------------------------------------------------------------------

client.close()
manager.stop_web_service()
model.unload()
print("Done.")

WebGPU Execution Provider Plugin

The WebGPU execution provider is now delivered as a separate plugin rather than being bundled with the Windows ONNX Runtime package. This change reduces the default package size for applications that don’t need WebGPU, while keeping it available as an on-demand plugin for scenarios that require it. The plugin is automatically acquired via the standard execution provider download mechanism — no changes are needed in your application code.

.NET SDK: Broader Compatibility

The C# SDK packages now target lower framework versions, broadening compatibility for applications that haven’t yet upgraded to the latest .NET runtime:

  • Microsoft.AI.Foundry.Local now targets netstandard2.0 (previously net9.0) — compatible with .NET Framework 4.6.1+, .NET Core 2.0+, Mono, Xamarin, and Unity. This makes it straightforward to add local AI capabilities to existing .NET applications regardless of runtime version.
  • Microsoft.AI.Foundry.Local.WinML now targets net8.0 (previously net9.0) — providing Windows hardware acceleration via GPU/NPU execution providers while maintaining broad compatibility across modern .NET LTS runtimes.

Reduced JavaScript package size

The JavaScript SDK’s native interop layer has been rewritten from koffi (a runtime FFI library) to a purpose-built Node-API C addon with prebuilt binaries shipped per platform. This removes the large koffi dependency from the package while keeping the SDK’s public API surface unchanged.

The benefits include:

  • ~27 MB smaller install footprint — eliminates the koffi transitive dependency tree
  • Faster load times — prebuilt .node binaries load directly without runtime FFI setup
  • Better stability — Node-API provides a stable ABI across Node.js versions, avoiding breakage on engine upgrades
  • No native compilation required — prebuilt addons for each platform (Windows, macOS, Linux) ship with the npm package, so npm install just works without a C toolchain

Get Started

Update to Foundry Local 1.1.0 by installing the latest SDK for your language:

# Python
pip install foundry-local-sdk --upgrade # macOS/Linux
pip install foundry-local-sdk-winml --upgrade # Windows

# JavaScript
npm install foundry-local-sdk@latest # macOS/Linux
npm install foundry-local-sdk-winml@latest # Windows

# C#
dotnet add package Microsoft.AI.Foundry.Local # macOS/Linux
dotnet add package Microsoft.AI.Foundry.Local.WinML # Windows

# Rust (macOS/Linux)
cargo add foundry-local-sdk # macOS/Linux
cargo add foundry-local-sdk --features winml # Windows

What’s coming in the next release

  • C++ language binding — already available for early testing and feedback in the Foundry Local GitHub repo.
  • Smaller package size — further reductions to the core runtime footprint.
  • Audio enhancements — word and segment level timestamps, and additional language support for live transcription.

Learn more

The post Foundry Local 1.1: Live Transcription, Embeddings, and Responses API appeared first on Microsoft Foundry Blog.

Read the whole story
alvinashcraft
28 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

Linux Mint vs. Elementary OS: I compared both distros, and here's my advice

1 Share
If you're looking for a user-friendly Linux distribution, your destination could depend on your starting point.
Read the whole story
alvinashcraft
7 hours ago
reply
Pennsylvania, USA
Share this story
Delete

Connecting the dots for accurate AI

1 Share
At HumanX, Ryan is joined by Philip Rathle, CTO at Neo4j to discuss what knowledge context means for AI agents, how limitations like stale training data make the model-only approach to agents a bad fit for enterprise environments, and how Graph RAG raises the bar for accuracy and reduces context rot by combining vectors with a knowledge graph so agents are more targeted and connected.
Read the whole story
alvinashcraft
7 hours ago
reply
Pennsylvania, USA
Share this story
Delete

Microsoft PowerToys now lets you control your monitor from the taskbar - here's how

1 Share
Instead of pressing buttons on your monitor or hunting through your Windows settings, here's how you can now adjust your display directly from the system tray - plus other new PowerToys perks.
Read the whole story
alvinashcraft
7 hours ago
reply
Pennsylvania, USA
Share this story
Delete

A Data Center Drained 30 Million Gallons of Water Unnoticed

1 Share
A Georgia data center developed by QTS used nearly 30 million gallons of water through two unaccounted-for connections before residents complained about low water pressure and the county utility discovered the issue. "All told, the developer, Quality Technology Services, owed nearly $150,000 for using more than 29 million gallons of unaccounted-for water," reports Politico. "That is equivalent to 44 Olympic-size swimming pools and far exceeds the peak limit agreed to during the data center planning process." From the report: The details were revealed in a May 15, 2025 letter from the Fayette County water system to Quality Technology Services, which outlined the retroactive charge of $147,474. The letter did not specify how many months the unpaid bill covered, but when asked about it Wednesday, Vanessa Tigert, the Fayette County water system director, said it was likely about four months. A QTS spokesperson said the timeframe was 9-15 months. Once the data center was notified, it paid all retroactive charges, a QTS spokesperson said in an email, noting the unmetered water consumption occurred while the county converted its system to smart meters. The Fayette County water system confirmed the data center's meters are now fully integrated and tracked. Tigert, the water system director, blamed the issue on a procedural mix-up. "Fayette County is a suburb, it's mostly residential, and we don't have much commercial meters in our system anyway," she said. "And so we didn't realize our connection point wasn't working." The incident became public last week when a county resident obtained the 2025 letter to QTS through a public records request and posted it on Facebook, prompting outrage from residents concerned about the data center's water consumption. [...] Tigert, who sent the 2025 letter to QTS, said the utility didn't know about the water hookups because the connection process "got mixed up" as the county transitioned to a cloud-based system while also trying to accommodate an industrial customer. Tigert also said her staff is small and at capacity. "Just like any water system, we don't have enough staff. We can't keep staff," she said. "I've got one person that's doing inspections and plan review, and so he's spread pretty thin." She said it's possible her staff did know about hookups but that she hadn't been able to locate the inspection report. "I may have hit 'send' too soon," she said about the 2025 letter to QTS. While the utility charged the data center a higher construction rate for the unapproved water consumption, Tigert confirmed the utility did not penalize or fine the data center. For what it's worth, the Blackstone-owned company says its data centers use a closed-loop cooling system that does not consume water for cooling. The reason for last year's high water use, according to QTS, was the temporary construction work such as concrete, dust control, and site preparation. Once the campus is fully operational, it should only use a small amount of water for things like bathrooms and kitchens. But that point could still be years away, as construction and expansion in Fayetteville may continue for another three to five years.

Read more of this story at Slashdot.

Read the whole story
alvinashcraft
7 hours ago
reply
Pennsylvania, USA
Share this story
Delete

Running Foundry Agent Service on Azure Container Apps

1 Share

Microsoft’s Customer Zero blog series gives an insider view of how Microsoft builds and operates Microsoft using our trusted, enterprise-grade agentic platform. Learn best practices from our engineering teams with real-world lessons, architectural patterns, and operational strategies for pressure-tested solutions in building, operating, and scaling AI apps and agent fleets across the organization.

Challenge: Scaling agents to production changes the requirements

As teams move from experimenting with AI agents to running them in production, the questions they ask begin to change. Early prototypes often focus on whether an agent can reason to generate useful output. But once agents are placed into real systems where they continuously need to serve users and respond to events, new concerns quickly take center stage: reliability, scale, observability, security, and long‑running operations.

A common misconception at this stage is to think of an agent as a simple chatbot wrapped around an API. In practice, an AI agent is something very different. It is a service that listens, thinks, and acts, ingesting unstructured inputs, reasoning over context, and producing outputs that may span multiple phases. Treating agents as services means teams often need more than they initially expect: dependable compute, strong security, and real-time visibility to run agents safely and effectively at scale.

When we kick off an agent loop, we provide input that informs the context it recalls for the task, the data it connects to, the tools it calls, and the reasoning steps it outlines for itself to generate an output. Agent needs are different from traditional services in hosting, scaling, identity, security, and observability; it’s a product with a probabilistic nature that requires secure, auditable access to many resources at the same lightspeed performance that users expect from any software.

This isn’t the first time that the software industry needed to evolve its thinking around infrastructure. When modern application architectures began shifting from monolithic apps toward microservices, existing infrastructure wasn’t built with that model in mind. As systems were reconstructed into independent services, teams quickly discovered they needed new runtime architecture that properly accommodated microservice needs. The modern app era brought new levels of performance, reliability, and scalability of apps, but it also warranted that we rebuild app infrastructure with container orchestration and new operational patterns in mind.

AI agents represent a similar inflection. Infrastructure designed for request‑response applications or stateless workloads wasn’t built with long‑running, tool‑calling, AI‑driven workflows in mind. As the builders of Foundry Agent Service, we were very aware that traditional architectures wouldn’t hold up to the bursty agentic workflows that needed to aggregate data across sources, connect to several simultaneous tools, and reason through execution plans for the output that we needed. Rather than building new infrastructure from scratch, the choice for building on Azure Container Apps was clear. With over a million Apps hosted on Azure Container Apps, it was the tried-and-true solution we needed to keep our team focused on building agent intelligence and behavior instead of the plumbing underneath.

Solution: Building Foundry Agent Service on a resilient agent runtime foundation

Foundry Agent Service is Microsoft’s fully managed platform for building, deploying, and scaling AI agents as production services. Builders start by choosing their preferred framework or immediately building an agent inside Foundry, while Foundry Agent Service handles the operational complexity required to run agents at scale.

Let’s use the example of a sales agent in Foundry Agent Service. You might have a salesperson who prompts a sales agent with “Help me prepare for my upcoming meeting with customer Contoso.” The agent is going to kick off several processes across data and tools to generate the best answer: Work IQ to understand Teams conversations with Contoso, Fabric IQ for current product usage and forecast trends, Foundry IQ to do an AI search over internal sales materials, and even GitHub Copilot SDK to generate and execute code that can draft PowerPoint and Word artifacts for the meeting. And this is just one agent; more than 20,000 customers rely on Foundry Agent Service.

At the core of Foundry Agent Service is a dedicated agent runtime through Azure Container Apps that explicitly meets our demands for production agents. Agent runtime through flexible cloud infrastructure allows builders to focus on making powerful agent experiences without worrying about under-the-hood compute and configurations.

This runtime is built around five foundational pillars:

  1. Fast startup and resume. Agents are event‑driven and often bursty. Responsiveness depends on the ability to start or resume execution quickly when events arrive.
  2. Built‑in agent tool execution. Agents must securely execute tool calls like APIs, workflows, and services as part of their reasoning process, without fragile glue code or ad‑hoc orchestration.
  3. State persistence and restore. Many agent workflows are long‑running and multi‑phase. The runtime must allow agents to reason, pause, and resume with safely preserved state.
  4. Strong isolation per agent task. As agents execute code and tools dynamically, isolation is critical to prevent data leakage and contain blast radius.
  5. Secure by default. Identity, access, and execution controls are enforced at the runtime layer rather than bolted on after the fact.

Together, these pillars define what it means to run AI agents as first‑class production services.

Impact: How Azure Container Apps powers agent runtime

Building and operating agent infrastructure from scratch introduces unnecessary complexity and risk. Azure Container Apps has been pressure‑tested at Microsoft scale, proving to be a powerful, serverless foundation for running AI workloads and aligns naturally with the needs of agent runtime.

It provides serverless, event‑driven scaling with fast startup and scale‑to‑zero, which is critical for agents with unpredictable execution patterns. Execution is secure by default, with built‑in identity, isolation, and security boundaries enforced at the platform layer. Azure Container Apps natively supports running MCP servers and executing full agent workflows, while Container Apps jobs enable on‑demand tool execution for discrete units of work without custom orchestration.

For scenarios involving AI‑generated or untrusted code, dynamic sessions allow execution in isolated sandboxes, keeping blast radius contained. Azure Container Apps also supports running model inference directly within the container boundary, helping preserve data residency and reduce unnecessary data movement.

Learnings for your agent runtime foundation

Make infrastructure flexible with serverless architecture. AI systems move too fast to create infrastructure from scratch. With bursty, unpredictable agent workloads, sub‑second startup times and serverless scaling are critical.

Simplify heavy lifting. Developers should focus on agent behavior, tool invocation, and workflow design instead of infrastructure plumbing. Using trusted cloud infrastructure, pain points like making sure agents run in isolated sandboxes, properly applying security policy to agent IDs, and ensuring secure connections to virtual networks are already solved. When you simplify the operational overhead, you make it easier for developers to focus on meaningful innovation.

Invest in visibility and monitoring. Strong observability enables faster iteration, safer evolution, and continuous self‑correction for both humans and agents as systems adapt over time.

Want to learn more?

Read the whole story
alvinashcraft
7 hours ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories