Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
153539 stories
·
33 followers

What is Code

1 Share

Increasingly humans delegate writing code to agents. Will there even be source code in the future? To wrestle with this question, we have to understand what code is. Unmesh Joshi sees code as having two distinct but intertwined purposes: instructions to a machine and a conceptual model of the problem domain. He explores why it's vital to build a vocabulary to talk to the machine, how programming languages are thinking tools, and how this affects our future as we work with LLMs.

more…

Read the whole story
alvinashcraft
just a second ago
reply
Pennsylvania, USA
Share this story
Delete

Foundry Local 1.1: Live Transcription, Embeddings, and Responses API

1 Share

Today we’re announcing the 1.1.0 release of Foundry Local — Microsoft’s cross-platform local AI solution that lets developers bring AI directly into their applications with no cloud dependency, no network latency, and no per-token costs.

This release adds the following:

  • Live audio transcription for real-time speech-to-text scenarios like captioning, voice UIs, and meeting transcription.
  • Text embeddings for semantic search, RAG, clustering, and similarity matching use cases.
  • Responses API support for structured agentic interactions, including tool calling and multimodal vision-language input.
  • WebGPU execution provider plugin delivered separately to reduce the default package size for applications that don’t need it.
  • Reduced JavaScript package size by replacing the koffi FFI layer with a custom Node-API C addon.
  • Broader .NET compatibility by targeting lower framework versions in the C# SDK.

What’s new

Live Transcription API

Foundry Local now supports real-time speech-to-text streaming directly from a microphone — ideal for live captioning, voice-driven UIs, meeting transcription, and accessibility scenarios. The new Live Transcription API lets you push raw PCM audio chunks and receive transcription results as they arrive, with clear is_final markers distinguishing interim from finalized text.

The API is built around a simple session-based pattern available across all SDK language bindings (JavaScript, C#, Python, Rust):

  1. Load a streaming speech model from the catalog
  2. Create a live transcription session with audio settings (sample rate, channels, language)
  3. Start the session and begin appending audio data
  4. Consume transcription results via an async stream

Example usage

Throughout this article, the examples are shown using the Python SDK language binding. However, in all examples, JavaScript, Rust, and C# bindings are also available. See the Foundry Local samples on GitHub.

"""
Live microphone transcription using Foundry Local.

This script loads a streaming speech model, captures audio from the
microphone via PyAudio, and prints transcription results in real time.

Requirements:
    pip install foundry-local-sdk pyaudio
"""

import threading
import pyaudio
from foundry_local_sdk import Configuration, FoundryLocalManager

# ---------------------------------------------------------------------------
# 1. Initialize Foundry Local
# ---------------------------------------------------------------------------

config = Configuration(app_name="foundry_local_samples")
FoundryLocalManager.initialize(config)
manager = FoundryLocalManager.instance

# ---------------------------------------------------------------------------
# 2. Download and load the streaming speech model
# ---------------------------------------------------------------------------

model = manager.catalog.get_model("nemotron-speech-streaming-en-0.6b")

if not model.is_cached:
    print("Downloading model...")
    model.download(
        lambda progress: print(f"\r  Progress: {progress:.1f}%", end="", flush=True)
    )
    print("\n  Download complete.")

model.load()

# ---------------------------------------------------------------------------
# 3. Create a live transcription session
# ---------------------------------------------------------------------------

audio_client = model.get_audio_client()
session = audio_client.create_live_transcription_session()
session.settings.sample_rate = 16000
session.settings.channels = 1
session.settings.language = "en"

session.start()

# ---------------------------------------------------------------------------
# 4. Read transcription results in a background thread
# ---------------------------------------------------------------------------

def read_results():
    for result in session.get_stream():
        text = result.content[0].text if result.content else ""
        if result.is_final:
            print(f"\n  [FINAL] {text}")
        elif text:
            print(text, end="", flush=True)

read_thread = threading.Thread(target=read_results, daemon=True)
read_thread.start()

# ---------------------------------------------------------------------------
# 5. Capture microphone audio and feed it to the session
# ---------------------------------------------------------------------------

RATE, CHANNELS, CHUNK = 16000, 1, 480  # 30 ms frames
pa = pyaudio.PyAudio()
stream = pa.open(
    format=pyaudio.paInt16,
    channels=CHANNELS,
    rate=RATE,
    input=True,
    frames_per_buffer=CHUNK,
)

print("Speak into your microphone. Press Ctrl+C to stop.\n")
try:
    while True:
        pcm_data = stream.read(CHUNK, exception_on_overflow=False)
        session.append(pcm_data)
except KeyboardInterrupt:
    print("\nStopping...")

# ---------------------------------------------------------------------------
# 6. Cleanup
# ---------------------------------------------------------------------------

stream.close()
pa.terminate()
session.stop()
read_thread.join(timeout=5)
model.unload()

Optimized for on-device streaming ASR

To identify the best model for real-time on-device transcription, we conducted a systematic empirical study across over 50 configurations spanning encoder-decoder, transducer, and LLM-based ASR architectures — including OpenAI Whisper, NVIDIA Nemotron, Parakeet TDT, Canary, Conformer Transducer, and Qwen3-ASR — evaluated across batch, chunked, and streaming inference modes.

From this study, we identified NVIDIA’s Nemotron Speech Streaming as the strongest candidate for real-time English streaming on resource-constrained hardware. We then re-implemented the complete streaming inference pipeline in ONNX Runtime and applied multiple post-training quantization strategies — including importance-weighted k-quant, mixed-precision schemes, and round-to-nearest quantization — combined with graph-level operator fusion. These optimizations reduced the model from 2.47 GB to as little as 0.67 GB while maintaining word error rate (WER) within 1% absolute of the full-precision PyTorch baseline.

Our recommended configuration, the int4 k-quant variant, achieves 8.20% average streaming WER across eight standard benchmarks, running comfortably faster than real-time on CPU with 0.56s algorithmic latency — establishing a new quality-efficiency Pareto point for on-device streaming ASR.

The model is available in the Foundry catalog as nemotron-speech-streaming-en-0.6b.

For the full methodology and benchmark results, see our paper: Pushing the Limits of On-Device Streaming ASR: A Compact, High-Accuracy English Model for Low-Latency Inference (arXiv:2604.14493).

Embeddings API for semantic search scenarios

Foundry Local now supports text embedding generation across all four SDKs (C#, JavaScript, Python, and Rust). Embeddings unlock a wide range of local AI scenarios including semantic search, RAG (retrieval-augmented generation), clustering, and similarity matching — all running entirely on-device.

The Embeddings API supports both single and batch input, with configurable dimensions and encoding format. Responses follow the OpenAI embeddings format for seamless cloud-to-edge portability.

Example usage

The following example pairs Foundry Local embeddings with ChromaDB to build a fully local semantic search pipeline — documents are embedded and indexed in-memory, then natural-language queries are matched to the most relevant results.

"""
Semantic search using Foundry Local embeddings and ChromaDB.

This script loads an embedding model locally, indexes a set of documents
into an in-memory ChromaDB collection, and performs natural-language
semantic queries against them — all running on-device.

Requirements:
    pip install foundry-local-sdk chromadb
"""

import chromadb

from foundry_local_sdk import Configuration, FoundryLocalManager

# ---------------------------------------------------------------------------
# 1. Initialize Foundry Local
# ---------------------------------------------------------------------------

config = Configuration(app_name="foundry_local_samples")
FoundryLocalManager.initialize(config)
manager = FoundryLocalManager.instance

# ---------------------------------------------------------------------------
# 2. Enable additional hardware acceleration for end users
# ---------------------------------------------------------------------------

manager.download_and_register_eps(
    progress_callback=lambda ep, progress: print(
        f"\r  Downloading EP '{ep}': {progress:.1f}%", end="", flush=True
    )
)
print("\n  EP registration complete.\n")

print("Available EPs:")
for ep in manager.discover_eps():
    print(f"  {ep.name} (registered: {ep.is_registered})")
print()

# ---------------------------------------------------------------------------
# 3. Download and load an embedding model
# ---------------------------------------------------------------------------

model = manager.catalog.get_model("qwen3-embedding-0.6b")

if not model.is_cached:
    print("Downloading model...")
    model.download(
        lambda progress: print(f"\r  Progress: {progress:.1f}%", end="", flush=True)
    )
    print("\n  Download complete.")

model.load()

client = model.get_embedding_client()

# ---------------------------------------------------------------------------
# 4. Build a knowledge base
# ---------------------------------------------------------------------------

documents = [
    "Python is a high-level programming language known for its readability and versatility.",
    "Rust is a systems programming language focused on safety, speed, and concurrency.",
    "Machine learning is a subset of artificial intelligence that learns from data.",
    "The capital of France is Paris, known for the Eiffel Tower.",
    "Docker containers package applications with their dependencies for consistent deployment.",
    "PostgreSQL is a powerful open-source relational database system.",
    "Neural networks are computing systems inspired by biological brain structures.",
    "Kubernetes orchestrates containerized workloads across clusters of machines.",
    "The Python GIL limits true multi-threading for CPU-bound tasks.",
    "Vector databases store and search high-dimensional embeddings efficiently.",
]

print("Generating embeddings for knowledge base...")
batch_response = client.generate_embeddings(documents)
embeddings = [item.embedding for item in batch_response.data]
print(f"Indexed {len(embeddings)} documents ({len(embeddings[0])} dimensions each)")

# ---------------------------------------------------------------------------
# 5. Store embeddings in ChromaDB
# ---------------------------------------------------------------------------

chroma = chromadb.Client()
collection = chroma.create_collection(
    name="knowledge_base", metadata={"hnsw:space": "cosine"}
)
collection.add(
    ids=[f"doc-{i}" for i in range(len(documents))],
    embeddings=embeddings,
    documents=documents,
)

# ---------------------------------------------------------------------------
# 6. Semantic search
# ---------------------------------------------------------------------------

queries = [
    "What programming language is good for beginners?",
    "How do I deploy applications in production?",
    "Tell me about AI and deep learning",
]

for query in queries:
    query_embedding = client.generate_embedding(query).data[0].embedding
    results = collection.query(query_embeddings=[query_embedding], n_results=3)

    print(f'\n🔍 Query: "{query}"')
    for doc, distance in zip(results["documents"][0], results["distances"][0]):
        print(f"   [{1 - distance:.3f}] {doc}")

# ---------------------------------------------------------------------------
# 7. Cleanup
# ---------------------------------------------------------------------------

model.unload()

Responses API

Foundry Local now includes an Open Responses API client, bringing structured agentic AI capabilities to on-device inference. The Responses API provides a higher-level abstraction over chat completions with built-in support for:

  • Streaming — token-by-token server-sent events
  • Multi-turn conversations — chain responses with previous_response_id
  • Tool calling — define function tools and handle tool call/result round-trips
  • Vision — pass images alongside text input (model-dependent)

Example usage

With the Foundry Local 1.1 release we’ve also added Qwen3.5 VLM to the model catalog — a natively multimodal vision-language model that can reason over images and text together. Smaller variants (3B, 7B) are optimized for on-device inference, making it practical to run vision tasks locally without cloud dependencies.

This enables scenarios like document understanding, diagram analysis, UI screenshot interpretation, and visual question answering — all running entirely on-device. For example, the following code streams a description of an image from the Qwen3.5 VLM using the Responses API:

"""
Image description using Foundry Local and the OpenAI Responses API.

This script loads a vision-language model locally via Foundry Local,
starts the built-in web service, and uses the OpenAI SDK's Responses API
to stream a description of an image — all running on-device.

Requirements:
    pip install foundry-local-sdk openai Pillow
"""

import base64
import io
import urllib.request

from PIL import Image
from openai import OpenAI

from foundry_local_sdk import Configuration, FoundryLocalManager

# ---------------------------------------------------------------------------
# 1. Initialize Foundry Local
# ---------------------------------------------------------------------------

config = Configuration(app_name="foundry_local_samples")
FoundryLocalManager.initialize(config)
manager = FoundryLocalManager.instance

# ---------------------------------------------------------------------------
# 2. Enable additional hardware acceleration for end users
# ---------------------------------------------------------------------------

manager.download_and_register_eps(
    progress_callback=lambda ep, progress: print(
        f"\r  Downloading EP '{ep}': {progress:.1f}%", end="", flush=True
    )
)
print("\n  EP registration complete.\n")

print("Available EPs:")
for ep in manager.discover_eps():
    print(f"  {ep.name} (registered: {ep.is_registered})")
print()

# ---------------------------------------------------------------------------
# 3. Download and load the vision model
# ---------------------------------------------------------------------------

model = manager.catalog.get_model("qwen3-vl-2b-instruct")

if not model.is_cached:
    print("Downloading model...")
    model.download(
        lambda progress: print(f"\r  Progress: {progress:.1f}%", end="", flush=True)
    )
    print("\n  Download complete.")

print("Loading model...")
model.load()
print("Model ready.\n")

# ---------------------------------------------------------------------------
# 4. Start the Foundry Local web service
# ---------------------------------------------------------------------------

manager.start_web_service()
base_url = manager.urls[0].rstrip("/") + "/v1"
client = OpenAI(base_url=base_url, api_key="notneeded")

# ---------------------------------------------------------------------------
# 5. Prepare the image
# ---------------------------------------------------------------------------

image_url = (
    "https://github.com/microsoft/Foundry-Local/blob/main/"
    "samples/python/web-server-responses-vision/src/test_image.jpg?raw=true"
)

print(f"Fetching image: {image_url}")
with urllib.request.urlopen(image_url) as resp:
    img = Image.open(io.BytesIO(resp.read()))

img.thumbnail((512, 512))
buf = io.BytesIO()
img.save(buf, format="JPEG")
image_b64 = base64.b64encode(buf.getvalue()).decode()

# ---------------------------------------------------------------------------
# 6. Call the Responses API with vision input
# ---------------------------------------------------------------------------

vision_input = [
    {
        "type": "message",
        "role": "user",
        "content": [
            {"type": "input_text", "text": "Describe what you see in this image."},
            {
                "type": "input_image",
                "image_data": image_b64,
                "media_type": "image/jpeg",
            },
        ],
    }
]

print("Streaming response:\n")
stream = client.responses.create(
    model=model.id,
    input="placeholder",
    extra_body={"input": vision_input},
    stream=True,
)

for event in stream:
    if getattr(event, "type", None) == "response.output_text.delta":
        print(getattr(event, "delta", ""), end="", flush=True)
print("\n")

# ---------------------------------------------------------------------------
# 7. Cleanup
# ---------------------------------------------------------------------------

client.close()
manager.stop_web_service()
model.unload()
print("Done.")

WebGPU Execution Provider Plugin

The WebGPU execution provider is now delivered as a separate plugin rather than being bundled with the Windows ONNX Runtime package. This change reduces the default package size for applications that don’t need WebGPU, while keeping it available as an on-demand plugin for scenarios that require it. The plugin is automatically acquired via the standard execution provider download mechanism — no changes are needed in your application code.

.NET SDK: Broader Compatibility

The C# SDK packages now target lower framework versions, broadening compatibility for applications that haven’t yet upgraded to the latest .NET runtime:

  • Microsoft.AI.Foundry.Local now targets netstandard2.0 (previously net9.0) — compatible with .NET Framework 4.6.1+, .NET Core 2.0+, Mono, Xamarin, and Unity. This makes it straightforward to add local AI capabilities to existing .NET applications regardless of runtime version.
  • Microsoft.AI.Foundry.Local.WinML now targets net8.0 (previously net9.0) — providing Windows hardware acceleration via GPU/NPU execution providers while maintaining broad compatibility across modern .NET LTS runtimes.

Reduced JavaScript package size

The JavaScript SDK’s native interop layer has been rewritten from koffi (a runtime FFI library) to a purpose-built Node-API C addon with prebuilt binaries shipped per platform. This removes the large koffi dependency from the package while keeping the SDK’s public API surface unchanged.

The benefits include:

  • ~27 MB smaller install footprint — eliminates the koffi transitive dependency tree
  • Faster load times — prebuilt .node binaries load directly without runtime FFI setup
  • Better stability — Node-API provides a stable ABI across Node.js versions, avoiding breakage on engine upgrades
  • No native compilation required — prebuilt addons for each platform (Windows, macOS, Linux) ship with the npm package, so npm install just works without a C toolchain

Get Started

Update to Foundry Local 1.1.0 by installing the latest SDK for your language:

# Python
pip install foundry-local-sdk --upgrade # macOS/Linux
pip install foundry-local-sdk-winml --upgrade # Windows

# JavaScript
npm install foundry-local-sdk@latest # macOS/Linux
npm install foundry-local-sdk-winml@latest # Windows

# C#
dotnet add package Microsoft.AI.Foundry.Local # macOS/Linux
dotnet add package Microsoft.AI.Foundry.Local.WinML # Windows

# Rust (macOS/Linux)
cargo add foundry-local-sdk # macOS/Linux
cargo add foundry-local-sdk --features winml # Windows

What’s coming in the next release

  • C++ language binding — already available for early testing and feedback in the Foundry Local GitHub repo.
  • Smaller package size — further reductions to the core runtime footprint.
  • Audio enhancements — word and segment level timestamps, and additional language support for live transcription.

Learn more

The post Foundry Local 1.1: Live Transcription, Embeddings, and Responses API appeared first on Microsoft Foundry Blog.

Read the whole story
alvinashcraft
37 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

Linux Mint vs. Elementary OS: I compared both distros, and here's my advice

1 Share
If you're looking for a user-friendly Linux distribution, your destination could depend on your starting point.
Read the whole story
alvinashcraft
7 hours ago
reply
Pennsylvania, USA
Share this story
Delete

Connecting the dots for accurate AI

1 Share
At HumanX, Ryan is joined by Philip Rathle, CTO at Neo4j to discuss what knowledge context means for AI agents, how limitations like stale training data make the model-only approach to agents a bad fit for enterprise environments, and how Graph RAG raises the bar for accuracy and reduces context rot by combining vectors with a knowledge graph so agents are more targeted and connected.
Read the whole story
alvinashcraft
7 hours ago
reply
Pennsylvania, USA
Share this story
Delete

Microsoft PowerToys now lets you control your monitor from the taskbar - here's how

1 Share
Instead of pressing buttons on your monitor or hunting through your Windows settings, here's how you can now adjust your display directly from the system tray - plus other new PowerToys perks.
Read the whole story
alvinashcraft
7 hours ago
reply
Pennsylvania, USA
Share this story
Delete

A Data Center Drained 30 Million Gallons of Water Unnoticed

1 Share
A Georgia data center developed by QTS used nearly 30 million gallons of water through two unaccounted-for connections before residents complained about low water pressure and the county utility discovered the issue. "All told, the developer, Quality Technology Services, owed nearly $150,000 for using more than 29 million gallons of unaccounted-for water," reports Politico. "That is equivalent to 44 Olympic-size swimming pools and far exceeds the peak limit agreed to during the data center planning process." From the report: The details were revealed in a May 15, 2025 letter from the Fayette County water system to Quality Technology Services, which outlined the retroactive charge of $147,474. The letter did not specify how many months the unpaid bill covered, but when asked about it Wednesday, Vanessa Tigert, the Fayette County water system director, said it was likely about four months. A QTS spokesperson said the timeframe was 9-15 months. Once the data center was notified, it paid all retroactive charges, a QTS spokesperson said in an email, noting the unmetered water consumption occurred while the county converted its system to smart meters. The Fayette County water system confirmed the data center's meters are now fully integrated and tracked. Tigert, the water system director, blamed the issue on a procedural mix-up. "Fayette County is a suburb, it's mostly residential, and we don't have much commercial meters in our system anyway," she said. "And so we didn't realize our connection point wasn't working." The incident became public last week when a county resident obtained the 2025 letter to QTS through a public records request and posted it on Facebook, prompting outrage from residents concerned about the data center's water consumption. [...] Tigert, who sent the 2025 letter to QTS, said the utility didn't know about the water hookups because the connection process "got mixed up" as the county transitioned to a cloud-based system while also trying to accommodate an industrial customer. Tigert also said her staff is small and at capacity. "Just like any water system, we don't have enough staff. We can't keep staff," she said. "I've got one person that's doing inspections and plan review, and so he's spread pretty thin." She said it's possible her staff did know about hookups but that she hadn't been able to locate the inspection report. "I may have hit 'send' too soon," she said about the 2025 letter to QTS. While the utility charged the data center a higher construction rate for the unapproved water consumption, Tigert confirmed the utility did not penalize or fine the data center. For what it's worth, the Blackstone-owned company says its data centers use a closed-loop cooling system that does not consume water for cooling. The reason for last year's high water use, according to QTS, was the temporary construction work such as concrete, dust control, and site preparation. Once the campus is fully operational, it should only use a small amount of water for things like bathrooms and kitchens. But that point could still be years away, as construction and expansion in Fayetteville may continue for another three to five years.

Read more of this story at Slashdot.

Read the whole story
alvinashcraft
7 hours ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories