Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
154422 stories
·
33 followers

Building an On-Device Voice Assistant with Microsoft Foundry Local

1 Share

Why on-device voice still matters

Most "voice AI" tutorials assume your audio leaves the machine. You ship a WAV to Whisper-API, your transcript to GPT-4, and a synthesized response back over the wire. That works — but it also means three round trips, three per-token bills, and three places your user's voice gets logged.

The new wave of small, hardware-optimised models changes the trade-off. NVIDIA's Nemotron Speech Streaming En 0.6B is a 600M-parameter streaming ASR model published into the Microsoft Foundry Local catalog. Paired with a small chat model like qwen2.5-0.5b or phi-4-mini, you can run the entire capture → transcribe → reason → respond loop in-process on a developer laptop, with no API keys and no network egress.

This post walks through how the fl-nemotron sample does it, the SDK pitfalls we hit on the way, and the design decisions that made the pipeline reliable.

What we're building

A browser-hosted assistant served by FastAPI at http://127.0.0.1:8000. The page captures microphone audio, posts it to /api/transcribe, then streams the chat reply back over Server-Sent Events from /api/chat. All inference runs locally through two Foundry Local models loaded into the same process.

The shape of the pipeline:

Microphone (browser MediaRecorder)
   │  WebM/Opus blob
   ▼
Client-side WAV encoder (16 kHz, mono, PCM-16)
   │  multipart/form-data
   ▼
FastAPI /api/transcribe
   │
   ▼
Nemotron Speech Streaming En 0.6B  (Foundry Local audio client)
   │  transcript text
   ▼
Chat LLM e.g. qwen2.5-0.5b         (Foundry Local chat client)
   │  streamed tokens
   ▼
FastAPI /api/chat → SSE → browser bubble

The version that bit us: foundry-local-sdk >= 1.1.0

Before any code, the single most important fact about this project:

The Nemotron Speech Streaming model only appears in the Foundry Local 1.1.x catalog. Older SDKs (0.5.x / 0.6.x) cannot resolve the alias nemotron-speech-streaming-en-0.6b and fail with model not found.

The module name also changed in 1.1.0 — it is now foundry_local_sdk (with the underscore-sdk suffix), not foundry_local. The pip wheel for foundry-local-core is bundled, so there is no separate MSI / winget install to worry about.

Pin it explicitly:

pip install --upgrade "foundry-local-sdk>=1.1.0,<2"

And verify before anything else:

python -c "import importlib.metadata as m; print('sdk', m.version('foundry-local-sdk'))"
# expect: sdk 1.1.0

Loading both models from one manager

The 1.1.x SDK exposes a single FoundryLocalManager that owns the runtime. Each loaded model gives you back a per-model OpenAI-compatible client — get_chat_client() for text models and get_audio_client() for ASR. There is no need to bring your own openai Python package; the SDK ships its own thin client.

The wrapper used in the repo (src/foundry_client.py) does this:

from foundry_local_sdk import Configuration, FoundryLocalManager

FoundryLocalManager.initialize(Configuration(app_name="fl-nemotron"))
manager = FoundryLocalManager.instance

chat_model = manager.load_model("qwen2.5-0.5b")
stt_model  = manager.load_model("nemotron-speech-streaming-en-0.6b")

chat_client  = chat_model.get_chat_client()
audio_client = stt_model.get_audio_client()

Both models are downloaded on first use into the Foundry Local cache and stay resident for the lifetime of the process. On a laptop with 16 GB RAM, the combined working set sits comfortably under 4 GB.

The transcription surprise

The first naive approach was the obvious one:

with open(wav_path, "rb") as f:
    result = audio_client.transcribe(file=f, model="nemotron-speech-streaming-en-0.6b")

That call fails on Nemotron. The bundled ONNX Runtime GenAI in foundry-local-core does not register the nemotron_speech multi-modal model type that the standard AudioClient.transcribe() path tries to instantiate. The error surfaces as a cryptic model-type registration failure deep inside the native runtime.

The fix is to use the streaming session API instead — a different native entry point (core_interop.start_audio_stream) that the streaming model does support. The repo isolates this in src/_nemotron_live.py:

def transcribe_wav_live(audio_client, wav_path, *, language="en"):
    with wave.open(str(wav_path), "rb") as w:
        sample_rate  = w.getframerate()
        channels     = w.getnchannels()
        sample_width = w.getsampwidth()
        pcm          = w.readframes(w.getnframes())

    session = audio_client.create_live_transcription_session()
    session.settings.sample_rate     = sample_rate
    session.settings.channels        = channels
    session.settings.bits_per_sample = sample_width * 8
    session.settings.language        = language
    session.start()

    # Feed PCM in ~100 ms chunks from a worker thread, then stop.
    bytes_per_sec = sample_rate * channels * sample_width
    chunk_bytes   = max(bytes_per_sec // 10, 1024)

    def _pusher():
        try:
            for offset in range(0, len(pcm), chunk_bytes):
                session.append(pcm[offset:offset + chunk_bytes])
        finally:
            session.stop()

    threading.Thread(target=_pusher, daemon=True).start()

    parts = []
    for resp in session.get_stream():
        for cp in getattr(resp, "content", []) or []:
            text = getattr(cp, "text", "") or getattr(cp, "transcript", "") or ""
            if text:
                parts.append(text)
    return " ".join(p.strip() for p in parts if p.strip()).strip()

Two things to notice:

  • Push from a thread, read from the main coroutine. session.append() is a blocking write into the native stream and session.get_stream() is a blocking generator. Run one in a worker thread so the other can drain in parallel — otherwise you deadlock the session.
  • Chunk to ~100 ms. Smaller chunks (e.g. 10 ms) spend more time crossing the FFI boundary than transcribing; larger chunks (e.g. 1 s) hold back partial results and hurt perceived latency.
  • Always session.stop(). Without it the generator never terminates and the request hangs.

The other transcription surprise: browsers don't send WAV

Inside the browser, MediaRecorder defaults to audio/webm; codecs=opus. That's great for size but bad for our STT model, which expects a 16-bit mono PCM WAV at a known sample rate. Decoding WebM/Opus server-side would require ffmpeg as a runtime dependency — which is exactly the kind of friction this project exists to remove.

The cleaner solution is to encode WAV on the client. AudioContext.decodeAudioData already understands WebM/Opus, so the page can decode the recording, resample to 16 kHz, mix to mono, and emit a PCM-16 WAV blob in 30 lines of JavaScript:

// Inside src/static/index.html
async function webmToWav(blob) {
  const ctx = new (window.AudioContext || window.webkitAudioContext)({ sampleRate: 16000 });
  const buf = await ctx.decodeAudioData(await blob.arrayBuffer());
  // Mix to mono
  const ch  = buf.numberOfChannels;
  const mono = new Float32Array(buf.length);
  for (let c = 0; c < ch; c++) {
    const data = buf.getChannelData(c);
    for (let i = 0; i < data.length; i++) mono[i] += data[i] / ch;
  }
  return encodeWav(mono, 16000);
}

function encodeWav(samples, sampleRate) {
  const buffer = new ArrayBuffer(44 + samples.length * 2);
  const view   = new DataView(buffer);
  // RIFF header
  writeStr(view, 0, "RIFF");
  view.setUint32(4, 36 + samples.length * 2, true);
  writeStr(view, 8, "WAVE");
  // fmt chunk
  writeStr(view, 12, "fmt ");
  view.setUint32(16, 16, true);              // PCM chunk size
  view.setUint16(20, 1, true);               // PCM format
  view.setUint16(22, 1, true);               // mono
  view.setUint32(24, sampleRate, true);
  view.setUint32(28, sampleRate * 2, true);  // byte rate
  view.setUint16(32, 2, true);               // block align
  view.setUint16(34, 16, true);              // bits per sample
  // data chunk
  writeStr(view, 36, "data");
  view.setUint32(40, samples.length * 2, true);
  // PCM-16 samples
  let o = 44;
  for (let i = 0; i < samples.length; i++, o += 2) {
    const s = Math.max(-1, Math.min(1, samples[i]));
    view.setInt16(o, s < 0 ? s * 0x8000 : s * 0x7FFF, true);
  }
  return new Blob([view], { type: "audio/wav" });
}

Now the server's /api/transcribe endpoint just writes the bytes to a temp file and hands them to transcribe_wav_live() — no audio decoding libraries on the Python side.

Wiring it into FastAPI

The server (src/app.py) is deliberately small. The notable detail is that the same process holds both Foundry Local model handles for its entire lifetime, so there is no warm-up cost per request:

@app.post("/api/transcribe")
async def transcribe(audio: UploadFile = File(...)):
    data = await audio.read()
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
        f.write(data); path = f.name
    text = _ai_client.transcribe(path)
    return {"text": text}


@app.post("/api/chat")
async def chat(req: ChatRequest):
    if req.stream:
        return StreamingResponse(
            _sse(_ai_client.stream_completion(req.messages)),
            media_type="text/event-stream",
        )
    return {"text": _ai_client.chat_completion(req.messages)}

Streaming uses Server-Sent Events because they are trivially supported in both fetch() and the FastAPI runtime, and they don't require a WebSocket upgrade through any proxy a developer might have in front of localhost.

What it looks like

The repo includes screenshots of the running UI: a welcome screen with both models loaded, a streamed haiku reply, an inline code block with copy-to-clipboard, and the recording state for the microphone.

Performance, honestly

This is a small-model, CPU-friendly stack. On an Arm64 Surface running the x64 SDK under emulation:

  • First model load (cold cache): tens of seconds — downloads ~600 MB for Nemotron and ~400 MB for qwen2.5-0.5b.
  • Subsequent loads (warm cache): a few seconds per model.
  • End-to-end transcription of a 5-second utterance: well under a second after warm-up.
  • First chat token from qwen2.5-0.5b: typically 200–500 ms; full short reply within 1–2 s.

On x64 silicon with a recent CPU the numbers improve substantially, and the SDK will pick the best execution provider it finds (CPU / DirectML / CUDA) for each model.

Trade-offs to know about

  • Model quality. qwen2.5-0.5b is a 500M-parameter model. It is fast and small enough to ship on a laptop, but it is not GPT-4. Swap in phi-4-mini or mistral-nemo-12b-instruct if you have the RAM and want better reasoning — the wrapper accepts any chat alias in the Foundry Local catalog.
  • STT is English-only here. The current Nemotron streaming model in the catalog is ...-en-0.6b. Multilingual variants are likely to follow.
  • Browser microphone needs a real browser. Headless / automated browsers (Playwright, Puppeteer) deny getUserMedia by default. Open the page in Edge / Chrome / Firefox to grant the permission and capture audio for real.
  • No agent framework yet. This sample is deliberately a single-turn loop over a chat client — there is no tool calling, planning, or multi-agent orchestration. Adding the Microsoft Agent Framework on top would be a natural next step for richer behaviour.

Responsible AI considerations

Running locally removes the cloud-egress class of privacy concerns, but it does not remove responsibility:

  • Disclose recording. The browser prompts for mic permission; your UI should make it obvious when capture is active. The sample shows a red button and a "Recording…" banner for that reason.
  • Don't log raw audio. The sample writes audio to a per-request NamedTemporaryFile and deletes it after transcription. Treat the WAV as sensitive data even when it never leaves the device.
  • Small models hallucinate. A 0.5B chat model is great for snappy local replies, but unsuitable for high-stakes answers. Pair it with retrieval, ground it on your own data, or escalate to a larger model when accuracy matters.

Try it

  1. Clone github.com/leestott/fl-nemotron.
  2. ./setup.ps1 (or ./setup.sh) to create a virtualenv and install the pinned SDK.
  3. python scripts/prefetch.py nemotron-speech-streaming-en-0.6b qwen2.5-0.5b to download both models.
  4. .venv\Scripts\uvicorn.exe app:app --app-dir src --port 8000
  5. Open http://127.0.0.1:8000 in a real browser and click the 🎤 button.

Where to go next

Key takeaways

  • Pin foundry-local-sdk >= 1.1.0. Earlier SDKs cannot see the Nemotron Speech Streaming model.
  • Use the LiveAudioTranscriptionSession API for Nemotron, not AudioClient.transcribe().
  • Encode WAV in the browser. It eliminates a heavy server-side ffmpeg dependency for a few lines of JS.
  • Push audio chunks on a worker thread and drain the response generator on the main one to avoid deadlocks.
  • A small Foundry Local chat model plus Nemotron STT gives you a credible local voice loop in a single Python process — no cloud, no keys, no data egress.
Read the whole story
alvinashcraft
46 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

How ‘bee’s knees’ became high praise, and why do recipes sound so bossy?

1 Share

1188. This week, we look at how “the bee's knees” went from meaning something tiny to the cheeriest slang of the 1920s — and why it outlasted the cat's pajamas and the clam's overshoes. Then, we look at why recipes boss you around with phrases like “fold in cheese” and how cookbook language evolved from chatty medieval notes into clipped, no-nonsense commands.


The "recipe" segment was by Karen Lunde, a career writer and former Quick & Dirty Tips editor. She writes I'll Go First, a Substack where she shares personal essays and memoir, then hands you a weekly writing prompt and a metaphorical pen. Find her on igofirst.org.


🔗 Join the Grammar Girl Patreon.

🔗 Share your familect recording in Speakpipe or by leaving a voicemail at 833-214-GIRL (833-214-4475)

🔗 Watch my LinkedIn Learning writing courses.

🔗 Subscribe to the newsletter.

🔗 Find an edited transcript.

🔗 Get Grammar Girl books.


| HOST: Mignon Fogarty


| Grammar Girl is part of the Quick and Dirty Tips podcast network.

  • Audio Engineer: Castria Communications
  • Director of Podcast: Holly Hutchings
  • Advertising Operations Specialist: Morgan Christianson
  • Marketing and Video: Nat Hoopes, Rebekah Sebastian
  • Podcast Associate: Maram Elnagheeb


| Theme music by Catherine Rannus.


| Grammar Girl Social Media: YouTubeTikTokFacebookThreadsInstagramLinkedInMastodonBluesky.


Hosted on Acast. See acast.com/privacy for more information.





Download audio: https://sphinx.acast.com/p/open/s/69c1476c007cdcf83fc0964b/e/6a0fa1ad163f1001830b603a/media.mp3
Read the whole story
alvinashcraft
46 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

SDU Show 96 with guest Jerry Nixon

1 Share
SDU Show 96 features Microsoft Principal Product Manager for Data and AI Jerry Nixon discussing the SQL MCP Server



Download audio: http://sqldownunder.blob.core.windows.net/podcasts/SDU96FullShow.mp3
Read the whole story
alvinashcraft
47 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Why Agents Still Need Humans

1 Share
From: AIDailyBrief
Duration: 22:19
Views: 1,610

NLW explores the next wave of human-agent collaboration, using Dan Shipper’s “After Automation” essay and Every’s agent experiments to argue that automation is creating more expert human work, not less. The episode looks at shared team agents, the “human sandwich” model, the limits of fully autonomous OpenClaw-style agents, and why Codex and Claude Code point toward a more semi-synchronous future of managing agent work across devices.
After Automation: ⁠https://every.to/p/after-automation

The AI Daily Brief helps you understand the most important news and discussions in AI.
Subscribe to the podcast version of The AI Daily Brief wherever you listen: https://pod.link/1680633614
Get it ad free at http://patreon.com/aidailybrief
Learn more about the show https://aidailybrief.ai/

Read the whole story
alvinashcraft
47 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

#724 – All Heat, No Useful Work

1 Share





Download audio: https://traffic.libsyn.com/theamphour/TheAmpHour-724-AllHeatNoUsefulWork.mp3
Read the whole story
alvinashcraft
47 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Get To Know MCP Tunnels

1 Share
From: Den Delimarsky
Duration: 13:41
Views: 100

Got MCP servers running inside a private network and no good way to reach them without poking holes in your firewall? That is exactly what MCP tunnels solve.

In this video I walk through how tunnels let Claude Managed Agents talk to MCP servers on your private network over an outbound-only connection. No inbound ports, no exposing services to the public internet, no allowlisting IP ranges on your origin.

It works well with Docker - but also if you want to deploy this right into your Kubernetes cluster!

#engineering #tutorial #claude #mcp

Read the whole story
alvinashcraft
47 minutes ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories