Why on-device voice still matters
Most "voice AI" tutorials assume your audio leaves the machine. You ship a WAV to Whisper-API, your transcript to GPT-4, and a synthesized response back over the wire. That works — but it also means three round trips, three per-token bills, and three places your user's voice gets logged.
The new wave of small, hardware-optimised models changes the trade-off. NVIDIA's Nemotron Speech Streaming En 0.6B is a 600M-parameter streaming ASR model published into the Microsoft Foundry Local catalog. Paired with a small chat model like qwen2.5-0.5b or phi-4-mini, you can run the entire capture → transcribe → reason → respond loop in-process on a developer laptop, with no API keys and no network egress.
This post walks through how the fl-nemotron sample does it, the SDK pitfalls we hit on the way, and the design decisions that made the pipeline reliable.
What we're building
A browser-hosted assistant served by FastAPI at http://127.0.0.1:8000. The page captures microphone audio, posts it to /api/transcribe, then streams the chat reply back over Server-Sent Events from /api/chat. All inference runs locally through two Foundry Local models loaded into the same process.
The shape of the pipeline:
Microphone (browser MediaRecorder)
│ WebM/Opus blob
▼
Client-side WAV encoder (16 kHz, mono, PCM-16)
│ multipart/form-data
▼
FastAPI /api/transcribe
│
▼
Nemotron Speech Streaming En 0.6B (Foundry Local audio client)
│ transcript text
▼
Chat LLM e.g. qwen2.5-0.5b (Foundry Local chat client)
│ streamed tokens
▼
FastAPI /api/chat → SSE → browser bubble
The version that bit us: foundry-local-sdk >= 1.1.0
Before any code, the single most important fact about this project:
The Nemotron Speech Streaming model only appears in the Foundry Local 1.1.x catalog. Older SDKs (0.5.x / 0.6.x) cannot resolve the aliasnemotron-speech-streaming-en-0.6band fail withmodel not found.
The module name also changed in 1.1.0 — it is now foundry_local_sdk (with the underscore-sdk suffix), not foundry_local. The pip wheel for foundry-local-core is bundled, so there is no separate MSI / winget install to worry about.
Pin it explicitly:
pip install --upgrade "foundry-local-sdk>=1.1.0,<2"
And verify before anything else:
python -c "import importlib.metadata as m; print('sdk', m.version('foundry-local-sdk'))"
# expect: sdk 1.1.0
Loading both models from one manager
The 1.1.x SDK exposes a single FoundryLocalManager that owns the runtime. Each loaded model gives you back a per-model OpenAI-compatible client — get_chat_client() for text models and get_audio_client() for ASR. There is no need to bring your own openai Python package; the SDK ships its own thin client.
The wrapper used in the repo (src/foundry_client.py) does this:
from foundry_local_sdk import Configuration, FoundryLocalManager
FoundryLocalManager.initialize(Configuration(app_name="fl-nemotron"))
manager = FoundryLocalManager.instance
chat_model = manager.load_model("qwen2.5-0.5b")
stt_model = manager.load_model("nemotron-speech-streaming-en-0.6b")
chat_client = chat_model.get_chat_client()
audio_client = stt_model.get_audio_client()
Both models are downloaded on first use into the Foundry Local cache and stay resident for the lifetime of the process. On a laptop with 16 GB RAM, the combined working set sits comfortably under 4 GB.
The transcription surprise
The first naive approach was the obvious one:
with open(wav_path, "rb") as f:
result = audio_client.transcribe(file=f, model="nemotron-speech-streaming-en-0.6b")
That call fails on Nemotron. The bundled ONNX Runtime GenAI in foundry-local-core does not register the nemotron_speech multi-modal model type that the standard AudioClient.transcribe() path tries to instantiate. The error surfaces as a cryptic model-type registration failure deep inside the native runtime.
The fix is to use the streaming session API instead — a different native entry point (core_interop.start_audio_stream) that the streaming model does support. The repo isolates this in src/_nemotron_live.py:
def transcribe_wav_live(audio_client, wav_path, *, language="en"):
with wave.open(str(wav_path), "rb") as w:
sample_rate = w.getframerate()
channels = w.getnchannels()
sample_width = w.getsampwidth()
pcm = w.readframes(w.getnframes())
session = audio_client.create_live_transcription_session()
session.settings.sample_rate = sample_rate
session.settings.channels = channels
session.settings.bits_per_sample = sample_width * 8
session.settings.language = language
session.start()
# Feed PCM in ~100 ms chunks from a worker thread, then stop.
bytes_per_sec = sample_rate * channels * sample_width
chunk_bytes = max(bytes_per_sec // 10, 1024)
def _pusher():
try:
for offset in range(0, len(pcm), chunk_bytes):
session.append(pcm[offset:offset + chunk_bytes])
finally:
session.stop()
threading.Thread(target=_pusher, daemon=True).start()
parts = []
for resp in session.get_stream():
for cp in getattr(resp, "content", []) or []:
text = getattr(cp, "text", "") or getattr(cp, "transcript", "") or ""
if text:
parts.append(text)
return " ".join(p.strip() for p in parts if p.strip()).strip()
Two things to notice:
- Push from a thread, read from the main coroutine.
session.append()is a blocking write into the native stream andsession.get_stream()is a blocking generator. Run one in a worker thread so the other can drain in parallel — otherwise you deadlock the session. - Chunk to ~100 ms. Smaller chunks (e.g. 10 ms) spend more time crossing the FFI boundary than transcribing; larger chunks (e.g. 1 s) hold back partial results and hurt perceived latency.
- Always
session.stop(). Without it the generator never terminates and the request hangs.
The other transcription surprise: browsers don't send WAV
Inside the browser, MediaRecorder defaults to audio/webm; codecs=opus. That's great for size but bad for our STT model, which expects a 16-bit mono PCM WAV at a known sample rate. Decoding WebM/Opus server-side would require ffmpeg as a runtime dependency — which is exactly the kind of friction this project exists to remove.
The cleaner solution is to encode WAV on the client. AudioContext.decodeAudioData already understands WebM/Opus, so the page can decode the recording, resample to 16 kHz, mix to mono, and emit a PCM-16 WAV blob in 30 lines of JavaScript:
// Inside src/static/index.html
async function webmToWav(blob) {
const ctx = new (window.AudioContext || window.webkitAudioContext)({ sampleRate: 16000 });
const buf = await ctx.decodeAudioData(await blob.arrayBuffer());
// Mix to mono
const ch = buf.numberOfChannels;
const mono = new Float32Array(buf.length);
for (let c = 0; c < ch; c++) {
const data = buf.getChannelData(c);
for (let i = 0; i < data.length; i++) mono[i] += data[i] / ch;
}
return encodeWav(mono, 16000);
}
function encodeWav(samples, sampleRate) {
const buffer = new ArrayBuffer(44 + samples.length * 2);
const view = new DataView(buffer);
// RIFF header
writeStr(view, 0, "RIFF");
view.setUint32(4, 36 + samples.length * 2, true);
writeStr(view, 8, "WAVE");
// fmt chunk
writeStr(view, 12, "fmt ");
view.setUint32(16, 16, true); // PCM chunk size
view.setUint16(20, 1, true); // PCM format
view.setUint16(22, 1, true); // mono
view.setUint32(24, sampleRate, true);
view.setUint32(28, sampleRate * 2, true); // byte rate
view.setUint16(32, 2, true); // block align
view.setUint16(34, 16, true); // bits per sample
// data chunk
writeStr(view, 36, "data");
view.setUint32(40, samples.length * 2, true);
// PCM-16 samples
let o = 44;
for (let i = 0; i < samples.length; i++, o += 2) {
const s = Math.max(-1, Math.min(1, samples[i]));
view.setInt16(o, s < 0 ? s * 0x8000 : s * 0x7FFF, true);
}
return new Blob([view], { type: "audio/wav" });
}
Now the server's /api/transcribe endpoint just writes the bytes to a temp file and hands them to transcribe_wav_live() — no audio decoding libraries on the Python side.
Wiring it into FastAPI
The server (src/app.py) is deliberately small. The notable detail is that the same process holds both Foundry Local model handles for its entire lifetime, so there is no warm-up cost per request:
@app.post("/api/transcribe")
async def transcribe(audio: UploadFile = File(...)):
data = await audio.read()
with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
f.write(data); path = f.name
text = _ai_client.transcribe(path)
return {"text": text}
@app.post("/api/chat")
async def chat(req: ChatRequest):
if req.stream:
return StreamingResponse(
_sse(_ai_client.stream_completion(req.messages)),
media_type="text/event-stream",
)
return {"text": _ai_client.chat_completion(req.messages)}
Streaming uses Server-Sent Events because they are trivially supported in both fetch() and the FastAPI runtime, and they don't require a WebSocket upgrade through any proxy a developer might have in front of localhost.
What it looks like
The repo includes screenshots of the running UI: a welcome screen with both models loaded, a streamed haiku reply, an inline code block with copy-to-clipboard, and the recording state for the microphone.
Performance, honestly
This is a small-model, CPU-friendly stack. On an Arm64 Surface running the x64 SDK under emulation:
- First model load (cold cache): tens of seconds — downloads ~600 MB for Nemotron and ~400 MB for
qwen2.5-0.5b. - Subsequent loads (warm cache): a few seconds per model.
- End-to-end transcription of a 5-second utterance: well under a second after warm-up.
- First chat token from
qwen2.5-0.5b: typically 200–500 ms; full short reply within 1–2 s.
On x64 silicon with a recent CPU the numbers improve substantially, and the SDK will pick the best execution provider it finds (CPU / DirectML / CUDA) for each model.
Trade-offs to know about
- Model quality.
qwen2.5-0.5bis a 500M-parameter model. It is fast and small enough to ship on a laptop, but it is not GPT-4. Swap inphi-4-miniormistral-nemo-12b-instructif you have the RAM and want better reasoning — the wrapper accepts any chat alias in the Foundry Local catalog. - STT is English-only here. The current Nemotron streaming model in the catalog is
...-en-0.6b. Multilingual variants are likely to follow. - Browser microphone needs a real browser. Headless / automated browsers (Playwright, Puppeteer) deny
getUserMediaby default. Open the page in Edge / Chrome / Firefox to grant the permission and capture audio for real. - No agent framework yet. This sample is deliberately a single-turn loop over a chat client — there is no tool calling, planning, or multi-agent orchestration. Adding the Microsoft Agent Framework on top would be a natural next step for richer behaviour.
Responsible AI considerations
Running locally removes the cloud-egress class of privacy concerns, but it does not remove responsibility:
- Disclose recording. The browser prompts for mic permission; your UI should make it obvious when capture is active. The sample shows a red
⏹button and a "Recording…" banner for that reason. - Don't log raw audio. The sample writes audio to a per-request
NamedTemporaryFileand deletes it after transcription. Treat the WAV as sensitive data even when it never leaves the device. - Small models hallucinate. A 0.5B chat model is great for snappy local replies, but unsuitable for high-stakes answers. Pair it with retrieval, ground it on your own data, or escalate to a larger model when accuracy matters.
Try it
- Clone github.com/leestott/fl-nemotron.
./setup.ps1(or./setup.sh) to create a virtualenv and install the pinned SDK.python scripts/prefetch.py nemotron-speech-streaming-en-0.6b qwen2.5-0.5bto download both models..venv\Scripts\uvicorn.exe app:app --app-dir src --port 8000- Open
http://127.0.0.1:8000in a real browser and click the 🎤 button.
Where to go next
- Foundry Local documentation — official docs for the runtime, catalog, and SDK.
- microsoft/Foundry-Local — upstream samples and issue tracker.
- NVIDIA Nemotron model family — background on the speech and language models being published into the catalog.
- leestott/fl-nemotron — the full source for this post.
Key takeaways
- Pin
foundry-local-sdk >= 1.1.0. Earlier SDKs cannot see the Nemotron Speech Streaming model. - Use the
LiveAudioTranscriptionSessionAPI for Nemotron, notAudioClient.transcribe(). - Encode WAV in the browser. It eliminates a heavy server-side ffmpeg dependency for a few lines of JS.
- Push audio chunks on a worker thread and drain the response generator on the main one to avoid deadlocks.
- A small Foundry Local chat model plus Nemotron STT gives you a credible local voice loop in a single Python process — no cloud, no keys, no data egress.

