AI Innovation is accelerating — and Foundry Labs is where you can stay up-to-date.
The pace of AI innovation isn't just fast — it's fundamentally different from anything we've seen before. New architectures, new modalities, new benchmarks being broken week after week. For developers, keeping up isn't a nice-to-have; it's a competitive necessity.
But staying at the cutting edge is hard when the cutting-edge keeps moving.
That's exactly why we created Microsoft Foundry Labs. It's the place where Microsoft's earliest AI experiments and research prototypes become accessible to builders — a sandbox where you get a first-hand look to explore, evaluate, and experiment with what's next.
Today, we're sharing a roundup of recent additions to Foundry Labs — from speech and vision to multimodal AI that's redefining what's possible at the edge.
MAI-Transcribe-1, MAI-Voice-1 & MAI-Image-2: Microsoft's First-Party AI Stack, Now in Foundry
Recently, we released 3 models from Microsoft AI (MAI) that are exclusively available to builders in Foundry in public preview:
- MAI-Transcribe-1 is Microsoft's first-generation speech recognition model, delivering enterprise-grade accuracy across 25 languages at approximately 50% lower GPU cost than leading alternatives. It achieves an industry-leading 3.9% average Word Error Rate on the FLEURS benchmark — outperforming GPT-Transcribe, Gemini 3.1 Flash, and Whisper-large-v3 — while running at 2.5x the batch transcription speed of Microsoft's existing Azure Fast offering.
- MAI-Voice-1 is a high-fidelity speech generation model capable of producing 60 seconds of expressive, natural-sounding audio in under one second on a single GPU. It preserves speaker identity and emotional nuance across long-form content — and now supports custom voice creation from just a few seconds of audio.
- MAI-Image-2 is Microsoft's highest-capability text-to-image model, debuting at #3 on the Arena.ai leaderboard for image model families. It delivers at least 2x faster image generation on Foundry and Copilot compared to its predecessor, with improvements in natural lighting, skin tone accuracy, and in-image text clarity. Enterprise partners like WPP are already building with it at scale.
Together, these models give developers a complete end-to-end audio and visual AI stack — all under one platform, with the reliability and pricing transparency that enterprises need.
Harrier-oss-v1: State-of-the-Art Multilingual Text Embeddings
Search, retrieval, and semantic understanding are at the core of virtually every AI-powered application — and the quality of your text embeddings determines how well those experiences work across languages and domains. That's why we're excited to introduce harrier-oss-v1 , a new family of open-source multilingual text embedding models , on Microsoft Foundry.
Harrier uses a decoder-only architecture with last-token pooling and L2 normalization to produce dense text embeddings — a design that enables it to excel across a wide range of downstream tasks including retrieval, clustering, semantic similarity, classification, bitext mining, and reranking.
The family comes in three sizes to fit different latency and accuracy requirements:
|
Model |
Parameters |
Embedding Dimension |
Max Tokens |
MTEB v2 Score |
|
harrier-oss-v1-270m |
270M |
640 |
32,768 |
66.5 |
|
harrier-oss-v1-0.6b |
0.6B |
1,024 |
32,768 |
69.0 |
|
harrier-oss-v1-27b |
27B |
5,376 |
32,768 |
74.3 |
All three variants achieve state-of-the-art results on the Multilingual MTEB v2 benchmark as of their release date. The 270M and 0.6B variants are further enhanced through knowledge distillation from the larger 27B model — meaning you get competitive performance even at smaller scale.
With support for 94 languages — including Arabic, Chinese, Japanese, Korean, Hindi, Indonesian, and dozens of European languages — Harrier is purpose-built for global applications. And because it's instruction-tuned, you can customize embedding behavior for different scenarios simply by prepending a one-sentence natural language instruction to your query — no fine-tuning required.
Whether you're building multilingual RAG pipelines, cross-lingual document search, or semantic similarity features, Harrier gives you a production-ready embedding model that scales from edge to enterprise.
- Learn more about harrier-oss-v1 in Foundry Labs
- Deploy harrier-oss-v1 from the Foundry Model Catalog
Phi-4-Reasoning-Vision-15B: Small Model, Big Reasoning
Vision models have historically been great at perception — identifying objects, reading text, describing scenes. But perception alone isn't enough for the next generation of agentic applications. What developers need is a model that can reason over what it sees.
That's exactly what Phi-4-Reasoning-Vision-15B delivers.
This new addition to the Phi-4 family combines high-resolution visual perception with selective, task-aware reasoning — giving developers the ability to toggle reasoning on or off at runtime, balancing latency and accuracy based on their use case.
Key use cases include:
- Diagram-based math and document understanding — parse charts, tables, and visual problem sets with structured inference
- GUI interpretation and grounding — ideal for computer-use agent (CUA) scenarios where the model needs to interpret screens and drive actions
- Scientific and analytical reasoning — process complex visual inputs and produce multi-step, grounded conclusions
- Education — build tutoring apps where students upload worksheets or diagrams and receive guided, step-by-step explanations
Despite being a compact 15B-parameter model, Phi-4-Reasoning-Vision-15B holds its own against significantly larger models — achieving 88.2% on ScreenSpot_v2 and 83.3% on ChartQA in internal benchmarks.
It's the right model when you need vision reasoning that's fast, efficient, and production-ready.
- Learn more about Phi-4-Reasoning-Vision-15B on Foundry Labs
- Deploy Phi-4-Reasoning-Vision-15B on Microsoft Foundry
VibeVoice ASR: Longform, Structured Speech Recognition at Scale
Real-world audio is messy. Hour-long meetings, multi-speaker conversations, domain-specific jargon, and seamless code-switching between languages — these are the scenarios where most speech recognition systems fall apart. VibeVoice ASR was built specifically to solve that.
Developed by Microsoft Research, VibeVoice ASR is a unified speech-to-text model that transcribes up to 60 minutes of continuous audio in a single pass — no manual chunking, no stitching, no context loss.
What makes it different is the richness of its output. Rather than returning a wall of text, VibeVoice ASR jointly performs:
- Transcription — what was said
- Speaker diarization — who said it
- Timestamping — when they said it
All in one unified inference pass, without requiring any post-processing pipeline.
Additional capabilities include:
- Customized hotwords — inject domain-specific vocabulary, names, or technical terms to improve accuracy in specialized contexts
- 50+ language support — with native code-switching, no explicit language configuration required
VibeVoice ASR is also fully integrated with the Hugging Face Transformers ecosystem and discoverable in the Foundry Model Catalog, making it easy to evaluate and deploy using familiar tooling.
GigaTIME: Population-Scale Tumor Immune Microenvironment Modeling
Understanding how tumors interact with the immune system is one of the most complex — and consequential — challenges in precision oncology. Multiplex immunofluorescence (mIF) imaging can illuminate that relationship at the cellular level, but at thousands of dollars per sample, it's rarely feasible at scale.
GigaTIME changes that.
Developed by Microsoft Research in collaboration with Providence and the University of Washington, GigaTIME is a multimodal AI model that translates routine, low-cost hematoxylin and eosin (H&E) pathology slides — already a standard part of cancer care at just $5–$10 per sample — into high-resolution virtual multiplex immunofluorescence (mIF) images across 21 protein channels.
Trained on 40 million cells with paired H&E and mIF data, GigaTIME was applied to 14,256 cancer patients across 51 hospitals, generating a virtual population of ~300,000 mIF images spanning 24 cancer types and 306 cancer subtypes. The result: 1,234 statistically significant associations between tumor immune cell states and clinical attributes like biomarkers, staging, and survival — independently validated on 10,200 patients from The Cancer Genome Atlas (TCGA).
This was the first population-scale study of the tumor immune microenvironment based on spatial proteomics — a class of study previously out of reach due to mIF data scarcity.
GigaTIME is now publicly available on Foundry Labs and Hugging Face, open for researchers and developers to explore and build on.
What's Next
Foundry Labs is where Microsoft's most ambitious AI research becomes accessible to builders. Whether you're building voice agents, multimodal pipelines, or intelligent document processors — the tools are here, and they're only getting better.
Stay tuned — there's more coming soon:
- Explore more AI innovations on Foundry Labs
- Join the Microsoft Foundry Discord community to shape the future of AI together

