Continuing Microsoft AI momentum in Foundry
Since launching MAI-Image-2-Efficient, MAI-Image-2, MAI-Voice-1, MAI-Transcribe-1 in Microsoft Foundry this spring, we've been laser-focused on one thing: giving developers the most complete first-party AI stack to build with.
Today, at Microsoft Build 2026, we're taking the next step. We're announcing the availability of new models from Microsoft AI (MAI) in Microsoft Foundry across 4 modalities:
- Text/Reasoning: MAI-Thinking-1 is our first large language model, designed to deliver strong reasoning, math, and general intelligence at a fraction of the cost of other models.
- Image: MAI-Image-2.5 is an updated image generation model that adds image-to-image editing and a suite of "control with preservation" capabilities, once again debuting at No. 3 on Arena.ai for image generation model families. We also have MAI-Image-2.5 Flash for a faster and more efficient option available in Foundry.
- Voice: MAI-Voice-2 is an updated multilingual text-to-speech model that brings voice cloning and voice prompting to more than 15 languages. We also have MAI-Voice-2 Flash for a faster and more efficient option coming soon.
- Speech: MAI-Transcribe-1.5 is an updated speech-to-text model that supports 43 total languages and adds content biasing and improved accuracy, retaining its #1 spot on the FLEURS benchmark1.
These are the same models already powering experiences across Copilot, Bing, PowerPoint, and Azure Speech, and now they're available in Foundry for developers to build with.
Read on for a deeper look at each model and how to start building.
MAI-Thinking-1: Medium-size model that stands among the strongest in its weight class
MAI-Thinking-1 is MAI’s first large language model -- and it's purpose-built for the workloads enterprises run at scale. With MAI-Thinking-1, we’ve been listening to customer feedback on leading models and making a clear bet: deliver strong reasoning, math, and general intelligence at a price-performance point that makes high-volume, always-on AI workloads economically viable.
MAI-Thinking-1 uses a Mixture-of-Experts (MoE) architecture that selectively activates only the parts of the model needed for each request. The result: capability scales without compute scaling linearly. MAI-Thinking-1 is well-suited for enterprise use-cases that often require deep context – analyzing long documents, complex multi-step reasoning, and processing extended agent traces without chunking and stitching.
MAI-Thinking-1 matches Claude Opus 4.6 on SWE-Bench Pro at substantially lower cost, while initial testing shows parity in preference with models such as Sonnet 4.6. We trained it from the ground up on clean data, without distillation from third-party models.
MAI-Image-2.5: Control with preservation for enterprise creative workflows
We’re also introducing the MAI-Image-2.5 family of models. This includes MAI-Image-2.5 for maximum fidelity and MAI-Image-2.5-Flash for fast, scalable production workloads. MAI-Image-2.5 debuted at No. 3 on Arena.ai, and makes meaningful gains in text rendering, stylized illustration and commercial imagery. Additionally, we're adding the editing surface enterprise creative teams have been asking for and optimizing for the way creative work actually gets done.
MAI-Image-2.5 introduces image-to-image editing with a suite of other capabilities that add control while preserving identity and brand:
- Identity & character consistency: Preserves recognizable faces (plus hair, clothing, full-body identity) across stylization, pose, and layout changes — built for branded characters, spokespeople, and social campaigns.
- Style & scene control: Applies full-frame restyling (anime, color grading, film grain, de-aging) and restructures shots by adding, removing, or repositioning objects and adjusting human pose and interactions.
- Text, graphics & layout control: Generates typography, logos, and responsive text edits from natural cues ("make the text more rounded"), and produces PPT-ready infographics and slides with coherent hierarchy, alignment, and template adherence — including targeted edits like "convert to a 3-step flow."
These new features come with efficiency gains that we are passing directly to customers. Together, they deliver the best price-to-performance ELO in the market, giving customers the flexibility to optimize production image workflows for quality, speed, or cost.
MAI-Voice-2 and MAI-Transcribe-1.5: A more accurate, multilingual audio stack
Voice and speech continue to be the primary interface for the next generation of AI agents, and with MAI-Voice-2 and MAI-Transcribe-1.5, we're closing some of the biggest gaps that have kept general models out of enterprise voice workflows.
MAI-Voice-2: One voice, many languages
MAI-Voice-2 adds two headline capabilities, identity preservation and voice prompting, with the expansion to 15+ languages in a single unified system:
- Identity preservation recreates the unique vocal identity of a specific person, so the model can "speak as" that individual across markets – useful for consistent branded voices, localized spokesperson and celebrity campaigns, personalized digital assistants, and accessibility solutions.
- Voice prompting takes a short audio sample as a reference for tone, emotion, accent, pacing, and speaking style and lets developers control delivery without managing a separate voice library.
Both capabilities now operate across all supported languages, so a single cloned voice or reference style carries naturally across markets without separate systems per language.
MAI-Transcribe-1.5: Faster and more accurate transcription
MAI-Transcribe-1.5 doubles down on the best-in-class speed and cost of MAI-Transcribe-1 – it is now up to 5x more efficient than Gemini 3.1 Flash, ScribeV2, gpt-4o-transcribe on the Artificial Analysis leaderboard. It also adds two highly-request capabilities:
- Entity biasing primes the model with domain context – names, brand terms, industry vocabulary – so it transcribes specialized words correctly instead of guessing the closest common spelling. This was a heavily requested feature from our customers, and a long-standing failure mode for general speech models in sports, business, medical, and technical workflows.
- Improved accuracy holds up in the conditions that enterprises operate in every day — cross-talk, background noise, and long-form meetings — where general models tend to drift. On FLEURS - the standard multilingual benchmark across 25 languages – Word Error Rate (WER) improved from 3.9% to 3.7%, maintaining our position as the most accurate model on this benchmark1.
Try them today
Try the models today models in Microsoft Foundry:
- MAI-Thinking-1: In private preview, request access here.
- MAI-Image-2.5: Available directly in the Foundry Model Catalog. Pricing starts at $5 USD per 1M tokens for text input, $8 USD per 1M tokens for image input, and $47 USD per 1M tokens for image output.
- MAI-Image-2.5 Flash: Available directly in the Foundry Model Catalog. Pricing starts at $1.75 USD per 1M tokens for text and image input and $33 USD per 1M tokens for image output.
- MAI-Voice-2: Available through Azure Speech. Pricing starts at $22 USD per 1M characters.
- MAI-Transcribe-1.5: Available through Azure Speech. Pricing starts at $0.36 USD per hour.
- Experiment in MAI Playground: Try MAI models at the MAI Playground.
References
11st on overall WER on the FLEURS benchmark. Out of the top 25 global languages, MAI-Transcribe-1.5 ranks 1st by FLEURS in 11 core languages. It wins against Whisper-large-v3 on the remaining 14 and Gemini 3.1 Flash on 11 of those 14.
