That’s the promise behind the new pricing model for Azure AI Content Understanding. We’ve restructured how you pay for document, audio, and video analysis—moving from rigid field-based pricing to a flexible, token-based system that lets you pay only for what you use.
Whether you're extracting layout from documents or identifying actions in a video, the new pricing structure delivers up to 60% cost savings for many typical tasks and more control over your spend.
Why We’re Moving to Token-Based Pricing
Field-based pricing was easy to understand, but it didn’t reflect the real work being done. Some fields are simple. Others require deep reasoning, cross-referencing, and contextual understanding.
So we asked: What if pricing scaled with complexity?
Enter tokens. Tokens are the atomic units of language models—think of them like syllables. By pricing based on tokens, we can:
- Reflect actual compute usage
- Align with generative AI model pricing
- Offer more predictability to developers
What’s Included in the New Pricing Model?
The new pricing structure has three components – Content Extraction, Field Extraction, and Contextualization. Each of these components are essential for enabling customers to create content processing tasks delivering high quality.
![]()
Overall Content Understanding framework for multimodal file processing
1. Content Extraction 🧾
Content Extraction is the essential first step for transforming unstructured input—whether it’s a document, audio, video, or image—into a standardized, reusable format. This process alone delivers significant value, as it allows you to consistently access and utilize information from any source, no matter how varied or complex the original data might be. Content Extraction also serves as the foundation for the more advanced data processing of Field Extraction.
We’re lowering the price significantly for both Document Content Extraction and the Face Grouping & Identification add-on for video.
Pricing Breakdown:
Modality
|
Feature
|
Unit
|
New Price
|
% Change
|
Document
|
Content Extraction (Now includes Layout and Formula)
|
per 1,000 pages
|
$5.00
|
61% Lower
|
Audio
|
Content Extraction
|
per hour
|
$0.36
|
No change
|
Video
|
Content Extraction
|
per hour
|
$1.00
|
No change
|
Video
|
Add-on: Video Face Grouping & Identification
|
per hour
|
$2.00
|
40% Lower
|
Image
|
Content Extraction
|
N/A
|
N/A
|
N/A
|
2. Field Extraction 🧠
Field Extraction is where your custom schema comes to life. Using generative models like GPT-4o and o3-mini, we extract the specific fields you define—whether it’s invoice totals, contract terms, or customer sentiment. With this update, we’ve aligned pricing directly to token usage, matching the regional rates of GPT-4o for Standard and o3-mini for Pro. You can now choose the mode depending on your use case. These tokens will be charged based on the actual content processed by the generative models for field extraction using the standard Azure OpenAI tokenizer.
Pro and Standard modes provide two distinct ways to process content as part of the 2025-05-01-preview version of the APIs. Standard mode efficiently extracts structured fields from individual files using your defined schema, while Pro mode is tailored for advanced scenarios involving multi-step reasoning and can process multiple files with reference data. For a more detailed comparison of the capabilities of Standard and Pro modes check out Azure AI Content Understanding standard and pro modes - Azure AI services | Microsoft Learn. Initially pro mode only supports documents, but this will expand over time.
Pricing Breakdown:
Mode
|
Token Type
|
Unit
|
Price
|
Standard
|
Input Tokens
|
per 1M tokens
|
$2.75
|
Standard
|
Output Tokens
|
per 1M tokens
|
$11.00
|
Pro
|
Input Tokens
|
per 1M tokens
|
$1.21
|
Pro
|
Output Tokens
|
per 1M tokens
|
$4.84
|
Note that although the price per 1M tokens is lower for the Pro mode, it typically consume substantially more tokens than the Standard mode.
3. Contextualization 🔍
Accurate field extraction depends on context, which is why we've introduced a separate charge for Contextualization—covering processes such as output normalization, adding source references, and calculating confidence scores to enhance accuracy and consistency. It also enables in-context learning which allows you to continually refine analyzers with feedback. It’s an investment in quality with real value as data like confidence scores can enable more straight-through processing reducing cost and improving quality. These features are now priced transparently so you can see exactly where your value comes from. Contextualization tokens are always used as part of analyzers the run field extraction.
Pricing Breakdown:
Customers are charged Contextualization tokens based on the size of the files (documents, images, audio or video) that are processed. Tokens for Standard and Pro have different prices.
Mode
|
Token Type
|
Unit
|
Price
|
Standard
|
Contextualization Tokens
|
per 1M tokens
|
$1.00
|
Pro
|
Contextualization Tokens
|
per 1M tokens
|
$1.50
|
Unlike the Field Extraction tokens, which are calculated using the Azure OpenAI tokenizers, Contextualization tokens are consumed at a fixed rate based on the size of the input file.
For example, a 1 page document processed with Standard mode will cost 1000 contextualization tokens, as shown in the table below. Thus, the cost for contextualization will be $0.001 for that processing.
Units
|
Contextualization Tokens
|
Effective Standard Price per unit
|
1 Page[1]
|
1000 contextualization tokens
|
$1 per 1000 pages
|
1 Image
|
1000 contextualization tokens
|
$1 per 1000 images
|
1 hour audio
|
100,000 contextualization tokens
|
$0.1 per hour
|
1 hour video
|
1,000,000 contextualization tokens
|
$1 per hour
|
📊 Pricing examples
Let’s walk through three detailed examples of how the new pricing structure works out in practice.
📄 Example 1: Document Content Extraction Only (1,000 Pages)
Scenario: You want to extract layout and formulas from a 1,000-page document—no field extraction, just the raw content.
- Old Pricing (Preview.1):
- Document content extraction: $5
- Layout add-on: $5
- Formula add-on: $3
- Total: $13.00 per 1,000 pages
- New Pricing (Preview.2):
- Document content extraction (now includes layout + formula): $5.00 per 1,000 pages
Note that pricing will be prorated when processing some fraction of 1000 pages.
✅ Savings: ~62% reduction in cost for the same functionality.
🧠 Example 2: Document Field Extraction (1,000 Pages)
Scenario: You want to extract structured fields from a 1,000-page document using Standard Mode.
- Assumptions:
- ~2,000 tokens for content extraction output
- ~300 tokens for field schema output
- ~300 tokens for metaprompts
- Each page generates ~2,600 tokens total:
- Total tokens = 2.6M input tokens
- Output tokens = 90K (assuming ~20 fields per page with short responses)
- Contextualization = 1M tokens (1,000 tokens per page)
- Step-by-Step Pricing:
- Input Tokens: 2.6M × $2.75/M = $7.15
- Output Tokens: 90K × $11/M = $0.99
- Content Extraction: $5.00
- Field Extraction:
- Contextualization: 1M × $1.00/M = $1.00
- Total (Preview.2):
- $5.00 (CE) + $7.15 (Input) + $0.99 (Output) + $1.00 (Contextualization) = $14.14
- Old Pricing (Preview.1):
- Flat rate: $30.00 per 1,000 pages
✅ Savings: Over 50% reduction in cost, with more transparent, usage-based billing.
🎥 Example 3: Video Field Extraction (1 Hour)
Scenario: You want to extract structured data from 1 hour of video content at a segment level. Segments are short 15-30 seconds on average, resulting in a substantial number of output segments.
- Assumptions:
- Input tokens: 7500 tokens for 1 min (based on sampled frames, transcription, schema prompts and metaprompts)
- Output tokens: 900 tokens for 1 min (assuming 10-20 short structured fields per segment with auto segmentation)
- Contextualization: 1M tokens per hour of video
- Step-by-Step Pricing:
- Input Tokens: 450k × $2.75/M = $1.24
- Output Tokens: 54k × $11/M = $0.59
- Content Extraction: $1.00
- Field Extraction:
- Contextualization: 1M × $1.00/M = $1.00
- Total (Preview.2):
- $1.00 (CE) + $20.63 (Input) + $9.90 (Output) + $1.00 (Contextualization) = $3.83
- Old Pricing (Preview.1):
- Flat rate: $10.00 per hour
✅ Savings: Over 60% reduction in cost, with more transparent, usage-based billing.
Note: Actual cost saving will vary based on the specifics of the input and output
📚 Learn More
--------------------------
What could you build now that pricing is not a blocker?
Let us know how you’re using Content Understanding—we’d love to feature your story in a future post.
--------------------------
[1] For documents without explicit pages (ex. txt, html), every 3000 UTF-16 characters is counted as one page.