Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
148964 stories
·
33 followers

NVIDIA Blackwell Architecture Sweeps MLPerf Training v5.1 Benchmarks

1 Share

The NVIDIA Blackwell architecture powered the fastest time to train across every MLPerf Training v5.1 benchmark, marking a clean sweep in the latest round of results. As developers experiment with new architectures, and models continue to grow in size, more training compute is essential. Meeting this need for delivered compute requires innovation across every layer of the AI stack—from chips and systems to software—advancing performance at an unprecedented pace. 

MLPerf Training v5.1 is the latest in the long-running series of industry benchmarks designed to measure AI training performance. This version measures the time to train seven models, representing a wide range of use cases, each to a specified target accuracy. The Blackwell architecture, which powers both NVIDIA Blackwell and NVIDIA Blackwell Ultra GPUs, delivered the highest performance on every benchmark at maximum scale and at each submitted scale.

BenchmarkTime to trainMaximum Submission scale
Llama 3.1 405B pretraining10 minutes5,120 Blackwell GPUs
LLama 3.1 8B pretraining5.2 minutes512 Blackwell Ultra GPUs
Llama 2 70B LoRA fine-tuning0.40 minutes512 Blackwell Ultra GPUs
FLUX.112.5 minutes1,152 Blackwell GPUs
DLRM-DCNv20.71 minutes64 Blackwell GPUs
R-GAT0.84 minutes256 Blackwell GPUs
RetinaNet1.4 minutes512 Blackwell GPUs
Table 1. The NVIDIA platform delivers the fastest time to train on every model currently tested in MLPerf Training

MLPerf™ Training v5.0 and v5.1 results retrieved from www.mlcommons.org on November 12, 2025, from the following entries: 5.0-0082, 5.1-0002, 5.1-0004, 5.1-0060, 5.1-0070, 5.1-0072. The MLPerf™ name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.    

The NVIDIA platform was also the only one to submit results on all benchmarks. In this post, we take a closer look at these results and the technology innovations that powered them. 

NVIDIA makes the industry’s first FP4 training submissions with NVFP4

Innovation in low-precision AI data formats is a key enabler of the performance gains delivered by the Blackwell architecture, which powers Blackwell and Blackwell Ultra GPUs. The Blackwell architecture incorporates hardware acceleration for FP4 data formats, including the NVIDIA-designed NVFP4 format. Blackwell GPUs offer peak FP4 throughput per clock, twice that of FP8. Blackwell Ultra GPUs build upon that innovation, increasing FP4 throughput per clock to 3x that of FP8. 

As shown in the paper, Pretraining Large Language Models with NVFP4, NVFP4 provides better accuracy for the same number of tokens used during training, or achieves the same accuracy using significantly fewer tokens, compared to the industry MXFP4 data format. This means faster time to train to a specified accuracy and faster time to deployment with lower training costs.

This round, NVIDIA adopted NVFP4 in every large language model (LLM) in MLPerf Training by incorporating many of the techniques recommended in the paper. NVIDIA submissions also carefully applied “healing”—a process by which higher precisions are used during certain parts of the training process—to improve accuracy. Specifically, NVIDIA submissions kept the last few training iterations in FP8 precision.  

These submissions required innovation at every layer of the technology stack, including hardware acceleration of NVFP4 directly in Blackwell and Blackwell Ultra silicon, acceleration libraries including NVIDIA cuBLAS, NVIDIA Transformer Engine, and NVIDIA Megatron-Core, and new numerical techniques.

Blackwell Ultra delivers a large leap for LLM training

NVIDIA submitted the first MLPerf Training results on Blackwell Ultra using an NVIDIA AI cluster codenamed “Theia,” after the Greek goddess of sight and vision. It features a total of 512 Blackwell Ultra GPUs, built from multiple NVIDIA GB300 NVL72 rack-scale systems connected using NVIDIA Quantum-X800 InfiniBand. 

Blackwell Ultra GPUs incorporate several important enhancements compared to Blackwell GPUs, including: 

  • 1.5x peak NVFP4 throughput. Blackwell Ultra GPUs feature updated Tensor Cores that increase peak FP4 throughput per clock by 1.5x compared to Blackwell GPUs. This helps accelerate math-bound GEMM operations. 
  • 2x Softmax for attention. Blackwell Ultra GPUs feature an upgraded special function unit (SFU), providing 2x accelerated throughput for key softmax operations, which can be critical for the attention layer. In MLPerf benchmarks, this results in up to 1.3x speedup in the attention block.
  • 1.5x larger HBM3e capacity. Blackwell Ultra GPUs incorporate higher-capacity HBM3e stacks, which are now 12-Hi compared to 8-Hi in Blackwell GPUs. On the Llama 2 70B LoRA benchmark, this enabled us to fit the entire model in one GPU, with no CPU offloading required, eliminating model-parallel communication overheads and improving GEMM efficiency. 

Blackwell Ultra GPU innovations, adoption of NVFP4 format, and software optimizations delivered large increases in pretraining and LLM fine-tuning performance with the same number of GPUs compared to the most recent NVIDIA submissions using the Hopper architecture.

Two sets of bar charts, with performance starting with Hopper submissions in prior rounds, followed by Blackwell GB200 NVL72 submissions in v5.0, then finally Blackwell Ultra GB300 NVL72 submissions in v5.1. The speedups listed for Llama 3.1 405B are 1x, ~2x, and 4x+, and 1x, ~3x, and ~5x for Llama 2 70B LoRA, respectively. Figure 1. Relative Llama 3.1 405B pretraining and Llama 2 70B LoRA fine-tuning performance at 512-GPU and 8-GPU scales, respectively

MLPerf Training v4.1, v5.0, and v5.1, closed division. Results from entries: 4.1-0050, 5.0-0076, 5.0-0067, 5.1-0058, 5.1-0060. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See www.mlcommons.org for more information.

Additionally, the latest NVIDIA Quantum-X800 networking platform—composed of NVIDIA ConnectX-8 SuperNICs, NVIDIA Quantum-X800 InfiniBand switches, and NVIDIA LinkX cables—was used to connect the multiple GB300 NVL72 racks that form the Theia cluster. This marks the industry’s first and only 800 Gb/s networking submitted to MLPerf Training. 

NVIDIA Blackwell sets new Llama 3.1 405B training record

On Llama 3.1 405B, the largest and most challenging benchmark in the MLPerf Training v5.1, NVIDIA set a new time-to-train record of 10 minutes powered by 5,120 Blackwell GPUs. This is a 2.7x increase compared to the fastest submission using Blackwell GPUs last round.* 

Two major factors contributed to this large speedup. With the use of NVFP4 training recipes and general software enhancements, the submission using 2,560 Blackwell GPUs achieved a score of 18.79 minutes. This is 3x faster than the previous NVIDIA submissions with the same number of NVIDIA Hopper architecture GPUs.* Effective performance per Blackwell GPU also increased by 42%, when comparing the performance of the 2,496 Blackwell GPU submission last round to the 2,560 Blackwell GPU submission this round.*

* MLPerf™ Training v5.0 and v5.1 results retrieved from www.mlcommons.org on November 12, 2025, from the following entries: 5.0-0067, 5.0-0002, 5.0-0003, 5.0-0004, 5.1-0003, 5.1-0004, 5.1-0071. Performance-per-GPU is not an official MLPerf metric, and is derived by dividing the ratios of delivered performance and scales submitted.  The MLPerf™ name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.  

A dark green line chart indicating MLPerf Training v5.0 baseline, which scales from 512 Blackwell GPUs to 2,496 Blackwell GPUs. Then a lighter green line indicating Blackwell submissions in MLPerf Training v5.1, with points at 512 GPUs, 2,560 GPUs, and 5,120 GPUs. At the 2,560 GPU mark, performance/GPU in v5.1 is indicated as 1.4x that of v5.0, at the 2,496 GPU point. At 5,120 GPUs, a 2.7x increase in perf at max scale is indicated. Figure 2. Performance scaling with the number of Blackwell GPUs submitted in both MLPerf Training v5.0 and MLPerf Training v5.1.

MLPerf™ Training v5.0 and v5.1 results retrieved from www.mlcommons.org on November 12, 2025, from the following entries: 5.0-0001, 5.0-0002, 5.0-0003, 5.0-0004, 5.0-0005, 5.0-0013, 5.0-0014, 5.1-0003, 5.1-0004,  5.1-0071. Performance-per-GPU is not an official MLPerf metric, and is derived by dividing the ratios of delivered performance and scales submitted.  The MLPerf™ name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.   

This submission also used a total of 5,120 Blackwell GPUs—more than doubling the largest submitted scale of 2,496 Blackwell GPUs in the prior round—connected using NVLink for scale-up within a rack, and NVIDIA Quantum-2 InfiniBand for scale-out to multiple racks. Performance increased by 2.7x, meaning that the gains resulted from a larger scale and increased effective performance per GPU. 

Scaling efficiency, which is the amount of performance increase with additional GPUs, scaled by 10x from 512 Blackwell GPUs to 5,120 Blackwell GPUs, was 85%. 

This is critical as it enables model builders to scale training runs, accelerating time to train and time to revenue, while ensuring that each of those incremental GPUs achieves high utilization. 

Blackwell Ultra sets the bar for Llama 3.1 8B training performance

To ensure that MLPerf Training results represent modern AI use cases, the benchmark is regularly updated. This round, BERT-large was replaced by Llama 3.1 8B, which provides a substantial increase in capability and training complexity while maintaining a simple, accessible LLM for a broader range of platforms.   

The NVIDIA platform delivered the highest performance on the Llama 3.1 8B training benchmark, both in terms of performance at a given number of GPUs and performance at scale. 

Llama 3.1 8B submissions also benefited from several full-stack optimizations. 

One was the use of NVFP4 training recipes, which enabled performance increases while maintaining accuracy, even with a much smaller model. 

Next, with increased context lengths, attention becomes a critical component of the end-to-end LLM pretraining performance. Previous NVIDIA LLM pretraining submissions used BF16 precision for the inputs of batched-matrix-multiply (BMM) computations in the attention block. This round, NVIDIA submissions used FP8 precision for the attention BMM inputs for the Llama 3.1 8B pretraining benchmark. This applied to forward and backward pass computation, greater FP8 precision for attention BMMs. 

Our FP8 recipe achieved up to 1.3x better performance in the attention kernel of MLPerf benchmarks compared to the BF16 counterpart while still meeting the accuracy requirements of the benchmark. 

The FP8 attention recipe used for the pretraining benchmarks this round uses per-tensor current scaling FP8 for query (Q), key (K), and value (V) tensors, as well as the gradient of output (dO) used in backward propagation. FP8 attention resulted in a 5% end-to-end speedup in the Llama 3.1 8B model. The FP8 attention implementation, for delayed scaling and current scaling recipes, is available in the NVIDIA cuDNN library, which is used in NVIDIA MLPerf submissions through the NVIDIA Transformer Engine library.

Other software optimizations implemented for pretraining models include the following, which focused on optimizing away the device-to-device memory copies and tensor concatenations: 

  • Implementing a fused RoPE kernel in Transformer Engine that uses combined Q/K/V input and outputs Q, K, V tensors. This avoided splitting Q,K,V tensors in the forward pass, and concatenating dQ, dK, dV tensors in the backward pass
  • Avoiding changes to the attention input to BSHD layout by using SBHD attention layout. This change was implemented in Megatron-LM. In this notation, B stands for batch size, S sequence length, H number of attention heads, and D head dimension, consistent with Transformer Engine notation.
  • Fusing amax computation into the producer operations.    

Highest performance on new FLUX.1 benchmark

Another benchmark update was the addition of the FLUX.1 image generation model, replacing Stable Diffusion v2. On this test, NVIDIA once again set the bar, delivering the fastest time to train at scale of 12.5 minutes using 1,152 Blackwell GPUs. NVIDIA was also the only platform to submit to this benchmark, highlighting both the performance and versatility of the NVIDIA training stack. 

Llama 2 70B LoRA software optimizations

This round, several fusion optimizations were implemented that benefited the Llama 2 70B LoRA fine-tuning benchmark significantly. The core idea is using LoRALinearLayer, which combines the LoRA adapters and the frozen GEMM within the same module. Building this abstraction enables us to fuse cast operations, scaling operations, and the addition to the frozen GEMM.

Key takeaways

NVIDIA is innovating on a one-year rhythm, with innovation across GPU, CPU, scale-up networking, scale-out networking, system architecture, and software, to drive up performance, drive down intelligence costs, and pave the way for new AI breakthroughs. 

See more NVIDIA performance data on the Data Center Deep Learning Product Performance Hub and Performance Explorer pages.

Read the whole story
alvinashcraft
1 hour ago
reply
Pennsylvania, USA
Share this story
Delete

Releasing Windows 11 Build 22631.6269 to the Release Preview Channel

1 Share
Hello Windows Insiders, today we’re releasing Windows 11 Build 22631.6269 (KB 5070312) to Insiders in the Release Preview Channel on Windows 11, version 23H2 (Build 22631)  This update includes the following features and improvements that are rolling out as part of this update. Text bolded in brackets indicates the area of the change being documented. This non-security update includes quality improvements. The following summary outlines key issues addressed by the KB after you install it. The bold text within the brackets indicates the item or area of the change.
  • [Country and Operator Settings Asset (COSA)]
    • Fixed: This update brings profiles up to date for certain mobile operators.
  • [File Explorer]
    • Fixed: This update addresses an issue where File Explorer sometimes didn’t respond to mouse clicks until you closed and reopened it.
    • Fixed: This update addresses an issue with extracting .tar files when file or folder names contain more than 34 commonly used Chinese characters.
  • [Group Policy and Configuration]
    • Fixed: This update addresses an issue where the HideRecommendedSection policy didn’t work in Windows 11 Enterprise multi-session environments, such as Azure Virtual Desktop (AVD). Even when configured using Group Policy or Configuration Service Provider (CSP), recommendations still appeared in AVD sessions.
Thanks, Windows Insider Program Team
Read the whole story
alvinashcraft
3 hours ago
reply
Pennsylvania, USA
Share this story
Delete

GPT‑5.1 in Foundry: A Workhorse for Reasoning, Coding, and Chat

1 Share

The pace of AI innovation is accelerating, and developers—across startups and global enterprises—are at the heart of this transformation. Today marks a significant moment for enterprise AI innovation: Azure AI Foundry is unveiling OpenAI’s GPT-5.1 series, the next generation of reasoning, analytics, and conversational intelligence.

The following models will be rolling out in Foundry today:

  • GPT-5.1: adaptive, more efficient reasoning
  • GPT-5.1-chat: chat with new chain-of-thought for end-users
  • GPT-5.1-codex: optimized for long-running conversations with enhanced tools and agentic workflows
  • GPT-5.1-codex-mini: a compact variant for resource-constrained environments
What’s new with GPT-5.1 series

The GPT-5.1 series is built to respond faster to users in a variety of situations with adaptive reasoning, improving latency and cost efficiency across the series by varying thinking time more significantly. This, combined with other tooling improvements, enhanced stepwise reasoning visibility, multimodal intelligence, and enterprise-grade compliance.  

GPT-5.1: Adaptive and Efficient Reasoning

GPT-5.1 is the mainline model engineered to deliver adaptive, stepwise reasoning that adjusts its approach based on the complexity of each task. Core capabilities included:

  • Adaptive reasoning for nuanced, context-aware thinking time
  • Multimodal intelligence: supporting text, image, and audio inputs/outputs
  • Enterprise-grade performance, security, and compliance

This model’s flexibility empowers developers to tackle a wide spectrum of tasks—from simple queries to deep, multi-step workflows for enterprise-grade solutions. With its ability to intelligently balance speed, cost, and intelligence, GPT-5.1 sets a new standard for both performance and efficiency in AI-powered development.

GPT-5.1-chat: Elevating Interactive Experiences with Smart, Safe Conversations

GPT-5.1-chat powers fast, context-aware chat experiences with adaptive reasoning and robust safety guardrails. With chain-of-thought added in the chat for the first time, it brings an interactive experience to the next level. It’s tuned for safety and instruction-following, making it ideal for customer support, IT helpdesk, HR, and sales enablement. Multimodal chat (text, image, and audio) improves long-turn consistency for real problem solving, delivering brand-aligned, safe conversations, and supporting next-best-action recommendations.

GPT-5.1-codex and GPT-5.1-codex-mini: Frontier Models for Agentic Coding

GPT-5.1-codex builds on the foundation set by GPT-5-codex, advancing developer tooling with:

  • Enhanced reasoning frameworks for stepwise, context-aware code analysis and generation; plus
  • Enhanced tool handling for certain development scenario's
  • Multimodal intelligence for richer developer experiences when coding

With Foundry’s enterprise-grade security and governance, GPT-5.1-codex is ideal for automated code generation and review, accelerating development cycles with intelligent code suggestions, refactoring, and bug detection.

GPT-5.1-codex-mini is a compact, efficient variant optimized for resource-constrained environments. It maintains near state-of-the-art performance, multimodal intelligence, and the same safety stack and tool access as GPT-5.1-codex, making it best for cost-effective, scalable solutions in education, startups, and cost-conscience settings.

Together, these Codex models empower teams to innovate faster and with greater confidence.

Selecting Your AI Engine: Match Model Strengths to Your Business Goals

One of the advantages of the GPT-5.1 series is unified access to deep reasoning, adaptive chat, and advanced coding—all in one place. Here’s how to match model strengths to your needs:

  • Opt for GPT-5.1 for general ai application use—tasks like analytics, research, legal/financial review, or consolidating large documents and codebases. It’s the model of choice for reliability and high-impact outputs.
  • Go with GPT-5.1-chat for interactive assistants and product UX, especially when adaptive reasoning is required for complex cases.  Reasoning hints and adaptive reasoning help with customer latency perception.
  • Leverage GPT-5.1-codex for deep, stepwise reasoning in complex code generation, refactoring, or multi-step analysis—ideal for demanding agentic workflows and enterprise automation.
  • Utilize GPT-5.1-codex-mini for efficient, cost-effective coding intelligence in broad-scale deployment, education, or resource-constrained environments—delivering near-mainline performance in a compact model.
Deployment and Pricing 

Model

Deployment

Available Regions

Pricing ($/million tokens)

Input

Cached Input

Output

GPT-5.1

Standard Global

 

Global

$1.25

$0.125

$10.00

Standard Data Zone

Data Zone (US & EU)

$1.38

$0.14

$11.00

GPT-5.1-chat

Standard Global

Global

$1.25

$0.125

$10.00

GPT-5.1-codex

Standard Global

Global

$1.25

$0.125

$10.00

GPT-5.1-codex-mini

Standard Global

Global

$0.25

$0.025

$2.00

Start Building Today

The GPT-5.1 series is now available in Foundry Models. Whether you’re building for enterprise, small and medium-sized business, or launching the next digital-native app, these models and the Foundry platform are designed to help you innovate faster, safer, and at scale.

 

 

Read the whole story
alvinashcraft
3 hours ago
reply
Pennsylvania, USA
Share this story
Delete

What is Copilot Tuning?

1 Share
From: Microsoft Developer
Duration: 0:43
Views: 90

What if you could make Microsoft 365 Copilot sound and work just like your organization? Dona Sarkar explains how Copilot Tuning lets you fine-tune Copilot with your own data, workflows, and voice.

Learn more: https://msft.it/6059tOnRl

Read the whole story
alvinashcraft
3 hours ago
reply
Pennsylvania, USA
Share this story
Delete

98: Customizable select

1 Share

In this episode of The CSS Podcast, Una and Bramus cover building customizable select menus. Have you ever had to build a dropdown menu where you want to do something as simple as change the color, or add little flag icons? You know how hard it can be! Discover how the web platform is solving this once and for all with the new customizable select API.

Resources:
Customizable select demos → https://goo.gle/43G5ruv 
 
Una Kravets (co-host)
Bluesky | Twitter | YouTube | Website
Making the web more colorful @googlechrome 
Bramus Van Damme (co-host)
Bluesky | Mastodon | YouTube | Website
@GoogleChrome CSS DevRel; @CSSWG; Scuba Diver 🤿





Download audio: https://traffic.libsyn.com/secure/thecsspodcast/TCP098_v1.mp3?dest-id=1891556
Read the whole story
alvinashcraft
3 hours ago
reply
Pennsylvania, USA
Share this story
Delete

C# Could Overtake Java in TIOBE Index

1 Share
Read the whole story
alvinashcraft
3 hours ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories