benchmarkNVIDIA

Nvidia claims 291 MLPerf wins with 288-GPU setup; AMD MI355X crosses 1M tokens/sec

TL;DR

MLCommons published MLPerf Inference v6.0 results on April 1, 2026, with Nvidia, AMD, and Intel each claiming top spots in different configurations. Nvidia's 288-GPU GB300-NVL72 system achieved 2.49 million tokens per second on DeepSeek-R1, while AMD's MI355X crossed one million tokens per second for the first time. Direct comparisons remain difficult as each chipmaker targets different market segments and benchmarks.

April 2, 2026 · 3:05 PM3 min read

MLCommons published the results of MLPerf Inference v6.0 on April 1, 2026, introducing multimodal and video models to the industry's top inference benchmark for the first time. All three major chipmakers—Nvidia, AMD, and Intel—submitted results, each claiming performance leadership. However, the results are only partially comparable due to different system configurations, models, and scenarios.

Nvidia's Scale Dominance

Nvidia showcases its records primarily on DeepSeek-R1 and GPT-OSS-120B, sometimes using 288-GPU configurations. The GB300-NVL72 system with Blackwell Ultra GPUs achieved the highest throughput across all new workloads. Nvidia achieved a 2.7x performance jump on DeepSeek-R1 in the server scenario compared to submissions six months ago on identical hardware—gains attributed entirely to software optimizations.

These improvements came from fusing basic compute operations, reducing GPU overhead, and using the open-source Nvidia Dynamo framework to separately optimize text input processing and token generation. For mixture-of-experts models like DeepSeek-R1, Wide Expert Parallel distributes weights across more GPUs to prevent bottlenecks. Multi-Token Prediction generates multiple tokens simultaneously in interactive scenarios with small batch sizes.

In its largest MLPerf submission ever, Nvidia connected four GB300-NVL72 systems (288 GPUs total) via Quantum-X800 InfiniBand, achieving approximately 2.49 million tokens per second on DeepSeek-R1 in offline scenario. Fourteen partners submitted results on the Nvidia platform—the most of any platform this round. Nvidia claims 291 cumulative MLPerf wins since 2018, nine times more than all other submitters combined.

AMD's Single-Node Parity

AMD's Instinct MI355X on CDNA 4 architecture (3nm, up to 288 GB HBM3E) crossed one million tokens per second for the first time, though using multi-node scaling with up to 94 GPUs. Compared to the previous-generation MI325X, the MI355X delivered a 3.1x throughput jump on Llama 2 70B server benchmarks.

In single-node comparisons using eight GPUs, AMD claims the MI355X matched Nvidia's B200 on Llama 2 70B offline scenario (100%), achieved 97% parity in server scenario, and reached 119% on interactive scenarios. Against the B300, those figures dropped to 92%, 93%, and 104% respectively. On GPT-OSS-120B, AMD exceeded B200 by 11-15% but lagged B300 at 91-82%.

Critically, AMD did not submit results for DeepSeek-R1, where Nvidia posts strongest numbers. AMD's text-to-video submission competed in the Open category rather than Closed Division, limiting direct comparability. Nine partners submitted AMD results, scoring within 4% of AMD's measurements. Dell and MangoBoost created the first heterogeneous MLPerf submission, mixing MI300X, MI325X, and MI355X GPUs across US and Korean sites, hitting roughly 142,000 tokens per second on Llama 2 70B.

Intel's Strategic Pivot

Intel skips direct data center competition, instead showcasing Arc Pro B70 and B65 GPUs alongside Xeon 6 processors for workstations and edge inference. Four Arc Pro B70 cards provide 128 GB VRAM for 120-billion-parameter models. The B70 delivers 1.8x performance over B60. Software optimizations on B60 hardware achieved up to 1.18x performance gains over MLPerf v5.1. Intel emphasized being the only server processor maker submitting standalone CPU results; over half of all v6.0 submissions use Xeon as host CPU.

Benchmark Limitations

MLPerf v6.0 adds five new tests: DeepSeek-R1 interactive scenario (5x higher token rate minimum), Qwen3-VL-235B vision-language model, GPT-OSS-120B, WAN-2.2-T2V text-to-video model, and DLRMv3 recommendation benchmark. Only Nvidia submitted results for all new models and scenarios.

Notably absent were submissions from Google (Ironwood TPU chips) and inference specialists like Cerebras. These results demonstrate that while MLPerf remains the industry standard, it doesn't produce a straightforward leaderboard. Each chipmaker naturally highlights configurations where its products excel, making comparative analysis challenging without examining underlying test parameters.

Nvidia is driving definition of MLPerf Endpoints benchmark within MLCommons, designed to measure real-world API performance under actual traffic patterns rather than standardized conditions.

What This Means

Nvidia's 288-GPU scaling capability and software optimization gains extend its absolute performance lead at massive scale, though these configurations remain inaccessible to most organizations. AMD's single-node parity with B200 (and advantages on some models) suggests real competitive optionality in mid-scale deployments, constrained only by limited model coverage. Intel's workstation/edge focus signals acknowledgment of different market dynamics outside hyperscaler data centers. The lack of breakthrough single-chip performance from any vendor, and AMD's absence on largest models, indicates Nvidia's current architectural and software superiority remains substantial despite AMD's generational improvements.

Source: the-decoder.com ↗

mlperf inference benchmark nvidia amd intel gpu llm

model releaseJuly 4, 2026

NVIDIA releases Nemotron-Labs-TwoTower-30B: block-wise diffusion model claims 2.42× faster generation at 98.7% baseline

NVIDIA released Nemotron-Labs-TwoTower-30B-A3B-Base-BF16, a block-wise diffusion language model that generates text by denoising blocks of tokens in parallel rather than sequentially. According to NVIDIA, the model achieves 2.42× the wall-clock generation throughput of its autoregressive baseline while retaining 98.7% of aggregate benchmark quality.

product updateJuly 1, 2026

AWS brings NVIDIA Nemotron and OpenAI GPT OSS models to GovCloud for secure government AI workloads

Amazon Bedrock now supports NVIDIA Nemotron and OpenAI GPT OSS models in AWS GovCloud (US) Regions. The launch includes OpenAI's GPT OSS models (120B and 20B parameters, 128K context) and NVIDIA Nemotron 3 family (9B to 120B parameters, 1M context), providing government agencies FedRAMP High and DoD SRG Level 5-compliant AI inference on U.S. soil.

benchmarkJune 12, 2026

Gemini 3.5 Flash ranks 6th in Android coding benchmark at 3x cost of Gemini 3.1 Pro

Google's latest Android Bench results show Gemini 3.5 Flash ranking 6th with a 63.7% success rate, despite averaging $147.10 per benchmark run compared to Gemini 3.1 Pro Preview's $47.90. The newer model used 355.9 tokens per run versus 73.3 for its predecessor, while GPT 5.5 leads the benchmark at 74% success rate.

benchmarkJune 9, 2026

ServiceNow Releases First Code-Switching ASR Benchmark: ElevenLabs Scribe V2 Leads with Lowest WER Across Four Language

ServiceNow released AU-Harness, the first comprehensive benchmark for code-switched speech recognition in enterprise voice agents, testing seven ASR systems including ElevenLabs, Gemini, and AssemblyAI. The benchmark covers 918 utterances across Spanish-English, French-English, Canadian French-English, and German-English, measuring Word Error Rate (WER), Semantic WER (SWER), and Answer Error Rate (AER). ElevenLabs Scribe V2 achieved the lowest WER across all language pairs, followed closely by AssemblyAI Universal-3 Pro.