benchmarkNVIDIA

Nvidia claims 291 MLPerf wins with 288-GPU setup; AMD MI355X crosses 1M tokens/sec

TL;DR

MLCommons published MLPerf Inference v6.0 results on April 1, 2026, with Nvidia, AMD, and Intel each claiming top spots in different configurations. Nvidia's 288-GPU GB300-NVL72 system achieved 2.49 million tokens per second on DeepSeek-R1, while AMD's MI355X crossed one million tokens per second for the first time. Direct comparisons remain difficult as each chipmaker targets different market segments and benchmarks.

3 min read
0

MLCommons published the results of MLPerf Inference v6.0 on April 1, 2026, introducing multimodal and video models to the industry's top inference benchmark for the first time. All three major chipmakers—Nvidia, AMD, and Intel—submitted results, each claiming performance leadership. However, the results are only partially comparable due to different system configurations, models, and scenarios.

Nvidia's Scale Dominance

Nvidia showcases its records primarily on DeepSeek-R1 and GPT-OSS-120B, sometimes using 288-GPU configurations. The GB300-NVL72 system with Blackwell Ultra GPUs achieved the highest throughput across all new workloads. Nvidia achieved a 2.7x performance jump on DeepSeek-R1 in the server scenario compared to submissions six months ago on identical hardware—gains attributed entirely to software optimizations.

These improvements came from fusing basic compute operations, reducing GPU overhead, and using the open-source Nvidia Dynamo framework to separately optimize text input processing and token generation. For mixture-of-experts models like DeepSeek-R1, Wide Expert Parallel distributes weights across more GPUs to prevent bottlenecks. Multi-Token Prediction generates multiple tokens simultaneously in interactive scenarios with small batch sizes.

In its largest MLPerf submission ever, Nvidia connected four GB300-NVL72 systems (288 GPUs total) via Quantum-X800 InfiniBand, achieving approximately 2.49 million tokens per second on DeepSeek-R1 in offline scenario. Fourteen partners submitted results on the Nvidia platform—the most of any platform this round. Nvidia claims 291 cumulative MLPerf wins since 2018, nine times more than all other submitters combined.

AMD's Single-Node Parity

AMD's Instinct MI355X on CDNA 4 architecture (3nm, up to 288 GB HBM3E) crossed one million tokens per second for the first time, though using multi-node scaling with up to 94 GPUs. Compared to the previous-generation MI325X, the MI355X delivered a 3.1x throughput jump on Llama 2 70B server benchmarks.

In single-node comparisons using eight GPUs, AMD claims the MI355X matched Nvidia's B200 on Llama 2 70B offline scenario (100%), achieved 97% parity in server scenario, and reached 119% on interactive scenarios. Against the B300, those figures dropped to 92%, 93%, and 104% respectively. On GPT-OSS-120B, AMD exceeded B200 by 11-15% but lagged B300 at 91-82%.

Critically, AMD did not submit results for DeepSeek-R1, where Nvidia posts strongest numbers. AMD's text-to-video submission competed in the Open category rather than Closed Division, limiting direct comparability. Nine partners submitted AMD results, scoring within 4% of AMD's measurements. Dell and MangoBoost created the first heterogeneous MLPerf submission, mixing MI300X, MI325X, and MI355X GPUs across US and Korean sites, hitting roughly 142,000 tokens per second on Llama 2 70B.

Intel's Strategic Pivot

Intel skips direct data center competition, instead showcasing Arc Pro B70 and B65 GPUs alongside Xeon 6 processors for workstations and edge inference. Four Arc Pro B70 cards provide 128 GB VRAM for 120-billion-parameter models. The B70 delivers 1.8x performance over B60. Software optimizations on B60 hardware achieved up to 1.18x performance gains over MLPerf v5.1. Intel emphasized being the only server processor maker submitting standalone CPU results; over half of all v6.0 submissions use Xeon as host CPU.

Benchmark Limitations

MLPerf v6.0 adds five new tests: DeepSeek-R1 interactive scenario (5x higher token rate minimum), Qwen3-VL-235B vision-language model, GPT-OSS-120B, WAN-2.2-T2V text-to-video model, and DLRMv3 recommendation benchmark. Only Nvidia submitted results for all new models and scenarios.

Notably absent were submissions from Google (Ironwood TPU chips) and inference specialists like Cerebras. These results demonstrate that while MLPerf remains the industry standard, it doesn't produce a straightforward leaderboard. Each chipmaker naturally highlights configurations where its products excel, making comparative analysis challenging without examining underlying test parameters.

Nvidia is driving definition of MLPerf Endpoints benchmark within MLCommons, designed to measure real-world API performance under actual traffic patterns rather than standardized conditions.

What This Means

Nvidia's 288-GPU scaling capability and software optimization gains extend its absolute performance lead at massive scale, though these configurations remain inaccessible to most organizations. AMD's single-node parity with B200 (and advantages on some models) suggests real competitive optionality in mid-scale deployments, constrained only by limited model coverage. Intel's workstation/edge focus signals acknowledgment of different market dynamics outside hyperscaler data centers. The lack of breakthrough single-chip performance from any vendor, and AMD's absence on largest models, indicates Nvidia's current architectural and software superiority remains substantial despite AMD's generational improvements.

Related Articles

product update

NVIDIA Nemotron 3 Super now available on Amazon Bedrock with 256K context window

NVIDIA Nemotron 3 Super, a hybrid Mixture of Experts model with 120B parameters and 12B active parameters, is now available as a fully managed model on Amazon Bedrock. The model supports up to 256K token context length and claims 5x higher throughput efficiency over the previous Nemotron Super and 2x higher accuracy on reasoning tasks.

product update

NVIDIA Nemotron 3 Nano now available on Amazon Bedrock as serverless model

Amazon Bedrock now offers NVIDIA's Nemotron 3 Nano as a fully managed serverless model, expanding its Nemotron portfolio alongside previously available Nemotron 2 Nano 9B and Nemotron 2 Nano VL 12B variants. The addition enables developers to deploy NVIDIA's smallest inference-optimized model without managing infrastructure.

model release

NVIDIA Optimizes Google Gemma 4 for Local Agentic AI on RTX and Spark

NVIDIA has optimized Google's Gemma 4 models for local deployment on RTX and Spark platforms, targeting the emerging wave of on-device agentic AI. The optimization enables small, efficient models to access real-time local context for autonomous decision-making without cloud dependency.

model release

NVIDIA releases gpt-oss-puzzle-88B, 88B-parameter reasoning model with 1.63× throughput gains

NVIDIA released gpt-oss-puzzle-88B on March 26, 2026, a 88-billion parameter mixture-of-experts model optimized for inference efficiency on H100 hardware. Built using the Puzzle post-training neural architecture search framework, the model achieves 1.63× throughput improvement in long-context (64K/64K) scenarios and up to 2.82× improvement on single H100 GPUs compared to its parent gpt-oss-120B, while matching or exceeding accuracy across reasoning effort levels.

Comments

Loading...