Nvidia claims 291 MLPerf wins with 288-GPU setup; AMD MI355X crosses 1M tokens/sec
MLCommons published MLPerf Inference v6.0 results on April 1, 2026, with Nvidia, AMD, and Intel each claiming top spots in different configurations. Nvidia's 288-GPU GB300-NVL72 system achieved 2.49 million tokens per second on DeepSeek-R1, while AMD's MI355X crossed one million tokens per second for the first time. Direct comparisons remain difficult as each chipmaker targets different market segments and benchmarks.
MLCommons published the results of MLPerf Inference v6.0 on April 1, 2026, introducing multimodal and video models to the industry's top inference benchmark for the first time. All three major chipmakers—Nvidia, AMD, and Intel—submitted results, each claiming performance leadership. However, the results are only partially comparable due to different system configurations, models, and scenarios.
Nvidia's Scale Dominance
Nvidia showcases its records primarily on DeepSeek-R1 and GPT-OSS-120B, sometimes using 288-GPU configurations. The GB300-NVL72 system with Blackwell Ultra GPUs achieved the highest throughput across all new workloads. Nvidia achieved a 2.7x performance jump on DeepSeek-R1 in the server scenario compared to submissions six months ago on identical hardware—gains attributed entirely to software optimizations.
These improvements came from fusing basic compute operations, reducing GPU overhead, and using the open-source Nvidia Dynamo framework to separately optimize text input processing and token generation. For mixture-of-experts models like DeepSeek-R1, Wide Expert Parallel distributes weights across more GPUs to prevent bottlenecks. Multi-Token Prediction generates multiple tokens simultaneously in interactive scenarios with small batch sizes.
In its largest MLPerf submission ever, Nvidia connected four GB300-NVL72 systems (288 GPUs total) via Quantum-X800 InfiniBand, achieving approximately 2.49 million tokens per second on DeepSeek-R1 in offline scenario. Fourteen partners submitted results on the Nvidia platform—the most of any platform this round. Nvidia claims 291 cumulative MLPerf wins since 2018, nine times more than all other submitters combined.
AMD's Single-Node Parity
AMD's Instinct MI355X on CDNA 4 architecture (3nm, up to 288 GB HBM3E) crossed one million tokens per second for the first time, though using multi-node scaling with up to 94 GPUs. Compared to the previous-generation MI325X, the MI355X delivered a 3.1x throughput jump on Llama 2 70B server benchmarks.
In single-node comparisons using eight GPUs, AMD claims the MI355X matched Nvidia's B200 on Llama 2 70B offline scenario (100%), achieved 97% parity in server scenario, and reached 119% on interactive scenarios. Against the B300, those figures dropped to 92%, 93%, and 104% respectively. On GPT-OSS-120B, AMD exceeded B200 by 11-15% but lagged B300 at 91-82%.
Critically, AMD did not submit results for DeepSeek-R1, where Nvidia posts strongest numbers. AMD's text-to-video submission competed in the Open category rather than Closed Division, limiting direct comparability. Nine partners submitted AMD results, scoring within 4% of AMD's measurements. Dell and MangoBoost created the first heterogeneous MLPerf submission, mixing MI300X, MI325X, and MI355X GPUs across US and Korean sites, hitting roughly 142,000 tokens per second on Llama 2 70B.
Intel's Strategic Pivot
Intel skips direct data center competition, instead showcasing Arc Pro B70 and B65 GPUs alongside Xeon 6 processors for workstations and edge inference. Four Arc Pro B70 cards provide 128 GB VRAM for 120-billion-parameter models. The B70 delivers 1.8x performance over B60. Software optimizations on B60 hardware achieved up to 1.18x performance gains over MLPerf v5.1. Intel emphasized being the only server processor maker submitting standalone CPU results; over half of all v6.0 submissions use Xeon as host CPU.
Benchmark Limitations
MLPerf v6.0 adds five new tests: DeepSeek-R1 interactive scenario (5x higher token rate minimum), Qwen3-VL-235B vision-language model, GPT-OSS-120B, WAN-2.2-T2V text-to-video model, and DLRMv3 recommendation benchmark. Only Nvidia submitted results for all new models and scenarios.
Notably absent were submissions from Google (Ironwood TPU chips) and inference specialists like Cerebras. These results demonstrate that while MLPerf remains the industry standard, it doesn't produce a straightforward leaderboard. Each chipmaker naturally highlights configurations where its products excel, making comparative analysis challenging without examining underlying test parameters.
Nvidia is driving definition of MLPerf Endpoints benchmark within MLCommons, designed to measure real-world API performance under actual traffic patterns rather than standardized conditions.
What This Means
Nvidia's 288-GPU scaling capability and software optimization gains extend its absolute performance lead at massive scale, though these configurations remain inaccessible to most organizations. AMD's single-node parity with B200 (and advantages on some models) suggests real competitive optionality in mid-scale deployments, constrained only by limited model coverage. Intel's workstation/edge focus signals acknowledgment of different market dynamics outside hyperscaler data centers. The lack of breakthrough single-chip performance from any vendor, and AMD's absence on largest models, indicates Nvidia's current architectural and software superiority remains substantial despite AMD's generational improvements.
Related Articles
NVIDIA releases LoRA/DoRA fine-tuning guide for Cosmos Predict 2.5 to generate synthetic robot training data
NVIDIA published a technical guide for parameter-efficient fine-tuning of its Cosmos Predict 2.5 world model using LoRA and DoRA adapters. The method allows teams to adapt the 2B-parameter model to robot manipulation tasks on a single 80GB GPU, generating synthetic training trajectories from just 92 demonstration videos.
IBM Research launches Open Agent Leaderboard, showing same models achieve different results based on agent architecture
IBM Research has launched the Open Agent Leaderboard, the first open benchmark that evaluates complete AI agent systems rather than just underlying models. The leaderboard reveals that agents using identical models can achieve significantly different success rates and costs depending on system architecture, with failed runs costing 20-54% more than successful ones.
Gemini handles video analysis across YouTube and 1.65GB local files, Claude fails entirely
In direct testing, Google's Gemini successfully analyzed video content from YouTube links and local files up to 1.65GB, accurately understanding context without audio or metadata. Anthropic's Claude cannot process video at all, while OpenAI's ChatGPT faces a 500MB file size limit without Codex assistance.
NVIDIA releases Nemotron-3-Nano-Omni-30B, a 31B-parameter multimodal model with 256K context and reasoning mode
NVIDIA released Nemotron-3-Nano-Omni-30B-A3B, a multimodal large language model with 31 billion parameters that processes video, audio, images, and text with up to 256K token context. The model uses a Mamba2-Transformer hybrid Mixture of Experts architecture and supports chain-of-thought reasoning mode.
Comments
Loading...