Meta's TRIBE v2 AI predicts brain activity from images, audio, and speech with 70,000-voxel fMRI mapping
Meta's FAIR lab released TRIBE v2, an AI model that predicts human brain activity from images, audio, and text. Trained on over 1,000 hours of fMRI data from 720 subjects, the model maps predictions to 70,000 voxels and often matches group-average brain responses more accurately than individual brain scans.
Meta's TRIBE v2 AI Predicts Brain Activity from Images, Audio, and Speech
Meta's FAIR lab released TRIBE v2, an AI model that predicts how the human brain responds to visual, auditory, and language stimuli. Trained on more than 1,000 hours of fMRI data from 720 subjects, the model can map predictions to 70,000 voxels—the 3D units that comprise an fMRI scan—and often outperforms individual brain scans when compared to group-average responses.
Architecture and Training
TRIBE v2 processes three input types through separate pre-trained Meta models: Llama 3.2 for text, Wav2Vec-Bert-2.0 for audio, and Video-JEPA-2 for video. These models generate embeddings capturing semantic content. A transformer then processes all three representations jointly, identifying patterns across stimuli and subjects. A final person-specific layer translates output into predicted brain activation maps.
The model generalizes to new subjects without retraining and shows steady accuracy improvements with additional training data—a scaling pattern similar to large language models.
Performance and Validation
In testing, TRIBE v2's predictions correlated more strongly with actual group-average brain responses than individual subjects' scans in many cases. On the Human Connectome Project dataset (7 Tesla fMRI), the model achieved twice the median individual subject's correlation with group averages.
Compared to optimized linear baseline models, TRIBE v2 showed significant improvements across all datasets tested. The predecessor TRIBE v1—trained on only four subjects predicting 1,000 voxels—won the Algonauts 2025 competition against 263 other teams, suggesting v2's substantially improved scope and accuracy.
Replicating Decades of Neuroscience
Using controlled test protocols from the Individual Brain Charting dataset, researchers validated TRIBE v2 against classical neuroscience findings. In visual tasks, the model correctly identified specialized brain regions for faces, places, bodies, and characters. In language experiments, it replicated known patterns: distinguishing speech from silence, emotional from physical pain processing, and showing expected left-hemisphere activation for complete sentences versus word lists.
By selectively disabling input channels, the team mapped which sensory modality drives activity in specific regions: audio predicts activity near auditory cortex, video maps to visual cortex, text activates language areas. In multimodal regions like the temporal-parietal-occipital junction, using all three channels improved prediction accuracy by up to 50 percent versus single channels.
Significant Limitations
TRIBE v2 treats the brain as a passive sensory receiver without modeling active decision-making or motor output. fMRI's indirect measurement via blood flow introduces multi-second delays, obscuring millisecond-scale neural dynamics. The model covers only three sensory channels; smell, touch, and balance remain unmapped. It cannot capture developmental changes or clinical conditions.
Accuracy varies by stimulus type and brain region, with some areas showing notably lower prediction quality than others.
Availability and Impact
Meta has released TRIBE v2's code, weights, and an interactive demo freely. The researchers propose three primary use cases: planning expensive neuroscience experiments computationally before lab time, building more brain-like AI architectures, and accelerating neuroscience research by reducing measurement bottlenecks.
For neuroscience, the model could substantially lower research costs by allowing researchers to prototype experiments computationally before committing resources to actual fMRI studies.
What This Means
TRIBE v2 demonstrates that large-scale multimodal AI models trained on neuroimaging data can capture generalizable patterns of human brain function. This has immediate practical value for neuroscience labs, potentially cutting experimental timelines and costs. However, the model's limitations—treating the brain as passive, missing temporal resolution, and incomplete sensory coverage—mean it complements rather than replaces empirical neuroscience. The scaling improvements with more training data suggest future versions will improve as fMRI datasets expand.
Related Articles
Apple's RubiCap model generates better image captions with 3-7B parameters than 72B competitors
Apple researchers developed RubiCap, a framework for training dense image captioning models that achieve state-of-the-art results at 2B, 3B, and 7B parameter scales. The 7B model outperforms models up to 72 billion parameters on multiple benchmarks including CapArena and CaptionQA, while the 3B variant matches larger 32B models, suggesting efficient dense captioning doesn't require massive scale.
Google's TurboQuant cuts AI inference memory by 6x using lossless compression
Google Research unveiled TurboQuant, a lossless memory compression algorithm that reduces AI inference working memory (KV cache) by at least 6x without impacting model performance. The technology uses vector quantization methods called PolarQuant and an optimization technique called QJL. Findings will be presented at ICLR 2026.
Half of AI code passing SWE-bench would be rejected by real developers, METR study finds
A study by research organization METR found that approximately 50% of AI-generated code solutions that pass the widely-used SWE-bench benchmark would be rejected by actual project maintainers. The finding exposes a significant gap between industry-standard code generation benchmarks and real-world code review standards.
Anthropic study: AI job disruption far below theoretical potential despite programmer exposure
Anthropic has developed a new measurement combining theoretical AI capabilities with real-world usage data, finding that programmers and customer service workers face the highest exposure to AI automation. However, unemployment in affected professions has not risen, with only early warning signs appearing among younger workers.
Comments
Loading...