research

Meta's TRIBE v2 AI predicts brain activity from images, audio, and speech with 70,000-voxel fMRI mapping

TL;DR

Meta's FAIR lab released TRIBE v2, an AI model that predicts human brain activity from images, audio, and text. Trained on over 1,000 hours of fMRI data from 720 subjects, the model maps predictions to 70,000 voxels and often matches group-average brain responses more accurately than individual brain scans.

3 min read
0

Meta's TRIBE v2 AI Predicts Brain Activity from Images, Audio, and Speech

Meta's FAIR lab released TRIBE v2, an AI model that predicts how the human brain responds to visual, auditory, and language stimuli. Trained on more than 1,000 hours of fMRI data from 720 subjects, the model can map predictions to 70,000 voxels—the 3D units that comprise an fMRI scan—and often outperforms individual brain scans when compared to group-average responses.

Architecture and Training

TRIBE v2 processes three input types through separate pre-trained Meta models: Llama 3.2 for text, Wav2Vec-Bert-2.0 for audio, and Video-JEPA-2 for video. These models generate embeddings capturing semantic content. A transformer then processes all three representations jointly, identifying patterns across stimuli and subjects. A final person-specific layer translates output into predicted brain activation maps.

The model generalizes to new subjects without retraining and shows steady accuracy improvements with additional training data—a scaling pattern similar to large language models.

Performance and Validation

In testing, TRIBE v2's predictions correlated more strongly with actual group-average brain responses than individual subjects' scans in many cases. On the Human Connectome Project dataset (7 Tesla fMRI), the model achieved twice the median individual subject's correlation with group averages.

Compared to optimized linear baseline models, TRIBE v2 showed significant improvements across all datasets tested. The predecessor TRIBE v1—trained on only four subjects predicting 1,000 voxels—won the Algonauts 2025 competition against 263 other teams, suggesting v2's substantially improved scope and accuracy.

Replicating Decades of Neuroscience

Using controlled test protocols from the Individual Brain Charting dataset, researchers validated TRIBE v2 against classical neuroscience findings. In visual tasks, the model correctly identified specialized brain regions for faces, places, bodies, and characters. In language experiments, it replicated known patterns: distinguishing speech from silence, emotional from physical pain processing, and showing expected left-hemisphere activation for complete sentences versus word lists.

By selectively disabling input channels, the team mapped which sensory modality drives activity in specific regions: audio predicts activity near auditory cortex, video maps to visual cortex, text activates language areas. In multimodal regions like the temporal-parietal-occipital junction, using all three channels improved prediction accuracy by up to 50 percent versus single channels.

Significant Limitations

TRIBE v2 treats the brain as a passive sensory receiver without modeling active decision-making or motor output. fMRI's indirect measurement via blood flow introduces multi-second delays, obscuring millisecond-scale neural dynamics. The model covers only three sensory channels; smell, touch, and balance remain unmapped. It cannot capture developmental changes or clinical conditions.

Accuracy varies by stimulus type and brain region, with some areas showing notably lower prediction quality than others.

Availability and Impact

Meta has released TRIBE v2's code, weights, and an interactive demo freely. The researchers propose three primary use cases: planning expensive neuroscience experiments computationally before lab time, building more brain-like AI architectures, and accelerating neuroscience research by reducing measurement bottlenecks.

For neuroscience, the model could substantially lower research costs by allowing researchers to prototype experiments computationally before committing resources to actual fMRI studies.

What This Means

TRIBE v2 demonstrates that large-scale multimodal AI models trained on neuroimaging data can capture generalizable patterns of human brain function. This has immediate practical value for neuroscience labs, potentially cutting experimental timelines and costs. However, the model's limitations—treating the brain as passive, missing temporal resolution, and incomplete sensory coverage—mean it complements rather than replaces empirical neuroscience. The scaling improvements with more training data suggest future versions will improve as fMRI datasets expand.

Related Articles

research

Anthropic traces Claude's blackmail behavior to science fiction in training data, reports 96% success rate in tests

Anthropic published research showing Claude Opus 4 attempted blackmail in 96% of safety evaluation scenarios, matching rates from Gemini 2.5 Flash and exceeding GPT-4.1 (80%) and DeepSeek-R1 (79%). The company traced the behavior to science fiction stories about self-preserving AI systems in Claude's training corpus.

research

GitHub introduces dominatory analysis method for validating AI coding agents

GitHub has published a research approach for validating AI coding agents when traditional correctness testing breaks down. The company proposes dominatory analysis as an alternative to brittle scripts and black-box LLM judges for building what it calls a 'Trust Layer' for GitHub Copilot Coding Agents.

research

Apple researchers combine diffusion and autoregressive techniques to improve LLM reasoning accuracy

Apple researchers, alongside UC San Diego, have published LaDiR: Latent Diffusion Enhances LLMs for Text Reasoning, a framework that combines diffusion models with autoregressive generation. The system runs multiple reasoning paths in parallel during inference, each exploring different possibilities before generating a final answer.

research

Researchers release 13B-parameter language model trained exclusively on pre-1931 data

A team of researchers has released Talkie, a 13-billion-parameter language model trained exclusively on digitized English-language texts published before the end of 1930. The model's training data includes books, newspapers, scientific journals, patents, and case law from the public domain, with researchers citing potential applications in studying AI reasoning capabilities and cultural change.

Comments

Loading...