HSSBench: New benchmark reveals MLLMs struggle with humanities and social sciences reasoning
Researchers have released HSSBench, a new benchmark designed to evaluate multimodal large language models on humanities and social sciences tasks—areas where current benchmarks are sparse. The benchmark contains over 13,000 samples across six key categories in multiple languages, and testing shows even state-of-the-art models struggle significantly with cross-disciplinary reasoning required for HSS domains.
HSSBench Reveals Performance Gaps for Multimodal Models on Humanities Tasks
A new benchmark called HSSBench has exposed significant weaknesses in how multimodal large language models handle humanities and social sciences tasks, areas where existing evaluation frameworks have historically focused on STEM reasoning instead.
Benchmark Overview
HSSBench contains over 13,000 meticulously designed samples covering six key categories across multiple languages, including the six official UN languages. The benchmark was created through a novel data generation pipeline that combines domain experts with automated agents to iteratively refine each sample.
The researchers tested more than 20 mainstream multimodal models on the benchmark. Results show that even state-of-the-art models perform significantly below human-level accuracy on HSS tasks.
What Makes HSS Tasks Different
According to the paper, humanities and social sciences tasks require fundamentally different cognitive approaches than STEM benchmarks:
- Horizontal interdisciplinary thinking rather than vertical step-by-step reasoning
- Deep integration of knowledge across related fields
- Linking abstract concepts to corresponding visual representations
- Cross-disciplinary reasoning abilities that connect information across domains
Current benchmarks for multimodal models primarily emphasize general knowledge and STEM-style problem-solving, which explains why this capability gap went largely unmeasured until now.
The Data Generation Pipeline
The researchers developed a specialized methodology for creating HSS benchmark samples. Domain experts collaborate with automated agents to generate and iteratively refine each sample, ensuring quality and relevance to actual humanities and social sciences challenges.
This approach differs from typical benchmark creation, which often relies on crowdsourcing or automated generation alone. The hybrid expert-agent pipeline appears designed to capture the nuanced requirements of humanities reasoning.
Implications for MLLM Development
The benchmark's findings suggest that current multimodal models may be optimized primarily for factual retrieval and STEM reasoning. The struggle with HSS tasks indicates a gap in their ability to synthesize knowledge across disciplines and interpret visual information in complex conceptual contexts.
This could have practical consequences for deploying MLLMs in fields like history, literature, anthropology, cultural studies, and policy research—domains where interdisciplinary knowledge integration is central to the work.
What This Means
HSSBench fills a genuine gap in MLLM evaluation. Current benchmarks (MMLU, VQA variants, etc.) are heavily skewed toward scientific and technical reasoning, leaving humanities-focused capabilities largely unexamined. This benchmark provides the evaluation framework needed to identify and measure progress on cross-disciplinary reasoning—a capability that matters for real-world applications beyond tech domains. Expect researchers to use HSSBench as a standard for tracking improvements in humanities reasoning, similar to how MMLU tracks general knowledge.