benchmark

HSSBench: New benchmark reveals MLLMs struggle with humanities and social sciences reasoning

Researchers have released HSSBench, a new benchmark designed to evaluate multimodal large language models on humanities and social sciences tasks—areas where current benchmarks are sparse. The benchmark contains over 13,000 samples across six key categories in multiple languages, and testing shows even state-of-the-art models struggle significantly with cross-disciplinary reasoning required for HSS domains.

March 5, 2026 · 12:54 AM2 min read

HSSBench Reveals Performance Gaps for Multimodal Models on Humanities Tasks

A new benchmark called HSSBench has exposed significant weaknesses in how multimodal large language models handle humanities and social sciences tasks, areas where existing evaluation frameworks have historically focused on STEM reasoning instead.

Benchmark Overview

HSSBench contains over 13,000 meticulously designed samples covering six key categories across multiple languages, including the six official UN languages. The benchmark was created through a novel data generation pipeline that combines domain experts with automated agents to iteratively refine each sample.

The researchers tested more than 20 mainstream multimodal models on the benchmark. Results show that even state-of-the-art models perform significantly below human-level accuracy on HSS tasks.

What Makes HSS Tasks Different

According to the paper, humanities and social sciences tasks require fundamentally different cognitive approaches than STEM benchmarks:

Horizontal interdisciplinary thinking rather than vertical step-by-step reasoning
Deep integration of knowledge across related fields
Linking abstract concepts to corresponding visual representations
Cross-disciplinary reasoning abilities that connect information across domains

Current benchmarks for multimodal models primarily emphasize general knowledge and STEM-style problem-solving, which explains why this capability gap went largely unmeasured until now.

The Data Generation Pipeline

The researchers developed a specialized methodology for creating HSS benchmark samples. Domain experts collaborate with automated agents to generate and iteratively refine each sample, ensuring quality and relevance to actual humanities and social sciences challenges.

This approach differs from typical benchmark creation, which often relies on crowdsourcing or automated generation alone. The hybrid expert-agent pipeline appears designed to capture the nuanced requirements of humanities reasoning.

Implications for MLLM Development

The benchmark's findings suggest that current multimodal models may be optimized primarily for factual retrieval and STEM reasoning. The struggle with HSS tasks indicates a gap in their ability to synthesize knowledge across disciplines and interpret visual information in complex conceptual contexts.

This could have practical consequences for deploying MLLMs in fields like history, literature, anthropology, cultural studies, and policy research—domains where interdisciplinary knowledge integration is central to the work.

What This Means

HSSBench fills a genuine gap in MLLM evaluation. Current benchmarks (MMLU, VQA variants, etc.) are heavily skewed toward scientific and technical reasoning, leaving humanities-focused capabilities largely unexamined. This benchmark provides the evaluation framework needed to identify and measure progress on cross-disciplinary reasoning—a capability that matters for real-world applications beyond tech domains. Expect researchers to use HSSBench as a standard for tracking improvements in humanities reasoning, similar to how MMLU tracks general knowledge.

Source: arxiv.org ↗

benchmark multimodal-models humanities social-sciences reasoning mllm-evaluation cross-disciplinary-reasoning