LLM News | TPS

benchmark

HSSBench: New benchmark reveals MLLMs struggle with humanities and social sciences reasoning

Researchers have released HSSBench, a new benchmark designed to evaluate multimodal large language models on humanities and social sciences tasks—areas where current benchmarks are sparse. The benchmark contains over 13,000 samples across six key categories in multiple languages, and testing shows even state-of-the-art models struggle significantly with cross-disciplinary reasoning required for HSS domains.

March 5, 2026 · 12:54 AM2 min read

benchmark multimodal-models humanities

via arxiv.org ↗