research

T2S-Bench benchmark reveals text-to-structure reasoning gap across 45 AI models

Researchers introduced T2S-Bench, a new benchmark with 1,800 samples across 6 scientific domains and 32 structural types, evaluating text-to-structure reasoning in 45 mainstream models. The benchmark reveals substantial capability gaps: average accuracy on multi-hop reasoning tasks is only 52.1%, while Structure-of-Thought (SoT) prompting alone yields +5.7% improvement on average across eight text-processing tasks.

March 5, 2026 · 5:53 AM2 min read

New Benchmark Exposes Text-to-Structure Reasoning Weakness in Large Language Models

A new research paper introduces T2S-Bench, the first benchmark specifically designed to measure how well large language models convert text into structured formats—a capability fundamental to tasks like information extraction, knowledge graph construction, and document understanding.

The benchmark evaluates 45 mainstream models across 1,800 carefully constructed samples spanning 6 scientific domains and 32 different structural types. Results reveal significant performance gaps. The average accuracy on multi-hop reasoning tasks stands at just 52.1%, and even the best-performing models achieve only 58.1% node accuracy in end-to-end extraction tasks.

Structure-of-Thought Prompting

Alongside the benchmark, researchers propose Structure-of-Thought (SoT), a prompting technique that explicitly guides models to construct intermediate text structures before generating final outputs. Testing on Qwen2.5-7B-Instruct shows SoT delivers consistent improvements: an average +5.7% gain across eight diverse text-processing tasks, with fine-tuning on T2S-Bench data pushing improvements to +8.6%.

The approach mirrors human problem-solving: humans extract key points, infer relationships between them, and organize information into structures that guide reasoning. SoT replicates this process by having models generate structural representations as intermediate steps.

Benchmark Design and Coverage

T2S-Bench covers scientific domains including chemistry, materials science, biomedicine, physics, computer science, and earth science. The 32 structural types range from molecular structures to semantic relationships, entity networks, and process flows. Researchers emphasize "rigorous construction" to ensure accuracy, fairness, and quality across all samples.

The benchmark's scope—evaluating three model families across eight tasks—provides broad evidence that text-to-structure reasoning remains a weak point across current LLMs, regardless of architecture or training approach.

Reproducibility and Access

The researchers released both T2S-Bench dataset and evaluation code publicly at https://t2s-bench.github.io/T2S-Bench-Page/, enabling other researchers to test new models and develop improved approaches.

What This Means

T2S-Bench fills a measurement gap: prior work evaluated broad reasoning or specific downstream tasks, but no standardized benchmark existed specifically for text-to-structure conversion. The benchmark shows this capability lags significantly behind other LLM competencies. The +5.7% baseline improvement from SoT prompting—without fine-tuning—suggests this gap may be addressable through better prompting strategies or structured training data. For organizations relying on LLMs for knowledge extraction, information retrieval, or document analysis, these results indicate current models are operating well below theoretical capability on this task class.

Source: arxiv.org ↗

benchmark reasoning text-to-structure prompting information-extraction llm-evaluation structure-of-thought