benchmark

AttackSeqBench measures LLM capabilities for cybersecurity threat analysis

Researchers introduced AttackSeqBench, a benchmark for evaluating how well large language models understand and reason about cyber attack sequences in threat intelligence reports. The evaluation tested 7 LLMs and 5 reasoning models across multiple tasks, revealing gaps in their ability to extract actionable security insights from unstructured cybersecurity data.

March 5, 2026 · 1:05 AM2 min read

AttackSeqBench Measures LLM Capabilities for Cybersecurity Threat Analysis

Researchers have released AttackSeqBench, a new benchmark designed to systematically evaluate how well large language models understand and reason about adversarial attack sequences extracted from cyber threat intelligence (CTI) reports.

What the Benchmark Measures

Cyber Threat Intelligence reports document observations of cyber threats by synthesizing evidence about adversaries' actions and intent. Security practitioners typically must manually extract and analyze attack sequences from these verbose, unstructured reports—a labor-intensive process. AttackSeqBench evaluates LLMs' reasoning capabilities across three dimensions: tactical behaviors, technical procedures, and procedural sequences of adversarial actions.

The benchmark satisfies three design principles: Extensibility (ability to incorporate new threat data), Reasoning Scalability (evaluating progressively complex reasoning), and Domain-Specific Epistemic Expandability (incorporating evolving cybersecurity knowledge).

Evaluation Results

The researchers benchmarked:

7 LLMs (specific model names not disclosed in the abstract)
5 LRMs (language reasoning models, a category distinct from standard LLMs)
4 post-training strategies for cybersecurity optimization

They evaluated performance across 3 benchmark settings and 3 benchmark tasks, though specific accuracy scores and model names were not detailed in the paper abstract.

Key Findings

The evaluation identified both advantages and limitations in current LLMs for CTI understanding. The abstract indicates that while LLMs show promise in specific cybersecurity tasks like entity extraction and knowledge graph construction, their broader reasoning capabilities for understanding behavioral sequences and complex attack patterns remain underexplored and limited.

The researchers note that their work contributes to deeper understanding of how LLMs can support CTI report analysis and cybersecurity operations more broadly.

Availability

The benchmark construction code, evaluation scripts, and corresponding dataset are publicly available on GitHub at https://github.com/hulkima/AttackSeqBench.

What This Means

AttackSeqBench fills a gap in LLM evaluation by focusing on a specialized but critical security task: extracting tactical and technical threat sequences from CTI reports. While models excel at narrow NLP tasks, this benchmark suggests they still struggle with the complex, multi-step reasoning required to understand how adversaries operate. For security teams, this indicates that current LLMs cannot yet be trusted as autonomous CTI analysts—human review remains essential for threat analysis.

Source: arxiv.org ↗

benchmark cybersecurity llm-evaluation threat-intelligence reasoning-models attack-analysis cti arXiv