LLM News

Every LLM release, update, and milestone.

Filtered by:benchmarks✕ clear

research

ms-Mamba outperforms Transformer models on time-series forecasting with fewer parameters

Researchers introduced ms-Mamba, a multi-scale Mamba architecture for time-series forecasting that outperforms recent Transformer and Mamba-based models while using significantly fewer parameters. On the Solar-Energy dataset, ms-Mamba achieved 0.229 mean-squared error versus 0.240 for S-Mamba while using only 3.53M parameters compared to 4.77M.

March 6, 2026 · 5:20 AM2 min read

time-series-forecasting mamba architecture

via arxiv.org ↗

benchmark

AMA-Bench reveals major gaps in LLM agent memory systems with real-world evaluation

Researchers introduce AMA-Bench, a benchmark for evaluating long-horizon memory in LLM-based autonomous agents using real-world trajectories and synthetic scaling. Existing memory systems underperform due to lack of causality and reliance on lossy similarity-based retrieval. The proposed AMA-Agent system with causality graphs and tool-augmented retrieval achieves 57.22% accuracy, outperforming baselines by 11.16 percentage points.

March 5, 2026 · 5:10 AM2 min read

benchmarks agents memory-systems

via arxiv.org ↗

benchmarkOpenAI

Frontier LLMs lose up to 33% accuracy in long conversations, study finds

Frontier language models including GPT-5.2 and Claude 4.6 experience accuracy degradation of up to 33% as conversations lengthen, according to new research. The finding suggests that extended context use within a single conversation introduces performance challenges even in state-of-the-art models.

February 28, 2026 · 6:05 PM2 min read

accuracy benchmarks context-window

via the-decoder.com ↗

benchmarkOpenAI

OpenAI says SWE-bench Verified is broken—most tasks reject correct solutions

OpenAI is calling for the retirement of SWE-bench Verified, the widely-used AI coding benchmark, claiming most tasks are flawed enough to reject correct solutions. The company argues that leading AI models have likely seen the answers during training, meaning benchmark scores measure memorization rather than genuine coding ability.

February 23, 2026 · 7:20 PM2 min read

benchmarks SWE-bench code-generation

via the-decoder.com ↗