LLM News

Every LLM release, update, and milestone.

Filtered by:benchmarks✕ clear
research

ms-Mamba outperforms Transformer models on time-series forecasting with fewer parameters

Researchers introduced ms-Mamba, a multi-scale Mamba architecture for time-series forecasting that outperforms recent Transformer and Mamba-based models while using significantly fewer parameters. On the Solar-Energy dataset, ms-Mamba achieved 0.229 mean-squared error versus 0.240 for S-Mamba while using only 3.53M parameters compared to 4.77M.

benchmark

AMA-Bench reveals major gaps in LLM agent memory systems with real-world evaluation

Researchers introduce AMA-Bench, a benchmark for evaluating long-horizon memory in LLM-based autonomous agents using real-world trajectories and synthetic scaling. Existing memory systems underperform due to lack of causality and reliance on lossy similarity-based retrieval. The proposed AMA-Agent system with causality graphs and tool-augmented retrieval achieves 57.22% accuracy, outperforming baselines by 11.16 percentage points.

2 min readvia arxiv.org
benchmarkOpenAI

OpenAI says SWE-bench Verified is broken—most tasks reject correct solutions

OpenAI is calling for the retirement of SWE-bench Verified, the widely-used AI coding benchmark, claiming most tasks are flawed enough to reject correct solutions. The company argues that leading AI models have likely seen the answers during training, meaning benchmark scores measure memorization rather than genuine coding ability.

2 min readvia the-decoder.com