LLM News

Every LLM release, update, and milestone.

Filtered by:arxiv✕ clear
research

Timer-S1: 8.3B time series foundation model achieves state-of-the-art forecasting on GIFT-Eval

Researchers have introduced Timer-S1, a Mixture-of-Experts time series foundation model with 8.3 billion total parameters and 750 million activated parameters per token. The model achieves state-of-the-art forecasting performance on the GIFT-Eval leaderboard, with the best MASE and CRPS scores among pre-trained models.

2 min readvia arxiv.org
research

New technique extends LLM context windows to 128K tokens without expensive retraining

Researchers propose a novel framework called SharedLLM that extends language model context windows from 8K to 128K tokens without costly continual pre-training. The method uses two stacked short-context models—one as a compressor, one as a decoder—with specialized tree-based information retrieval, achieving 2-3x inference speedups while maintaining competitive performance.

research

1.58-bit BitNet models naturally support structured sparsity with minimal accuracy loss

Researchers have demonstrated that 1.58-bit quantized language models are naturally more compatible with semi-structured N:M sparsity than full-precision models. The Sparse-BitNet framework combines both techniques simultaneously, achieving up to 1.30X speedups in training and inference while maintaining smaller accuracy degradation than full-precision baselines at equivalent sparsity levels.

2 min readvia arxiv.org
research

Progressive Residual Warmup improves LLM pretraining stability and convergence speed

Researchers propose Progressive Residual Warmup (ProRes), a pretraining technique that staggers layer learning by gradually warming residual connections from 0 to 1, with deeper layers taking longer to activate. The method demonstrates faster convergence, stronger generalization, and improved downstream performance across multiple model scales and initialization schemes.

research

Study shows LLMs can fact-check using internal knowledge without external retrieval

A new arXiv paper challenges the dominant retrieval-based fact-checking approach by demonstrating that LLMs can verify factual claims using only their parametric knowledge. The study introduces INTRA, a method leveraging internal model representations that outperforms logit-based approaches and shows robust generalization across long-tail knowledge, multilingual claims, and long-form generation.

research

Researchers Identify 'Contextual Inertia' Bug in LLMs During Multi-Turn Conversations

Researchers have identified a critical failure mode in large language models called 'contextual inertia'—where models ignore new information in multi-turn conversations and rigidly stick to previous reasoning. A new training method called RLSTA uses single-turn performance as an anchor to stabilize multi-turn reasoning and recover performance lost to this phenomenon.

research

Researchers propose Mixture of Universal Experts to scale MoE models via depth-width transformation

Researchers have introduced Mixture of Universal Experts (MoUE), a generalization of Mixture-of-Experts architectures that adds a new scaling dimension called virtual width. The approach reuses a shared expert pool across layers while maintaining fixed per-token computation, achieving up to 1.3% improvements over standard MoE baselines and enabling 4.2% gains when converting existing MoE checkpoints.

research

New framework improves VLM spatial reasoning through minimal information selection

A new research paper introduces MSSR (Minimal Sufficient Spatial Reasoner), a dual-agent framework that improves Vision-Language Models' ability to reason about 3D spatial relationships. The method addresses two key bottlenecks: inadequate 3D understanding from 2D-centric training and reasoning failures from redundant information.

research

ButterflyMoE achieves 150× memory reduction for mixture-of-experts models via geometric rotations

Researchers introduce ButterflyMoE, a technique that replaces independent expert weight matrices with learned geometric rotations applied to a shared quantized substrate. The method reduces memory scaling from linear to sub-linear in the number of experts, achieving 150× compression at 256 experts with negligible accuracy loss on language modeling tasks.

research

Research: Contrastive refinement reduces AI model over-refusal without sacrificing safety

Researchers propose DCR (Discernment via Contrastive Refinement), a pre-alignment technique that reduces the tendency of safety-aligned language models to reject benign prompts while preserving rejection of genuinely harmful content. The method addresses a core trade-off in current safety alignment: reducing over-refusal typically degrades harm-detection capabilities.

research

New Method Reduces AI Over-Refusal Without Sacrificing Safety Alignment

A new alignment technique called Discernment via Contrastive Refinement (DCR) addresses a persistent problem in safety-aligned LLMs: over-refusal, where models reject benign requests as toxic. The method uses contrastive refinement to help models better distinguish genuinely harmful prompts from superficially toxic ones, reducing refusals while preserving safety.

research

Researchers use LLMs to simulate misinformation susceptibility across demographics with 92% accuracy

Researchers have developed BeliefSim, a framework that uses Large Language Models to simulate how different demographic groups respond to misinformation by modeling their underlying beliefs. The approach achieved 92% accuracy in predicting susceptibility across multiple datasets and conditioning strategies.

research

Spectral Surgery: Training-Free Method Improves LoRA Adapters Without Retraining

Researchers propose Spectral Surgery, a training-free refinement method that improves Low-Rank Adaptation (LoRA) adapters by decomposing trained weights via SVD and selectively reweighting singular values based on gradient-estimated component sensitivity. The approach achieves consistent gains across Llama-3.1-8B and Qwen3-8B—up to +4.4 points on CommonsenseQA and +2.4 pass@1 on HumanEval—by adjusting only ~1,000 scalar coefficients.

research

Study reveals preference leakage bias when LLMs judge synthetically-trained models

A new arXiv paper identifies preference leakage, a fundamental contamination problem in LLM-based evaluation where language models used as judges systematically favor models trained on data they synthesized. The researchers confirm the bias occurs across multiple model families and benchmarks, making it harder to detect than previously known LLM judge biases.

research

Researchers identify and fix critical toggle control failure in multimodal GUI agents

A new arXiv paper identifies a significant blind spot in multimodal agents: they fail to reliably execute toggle control instructions on graphical user interfaces, particularly when the current state already matches the desired state. Researchers propose State-aware Reasoning (StaR), a method that improves toggle instruction accuracy by over 30% across four existing multimodal agents while also enhancing general task performance.

research

New RLVR method reformulates reward-based LLM training as classification problem

A new research paper proposes Rewards as Labels (REAL), a framework that reframes reinforcement learning with verifiable rewards as a classification problem rather than scalar weighting. The method addresses fundamental gradient optimization issues in current GRPO variants and demonstrates measurable improvements on mathematical reasoning benchmarks.

research

Diffusion language models memorize less training data than autoregressive models, study finds

A new arXiv study systematically characterizes memorization behavior in diffusion language models (DLMs) and finds they exhibit substantially lower memorization-based leakage of personally identifiable information compared to autoregressive language models. The research establishes a theoretical framework showing that sampling resolution directly correlates with exact training data extraction.

research

CoDAR framework shows continuous diffusion language models can match discrete approaches

A new paper identifies token rounding as the primary bottleneck limiting continuous diffusion language models (DLMs) and proposes CoDAR, a two-stage framework that combines continuous embedding-space diffusion with a contextual autoregressive decoder. Experiments on LM1B and OpenWebText show CoDAR achieves competitive performance with discrete diffusion approaches while offering tunable fluency-diversity trade-offs.

research

New benchmark reveals LLMs lose controllability at finer behavioral levels

A new arXiv paper introduces SteerEval, a hierarchical benchmark for measuring how well large language models can be controlled across language features, sentiment, and personality. The research reveals that existing steering methods degrade significantly at finer-grained behavioral specification levels, raising concerns for deployment in sensitive domains.

research

VC-STaR: Researchers use visual contrast to reduce hallucinations in VLM reasoning

Researchers propose Visual Contrastive Self-Taught Reasoner (VC-STaR), a self-improving framework that addresses a fundamental challenge in vision language models: hallucinations in visual reasoning. The approach uses contrastive VQA pairs—visually similar images with equivalent questions—to improve how VLMs identify relevant visual cues and generate more accurate reasoning paths.

research

Researchers propose DiSE, a self-evaluation method for diffusion language models

Researchers have proposed DiSE, a self-evaluation method designed to assess output quality in diffusion language models (dLLMs) by computing token regeneration probabilities. The technique enables efficient confidence quantification for models that generate text bidirectionally rather than sequentially, addressing a key limitation in quality assessment.

research

WAFFLE fine-tuning improves multimodal models for web development by 9 percentage points

Researchers introduce WAFFLE, a fine-tuning methodology that enhances multimodal models' ability to convert UI designs into HTML code. The approach uses structure-aware attention mechanisms and contrastive learning to bridge the gap between visual UI designs and text-based HTML, achieving up to 9 percentage point improvements on benchmark tasks.

research

DynFormer rethinks Transformers for physics simulations, cutting PDE solver errors by 95%

Researchers propose DynFormer, a Transformer variant designed specifically for solving partial differential equations (PDEs) that models physical systems at multiple scales simultaneously. By replacing uniform attention with specialized modules for different physical scales, DynFormer achieves up to 95% error reduction compared to existing neural operator baselines while consuming significantly less GPU memory.

research

New safety steering technique reduces unsafe T2I outputs without degrading image quality

Researchers introduce Conditioned Activation Transport (CAT), a technique that reduces unsafe content generation in text-to-image models during inference without the quality degradation seen in previous linear steering approaches. The method uses a contrastive dataset of 2,300 safe/unsafe prompt pairs and geometry-based conditioning to target only unsafe activation regions.

research

AI agent outperforms 9 of 10 human hackers in live penetration testing study

A new AI agent framework called ARTEMIS discovered 9 valid vulnerabilities in live penetration testing against a university network with ~8,000 hosts, outperforming 9 of 10 human cybersecurity professionals. The system achieved an 82% valid submission rate and costs $18/hour compared to $60/hour for professional penetration testers, though it struggles with GUI-based tasks and produces higher false-positive rates.

research

DiaBlo: Diagonal Block Finetuning Matches Full Model Performance With Lower Cost

Researchers propose DiaBlo, a parameter-efficient finetuning (PEFT) method that updates only diagonal blocks of model weight matrices, achieving comparable performance to full-model finetuning while maintaining LoRA-level efficiency. The approach eliminates low-rank matrix dependencies and provides theoretical guarantees of convergence.

research

Alignment tuning shrinks LLM output diversity by 2-5x, new research shows

A new arXiv paper introduces the Branching Factor (BF), a metric quantifying output diversity in large language models, and finds that alignment tuning reduces this diversity by 2-5x overall—and up to 10x at early generation positions. The research suggests alignment doesn't fundamentally change model behavior but instead steers outputs toward lower-entropy token sequences already present in base models.

research

SiNGER framework improves vision transformer distillation by suppressing high-norm artifacts

Researchers introduce SiNGER (Singular Nullspace-Guided Energy Reallocation), a knowledge distillation framework that improves how Vision Transformer features transfer to smaller student models. The method suppresses high-norm artifacts that degrade representation quality while preserving informative signals from teacher models.

research

MedXIAOHE: New medical vision-language model claims state-of-the-art performance on clinical benchmarks

Researchers have published MedXIAOHE, a medical multimodal foundation model designed for clinical applications. According to the authors, the model achieves state-of-the-art performance across diverse medical benchmarks and surpasses several closed-source multimodal systems on multiple capabilities.

research

DeepXiv-SDK releases three-layer agentic interface for scientific literature access

DeepXiv-SDK introduces a three-layer agentic data interface designed to give LLM agents efficient, cost-aware access to scientific literature. The system transforms unstructured data into normalized JSON, offers retrieval tools via CLI, MCP, and Python SDK, and currently covers the complete arXiv corpus with daily synchronization.

2 min readvia arxiv.org