research

Study shows RL training enables LLMs to abstain on unanswerable temporal questions, outperforming GPT-4o

A new arXiv study presents the first systematic evaluation of training large language models to abstain—refuse to answer—on temporal questions they cannot reliably answer. Using reinforcement learning with abstention-aware rewards, researchers achieved 3.46-5.80% higher accuracy on temporal QA benchmarks than GPT-4o, while improving true positive rates on unanswerable questions by 20%.

2 min read

Study: RL Training Enables LLMs to Learn Abstention on Temporal Questions

Large language models typically generate confident but inaccurate answers rather than admitting uncertainty. A new research paper addresses this critical reliability problem by training models to abstain—explicitly refuse to answer—when they lack sufficient evidence, particularly on temporal reasoning tasks.

The study, posted to arXiv as paper 2602.04755, presents the first empirical investigation of jointly optimizing abstention behavior and temporal reasoning in LLMs. Researchers framed abstention as a teachable skill and developed a training pipeline combining Chain-of-Thought supervision with reinforcement learning guided by abstention-aware reward signals.

Key Performance Results

Experiments using Qwen2.5-1.5B-Instruct as the base model demonstrated substantial improvements:

  • 3.46% higher Exact Match accuracy on TimeQA-Easy compared to GPT-4o
  • 5.80% higher Exact Match accuracy on TimeQA-Hard compared to GPT-4o
  • 20% improvement in True Positive rate on unanswerable questions versus supervised fine-tuning alone

The RL-trained model learned to appropriately refuse questions rather than generating plausible-sounding but incorrect answers—a critical capability for deployment in domains where false confidence creates safety risks.

Training Methodology

Researchers compared two primary training approaches:

Supervised Fine-Tuning (SFT): Direct instruction on question-answer pairs with abstention labels. The study found SFT induces overconfidence, harming model reliability despite maintaining reasonable performance.

Reinforcement Learning (RL): Trains models using reward signals that explicitly incentivize correct answers while penalizing false positives and false negatives on unanswerable questions. RL substantially improved accuracy and raised true positive rates, though analysis indicated it exhibits similar reliability risks as SFT in certain conditions.

Information Type Effectiveness

The study systematically evaluated how different information sources affect temporal reasoning with abstention:

  • Explicit Chain-of-Thought supervision: Most effective, providing clear reasoning pathways
  • Implicit cues (original context, temporal sub-context): Limited benefit when reasoning must handle temporal ambiguity
  • Knowledge graphs: Provided minimal improvement over baseline approaches

This finding suggests that explicit reasoning instructions are necessary for models to reliably handle temporal uncertainty, rather than relying on implicit contextual signals.

Implications for LLM Reliability

The research identifies a fundamental tension: while RL improves prediction accuracy and reduces false positives on unanswerable questions, both training approaches maintain underlying reliability vulnerabilities. The study argues this indicates abstention and reasoning optimization require continued investigation beyond standard fine-tuning approaches.

Temporal QA represents a particularly challenging domain because models must track facts across different time periods without conflating information. The ability to abstain on genuinely unanswerable temporal questions has direct applications in news summarization, historical fact verification, and time-sensitive information retrieval systems.

What This Means

This research demonstrates that abstention is trainable rather than inherent, and that reinforcement learning with appropriate reward signals can substantially improve both accuracy and appropriate refusal behavior. However, the finding that both SFT and RL maintain certain reliability vulnerabilities suggests that current training paradigms may have fundamental limitations for building maximally reliable LLMs. For practitioners, the results indicate that RL-based approaches should be prioritized over pure supervised fine-tuning when deployment requires appropriate uncertainty estimation, particularly in temporal reasoning applications.