research

Researchers propose WIM rating system to replace subjective numerical scores in LLM training

A new research paper introduces the What Is Missing (WIM) rating system, which generates model output rankings from natural-language feedback rather than subjective numerical scores. The approach integrates into existing LLM training pipelines and claims to reduce ties and increase training signal clarity compared to discrete ratings.

March 6, 2026 · 5:53 AM2 min read

Researchers Propose Natural-Language Alternative to Numerical Rating Systems for LLM Training

A new research paper proposes replacing subjective numerical ratings with the What Is Missing (WIM) system, which generates preference rankings from natural-language feedback to improve large language model training.

The Problem With Current Methods

Existing LLM preference learning approaches—including Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO)—rely on direct numerical rankings or ratings assigned by human judges. According to the research, single numerical ratings are a poor proxy for evaluating the quality of natural language outputs, as they fail to capture nuanced aspects of model performance and introduce subjectivity into the training signal.

How WIM Works

Instead of selecting a discrete numerical score, judges write natural-language feedback describing what the model output is missing. The system then:

Embeds both the model output and feedback using a sentence embedding model
Computes cosine similarity between the resulting vectors
Converts the similarity score into a rating that ranks model outputs

This approach generates interpretable ratings in a limited but useful sense: each scalar rating can be traced back to the specific missing-information text that produced it, enabling qualitative debugging of preference labels.

Key Findings

The researchers observed that compared to discrete numerical ratings, WIM produces:

Fewer ties in preference pairs
Larger rating deltas (greater differentiation between outputs)
Improved availability of learning signals in pairwise preference data

Practical Integration

The WIM system is designed for practical adoption. It integrates into existing training pipelines without requiring changes to underlying learning algorithms. The method can also be combined with other rating techniques and works as input to any preference learning method, making it compatible with current infrastructure.

What This Means

WIM addresses a fundamental issue in LLM alignment: the quality and interpretability of the preference data used to train models. By replacing opaque numerical judgments with connected natural-language feedback, the approach offers both better training signals and better debugging capabilities. This could particularly benefit teams working on model improvement where understanding why outputs differ matters as much as knowing that they do. The system's compatibility with existing methods suggests potential for near-term adoption, though real-world effectiveness will depend on implementation details and the quality of feedback provided by judges.

Source: arxiv.org ↗

llm-training preference-learning dpo ppo interpretability ratings natural-language-feedback research