research

SureLock cuts masked diffusion language model decoding compute by 30-50%

Researchers propose SureLock, a technique that reduces computational FLOPs in masked diffusion language model decoding by 30-50% on LLaDA-8B by skipping attention and feed-forward computations for tokens that have converged. The method caches key-value pairs for locked positions while continuing to compute for unlocked tokens, reducing per-iteration complexity from O(N²d) to O(MNd).

2 min read

SureLock Reduces Masked Diffusion LM Compute by 30-50%

Researchers have developed SureLock, a computational optimization technique for masked diffusion language models that eliminates redundant calculations during token generation, achieving 30-50% FLOP reductions while maintaining generation quality.

The Problem

Masked diffusion language models generate sequences through iterative sampling that progressively unmasks tokens across multiple steps. The standard approach recomputes attention and feed-forward blocks for every token position at every iteration—including tokens whose predictions have stabilized and no longer change meaningfully. This represents substantial computational waste, particularly as the number of fixed tokens accumulates during decoding.

SureLock's Approach

The technique introduces a "locking" mechanism: when the posterior probability distribution at an unmasked token position stabilizes across successive iterations (the "sure condition"), SureLock locks that position. Once locked, the model skips computing the query projection and feed-forward sublayers for that token in subsequent iterations. However, the method caches the token's attention keys and values, allowing other unlocked positions to continue attending to locked tokens normally.

This architectural change reduces the per-iteration computational complexity from O(N²d)—where N is sequence length and d is model dimension—to O(MNd), where M represents the number of unlocked positions. Since M decreases as decoding progresses, the savings accumulate across iterations.

Empirical Results

On LLaDA-8B, SureLock achieved 30-50% algorithmic FLOP reductions relative to the same sampler without token locking. Critically, the optimization maintained comparable generation quality, indicating the technique does not degrade model outputs.

Theoretical Justification

The researchers provide theoretical analysis demonstrating that monitoring only the local KL divergence at the lock step suffices to bound the deviation in final token probabilities. This justifies the design rationale: the locking decision can be made locally without global probability tracking, reducing overhead.

What This Means

Masked diffusion models represent an alternative decoding paradigm to standard autoregressive generation. If these models see wider adoption, SureLock's optimization could significantly reduce inference costs at scale. The 30-50% improvement directly translates to lower latency and reduced resource consumption during deployment. However, real-world impact depends on masked diffusion LMs becoming competitive with autoregressive approaches in production settings. The technique is theoretically grounded and empirically validated on a specific 8B parameter model, but scaling behavior and applicability across different architectures remains unclear.

SureLock: Masked Diffusion LM Decoding Optimization | TPS