FreeAct framework relaxes quantization constraints for multimodal and diffusion LLMs
Researchers propose FreeAct, a quantization framework that abandons static one-to-one transformation constraints to handle dynamic activation patterns in multimodal and diffusion LLMs. The method assigns token-specific transformation matrices to activations while keeping weights unified, demonstrating up to 5.3% performance improvements over existing approaches.
FreeAct Framework Relaxes Quantization Constraints for Multimodal and Diffusion LLMs
A new research paper proposes FreeAct, a quantization framework designed to handle dynamic activation patterns in large language models more effectively than existing transformation-based methods.
The Problem with Current Quantization Methods
Quantization remains essential for reducing LLM memory and computational requirements. Recent transformation-based approaches have improved quantization performance by projecting feature spaces onto smoother manifolds using orthogonal matrices. However, these methods enforce a rigid one-to-one transformation constraint that treats all inputs identically—a significant limitation for models that process varied input types.
Diffusion LLMs (dLLMs) and Multimodal LLMs (MLLMs) present particular challenges. Vision tokens, text tokens, and masked tokens exhibit distinct activation distributions that static transformations cannot accommodate effectively.
How FreeAct Works
FreeAct addresses this limitation by relaxing the static constraint and enabling dynamic activation transformations. The framework leverages the rank-deficient nature of activations to derive an extended solution space beyond simple inverse matrices. This theoretical foundation allows the method to decouple activation transformations from weight transformations.
Practically, FreeAct identifies token-specific dynamics and allocates distinct transformation matrices to the activation side. Crucially, weight transformations remain unified and static, avoiding unnecessary complexity in the weight space while enabling fine-grained control on activations.
This asymmetric approach—dynamic activations paired with static weights—represents a departure from conventional quantization strategies that apply uniform transformations across all components.
Experimental Results
The researchers conducted extensive experiments across both diffusion LLMs and multimodal models. FreeAct demonstrated significant performance improvements over baseline quantization methods, achieving up to 5.3% improvement in tested scenarios. The paper includes detailed analyses examining how token-type-specific transformations contribute to these gains.
The distinction between vision and text token handling appears particularly important for multimodal models, where a single static transformation would compromise accuracy across heterogeneous data types.
What This Means
FreeAct suggests that efficient quantization requires acknowledging the fundamental heterogeneity in modern LLM architectures. As models handle increasingly diverse input modalities and employ specialized token types (masked, padding, special tokens), one-size-fits-all quantization approaches leave performance on the table.
The practical impact depends on adoption. If implemented in production inference systems, FreeAct could reduce memory overhead and latency for multimodal and diffusion models without sacrificing accuracy. The method's focus on activation-side transformations also suggests potential compatibility with existing weight-quantization techniques.
The authors note that code will be publicly released, enabling community evaluation and potential integration into quantization frameworks used by major model developers.