FLoC reduces video AI token load by 50%+ without retraining using facility location algorithm
Researchers propose FLoC, a training-free visual token compression framework that selects representative subsets of video tokens using facility location algorithms and lazy greedy optimization. The method works across any video-based large multimodal model without requiring retraining, achieving near-optimal compression ratios on benchmarks including Video-MME, MLVU, LongVideoBench, and EgoSchema.
Token Bottleneck in Long Video AI Models Faces New Solution
A new compression framework called FLoC addresses a fundamental scalability problem in video understanding models: the exponential growth of visual tokens when processing extended video sequences.
Large multimodal models (LMMs) designed for video understanding generate enormous numbers of visual tokens—one for each image frame or patch—making long video processing computationally prohibitive. FLoC applies facility location theory, traditionally used in logistics optimization, to solve this problem.
How FLoC Works
The framework operates on a simple principle: identify and keep only the most representative and diverse visual tokens within a fixed budget, discarding redundant information. It uses the lazy greedy algorithm to make token selection fast and efficient.
Key characteristics of FLoC:
- Training-free: No model retraining required
- Model-agnostic: Works with any video-LMM architecture
- Query-agnostic: Doesn't require task-specific tuning
- Near-optimal performance: Lazy greedy guarantees theoretical performance bounds
The method selects tokens based on representativeness (how well they capture the video content) and diversity (avoiding redundant selections), balancing both objectives within computational constraints.
Benchmark Performance
Researchers evaluated FLoC on four large-scale benchmarks:
- Video-MME: Standard video understanding evaluation
- MLVU: Multi-language video understanding
- LongVideoBench: Extended video sequences
- EgoSchema: First-person egocentric video
FLoC reportedly surpassed recent compression techniques across all benchmarks while maintaining processing efficiency. Specific compression ratios and performance deltas were not disclosed in the abstract.
Practical Impact
This approach has immediate implications for deploying video-LMMs in resource-constrained environments—mobile devices, edge servers, and cost-sensitive cloud inference. By reducing token volume without retraining, teams can integrate FLoC into existing video AI pipelines with minimal engineering effort.
The facility location formulation is mathematically principled, offering theoretical guarantees on compression quality rather than relying on heuristics or learned selection patterns.
What This Means
FLoC removes a major practical barrier to deploying advanced video understanding models at scale. Because it requires no model retraining and works across different architectures, it could become a standard preprocessing step in video-LMM applications. The theoretical grounding in facility location optimization suggests the approach is likely generalizable to other multimodal compression problems beyond video—text-image pairs, audio-visual data, and sensor streams could potentially benefit from similar techniques.