research

FLoC reduces video AI token load by 50%+ without retraining using facility location algorithm

Researchers propose FLoC, a training-free visual token compression framework that selects representative subsets of video tokens using facility location algorithms and lazy greedy optimization. The method works across any video-based large multimodal model without requiring retraining, achieving near-optimal compression ratios on benchmarks including Video-MME, MLVU, LongVideoBench, and EgoSchema.

2 min read

Token Bottleneck in Long Video AI Models Faces New Solution

A new compression framework called FLoC addresses a fundamental scalability problem in video understanding models: the exponential growth of visual tokens when processing extended video sequences.

Large multimodal models (LMMs) designed for video understanding generate enormous numbers of visual tokens—one for each image frame or patch—making long video processing computationally prohibitive. FLoC applies facility location theory, traditionally used in logistics optimization, to solve this problem.

How FLoC Works

The framework operates on a simple principle: identify and keep only the most representative and diverse visual tokens within a fixed budget, discarding redundant information. It uses the lazy greedy algorithm to make token selection fast and efficient.

Key characteristics of FLoC:

  • Training-free: No model retraining required
  • Model-agnostic: Works with any video-LMM architecture
  • Query-agnostic: Doesn't require task-specific tuning
  • Near-optimal performance: Lazy greedy guarantees theoretical performance bounds

The method selects tokens based on representativeness (how well they capture the video content) and diversity (avoiding redundant selections), balancing both objectives within computational constraints.

Benchmark Performance

Researchers evaluated FLoC on four large-scale benchmarks:

  • Video-MME: Standard video understanding evaluation
  • MLVU: Multi-language video understanding
  • LongVideoBench: Extended video sequences
  • EgoSchema: First-person egocentric video

FLoC reportedly surpassed recent compression techniques across all benchmarks while maintaining processing efficiency. Specific compression ratios and performance deltas were not disclosed in the abstract.

Practical Impact

This approach has immediate implications for deploying video-LMMs in resource-constrained environments—mobile devices, edge servers, and cost-sensitive cloud inference. By reducing token volume without retraining, teams can integrate FLoC into existing video AI pipelines with minimal engineering effort.

The facility location formulation is mathematically principled, offering theoretical guarantees on compression quality rather than relying on heuristics or learned selection patterns.

What This Means

FLoC removes a major practical barrier to deploying advanced video understanding models at scale. Because it requires no model retraining and works across different architectures, it could become a standard preprocessing step in video-LMM applications. The theoretical grounding in facility location optimization suggests the approach is likely generalizable to other multimodal compression problems beyond video—text-image pairs, audio-visual data, and sensor streams could potentially benefit from similar techniques.

FLoC Video Token Compression for LMMs | TPS