Researchers extend Vision Mamba sequence length 4x with separator-based pretraining
Researchers have introduced STAR (Separators for AutoRegressive pretraining), a method that extends Vision Mamba's input sequence length by 4x through strategic separator insertion between images. The STAR-B model achieved 83.5% accuracy on ImageNet-1k, demonstrating improved long-range dependency modeling in vision tasks.
Researchers Extend Vision Mamba Sequence Length 4x With Separator-Based Pretraining
A new research paper presents STAR (Separators for AutoRegressive pretraining), a method that quadruples Vision Mamba's input sequence length while maintaining original image dimensions during autoregressive pretraining.
The Problem
Vision Mamba, a state space model architecture, has emerged as an efficient alternative to transformers for vision tasks, particularly excelling at processing long sequences. However, current autoregressive pretraining approaches for Vision Mamba remain constrained to short sequences, failing to fully leverage the model's core strength: efficient handling of extended sequence dependencies.
The Solution: STAR
The researchers introduce a straightforward but effective innovation: inserting identical separator tokens before each image in the training sequence. These separators demarcate image boundaries within concatenated sequences, allowing the model to learn long-range dependencies across multiple images without resizing the dataset.
Key technical details:
- Separator insertion point: Placed before each image inception
- Sequence extension: Achieves 4x increase in input sequence length
- Dataset preservation: Original image dimensions maintained
- Training approach: Autoregressive pretraining compatible with Mamba's causal mechanism
Results
The STAR-B model, trained using this method, achieved 83.5% accuracy on ImageNet-1k classification, placing it among competitive Vision Mamba implementations. The improvement suggests that longer sequence pretraining effectively enhances the model's ability to capture and utilize long-range spatial dependencies within vision tasks.
Significance
This work addresses a practical gap in Vision Mamba research. While the architecture is theoretically suited for long sequences, training methods hadn't fully exploited this property. By enabling 4x longer training sequences through a minimal architectural modification, the approach demonstrates that Vision Mamba's efficiency gains extend to practical vision benchmarks.
The separator-based approach is notably simple—requiring no additional parameters or complex modifications—making it immediately applicable to existing Vision Mamba implementations.
What This Means
Vision transformers and attention-based models dominate modern computer vision, but state space models like Mamba offer computational advantages for sequence processing. This research validates that these advantages translate to vision tasks when properly leveraged during pretraining. The 4x sequence extension with maintained image quality suggests Vision Mamba may become a viable alternative to transformers for vision applications where efficiency matters. However, the 83.5% ImageNet-1k score, while solid, doesn't yet exceed state-of-the-art transformer baselines, indicating this remains an evolving research direction rather than a replacement technology.
The work appears in arXiv (2603.03806) and has not been peer-reviewed or accepted to a top-tier venue at publication time.