model release

NemoStation releases Marlin-2B: 2-billion parameter video VLM achieves dense captioning performance between Tarsier-34B

TL;DR

NemoStation has released Marlin-2B, a 2-billion parameter video vision-language model that produces structured scene and event captions with second-precise timestamps. The model tops the CaReBench dense captioning leaderboard and sits between Tarsier-34B and Gemini-1.5-Pro on DREAM-1K, while matching Gemini-2.0-Flash on temporal grounding benchmarks.

2 min read
0

NemoStation Releases Marlin-2B Video VLM

NemoStation has released Marlin-2B, a 2-billion parameter video vision-language model designed for dense captioning and temporal grounding. The model sits between Tarsier-34B and Gemini-1.5-Pro on DREAM-1K dense captioning benchmarks while running on a single consumer GPU.

Performance

Marlin-2B tops the CaReBench leaderboard for dense video captioning in its weight class. On Tencent's TimeLens-Bench (Charades, ActivityNet, QVHighlights), the model beats Qwen2.5-VL-7B by +6.4 mIoU and matches Gemini-2.0-Flash on temporal grounding tasks, according to NemoStation.

The model produces structured outputs in two modes: caption mode generates "Scene" paragraphs plus timestamped "Events" with start-end boundaries, while find mode resolves natural-language queries to (start, end) time ranges in seconds.

Architecture and Training

Marlin-2B is a fine-tune of Qwen3.5-2B with the video-capable visual tower intact. NemoStation trained the model in two stages on a single H100:

  • Stage 1: Supervised fine-tuning on approximately 400,000 high-quality clip-level annotations assembled from ActivityNet, LSMDC, Charades, Charades-Ego, TREC-VTT, WebVid-10M, HC-STVG, VidSTG, and TimeLens datasets, plus dense re-annotations from Gemini-3-Flash
  • Stage 2: Preference optimization via SimPO using teacher-distilled preference pairs scored by Gemini-3-Flash

The training corpus combines sparse public annotations with dense re-annotations specifically tuned for temporally grounded atomic events with explicit time boundaries.

Technical Specifications

The model processes video at 2.0 FPS with a maximum of 200,704 pixels per frame (approximately 448×448). It caps total frames at 240, covering roughly 2-minute videos. The model requires transformers ≥5.7.0, torch ≥2.11.0, and torchcodec for video decoding.

Marlin-2B is vLLM- and swift-deploy-compatible. The model exposes two convenience methods (.caption and .find) that return parsed dictionaries, plus raw .generate() access for custom prompts.

Pricing and Availability

Pricing not yet disclosed. The model is available on Hugging Face with custom modeling code requiring trust_remote_code=True. NemoStation states a recipe paper detailing the training methodology is forthcoming.

What This Means

Marlin-2B demonstrates that specialized training on temporally grounded video data can produce a 2B model competitive with much larger models (34B) and proprietary systems on specific video understanding tasks. The model's ability to run on consumer hardware while matching Gemini-2.0-Flash on temporal grounding suggests efficient video VLMs are viable for production deployment. However, NemoStation acknowledges that specialized 7B+ models (TimeLens-7B/8B, MiMo-VL, Time-R1) still lead on these benchmarks due to task-specific training data—Marlin-2B's strength is as a general-purpose model at 2B scale.

Related Articles

model release

Microsoft Releases Fara-7B: 7B Parameter Computer Use Agent Trained in 2.5 Days on 64 H100s

Microsoft Research has released Fara-7B, a 7-billion parameter small language model designed for computer automation tasks. The model, which took 2.5 days to train on 64 H100 GPUs, can navigate websites to complete tasks like booking restaurants and shopping, using screenshots as input with a 128K token context window.

model release

xAI Launches Grok Build 0.1: Coding Model with 256K Context for Agentic Workflows

xAI has released Grok Build 0.1, a coding-specialized model with a 256K context window and unlimited text output. The model is designed for agentic software engineering workflows and powers xAI's Grok Build CLI tool.

model release

Stability AI Releases Stable Audio 3.0 Model Family Trained on Licensed Data

Stability AI has released Stable Audio 3.0, a model family for audio generation trained on fully licensed data. The company positions the release as a foundation for commercial audio applications, though specific technical specifications have not yet been disclosed.

model release

Google releases Gemini Omni Flash video generation model with conversational editing, withholds speech synthesis

Google DeepMind released Gemini Omni Flash, the first model in its new Omni family that generates and edits video from image, audio, video, and text inputs. The model is rolling out to Gemini app subscribers and YouTube Shorts with a 10-second clip limit, while speech-editing capabilities remain withheld pending safety testing.

Comments

Loading...