model release

NemoStation releases Marlin-2B: 2-billion parameter video VLM achieves dense captioning performance between Tarsier-34B

TL;DR

NemoStation has released Marlin-2B, a 2-billion parameter video vision-language model that produces structured scene and event captions with second-precise timestamps. The model tops the CaReBench dense captioning leaderboard and sits between Tarsier-34B and Gemini-1.5-Pro on DREAM-1K, while matching Gemini-2.0-Flash on temporal grounding benchmarks.

May 20, 2026 · 7:51 AM2 min read

NemoStation Releases Marlin-2B Video VLM

NemoStation has released Marlin-2B, a 2-billion parameter video vision-language model designed for dense captioning and temporal grounding. The model sits between Tarsier-34B and Gemini-1.5-Pro on DREAM-1K dense captioning benchmarks while running on a single consumer GPU.

Performance

Marlin-2B tops the CaReBench leaderboard for dense video captioning in its weight class. On Tencent's TimeLens-Bench (Charades, ActivityNet, QVHighlights), the model beats Qwen2.5-VL-7B by +6.4 mIoU and matches Gemini-2.0-Flash on temporal grounding tasks, according to NemoStation.

The model produces structured outputs in two modes: caption mode generates "Scene" paragraphs plus timestamped "Events" with start-end boundaries, while find mode resolves natural-language queries to (start, end) time ranges in seconds.

Architecture and Training

Marlin-2B is a fine-tune of Qwen3.5-2B with the video-capable visual tower intact. NemoStation trained the model in two stages on a single H100:

Stage 1: Supervised fine-tuning on approximately 400,000 high-quality clip-level annotations assembled from ActivityNet, LSMDC, Charades, Charades-Ego, TREC-VTT, WebVid-10M, HC-STVG, VidSTG, and TimeLens datasets, plus dense re-annotations from Gemini-3-Flash
Stage 2: Preference optimization via SimPO using teacher-distilled preference pairs scored by Gemini-3-Flash

The training corpus combines sparse public annotations with dense re-annotations specifically tuned for temporally grounded atomic events with explicit time boundaries.

Technical Specifications

The model processes video at 2.0 FPS with a maximum of 200,704 pixels per frame (approximately 448×448). It caps total frames at 240, covering roughly 2-minute videos. The model requires transformers ≥5.7.0, torch ≥2.11.0, and torchcodec for video decoding.

Marlin-2B is vLLM- and swift-deploy-compatible. The model exposes two convenience methods (.caption and .find) that return parsed dictionaries, plus raw .generate() access for custom prompts.

Pricing and Availability

Pricing not yet disclosed. The model is available on Hugging Face with custom modeling code requiring trust_remote_code=True. NemoStation states a recipe paper detailing the training methodology is forthcoming.

What This Means

Marlin-2B demonstrates that specialized training on temporally grounded video data can produce a 2B model competitive with much larger models (34B) and proprietary systems on specific video understanding tasks. The model's ability to run on consumer hardware while matching Gemini-2.0-Flash on temporal grounding suggests efficient video VLMs are viable for production deployment. However, NemoStation acknowledges that specialized 7B+ models (TimeLens-7B/8B, MiMo-VL, Time-R1) still lead on these benchmarks due to task-specific training data—Marlin-2B's strength is as a general-purpose model at 2B scale.

Source: huggingface.co ↗

marlin nemostation video-vlm qwen dense-captioning temporal-grounding 2b-parameters simpo

model releaseJune 29, 2026

DeepReinforce Releases Ornith-1.0, Open-Source Agentic Coding Model in 9B to 397B Sizes

DeepReinforce has released Ornith-1.0, an MIT-licensed model designed for agentic coding tasks with variants ranging from 9B to 397B parameters. Built on top of Apache 2.0-licensed Gemma 4 and Qwen 3.5 base models, the company claims it achieves state-of-the-art performance among open-source models of comparable size on coding benchmarks.

model releaseJuly 4, 2026

Mistral releases Leanstral 1.5: 119B parameter open-source model for Lean 4 proof assistance

Mistral AI has released Leanstral 1.5, an open-source 119B parameter mixture-of-experts model designed specifically for Lean 4 proof assistance. The model features 128 experts with 4 active per token (6.5B activated parameters), a 256k token context window, and multimodal input capabilities.

model releaseJuly 4, 2026

NVIDIA releases Nemotron-Labs-TwoTower-30B: block-wise diffusion model claims 2.42× faster generation at 98.7% baseline

NVIDIA released Nemotron-Labs-TwoTower-30B-A3B-Base-BF16, a block-wise diffusion language model that generates text by denoising blocks of tokens in parallel rather than sequentially. According to NVIDIA, the model achieves 2.42× the wall-clock generation throughput of its autoregressive baseline while retaining 98.7% of aggregate benchmark quality.

model releaseJuly 3, 2026

Mistral Releases Leanstral 1.5: 6B-Parameter Model Achieves 100% on miniF2F, Solves 587/672 PutnamBench Problems

Mistral AI released Leanstral 1.5, a free Apache-2.0 licensed model with 119B total parameters and 6B active parameters specialized for formal verification in Lean 4. The model achieves 100% on miniF2F benchmark, solves 587 of 672 PutnamBench problems at $4 per problem (versus $300+ for competitors), and reaches state-of-the-art 87% on FATE-H and 34% on FATE-X benchmarks.