model release

UAE's TIIUAE releases Falcon Perception: 0.6B early-fusion model for open-vocabulary grounding

TL;DR

TIIUAE has released Falcon Perception, a 0.6B-parameter early-fusion Transformer that combines image patches and text in a single sequence for open-vocabulary object grounding and segmentation. The model achieves 68.0 Macro-F1 on SA-Co (vs. 62.3 for SAM 3) and introduces PBench, a diagnostic benchmark that isolates performance across five capability levels. TIIUAE also released Falcon OCR, a 0.3B model reaching 80.3 on olmOCR and 88.6 on OmniDocBench.

April 1, 2026 · 7:20 AM3 min read

Falcon Perception — Quick Specs

Compare Falcon Perception with other models →

Falcon Perception: Single-Backbone Approach to Vision-Language Grounding

TIIUAE has released Falcon Perception, a 0.6B-parameter early-fusion Transformer designed for open-vocabulary object grounding and segmentation from natural language prompts. The architecture processes image patches and text tokens in a unified sequence using a hybrid attention mask, departing from the standard modular pipeline approach that separates vision encoders from language fusion.

Architecture: Early Fusion with Hybrid Attention

Falcon Perception replaces traditional multi-stage pipelines (frozen vision backbone → separate fusion decoder → post-processing) with a single Transformer backbone. The model uses a hybrid attention pattern: image tokens attend bidirectionally to all image tokens to build global visual context, while text and task tokens attend causally to everything before them. This allows the same backbone to function as both a bidirectional visual encoder and an autoregressive predictor.

Output generation follows a deliberate three-step "Chain-of-Perception" structure:

Coordinate token: Predicts instance center
Size token: Predicts spatial extent
Segmentation token: Produces full-resolution binary mask via dot product with upsampled image features

This coarse-to-fine decomposition avoids expensive token-by-token mask generation while maintaining variable-length instance prediction. Coordinate and size heads use Fourier feature encoding to overcome spectral bias in neural networks. Segmentation heads operate as lightweight dot-product layers rather than separate mask-query machinery.

Performance and Benchmarking

On the SA-Co benchmark, Falcon Perception reaches 68.0 Macro-F1, outperforming SAM 3's 62.3. However, the model shows a presence calibration gap (MCC 0.64 vs. 0.82), meaning it struggles to reliably predict when objects are absent.

TIIUAE introduced PBench, a diagnostic benchmark that disaggregates performance by five capability levels:

L0: Simple object recognition
L1: Attributes and subtypes
L2: OCR-guided identification
L3: Spatial understanding
L4: Relations and interactions
Dense Crowdedness: Stress test with hundreds of instances per image

This structure isolates failure modes and guides future development priorities.

Training Methodology

Falcon Perception was trained on a 54M-image dataset with 195M positive expressions and 488M hard negatives. The training pipeline includes:

Multi-teacher distillation: Initialized from DINOv3 (ViT-H) and SigLIP2 to provide strong visual and language-aligned foundations
Hierarchical image clustering: Via DINOv3 embeddings for uniform concept coverage
VLM-driven annotation: Generated dense descriptions categorized by PBench complexity levels (60% basic, 40% advanced)
Negative mining: Created semantic, visual, and fine-grained hard negatives
Ensemble consensus: SAM 3, Qwen3-VL-30B, and Moondream3 must agree (IoU > 0.8) for automatic acceptance
Strict 1:1 positive-to-negative ratio: Prioritizes presence calibration as a first-class training objective

Falcon OCR

TIIUAE simultaneously released Falcon OCR, a 0.3B-parameter model achieving 80.3 on olmOCR and 88.6 on OmniDocBench with the highest throughput among open-source OCR models.

What This Means

Falcon Perception demonstrates that unified, early-fusion architectures can compete with or exceed modular vision-language pipelines while remaining interpretable and scalable. The 0.6B parameter count positions it for edge deployment and inference speed advantages. The primary limitation—presence calibration—is well-diagnosed and addressable with targeted data and training techniques. PBench's capability-level breakdown provides a reproducible methodology for future vision-language model evaluation beyond saturated benchmarks like RefCOCO. The simultaneous OCR release and emphasis on open-vocabulary grounding signal TIIUAE's focus on practical industrial deployment of perception systems.

Source: huggingface.co ↗

falcon-perception tiiuae vision-language object-grounding segmentation open-vocabulary early-fusion falcon-ocr

model releaseMay 12, 2026

Perceptron Launches Mk1 Vision-Language Model with Video Reasoning at $0.15/$1.50 per 1M Tokens

Perceptron has released Perceptron Mk1, a vision-language model designed for video understanding and embodied reasoning tasks. The model accepts image and video inputs with 33K context window, priced at $0.15 per 1M input tokens and $1.50 per 1M output tokens, and supports structured spatial annotations on demand.

model releaseMay 15, 2026

Microsoft Releases Fara-7B: 7B Parameter Computer Use Agent Trained in 2.5 Days on 64 H100s

Microsoft Research has released Fara-7B, a 7-billion parameter small language model designed for computer automation tasks. The model, which took 2.5 days to train on 64 H100 GPUs, can navigate websites to complete tasks like booking restaurants and shopping, using screenshots as input with a 128K token context window.

model releaseMay 14, 2026

IBM Releases 97M-Parameter Granite Embedding Model With 60.3 MTEB Score — Highest Retrieval Quality Under 100M Parameter

IBM released two new multilingual embedding models under Apache 2.0: a 97M-parameter compact model scoring 60.3 on MTEB Multilingual Retrieval (highest in its size class) and a 311M full-size model scoring 65.2. Both support 200+ languages with enhanced retrieval for 52 languages, handle 32K-token context (64x increase over predecessors), and include code retrieval across 9 programming languages.