UAE's TIIUAE releases Falcon Perception: 0.6B early-fusion model for open-vocabulary grounding
TIIUAE has released Falcon Perception, a 0.6B-parameter early-fusion Transformer that combines image patches and text in a single sequence for open-vocabulary object grounding and segmentation. The model achieves 68.0 Macro-F1 on SA-Co (vs. 62.3 for SAM 3) and introduces PBench, a diagnostic benchmark that isolates performance across five capability levels. TIIUAE also released Falcon OCR, a 0.3B model reaching 80.3 on olmOCR and 88.6 on OmniDocBench.
Falcon Perception: Single-Backbone Approach to Vision-Language Grounding
TIIUAE has released Falcon Perception, a 0.6B-parameter early-fusion Transformer designed for open-vocabulary object grounding and segmentation from natural language prompts. The architecture processes image patches and text tokens in a unified sequence using a hybrid attention mask, departing from the standard modular pipeline approach that separates vision encoders from language fusion.
Architecture: Early Fusion with Hybrid Attention
Falcon Perception replaces traditional multi-stage pipelines (frozen vision backbone → separate fusion decoder → post-processing) with a single Transformer backbone. The model uses a hybrid attention pattern: image tokens attend bidirectionally to all image tokens to build global visual context, while text and task tokens attend causally to everything before them. This allows the same backbone to function as both a bidirectional visual encoder and an autoregressive predictor.
Output generation follows a deliberate three-step "Chain-of-Perception" structure:
- Coordinate token: Predicts instance center
- Size token: Predicts spatial extent
- Segmentation token: Produces full-resolution binary mask via dot product with upsampled image features
This coarse-to-fine decomposition avoids expensive token-by-token mask generation while maintaining variable-length instance prediction. Coordinate and size heads use Fourier feature encoding to overcome spectral bias in neural networks. Segmentation heads operate as lightweight dot-product layers rather than separate mask-query machinery.
Performance and Benchmarking
On the SA-Co benchmark, Falcon Perception reaches 68.0 Macro-F1, outperforming SAM 3's 62.3. However, the model shows a presence calibration gap (MCC 0.64 vs. 0.82), meaning it struggles to reliably predict when objects are absent.
TIIUAE introduced PBench, a diagnostic benchmark that disaggregates performance by five capability levels:
- L0: Simple object recognition
- L1: Attributes and subtypes
- L2: OCR-guided identification
- L3: Spatial understanding
- L4: Relations and interactions
- Dense Crowdedness: Stress test with hundreds of instances per image
This structure isolates failure modes and guides future development priorities.
Training Methodology
Falcon Perception was trained on a 54M-image dataset with 195M positive expressions and 488M hard negatives. The training pipeline includes:
- Multi-teacher distillation: Initialized from DINOv3 (ViT-H) and SigLIP2 to provide strong visual and language-aligned foundations
- Hierarchical image clustering: Via DINOv3 embeddings for uniform concept coverage
- VLM-driven annotation: Generated dense descriptions categorized by PBench complexity levels (60% basic, 40% advanced)
- Negative mining: Created semantic, visual, and fine-grained hard negatives
- Ensemble consensus: SAM 3, Qwen3-VL-30B, and Moondream3 must agree (IoU > 0.8) for automatic acceptance
- Strict 1:1 positive-to-negative ratio: Prioritizes presence calibration as a first-class training objective
Falcon OCR
TIIUAE simultaneously released Falcon OCR, a 0.3B-parameter model achieving 80.3 on olmOCR and 88.6 on OmniDocBench with the highest throughput among open-source OCR models.
What This Means
Falcon Perception demonstrates that unified, early-fusion architectures can compete with or exceed modular vision-language pipelines while remaining interpretable and scalable. The 0.6B parameter count positions it for edge deployment and inference speed advantages. The primary limitation—presence calibration—is well-diagnosed and addressable with targeted data and training techniques. PBench's capability-level breakdown provides a reproducible methodology for future vision-language model evaluation beyond saturated benchmarks like RefCOCO. The simultaneous OCR release and emphasis on open-vocabulary grounding signal TIIUAE's focus on practical industrial deployment of perception systems.
Related Articles
Perceptron Launches Mk1 Vision-Language Model with Video Reasoning at $0.15/$1.50 per 1M Tokens
Perceptron has released Perceptron Mk1, a vision-language model designed for video understanding and embodied reasoning tasks. The model accepts image and video inputs with 33K context window, priced at $0.15 per 1M input tokens and $1.50 per 1M output tokens, and supports structured spatial annotations on demand.
Microsoft Releases Fara-7B: 7B Parameter Computer Use Agent Trained in 2.5 Days on 64 H100s
Microsoft Research has released Fara-7B, a 7-billion parameter small language model designed for computer automation tasks. The model, which took 2.5 days to train on 64 H100 GPUs, can navigate websites to complete tasks like booking restaurants and shopping, using screenshots as input with a 128K token context window.
IBM Releases 97M-Parameter Granite Embedding Model With 60.3 MTEB Score — Highest Retrieval Quality Under 100M Parameter
IBM released two new multilingual embedding models under Apache 2.0: a 97M-parameter compact model scoring 60.3 on MTEB Multilingual Retrieval (highest in its size class) and a 311M full-size model scoring 65.2. Both support 200+ languages with enhanced retrieval for 52 languages, handle 32K-token context (64x increase over predecessors), and include code retrieval across 9 programming languages.
Baidu Releases Qianfan-OCR-Fast Model with 66K Context at $0.68 Per 1M Input Tokens
Baidu has released Qianfan-OCR-Fast, a multimodal model specialized for optical character recognition tasks. The model offers a 66,000 token context window and is priced at $0.68 per 1M input tokens and $2.81 per 1M output tokens.
Comments
Loading...