model release

UAE's TIIUAE releases Falcon Perception: 0.6B early-fusion model for open-vocabulary grounding

TL;DR

TIIUAE has released Falcon Perception, a 0.6B-parameter early-fusion Transformer that combines image patches and text in a single sequence for open-vocabulary object grounding and segmentation. The model achieves 68.0 Macro-F1 on SA-Co (vs. 62.3 for SAM 3) and introduces PBench, a diagnostic benchmark that isolates performance across five capability levels. TIIUAE also released Falcon OCR, a 0.3B model reaching 80.3 on olmOCR and 88.6 on OmniDocBench.

3 min read
0

Falcon Perception: Single-Backbone Approach to Vision-Language Grounding

TIIUAE has released Falcon Perception, a 0.6B-parameter early-fusion Transformer designed for open-vocabulary object grounding and segmentation from natural language prompts. The architecture processes image patches and text tokens in a unified sequence using a hybrid attention mask, departing from the standard modular pipeline approach that separates vision encoders from language fusion.

Architecture: Early Fusion with Hybrid Attention

Falcon Perception replaces traditional multi-stage pipelines (frozen vision backbone → separate fusion decoder → post-processing) with a single Transformer backbone. The model uses a hybrid attention pattern: image tokens attend bidirectionally to all image tokens to build global visual context, while text and task tokens attend causally to everything before them. This allows the same backbone to function as both a bidirectional visual encoder and an autoregressive predictor.

Output generation follows a deliberate three-step "Chain-of-Perception" structure:

  1. Coordinate token: Predicts instance center
  2. Size token: Predicts spatial extent
  3. Segmentation token: Produces full-resolution binary mask via dot product with upsampled image features

This coarse-to-fine decomposition avoids expensive token-by-token mask generation while maintaining variable-length instance prediction. Coordinate and size heads use Fourier feature encoding to overcome spectral bias in neural networks. Segmentation heads operate as lightweight dot-product layers rather than separate mask-query machinery.

Performance and Benchmarking

On the SA-Co benchmark, Falcon Perception reaches 68.0 Macro-F1, outperforming SAM 3's 62.3. However, the model shows a presence calibration gap (MCC 0.64 vs. 0.82), meaning it struggles to reliably predict when objects are absent.

TIIUAE introduced PBench, a diagnostic benchmark that disaggregates performance by five capability levels:

  • L0: Simple object recognition
  • L1: Attributes and subtypes
  • L2: OCR-guided identification
  • L3: Spatial understanding
  • L4: Relations and interactions
  • Dense Crowdedness: Stress test with hundreds of instances per image

This structure isolates failure modes and guides future development priorities.

Training Methodology

Falcon Perception was trained on a 54M-image dataset with 195M positive expressions and 488M hard negatives. The training pipeline includes:

  • Multi-teacher distillation: Initialized from DINOv3 (ViT-H) and SigLIP2 to provide strong visual and language-aligned foundations
  • Hierarchical image clustering: Via DINOv3 embeddings for uniform concept coverage
  • VLM-driven annotation: Generated dense descriptions categorized by PBench complexity levels (60% basic, 40% advanced)
  • Negative mining: Created semantic, visual, and fine-grained hard negatives
  • Ensemble consensus: SAM 3, Qwen3-VL-30B, and Moondream3 must agree (IoU > 0.8) for automatic acceptance
  • Strict 1:1 positive-to-negative ratio: Prioritizes presence calibration as a first-class training objective

Falcon OCR

TIIUAE simultaneously released Falcon OCR, a 0.3B-parameter model achieving 80.3 on olmOCR and 88.6 on OmniDocBench with the highest throughput among open-source OCR models.

What This Means

Falcon Perception demonstrates that unified, early-fusion architectures can compete with or exceed modular vision-language pipelines while remaining interpretable and scalable. The 0.6B parameter count positions it for edge deployment and inference speed advantages. The primary limitation—presence calibration—is well-diagnosed and addressable with targeted data and training techniques. PBench's capability-level breakdown provides a reproducible methodology for future vision-language model evaluation beyond saturated benchmarks like RefCOCO. The simultaneous OCR release and emphasis on open-vocabulary grounding signal TIIUAE's focus on practical industrial deployment of perception systems.

Related Articles

model release

DeepReinforce Releases Ornith-1.0, Open-Source Agentic Coding Model in 9B to 397B Sizes

DeepReinforce has released Ornith-1.0, an MIT-licensed model designed for agentic coding tasks with variants ranging from 9B to 397B parameters. Built on top of Apache 2.0-licensed Gemma 4 and Qwen 3.5 base models, the company claims it achieves state-of-the-art performance among open-source models of comparable size on coding benchmarks.

model release

DeepSeek Releases V4 Models: 1M Context Window, 90% Less KV Cache Than V3

DeepSeek has released two new MoE models: DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated). Both models support a one million token context window and use a hybrid attention architecture that requires only 27% of single-token inference FLOPs and 10% of KV cache compared to DeepSeek-V3.2.

model release

DeepSeek Releases V4-Pro with 1.6T Parameters, 1M Token Context at 27% Inference Cost of V3

DeepSeek has released two Mixture-of-Experts models: V4-Pro with 1.6 trillion parameters (49B activated) and V4-Flash with 284B parameters (13B activated), both supporting 1 million token context windows. V4-Pro requires only 27% of inference FLOPs and 10% of KV cache compared to V3.2 at 1M token context, trained on over 32 trillion tokens.

model release

Anthropic's Fable 5 model expected to return next week after 15-day government shutdown

The Trump administration is close to allowing Anthropic to restore access to its Fable 5 model, which has been offline for 15 days due to national security concerns. Insiders expect restrictions could be lifted as soon as next week, though Pentagon and NSA approval is still required.

Comments

Loading...