Perceptron Launches Mk1 Vision-Language Model with Video Reasoning at $0.15/$1.50 per 1M Tokens
Perceptron has released Perceptron Mk1, a vision-language model designed for video understanding and embodied reasoning tasks. The model accepts image and video inputs with 33K context window, priced at $0.15 per 1M input tokens and $1.50 per 1M output tokens, and supports structured spatial annotations on demand.
Perceptron Mk1 — Quick Specs
Perceptron Launches Mk1 Vision-Language Model with Video Reasoning
Perceptron has released Perceptron Mk1 (Mark One), a multimodal vision-language model built for video understanding and embodied reasoning tasks. The model processes image and video inputs paired with natural language queries, returning either structured annotations or natural language responses.
Pricing and Context
Perceptron Mk1 is priced at $0.15 per 1M input tokens and $1.50 per 1M output tokens, with a 33K token context window. The model is available through OpenRouter's API routing service.
Core Capabilities
According to Perceptron, Mk1 excels at multiple video understanding tasks including video question answering, summarization, and event detection. For image inputs, the model handles:
- Point-by-example grounding from multimodal prompts
- OCR and document parsing on real-world inputs
- Open vocabulary object detection and counting
- Hand pose estimation
Structured Annotation System
The model's distinctive feature is its optional structured annotation output. By default, Mk1 returns natural language text only. Users can request spatial localization through the annotation_format parameter:
- "point" for point annotations on images
- "box" for bounding boxes
- "polygon" for polygon masks
- "clip" for temporal segments (start/end timestamps) in video
Annotations are emitted inline with text only when explicitly requested.
Optional Reasoning Mode
Mk1 includes an optional reasoning mode that can be enabled per request. This trades increased latency for deeper analysis on complex tasks, allowing the model to show step-by-step thinking processes. OpenRouter provides access to the reasoning_details array in API responses.
What This Means
Perceptron Mk1 enters a crowded multimodal model market with a focus on structured output formats and video understanding. The $1.50 per 1M output tokens places it in the premium tier—comparable to GPT-4 Vision pricing. The optional reasoning mode and granular annotation controls suggest the model targets developers building computer vision pipelines and video analysis applications rather than general-purpose chat interfaces. The company has not disclosed benchmark scores or parameter count, making direct performance comparisons difficult.
Related Articles
Google releases Gemini 3.1 Flash Lite with 1M context at $0.25 per million input tokens
Google has released Gemini 3.1 Flash Lite, a high-efficiency multimodal model with a 1,048,576 token context window priced at $0.25 per million input tokens and $1.50 per million output tokens. The model supports text, image, video, audio, and PDF inputs with four thinking levels for cost-performance optimization.
Google DeepMind Releases Gemma 4 E4B with Multi-Token Prediction for 2x Faster Inference
Google DeepMind released the Gemma 4 E4B assistant model using Multi-Token Prediction (MTP) architecture that accelerates inference by up to 2x through speculative decoding. The 4.5B effective parameter model supports 128K context windows and handles text, image, and audio input with pricing not yet disclosed.
Zyphra Releases ZAYA1-8B: 8.4B Parameter MoE Model with 760M Active Parameters Matches 80B+ Models on Math Benchmarks
Zyphra has released ZAYA1-8B, a mixture-of-experts language model with 760M active parameters and 8.4B total parameters. The model scores 89.1% on AIME 2026, competitive with models exceeding 100B parameters, while maintaining efficiency for on-device deployment.
Google DeepMind Releases Gemma 4 26B A4B Assistant Model for 2x Faster Inference via Multi-Token Prediction
Google DeepMind has released a Multi-Token Prediction assistant model for Gemma 4 26B A4B that achieves up to 2x decoding speedup through speculative decoding. The model uses 3.8B active parameters from a 25.2B total parameter MoE architecture with 128 experts and a 256K token context window.
Comments
Loading...