computer-vision
8 articles tagged with computer-vision
Krea Releases 12-Billion Parameter Text-to-Image Model with 8-Step Generation
Krea.ai released Krea 2 Turbo, a 12-billion parameter diffusion transformer model for text-to-image generation. The open-weight model generates images in 8 inference steps and supports resolutions up to 2048x2048 pixels.
Baidu Releases Unlimited-OCR, a 3B Parameter Document Parsing Model Based on Deepseek-OCR
Baidu has released Unlimited-OCR, a 3 billion parameter model for optical character recognition and document parsing. The model supports single-page and multi-page document processing with a 32,768 token context window and runs on NVIDIA GPUs using bfloat16 precision.
Mistral AI fine-tunes Pixtral-12B on satellite imagery, boosting classification accuracy from 56% to 91%
Mistral AI has published research showing that fine-tuning its Pixtral-12B vision language model on satellite imagery increases classification accuracy from 56% to 91% on the Aerial Image Dataset. Using Low-Rank Adaptation (LoRA) with 8,000 training samples across 30 scene categories, the company reduced hallucinations from 5% to 0.1% for under $10 in compute costs.
Apple adds cloud-powered AI image editing to iOS 27 Photos app with Clean Up upgrade, Extend and Reframe tools
Apple announced three AI-powered photo editing tools for iOS 27: an upgraded Clean Up feature for object removal, Extend for generating content around image borders, and Reframe for changing photo angles. The features use a combination of on-device and cloud AI models, marking a shift from Apple's previous on-device-only approach.
Ideogram Releases First Open-Weight Image Model With 9.3B Parameters and 2K Native Resolution
Ideogram has released Ideogram 4, a 9.3B parameter open-weight text-to-image model trained from scratch. The model features structured JSON prompting, native 2K resolution output, and ranks as the top open-weight model on Design Arena. Available in fp8 and nf4 quantizations under a non-commercial license.
Google adds visual automation triggers to Gemini for Home, lets cameras initiate smart home routines
Google is rolling out camera-based automations for Gemini for Home, allowing smart home cameras to trigger routines based on visual events like package deliveries or glass breaking. The update includes improved multi-request handling and reduced error rates for the AI assistant that replaced Google Assistant on smart home devices.
Amazon Bedrock adds three video analysis workflows for multimodal understanding at scale
Amazon Bedrock has introduced three distinct video analysis workflows that leverage multimodal foundation models to extract insights from video content at scale. The approaches—frame-based, shot-based, and multimodal embedding—are designed for different use cases and cost-performance trade-offs, with open-source reference implementations available on GitHub.
Segmind releases SegMoE, a mixture-of-experts diffusion model for faster image generation
Segmind has released SegMoE, a mixture-of-experts (MoE) diffusion model designed to accelerate image generation while reducing computational overhead. The model applies MoE techniques traditionally used in large language models to the diffusion model architecture, enabling selective expert activation during inference.