code-generation
27 articles tagged with code-generation
Mistral Releases Codestral Embed, Code-Specialized Embedding Model at $0.15 Per Million Tokens
Mistral AI has released Codestral Embed, its first code-specialized embedding model, priced at $0.15 per million tokens. The model features an 8192-token context window and claims to outperform Voyage Code 3, Cohere Embed v4.0, and OpenAI's large embedding model on code retrieval benchmarks.
Anthropic releases Claude Fable 5, first public version of Mythos model for code generation
Anthropic has released Claude Fable 5, the first publicly available version of its Mythos model line. University of Pennsylvania AI researcher Ethan Mollick reports the model can execute multi-page specifications for up to 12 hours and generate complete video games from single prompts in Claude Code.
Replit Agent now generates custom Shopify storefronts in 10 minutes from a single prompt
Replit launched an integration allowing its AI Agent to design and deploy custom Shopify storefronts from natural language prompts. The system generates the front end, provisions a Shopify store, and adds products in a single conversation, with the entire process from first prompt to accepting orders taking roughly 10 minutes.
Anthropic releases Claude Opus 4.8 with Dynamic Workflows for multi-agent tasks
Anthropic released Claude Opus 4.8 on Thursday, its fastest upgrade cycle at 41 days since the previous Opus 4.7. The model includes a new Dynamic Workflows feature designed to manage complex tasks across hundreds of parallel subagents, with pricing unchanged from previous Opus releases.
Mistral releases Leanstral, 6B-parameter open-source model for Lean 4 formal proof verification
Mistral AI released Leanstral, the first open-source code agent designed specifically for Lean 4 formal proof verification. The model uses 6B active parameters in a sparse 120B architecture and is available under Apache 2.0 license with free API access.
Mistral Releases Codestral 25.08 with 30% Higher Completion Acceptance, Ships Full Enterprise Coding Stack
Mistral AI released Codestral 25.08, showing 30% more accepted code completions and 10% higher retention rates. The company also shipped Devstral Small, a 24B-parameter agentic coding model scoring 53.6% on SWE-Bench Verified, alongside new embedding and IDE integration tools aimed at enterprise deployment.
GitHub engineer builds roguelike dungeon generator from codebases using Copilot CLI
A GitHub engineer has developed an extension using GitHub Copilot CLI that procedurally generates roguelike dungeons from existing codebases. The project demonstrates practical applications of GitHub's AI-powered command-line tool for creative development tasks.
Augment Code launches Cosmos, an operating system for multi-agent software development workflows
Augment Code has released Cosmos into public preview, positioning it as an operating system for agentic software development. The platform coordinates AI agents across the full software development lifecycle with shared memory, multi-model routing via their Prism system that claims 20-30% token savings, and what the company calls specialized agents that learn from team feedback.
IBM releases Granite 4.1-8B with 131K context window and enhanced tool-calling capabilities
IBM has released Granite 4.1-8B, an 8-billion parameter long-context model with a 131,072-token context window. The model achieves 85.37% on HumanEval and 73.84% on MMLU 5-shot, with enhanced tool-calling capabilities reaching 68.27% on BFCL v3. Released under Apache 2.0 license, it supports 12 languages.
OpenRouter Launches Pareto Code Router with Dynamic Model Selection Based on Quality Threshold
OpenRouter has released Pareto Code Router, a dynamic routing system that automatically selects from a curated list of coding models based on a user-defined quality threshold. Users set a min_coding_score between 0 and 1, and the router selects an appropriate model from its shortlist without requiring commitment to a specific model.
GitHub Copilot Individual Plans Change Structure, Details Not Yet Disclosed
GitHub has announced changes to its Copilot Individual subscription plans, citing the need for reliability and predictability for existing customers. The company has not yet disclosed specific details about pricing adjustments, feature modifications, or implementation timelines.
Roblox Assistant adds multi-step planning mode and AI-driven playtesting to automate game development
Roblox is deploying agentic features to its Assistant tool that plan, build, and test games through multi-step workflows. The enhanced Planning Mode analyzes code, asks clarifying questions, and creates editable action plans before implementation, while new AI-driven playtesting tools automatically identify and fix bugs.
Z.ai releases GLM-5.1, 754B parameter open-weight model with improved code generation
Z.ai has released GLM-5.1, a 754-billion parameter open-weight model matching the size of its predecessor GLM-5. The model demonstrates improved ability to generate complex, multi-part outputs like HTML pages with SVG graphics and CSS animations, available via Hugging Face and OpenRouter.
GitHub Copilot CLI adds Rubber Duck for second-opinion analysis across model families
GitHub has added a feature called Rubber Duck to Copilot CLI that queries multiple AI model families to provide alternative perspectives on code suggestions. The feature acts as a second opinion mechanism, allowing developers to compare recommendations from different model architectures.
Zhipu AI releases GLM-5V-Turbo: multimodal model generates front-end code from design mockups
Zhipu AI released GLM-5V-Turbo, a multimodal coding model that converts design mockups directly into executable front-end code. The model processes images, video, and text with a 200,000-token context window and 128,000-token max output, priced at $1.20 per million input tokens and $4 per million output tokens.
Google DeepMind releases Gemma 4 with 4 model sizes, 256K context, and multimodal reasoning
Google DeepMind released Gemma 4, a family of open-weights multimodal models in four sizes: E2B (2.3B effective), E4B (4.5B effective), 26B A4B (3.8B active), and 31B (30.7B parameters). All models support text and image input with 128K-256K context windows, while E2B and E4B add native audio capabilities and reasoning modes across 140+ languages.
Alibaba releases Qwen 3.6 Plus with 1M context window, free tier now available
Alibaba's Qwen division released Qwen 3.6 Plus on April 2, 2026, offering free access to a model with a 1,000,000 token context window. The model combines linear attention with sparse mixture-of-experts routing and achieves a 78.8 score on SWE-bench Verified for software engineering tasks.
Alibaba's Qwen3.5-Omni learns to write code from speech and video without explicit training
Alibaba has released Qwen3.5-Omni, an omnimodal model handling text, images, audio, and video with a 256,000-token context window. The model reportedly outperforms Google's Gemini 3.1 Pro on audio tasks with support for 74 languages in speech recognition, a 6x increase from its predecessor. An unexpected emergent capability: writing working code from spoken instructions and video input, which the team did not explicitly train.
Anthropic's Claude Code Auto Mode enables automatic execution of safe commands while blocking risky actions
Anthropic has released Auto Mode for Claude Code, a middle-ground safety feature that automatically executes safe local operations while blocking risky actions like external deployments and mass deletions. A Claude Sonnet 4.6 classifier evaluates each command based on conversation context, and the system reverts to manual approval after three consecutive blocks or twenty total blocks. The feature is available as a research preview for Team plan users, with Enterprise and API access expected shortly.
Mistral's Leanstral code verification agent outperforms Claude Sonnet at 15% of the cost
Mistral has released Leanstral, a 120B-parameter code verification agent built with the Lean programming language, claiming it outperforms larger open-source models and offers significant cost advantages over Anthropic's Claude suite. The model achieves a pass@2 score of 26.3—beating Claude Sonnet by 2.6 points—while costing $36 to run compared to Sonnet's $549.
OpenAI's GPT-5.4 mini now available in GitHub Copilot
OpenAI has released GPT-5.4 mini, the lightweight variant of its agentic coding model GPT-5.4, in GitHub Copilot. The model represents OpenAI's highest-performing mini offering to date for code generation and completion tasks.
Anthropic launches Code Review tool to automatically analyze AI-generated code
Anthropic has launched Code Review, a multi-agent system within Claude Code that automatically analyzes AI-generated code and flags logic errors. The tool addresses enterprise concerns about managing the increasing volume of code produced by AI systems.
Anthropic adds scheduled background tasks to Claude Code Desktop
Anthropic has added scheduled task functionality to Claude Code Desktop, allowing users to set up recurring automation that runs in the background. The feature enables Claude to perform routine developer operations like checking error logs and creating pull requests for fixable bugs at specified intervals.
OpenAI's GPT-5.4 now generally available in GitHub Copilot
OpenAI's GPT-5.4, an agentic coding model, is now generally available in GitHub Copilot. The model was tested on real-world software development scenarios and demonstrated improved coding capabilities.
Tabnine launches Enterprise Context Engine to ground AI coding in production environments
Tabnine has introduced its Enterprise Context Engine, designed to give AI models the contextual understanding needed to operate safely within real production development environments. The tool addresses a gap between raw model capability and practical enterprise deployment, where understanding an organization's codebase, dependencies, and architecture is critical.
OpenAI says SWE-bench Verified is broken—most tasks reject correct solutions
OpenAI is calling for the retirement of SWE-bench Verified, the widely-used AI coding benchmark, claiming most tasks are flawed enough to reject correct solutions. The company argues that leading AI models have likely seen the answers during training, meaning benchmark scores measure memorization rather than genuine coding ability.
GitHub deprecates selected Anthropic and OpenAI models from Copilot
GitHub deprecated selected Anthropic and OpenAI models across all Copilot experiences on February 17, 2026. The deprecation affects Copilot Chat, inline edits, ask mode, agent mode, and code completions. Specific model names and transition timelines were not disclosed in the initial announcement.