LLM News

Every LLM release, update, and milestone.

0
researchByteDance

Bytedance study: reasoning models know when to stop, but sampling methods force continued thinking

A new Bytedance study reveals that large reasoning models actually know when they've reached the correct answer, but common sampling methods prevent them from stopping. The models engage in unnecessary cross-checking and reformulation despite already solving problems correctly.

0
product updateAnthropic

Anthropic launches Claude Code Remote Control for device automation

Anthropic has released Claude Code Remote Control, a new feature allowing users to initiate remote control sessions on their computers and manage them via Claude Code on web, iOS, and native apps. The feature is in early stages with reported stability issues including API 500 errors and permission approval requirements.

2 min readvia simonwillison.net
0
product update

Adobe Firefly adds Quick Cut feature to auto-generate video drafts from raw footage

Adobe has added Quick Cut to Firefly, an AI-powered feature that automatically generates first-draft videos from raw footage based on user instructions. The tool is designed to reduce manual editing time by processing footage and applying cuts, transitions, and basic structure without requiring frame-by-frame manual work.

2 min readvia techcrunch.com
0
researchApple

Apple Research Identifies 'Text-Speech Understanding Gap' Limiting LLM Speech Performance

Apple researchers have identified a fundamental limitation in speech-adapted large language models: they consistently underperform their text-based counterparts on language understanding tasks. The team terms this the 'text-speech understanding gap' and documents that speech-adapted LLMs lag behind both their original text versions and cascaded speech-to-text pipelines.

0
benchmarkOpenAI

OpenAI says SWE-bench Verified is broken—most tasks reject correct solutions

OpenAI is calling for the retirement of SWE-bench Verified, the widely-used AI coding benchmark, claiming most tasks are flawed enough to reject correct solutions. The company argues that leading AI models have likely seen the answers during training, meaning benchmark scores measure memorization rather than genuine coding ability.

2 min readvia the-decoder.com