reinforcement-learning

8 articles tagged with reinforcement-learning

April 30, 2026
changelogOpenAI

OpenAI Fixed GPT-5.5's Goblin Obsession by Explicitly Banning Mythical Creature References

OpenAI discovered its GPT-5.1 through GPT-5.4 models developed an increasing fixation on goblins, gremlins, and other mythical creatures. The issue traced back to reinforcement learning rewards used to develop a discontinued 'Nerdy personality' feature, which persisted across model generations.

April 11, 2026
benchmark

AI models guess instead of asking for help, ProactiveBench study shows

Researchers introduced ProactiveBench, a benchmark testing whether multimodal language models ask for help when visual information is missing. Out of 22 models tested—including GPT-4.1, GPT-5.2, and o4-mini—almost none proactively request clarification, instead hallucinating or refusing to respond. A reinforcement learning approach showed models can be trained to ask for help, improving performance from 17.5% to 37-38%, though significant gaps remain.

April 8, 2026
product updateAmazon Web Services

Amazon Bedrock adds reinforcement fine-tuning best practices for Nova and open source models

Amazon Bedrock now supports Reinforcement Fine-Tuning (RFT) for customizing Amazon Nova and open source models using reward signals instead of labeled datasets. AWS reports up to 66% accuracy improvements over base models with reduced customization complexity. The approach works best for tasks with verifiable correctness (code, math) or subjective evaluation (moderation, summarization).

April 6, 2026
research

Alibaba's HopChain framework fixes vision model failures in multi-step reasoning tasks

Researchers from Alibaba's Qwen team and Tsinghua University developed HopChain, a framework that automatically generates multi-step image questions to fix how vision-language models fail during complex reasoning tasks. The method improved 20 out of 24 tested benchmarks by forcing models to re-examine images at each reasoning step, preventing early perceptual errors from cascading through subsequent steps.

April 5, 2026
research

Alibaba's Qwen team develops algorithm that doubles reasoning chain length in math problems

Alibaba's Qwen team has developed Future-KL Influenced Policy Optimization (FIPO), a training algorithm that assigns different weights to tokens based on their influence on subsequent reasoning steps, rather than treating all tokens equally. Testing on Qwen2.5-32B-Base showed reasoning chains double from ~4,000 to 10,000+ tokens, with AIME 2024 accuracy improving from 50% to 58%, outperforming Deepseek-R1-Zero-Math-32B (47%) and OpenAI's o1-mini (56%). The team plans to open-source the system.

March 25, 2026
researchApple

Apple's RubiCap model generates better image captions with 3-7B parameters than 72B competitors

Apple researchers developed RubiCap, a framework for training dense image captioning models that achieve state-of-the-art results at 2B, 3B, and 7B parameter scales. The 7B model outperforms models up to 72 billion parameters on multiple benchmarks including CapArena and CaptionQA, while the 3B variant matches larger 32B models, suggesting efficient dense captioning doesn't require massive scale.

product updateAmazon Web Services

Amazon Bedrock adds reinforcement fine-tuning with OpenAI-compatible APIs

Amazon Bedrock now enables reinforcement fine-tuning (RFT) across multiple model families including Amazon Nova, open-weight models like OpenAI's GPT-OSS 20B, and Qwen 3 32B. The service automates the end-to-end customization workflow using GRPO optimization, allowing models to learn from feedback on multiple responses rather than static training datasets, with support for OpenAI-compatible APIs.

March 19, 2026
model release

Cursor releases Composer 2 at $0.50/$2.50 per 1M tokens, undercutting Claude and GPT-4 on pricing

Cursor released Composer 2, a code-specialized model priced at $0.50 per million input tokens and $2.50 per million output tokens—roughly 90% cheaper than Claude Opus 4.6 ($5.00/$25.00) and 60% cheaper than GPT-5.4 ($2.50/$15.00). The model scores 61.3 on Cursor's internal CursorBench, competitive with Claude Opus 4.6 (58.2) but below GPT-5.4 Thinking (63.9).