Amazon Bedrock adds reinforcement fine-tuning best practices for Nova and open source models

TL;DR

Amazon Bedrock now supports Reinforcement Fine-Tuning (RFT) for customizing Amazon Nova and open source models using reward signals instead of labeled datasets. AWS reports up to 66% accuracy improvements over base models with reduced customization complexity. The approach works best for tasks with verifiable correctness (code, math) or subjective evaluation (moderation, summarization).

April 8, 2026 · 7:50 PM3 min read

Amazon Bedrock Adds Reinforcement Fine-Tuning Best Practices

Amazon Web Services has published comprehensive best practices for Reinforcement Fine-Tuning (RFT) on Amazon Bedrock, a technique that customizes foundation models using reward signals rather than static labeled datasets. According to AWS, RFT delivers up to 66% accuracy gains over base models while reducing customization cost and complexity.

How RFT Works

Unlike supervised fine-tuning (SFT) that trains on correct input-output pairs, RFT uses a dataset of inputs paired with a reward function. The reward function can be rule-based, a trained grader model, or an LLM acting as a judge. During training, the model generates candidate responses, the reward function scores each response, and model weights update to increase probability of high-reward outputs. This iterative cycle steers the model toward behaviors that maximize reward signals.

AWS identifies two primary categories where RFT excels:

Reinforcement Learning with Verifiable Rewards (RLVR): Tasks where correctness can be automatically verified through rules or tests. Examples include code generation (unit-test pass rates), math reasoning (exact answers), structured data extraction (schema validation), and API orchestration (successful task completion).

Reinforcement Learning with AI Feedback (RLAIF): Subjective tasks where another model evaluates quality against a rubric. Applications include content moderation, chatbots, creative writing, and summarization.

Dataset Requirements and Guidelines

Amazon Bedrock's RFT supports datasets between 100–10,000 training samples, with requirements varying by task complexity. AWS provides tiered guidance:

100–200 examples: Initial experimentation to validate prompts, reward functions, and measurable improvements
200–5,000 examples: Typical implementations providing stronger generalization and consistent performance across prompt variations
5,000–10,000 examples: Complex reasoning tasks, specialized domains, or sophisticated reward functions requiring robustness across diverse inputs

AWS emphasizes that dataset quality fundamentally determines RFT outcomes and that training data must follow OpenAI chat completion format as JSONL files.

Mathematical Reasoning Case Study

AWS demonstrates RFT effectiveness using the GSM8K (Grade School Math 8K) dataset, showing how the approach improves mathematical problem-solving. Unlike standard fine-tuning that encourages pattern-matching, RFT can define reward functions that assign full credit for exact answers while providing partial credit for correct intermediate reasoning steps. This allows models to discover valid solution approaches with relatively small datasets (100–1000 examples) while maintaining structured output formats.

The example shows a math problem requiring multi-step reasoning with intermediate verification, where RFT can guide the model toward breaking problems into logical steps and following required formatting—capabilities that supervised fine-tuning typically struggles to achieve.

Practical Implementation

On Amazon Bedrock, both rule-based and model-based reward approaches implement as custom AWS Lambda functions that the platform invokes during the training loop. AWS guidance covers:

Reward function strategy and design
Hyperparameter tuning informed by experiments across multiple models and use cases
Training progress monitoring using Amazon Bedrock metrics
Use cases including code generation, structured extraction, and content moderation

The approach works with Amazon Nova and supported open source models available through Bedrock.

What This Means

AWS is positioning RFT as a practical alternative to supervised fine-tuning for scenarios where labeled datasets are expensive or impractical to curate. The 66% accuracy improvement claim and support for datasets as small as 100 examples could significantly lower the barrier to model customization for specialized tasks. However, AWS's emphasis on dataset quality and the requirement for well-designed reward functions suggests RFT success depends heavily on implementation details beyond dataset size. The guidance toward 200–5,000 examples for typical implementations indicates that "small dataset" claims should be interpreted conservatively for production deployments.

Source: aws.amazon.com ↗

amazon-bedrock reinforcement-learning fine-tuning amazon-nova model-customization aws rft best-practices

researchJuly 6, 2026

AWS introduces rDPO unlearning technique to reduce false content moderation in Amazon Nova models by 53 percentage point

AWS has developed Reverse Direct Preference Optimization (rDPO), a novel unlearning technique that reduces over-deflection in Amazon Nova models by up to 53 percentage points. The approach allows organizations to selectively adjust content moderation safeguards while preserving general model capabilities through LoRA adapters.

product updateJuly 7, 2026

Hugging Face and AWS launch one-click deployment to SageMaker Studio

Hugging Face and Amazon Web Services have integrated a one-click workflow that takes developers from model discovery on Hugging Face directly into AWS SageMaker Studio. The integration eliminates manual setup steps by automatically provisioning domains with pre-configured IAM permissions and displaying GPU quota availability inline.

product updateJuly 6, 2026

AWS launches MiniMax M2 family on Amazon Bedrock with 1M token context and MoE architecture

Amazon Web Services has added three MiniMax models to Amazon Bedrock: M2, M2.1, and M2.5. The newest model, M2.5, uses a mixture-of-experts architecture with 230 billion total parameters and 10 billion active per token, trained specifically for agent-native execution and coding tasks.