product updateAmazon Web Services

AWS releases Nova Forge SDK data mixing guide to preserve general capabilities during fine-tuning

TL;DR

Amazon Web Services published a practical guide for fine-tuning Amazon Nova models using the Nova Forge SDK's data mixing capabilities. According to AWS, blending customer data with Amazon-curated datasets preserved near-baseline MMLU scores while delivering a 12-point F1 improvement on a Voice of Customer classification task spanning 1,420 leaf categories.

2 min read
0

AWS releases Nova Forge SDK data mixing guide to preserve general capabilities during fine-tuning

Amazon Web Services published a hands-on guide for fine-tuning Amazon Nova models using the Nova Forge SDK's data mixing capabilities, which allows developers to fine-tune on domain-specific data without losing general model capabilities.

Performance claims

According to AWS, blending customer data with Amazon-curated datasets preserved near-baseline MMLU scores while delivering a 12-point F1 improvement on a Voice of Customer classification task spanning 1,420 leaf categories. By contrast, AWS claims fine-tuning an open-source model on customer data alone caused a near-total loss of general capabilities.

Technical implementation

The guide covers a five-stage workflow: environment setup with Nova Forge SDK installation, data preparation with sanitization and validation, training configuration including SageMaker HyperPod runtime setup, model training using supervised fine-tuning with Low-Rank Adaptation (LoRA), and model evaluation against public benchmarks.

The SDK enforces token-level validation on training data to prevent conflicts with Nova's internal chat template. Special delimiters like System:, User:, and Assistant: must be sanitized before training to avoid corrupting the training signal.

Infrastructure requirements

The walkthrough uses 4 ml.p5.48xlarge GPU instances for both training and evaluation. AWS recommends starting with a short test run (max_steps=5) to validate configuration before committing to full training runs. Prerequisites include an AWS account with Amazon Nova Forge access, a provisioned SageMaker HyperPod cluster with GPU instances, an Amazon SageMaker MLflow application for experiment tracking, and appropriate IAM permissions.

Dataset example

The guide demonstrates the workflow using the MedReason dataset from Hugging Face, which contains approximately 32,700 medical question-answer pairs. The Nova Forge SDK supports JSONL, JSON, and CSV input formats and provides a JSONLDatasetLoader that converts raw data into the structured turn-based format Nova models expect during training.

What this means

Data mixing addresses a critical challenge in model fine-tuning: maintaining general capabilities while adapting to specific domains. AWS's 12-point F1 improvement claim suggests meaningful performance gains are possible without catastrophic forgetting. However, the requirement for expensive GPU infrastructure (ml.p5.48xlarge instances) and the proprietary nature of Amazon's curated datasets may limit adoption to larger organizations already invested in AWS infrastructure. The detailed sanitization requirements highlight the fragility of chat template-based training approaches.

Related Articles

product update

Amazon Nova Micro Fine-Tuned Text-to-SQL Models Now Available on Bedrock On-Demand Inference at $0.80/Month for 22,000 Q

AWS has enabled fine-tuned Amazon Nova Micro models to run on Bedrock's on-demand inference for text-to-SQL generation. According to AWS testing, a sample workload of 22,000 queries per month costs $0.80 monthly using the serverless approach, compared to higher costs with persistent model hosting. The solution uses LoRA fine-tuning on the sql-create-context dataset containing over 78,000 SQL examples.

product update

AWS Reduces Video Search Routing Cost 95% Using Nova Premier-to-Micro Model Distillation

Amazon Web Services released a model distillation pipeline on Amazon Bedrock that transfers video search routing intelligence from Nova Premier to Nova Micro. According to AWS, the approach reduces inference cost by over 95% and latency by 50% compared to using Claude Haiku for intent routing.

product update

Amazon Launches Nova Multimodal Embeddings for Video Semantic Search Across Visual, Audio, and Text Signals

Amazon released Nova Multimodal Embeddings on Amazon Bedrock, a unified embedding model that processes text, documents, images, video, and audio into a shared 1024-dimensional semantic vector space. The model supports up to 30 seconds of video per embedding and enables semantic search across all modalities simultaneously without converting video to text first.

product update

AWS launches Automated Reasoning checks in Amazon Bedrock for mathematically verified AI compliance

AWS has released Automated Reasoning checks in Amazon Bedrock Guardrails, a feature that uses formal mathematical verification to validate AI outputs against defined rules. Unlike LLM-as-a-judge approaches that use one probabilistic model to validate another, Automated Reasoning provides mathematically proven, auditable compliance evidence for regulated industries.

Comments

Loading...