product update

AWS releases four multimodal evaluators for image-to-text AI tasks in Strands Evals SDK

TL;DR

AWS has added four multimodal evaluators to its Strands Evals SDK that judge image-to-text AI outputs by directly analyzing source images. The evaluators—Overall Quality, Correctness, Faithfulness, and Instruction Following—use multimodal large language models to detect visual hallucinations, factual errors, and instruction violations that text-only judges miss.

2 min read
0

AWS releases four multimodal evaluators for image-to-text AI tasks in Strands Evals SDK

AWS has added four multimodal large language model (MLLM)-as-a-Judge evaluators to its Strands Evals SDK for evaluating image-to-text AI tasks. The evaluators send source images directly to judge models alongside text inputs, addressing a core limitation of text-only evaluation methods that cannot verify whether AI outputs are grounded in visual content.

The four evaluators

The new evaluators target common image-to-text tasks including image captioning, visual question answering, chart interpretation, document field extraction, OCR, and screenshot summarization:

  1. Overall Quality: Likert 1-5 scale rating for response quality, catching poor relevance, inaccuracy, and shallow answers
  2. Correctness: Binary score for factual accuracy and completeness, detecting wrong attributes, counts, positions, and omissions
  3. Faithfulness: Binary score for image grounding, identifying invented objects, unsupported inferences, and hallucinations
  4. Instruction Following: Binary score for adherence to query constraints, catching format violations and off-topic content

All four evaluators support both reference-based mode (comparing against gold answers for labeled test sets) and reference-free mode (judging from the image alone for live production data).

Technical implementation

The evaluators integrate with the existing Strands Evals Case → Experiment → Report workflow. According to AWS, they accept images through an ImageData type and share a common MultimodalOutputEvaluator base class.

The judge models run on Amazon Bedrock and return both a numerical score and a reasoning string for debugging. AWS states that developers can plug these evaluators into continuous integration pipelines to catch visual hallucinations automatically.

Requirements

To use the evaluators, developers need:

  • Python 3.10 or later
  • strands-agents-evals package (installed via pip)
  • AWS account with Amazon Bedrock access
  • AWS IAM credentials with InvokeModel permission for judge models

The evaluators work as drop-in replacements for text-only judges in existing Strands Evals workflows.

Why multimodal evaluation matters

AWS cites Gartner research predicting that by 2030, 80% of enterprise software will be multimodal, up from less than 10% in 2024. Text-only evaluators cannot detect when an AI model:

  • Names a chart trend that doesn't exist in the actual chart
  • Hallucinates products, labels, or people not present in images
  • Extracts incorrect data from documents
  • Invents interface elements not shown in screenshots

What this means

This release addresses a significant gap in production AI evaluation. As more enterprise applications incorporate vision capabilities for invoice processing, dashboard summarization, and visual search, the inability to automatically verify image grounding has forced companies to choose between expensive human review and unreliable text-only proxies. AWS's implementation provides automated multimodal evaluation within an existing SDK framework, though the announcement does not disclose judge model accuracy benchmarks, pricing details, or latency measurements for the evaluation process itself. The evaluators' effectiveness will depend heavily on the underlying judge model selection on Bedrock, which AWS notes requires balancing accuracy, cost, and latency trade-offs.

Related Articles

product update

AWS brings NVIDIA Nemotron and OpenAI GPT OSS models to GovCloud for secure government AI workloads

Amazon Bedrock now supports NVIDIA Nemotron and OpenAI GPT OSS models in AWS GovCloud (US) Regions. The launch includes OpenAI's GPT OSS models (120B and 20B parameters, 128K context) and NVIDIA Nemotron 3 family (9B to 120B parameters, 1M context), providing government agencies FedRAMP High and DoD SRG Level 5-compliant AI inference on U.S. soil.

product update

AWS adds metadata filtering to AgentCore Memory, improving agent retrieval accuracy from 40% to 64%

Amazon has added metadata filtering to its AgentCore Memory service for AI agents. In AWS evaluations across 151 questions, the feature improved overall question-answering accuracy from 40% to 64%, with context-dependent questions jumping from 16% to 69% accuracy. The update allows agents to filter memory retrieval by attributes like priority, department, or time range before semantic search runs.

product update

AWS to Release Anthropic's Claude Fable 5 on Bedrock with Cybersecurity Guardrails

Amazon Web Services announced it will make Anthropic's Claude Fable 5 models available on Bedrock starting tomorrow, featuring guardrails designed to prevent cybersecurity misuse. When guardrails are triggered, the system automatically falls back to Claude Opus 4.8.

product update

AWS launches managed entitlements for Bedrock to distribute third-party model access across multi-account organizations

AWS has introduced managed entitlements for Amazon Bedrock, allowing organizations to subscribe to third-party models like Anthropic Claude and Cohere from a central account and distribute access across member accounts without requiring AWS Marketplace permissions. The feature uses AWS License Manager to create grants that share model entitlements with specific accounts or entire organizational units.

Comments

Loading...