product update

AWS releases four multimodal evaluators for image-to-text AI tasks in Strands Evals SDK

TL;DR

AWS has added four multimodal evaluators to its Strands Evals SDK that judge image-to-text AI outputs by directly analyzing source images. The evaluators—Overall Quality, Correctness, Faithfulness, and Instruction Following—use multimodal large language models to detect visual hallucinations, factual errors, and instruction violations that text-only judges miss.

2 min read
0

AWS releases four multimodal evaluators for image-to-text AI tasks in Strands Evals SDK

AWS has added four multimodal large language model (MLLM)-as-a-Judge evaluators to its Strands Evals SDK for evaluating image-to-text AI tasks. The evaluators send source images directly to judge models alongside text inputs, addressing a core limitation of text-only evaluation methods that cannot verify whether AI outputs are grounded in visual content.

The four evaluators

The new evaluators target common image-to-text tasks including image captioning, visual question answering, chart interpretation, document field extraction, OCR, and screenshot summarization:

  1. Overall Quality: Likert 1-5 scale rating for response quality, catching poor relevance, inaccuracy, and shallow answers
  2. Correctness: Binary score for factual accuracy and completeness, detecting wrong attributes, counts, positions, and omissions
  3. Faithfulness: Binary score for image grounding, identifying invented objects, unsupported inferences, and hallucinations
  4. Instruction Following: Binary score for adherence to query constraints, catching format violations and off-topic content

All four evaluators support both reference-based mode (comparing against gold answers for labeled test sets) and reference-free mode (judging from the image alone for live production data).

Technical implementation

The evaluators integrate with the existing Strands Evals Case → Experiment → Report workflow. According to AWS, they accept images through an ImageData type and share a common MultimodalOutputEvaluator base class.

The judge models run on Amazon Bedrock and return both a numerical score and a reasoning string for debugging. AWS states that developers can plug these evaluators into continuous integration pipelines to catch visual hallucinations automatically.

Requirements

To use the evaluators, developers need:

  • Python 3.10 or later
  • strands-agents-evals package (installed via pip)
  • AWS account with Amazon Bedrock access
  • AWS IAM credentials with InvokeModel permission for judge models

The evaluators work as drop-in replacements for text-only judges in existing Strands Evals workflows.

Why multimodal evaluation matters

AWS cites Gartner research predicting that by 2030, 80% of enterprise software will be multimodal, up from less than 10% in 2024. Text-only evaluators cannot detect when an AI model:

  • Names a chart trend that doesn't exist in the actual chart
  • Hallucinates products, labels, or people not present in images
  • Extracts incorrect data from documents
  • Invents interface elements not shown in screenshots

What this means

This release addresses a significant gap in production AI evaluation. As more enterprise applications incorporate vision capabilities for invoice processing, dashboard summarization, and visual search, the inability to automatically verify image grounding has forced companies to choose between expensive human review and unreliable text-only proxies. AWS's implementation provides automated multimodal evaluation within an existing SDK framework, though the announcement does not disclose judge model accuracy benchmarks, pricing details, or latency measurements for the evaluation process itself. The evaluators' effectiveness will depend heavily on the underlying judge model selection on Bedrock, which AWS notes requires balancing accuracy, cost, and latency trade-offs.

Related Articles

product update

AWS SageMaker AI adds bidirectional streaming for real-time speech transcription with vLLM

Amazon SageMaker AI has launched bidirectional streaming support for real-time inference, enabling WebSocket-based voice applications through vLLM integration. The feature uses HTTP/2 on port 8443 to bridge client connections with vLLM's Realtime API, allowing audio to stream in while transcription streams back simultaneously over a single persistent connection.

product update

Google launches Universal Cart, an AI agent that shops across multiple retailers in one checkout

Google announced Universal Cart at its I/O developer conference, an AI-powered shopping system that consolidates purchases from multiple retailers including Target, Shopify, Wayfair, and Etsy into a single checkout. The feature uses Gemini's agentic AI to verify product compatibility, suggest better deals, and automate routine purchases.

product update

Google Announces Gemini Spark Agent and Antigravity Platform at I/O, Launch Date Not Disclosed

Google announced Gemini Spark at I/O 2026, positioning it as a competitor to OpenAI's Claude-based agents. The service will integrate with Gmail, Calendar, Drive, and other Google apps, running on Gemini 3.5 Flash and a new platform called Antigravity. No general availability date has been disclosed.

product update

llm-gemini Plugin Adds Support for Google's Gemini 3.5 Flash Model

Developer Simon Willison released version 0.32 of the llm-gemini plugin, which adds support for Google's Gemini 3.5 Flash model. The plugin enables command-line access to Google's Gemini model family through the LLM tool.

Comments

Loading...