product updateMicrosoft

Microsoft releases ASSERT, open-source framework for testing application-specific AI behavior using natural language

TL;DR

Microsoft released ASSERT (Adaptive Spec-driven Scoring for Evaluation and Regression Testing), an open-source framework that converts natural language descriptions of expected AI behavior into structured test cases. The tool addresses a gap in AI evaluation by testing application-specific behaviors that general benchmarks cannot capture.

June 2, 2026 · 7:20 PM2 min read

Microsoft releases ASSERT, open-source framework for testing application-specific AI behavior using natural language

The framework takes plain-language descriptions of an AI model's expected behavior and policies, converts them into structured sets of acceptable and unacceptable behaviors, generates test scenarios, runs them against the target system, and scores the results. ASSERT also records the paths AI systems take, including intermediate actions and tool calls, enabling developers to inspect where failures occur.

How ASSERT works

Developers provide high-level behavioral rules in natural language. For example, a developer could specify that a document research AI agent shouldn't send emails outside the company, should limit confidential information to C-level executives, and must provide concise summaries with prior context. ASSERT uses these rules to generate test cases that verify ongoing compliance.

Developers can add system context, tools, and constraints to customize evaluation scope. The framework supports testing during development, after deployment, and for continuous monitoring.

Filling an evaluation gap

"One of the things we've learned is that evaluations are absolutely critical to making good decisions," said Sarah Bird, chief product officer of Responsible AI at Microsoft. "What we found is that if you really want to have a trustworthy system, you should evaluate many more dimensions that are application-specific."

According to Microsoft, ASSERT addresses a gap that broader benchmarks cannot fill. While general evaluations measure model capabilities across standard metrics, they don't capture behaviors shaped by specific application contexts, policies, and tools.

Industry context

The release aligns with a broader shift in AI evaluation practices. As models become more capable, researchers increasingly focus on repeatable testing and regression checks. Stanford's HELM, MLCommons' AILuminate, and evaluation groups like METR have rolled out benchmarks measuring model behavior under different conditions.

The framework is now available as open source.

What this means

ASSERT represents a practical response to a real deployment problem: companies need to verify AI systems behave correctly within their specific contexts, not just on generic benchmarks. By automating the translation of policy requirements into test cases, Microsoft is addressing the gap between model capability testing and application-specific behavior verification. This could accelerate AI deployment by making it easier to establish and maintain compliance with organizational policies, though the framework's effectiveness will depend on how well it generates comprehensive test coverage from natural language specifications.

Source: techcrunch.com ↗

Microsoft AI Testing Evaluation Open Source DevTools AI Safety ASSERT

product updateJuly 17, 2026

OpenRouter Launches Auto Router Beta: Task-Aware Model Routing Based on Community Spend

OpenRouter has released Auto Router Beta, a task-aware routing system that classifies incoming requests and automatically routes them to popular models based on community spending patterns. The router allows users to filter selections by cost-quality tradeoff preferences.

product updateJuly 17, 2026

OpenAI restores chat sidebar in Mac app after user backlash over confusing redesign

OpenAI has updated its ChatGPT Mac app to restore direct access to chat conversations through a prominent sidebar toggle. The fix addresses user complaints following a July 10 redesign that replaced the native Mac client with an Electron-based app and buried the standard chat interface behind Work and Codex features.

product updateJuly 17, 2026

NVIDIA NeMo Automodel integrates with Hugging Face Diffusers for distributed video and image model fine-tuning

NVIDIA and Hugging Face have integrated NeMo Automodel with the Diffusers library, enabling distributed fine-tuning of video and image diffusion models without checkpoint conversion. The integration supports models including FLUX.1-dev (12B), Wan 2.1 (1.3B/14B), and HunyuanVideo (13B) with full fine-tuning and LoRA options.

product updateJuly 16, 2026

AWS launches Managed Knowledge Base for Bedrock with 6 enterprise connectors and automatic ACL enforcement

Amazon Web Services launched Managed Knowledge Base for Bedrock in general availability, offering a fully managed retrieval solution with six native enterprise connectors including SharePoint, Confluence, and Google Drive. The service handles document parsing up to 500 MB for PDFs, 2 GB for audio, and 10 GB for video, with real-time access control list verification at query time.

Microsoft releases ASSERT, open-source framework for testing application-specific AI behavior using natural language

Microsoft releases ASSERT, open-source framework for testing application-specific AI behavior using natural language

How ASSERT works

Filling an evaluation gap

Industry context

What this means

Related Articles

OpenRouter Launches Auto Router Beta: Task-Aware Model Routing Based on Community Spend

OpenAI restores chat sidebar in Mac app after user backlash over confusing redesign

NVIDIA NeMo Automodel integrates with Hugging Face Diffusers for distributed video and image model fine-tuning

AWS launches Managed Knowledge Base for Bedrock with 6 enterprise connectors and automatic ACL enforcement

Comments