Microsoft releases ASSERT, open-source framework for testing application-specific AI behavior using natural language
Microsoft released ASSERT (Adaptive Spec-driven Scoring for Evaluation and Regression Testing), an open-source framework that converts natural language descriptions of expected AI behavior into structured test cases. The tool addresses a gap in AI evaluation by testing application-specific behaviors that general benchmarks cannot capture.
Microsoft releases ASSERT, open-source framework for testing application-specific AI behavior using natural language
Microsoft released ASSERT (Adaptive Spec-driven Scoring for Evaluation and Regression Testing), an open-source framework that converts natural language descriptions of expected AI behavior into structured, scored test cases.
The framework takes plain-language descriptions of an AI model's expected behavior and policies, converts them into structured sets of acceptable and unacceptable behaviors, generates test scenarios, runs them against the target system, and scores the results. ASSERT also records the paths AI systems take, including intermediate actions and tool calls, enabling developers to inspect where failures occur.
How ASSERT works
Developers provide high-level behavioral rules in natural language. For example, a developer could specify that a document research AI agent shouldn't send emails outside the company, should limit confidential information to C-level executives, and must provide concise summaries with prior context. ASSERT uses these rules to generate test cases that verify ongoing compliance.
Developers can add system context, tools, and constraints to customize evaluation scope. The framework supports testing during development, after deployment, and for continuous monitoring.
Filling an evaluation gap
"One of the things we've learned is that evaluations are absolutely critical to making good decisions," said Sarah Bird, chief product officer of Responsible AI at Microsoft. "What we found is that if you really want to have a trustworthy system, you should evaluate many more dimensions that are application-specific."
According to Microsoft, ASSERT addresses a gap that broader benchmarks cannot fill. While general evaluations measure model capabilities across standard metrics, they don't capture behaviors shaped by specific application contexts, policies, and tools.
Industry context
The release aligns with a broader shift in AI evaluation practices. As models become more capable, researchers increasingly focus on repeatable testing and regression checks. Stanford's HELM, MLCommons' AILuminate, and evaluation groups like METR have rolled out benchmarks measuring model behavior under different conditions.
The framework is now available as open source.
What this means
ASSERT represents a practical response to a real deployment problem: companies need to verify AI systems behave correctly within their specific contexts, not just on generic benchmarks. By automating the translation of policy requirements into test cases, Microsoft is addressing the gap between model capability testing and application-specific behavior verification. This could accelerate AI deployment by making it easier to establish and maintain compliance with organizational policies, though the framework's effectiveness will depend on how well it generates comprehensive test coverage from natural language specifications.
Related Articles
Microsoft releases MAI-Thinking-1, its first reasoning model with 35B parameters
Microsoft released seven AI models at Build 2026, headlined by MAI-Thinking-1, its first reasoning model with 35 billion parameters. The company claims the model matches Anthropic's Claude Opus 4.6 on SWE Bench Pro coding benchmarks and beats Sonnet 4.61 in blind tests.
Microsoft launches MAI-Code-1 and MAI-Thinking-1 models to reduce OpenAI dependence
Microsoft announced two proprietary AI models at its Build developer conference: MAI-Code-1 for code generation and MAI-Thinking-1 for reasoning tasks. The models are designed to run on Azure infrastructure, allowing Microsoft to reduce costs from its $13 billion OpenAI investment while competing directly with Anthropic and Google.
Microsoft releases MAI-Thinking-1, its first reasoning AI model trained without third-party distillation
Microsoft announced MAI-Thinking-1, its first advanced reasoning AI model, at Build 2026. The company claims it's a medium-sized model matching leading models on key software engineering benchmarks, trained from scratch without distillation from third-party models.
Microsoft launches Scout AI assistant built on OpenClaw framework, requires GitHub Copilot subscription
Microsoft has launched Scout, an AI assistant built on the OpenClaw framework that operates across Microsoft 365. The system requires a GitHub Copilot subscription and includes policy conformance checks with audit trails to address security concerns about autonomous AI agents.
Comments
Loading...