product updateAmazon Web Services

AWS Adds OS-Level Control to Bedrock AgentCore Browser for Native UI Automation

TL;DR

AWS announced OS Level Actions for Amazon Bedrock AgentCore Browser, extending agent automation beyond the browser's Document Object Model to interact with native operating system UI. The feature enables agents to control system dialogs, security prompts, and context menus through direct mouse and keyboard commands at the OS level.

3 min read
0

AWS Adds OS-Level Control to Bedrock AgentCore Browser for Native UI Automation

Amazon Web Services announced OS Level Actions for Amazon Bedrock AgentCore Browser, a new capability that extends AI agent automation beyond the browser's Document Object Model (DOM) to interact with native operating system UI elements.

The Technical Gap

Existing browser automation tools like Playwright and Chrome DevTools Protocol (CDP) operate within the web layer, accessing only DOM-exposed content. This creates a hard boundary: native OS elements including system print dialogs, security prompts, certificate choosers, macOS privacy dialogs, Windows Security prompts, and browser context menus remain invisible and inaccessible to standard web automation.

According to AWS, this gap particularly affects vision-enabled agents that capture screenshots and receive model instructions with coordinates. When native UI appears, the agent can see what to do in the screenshot but has no mechanism to execute the action through CDP.

Eight Supported Actions

OS Level Actions are organized into three categories with eight specific operations:

Mouse Control:

  • mouseClick — optional x, y coordinates, button type (LEFT/RIGHT), click count (1-10)
  • mouseMove — moves cursor to specified x, y coordinates
  • mouseDrag — drags from start to end coordinates
  • mouseScroll — scrolls with delta values (-1000 to 1000 range)

Keyboard Input:

  • keyType — types strings up to 10,000 characters
  • keyPress — presses individual keys repeatedly (1-100 times)
  • keyShortcut — executes key combinations (up to five keys simultaneously)

Visual Capture:

  • screenshot — captures full OS desktop as base64-encoded PNG

Implementation Pattern

The feature operates through an action-screenshot-reaction loop via the InvokeBrowser API. Each action call carries one operation with type and arguments, returns SUCCESS or FAILED status, and is tied to the browser session through the x-amzn-browser-session-id header.

The agent dispatches an action, AgentCore executes it at the OS level, the agent captures a screenshot showing the resulting state, sends the screenshot to a vision model for reasoning, and determines the next action based on observed changes.

Technical Requirements

OS Level Actions require:

  • IAM execution role with three specific permissions: bedrock-agentcore:InvokeBrowser, bedrock-agentcore:StartBrowserSession, and bedrock-agentcore:StopBrowserSession
  • Browser resource configuration through the bedrock-agentcore-control control plane client
  • Session initialization with viewport settings that determine coordinate space and screenshot dimensions

The feature is available for new and existing browser configurations without additional setup. All supported key names for keyboard actions must be lowercase, including single characters (a-z, 0-9) and named keys like enter, tab, space, backspace, delete, escape, ctrl, alt, and shift.

Limitations

AWS notes that some context menu items may not function as expected due to the virtualized environment in which browser sessions run. The screenshot action is the only operation that returns data beyond status codes.

What This Means

This addresses a genuine automation gap in production workflows where web agents encounter OS-level UI that standard browser automation cannot reach. The action-screenshot-reaction pattern mirrors how human users interact with computers, making it viable for agents to handle workflows that cross the web-to-OS boundary. The feature's availability without additional configuration for existing AgentCore Browser deployments suggests AWS is positioning this as a standard capability rather than an optional add-on. The 10,000 character limit on text input and 100-press limit on key repetition indicate design for practical automation scenarios rather than programmatic abuse.

Related Articles

product update

AWS Releases AgentCore Harness for Production AI Agents with Two-API Setup

Amazon Web Services made its AgentCore harness generally available, reducing production AI agent deployment to two API calls: CreateHarness and InvokeHarness. The managed service handles sandboxed execution, memory, tool integration, and observability, eliminating infrastructure setup for teams building LLM agents.

product update

AWS launches Web Search on Amazon Bedrock AgentCore with tens of billions of documents, no external API required

Amazon Web Services launched Web Search on Amazon Bedrock AgentCore, a fully managed web search capability that gives AI agents access to tens of billions of documents without requiring external search APIs. The service, now generally available, runs entirely within AWS infrastructure and refreshes its index within minutes of new content appearing online.

product update

Google Gemini Live gains access to Memory and Connected Apps from past conversations

Google has updated Gemini Live to access past conversation history through Memory and Connected Apps. The feature, currently available in English in the US, allows the voice assistant to reference previous chats and information from YouTube, Workspace, Utilities, and image generation tools during conversations.

product update

U.S. government orders Anthropic to halt exports of Mythos and Fable AI models, both now offline for one week

The White House ordered Anthropic to restrict exports of its Mythos and Fable AI models last Friday, citing national security concerns. Anthropic pulled both models offline within 90 minutes of the Commerce Department directive, marking the first major test of AI export controls.

Comments

Loading...