product updateAmazon Web Services

AWS Adds OS-Level Control to Bedrock AgentCore Browser for Native UI Automation

TL;DR

AWS announced OS Level Actions for Amazon Bedrock AgentCore Browser, extending agent automation beyond the browser's Document Object Model to interact with native operating system UI. The feature enables agents to control system dialogs, security prompts, and context menus through direct mouse and keyboard commands at the OS level.

3 min read
0

AWS Adds OS-Level Control to Bedrock AgentCore Browser for Native UI Automation

Amazon Web Services announced OS Level Actions for Amazon Bedrock AgentCore Browser, a new capability that extends AI agent automation beyond the browser's Document Object Model (DOM) to interact with native operating system UI elements.

The Technical Gap

Existing browser automation tools like Playwright and Chrome DevTools Protocol (CDP) operate within the web layer, accessing only DOM-exposed content. This creates a hard boundary: native OS elements including system print dialogs, security prompts, certificate choosers, macOS privacy dialogs, Windows Security prompts, and browser context menus remain invisible and inaccessible to standard web automation.

According to AWS, this gap particularly affects vision-enabled agents that capture screenshots and receive model instructions with coordinates. When native UI appears, the agent can see what to do in the screenshot but has no mechanism to execute the action through CDP.

Eight Supported Actions

OS Level Actions are organized into three categories with eight specific operations:

Mouse Control:

  • mouseClick — optional x, y coordinates, button type (LEFT/RIGHT), click count (1-10)
  • mouseMove — moves cursor to specified x, y coordinates
  • mouseDrag — drags from start to end coordinates
  • mouseScroll — scrolls with delta values (-1000 to 1000 range)

Keyboard Input:

  • keyType — types strings up to 10,000 characters
  • keyPress — presses individual keys repeatedly (1-100 times)
  • keyShortcut — executes key combinations (up to five keys simultaneously)

Visual Capture:

  • screenshot — captures full OS desktop as base64-encoded PNG

Implementation Pattern

The feature operates through an action-screenshot-reaction loop via the InvokeBrowser API. Each action call carries one operation with type and arguments, returns SUCCESS or FAILED status, and is tied to the browser session through the x-amzn-browser-session-id header.

The agent dispatches an action, AgentCore executes it at the OS level, the agent captures a screenshot showing the resulting state, sends the screenshot to a vision model for reasoning, and determines the next action based on observed changes.

Technical Requirements

OS Level Actions require:

  • IAM execution role with three specific permissions: bedrock-agentcore:InvokeBrowser, bedrock-agentcore:StartBrowserSession, and bedrock-agentcore:StopBrowserSession
  • Browser resource configuration through the bedrock-agentcore-control control plane client
  • Session initialization with viewport settings that determine coordinate space and screenshot dimensions

The feature is available for new and existing browser configurations without additional setup. All supported key names for keyboard actions must be lowercase, including single characters (a-z, 0-9) and named keys like enter, tab, space, backspace, delete, escape, ctrl, alt, and shift.

Limitations

AWS notes that some context menu items may not function as expected due to the virtualized environment in which browser sessions run. The screenshot action is the only operation that returns data beyond status codes.

What This Means

This addresses a genuine automation gap in production workflows where web agents encounter OS-level UI that standard browser automation cannot reach. The action-screenshot-reaction pattern mirrors how human users interact with computers, making it viable for agents to handle workflows that cross the web-to-OS boundary. The feature's availability without additional configuration for existing AgentCore Browser deployments suggests AWS is positioning this as a standard capability rather than an optional add-on. The 10,000 character limit on text input and 100-press limit on key repetition indicate design for practical automation scenarios rather than programmatic abuse.

Related Articles

product update

AWS Launches AgentCore Optimization: Automated Performance Loop for Production AI Agents

Amazon Web Services released AgentCore Optimization in preview, introducing an automated performance loop that generates configuration recommendations from production traces, validates them through batch evaluation and A/B testing, and enables continuous agent optimization. The system targets the quality drift problem where AI agents degrade as models evolve and user behavior shifts.

product update

AWS launches agent-guided workflows in SageMaker AI to automate model fine-tuning

Amazon Web Services has released agent-guided workflows in SageMaker AI that use AI coding agents to automate model customization. The feature includes nine pre-built skills covering use case definition, data preparation, fine-tuning technique selection (SFT, DPO, RLVR), evaluation, and deployment to Amazon Bedrock or SageMaker endpoints.

product update

AWS SageMaker adds automatic instance fallback to prevent GPU capacity failures

Amazon SageMaker AI now supports capacity-aware instance pools that automatically try alternative GPU instance types when primary choices lack capacity. The feature works across endpoint creation, autoscaling, and scale-in operations, eliminating the manual retry cycles that previously left endpoints stuck in failed states.

product update

Google Gemini app receives full UI redesign with pill-shaped prompt box and gradient backgrounds

Google is rolling out a full redesign of its Gemini app that overhauls every aspect of the user interface. The update introduces a pill-shaped prompt box, colorful gradient backgrounds, and a unified bottom sheet for file uploads and tools.

Comments

Loading...