DesktopEnv Interface Documentation
Overview
The DesktopEnv
class is the core interface for interacting with virtual desktop environments in OSWorld.
It provides a standardized way to control virtual machines across different providers (VMware, VirtualBox, Docker, AWS, etc.) and execute actions on desktop environments through various action spaces.
This class serves as the main entry point for agents to interact with real computer environments, enabling tasks such as:
Making observations by taking screenshots and/or capturing accessibility trees
Executing mouse and keyboard actions
Managing virtual machine states
Evaluating task completion
Setting up and tearing down task environments
Class Definition
class DesktopEnv:
def __init__(self,
provider_name: str = "vmware",
region: str = None,
path_to_vm: str = None,
snapshot_name: str = "init_state",
action_space: str = "pyautogui",
cache_dir: str = "cache",
screen_size: Tuple[int] = (int(os.environ.get("SCREEN_WIDTH", 1920)), int(os.environ.get("SCREEN_HEIGHT", 1080))),
headless: bool = False,
require_a11y_tree: bool = True,
require_terminal: bool = False,
os_type: str = "Ubuntu",
enable_proxy: bool = False,
client_password: str = "",
**kwargs)
Initialization Parameters
Core Parameters
- provider_namestr, default=”vmware”
The virtualization provider to use. Supported options:
"vmware"
: VMware Workstation Pro/Fusion"virtualbox"
: Oracle VirtualBox"docker"
: Docker containers with GUI support"aws"
: Amazon Web Services EC2 instances
Each provider has different capabilities and requirements. See provider-specific documentation for details.
- regionstr, optional
The region of the virtual machine. The meaning varies for different providers. For AWS provider, this parameter should be the region of the virtual machine. For others, this parameter is ignored.
- path_to_vmstr, optional
For VMware/VirtualBox providers, the file path to the virtual machine configuration file (e.g., .vmx file for VMware).
- snapshot_namestr, default=”init_state”
For VMware provider, ‘init_state’ is the default name of snapshot we set for the client virtual machine. For cloud provider such as AWS, this parameter should be the AMI ID of the virtual machine. This allows the environment to return to a known clean state for each task.
- action_spacestr, default=”pyautogui”
Defines the action interface for agent interactions. Supported action spaces:
"pyautogui"
: Python automation using PyAutoGUI library"computer_13"
: The action space we designed for our agent to use, contains 13 actions, including click, drag, scroll, type, etc."claude_computer_use"
: The action space from Claude Computer Use.
- cache_dirstr, default=”cache”
The directory to cache the files from cloud (initial files and ground-truth files for environment setup and evaluation) and client machine (for example, the files modified by the agent).
Note
Cloud files are automatically downloaded on first use, then the system will check this directory first before downloading. Important: If you modify task files without changing their names, they may not be overwritten. You need to manually delete them to ensure updates.
- screen_sizeTuple[int], default=(1920, 1080)
The screen size of the virtual machine you refer to. Currently we don’t support modifying resolution based on this parameter - it’s only used as an index. On AWS, machines without corresponding resolution will report errors; on VMware and Docker, this parameter is ignored.
- headlessbool, default=False
Whether to run the virtual machine in headless mode (without GUI display). Useful for batch processing and server deployments.
- require_a11y_treebool, default=False
Whether accessibility tree capture is required. When enabled, the environment will extract and provide accessibility information alongside screenshots.
Warning
The quality and speed of obtaining accessibility tree content is strongly correlated with the software provider. When enabled, each observation acquisition may take much longer than simply taking screenshots (~10 seconds on average, for some software it may take several minutes or more). Use with caution.
Our vision is that accessibility trees will be phased out in the future.
- require_terminalbool, default=False
Whether to require terminal information for the virtual machine. When enabled, the environment will attempt to extract terminal information from the virtual machine to help agents make potential decisions.
- os_typestr, default=”Ubuntu”
Operating system type of the target virtual machine:
"Ubuntu"
: Ubuntu Linux distributions"Windows"
: Microsoft Windows[PLACEHOLDER: Add other supported OS types]
- enable_proxybool, default=False
Whether to enable proxy for the virtual machine. If enabled, the environment will use dataimpulse proxy server to access the internet.
Important
Please ensure you have configured the dataimpulse proxy server according to our instructions and guaranteed sufficient balance, and have copied the username and password to dataimpulse.json, otherwise network disconnection will occur.
- client_passwordstr, optional
Password for the guest operating system user account. Required for:
Proxy configuration setup
Tasks requiring sudo privileges
Administrative operations within the VM
Default credentials vary by provider (e.g., “password” for vmware/virtualbox, “osworld-public-evaluation” for aws).
Additional Configuration Notes
Port Configuration
In the initialization function, several ports are configured for client-server communication. These ports are pre-configured in our provided client machines. If you need to reconfigure machines, please pay attention to these port settings:
# Default port configuration
self.server_port = 5000
self.chromium_port = 9222
self.vnc_port = 8006
self.vlc_port = 8080
Docker Provider Special Handling
For Docker, our implementation is special because Docker containers from the same machine share the same IP address. Therefore, we need to control different ports to support multiple parallel environments. A port allocation mechanism is implemented in the provider, with special handling as shown below:
# Get the ip from the virtual machine, and setup the controller
vm_ip_ports = self.provider.get_ip_address(self.path_to_vm).split(':')
self.vm_ip = vm_ip_ports[0]
# Get the ports from the virtual machine (for Docker provider only)
if len(vm_ip_ports) > 1:
self.server_port = int(vm_ip_ports[1])
self.chromium_port = int(vm_ip_ports[2])
self.vnc_port = int(vm_ip_ports[3])
self.vlc_port = int(vm_ip_ports[4])
Proxy Configuration
When you set enable_proxy
to True
, please ensure you have configured the dataimpulse proxy server according to
our proxy setup instructions
and guaranteed sufficient balance. You must also copy the username and password to the
dataimpulse.json file,
otherwise network disconnection will occur.
Key Methods
reset()
def reset(self, task_config: Dict) -> observation
Resets the environment to an initial state and sets up the task configuration.
Parameters:
- task_configDict
Task configuration dictionary containing:
"id"
: Unique task identifier"instruction"
: Natural language task description for the agent"config"
: List of setup commands and scripts for task initialization"evaluator"
: Evaluation criteria and success conditions"trajectory"
: (Optional) Reference trajectory for task completion"snapshot"
: (Optional) VM snapshot to restore before task execution"proxy"
: (Optional) Boolean indicating if proxy is required for the task
Returns:
observation : The initial observation after environment reset
step()
def step(self, action: str, pause: Optional[float] = None) -> Tuple[observation, reward, done, info]
Executes an action in the environment and returns the resulting state.
Parameters:
- actionstr
Action string formatted according to the specified action_space. Examples:
PyAutoGUI:
"pyautogui.click(100, 200)"
Computer 13:
{"action": "click", "coordinate": [100, 200]}
- pausefloat, optional
CRITICAL PARAMETER: Time to sleep after action execution before capturing observation.
Warning
The
pause
parameter (often set via thesleep_after_execution
parameter in run_multienv functions) is extremely important for reliable agent performance. In our practical experience, this parameter significantly affects the quality of observations captured after action execution.Why this matters:
If the sleep time is too short, the agent may capture observations before the action has fully taken effect
GUI applications need time to process commands and update their interfaces
Web pages require time to load and render new content
Network delays can cause UI state changes to be delayed
Without adequate pause time, agents may observe the pre-action state instead of the post-action state, leading to confusion and poor performance
As of 2024-07-29, current AI models and agents are already quite fragile, and inadequate pause timing makes them even more unstable and confused.
Returns:
observation : Current environment observation (screenshot, a11y tree, or both)
reward : Task completion reward (typically 0 during execution, 1 on success)
done : Boolean indicating if the task/episode has ended
info : Additional information dictionary with metadata
close()
def close()
Closes the environment and cleans up resources. This method should be called when the environment is no longer needed.
Parameters:
None
Returns:
None
get_observation()
def get_observation() -> Dict
Captures the current state observation from the virtual machine.
Returns:
- observationDict
Dictionary containing observation data:
"screenshot"
: Base64-encoded screenshot image"accessibility_tree"
: Accessibility tree structure (if enabled)"terminal"
: Terminal information (if enabled)"timestamp"
: Observation timestamp
evaluate()
def evaluate() -> float
Evaluates the current task completion status using the configured evaluator.
The evaluation process involves three main components:
Result Getters: Extract current state from various sources (screenshots, files, command outputs, etc.)
Expected Getters: Extract expected/reference state for comparison (optional)
Metrics: Compare result and expected states to compute success scores
Evaluation Flow:
Post-configuration Setup: Executes any post-configuration commands defined in evaluator. Some tasks require additional operations to facilitate verification, such as saving files, activating modification synchronization, etc.
Special Case Handling: - For “infeasible” tasks: Returns 1 if last action was “FAIL”, otherwise 0 - For regular tasks: Returns 0 if last action was “FAIL”
Result Extraction: Uses result_getter to extract current state based on evaluator[“result”] configuration
Expected State Extraction: Uses expected_getter to extract reference state (if evaluator[“expected”] exists)
Metric Computation: Applies metric function to compare states and compute score
Single vs Multiple Metrics:
Single Metric (most common):
# Example: Check if screenshot contains specific text
evaluator = {
"func": "check_include_exclude",
"result": {"type": ...},
"expected": {"type": ...}
}
# Returns: float score (0.0 to 1.0)
Multiple Metrics with AND/OR logic:
# Example: All conditions must be met
example =
{
"id": "example-task",
"instruction": "Clear rubbish bin",
"config": [],
"evaluator": {
"func": [
"metric_1",
"metric_2"
],
"conj": "and", # Return positive reward if all metrics succeed
# "conj": "or" # Return positive reward if any metric succeeds
"result": [
{"type": "xxx", ...},
{"type": "xxx", ...},
],
"expected": [
{"type": "xxx", ...},
{"type": "xxx", ...},
],
"options":[
{"xxx": ..},
{"xxx": ..}
]
}
}
# Returns: 0 if any metric fails, otherwise average of all scores
Multiple Metrics with OR logic:
# Example: Any condition can satisfy the task
evaluator = {
"func": "check_alternative_outcomes",
"conj": "or", # Any metric can succeed
"result": [
{"type": "screenshot"},
{"type": "vm_command_line", "command": "ls /tmp/"}
]
}
# Returns: 1 if any metric succeeds, otherwise maximum score
Returns:
score : float - Numerical score (0.0 = failure, 1.0 = complete success, intermediate values possible)
get_info()
def get_info() -> Dict
Returns information about the current environment state and configuration.
Returns:
- infoDict
Environment information including:
"vm_ip"
: Virtual machine IP address"server_port"
: Communication server port"action_space"
: Current action space configuration"provider"
: Provider information
Usage Examples
Basic Usage
from desktop_env.desktop_env import DesktopEnv
# Initialize environment with VMware provider
env = DesktopEnv(
provider_name="vmware",
action_space="pyautogui"
)
# Define a simple task
task_config = {
"id": "example-task",
"instruction": "Clear rubbish bin",
"config": [],
"evaluator": {
"func": "check_include_exclude",
"result": {"type": "screenshot"},
"expected": {"type": "rule", "rules": {"include": ["desktop"]}}
}
}
# Reset environment and execute action
obs = env.reset(task_config=task_config)
obs, reward, done, info = env.step("pyautogui.click(500, 300)", pause=2.0)
# Clean up when finished
env.close()
Docker Provider Example
# Initialize with Docker provider
env = DesktopEnv(
provider_name="docker",
os_type="Ubuntu",
headless=True,
client_password="password"
)
AWS Provider Example
# Initialize with AWS provider for large-scale evaluation
env = DesktopEnv(
provider_name="aws",
os_type="Ubuntu",
client_password="osworld-public-evaluation"
)
With Accessibility Tree
# Environment with accessibility tree support
env = DesktopEnv(
provider_name="vmware",
require_a11y_tree=True
)
Provider-Specific Considerations
- VMware Provider
Requires VMware Workstation Pro or VMware Fusion
Supports snapshots for quick environment reset
Best performance on bare metal machines
Supports macOS on Apple Silicon chips (via VMware Fusion)
- Docker Provider
Recommended for cloud/server deployments
Supports parallel environments with automatic port allocation
Requires KVM support for optimal performance
May have limitations on GUI applications
- AWS Provider
Enables large-scale parallel evaluation
Region-specific instance types and AMI requirements
Network latency considerations for interactive tasks
Cost optimization through spot instances and scheduled termination
- VirtualBox Provider
Free alternative to VMware
Limited parallelism support
May have performance limitations on some systems
Security and Credentials
Default Credentials: Use secure passwords and change defaults in production
Network Security: Configure firewalls appropriately for the communication ports
Cloud Security: Use IAM roles and security groups for AWS deployments
Proxy Configuration: Ensure proxy credentials are securely stored and rotated
Troubleshooting
Common Issues:
Port Conflicts: Ensure communication ports (5000, 9222, 8006, 8080) are available
Network Connectivity: Verify VM can reach external networks if required
Performance Issues: Check sleep_after_execution timing and VM resource allocation
Provider Errors: Consult provider-specific logs and documentation
See Also
OSWorld Task Examples Explanation - Task configuration format
run_experiment - Running experiments with DesktopEnv
Install Provider - Provider-specific setup guides
Quick Start - Quick start guide
Proxy Guideline - Proxy configuration guide