DesktopEnv Interface Documentation

Overview

The DesktopEnv class is the core interface for interacting with virtual desktop environments in OSWorld. It provides a standardized way to control virtual machines across different providers (VMware, VirtualBox, Docker, AWS, etc.) and execute actions on desktop environments through various action spaces.

This class serves as the main entry point for agents to interact with real computer environments, enabling tasks such as:

Making observations by taking screenshots and/or capturing accessibility trees
Executing mouse and keyboard actions
Managing virtual machine states
Evaluating task completion
Setting up and tearing down task environments

Class Definition

class DesktopEnv:
    def __init__(self,
                 provider_name: str = "vmware",
                 region: str = None,
                 path_to_vm: str = None,
                 snapshot_name: str = "init_state",
                 action_space: str = "pyautogui",
                 cache_dir: str = "cache",
                 screen_size: Tuple[int] = (int(os.environ.get("SCREEN_WIDTH", 1920)), int(os.environ.get("SCREEN_HEIGHT", 1080))),
                 headless: bool = False,
                 require_a11y_tree: bool = True,
                 require_terminal: bool = False,
                 os_type: str = "Ubuntu",
                 enable_proxy: bool = False,
                 client_password: str = "",
                 **kwargs)

Initialization Parameters

Core Parameters

provider_namestr, default=”vmware”

The virtualization provider to use. Supported options:

"vmware": VMware Workstation Pro/Fusion
"virtualbox": Oracle VirtualBox
"docker": Docker containers with GUI support
"aws": Amazon Web Services EC2 instances

Each provider has different capabilities and requirements. See provider-specific documentation for details.

regionstr, optional

The region of the virtual machine. The meaning varies for different providers. For AWS provider, this parameter should be the region of the virtual machine. For others, this parameter is ignored.

path_to_vmstr, optional

For VMware/VirtualBox providers, the file path to the virtual machine configuration file (e.g., .vmx file for VMware).

snapshot_namestr, default=”init_state”

For VMware provider, ‘init_state’ is the default name of snapshot we set for the client virtual machine. For cloud provider such as AWS, this parameter should be the AMI ID of the virtual machine. This allows the environment to return to a known clean state for each task.

action_spacestr, default=”pyautogui”

Defines the action interface for agent interactions. Supported action spaces:

"pyautogui": Python automation using PyAutoGUI library
"computer_13": The action space we designed for our agent to use, contains 13 actions, including click, drag, scroll, type, etc.
"claude_computer_use": The action space from Claude Computer Use.

cache_dirstr, default=”cache”

The directory to cache the files from cloud (initial files and ground-truth files for environment setup and evaluation) and client machine (for example, the files modified by the agent).

Note

Cloud files are automatically downloaded on first use, then the system will check this directory first before downloading. Important: If you modify task files without changing their names, they may not be overwritten. You need to manually delete them to ensure updates.

screen_sizeTuple[int], default=(1920, 1080)

The screen size of the virtual machine you refer to. Currently we don’t support modifying resolution based on this parameter - it’s only used as an index. On AWS, machines without corresponding resolution will report errors; on VMware and Docker, this parameter is ignored.

headlessbool, default=False

Whether to run the virtual machine in headless mode (without GUI display). Useful for batch processing and server deployments.

require_a11y_treebool, default=False

Whether accessibility tree capture is required. When enabled, the environment will extract and provide accessibility information alongside screenshots.

Warning

The quality and speed of obtaining accessibility tree content is strongly correlated with the software provider. When enabled, each observation acquisition may take much longer than simply taking screenshots (~10 seconds on average, for some software it may take several minutes or more). Use with caution.

Our vision is that accessibility trees will be phased out in the future.

require_terminalbool, default=False

Whether to require terminal information for the virtual machine. When enabled, the environment will attempt to extract terminal information from the virtual machine to help agents make potential decisions.

os_typestr, default=”Ubuntu”

Operating system type of the target virtual machine:

"Ubuntu": Ubuntu Linux distributions
"Windows": Microsoft Windows
[PLACEHOLDER: Add other supported OS types]

enable_proxybool, default=False

Whether to enable proxy for the virtual machine. If enabled, the environment will use dataimpulse proxy server to access the internet.

Important

Please ensure you have configured the dataimpulse proxy server according to our instructions and guaranteed sufficient balance, and have copied the username and password to dataimpulse.json, otherwise network disconnection will occur.

client_passwordstr, optional

Password for the guest operating system user account. Required for:

Proxy configuration setup
Tasks requiring sudo privileges
Administrative operations within the VM

Default credentials vary by provider (e.g., “password” for vmware/virtualbox, “osworld-public-evaluation” for aws).

Additional Configuration Notes

Port Configuration

In the initialization function, several ports are configured for client-server communication. These ports are pre-configured in our provided client machines. If you need to reconfigure machines, please pay attention to these port settings:

# Default port configuration
self.server_port = 5000
self.chromium_port = 9222
self.vnc_port = 8006
self.vlc_port = 8080

Docker Provider Special Handling

For Docker, our implementation is special because Docker containers from the same machine share the same IP address. Therefore, we need to control different ports to support multiple parallel environments. A port allocation mechanism is implemented in the provider, with special handling as shown below:

# Get the ip from the virtual machine, and setup the controller
vm_ip_ports = self.provider.get_ip_address(self.path_to_vm).split(':')
self.vm_ip = vm_ip_ports[0]
# Get the ports from the virtual machine (for Docker provider only)
if len(vm_ip_ports) > 1:
    self.server_port = int(vm_ip_ports[1])
    self.chromium_port = int(vm_ip_ports[2])
    self.vnc_port = int(vm_ip_ports[3])
    self.vlc_port = int(vm_ip_ports[4])

Proxy Configuration

When you set enable_proxy to True, please ensure you have configured the dataimpulse proxy server according to our proxy setup instructions and guaranteed sufficient balance. You must also copy the username and password to the dataimpulse.json file, otherwise network disconnection will occur.

Key Methods

reset()

def reset(self, task_config: Dict) -> observation

Resets the environment to an initial state and sets up the task configuration.

Parameters:

task_configDict
Task configuration dictionary containing:
- "id": Unique task identifier
- "instruction": Natural language task description for the agent
- "config": List of setup commands and scripts for task initialization
- "evaluator": Evaluation criteria and success conditions
- "trajectory": (Optional) Reference trajectory for task completion
- "snapshot": (Optional) VM snapshot to restore before task execution
- "proxy": (Optional) Boolean indicating if proxy is required for the task

Returns:

observation : The initial observation after environment reset

step()

def step(self, action: str, pause: Optional[float] = None) -> Tuple[observation, reward, done, info]

Executes an action in the environment and returns the resulting state.

Parameters:

actionstr
Action string formatted according to the specified action_space. Examples:
- PyAutoGUI: "pyautogui.click(100, 200)"
- Computer 13: {"action": "click", "coordinate": [100, 200]}
pausefloat, optional
CRITICAL PARAMETER: Time to sleep after action execution before capturing observation.
Warning

The pause parameter (often set via the sleep_after_execution parameter in run_multienv functions) is extremely important for reliable agent performance. In our practical experience, this parameter significantly affects the quality of observations captured after action execution.

Why this matters:
- If the sleep time is too short, the agent may capture observations before the action has fully taken effect
- GUI applications need time to process commands and update their interfaces
- Web pages require time to load and render new content
- Network delays can cause UI state changes to be delayed
- Without adequate pause time, agents may observe the pre-action state instead of the post-action state, leading to confusion and poor performance
As of 2024-07-29, current AI models and agents are already quite fragile, and inadequate pause timing makes them even more unstable and confused.

Returns:

observation : Current environment observation (screenshot, a11y tree, or both)
reward : Task completion reward (typically 0 during execution, 1 on success)
done : Boolean indicating if the task/episode has ended
info : Additional information dictionary with metadata

close()

def close()

Closes the environment and cleans up resources. This method should be called when the environment is no longer needed.

Parameters:

None

Returns:

None

get_observation()

def get_observation() -> Dict

Captures the current state observation from the virtual machine.

Returns:

observationDict
Dictionary containing observation data:
- "screenshot": Base64-encoded screenshot image
- "accessibility_tree": Accessibility tree structure (if enabled)
- "terminal": Terminal information (if enabled)
- "timestamp": Observation timestamp

evaluate()

def evaluate() -> float

Evaluates the current task completion status using the configured evaluator.

The evaluation process involves three main components:

Result Getters: Extract current state from various sources (screenshots, files, command outputs, etc.)
Expected Getters: Extract expected/reference state for comparison (optional)
Metrics: Compare result and expected states to compute success scores

Evaluation Flow:

Post-configuration Setup: Executes any post-configuration commands defined in evaluator. Some tasks require additional operations to facilitate verification, such as saving files, activating modification synchronization, etc.
Special Case Handling: - For “infeasible” tasks: Returns 1 if last action was “FAIL”, otherwise 0 - For regular tasks: Returns 0 if last action was “FAIL”
Result Extraction: Uses result_getter to extract current state based on evaluator[“result”] configuration
Expected State Extraction: Uses expected_getter to extract reference state (if evaluator[“expected”] exists)
Metric Computation: Applies metric function to compare states and compute score

Single vs Multiple Metrics:

Single Metric (most common):

# Example: Check if screenshot contains specific text
evaluator = {
    "func": "check_include_exclude",
    "result": {"type": ...},
    "expected": {"type": ...}
}
# Returns: float score (0.0 to 1.0)

Multiple Metrics with AND/OR logic:

# Example: All conditions must be met
example =
 {
     "id": "example-task",
     "instruction": "Clear rubbish bin",
     "config": [],
     "evaluator": {
         "func": [
                 "metric_1",
                 "metric_2"
         ],
         "conj": "and",  # Return positive reward if all metrics succeed
         # "conj": "or"  # Return positive reward if any metric succeeds
         "result": [
             {"type": "xxx", ...},
             {"type": "xxx", ...},
         ],
         "expected": [
             {"type": "xxx", ...},
             {"type": "xxx", ...},
         ],
         "options":[
             {"xxx": ..},
             {"xxx": ..}
         ]
    }
}
# Returns: 0 if any metric fails, otherwise average of all scores

Multiple Metrics with OR logic:

# Example: Any condition can satisfy the task
evaluator = {
    "func": "check_alternative_outcomes",
    "conj": "or",  # Any metric can succeed
    "result": [
        {"type": "screenshot"},
        {"type": "vm_command_line", "command": "ls /tmp/"}
    ]
}
# Returns: 1 if any metric succeeds, otherwise maximum score

Returns:

score : float - Numerical score (0.0 = failure, 1.0 = complete success, intermediate values possible)

get_info()

def get_info() -> Dict

Returns information about the current environment state and configuration.

Returns:

infoDict
Environment information including:
- "vm_ip": Virtual machine IP address
- "server_port": Communication server port
- "action_space": Current action space configuration
- "provider": Provider information

Usage Examples

Basic Usage

from desktop_env.desktop_env import DesktopEnv

# Initialize environment with VMware provider
env = DesktopEnv(
    provider_name="vmware",
    action_space="pyautogui"
)

# Define a simple task
task_config = {
    "id": "example-task",
    "instruction": "Clear rubbish bin",
    "config": [],
    "evaluator": {
        "func": "check_include_exclude",
        "result": {"type": "screenshot"},
        "expected": {"type": "rule", "rules": {"include": ["desktop"]}}
    }
}

# Reset environment and execute action
obs = env.reset(task_config=task_config)
obs, reward, done, info = env.step("pyautogui.click(500, 300)", pause=2.0)

# Clean up when finished
env.close()

Docker Provider Example

# Initialize with Docker provider
env = DesktopEnv(
    provider_name="docker",
    os_type="Ubuntu",
    headless=True,
    client_password="password"
)

AWS Provider Example

# Initialize with AWS provider for large-scale evaluation
env = DesktopEnv(
    provider_name="aws",
    os_type="Ubuntu",
    client_password="osworld-public-evaluation"
)

With Accessibility Tree

# Environment with accessibility tree support
env = DesktopEnv(
    provider_name="vmware",
    require_a11y_tree=True
)

Provider-Specific Considerations

VMware Provider

Requires VMware Workstation Pro or VMware Fusion
Supports snapshots for quick environment reset
Best performance on bare metal machines
Supports macOS on Apple Silicon chips (via VMware Fusion)

Docker Provider

Recommended for cloud/server deployments
Supports parallel environments with automatic port allocation
Requires KVM support for optimal performance
May have limitations on GUI applications

AWS Provider

Enables large-scale parallel evaluation
Region-specific instance types and AMI requirements
Network latency considerations for interactive tasks
Cost optimization through spot instances and scheduled termination

VirtualBox Provider

Free alternative to VMware
Limited parallelism support
May have performance limitations on some systems

Security and Credentials

Default Credentials: Use secure passwords and change defaults in production
Network Security: Configure firewalls appropriately for the communication ports
Cloud Security: Use IAM roles and security groups for AWS deployments
Proxy Configuration: Ensure proxy credentials are securely stored and rotated

Troubleshooting

Common Issues:

Port Conflicts: Ensure communication ports (5000, 9222, 8006, 8080) are available
Network Connectivity: Verify VM can reach external networks if required
Performance Issues: Check sleep_after_execution timing and VM resource allocation
Provider Errors: Consult provider-specific logs and documentation

DesktopEnv Interface Documentation

Overview

Class Definition

Initialization Parameters

Core Parameters

Additional Configuration Notes

Port Configuration

Docker Provider Special Handling

Proxy Configuration

Key Methods

reset()

step()

close()

get_observation()

evaluate()

get_info()

Usage Examples

Basic Usage

Docker Provider Example

AWS Provider Example

With Accessibility Tree

Provider-Specific Considerations

Security and Credentials

Troubleshooting

See Also