Public Evaluation Platform User Guide

We have built an AWS-based platform for large-scale parallel evaluation of OSWorld tasks. The system follows a Host-Client architecture:

Host Instance: The central controller that stores code, configurations, and manages task execution.
Client Instances: Worker nodes automatically launched to perform tasks in parallel.

The architecture consists of a host machine (where you git clone and set up the OSWorld host environment) that controls multiple virtual machines for testing and potential training purposes. Each virtual machine serves as an OSWorld client environment using pre-configured AMI images and runs osworld.service to execute actions and commands from the host machine. To prevent security breaches, proper security groups and subnets must be configured for both the host and virtual machines.

1. Platform Deployment & Connection

Below, we assume you have no prior AWS configuration experience. You may freely skip or replace any graphical operations with API calls if you know how to do so.

1.1 Launch the Host Instance

Please create an instance in the AWS EC2 graphical interface to build the Host Machine.

Our recommended instance type settings are as follows: if you want to run fewer than 5 VM environments in parallel (--num_envs < 5), t3.medium will be sufficient; if you want to run fewer than 15 VM environments in parallel (--num_envs < 15), t3.large will be sufficient; however, if you want to use more than 15 VM environments in parallel, it’s better to choose a machine with more vCPUs and memory, such as c4.8xlarge.

For the AMI, we recommend using Ubuntu Server 24.04 LTS (HVM), SSD Volume Type, though other options will also work since the host machine doesn’t have strict requirements.

For storage space, please consider the number of experiments you plan to run. We recommend at least 50GB or more.

For security group configuration, please configure according to your specific requirements. We provides a monitor service that runs on port 8080 by default. You need to open this port to use this functionality.

Set the VPC as default, and we will return to it later to configure the virtual machines with the same setting.

1.2 Connect to the Host Instance

Step 1: Prepare Your SSH Key

When launching the instance, choose “Create new key pair” and download the .pem file (e.g. osworld-host-key.pem). Save it locally.

Set appropriate permissions:
```
chmod 400 <your_key_file_path>
```
Find your instance’s public IP and DNS:
- Go to the EC2 Instances page on the AWS Console.
- Locate your Host instance by its ID.

Step 2: Connect via SSH or VSCode

SSH:

ssh -i <your_key_path> ubuntu@<your_public_dns>

VSCode/Cursor Remote SSH configuration:

Host host_example
    HostName <your_public_dns>
    User ubuntu
    IdentityFile <your_key_path>

Step 3: Set up the host machine

After you connect the host machine, clone the latest OSWorld and set up the environment. Please ensure that the version of Python is >= 3.10.

# Clone the OSWorld repository
git clone https://github.com/xlang-ai/OSWorld

# Change directory into the cloned repository
cd OSWorld

# Optional: Create a Conda environment for OSWorld
# conda create -n osworld python=3.10
# conda activate osworld

# Install required dependencies
pip install -r requirements.txt

When installing requirements, you may encounter general environment issues, but these are solvable. You’ll need to use apt install to install and configure some dependencies. These issues can be quickly fixed with the help of AI tools like Claude Code.

Then it is almost done for the host machine part!

1.3 Set up the virtual machine

We need to programmatically scale virtual machines. Therefore, we will use the graphical interface to configure and obtain the necessary environment variables, then set these environment variables on the host machine so that the OSWorld code can read them to automatically scale and run experiments.

For the client machine environments, we have already prepared pre-configured client machine environments in different regions, stored in https://github.com/xlang-ai/OSWorld/blob/main/desktop_env/providers/aws/manager.py:

IMAGE_ID_MAP = {
    "us-east-1": {
        (1920, 1080): "ami-0d23263edb96951d8"
    },
    "ap-east-1": {
        (1920, 1080): "ami-0c092a5b8be4116f5"
    }
}
# Tell us if you need more, we can make immigration from one place to another.

Therefore, you don’t need to configure the virtual machine environments and related variables. If you need to add additional functionality, you can configure it based on these images. If you want to reconfigure from scratch, please refer to the files and instructions under https://github.com/xlang-ai/OSWorld/tree/main/desktop_env/server.

Step 1: Security Group for OSWorld Virtual Machines

OSWorld requires certain ports to be open, such as port 5000 for backend connections to OSWorld services, port 5910 for VNC visualization, port 9222 for Chrome control, etc. The AWS_SECURITY_GROUP_ID variable represents the security group configuration for virtual machines serving as OSWorld environments. Please complete the configuration and set this environment variable to the ID of the configured security group.

⚠️ Important: Please strictly follow the port settings below to prevent OSWorld tasks from failing due to connection issues:

Inbound Rules (8 rules required)

Type	Protocol	Port Range	Source	Description
SSH	TCP	22	0.0.0.0/0	SSH access
HTTP	TCP	80	172.31.0.0/16	HTTP traffic
Custom TCP	TCP	5000	172.31.0.0/16	OSWorld backend service
Custom TCP	TCP	5910	0.0.0.0/0	NoVNC visualization port
Custom TCP	TCP	8006	172.31.0.0/16	VNC service port
Custom TCP	TCP	8080	172.31.0.0/16	VLC service port
Custom TCP	TCP	8081	172.31.0.0/16	Additional service port
Custom TCP	TCP	9222	172.31.0.0/16	Chrome control port

Once finished, record the AWS_SECURITY_GROUP_ID as you will need to set it as the environment variable AWS_SECURITY_GROUP_ID on the host machine before starting the client code.

Outbound Rules (1 rule required)

Type	Protocol	Port Range	Destination	Description
All traffic	All	All	0.0.0.0/0	Allow all outbound traffic

Step 2: Record VPC Configuration for Client Machines from Host Machine

To isolate the entire evaluation stack, we run both the host machine and all client virtual machines inside a dedicated VPC.

The setup is straightforward:

Launch the host instance in the EC2 console via the AWS console and note the VPC ID and Subnet ID shown in its network settings.
Record the Subnet ID as you will need to set it as the environment variable AWS_SUBNET_ID on the host machine before starting the client code.

1.3 Get AWS Access Keys & Secret Access Key

Click on Security Credentials from the drop-down menu under your account in the top-right corner.

In the Access keys section, click “Create access key” to generate your own key.

If this method doesn’t work, please go to IAM → Users → select your username → Security credentials tab → Create access key.

Alternatively, you can create access keys through IAM for better security practices:

Navigate to IAM in the AWS Console
Click on Users in the left sidebar
Select your username or create a new IAM user
Go to the Security credentials tab
Click Create access key
Choose the appropriate use case (e.g., “Command Line Interface (CLI)”)
Download or copy the Access Key ID and Secret Access Key

Note: For production environments, it’s recommended to use IAM roles instead of access keys when possible, or create dedicated IAM users with minimal required permissions rather than using root account credentials.

Similarly, later you will need to set them as the environment variables on the host machine.

2. Environment Setup

Great! Now back to the host machine, we can start running experiments! All the following operations are performed on the host machine environment, under the OSWorld path.

2.1 Google Drive Integration (Optional)

Task Overview: OSWorld includes 8 Google Drive tasks out of a total of 369 tasks. Due to Google’s increasingly strict security policies, these tasks often encounter initialization and setup issues.

Common Setup Problems:

IP address changes triggering verification requests
OAuth2.0 authentication failures in virtualized environments
Google’s automated detection of unusual access patterns
Account lockouts due to security policy violations

Configuration Instructions:

Follow the instructions in Google Account Guideline, specifically the section “Generating credentials.json for Public Eval”. This configuration is necessary if you want to evaluate all 369 tasks.

Alternative Approaches:

If you encounter persistent setup issues with Google Drive tasks, you have two acceptable options:

Complete Setup (369 tasks): Continue troubleshooting following the detailed guide in Google Account Guideline until all Google Drive tasks work properly.
Skip Google Drive Tasks (361 tasks): Exclude the 8 Google Drive tasks and evaluate the remaining 361 tasks. This approach is officially supported and acceptable for benchmark evaluation.

Note

For Evaluation Submissions: When reporting your results, please clearly specify whether you evaluated all 369 tasks or excluded the 8 Google Drive tasks (361 tasks). Both approaches are valid, but this information is essential for fair comparison with other submissions.

You can skip this step during the debugging stage, as the Google Drive tasks represent only a small portion of the total benchmark and their setup complexity often outweighs their contribution during initial development phases.

2.2 Proxy Setup

Register at DataImpulse.
Purchase a US residential IP package (approximately $1 per 1GB).

Configure your credentials in OSWorld/evaluation_examples/settings/proxy/dataimpulse.json:

[
    {
        "host": "gw.dataimpulse.com",
        "port": 823,
        "username": "your_username",
        "password": "your_password",
        "protocol": "http",
        "provider": "dataimpulse",
        "type": "residential",
        "country": "US",
        "note": "Dataimpulse Residential Proxy"
    }
]

We have set proxy to True in the config JSON files for those proxy-sensitive tasks. OSWorld will automatically wrap these tasks with a proxy when DesktopEnv’s enable_proxy=True, while other tasks will not be affected. We recommend using a proxy. If you don’t need it at all, please set enable_proxy=False in the experiment’s .py file:

env = DesktopEnv(
    ...
    enable_proxy=False,
    ...
)

(We didn’t make too much explanantion on the DesktopEnv interface, please read theough the code to get to understand here.)

Note that disabling the proxy will cause some tasks under the Chrome domain to fail.

2.3 Set Environment Variables

# export OPENAI_API_KEY_CUA="your_openai_api_key" # if you use openai API
# export ANTHROPIC_API_KEY="your_anthropic_api_key" # if you use anthropic API
# export DASHSCOPE_API_KEY="your_dashscope_api_key" # if you use dashscope API from alibaba qwen
# export DOUBAO_API_KEY, DOUBAO_API_URL = "", "" # if you use doubao seed API from bytedance ui_tars
export AWS_ACCESS_KEY_ID="your_access_key" # key we mentioned before
export AWS_SECRET_ACCESS_KEY="your_security_access_key" # key we mentioned before
export AWS_REGION="your_aws_region" # eg. us-east-1, or leave it, it will be set default to us-east-1
export AWS_SECURITY_GROUP_ID="sg-xxxx" # the security group we mentioned before
export AWS_SUBNET_ID="subnet-xxxx" # the subnet we mentioned before

3. Running Evaluations

Use the run_multienv_xxx.py scripts to launch tasks in parallel.

Example (with the OpenAI CUA agent):

# --client_password set to the one you set to the client machine
# Run OpenAI CUA
python run_multienv_openaicua.py \
--headless \
--observation_type screenshot \
--model computer-use-preview \
--result_dir ./results_operator \
--test_all_meta_path evaluation_examples/test_all.json \
--region us-east-1 \
--max_steps 50 \
--num_envs 5 \
--client_password osworld-public-evaluation

# Run Anthropic (via AWS Bedrock), please modify agent if you want Anthropic endpoint
python run_multienv_claude.py \
--headless \
--observation_type screenshot \
--action_space claude_computer_use \
--model claude-4-sonnet-20250514 \
--result_dir ./results_claude \
--test_all_meta_path evaluation_examples/test_all.json \
--max_steps 50 \
--num_envs 5 \
--provider_name aws \
--client_password osworld-public-evaluation

Key Parameters:

--num_envs: Number of parallel environments
--max_steps: Max steps per task
--result_dir: Output directory for results
--test_all_meta_path: Path to the test set metadata
--region: AWS region

Usually the running code is named with run_multi_env_xxx.py under main folder, and the agent implementation is under mm_agents folder. Add according to your needs.

4. Viewing Results

4.1 Web Monitoring Tool

cd monitor
pip install -r requirements.txt
python main.py

Then, open your Host’s public IP on port 8080 in a browser. (eg. http://<client-public-ip>:8080)

For more, see: MONITOR_README

4.2 VNC Remote Desktop Access

We pre-install vnc for every virtual machine so you can have a look on it during the running. You can access via VNC at http://<client-public-ip>:5910/vnc.html The password set default is osworld-public-evaluation in our AMI to prevent attack.

5. Contact the team to update leaderboard and fix errors (optional)

If you want your results to be displayed on the leaderboard, please send a message to the OSWorld leaderboard maintainers (tianbaoxiexxx@gmail.com, yuanmengqi732@gmail.com) and open a pull request. We can update the results in the self-reported section.

If you want your results to be verified and displayed in the verified leaderboard section, we need you to schedule a meeting with us to run your agent code on our side to obtain results and have us report them. Alternatively, if you are from a trusted institution, you can share your monitor and trajectories with us.

If you discover new errors or the environment has undergone some changes, please contact us via GitHub issues or email.