Public Evaluation Platform User Guide

We have built an AWS-based platform for large-scale parallel evaluation of OSWorld tasks. The system follows a Host-Client architecture:

  • Host Instance: The central controller that stores code, configurations, and manages task execution.

  • Client Instances: Worker nodes automatically launched to perform tasks in parallel.

The architecture consists of a host machine (where you git clone and set up the OSWorld host environment) that controls multiple virtual machines for testing and potential training purposes. Each virtual machine serves as an OSWorld client environment using pre-configured AMI images and runs osworld.service to execute actions and commands from the host machine. To prevent security breaches, proper security groups and subnets must be configured for both the host and virtual machines.

1. Platform Deployment & Connection

Below, we assume you have no prior AWS configuration experience. You may freely skip or replace any graphical operations with API calls if you know how to do so.

1.1 Launch the Host Instance

Please create an instance in the AWS EC2 graphical interface to build the Host Machine.

Our recommended instance type settings are as follows: if you want to run fewer than 5 VM environments in parallel (--num_envs < 5), t3.medium will be sufficient; if you want to run fewer than 15 VM environments in parallel (--num_envs < 15), t3.large will be sufficient; however, if you want to use more than 15 VM environments in parallel, it’s better to choose a machine with more vCPUs and memory, such as c4.8xlarge.

For the AMI, we recommend using Ubuntu Server 24.04 LTS (HVM), SSD Volume Type, though other options will also work since the host machine doesn’t have strict requirements.

For storage space, please consider the number of experiments you plan to run. We recommend at least 50GB or more.

For security group configuration, please configure according to your specific requirements. We provides a monitor service that runs on port 8080 by default. You need to open this port to use this functionality.

Set the VPC as default, and we will return to it later to configure the virtual machines with the same setting.

1.2 Connect to the Host Instance

Step 1: Prepare Your SSH Key

  • When launching the instance, choose “Create new key pair” and download the .pem file (e.g. osworld-host-key.pem). Save it locally.

pubeval1
  • Set appropriate permissions:

    chmod 400 <your_key_file_path>
    
  • Find your instance’s public IP and DNS:

    • Go to the EC2 Instances page on the AWS Console.

    • Locate your Host instance by its ID.

Step 2: Connect via SSH or VSCode

  • SSH:

    ssh -i <your_key_path> ubuntu@<your_public_dns>
    
  • VSCode/Cursor Remote SSH configuration:

    Host host_example
        HostName <your_public_dns>
        User ubuntu
        IdentityFile <your_key_path>
    

Step 3: Set up the host machine

After you connect the host machine, clone the latest OSWorld and set up the environment. Please ensure that the version of Python is >= 3.10.

# Clone the OSWorld repository
git clone https://github.com/xlang-ai/OSWorld

# Change directory into the cloned repository
cd OSWorld

# Optional: Create a Conda environment for OSWorld
# conda create -n osworld python=3.10
# conda activate osworld

# Install required dependencies
pip install -r requirements.txt

When installing requirements, you may encounter general environment issues, but these are solvable. You’ll need to use apt install to install and configure some dependencies. These issues can be quickly fixed with the help of AI tools like Claude Code.

Then it is almost done for the host machine part!

1.3 Set up the virtual machine

We need to programmatically scale virtual machines. Therefore, we will use the graphical interface to configure and obtain the necessary environment variables, then set these environment variables on the host machine so that the OSWorld code can read them to automatically scale and run experiments.

For the client machine environments, we have already prepared pre-configured client machine environments in different regions, stored in https://github.com/xlang-ai/OSWorld/blob/main/desktop_env/providers/aws/manager.py:

IMAGE_ID_MAP = {
    "us-east-1": {
        (1920, 1080): "ami-0d23263edb96951d8"
    },
    "ap-east-1": {
        (1920, 1080): "ami-0c092a5b8be4116f5"
    }
}
# Tell us if you need more, we can make immigration from one place to another.

Therefore, you don’t need to configure the virtual machine environments and related variables. If you need to add additional functionality, you can configure it based on these images. If you want to reconfigure from scratch, please refer to the files and instructions under https://github.com/xlang-ai/OSWorld/tree/main/desktop_env/server.

Step 1: Security Group for OSWorld Virtual Machines

OSWorld requires certain ports to be open, such as port 5000 for backend connections to OSWorld services, port 5910 for VNC visualization, port 9222 for Chrome control, etc. The AWS_SECURITY_GROUP_ID variable represents the security group configuration for virtual machines serving as OSWorld environments. Please complete the configuration and set this environment variable to the ID of the configured security group.

⚠️ Important: Please strictly follow the port settings below to prevent OSWorld tasks from failing due to connection issues:

Inbound Rules (8 rules required)

Type

Protocol

Port Range

Source

Description

SSH

TCP

22

0.0.0.0/0

SSH access

HTTP

TCP

80

172.31.0.0/16

HTTP traffic

Custom TCP

TCP

5000

172.31.0.0/16

OSWorld backend service

Custom TCP

TCP

5910

0.0.0.0/0

NoVNC visualization port

Custom TCP

TCP

8006

172.31.0.0/16

VNC service port

Custom TCP

TCP

8080

172.31.0.0/16

VLC service port

Custom TCP

TCP

8081

172.31.0.0/16

Additional service port

Custom TCP

TCP

9222

172.31.0.0/16

Chrome control port

Once finished, record the AWS_SECURITY_GROUP_ID as you will need to set it as the environment variable AWS_SECURITY_GROUP_ID on the host machine before starting the client code.

Outbound Rules (1 rule required)

Type

Protocol

Port Range

Destination

Description

All traffic

All

All

0.0.0.0/0

Allow all outbound traffic

Step 2: Record VPC Configuration for Client Machines from Host Machine

To isolate the entire evaluation stack, we run both the host machine and all client virtual machines inside a dedicated VPC.

The setup is straightforward:

  1. Launch the host instance in the EC2 console via the AWS console and note the VPC ID and Subnet ID shown in its network settings.

  2. Record the Subnet ID as you will need to set it as the environment variable AWS_SUBNET_ID on the host machine before starting the client code.

pubeval_subnet

1.3 Get AWS Access Keys & Secret Access Key

Click on Security Credentials from the drop-down menu under your account in the top-right corner.

pubeval4

In the Access keys section, click “Create access key” to generate your own key.

pubeval5

If this method doesn’t work, please go to IAMUsers → select your username → Security credentials tab → Create access key.

Alternatively, you can create access keys through IAM for better security practices:

  1. Navigate to IAM in the AWS Console

  2. Click on Users in the left sidebar

  3. Select your username or create a new IAM user

  4. Go to the Security credentials tab

  5. Click Create access key

  6. Choose the appropriate use case (e.g., “Command Line Interface (CLI)”)

  7. Download or copy the Access Key ID and Secret Access Key

Note: For production environments, it’s recommended to use IAM roles instead of access keys when possible, or create dedicated IAM users with minimal required permissions rather than using root account credentials.

Similarly, later you will need to set them as the environment variables on the host machine.

2. Environment Setup

Great! Now back to the host machine, we can start running experiments! All the following operations are performed on the host machine environment, under the OSWorld path.

2.1 Google Drive Integration (Optional)

Follow the instructions in Google Account Guideline, specifically the section “Generating credentials.json for Public Eval”. This part is necessary if using public evaluation.

You can skip this step at the debugging stage, since it is only 8 Google Drive tasks and it is more and more annoying to make it due to their policy.

pubeval_gdrive_auth

2.2 Proxy Setup

  • Register at DataImpulse.

  • Purchase a US residential IP package (approximately $1 per 1GB).

  • Configure your credentials in OSWorld/evaluation_examples/settings/proxy/dataimpulse.json:

    [
        {
            "host": "gw.dataimpulse.com",
            "port": 823,
            "username": "your_username",
            "password": "your_password",
            "protocol": "http",
            "provider": "dataimpulse",
            "type": "residential",
            "country": "US",
            "note": "Dataimpulse Residential Proxy"
        }
    ]
    

We have set proxy to True in the config JSON files for those proxy-sensitive tasks. OSWorld will automatically wrap these tasks with a proxy when DesktopEnv’s enable_proxy=True, while other tasks will not be affected. We recommend using a proxy. If you don’t need it at all, please set enable_proxy=False in the experiment’s .py file:

env = DesktopEnv(
    ...
    enable_proxy=False,
    ...
)

(We didn’t make too much explanantion on the DesktopEnv interface, please read theough the code to get to understand here.)

Note that disabling the proxy will cause some tasks under the Chrome domain to fail.

2.3 Set Environment Variables

# export OPENAI_API_KEY_CUA="your_openai_api_key" # if you use openai API
# export ANTHROPIC_API_KEY="your_anthropic_api_key" # if you use anthropic API
# export DASHSCOPE_API_KEY="your_dashscope_api_key" # if you use dashscope API from alibaba qwen
# export DOUBAO_API_KEY, DOUBAO_API_URL = "", "" # if you use doubao seed API from bytedance ui_tars
export AWS_ACCESS_KEY_ID="your_access_key" # key we mentioned before
export AWS_SECRET_ACCESS_KEY="your_security_access_key" # key we mentioned before
export AWS_REGION="your_aws_region" # eg. us-east-1, or leave it, it will be set default to us-east-1
export AWS_SECURITY_GROUP_ID="sg-xxxx" # the security group we mentioned before
export AWS_SUBNET_ID="subnet-xxxx" # the subnet we mentioned before

3. Running Evaluations

Use the run_multienv_xxx.py scripts to launch tasks in parallel.

Example (with the OpenAI CUA agent):

# --client_password set to the one you set to the client machine
# Run OpenAI CUA
python run_multienv_openaicua.py \
--headless \
--observation_type screenshot \
--model computer-use-preview \
--result_dir ./results_operator \
--test_all_meta_path evaluation_examples/test_all.json \
--region us-east-1 \
--max_steps 50 \
--num_envs 5 \
--client_password osworld-public-evaluation

# Run Anthropic (via AWS Bedrock), please modify agent if you want Anthropic endpoint
python run_multienv_claude.py \
--headless \
--observation_type screenshot \
--action_space claude_computer_use \
--model claude-4-sonnet-20250514 \
--result_dir ./results_claude \
--test_all_meta_path evaluation_examples/test_all.json \
--max_steps 50 \
--num_envs 5 \
--provider_name aws \
--client_password osworld-public-evaluation

Key Parameters:

  • --num_envs: Number of parallel environments

  • --max_steps: Max steps per task

  • --result_dir: Output directory for results

  • --test_all_meta_path: Path to the test set metadata

  • --region: AWS region

Usually the running code is named with run_multi_env_xxx.py under main folder, and the agent implementation is under mm_agents folder. Add according to your needs.

4. Viewing Results

4.1 Web Monitoring Tool

cd monitor
pip install -r requirements.txt
python main.py

Then, open your Host’s public IP on port 8080 in a browser. (eg. http://<client-public-ip>:8080)

For more, see: MONITOR_README

pubeval_monitor pubeval_monitor

4.2 VNC Remote Desktop Access

We pre-install vnc for every virtual machine so you can have a look on it during the running. You can access via VNC at http://<client-public-ip>:5910/vnc.html The password set default is osworld-public-evaluation in our AMI to prevent attack.

5. Contact the team to update leaderboard and fix errors (optional)

If you want your results to be displayed on the leaderboard, please send a message to the OSWorld leaderboard maintainers (tianbaoxiexxx@gmail.com, yuanmengqi732@gmail.com) and open a pull request. We can update the results in the self-reported section.

If you want your results to be verified and displayed in the verified leaderboard section, we need you to schedule a meeting with us to run your agent code on our side to obtain results and have us report them. Alternatively, if you are from a trusted institution, you can share your monitor and trajectories with us.

If you discover new errors or the environment has undergone some changes, please contact us via GitHub issues or email.

See Also