Public Evaluation Platform User Guide ========================================== We have built an AWS-based platform for large-scale parallel evaluation of OSWorld tasks. The system follows a Host-Client architecture: - **Host Instance**: The central controller that stores code, configurations, and manages task execution. - **Client Instances**: Worker nodes automatically launched to perform tasks in parallel. The architecture consists of a host machine (where you git clone and set up the OSWorld host environment) that controls multiple virtual machines for testing and potential training purposes. Each virtual machine serves as an OSWorld client environment using pre-configured AMI images and runs ``osworld.service`` to execute actions and commands from the host machine. To prevent security breaches, proper security groups and subnets must be configured for both the host and virtual machines. 1. Platform Deployment & Connection ----------------------------------- Below, we assume you have no prior AWS configuration experience. You may freely skip or replace any graphical operations with API calls if you know how to do so. 1.1 Launch the Host Instance ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Please create an instance in the AWS EC2 graphical interface to build the Host Machine. Our recommended instance type settings are as follows: if you want to run fewer than 5 VM environments in parallel (``--num_envs`` < 5), ``t3.medium`` will be sufficient; if you want to run fewer than 15 VM environments in parallel (``--num_envs`` < 15), ``t3.large`` will be sufficient; however, if you want to use more than 15 VM environments in parallel, it's better to choose a machine with more vCPUs and memory, such as ``c4.8xlarge``. For the AMI, we recommend using Ubuntu Server 24.04 LTS (HVM), SSD Volume Type, though other options will also work since the host machine doesn't have strict requirements. For storage space, please consider the number of experiments you plan to run. We recommend at least 50GB or more. For security group configuration, please configure according to your specific requirements. We provides a monitor service that runs on port 8080 by default. You need to open this port to use this functionality. Set the VPC as default, and we will return to it later to configure the virtual machines with the same setting. 1.2 Connect to the Host Instance ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Step 1: Prepare Your SSH Key """"""""""""""""""""""""""""" * When launching the instance, choose "Create new key pair" and download the ``.pem`` file (e.g. ``osworld-host-key.pem``). Save it locally. .. image:: /_static/assets/pubeval1.png :alt: pubeval1 :width: 80% :align: center * Set appropriate permissions: .. code-block:: bash chmod 400 * Find your instance's **public IP** and **DNS**: - Go to the EC2 **Instances** page on the AWS Console. - Locate your Host instance by its ID. .. image:: /_static/assets/pubeval2.png :alt: pubeval2 :width: 80% :align: center * On the instance detail page: - **Public IP/DNS**: used for browser/VNC access and SSH connection - **Instance metadata**: e.g. storage, can be adjusted post-launch .. image:: /_static/assets/pubeval3.png :alt: pubeval3 :width: 80% :align: center Step 2: Connect via SSH or VSCode """""""""""""""""""""""""""""""""" * SSH: .. code-block:: bash ssh -i ubuntu@ * VSCode/Cursor Remote SSH configuration: .. code-block:: Host host_example HostName User ubuntu IdentityFile Step 3: Set up the host machine """"""""""""""""""""""""""""""" After you connect the host machine, clone the latest OSWorld and set up the environment. Please ensure that the version of Python is >= 3.10. .. code-block:: # Clone the OSWorld repository git clone https://github.com/xlang-ai/OSWorld # Change directory into the cloned repository cd OSWorld # Optional: Create a Conda environment for OSWorld # conda create -n osworld python=3.10 # conda activate osworld # Install required dependencies pip install -r requirements.txt When installing requirements, you may encounter general environment issues, but these are solvable. You'll need to use ``apt install`` to install and configure some dependencies. These issues can be quickly fixed with the help of AI tools like Claude Code. Then it is almost done for the host machine part! 1.3 Set up the virtual machine ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ We need to programmatically scale virtual machines. Therefore, we will use the graphical interface to configure and obtain the necessary environment variables, then set these environment variables on the host machine so that the OSWorld code can read them to automatically scale and run experiments. For the client machine environments, we have already prepared pre-configured client machine environments in different regions, stored in https://github.com/xlang-ai/OSWorld/blob/main/desktop_env/providers/aws/manager.py: .. code-block:: IMAGE_ID_MAP = { "us-east-1": { (1920, 1080): "ami-0d23263edb96951d8" }, "ap-east-1": { (1920, 1080): "ami-0c092a5b8be4116f5" } } # Tell us if you need more, we can make immigration from one place to another. Therefore, you don't need to configure the virtual machine environments and related variables. If you need to add additional functionality, you can configure it based on these images. If you want to reconfigure from scratch, please refer to the files and instructions under https://github.com/xlang-ai/OSWorld/tree/main/desktop_env/server. Step 1: Security Group for OSWorld Virtual Machines """""""""""""""""""""""""""""""""""""""""""""""""""" OSWorld requires certain ports to be open, such as port 5000 for backend connections to OSWorld services, port 5910 for VNC visualization, port 9222 for Chrome control, etc. The ``AWS_SECURITY_GROUP_ID`` variable represents the security group configuration for virtual machines serving as OSWorld environments. Please complete the configuration and set this environment variable to the ID of the configured security group. **⚠️ Important**: Please strictly follow the port settings below to prevent OSWorld tasks from failing due to connection issues: Inbound Rules (8 rules required) ''''''''''''''''''''''''''''''''' +-------------+----------+------------+----------------+---------------------------+ | Type | Protocol | Port Range | Source | Description | +=============+==========+============+================+===========================+ | SSH | TCP | 22 | 0.0.0.0/0 | SSH access | +-------------+----------+------------+----------------+---------------------------+ | HTTP | TCP | 80 | 172.31.0.0/16 | HTTP traffic | +-------------+----------+------------+----------------+---------------------------+ | Custom TCP | TCP | 5000 | 172.31.0.0/16 | OSWorld backend service | +-------------+----------+------------+----------------+---------------------------+ | Custom TCP | TCP | 5910 | 0.0.0.0/0 | NoVNC visualization port | +-------------+----------+------------+----------------+---------------------------+ | Custom TCP | TCP | 8006 | 172.31.0.0/16 | VNC service port | +-------------+----------+------------+----------------+---------------------------+ | Custom TCP | TCP | 8080 | 172.31.0.0/16 | VLC service port | +-------------+----------+------------+----------------+---------------------------+ | Custom TCP | TCP | 8081 | 172.31.0.0/16 | Additional service port | +-------------+----------+------------+----------------+---------------------------+ | Custom TCP | TCP | 9222 | 172.31.0.0/16 | Chrome control port | +-------------+----------+------------+----------------+---------------------------+ Once finished, record the ``AWS_SECURITY_GROUP_ID`` as you will need to set it as the environment variable ``AWS_SECURITY_GROUP_ID`` on the host machine before starting the client code. Outbound Rules (1 rule required) ''''''''''''''''''''''''''''''''' +-------------+----------+------------+-------------+---------------------------+ | Type | Protocol | Port Range | Destination | Description | +=============+==========+============+=============+===========================+ | All traffic | All | All | 0.0.0.0/0 | Allow all outbound traffic| +-------------+----------+------------+-------------+---------------------------+ Step 2: Record VPC Configuration for Client Machines from Host Machine """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" To isolate the entire evaluation stack, we run both the host machine and all client virtual machines inside a dedicated VPC. The setup is straightforward: 1. Launch the host instance in the EC2 console via the AWS console and note the **VPC ID** and **Subnet ID** shown in its network settings. 2. Record the **Subnet ID** as you will need to set it as the environment variable ``AWS_SUBNET_ID`` on the host machine before starting the client code. .. image:: /_static/assets/pubeval_subnet.png :alt: pubeval_subnet :width: 80% :align: center 1.3 Get AWS Access Keys & Secret Access Key ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Click on **Security Credentials** from the drop-down menu under your account in the top-right corner. .. image:: /_static/assets/pubeval4.png :alt: pubeval4 :width: 25% :align: center In the **Access keys** section, click **"Create access key"** to generate your own key. .. image:: /_static/assets/pubeval5.png :alt: pubeval5 :width: 100% :align: center If this method doesn't work, please go to **IAM** → **Users** → select your username → **Security credentials** tab → **Create access key**. Alternatively, you can create access keys through IAM for better security practices: 1. Navigate to **IAM** in the AWS Console 2. Click on **Users** in the left sidebar 3. Select your username or create a new IAM user 4. Go to the **Security credentials** tab 5. Click **Create access key** 6. Choose the appropriate use case (e.g., "Command Line Interface (CLI)") 7. Download or copy the Access Key ID and Secret Access Key **Note**: For production environments, it's recommended to use IAM roles instead of access keys when possible, or create dedicated IAM users with minimal required permissions rather than using root account credentials. Similarly, later you will need to set them as the environment variables on the host machine. 2. Environment Setup -------------------- Great! Now back to the **host machine**, we can start running experiments! All the following operations are performed on the host machine environment, under the OSWorld path. 2.1 Google Drive Integration (Optional) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Follow the instructions in :doc:`../installation/google_account_guideline`, specifically the section "Generating ``credentials.json`` for Public Eval". This part is necessary if using public evaluation. You can skip this step at the debugging stage, since it is only 8 Google Drive tasks and it is more and more annoying to make it due to their policy. .. image:: /_static/assets/pubeval_gdrive_auth.jpg :alt: pubeval_gdrive_auth :width: 80% :align: center 2.2 Proxy Setup ^^^^^^^^^^^^^^^ - Register at `DataImpulse `_. - Purchase a US residential IP package (approximately $1 per 1GB). - Configure your credentials in ``OSWorld/evaluation_examples/settings/proxy/dataimpulse.json``: .. code-block:: json [ { "host": "gw.dataimpulse.com", "port": 823, "username": "your_username", "password": "your_password", "protocol": "http", "provider": "dataimpulse", "type": "residential", "country": "US", "note": "Dataimpulse Residential Proxy" } ] We have set proxy to True in the config JSON files for those proxy-sensitive tasks. OSWorld will automatically wrap these tasks with a proxy when DesktopEnv's ``enable_proxy=True``, while other tasks will not be affected. We recommend using a proxy. If you don't need it at all, please set ``enable_proxy=False`` in the experiment's ``.py`` file: .. code-block:: env = DesktopEnv( ... enable_proxy=False, ... ) (We didn't make too much explanantion on the DesktopEnv interface, please read theough the code to get to understand here.) Note that disabling the proxy will cause some tasks under the Chrome domain to fail. 2.3 Set Environment Variables ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash # export OPENAI_API_KEY_CUA="your_openai_api_key" # if you use openai API # export ANTHROPIC_API_KEY="your_anthropic_api_key" # if you use anthropic API # export DASHSCOPE_API_KEY="your_dashscope_api_key" # if you use dashscope API from alibaba qwen # export DOUBAO_API_KEY, DOUBAO_API_URL = "", "" # if you use doubao seed API from bytedance ui_tars export AWS_ACCESS_KEY_ID="your_access_key" # key we mentioned before export AWS_SECRET_ACCESS_KEY="your_security_access_key" # key we mentioned before export AWS_REGION="your_aws_region" # eg. us-east-1, or leave it, it will be set default to us-east-1 export AWS_SECURITY_GROUP_ID="sg-xxxx" # the security group we mentioned before export AWS_SUBNET_ID="subnet-xxxx" # the subnet we mentioned before 3. Running Evaluations ---------------------- Use the ``run_multienv_xxx.py`` scripts to launch tasks in parallel. Example (with the OpenAI CUA agent): .. code-block:: bash # --client_password set to the one you set to the client machine # Run OpenAI CUA python run_multienv_openaicua.py \ --headless \ --observation_type screenshot \ --model computer-use-preview \ --result_dir ./results_operator \ --test_all_meta_path evaluation_examples/test_all.json \ --region us-east-1 \ --max_steps 50 \ --num_envs 5 \ --client_password osworld-public-evaluation # Run Anthropic (via AWS Bedrock), please modify agent if you want Anthropic endpoint python run_multienv_claude.py \ --headless \ --observation_type screenshot \ --action_space claude_computer_use \ --model claude-4-sonnet-20250514 \ --result_dir ./results_claude \ --test_all_meta_path evaluation_examples/test_all.json \ --max_steps 50 \ --num_envs 5 \ --provider_name aws \ --client_password osworld-public-evaluation Key Parameters: - ``--num_envs``: Number of parallel environments - ``--max_steps``: Max steps per task - ``--result_dir``: Output directory for results - ``--test_all_meta_path``: Path to the test set metadata - ``--region``: AWS region Usually the running code is named with ``run_multi_env_xxx.py`` under main folder, and the agent implementation is under ``mm_agents`` folder. Add according to your needs. 4. Viewing Results ------------------ 4.1 Web Monitoring Tool ^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash cd monitor pip install -r requirements.txt python main.py Then, open your Host's **public IP** on port ``8080`` in a browser. (eg. ``http://:8080``) For more, see: `MONITOR_README <./monitor/README.md>`_ .. image:: /_static/assets/pubeval_monitor1.jpg :alt: pubeval_monitor :width: 80% :align: center .. image:: /_static/assets/pubeval_monitor2.jpg :alt: pubeval_monitor :width: 80% :align: center 4.2 VNC Remote Desktop Access ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ We pre-install vnc for every virtual machine so you can have a look on it during the running. You can access via VNC at ``http://:5910/vnc.html`` The password set default is ``osworld-public-evaluation`` in our AMI to prevent attack. 5. Contact the team to update leaderboard and fix errors (optional) -------------------------------------------------------------------- If you want your results to be displayed on the leaderboard, please send a message to the OSWorld leaderboard maintainers (tianbaoxiexxx@gmail.com, yuanmengqi732@gmail.com) and open a pull request. We can update the results in the self-reported section. If you want your results to be verified and displayed in the verified leaderboard section, we need you to schedule a meeting with us to run your agent code on our side to obtain results and have us report them. Alternatively, if you are from a trusted institution, you can share your monitor and trajectories with us. If you discover new errors or the environment has undergone some changes, please contact us via GitHub issues or email. See Also -------- * :doc:`run_evaluation` - Basic evaluation guide for local environment setup * :doc:`../installation/google_account_guideline` - Google account setup and OAuth2.0 configuration * :doc:`../installation/index` - Complete installation instructions * :doc:`../community/faq` - Frequently asked questions and troubleshooting