Run OSWorld

Agent Baselines

Attention

Important Configuration Requirements:

Google Account Tasks: Some tasks require Google account access and OAuth2.0 configuration. Please refer to google_account_guideline for detailed setup instructions.
Proxy Configuration: Some tasks may require proxy settings to function properly (this depends on the strength of website defenses against your network location). Please refer to your system’s proxy configuration documentation.
Impact of Missing Configuration: If these configurations are not properly set up, the corresponding tasks will fail to execute correctly, leading to lower evaluation scores.

If you wish to run the baseline agent used in our paper, you can execute the following command as an example under the GPT-4o pure-screenshot setting:

Set OPENAI_API_KEY environment variable with your API key

export OPENAI_API_KEY='changeme'

Optionally, set OPENAI_BASE_URL to use a custom OpenAI-compatible API endpoint

export OPENAI_BASE_URL='http://your-custom-endpoint.com/v1'  # Optional: defaults to https://api.openai.com

Single-threaded execution (using vmware provider as example)

python run.py \
    --provider_name vmware \
    --path_to_vm Ubuntu/Ubuntu.vmx \
    --headless \
    --observation_type screenshot \
    --model gpt-4o \
    --sleep_after_execution 3 \
    --max_steps 15 \
    --result_dir ./results \
    --client_password password

Parallel execution (example showing switching provider to docker)

python scripts/python/run_multienv.py \
    --provider_name docker \
    --headless \
    --observation_type screenshot \
    --model gpt-4o \
    --sleep_after_execution 3 \
    --max_steps 15 \
    --num_envs 10 \
    --client_password password

The results, which include screenshots, actions, and video recordings of the agent’s task completion, will be saved in the ./results (or other result_dir you specified) directory in this case. You can then run the following command to obtain the result:

python show_result.py

Result Calculation and Google Drive Tasks

OSWorld contains a total of 369 tasks, including 8 Google Drive tasks that require special configuration. Due to Google’s strict security policies and IP-based restrictions, these Google Drive tasks may fail to initialize properly even when following the configuration guidelines in google_account_guideline.

Common Issues with Google Drive Tasks:

IP address changes triggering additional verification
OAuth2.0 authentication failures in virtual environments
Google’s automated bot detection systems
Account security restrictions for new devices/locations

Recommended Solutions:

You have two acceptable approaches when encountering Google Drive task setup issues:

Manual Configuration (Recommended): Manually configure the 8 Google Drive tasks following the troubleshooting steps in google_account_guideline. This allows you to evaluate all 369 tasks.
Exclude Google Drive Tasks (Acceptable): Skip the 8 Google Drive tasks and evaluate the remaining 361 tasks. This approach is officially supported and acceptable for benchmark comparisons.

Attention

Important: Both approaches (369 tasks or 361 tasks) are valid for evaluation purposes. When reporting results, please clearly specify which task set was used to ensure fair comparisons with other submissions.

The show_result.py script will automatically calculate success rates based on the tasks that were actually attempted, so excluding the Google Drive tasks will not affect the calculation methodology.

Evaluation

Local Evaluation

Please start by reading through the agent interface and the environment interface. Correctly implement the agent interface and import your customized version in the run.py (for single-threaded execution) or scripts/python/run_multienv.py / scripts/python/run_multienv_xxx.py (for parallel execution) file. Afterward, you can execute a command similar to the one in the previous section to run the benchmark on your agent.

Public Evaluation

If you want your results to be verified and displayed on the verified leaderboard, you need to schedule a meeting with us (current maintainer: tianbaoxiexxx@gmail.com, yuanmengqi732@gmail.com) to run your agent code on our side and have us report the results. You need to upload and allow us to disclose your agent implementation under the OSWorld framework (you may choose not to expose your model API to the public), along with a report that allows the public to understand what’s happening behind the scenes. Alternatively, if you are from a trusted institution, you can share your monitoring data and trajectories with us. Please carefully follow the Public Evaluation Platform User Guide to get results.