FAQ
What is the username and password for the virtual machines?
The username and password for the virtual machines are as follows (for provider vmware
, virtualbox
and docker
): we set the account credentials for Ubuntu as user
/ password
.
For cloud service providers like aws
, to prevent attacks due to weak passwords, we default to osworld-public-evaluation
.
If you make further modifications, remember to set the client_password variable and pass it to DesktopEnv and Agent (if supported) when running experiments.
Some features like setting up proxy require the environment to have the client VM password to obtain sudo privileges, and for some OSWorld tasks, the agent needs the password to obtain sudo privileges to complete them.
Windows: TBD
How to setup the account and credentials for Google and Google Drive?
See Account Guideline.
How can I configure a proxy for the VM (if I’m behind the GFW, or I don’t want some of my tasks to be identified as bot and get lower scores)?
If you want to set it up yourself, please refer to Proxy Guideline. We also provide a pre-configured solution based on dataimpulse, please refer to proxy-setup section in PUBLIC_EVALUATION_GUIDELINE.
How to submit Evaluation?
Local Evaluation
Please start by reading through the agent interface and the environment interface.
Correctly implement the agent interface and import your customized version in the run.py
or run_multienv.py
file.
Afterward, you can execute a command similar to the one in the previous section to run the benchmark on your agent.
Public Evaluation
If you want your results to be verified and displayed on the verified leaderboard, you need to schedule a meeting with us (current maintainer: tianbaoxiexxx@gmail.com, yuanmengqi732@gmail.com) to run your agent code on our side and have us report the results. You need to upload and allow us to disclose your agent implementation under the OSWorld framework (you may choose not to expose your model API to the public), along with a report that allows the public to understand what’s happening behind the scenes. Alternatively, if you are from a trusted institution, you can share your monitoring data and trajectories with us. Please carefully follow the Public Evaluation Guideline to get results.
What about the Google Drive tasks? Can I skip them?
Task Overview: OSWorld contains 369 total tasks, including 8 Google Drive tasks that require Google account setup and OAuth2.0 configuration.
Common Issues: Due to Google’s strict security policies and IP-based restrictions, these 8 Google Drive tasks may fail to initialize properly even with correct configuration, especially when:
Running from different IP addresses or cloud environments
Using virtual machines or automated evaluation systems
Encountering Google’s bot detection mechanisms
Facing account security restrictions
Acceptable Solutions:
Complete Setup (369 tasks): Follow the detailed setup guide in Google Account Guideline and troubleshoot until all Google Drive tasks work properly.
Skip Google Drive Tasks (361 tasks): Exclude the 8 problematic Google Drive tasks and evaluate the remaining 361 tasks. This approach is officially supported and acceptable.
For Result Reporting: When submitting or comparing results, clearly specify whether you evaluated 369 tasks (with Google Drive) or 361 tasks (without Google Drive). Both approaches are valid for benchmark evaluation purposes.
What are the running times and costs under different settings? (Deprecated)
Setting |
Expected Time* |
Budget Cost (Full Test Set/Small Test Set) |
---|---|---|
GPT-4V (screenshot) |
10h |
$100 ($10) |
Gemini-ProV (screenshot) |
15h |
$0 ($0) |
Claude-3 Opus (screenshot) |
15h |
$150 ($15) |
GPT-4V (a11y tree, SoM, etc.) |
30h |
$500 ($50) |
*No environment parallelism. Calculated in April 2024.