Run OSWorld =========== Agent Baselines --------------- .. attention:: **Important Configuration Requirements:** * **Google Account Tasks**: Some tasks require Google account access and OAuth2.0 configuration. Please refer to :doc:`google_account_guideline` for detailed setup instructions. * **Proxy Configuration**: Some tasks may require proxy settings to function properly (this depends on the strength of website defenses against your network location). Please refer to your system's proxy configuration documentation. * **Impact of Missing Configuration**: If these configurations are not properly set up, the corresponding tasks will fail to execute correctly, leading to lower evaluation scores. If you wish to run the baseline agent used in our paper, you can execute the following command as an example under the GPT-4o pure-screenshot setting: Set **OPENAI_API_KEY** environment variable with your API key .. code-block:: bash export OPENAI_API_KEY='changeme' Optionally, set **OPENAI_BASE_URL** to use a custom OpenAI-compatible API endpoint .. code-block:: bash export OPENAI_BASE_URL='http://your-custom-endpoint.com/v1' # Optional: defaults to https://api.openai.com Single-threaded execution (deprecated, using ``vmware`` provider as example) .. code-block:: bash python run.py \ --provider_name vmware \ --path_to_vm Ubuntu/Ubuntu.vmx \ --headless \ --observation_type screenshot \ --model gpt-4o \ --sleep_after_execution 3 \ --max_steps 15 \ --result_dir ./results \ --client_password password Parallel execution (example showing switching provider to ``docker``) .. code-block:: bash python run_multienv.py \ --provider_name docker \ --headless \ --observation_type screenshot \ --model gpt-4o \ --sleep_after_execution 3 \ --max_steps 15 \ --num_envs 10 \ --client_password password The results, which include screenshots, actions, and video recordings of the agent's task completion, will be saved in the ``./results`` (or other ``result_dir`` you specified) directory in this case. You can then run the following command to obtain the result: .. code-block:: bash python show_result.py Evaluation ---------- Local Evaluation ^^^^^^^^^^^^^^^^^ Please start by reading through the `agent interface `_ and the `environment interface `_. Correctly implement the agent interface and import your customized version in the ``run.py`` or ``run_multienv.py`` file. Afterward, you can execute a command similar to the one in the previous section to run the benchmark on your agent. Public Evaluation ^^^^^^^^^^^^^^^^^^ If you want your results to be verified and displayed on the verified leaderboard, you need to schedule a meeting with us (current maintainer: tianbaoxiexxx@gmail.com, yuanmengqi732@gmail.com) to run your agent code on our side and have us report the results. You need to upload and allow us to disclose your agent implementation under the OSWorld framework (you may choose not to expose your model API to the public), along with a report that allows the public to understand what's happening behind the scenes. Alternatively, if you are from a trusted institution, you can share your monitoring data and trajectories with us. Please carefully follow the :doc:`run_public_evaluation` to get results. See Also -------- For additional setup and configuration guidance: * :doc:`google_account_guideline` - Step-by-step guide for setting up Google accounts and OAuth2.0 credentials for Google Drive tasks * :doc:`environment_explanation` - Detailed explanation of the OSWorld environment architecture and components * :doc:`task_example_explanation` - Comprehensive guide to understanding and working with OSWorld tasks * :doc:`add_new_agent` - Instructions for implementing and integrating custom agents into the OSWorld framework * :doc:`run_public_evaluation` - Requirements and process for verified leaderboard submission * :doc:`../installation/index` - Complete installation instructions * :doc:`../community/faq` - Frequently asked questions and troubleshooting