Adding New Agents to OSWorld ============================= This guide explains how to extend OSWorld's agent interface to integrate your custom agent for evaluation on the OSWorld benchmark. The agent interface is located in the ``mm_agents/`` directory and provides a standardized framework for multimodal agent implementations. Core Agent Interface -------------------- Base Agent Class ~~~~~~~~~~~~~~~~~ The foundation of OSWorld's agent system is the base agent class. We don't have strict requirements for your agent implementation - you can define all parameters according to your needs. The following example can be modified as needed: .. code-block:: python class YourCustomAgent: def __init__(self, model_name, observation_type="screenshot", max_steps=15, client_password=None): """ Initialize the agent with configuration parameters Args: model_name (str): Name/identifier of the underlying model observation_type (str): Type of observation ("screenshot", "a11y_tree", "som") max_steps (int): Maximum number of steps for task execution client_password (str, optional): VM password for sudo operations, depends on your agent implementation """ def predict(self, observation, instruction, **kwargs): """ Core prediction method - main agent decision-making logic Args: observation: Current environment state instruction: Task instruction string **kwargs: step_count, history, etc. Returns: action: Action string in OSWorld format """ def reset(self, task_config): """Reset agent state for a new task""" Essential Methods ~~~~~~~~~~~~~~~~~ **predict() Method** The ``predict()`` method is the core interface where your agent's decision-making logic resides: .. code-block:: python def predict(self, instruction, obs, **kwargs): """Main prediction logic""" # Your agent logic here return action **Parameters:** - ``instruction``: Natural language task description - ``obs``: Current environment state (screenshot, accessibility tree, etc.) - ``**kwargs``: You custom parameters **Returns:** - ``response``: Raw response from your model, you can use it to debug your agent or make some print to user - ``action``: A list of actions, all actions in OSWorld action formatted (either `payutogui` or `computer_13`) **reset() Method** The ``reset()`` method is also essential for your agent implementation. This method is called at the beginning of each new task to initialize your agent's state: .. code-block:: python def reset(self, _logger=None): """Reset agent state for a new task""" global logger logger = _logger if _logger is not None else logging.getLogger("desktopenv.agent") # Clear agent history self.thoughts = [] self.actions = [] self.observations = [] # You can add more reset logic here based on your agent's needs **Parameters:** - ``_logger`` (optional): Logger instance for the agent to use during task execution **Purpose:** Initialize your agent's internal state, clear history (thoughts, actions, observations), set up logging, and prepare for a new task execution. This ensures your agent starts fresh for each evaluation task. Integration and Usage --------------------- Final Agent Placement ~~~~~~~~~~~~~~~~~~~~~ Your agent implementation will ultimately be integrated into OSWorld's execution pipeline through the ``lib_run_single.py`` module. This file contains the core task execution logic that coordinates between your agent and the OSWorld environment. The integration flow works as follows: 1. **Agent Implementation**: Your agent class (with ``predict()`` and ``reset()`` methods) 2. **lib_run_single.py**: Core execution logic that calls your agent methods 3. **run.py/run_multienv.py/run_multienv_xxx.py**: Entry points that import and use lib_run_single **Coordination Requirements:** - Your agent's ``predict()`` method will be called by the execution pipeline in ``lib_run_single.py`` - The ``reset()`` method will be called at the start of each task - Your agent should handle the observation formats and return valid actions - Error handling should be robust as your agent will run in automated evaluation pipelines **File Structure:** :: OSWorld/ ├── mm_agents/ │ ├── your_agent.py # Your agent implementation │ └── __init__.py ├── lib_run_single.py # Core execution logic (calls your agent) ├── run.py # Single-environment entry point ├── run_multienv.py # Multi-environment entry point ├── run_multienv_xxx.py # Multi-environment entry point added by you if needed └── ... This architecture ensures your agent integrates seamlessly with OSWorld's evaluation pipeline while maintaining flexibility in your implementation approach. Running Evaluation ~~~~~~~~~~~~~~~~~~ .. code-block:: bash # Single-threaded evaluation python run.py \ --provider_name vmware \ --observation_type screenshot \ --model your-custom-agent \ --max_steps 15 \ --client_password password \ --result_dir ./results # Multi-environment parallel evaluation python run_multienv.py \ --provider_name docker \ --observation_type screenshot \ --model your-custom-agent \ --num_envs 4 \ --max_steps 15 \ --client_password password Noted here again different VM providers have different default credentials: * **VMware/VirtualBox/Docker**: username ``user``, password ``password`` * **AWS**: username ``osworld-public-evaluation``, auto-generated password See Also -------- * :doc:`environment_explanation` - Detailed explanation of the OSWorld environment architecture and components * :doc:`task_example_explanation` - Comprehensive guide to understanding and working with OSWorld tasks * :doc:`run_public_evaluation` - Requirements and process for verified leaderboard submission