Adding New Agents to OSWorld
=============================

This guide explains how to extend OSWorld's agent interface to integrate your custom agent for evaluation on the OSWorld benchmark. 
The agent interface is located in the ``mm_agents/`` directory and provides a standardized framework for multimodal agent implementations.

Core Agent Interface
--------------------

Base Agent Class
~~~~~~~~~~~~~~~~~

The foundation of OSWorld's agent system is the base agent class.
We don't have strict requirements for your agent implementation - you can define all parameters according to your needs. The following example can be modified as needed:

.. code-block:: python

    class YourCustomAgent:
        def __init__(self, model_name, observation_type="screenshot", max_steps=15, client_password=None):
            """
            Initialize the agent with configuration parameters
            
            Args:
                model_name (str): Name/identifier of the underlying model
                observation_type (str): Type of observation ("screenshot", "a11y_tree", "som")
                max_steps (int): Maximum number of steps for task execution
                client_password (str, optional): VM password for sudo operations, depends on your agent implementation
            """
            
        def predict(self, observation, instruction, **kwargs):
            """
            Core prediction method - main agent decision-making logic
            
            Args:
                observation: Current environment state
                instruction: Task instruction string
                **kwargs: step_count, history, etc.
                
            Returns:
                action: Action string in OSWorld format
            """
            
        def reset(self, task_config):
            """Reset agent state for a new task"""


Essential Methods
~~~~~~~~~~~~~~~~~

**predict() Method**

The ``predict()`` method is the core interface where your agent's decision-making logic resides:

.. code-block:: python

    def predict(self, instruction, obs, **kwargs):
        """Main prediction logic"""

        # Your agent logic here
        
        return action

**Parameters:**
- ``instruction``: Natural language task description
- ``obs``: Current environment state (screenshot, accessibility tree, etc.)
- ``**kwargs``: You custom parameters

**Returns:** 
- ``response``: Raw response from your model, you can use it to debug your agent or make some print to user
- ``action``: A list of actions, all actions in OSWorld action formatted (either `payutogui` or `computer_13`)

**reset() Method**

The ``reset()`` method is also essential for your agent implementation. 
This method is called at the beginning of each new task to initialize your agent's state:

.. code-block:: python

    def reset(self, _logger=None):
        """Reset agent state for a new task"""
        global logger
        logger = _logger if _logger is not None else logging.getLogger("desktopenv.agent")

        # Clear agent history
        self.thoughts = []
        self.actions = []
        self.observations = []
        
        # You can add more reset logic here based on your agent's needs

**Parameters:**
- ``_logger`` (optional): Logger instance for the agent to use during task execution

**Purpose:** Initialize your agent's internal state, clear history (thoughts, actions, observations), set up logging, and prepare for a new task execution. 
This ensures your agent starts fresh for each evaluation task.


Integration and Usage
---------------------

Final Agent Placement
~~~~~~~~~~~~~~~~~~~~~

Your agent implementation will ultimately be integrated into OSWorld's execution pipeline through the ``lib_run_single.py`` module. 
This file contains the core task execution logic that coordinates between your agent and the OSWorld environment.

The integration flow works as follows:

1. **Agent Implementation**: Your agent class (with ``predict()`` and ``reset()`` methods) 
2. **lib_run_single.py**: Core execution logic that calls your agent methods
3. **run.py/run_multienv.py/run_multienv_xxx.py**: Entry points that import and use lib_run_single


**Coordination Requirements:**

- Your agent's ``predict()`` method will be called by the execution pipeline in ``lib_run_single.py``
- The ``reset()`` method will be called at the start of each task
- Your agent should handle the observation formats and return valid actions
- Error handling should be robust as your agent will run in automated evaluation pipelines

**File Structure:**
::

    OSWorld/
    ├── mm_agents/
    │   ├── your_agent.py          # Your agent implementation
    │   └── __init__.py
    ├── lib_run_single.py          # Core execution logic (calls your agent)
    ├── run.py                     # Single-environment entry point
    ├── run_multienv.py           # Multi-environment entry point
    ├── run_multienv_xxx.py       # Multi-environment entry point added by you if needed
    └── ...

This architecture ensures your agent integrates seamlessly with OSWorld's evaluation pipeline while maintaining flexibility in your implementation approach.


Running Evaluation
~~~~~~~~~~~~~~~~~~

.. code-block:: bash

    # Single-threaded evaluation
    python run.py \
        --provider_name vmware \
        --observation_type screenshot \
        --model your-custom-agent \
        --max_steps 15 \
        --client_password password \
        --result_dir ./results

    # Multi-environment parallel evaluation  
    python run_multienv.py \
        --provider_name docker \
        --observation_type screenshot \
        --model your-custom-agent \
        --num_envs 4 \
        --max_steps 15 \
        --client_password password

Noted here again different VM providers have different default credentials:

* **VMware/VirtualBox/Docker**: username ``user``, password ``password``
* **AWS**: username ``osworld-public-evaluation``, auto-generated password


See Also
--------

* :doc:`environment_explanation` - Detailed explanation of the OSWorld environment architecture and components
* :doc:`task_example_explanation` - Comprehensive guide to understanding and working with OSWorld tasks
* :doc:`run_public_evaluation` - Requirements and process for verified leaderboard submission