Adding New Agents to OSWorld

This guide explains how to extend OSWorld’s agent interface to integrate your custom agent for evaluation on the OSWorld benchmark. The agent interface is located in the mm_agents/ directory and provides a standardized framework for multimodal agent implementations.

Core Agent Interface

Base Agent Class

The foundation of OSWorld’s agent system is the base agent class. We don’t have strict requirements for your agent implementation - you can define all parameters according to your needs. The following example can be modified as needed:

class YourCustomAgent:
    def __init__(self, model_name, observation_type="screenshot", max_steps=15, client_password=None):
        """
        Initialize the agent with configuration parameters

        Args:
            model_name (str): Name/identifier of the underlying model
            observation_type (str): Type of observation ("screenshot", "a11y_tree", "som")
            max_steps (int): Maximum number of steps for task execution
            client_password (str, optional): VM password for sudo operations, depends on your agent implementation
        """

    def predict(self, observation, instruction, **kwargs):
        """
        Core prediction method - main agent decision-making logic

        Args:
            observation: Current environment state
            instruction: Task instruction string
            **kwargs: step_count, history, etc.

        Returns:
            action: Action string in OSWorld format
        """

    def reset(self, task_config):
        """Reset agent state for a new task"""

Essential Methods

predict() Method

The predict() method is the core interface where your agent’s decision-making logic resides:

def predict(self, instruction, obs, **kwargs):
    """Main prediction logic"""

    # Your agent logic here

    return action

Parameters: - instruction: Natural language task description - obs: Current environment state (screenshot, accessibility tree, etc.) - **kwargs: You custom parameters

Returns: - response: Raw response from your model, you can use it to debug your agent or make some print to user - action: A list of actions, all actions in OSWorld action formatted (either payutogui or computer_13)

reset() Method

The reset() method is also essential for your agent implementation. This method is called at the beginning of each new task to initialize your agent’s state:

def reset(self, _logger=None):
    """Reset agent state for a new task"""
    global logger
    logger = _logger if _logger is not None else logging.getLogger("desktopenv.agent")

    # Clear agent history
    self.thoughts = []
    self.actions = []
    self.observations = []

    # You can add more reset logic here based on your agent's needs

Parameters: - _logger (optional): Logger instance for the agent to use during task execution

Purpose: Initialize your agent’s internal state, clear history (thoughts, actions, observations), set up logging, and prepare for a new task execution. This ensures your agent starts fresh for each evaluation task.

Integration and Usage

Final Agent Placement

Your agent implementation will ultimately be integrated into OSWorld’s execution pipeline through the lib_run_single.py module. This file contains the core task execution logic that coordinates between your agent and the OSWorld environment.

The integration flow works as follows:

Agent Implementation: Your agent class (with predict() and reset() methods)
lib_run_single.py: Core execution logic that calls your agent methods
run.py/run_multienv.py/run_multienv_xxx.py: Entry points that import and use lib_run_single

Coordination Requirements:

Your agent’s predict() method will be called by the execution pipeline in lib_run_single.py
The reset() method will be called at the start of each task
Your agent should handle the observation formats and return valid actions
Error handling should be robust as your agent will run in automated evaluation pipelines

File Structure:

OSWorld/
├── mm_agents/
│   ├── your_agent.py          # Your agent implementation
│   └── __init__.py
├── lib_run_single.py          # Core execution logic (calls your agent)
├── run.py                     # Single-environment entry point
├── run_multienv.py           # Multi-environment entry point
├── run_multienv_xxx.py       # Multi-environment entry point added by you if needed
└── ...

This architecture ensures your agent integrates seamlessly with OSWorld’s evaluation pipeline while maintaining flexibility in your implementation approach.

Running Evaluation

# Single-threaded evaluation
python run.py \
    --provider_name vmware \
    --observation_type screenshot \
    --model your-custom-agent \
    --max_steps 15 \
    --client_password password \
    --result_dir ./results

# Multi-environment parallel evaluation
python run_multienv.py \
    --provider_name docker \
    --observation_type screenshot \
    --model your-custom-agent \
    --num_envs 4 \
    --max_steps 15 \
    --client_password password

Noted here again different VM providers have different default credentials:

VMware/VirtualBox/Docker: username user, password password
AWS: username osworld-public-evaluation, auto-generated password