Adding New Agents to OSWorld
This guide explains how to extend OSWorld’s agent interface to integrate your custom agent for evaluation on the OSWorld benchmark.
The agent interface is located in the mm_agents/
directory and provides a standardized framework for multimodal agent implementations.
Core Agent Interface
Base Agent Class
The foundation of OSWorld’s agent system is the base agent class. We don’t have strict requirements for your agent implementation - you can define all parameters according to your needs. The following example can be modified as needed:
class YourCustomAgent:
def __init__(self, model_name, observation_type="screenshot", max_steps=15, client_password=None):
"""
Initialize the agent with configuration parameters
Args:
model_name (str): Name/identifier of the underlying model
observation_type (str): Type of observation ("screenshot", "a11y_tree", "som")
max_steps (int): Maximum number of steps for task execution
client_password (str, optional): VM password for sudo operations, depends on your agent implementation
"""
def predict(self, observation, instruction, **kwargs):
"""
Core prediction method - main agent decision-making logic
Args:
observation: Current environment state
instruction: Task instruction string
**kwargs: step_count, history, etc.
Returns:
action: Action string in OSWorld format
"""
def reset(self, task_config):
"""Reset agent state for a new task"""
Essential Methods
predict() Method
The predict()
method is the core interface where your agent’s decision-making logic resides:
def predict(self, instruction, obs, **kwargs):
"""Main prediction logic"""
# Your agent logic here
return action
Parameters:
- instruction
: Natural language task description
- obs
: Current environment state (screenshot, accessibility tree, etc.)
- **kwargs
: You custom parameters
Returns:
- response
: Raw response from your model, you can use it to debug your agent or make some print to user
- action
: A list of actions, all actions in OSWorld action formatted (either payutogui or computer_13)
reset() Method
The reset()
method is also essential for your agent implementation.
This method is called at the beginning of each new task to initialize your agent’s state:
def reset(self, _logger=None):
"""Reset agent state for a new task"""
global logger
logger = _logger if _logger is not None else logging.getLogger("desktopenv.agent")
# Clear agent history
self.thoughts = []
self.actions = []
self.observations = []
# You can add more reset logic here based on your agent's needs
Parameters:
- _logger
(optional): Logger instance for the agent to use during task execution
Purpose: Initialize your agent’s internal state, clear history (thoughts, actions, observations), set up logging, and prepare for a new task execution. This ensures your agent starts fresh for each evaluation task.
Integration and Usage
Final Agent Placement
Your agent implementation will ultimately be integrated into OSWorld’s execution pipeline through the lib_run_single.py
module.
This file contains the core task execution logic that coordinates between your agent and the OSWorld environment.
The integration flow works as follows:
Agent Implementation: Your agent class (with
predict()
andreset()
methods)lib_run_single.py: Core execution logic that calls your agent methods
run.py/run_multienv.py/run_multienv_xxx.py: Entry points that import and use lib_run_single
Coordination Requirements:
Your agent’s
predict()
method will be called by the execution pipeline inlib_run_single.py
The
reset()
method will be called at the start of each taskYour agent should handle the observation formats and return valid actions
Error handling should be robust as your agent will run in automated evaluation pipelines
File Structure:
OSWorld/
├── mm_agents/
│ ├── your_agent.py # Your agent implementation
│ └── __init__.py
├── lib_run_single.py # Core execution logic (calls your agent)
├── run.py # Single-environment entry point
├── run_multienv.py # Multi-environment entry point
├── run_multienv_xxx.py # Multi-environment entry point added by you if needed
└── ...
This architecture ensures your agent integrates seamlessly with OSWorld’s evaluation pipeline while maintaining flexibility in your implementation approach.
Running Evaluation
# Single-threaded evaluation
python run.py \
--provider_name vmware \
--observation_type screenshot \
--model your-custom-agent \
--max_steps 15 \
--client_password password \
--result_dir ./results
# Multi-environment parallel evaluation
python run_multienv.py \
--provider_name docker \
--observation_type screenshot \
--model your-custom-agent \
--num_envs 4 \
--max_steps 15 \
--client_password password
Noted here again different VM providers have different default credentials:
VMware/VirtualBox/Docker: username
user
, passwordpassword
AWS: username
osworld-public-evaluation
, auto-generated password
See Also
DesktopEnv Interface Documentation - Detailed explanation of the OSWorld environment architecture and components
OSWorld Task Examples Explanation - Comprehensive guide to understanding and working with OSWorld tasks
Public Evaluation Platform User Guide - Requirements and process for verified leaderboard submission