OSWorld Task Examples Explanation
This document explains the structure and meaning of task example JSON files in OSWorld, which is crucial for understanding and extending the benchmark.
Overview
OSWorld tasks are designed to benchmark agents’ abilities when interacting with GUI environments. Each task is defined as a JSON file containing all necessary information to set up, execute, and evaluate the task.
Task examples are stored in the ./examples
directory, where each data item follows a standardized JSON format that defines the task setup, execution context, and evaluation criteria.
JSON Schema Structure
Each task example JSON file contains the following fields:
Core Fields
- id (string, required)
A unique identifier for the task. This is typically a UUID that distinguishes this task from all others in the dataset.
- instruction (string, required)
The natural language instruction describing what the agent should accomplish. This is the primary task description that guides the agent’s behavior.
- source (string, optional)
The origin or reference for this task example. This could be:
A website URL where the task was inspired from
A forum discussion
A research paper
Manual creation by the dataset authors
- related_apps (array of strings, required)
List of applications that are involved in completing this task. These apps may be:
Pre-opened as part of the initial setup
Required to be opened during task execution
Used for evaluation purposes
Setup and Configuration
- config (array of objects, required)
Scripts and commands to set up the initial state of the task environment. This includes:
Launching applications
Opening specific files or URLs
Setting up network proxies or debugging connections
Configuring system states
Each config item has a
type
field that determines the setup action, andparameters
that specify the action details.- snapshot (string, deprecated)
Previously used to specify a pre-configured environment snapshot. This field is deprecated in favor of the more flexible
config
system.
Evaluation
- evaluator (object, required)
Defines how the task completion should be evaluated. Contains:
func
: The evaluation function to useresult
: Expected output location and formatexpected
: Ground truth or reference for comparison
- trajectory (string, deprecated)
Previously pointed to annotated human demonstration trajectories. This is being phased out in favor of more robust evaluation methods.
Environment Settings
- proxy (boolean, optional)
Indicates whether the task requires network proxy configuration. When
true
:Network traffic may need to be routed through specific proxies
Useful for tasks requiring specific network conditions
May affect how web-based applications behave
Note
If you turn on the proxy, make sure you always:
have enough balance for the proxy service, otherwise this will cause internet connection failure
provide the correct client machine password to DesktopEnv, otherwise proxy settings will fail
- fixed_ip (boolean, optional)
Specifies whether the task requires a fixed IP address configuration:
true
: Task needs consistent IP addressingfalse
: Task can work with dynamic IP assignmentImportant for tasks involving network-dependent applications, or have detection on IP changes to protect from bot
- possibility_of_env_change (string, optional)
Indicates the likelihood that the environment state may change during task execution:
"low"
: Environment is static and almost impossible to change, e.g. a tasks on local slides"medium"
: Some environmental changes may occur but unlikely, e.g. a widely adapted blog, some old website that is not updated"high"
: Significant environmental changes are found and expected, e.g. a website that is updated frequently about the elements, layout, or content
This helps in planning task execution strategies and evaluation approaches.
Note
You can add additional fields to the JSON as notes or comments, as long as you don’t break the original structure.
Detailed Example Analysis
Chrome PDF Download Task
Let’s examine a comprehensive example that demonstrates how to convert a webpage to PDF using Chrome:
{
"id": "e1e75309-3ddb-4d09-92ec-de869c928143",
"instruction": "Computer, can you turn the webpage I'm looking at into a PDF file, save it to my Desktop with the default filename and set the margins to none?",
"source": "https://in5stepstutorials.com/google-chrome/save-web-page-as-pdf-in-chrome.php",
"config": [
{
"type": "launch",
"parameters": {
"command": [
"google-chrome",
"--remote-debugging-port=1337"
]
}
},
{
"type": "launch",
"parameters": {
"command": [
"socat",
"tcp-listen:9222,fork",
"tcp:localhost:1337"
]
}
},
{
"type": "chrome_open_tabs",
"parameters": {
"urls_to_open": [
"https://lilianweng.github.io/posts/2023-06-23-agent/"
]
}
}
],
"related_apps": ["chrome"],
"evaluator": {
"func": "compare_pdfs",
"result": {
"type": "vm_file",
"path": "/home/user/Desktop/LLM Powered Autonomous Agents _ Lil'Log.pdf",
"dest": "LLM Powered Autonomous Agents _ Lil'Log.pdf"
},
"expected": {
"type": "pdf_from_url",
"path": "https://lilianweng.github.io/posts/2023-06-23-agent/",
"dest": "LLM Powered Autonomous Agents _ Lil'Log_gold.pdf"
}
},
"proxy": true,
"fixed_ip": false,
"possibility_of_env_change": "medium"
}
Field-by-Field Analysis
- Task Identity and Purpose
id
: Unique identifier for this specific Chrome PDF taskinstruction
: Clear natural language description of the desired actionsource
: Reference to the tutorial that inspired this task
- Initial Setup Configuration
The
config
array sets up the testing environment:Chrome Launch: Starts Google Chrome with remote debugging enabled on port 1337
Proxy Setup: Uses
socat
to create a proxy from port 9222 to the Chrome debugging portPage Loading: Opens the target webpage that needs to be converted to PDF
- Application Context
related_apps
: Only Chrome is needed for this task
- Evaluation Strategy
func
: Usescompare_pdfs
to verify the generated PDF matches expectationsresult
: Specifies where the agent should save the PDF fileexpected
: Defines the ground truth PDF generated from the same URL
- Environment Configuration
proxy: true
: Task requires proxy setup since the website may ban your IPfixed_ip: false
: Dynamic IP addressing is acceptable since the website is not sensitive to IP changespossibility_of_env_change: "medium"
: Some changes may occur since Lilian Weng could modify her blog, though this is unlikely
Advanced Configuration and Evaluation
Understanding Config and Evaluator Functions
Note
For detailed information about available setup configurations, post-configurations, and evaluation functions, we recommend reading the source code to understand function signatures and how they are used by DesktopEnv:
Setup configurations: desktop_env/controllers/setup.py
Evaluation functions: desktop_env/evaluators
Desktop environment: desktop_env/desktop_env.py
The evaluation system consists of two main components:
Getters (desktop_env/evaluators/getters): Functions that retrieve information from various sources (VM files, VM states, cloud files, webpage information, etc.) to gather data for task completion assessment.
Metrics (desktop_env/evaluators/metrics): Functions that process the retrieved data to determine whether a task has been successfully completed.
Available Getter Functions
Available Metric Functions
from .basic_os import (
check_gnome_favorite_apps,
is_utc_0,
check_text_enlarged,
check_moved_jpgs,
is_in_vm_clickboard
)
from .chrome import (
is_expected_tabs,
is_expected_bookmarks,
compare_pdfs,
compare_htmls,
compare_archive,
is_cookie_deleted,
is_shortcut_on_desktop,
check_font_size,
check_enabled_experiments,
check_history_deleted,
is_expected_search_query,
is_expected_active_tab,
is_expected_url_pattern_match,
is_added_to_steam_cart,
is_expected_installed_extensions,
compare_pdf_images,
is_expected_active_tab_approximate
)
from .docs import (
compare_font_names,
compare_subscript_contains,
has_page_numbers_in_footers,
compare_docx_lines,
evaluate_colored_words_in_tables,
check_highlighted_words,
evaluate_strike_through_last_paragraph,
evaluate_conversion,
evaluate_spacing,
check_italic_font_size_14,
evaluate_alignment,
get_unique_train_ids,
check_no_duplicates,
compare_init_lines,
find_default_font,
contains_page_break,
compare_docx_files,
compare_docx_tables,
compare_line_spacing,
compare_insert_equation,
compare_highlighted_text,
is_first_line_centered,
check_file_exists,
check_tabstops,
compare_contains_image,
compare_docx_files_and_ignore_new_lines,
compare_docx_images,
compare_image_text,
compare_references,
compare_unique_train_records
)
from .general import (
check_csv,
check_accessibility_tree,
run_sqlite3,
check_json,
check_list,
exact_match,
match_in_list,
is_in_list,
fuzzy_match,
check_include_exclude,
check_direct_json_object,
compare_time_in_speedtest_results,
is_included_all_json_objects,
is_gold_text_included_in_pdf,
check_line_number,
file_contains,
compare_terminal_and_txt,
fuzzy_place_math,
compare_python_pure_text,
diff_text_file,
literal_match
)
from .gimp import (
check_structure_sim_resized,
check_brightness_decrease_and_structure_sim,
check_contrast_increase_and_structure_sim,
check_saturation_increase_and_structure_sim,
check_image_size,
check_image_mirror,
check_palette_and_structure_sim,
check_textbox_on_leftside,
check_green_background,
check_file_exists_and_structure_sim,
check_triangle_position,
check_structure_sim,
check_config_status,
compare_image_list,
increase_saturation,
decrease_brightness,
check_file_exists,
compare_triangle_positions,
check_sharper,
check_image_file_size
)
from .libreoffice import check_libre_locale
from .others import compare_epub, check_mp3_meta
from .pdf import check_pdf_pages
from .slides import (
check_presenter_console_disable,
check_image_stretch_and_center,
check_slide_numbers_color,
compare_pptx_files,
check_strikethrough,
check_slide_orientation_Portrait,
evaluate_presentation_fill_to_rgb_distance,
check_left_panel,
check_transition,
check_page_number_colors,
check_auto_saving_time
)
from .table import (
compare_table,
compare_csv,
compare_conference_city_in_order
)
from .thunderbird import (
check_thunderbird_prefs,
check_thunderbird_filter,
check_thunderbird_folder
)
from .vlc import (
is_vlc_playing,
is_vlc_recordings_folder,
is_vlc_fullscreen,
compare_images,
compare_audios,
compare_videos,
check_qt_bgcone,
check_one_instance_when_started_from_file,
check_qt_minimal_view,
check_qt_max_volume,
check_qt_slider_colours,
check_global_key_play_pause
)
from .vscode import (
compare_text_file,
compare_config,
compare_answer,
compare_result_files,
is_extension_installed,
check_json_settings,
check_json_keybindings,
check_python_file_by_test_suite,
check_python_file_by_gold_file,
check_html_background_image,
compare_zip_files
)
def infeasible():
pass
Note
When reusing existing functions to create new tasks, we recommend carefully reading our implementations to ensure you fully understand the characteristics and requirements of these functions.
Multiple Answer Evaluation with OR Logic
When a task has multiple acceptable answers, you can use the "conj": "or"
connector. For tasks requiring all conditions to be met simultaneously, use "conj": "and"
.
Example: LibreOffice Writer Task with Multiple Valid Outcomes
This example demonstrates how the "conj": "or"
field allows any one of the three evaluation functions to pass for the task to be considered successful. Each evaluation compares the same result file against different expected outcomes.
Multi-Parameter File Handling with Selective Input
When dealing with multiple file parameters, you can use the "gives"
field to control which files are actually passed to the evaluation function. This is useful when you want to download multiple files but only provide specific ones to the evaluator.
Example: Selective File Input with the “gives” Field
In this example:
Multiple Files: Two files are downloaded from cloud storage (
HK_train_record_Gold.docx
andHK_train_record.docx
)Selective Input: The
"gives": [0, 1]
field specifies that both files (indices 0 and 1) should be passed to the evaluation functionMulti-parameter Support: The
"multi": true
flag enables handling multiple file parametersControlled Evaluation: Only the specified files are provided to the
compare_unique_train_records
function, even though more files could be available
This pattern is particularly useful when you need reference files for evaluation but don’t want to include all downloaded files in the actual comparison.
Config Types Reference
Common configuration types used in task setup:
- launch
Executes system commands to start applications or services.
Example:
{ "type": "launch", "parameters": { "command": ["application-name", "--option1", "--option2"] } }
- chrome_open_tabs
Opens specific URLs in Chrome browser tabs.
Example:
{ "type": "chrome_open_tabs", "parameters": { "urls_to_open": ["https://example.com", "https://another-site.com"] } }
- activate_window
Brings a specific application window to focus.
Example:
{ "type": "activate_window", "parameters": { "window_name": "Document.docx - LibreOffice Writer", "strict": true } }
- execute
Runs arbitrary commands or scripts within the environment.
Example:
{ "type": "execute", "parameters": { "command": ["python", "-c", "import pyautogui; pyautogui.hotkey('ctrl', 's');"] } }
- sleep
Introduces delays between operations.
Example:
{ "type": "sleep", "parameters": { "seconds": 0.5 } }
See Also
DesktopEnv Interface Documentation - Detailed explanation of the OSWorld environment architecture and components
Adding New Agents to OSWorld - Instructions for implementing and integrating custom agents into the OSWorld framework
Public Evaluation Platform User Guide - Requirements and process for verified leaderboard submission