OSWorld Task Examples Explanation

This document explains the structure and meaning of task example JSON files in OSWorld, which is crucial for understanding and extending the benchmark.

Overview

OSWorld tasks are designed to benchmark agents’ abilities when interacting with GUI environments. Each task is defined as a JSON file containing all necessary information to set up, execute, and evaluate the task.

Task examples are stored in the ./examples directory, where each data item follows a standardized JSON format that defines the task setup, execution context, and evaluation criteria.

JSON Schema Structure

Each task example JSON file contains the following fields:

Core Fields

id (string, required)

A unique identifier for the task. This is typically a UUID that distinguishes this task from all others in the dataset.

instruction (string, required)

The natural language instruction describing what the agent should accomplish. This is the primary task description that guides the agent’s behavior.

source (string, optional)

The origin or reference for this task example. This could be:

A website URL where the task was inspired from
A forum discussion
A research paper
Manual creation by the dataset authors

related_apps (array of strings, required)

List of applications that are involved in completing this task. These apps may be:

Pre-opened as part of the initial setup
Required to be opened during task execution
Used for evaluation purposes

Setup and Configuration

config (array of objects, required)

Scripts and commands to set up the initial state of the task environment. This includes:

Launching applications
Opening specific files or URLs
Setting up network proxies or debugging connections
Configuring system states

Each config item has a type field that determines the setup action, and parameters that specify the action details.

snapshot (string, deprecated)

Previously used to specify a pre-configured environment snapshot. This field is deprecated in favor of the more flexible config system.

Evaluation

evaluator (object, required)

Defines how the task completion should be evaluated. Contains:

func: The evaluation function to use
result: Expected output location and format
expected: Ground truth or reference for comparison

trajectory (string, deprecated)

Previously pointed to annotated human demonstration trajectories. This is being phased out in favor of more robust evaluation methods.

Environment Settings

proxy (boolean, optional)

Indicates whether the task requires network proxy configuration. When true:

Network traffic may need to be routed through specific proxies
Useful for tasks requiring specific network conditions
May affect how web-based applications behave

Note

If you turn on the proxy, make sure you always:

have enough balance for the proxy service, otherwise this will cause internet connection failure
provide the correct client machine password to DesktopEnv, otherwise proxy settings will fail

fixed_ip (boolean, optional)

Specifies whether the task requires a fixed IP address configuration:

true: Task needs consistent IP addressing
false: Task can work with dynamic IP assignment
Important for tasks involving network-dependent applications, or have detection on IP changes to protect from bot

possibility_of_env_change (string, optional)

Indicates the likelihood that the environment state may change during task execution:

"low": Environment is static and almost impossible to change, e.g. a tasks on local slides
"medium": Some environmental changes may occur but unlikely, e.g. a widely adapted blog, some old website that is not updated
"high": Significant environmental changes are found and expected, e.g. a website that is updated frequently about the elements, layout, or content

This helps in planning task execution strategies and evaluation approaches.

Note

You can add additional fields to the JSON as notes or comments, as long as you don’t break the original structure.

Detailed Example Analysis

Chrome PDF Download Task

Let’s examine a comprehensive example that demonstrates how to convert a webpage to PDF using Chrome:

{
  "id": "e1e75309-3ddb-4d09-92ec-de869c928143",
  "instruction": "Computer, can you turn the webpage I'm looking at into a PDF file, save it to my Desktop with the default filename and set the margins to none?",
  "source": "https://in5stepstutorials.com/google-chrome/save-web-page-as-pdf-in-chrome.php",
  "config": [
    {
      "type": "launch",
      "parameters": {
        "command": [
          "google-chrome",
          "--remote-debugging-port=1337"
        ]
      }
    },
    {
      "type": "launch",
      "parameters": {
        "command": [
          "socat",
          "tcp-listen:9222,fork",
          "tcp:localhost:1337"
        ]
      }
    },
    {
      "type": "chrome_open_tabs",
      "parameters": {
        "urls_to_open": [
          "https://lilianweng.github.io/posts/2023-06-23-agent/"
        ]
      }
    }
  ],
  "related_apps": ["chrome"],
  "evaluator": {
    "func": "compare_pdfs",
    "result": {
      "type": "vm_file",
      "path": "/home/user/Desktop/LLM Powered Autonomous Agents _ Lil'Log.pdf",
      "dest": "LLM Powered Autonomous Agents _ Lil'Log.pdf"
    },
    "expected": {
      "type": "pdf_from_url",
      "path": "https://lilianweng.github.io/posts/2023-06-23-agent/",
      "dest": "LLM Powered Autonomous Agents _ Lil'Log_gold.pdf"
    }
  },
  "proxy": true,
  "fixed_ip": false,
  "possibility_of_env_change": "medium"
}

Field-by-Field Analysis

Task Identity and Purpose

id: Unique identifier for this specific Chrome PDF task
instruction: Clear natural language description of the desired action
source: Reference to the tutorial that inspired this task

Initial Setup Configuration

The config array sets up the testing environment:

Chrome Launch: Starts Google Chrome with remote debugging enabled on port 1337
Proxy Setup: Uses socat to create a proxy from port 9222 to the Chrome debugging port
Page Loading: Opens the target webpage that needs to be converted to PDF

Application Context

related_apps: Only Chrome is needed for this task

Evaluation Strategy

func: Uses compare_pdfs to verify the generated PDF matches expectations
result: Specifies where the agent should save the PDF file
expected: Defines the ground truth PDF generated from the same URL

Environment Configuration

proxy: true: Task requires proxy setup since the website may ban your IP
fixed_ip: false: Dynamic IP addressing is acceptable since the website is not sensitive to IP changes
possibility_of_env_change: "medium": Some changes may occur since Lilian Weng could modify her blog, though this is unlikely

Advanced Configuration and Evaluation

Understanding Config and Evaluator Functions

Note

For detailed information about available setup configurations, post-configurations, and evaluation functions, we recommend reading the source code to understand function signatures and how they are used by DesktopEnv:

Setup configurations: desktop_env/controllers/setup.py
Evaluation functions: desktop_env/evaluators
Desktop environment: desktop_env/desktop_env.py

The evaluation system consists of two main components:

Getters (desktop_env/evaluators/getters): Functions that retrieve information from various sources (VM files, VM states, cloud files, webpage information, etc.) to gather data for task completion assessment.

Metrics (desktop_env/evaluators/metrics): Functions that process the retrieved data to determine whether a task has been successfully completed.

Available Getter Functions

Available Metric Functions

from .basic_os import (
    check_gnome_favorite_apps,
    is_utc_0,
    check_text_enlarged,
    check_moved_jpgs,
    is_in_vm_clickboard
)
from .chrome import (
    is_expected_tabs,
    is_expected_bookmarks,
    compare_pdfs,
    compare_htmls,
    compare_archive,
    is_cookie_deleted,
    is_shortcut_on_desktop,
    check_font_size,
    check_enabled_experiments,
    check_history_deleted,
    is_expected_search_query,
    is_expected_active_tab,
    is_expected_url_pattern_match,
    is_added_to_steam_cart,
    is_expected_installed_extensions,
    compare_pdf_images,
    is_expected_active_tab_approximate
)
from .docs import (
    compare_font_names,
    compare_subscript_contains,
    has_page_numbers_in_footers,
    compare_docx_lines,
    evaluate_colored_words_in_tables,
    check_highlighted_words,
    evaluate_strike_through_last_paragraph,
    evaluate_conversion,
    evaluate_spacing,
    check_italic_font_size_14,
    evaluate_alignment,
    get_unique_train_ids,
    check_no_duplicates,
    compare_init_lines,
    find_default_font,
    contains_page_break,
    compare_docx_files,
    compare_docx_tables,
    compare_line_spacing,
    compare_insert_equation,
    compare_highlighted_text,
    is_first_line_centered,
    check_file_exists,
    check_tabstops,
    compare_contains_image,
    compare_docx_files_and_ignore_new_lines,
    compare_docx_images,
    compare_image_text,
    compare_references,
    compare_unique_train_records
)
from .general import (
    check_csv,
    check_accessibility_tree,
    run_sqlite3,
    check_json,
    check_list,
    exact_match,
    match_in_list,
    is_in_list,
    fuzzy_match,
    check_include_exclude,
    check_direct_json_object,
    compare_time_in_speedtest_results,
    is_included_all_json_objects,
    is_gold_text_included_in_pdf,
    check_line_number,
    file_contains,
    compare_terminal_and_txt,
    fuzzy_place_math,
    compare_python_pure_text,
    diff_text_file,
    literal_match
)
from .gimp import (
    check_structure_sim_resized,
    check_brightness_decrease_and_structure_sim,
    check_contrast_increase_and_structure_sim,
    check_saturation_increase_and_structure_sim,
    check_image_size,
    check_image_mirror,
    check_palette_and_structure_sim,
    check_textbox_on_leftside,
    check_green_background,
    check_file_exists_and_structure_sim,
    check_triangle_position,
    check_structure_sim,
    check_config_status,
    compare_image_list,
    increase_saturation,
    decrease_brightness,
    check_file_exists,
    compare_triangle_positions,
    check_sharper,
    check_image_file_size
)
from .libreoffice import check_libre_locale
from .others import compare_epub, check_mp3_meta
from .pdf import check_pdf_pages
from .slides import (
    check_presenter_console_disable,
    check_image_stretch_and_center,
    check_slide_numbers_color,
    compare_pptx_files,
    check_strikethrough,
    check_slide_orientation_Portrait,
    evaluate_presentation_fill_to_rgb_distance,
    check_left_panel,
    check_transition,
    check_page_number_colors,
    check_auto_saving_time
)
from .table import (
    compare_table,
    compare_csv,
    compare_conference_city_in_order
)
from .thunderbird import (
    check_thunderbird_prefs,
    check_thunderbird_filter,
    check_thunderbird_folder
)
from .vlc import (
    is_vlc_playing,
    is_vlc_recordings_folder,
    is_vlc_fullscreen,
    compare_images,
    compare_audios,
    compare_videos,
    check_qt_bgcone,
    check_one_instance_when_started_from_file,
    check_qt_minimal_view,
    check_qt_max_volume,
    check_qt_slider_colours,
    check_global_key_play_pause
)
from .vscode import (
    compare_text_file,
    compare_config,
    compare_answer,
    compare_result_files,
    is_extension_installed,
    check_json_settings,
    check_json_keybindings,
    check_python_file_by_test_suite,
    check_python_file_by_gold_file,
    check_html_background_image,
    compare_zip_files
)


def infeasible():
    pass

Note

When reusing existing functions to create new tasks, we recommend carefully reading our implementations to ensure you fully understand the characteristics and requirements of these functions.

Multiple Answer Evaluation with OR Logic

When a task has multiple acceptable answers, you can use the "conj": "or" connector. For tasks requiring all conditions to be met simultaneously, use "conj": "and".

Example: LibreOffice Writer Task with Multiple Valid Outcomes

This example demonstrates how the "conj": "or" field allows any one of the three evaluation functions to pass for the task to be considered successful. Each evaluation compares the same result file against different expected outcomes.

Multi-Parameter File Handling with Selective Input

When dealing with multiple file parameters, you can use the "gives" field to control which files are actually passed to the evaluation function. This is useful when you want to download multiple files but only provide specific ones to the evaluator.

Example: Selective File Input with the “gives” Field

In this example:

Multiple Files: Two files are downloaded from cloud storage (HK_train_record_Gold.docx and HK_train_record.docx)
Selective Input: The "gives": [0, 1] field specifies that both files (indices 0 and 1) should be passed to the evaluation function
Multi-parameter Support: The "multi": true flag enables handling multiple file parameters
Controlled Evaluation: Only the specified files are provided to the compare_unique_train_records function, even though more files could be available

This pattern is particularly useful when you need reference files for evaluation but don’t want to include all downloaded files in the actual comparison.

Config Types Reference

Common configuration types used in task setup:

launch

Executes system commands to start applications or services.

Example:

{
  "type": "launch",
  "parameters": {
    "command": ["application-name", "--option1", "--option2"]
  }
}

chrome_open_tabs

Opens specific URLs in Chrome browser tabs.

Example:

{
  "type": "chrome_open_tabs",
  "parameters": {
    "urls_to_open": ["https://example.com", "https://another-site.com"]
  }
}

activate_window

Brings a specific application window to focus.

Example:

{
  "type": "activate_window",
  "parameters": {
    "window_name": "Document.docx - LibreOffice Writer",
    "strict": true
  }
}

execute

Runs arbitrary commands or scripts within the environment.

Example:

{
  "type": "execute",
  "parameters": {
    "command": ["python", "-c", "import pyautogui; pyautogui.hotkey('ctrl', 's');"]
  }
}

sleep

Introduces delays between operations.

Example:

{
  "type": "sleep",
  "parameters": {
    "seconds": 0.5
  }
}

OSWorld Task Examples Explanation

Overview

JSON Schema Structure

Core Fields

Setup and Configuration

Evaluation

Environment Settings

Detailed Example Analysis

Chrome PDF Download Task

Field-by-Field Analysis

Advanced Configuration and Evaluation

Understanding Config and Evaluator Functions

Available Getter Functions

Available Metric Functions

Multiple Answer Evaluation with OR Logic

Multi-Parameter File Handling with Selective Input

Config Types Reference

See Also