OSWorld Task Examples Explanation

This document explains the structure and meaning of task example JSON files in OSWorld, which is crucial for understanding and extending the benchmark.

Overview

OSWorld tasks are designed to benchmark agents’ abilities when interacting with GUI environments. Each task is defined as a JSON file containing all necessary information to set up, execute, and evaluate the task.

Task examples are stored in the ./examples directory, where each data item follows a standardized JSON format that defines the task setup, execution context, and evaluation criteria.

JSON Schema Structure

Each task example JSON file contains the following fields:

Core Fields

id (string, required)

A unique identifier for the task. This is typically a UUID that distinguishes this task from all others in the dataset.

instruction (string, required)

The natural language instruction describing what the agent should accomplish. This is the primary task description that guides the agent’s behavior.

source (string, optional)

The origin or reference for this task example. This could be:

  • A website URL where the task was inspired from

  • A forum discussion

  • A research paper

  • Manual creation by the dataset authors

related_apps (array of strings, required)

List of applications that are involved in completing this task. These apps may be:

  • Pre-opened as part of the initial setup

  • Required to be opened during task execution

  • Used for evaluation purposes

Setup and Configuration

config (array of objects, required)

Scripts and commands to set up the initial state of the task environment. This includes:

  • Launching applications

  • Opening specific files or URLs

  • Setting up network proxies or debugging connections

  • Configuring system states

Each config item has a type field that determines the setup action, and parameters that specify the action details.

snapshot (string, deprecated)

Previously used to specify a pre-configured environment snapshot. This field is deprecated in favor of the more flexible config system.

Evaluation

evaluator (object, required)

Defines how the task completion should be evaluated. Contains:

  • func: The evaluation function to use

  • result: Expected output location and format

  • expected: Ground truth or reference for comparison

trajectory (string, deprecated)

Previously pointed to annotated human demonstration trajectories. This is being phased out in favor of more robust evaluation methods.

Environment Settings

proxy (boolean, optional)

Indicates whether the task requires network proxy configuration. When true:

  • Network traffic may need to be routed through specific proxies

  • Useful for tasks requiring specific network conditions

  • May affect how web-based applications behave

Note

If you turn on the proxy, make sure you always:

  • have enough balance for the proxy service, otherwise this will cause internet connection failure

  • provide the correct client machine password to DesktopEnv, otherwise proxy settings will fail

fixed_ip (boolean, optional)

Specifies whether the task requires a fixed IP address configuration:

  • true: Task needs consistent IP addressing

  • false: Task can work with dynamic IP assignment

  • Important for tasks involving network-dependent applications, or have detection on IP changes to protect from bot

possibility_of_env_change (string, optional)

Indicates the likelihood that the environment state may change during task execution:

  • "low": Environment is static and almost impossible to change, e.g. a tasks on local slides

  • "medium": Some environmental changes may occur but unlikely, e.g. a widely adapted blog, some old website that is not updated

  • "high": Significant environmental changes are found and expected, e.g. a website that is updated frequently about the elements, layout, or content

This helps in planning task execution strategies and evaluation approaches.

Note

You can add additional fields to the JSON as notes or comments, as long as you don’t break the original structure.

Detailed Example Analysis

Chrome PDF Download Task

Let’s examine a comprehensive example that demonstrates how to convert a webpage to PDF using Chrome:

{
  "id": "e1e75309-3ddb-4d09-92ec-de869c928143",
  "instruction": "Computer, can you turn the webpage I'm looking at into a PDF file, save it to my Desktop with the default filename and set the margins to none?",
  "source": "https://in5stepstutorials.com/google-chrome/save-web-page-as-pdf-in-chrome.php",
  "config": [
    {
      "type": "launch",
      "parameters": {
        "command": [
          "google-chrome",
          "--remote-debugging-port=1337"
        ]
      }
    },
    {
      "type": "launch",
      "parameters": {
        "command": [
          "socat",
          "tcp-listen:9222,fork",
          "tcp:localhost:1337"
        ]
      }
    },
    {
      "type": "chrome_open_tabs",
      "parameters": {
        "urls_to_open": [
          "https://lilianweng.github.io/posts/2023-06-23-agent/"
        ]
      }
    }
  ],
  "related_apps": ["chrome"],
  "evaluator": {
    "func": "compare_pdfs",
    "result": {
      "type": "vm_file",
      "path": "/home/user/Desktop/LLM Powered Autonomous Agents _ Lil'Log.pdf",
      "dest": "LLM Powered Autonomous Agents _ Lil'Log.pdf"
    },
    "expected": {
      "type": "pdf_from_url",
      "path": "https://lilianweng.github.io/posts/2023-06-23-agent/",
      "dest": "LLM Powered Autonomous Agents _ Lil'Log_gold.pdf"
    }
  },
  "proxy": true,
  "fixed_ip": false,
  "possibility_of_env_change": "medium"
}

Field-by-Field Analysis

Task Identity and Purpose
  • id: Unique identifier for this specific Chrome PDF task

  • instruction: Clear natural language description of the desired action

  • source: Reference to the tutorial that inspired this task

Initial Setup Configuration

The config array sets up the testing environment:

  1. Chrome Launch: Starts Google Chrome with remote debugging enabled on port 1337

  2. Proxy Setup: Uses socat to create a proxy from port 9222 to the Chrome debugging port

  3. Page Loading: Opens the target webpage that needs to be converted to PDF

Application Context
  • related_apps: Only Chrome is needed for this task

Evaluation Strategy
  • func: Uses compare_pdfs to verify the generated PDF matches expectations

  • result: Specifies where the agent should save the PDF file

  • expected: Defines the ground truth PDF generated from the same URL

Environment Configuration
  • proxy: true: Task requires proxy setup since the website may ban your IP

  • fixed_ip: false: Dynamic IP addressing is acceptable since the website is not sensitive to IP changes

  • possibility_of_env_change: "medium": Some changes may occur since Lilian Weng could modify her blog, though this is unlikely

Advanced Configuration and Evaluation

Understanding Config and Evaluator Functions

Note

For detailed information about available setup configurations, post-configurations, and evaluation functions, we recommend reading the source code to understand function signatures and how they are used by DesktopEnv:

The evaluation system consists of two main components:

  • Getters (desktop_env/evaluators/getters): Functions that retrieve information from various sources (VM files, VM states, cloud files, webpage information, etc.) to gather data for task completion assessment.

  • Metrics (desktop_env/evaluators/metrics): Functions that process the retrieved data to determine whether a task has been successfully completed.

Available Getter Functions

Available Metric Functions

from .basic_os import (
    check_gnome_favorite_apps,
    is_utc_0,
    check_text_enlarged,
    check_moved_jpgs,
    is_in_vm_clickboard
)
from .chrome import (
    is_expected_tabs,
    is_expected_bookmarks,
    compare_pdfs,
    compare_htmls,
    compare_archive,
    is_cookie_deleted,
    is_shortcut_on_desktop,
    check_font_size,
    check_enabled_experiments,
    check_history_deleted,
    is_expected_search_query,
    is_expected_active_tab,
    is_expected_url_pattern_match,
    is_added_to_steam_cart,
    is_expected_installed_extensions,
    compare_pdf_images,
    is_expected_active_tab_approximate
)
from .docs import (
    compare_font_names,
    compare_subscript_contains,
    has_page_numbers_in_footers,
    compare_docx_lines,
    evaluate_colored_words_in_tables,
    check_highlighted_words,
    evaluate_strike_through_last_paragraph,
    evaluate_conversion,
    evaluate_spacing,
    check_italic_font_size_14,
    evaluate_alignment,
    get_unique_train_ids,
    check_no_duplicates,
    compare_init_lines,
    find_default_font,
    contains_page_break,
    compare_docx_files,
    compare_docx_tables,
    compare_line_spacing,
    compare_insert_equation,
    compare_highlighted_text,
    is_first_line_centered,
    check_file_exists,
    check_tabstops,
    compare_contains_image,
    compare_docx_files_and_ignore_new_lines,
    compare_docx_images,
    compare_image_text,
    compare_references,
    compare_unique_train_records
)
from .general import (
    check_csv,
    check_accessibility_tree,
    run_sqlite3,
    check_json,
    check_list,
    exact_match,
    match_in_list,
    is_in_list,
    fuzzy_match,
    check_include_exclude,
    check_direct_json_object,
    compare_time_in_speedtest_results,
    is_included_all_json_objects,
    is_gold_text_included_in_pdf,
    check_line_number,
    file_contains,
    compare_terminal_and_txt,
    fuzzy_place_math,
    compare_python_pure_text,
    diff_text_file,
    literal_match
)
from .gimp import (
    check_structure_sim_resized,
    check_brightness_decrease_and_structure_sim,
    check_contrast_increase_and_structure_sim,
    check_saturation_increase_and_structure_sim,
    check_image_size,
    check_image_mirror,
    check_palette_and_structure_sim,
    check_textbox_on_leftside,
    check_green_background,
    check_file_exists_and_structure_sim,
    check_triangle_position,
    check_structure_sim,
    check_config_status,
    compare_image_list,
    increase_saturation,
    decrease_brightness,
    check_file_exists,
    compare_triangle_positions,
    check_sharper,
    check_image_file_size
)
from .libreoffice import check_libre_locale
from .others import compare_epub, check_mp3_meta
from .pdf import check_pdf_pages
from .slides import (
    check_presenter_console_disable,
    check_image_stretch_and_center,
    check_slide_numbers_color,
    compare_pptx_files,
    check_strikethrough,
    check_slide_orientation_Portrait,
    evaluate_presentation_fill_to_rgb_distance,
    check_left_panel,
    check_transition,
    check_page_number_colors,
    check_auto_saving_time
)
from .table import (
    compare_table,
    compare_csv,
    compare_conference_city_in_order
)
from .thunderbird import (
    check_thunderbird_prefs,
    check_thunderbird_filter,
    check_thunderbird_folder
)
from .vlc import (
    is_vlc_playing,
    is_vlc_recordings_folder,
    is_vlc_fullscreen,
    compare_images,
    compare_audios,
    compare_videos,
    check_qt_bgcone,
    check_one_instance_when_started_from_file,
    check_qt_minimal_view,
    check_qt_max_volume,
    check_qt_slider_colours,
    check_global_key_play_pause
)
from .vscode import (
    compare_text_file,
    compare_config,
    compare_answer,
    compare_result_files,
    is_extension_installed,
    check_json_settings,
    check_json_keybindings,
    check_python_file_by_test_suite,
    check_python_file_by_gold_file,
    check_html_background_image,
    compare_zip_files
)


def infeasible():
    pass

Note

When reusing existing functions to create new tasks, we recommend carefully reading our implementations to ensure you fully understand the characteristics and requirements of these functions.

Multiple Answer Evaluation with OR Logic

When a task has multiple acceptable answers, you can use the "conj": "or" connector. For tasks requiring all conditions to be met simultaneously, use "conj": "and".

Example: LibreOffice Writer Task with Multiple Valid Outcomes

This example demonstrates how the "conj": "or" field allows any one of the three evaluation functions to pass for the task to be considered successful. Each evaluation compares the same result file against different expected outcomes.

Multi-Parameter File Handling with Selective Input

When dealing with multiple file parameters, you can use the "gives" field to control which files are actually passed to the evaluation function. This is useful when you want to download multiple files but only provide specific ones to the evaluator.

Example: Selective File Input with the “gives” Field

In this example:

  • Multiple Files: Two files are downloaded from cloud storage (HK_train_record_Gold.docx and HK_train_record.docx)

  • Selective Input: The "gives": [0, 1] field specifies that both files (indices 0 and 1) should be passed to the evaluation function

  • Multi-parameter Support: The "multi": true flag enables handling multiple file parameters

  • Controlled Evaluation: Only the specified files are provided to the compare_unique_train_records function, even though more files could be available

This pattern is particularly useful when you need reference files for evaluation but don’t want to include all downloaded files in the actual comparison.

Config Types Reference

Common configuration types used in task setup:

launch

Executes system commands to start applications or services.

Example:

{
  "type": "launch",
  "parameters": {
    "command": ["application-name", "--option1", "--option2"]
  }
}
chrome_open_tabs

Opens specific URLs in Chrome browser tabs.

Example:

{
  "type": "chrome_open_tabs",
  "parameters": {
    "urls_to_open": ["https://example.com", "https://another-site.com"]
  }
}
activate_window

Brings a specific application window to focus.

Example:

{
  "type": "activate_window",
  "parameters": {
    "window_name": "Document.docx - LibreOffice Writer",
    "strict": true
  }
}
execute

Runs arbitrary commands or scripts within the environment.

Example:

{
  "type": "execute",
  "parameters": {
    "command": ["python", "-c", "import pyautogui; pyautogui.hotkey('ctrl', 's');"]
  }
}
sleep

Introduces delays between operations.

Example:

{
  "type": "sleep",
  "parameters": {
    "seconds": 0.5
  }
}

See Also