OSWorld Task Examples Explanation ==================================== This document explains the structure and meaning of task example JSON files in OSWorld, which is crucial for understanding and extending the benchmark. Overview -------- OSWorld tasks are designed to benchmark agents' abilities when interacting with GUI environments. Each task is defined as a JSON file containing all necessary information to set up, execute, and evaluate the task. Task examples are stored in the ``./examples`` directory, where each data item follows a standardized JSON format that defines the task setup, execution context, and evaluation criteria. JSON Schema Structure --------------------- Each task example JSON file contains the following fields: Core Fields ~~~~~~~~~~~ **id** (string, required) A unique identifier for the task. This is typically a UUID that distinguishes this task from all others in the dataset. **instruction** (string, required) The natural language instruction describing what the agent should accomplish. This is the primary task description that guides the agent's behavior. **source** (string, optional) The origin or reference for this task example. This could be: - A website URL where the task was inspired from - A forum discussion - A research paper - Manual creation by the dataset authors **related_apps** (array of strings, required) List of applications that are involved in completing this task. These apps may be: - Pre-opened as part of the initial setup - Required to be opened during task execution - Used for evaluation purposes Setup and Configuration ~~~~~~~~~~~~~~~~~~~~~~~ **config** (array of objects, required) Scripts and commands to set up the initial state of the task environment. This includes: - Launching applications - Opening specific files or URLs - Setting up network proxies or debugging connections - Configuring system states Each config item has a ``type`` field that determines the setup action, and ``parameters`` that specify the action details. **snapshot** (string, deprecated) Previously used to specify a pre-configured environment snapshot. This field is deprecated in favor of the more flexible ``config`` system. Evaluation ~~~~~~~~~~ **evaluator** (object, required) Defines how the task completion should be evaluated. Contains: - ``func``: The evaluation function to use - ``result``: Expected output location and format - ``expected``: Ground truth or reference for comparison **trajectory** (string, deprecated) Previously pointed to annotated human demonstration trajectories. This is being phased out in favor of more robust evaluation methods. Environment Settings ~~~~~~~~~~~~~~~~~~~~ **proxy** (boolean, optional) Indicates whether the task requires network proxy configuration. When ``true``: - Network traffic may need to be routed through specific proxies - Useful for tasks requiring specific network conditions - May affect how web-based applications behave .. note:: If you turn on the proxy, make sure you always: - have enough balance for the proxy service, otherwise this will cause internet connection failure - provide the correct client machine password to DesktopEnv, otherwise proxy settings will fail **fixed_ip** (boolean, optional) Specifies whether the task requires a fixed IP address configuration: - ``true``: Task needs consistent IP addressing - ``false``: Task can work with dynamic IP assignment - Important for tasks involving network-dependent applications, or have detection on IP changes to protect from bot **possibility_of_env_change** (string, optional) Indicates the likelihood that the environment state may change during task execution: - ``"low"``: Environment is static and almost impossible to change, e.g. a tasks on local slides - ``"medium"``: Some environmental changes may occur but unlikely, e.g. a widely adapted blog, some old website that is not updated - ``"high"``: Significant environmental changes are found and expected, e.g. a website that is updated frequently about the elements, layout, or content This helps in planning task execution strategies and evaluation approaches. .. note:: You can add additional fields to the JSON as notes or comments, as long as you don't break the original structure. Detailed Example Analysis ------------------------- Chrome PDF Download Task ~~~~~~~~~~~~~~~~~~~~~~~~~ Let's examine a comprehensive example that demonstrates how to convert a webpage to PDF using Chrome: .. code-block:: json { "id": "e1e75309-3ddb-4d09-92ec-de869c928143", "instruction": "Computer, can you turn the webpage I'm looking at into a PDF file, save it to my Desktop with the default filename and set the margins to none?", "source": "https://in5stepstutorials.com/google-chrome/save-web-page-as-pdf-in-chrome.php", "config": [ { "type": "launch", "parameters": { "command": [ "google-chrome", "--remote-debugging-port=1337" ] } }, { "type": "launch", "parameters": { "command": [ "socat", "tcp-listen:9222,fork", "tcp:localhost:1337" ] } }, { "type": "chrome_open_tabs", "parameters": { "urls_to_open": [ "https://lilianweng.github.io/posts/2023-06-23-agent/" ] } } ], "related_apps": ["chrome"], "evaluator": { "func": "compare_pdfs", "result": { "type": "vm_file", "path": "/home/user/Desktop/LLM Powered Autonomous Agents _ Lil'Log.pdf", "dest": "LLM Powered Autonomous Agents _ Lil'Log.pdf" }, "expected": { "type": "pdf_from_url", "path": "https://lilianweng.github.io/posts/2023-06-23-agent/", "dest": "LLM Powered Autonomous Agents _ Lil'Log_gold.pdf" } }, "proxy": true, "fixed_ip": false, "possibility_of_env_change": "medium" } Field-by-Field Analysis ~~~~~~~~~~~~~~~~~~~~~~~ **Task Identity and Purpose** - ``id``: Unique identifier for this specific Chrome PDF task - ``instruction``: Clear natural language description of the desired action - ``source``: Reference to the tutorial that inspired this task **Initial Setup Configuration** The ``config`` array sets up the testing environment: 1. **Chrome Launch**: Starts Google Chrome with remote debugging enabled on port 1337 2. **Proxy Setup**: Uses ``socat`` to create a proxy from port 9222 to the Chrome debugging port 3. **Page Loading**: Opens the target webpage that needs to be converted to PDF **Application Context** - ``related_apps``: Only Chrome is needed for this task **Evaluation Strategy** - ``func``: Uses ``compare_pdfs`` to verify the generated PDF matches expectations - ``result``: Specifies where the agent should save the PDF file - ``expected``: Defines the ground truth PDF generated from the same URL **Environment Configuration** - ``proxy: true``: Task requires proxy setup since the website may ban your IP - ``fixed_ip: false``: Dynamic IP addressing is acceptable since the website is not sensitive to IP changes - ``possibility_of_env_change: "medium"``: Some changes may occur since Lilian Weng could modify her blog, though this is unlikely Advanced Configuration and Evaluation -------------------------------------- Understanding Config and Evaluator Functions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. note:: For detailed information about available setup configurations, post-configurations, and evaluation functions, we recommend reading the source code to understand function signatures and how they are used by DesktopEnv: - **Setup configurations**: `desktop_env/controllers/setup.py `_ - **Evaluation functions**: `desktop_env/evaluators `_ - **Desktop environment**: `desktop_env/desktop_env.py `_ The evaluation system consists of two main components: - **Getters** (`desktop_env/evaluators/getters `_): Functions that retrieve information from various sources (VM files, VM states, cloud files, webpage information, etc.) to gather data for task completion assessment. - **Metrics** (`desktop_env/evaluators/metrics `_): Functions that process the retrieved data to determine whether a task has been successfully completed. Available Getter Functions ~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from .chrome import ( get_default_search_engine, get_cookie_data, get_bookmarks, get_open_tabs_info, get_pdf_from_url, get_shortcuts_on_desktop, get_history, get_page_info, get_enabled_experiments, get_chrome_language, get_chrome_font_size, get_profile_name, get_number_of_search_results, get_googledrive_file, get_active_tab_info, get_enable_do_not_track, get_enable_enhanced_safety_browsing, get_new_startup_page, get_find_unpacked_extension_path, get_data_delete_automacally, get_active_tab_html_parse, get_active_tab_url_parse, get_gotoRecreationPage_and_get_html_content, get_url_dashPart, get_active_url_from_accessTree, get_find_installed_extension_name, get_info_from_website, get_macys_product_url_parse, get_url_path_parse # Alias for backward compatibility ) from .file import get_cloud_file, get_vm_file, get_cache_file, get_content_from_vm_file from .general import get_vm_command_line, get_vm_terminal_output, get_vm_command_error from .gimp import get_gimp_config_file from .impress import get_audio_in_slide, get_background_image_in_slide from .info import get_vm_screen_size, get_vm_window_size, get_vm_wallpaper, get_list_directory from .misc import get_rule, get_accessibility_tree, get_rule_relativeTime, get_time_diff_range from .replay import get_replay from .vlc import get_vlc_playing_info, get_vlc_config, get_default_video_player from .vscode import get_vscode_config from .calc import get_conference_city_in_order Available Metric Functions ~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from .basic_os import ( check_gnome_favorite_apps, is_utc_0, check_text_enlarged, check_moved_jpgs, is_in_vm_clickboard ) from .chrome import ( is_expected_tabs, is_expected_bookmarks, compare_pdfs, compare_htmls, compare_archive, is_cookie_deleted, is_shortcut_on_desktop, check_font_size, check_enabled_experiments, check_history_deleted, is_expected_search_query, is_expected_active_tab, is_expected_url_pattern_match, is_added_to_steam_cart, is_expected_installed_extensions, compare_pdf_images, is_expected_active_tab_approximate ) from .docs import ( compare_font_names, compare_subscript_contains, has_page_numbers_in_footers, compare_docx_lines, evaluate_colored_words_in_tables, check_highlighted_words, evaluate_strike_through_last_paragraph, evaluate_conversion, evaluate_spacing, check_italic_font_size_14, evaluate_alignment, get_unique_train_ids, check_no_duplicates, compare_init_lines, find_default_font, contains_page_break, compare_docx_files, compare_docx_tables, compare_line_spacing, compare_insert_equation, compare_highlighted_text, is_first_line_centered, check_file_exists, check_tabstops, compare_contains_image, compare_docx_files_and_ignore_new_lines, compare_docx_images, compare_image_text, compare_references, compare_unique_train_records ) from .general import ( check_csv, check_accessibility_tree, run_sqlite3, check_json, check_list, exact_match, match_in_list, is_in_list, fuzzy_match, check_include_exclude, check_direct_json_object, compare_time_in_speedtest_results, is_included_all_json_objects, is_gold_text_included_in_pdf, check_line_number, file_contains, compare_terminal_and_txt, fuzzy_place_math, compare_python_pure_text, diff_text_file, literal_match ) from .gimp import ( check_structure_sim_resized, check_brightness_decrease_and_structure_sim, check_contrast_increase_and_structure_sim, check_saturation_increase_and_structure_sim, check_image_size, check_image_mirror, check_palette_and_structure_sim, check_textbox_on_leftside, check_green_background, check_file_exists_and_structure_sim, check_triangle_position, check_structure_sim, check_config_status, compare_image_list, increase_saturation, decrease_brightness, check_file_exists, compare_triangle_positions, check_sharper, check_image_file_size ) from .libreoffice import check_libre_locale from .others import compare_epub, check_mp3_meta from .pdf import check_pdf_pages from .slides import ( check_presenter_console_disable, check_image_stretch_and_center, check_slide_numbers_color, compare_pptx_files, check_strikethrough, check_slide_orientation_Portrait, evaluate_presentation_fill_to_rgb_distance, check_left_panel, check_transition, check_page_number_colors, check_auto_saving_time ) from .table import ( compare_table, compare_csv, compare_conference_city_in_order ) from .thunderbird import ( check_thunderbird_prefs, check_thunderbird_filter, check_thunderbird_folder ) from .vlc import ( is_vlc_playing, is_vlc_recordings_folder, is_vlc_fullscreen, compare_images, compare_audios, compare_videos, check_qt_bgcone, check_one_instance_when_started_from_file, check_qt_minimal_view, check_qt_max_volume, check_qt_slider_colours, check_global_key_play_pause ) from .vscode import ( compare_text_file, compare_config, compare_answer, compare_result_files, is_extension_installed, check_json_settings, check_json_keybindings, check_python_file_by_test_suite, check_python_file_by_gold_file, check_html_background_image, compare_zip_files ) def infeasible(): pass .. note:: When reusing existing functions to create new tasks, we recommend carefully reading our implementations to ensure you fully understand the characteristics and requirements of these functions. Multiple Answer Evaluation with OR Logic ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When a task has multiple acceptable answers, you can use the ``"conj": "or"`` connector. For tasks requiring all conditions to be met simultaneously, use ``"conj": "and"``. **Example: LibreOffice Writer Task with Multiple Valid Outcomes** .. code-block:: json { ... "evaluator": { "postconfig": [ { "type": "activate_window", "parameters": { "window_name": "CCCH9003_Tutorial_guidelines.docx - LibreOffice Writer", "strict": true } }, { "type": "sleep", "parameters": { "seconds": 0.5 } }, { "type": "execute", "parameters": { "command": [ "python", "-c", "import pyautogui; import time; pyautogui.hotkey('ctrl', 's'); time.sleep(0.5); " ] } } ], "func": [ "compare_docx_files", "compare_docx_files", "compare_docx_files" ], "conj": "or", "expected": [ { "type": "cloud_file", "path": "https://huggingface.co/datasets/xlangai/ubuntu_osworld_file_cache/resolve/main/libreoffice_writer/88fe4b2d-3040-4c70-9a70-546a47764b48/CCCH9003_Tutorial_guidelines_Gold_1.docx", "dest": "CCCH9003_Tutorial_guidelines_Gold_1.docx" }, { "type": "cloud_file", "path": "https://huggingface.co/datasets/xlangai/ubuntu_osworld_file_cache/resolve/main/libreoffice_writer/88fe4b2d-3040-4c70-9a70-546a47764b48/CCCH9003_Tutorial_guidelines_Gold_2.docx", "dest": "CCCH9003_Tutorial_guidelines_Gold_2.docx" }, { "type": "cloud_file", "path": "https://huggingface.co/datasets/xlangai/ubuntu_osworld_file_cache/resolve/main/libreoffice_writer/88fe4b2d-3040-4c70-9a70-546a47764b48/CCCH9003_Tutorial_guidelines_Gold_3.docx", "dest": "CCCH9003_Tutorial_guidelines_Gold_3.docx" } ], "result": [ { "type": "vm_file", "path": "/home/user/Desktop/CCCH9003_Tutorial_guidelines.docx", "dest": "CCCH9003_Tutorial_guidelines.docx" }, { "type": "vm_file", "path": "/home/user/Desktop/CCCH9003_Tutorial_guidelines.docx", "dest": "CCCH9003_Tutorial_guidelines.docx" }, { "type": "vm_file", "path": "/home/user/Desktop/CCCH9003_Tutorial_guidelines.docx", "dest": "CCCH9003_Tutorial_guidelines.docx" } ], "options": [ { "ignore_blanks": false }, { "ignore_blanks": false }, { "ignore_blanks": false } ] }, ... } This example demonstrates how the ``"conj": "or"`` field allows any one of the three evaluation functions to pass for the task to be considered successful. Each evaluation compares the same result file against different expected outcomes. Multi-Parameter File Handling with Selective Input ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When dealing with multiple file parameters, you can use the ``"gives"`` field to control which files are actually passed to the evaluation function. This is useful when you want to download multiple files but only provide specific ones to the evaluator. **Example: Selective File Input with the "gives" Field** .. code-block:: json { ... "evaluator": { "postconfig": [ { "type": "activate_window", "parameters": { "window_name": "HK_train_record.docx - LibreOffice Writer", "strict": true } }, { "type": "sleep", "parameters": { "seconds": 0.5 } }, { "type": "execute", "parameters": { "command": [ "python", "-c", "import pyautogui; import time; pyautogui.hotkey('ctrl', 's'); time.sleep(0.5); " ] } } ], "func": "compare_unique_train_records", "result": { "type": "vm_file", "path": "/home/user/Desktop/HK_train_record.docx", "dest": "HK_train_record.docx" }, "expected": { "type": "cloud_file", "path": [ "https://huggingface.co/datasets/xlangai/ubuntu_osworld_file_cache/resolve/main/libreoffice_writer/6f81754e-285d-4ce0-b59e-af7edb02d108/HK_train_record_Gold.docx", "https://huggingface.co/datasets/xlangai/ubuntu_osworld_file_cache/resolve/main/libreoffice_writer/6f81754e-285d-4ce0-b59e-af7edb02d108/HK_train_record.docx" ], "dest": [ "HK_train_record_Gold.docx", "HK_train_record_Original.docx" ], "multi": true, "gives": [ 0, 1 ] } }, ... } In this example: - **Multiple Files**: Two files are downloaded from cloud storage (``HK_train_record_Gold.docx`` and ``HK_train_record.docx``) - **Selective Input**: The ``"gives": [0, 1]`` field specifies that both files (indices 0 and 1) should be passed to the evaluation function - **Multi-parameter Support**: The ``"multi": true`` flag enables handling multiple file parameters - **Controlled Evaluation**: Only the specified files are provided to the ``compare_unique_train_records`` function, even though more files could be available This pattern is particularly useful when you need reference files for evaluation but don't want to include all downloaded files in the actual comparison. Config Types Reference ---------------------- Common configuration types used in task setup: **launch** Executes system commands to start applications or services. Example: .. code-block:: json { "type": "launch", "parameters": { "command": ["application-name", "--option1", "--option2"] } } **chrome_open_tabs** Opens specific URLs in Chrome browser tabs. Example: .. code-block:: json { "type": "chrome_open_tabs", "parameters": { "urls_to_open": ["https://example.com", "https://another-site.com"] } } **activate_window** Brings a specific application window to focus. Example: .. code-block:: json { "type": "activate_window", "parameters": { "window_name": "Document.docx - LibreOffice Writer", "strict": true } } **execute** Runs arbitrary commands or scripts within the environment. Example: .. code-block:: json { "type": "execute", "parameters": { "command": ["python", "-c", "import pyautogui; pyautogui.hotkey('ctrl', 's');"] } } **sleep** Introduces delays between operations. Example: .. code-block:: json { "type": "sleep", "parameters": { "seconds": 0.5 } } See Also -------- * :doc:`environment_explanation` - Detailed explanation of the OSWorld environment architecture and components * :doc:`add_new_agent` - Instructions for implementing and integrating custom agents into the OSWorld framework * :doc:`run_public_evaluation` - Requirements and process for verified leaderboard submission