Vision and Roadmap
==================

Vision
------

Roadmap
-------

Here we provide a high-level road map for the project. We will update this road map as we make progress.
If you are interested in contributing to the project, please check the `CONTRIBUTING.md` for more details.

Road Map for Environment Infrastructure
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- ✓ Explore VMWare, and whether it can be connected and controlled through the mouse package
- ✓ Explore Windows and MacOS, whether they can be installed
  - MacOS is closed source and cannot be legally installed
  - Windows is available legally and can be installed
- ✓ Build a gym-like Python interface for controlling the VM
- ✓ Recording of actions (mouse movement, click, keyboard) for humans to annotate, and we can replay it and compress it
- ✓ Build a simple task, e.g. open a browser, open a website, click on a button, and close the browser
- ✓ Set up a pipeline and build agent implementation (zero-shot) for the task
- ✓ Start to design which tasks inside the DesktopENv to focus on, start to wrap up the environment to be public
- ✓ Start to annotate the examples for ~~training~~ and testing
- ✓ Error handling during file passing and file opening, etc.
- ✓ Add accessibility tree from the OS into the observation space
- ✓ Add pre-process and post-process action support for benchmarking setup and evaluation
- ✓ Experiment logging and visualization system
- ✓ Add more tasks, maybe scale to 300 for v1.0.0, and create a dynamic leaderboard
- ✓ Multiprocess support, can enable reinforcement learning to be more efficient
- ✓ Add support for automatic VM download and configuration, enable auto-scaling management
- ✓ VPN setup doc for those who need it
- ✓ Support running on platforms that have nested virtualization, AWS
- ✓ Be able to run without virtual machine platform VMware Pro, e.g. VirtualBox, or other platforms
- ✓ Scale dataset to 20K+ tasks across diverse applications and websites (achieved via OpenCUA AgentNet with 21K tasks)
- ✓ Develop cross-platform annotation infrastructure supporting Windows, macOS, and Ubuntu (achieved via AgentNet Tool)
- ✓ Build data processing pipeline to convert raw demonstrations into clean trajectories (achieved via AgentNet Method)
- ☐ Add VNC-based video streaming as observation/actions for potential online-video understanding models to tackle tasks (achieved via AgentNet Tool screen recording)
- ☐ Support running on platforms that have nested virtualization, GCP
- ☐ Prepare for the first release of Windows vm image for the environment

Road Map of Annotation Tool
^^^^^^^^^^^^^^^^^^^^^^^^^^^

- ✓ Improve the annotation tool base on DuckTrack/OpenAdapt, and make it more robust which aligns on accessibility tree (achieved via AgentNet Tool)
- ✓ Annotate the steps of doing the task (achieved via 21K human-annotated tasks in AgentNet Dataset)
- ✓ Crawl all resources we explored from the internet, and make it easy to access (achieved via AgentNet Dataset covering 140+ applications and 190+ websites)
- ✓ Set up ways for the crowdsourcing/community to contribute new examples (achieved via open-source AgentNet Tool and dataset release)
- ✓ Develop reflective Chain-of-Thought reasoning augmentation for training data (achieved via OpenCUA pipeline)
- ✓ Create offline evaluation benchmark for stable and fast assessment (achieved via AgentNetBench)

Road Map of Foundation Models
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- ✓ Train and release open-source computer-use agent models (achieved via OpenCUA-7B and OpenCUA-32B)
- ✓ Achieve competitive performance with proprietary models on OSWorld benchmark (OpenCUA-32B achieves 32.5% on OSWorld-Verified)
- ✓ Support multi-image history and mixed-domain training for robust performance
- ✓ Demonstrate effective scaling with increased training data and test-time computation
- ☐ Explore reinforcement learning and self-improvement techniques for agent training
- ☐ Develop specialized models for specific domains or applications