Vision and Roadmap ================== Vision ------ Roadmap ------- Here we provide a high-level road map for the project. We will update this road map as we make progress. If you are interested in contributing to the project, please check the `CONTRIBUTING.md` for more details. Road Map for Environment Infrastructure ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - ✓ Explore VMWare, and whether it can be connected and controlled through the mouse package - ✓ Explore Windows and MacOS, whether they can be installed - MacOS is closed source and cannot be legally installed - Windows is available legally and can be installed - ✓ Build a gym-like Python interface for controlling the VM - ✓ Recording of actions (mouse movement, click, keyboard) for humans to annotate, and we can replay it and compress it - ✓ Build a simple task, e.g. open a browser, open a website, click on a button, and close the browser - ✓ Set up a pipeline and build agent implementation (zero-shot) for the task - ✓ Start to design which tasks inside the DesktopENv to focus on, start to wrap up the environment to be public - ✓ Start to annotate the examples for ~~training~~ and testing - ✓ Error handling during file passing and file opening, etc. - ✓ Add accessibility tree from the OS into the observation space - ✓ Add pre-process and post-process action support for benchmarking setup and evaluation - ✓ Experiment logging and visualization system - ✓ Add more tasks, maybe scale to 300 for v1.0.0, and create a dynamic leaderboard - ✓ Multiprocess support, can enable reinforcement learning to be more efficient - ✓ Add support for automatic VM download and configuration, enable auto-scaling management - ✓ VPN setup doc for those who need it - ✓ Support running on platforms that have nested virtualization, AWS - ✓ Be able to run without virtual machine platform VMware Pro, e.g. VirtualBox, or other platforms - ✓ Scale dataset to 20K+ tasks across diverse applications and websites (achieved via OpenCUA AgentNet with 21K tasks) - ✓ Develop cross-platform annotation infrastructure supporting Windows, macOS, and Ubuntu (achieved via AgentNet Tool) - ✓ Build data processing pipeline to convert raw demonstrations into clean trajectories (achieved via AgentNet Method) - ☐ Add VNC-based video streaming as observation/actions for potential online-video understanding models to tackle tasks (achieved via AgentNet Tool screen recording) - ☐ Support running on platforms that have nested virtualization, GCP - ☐ Prepare for the first release of Windows vm image for the environment Road Map of Annotation Tool ^^^^^^^^^^^^^^^^^^^^^^^^^^^ - ✓ Improve the annotation tool base on DuckTrack/OpenAdapt, and make it more robust which aligns on accessibility tree (achieved via AgentNet Tool) - ✓ Annotate the steps of doing the task (achieved via 21K human-annotated tasks in AgentNet Dataset) - ✓ Crawl all resources we explored from the internet, and make it easy to access (achieved via AgentNet Dataset covering 140+ applications and 190+ websites) - ✓ Set up ways for the crowdsourcing/community to contribute new examples (achieved via open-source AgentNet Tool and dataset release) - ✓ Develop reflective Chain-of-Thought reasoning augmentation for training data (achieved via OpenCUA pipeline) - ✓ Create offline evaluation benchmark for stable and fast assessment (achieved via AgentNetBench) Road Map of Foundation Models ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - ✓ Train and release open-source computer-use agent models (achieved via OpenCUA-7B and OpenCUA-32B) - ✓ Achieve competitive performance with proprietary models on OSWorld benchmark (OpenCUA-32B achieves 32.5% on OSWorld-Verified) - ✓ Support multi-image history and mixed-domain training for robust performance - ✓ Demonstrate effective scaling with increased training data and test-time computation - ☐ Explore reinforcement learning and self-improvement techniques for agent training - ☐ Develop specialized models for specific domains or applications