Vision and Roadmap

Vision

Roadmap

Here we provide a high-level road map for the project. We will update this road map as we make progress. If you are interested in contributing to the project, please check the CONTRIBUTING.md for more details.

Road Map for Environment Infrastructure

  • ✓ Explore VMWare, and whether it can be connected and controlled through the mouse package

  • ✓ Explore Windows and MacOS, whether they can be installed - MacOS is closed source and cannot be legally installed - Windows is available legally and can be installed

  • ✓ Build a gym-like Python interface for controlling the VM

  • ✓ Recording of actions (mouse movement, click, keyboard) for humans to annotate, and we can replay it and compress it

  • ✓ Build a simple task, e.g. open a browser, open a website, click on a button, and close the browser

  • ✓ Set up a pipeline and build agent implementation (zero-shot) for the task

  • ✓ Start to design which tasks inside the DesktopENv to focus on, start to wrap up the environment to be public

  • ✓ Start to annotate the examples for ~~training~~ and testing

  • ✓ Error handling during file passing and file opening, etc.

  • ✓ Add accessibility tree from the OS into the observation space

  • ✓ Add pre-process and post-process action support for benchmarking setup and evaluation

  • ✓ Experiment logging and visualization system

  • ✓ Add more tasks, maybe scale to 300 for v1.0.0, and create a dynamic leaderboard

  • ✓ Multiprocess support, can enable reinforcement learning to be more efficient

  • ✓ Add support for automatic VM download and configuration, enable auto-scaling management

  • ✓ VPN setup doc for those who need it

  • ✓ Support running on platforms that have nested virtualization, AWS

  • ✓ Be able to run without virtual machine platform VMware Pro, e.g. VirtualBox, or other platforms

  • ☐ Support running on platforms that have nested virtualization, GCP

  • ☐ Prepare for the first release of Windows vm image for the environment

  • ☐ Add VNC-based video streaming as observation/actions for potential online-video understanding models to tackle tasks.

Road Map of Annotation Tool

  • ☐ Improve the annotation tool base on DuckTrack/OpenAdapt, and make it more robust which aligns on accessibility tree

  • ☐ Annotate the steps of doing the task

  • ☐ Crawl all resources we explored from the internet, and make it easy to access

  • ☐ Set up ways for the crowdsourcing/community to contribute new examples