Computer

  1. ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation Arxiv

    Difei Gao, Lei Ji, Zechen Bai, Mingyu Ouyang, Peiran Li, Dongxing Mao, Qinchen Wu, Weichen Zhang, Peiyi Wang, Xiangwu Guo, Hengxu Wang, Luowei Zhou, Mike Zheng Shou pdf, 2023.12

  2. SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents Arxiv

    Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, Zhiyong Wu pdf, 2024.1

  3. OS-Copilot: Towards Generalist Computer Agents with Self-Improvement Arxiv

    Zhiyong Wu, Chengcheng Han, Zichen Ding, Zhenmin Weng, Zhoumianze Liu, Shunyu Yao, Tao Yu, Lingpeng Kong pdf, 2024.2

  4. ScreenAgent: A Vision Language Model-driven Computer Control Agent Arxiv

    Runliang Niu, Jindong Li, Shiqi Wang, Yali Fu, Xiyu Hu, Xueyuan Leng, He Kong, Yi Chang, Qi Wang pdf, 2024.2

  5. OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web Arxiv

    Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem Alshikh, Ruslan Salakhutdinov pdf, 2024.2

  6. Cradle: Empowering Foundation Agents Towards General Computer Control Arxiv

    Weihao Tan, Wentao Zhang, Xinrun Xu, Haochong Xia, Ziluo Ding, Boyu Li, Bohan Zhou, Junpeng Yue, Jiechuan Jiang, Yewen Li, Ruyi An, Molei Qin, Chuqiao Zong, Longtao Zheng, Yujie Wu, Xiaoqiang Chai, Yifei Bi, Tianbao Xie, Pengjie Gu, Xiyun Li, Ceyao Zhang, Long Tian, Chaojie Wang, Xinrun Wang, Börje F. Karlsson, Bo An, Shuicheng Yan, Zongqing Lu pdf, 2024.3

  7. AgentStudio: A Toolkit for Building General Virtual Agents Arxiv

    Longtao Zheng, Zhiyuan Huang, Zhenghai Xue, Xinrun Wang, Bo An, Shuicheng Yan pdf, 2024.3

  8. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments Arxiv

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, Tao Yu pdf, 2024.4

  9. GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents Arxiv

    Dongping Chen, Yue Huang, Siyuan Wu, Jingyu Tang, Liuyi Chen, Yilin Bai, Zhigang He, Chenlong Wang, Huichi Zhou, Yiqiang Li, Tianshuo Zhou, Yue Yu, Chujie Gao, Qihui Zhang, Yi Gui, Zhen Li, Yao Wan, Pan Zhou, Jianfeng Gao, Lichao Sun pdf, 2024.6

  10. VideoGUI: A Benchmark for GUI Automation from Instructional Videos Arxiv

    Kevin Qinghong Lin, Linjie Li, Difei Gao, Qinchen WU, Mingyi Yan, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou pdf, 2024.6

  11. Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding Arxiv

    Yue Fan, Lei Ding, Ching-Chen Kuo, Shan Jiang, Yang Zhao, Xinze Guan, Jie Yang, Yi Zhang, Xin Eric Wang pdf, 2024.6

  12. GUI Action Narrator: Where and When Did That Action Take Place? Arxiv

    Qinchen Wu, Difei Gao, Kevin Qinghong Lin, Zhuoyu Wu, Xiangwu Guo, Peiran Li, Weichen Zhang, Hengxu Wang, Mike Zheng Shou pdf, 2024.6