🌐 Website | 🤗 Hugging Face Models
| 🔧 Deployment | 📑 Paper |
🖥️ UI-TARS-desktop
🏄 Midscene (Browser Automation) | 🫨 Discord
We also offer a UI-TARS-desktop version, which can operate on your local personal device. To use it, please visit https://github.com/bytedance/UI-TARS-desktop. To use UI-TARS in web automation, you may refer to the open-source project Midscene.js.
- 🌟 2025.04.16: We shared the latest progress of the UI-TARS-1.5 model in our blog, which excels in playing games and performing GUI tasks, and we open-sourced the UI-TARS-1.5-7B.
- ✨ 2025.03.23: We updated the OSWorld inference scripts from the original official OSWorld repository. Now, you can use the OSWorld official inference scripts to reproduce our results.
UI-TARS-1.5, an open-source multimodal agent built upon a powerful vision-language model. It is capable of effectively performing diverse tasks within virtual worlds.
Leveraging the foundational architecture introduced in our recent paper, UI-TARS-1.5 integrates advanced reasoning enabled by reinforcement learning. This allows the model to reason through its thoughts before taking action, significantly enhancing its performance and adaptability, particularly in inference-time scaling. Our new 1.5 version achieves state-of-the-art results across a variety of standard benchmarks, demonstrating strong reasoning capabilities and notable improvements over prior models.
- See the deploy guide here.
- For coordinates processing, refer to here.
- For full action space parsing, refer to OSWorld uitars_agent.py
- Refer to prompts.py
Online Benchmark Evaluation
Benchmark type | Benchmark | UI-TARS-1.5 | OpenAI CUA | Claude 3.7 | Previous SOTA |
---|---|---|---|---|---|
Computer Use | OSworld (100 steps) | 42.5 | 36.4 | 28 | 38.1 (200 step) |
Windows Agent Arena (50 steps) | 42.1 | - | - | 29.8 | |
Browser Use | WebVoyager | 84.8 | 87 | 84.1 | 87 |
Online-Mind2web | 75.8 | 71 | 62.9 | 71 | |
Phone Use | Android World | 64.2 | - | - | 59.5 |
Grounding Capability Evaluation
Benchmark | UI-TARS-1.5 | OpenAI CUA | Claude 3.7 | Previous SOTA |
---|---|---|---|---|
ScreenSpot-V2 | 94.2 | 87.9 | 87.6 | 91.6 |
ScreenSpotPro | 61.6 | 23.4 | 27.7 | 43.6 |
Poki Game
Model | 2048 | cubinko | energy | free-the-key | Gem-11 | hex-frvr | Infinity-Loop | Maze:Path-of-Light | shapes | snake-solver | wood-blocks-3d | yarn-untangle | laser-maze-puzzle | tiles-master |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
OpenAI CUA | 31.04 | 0.00 | 32.80 | 0.00 | 46.27 | 92.25 | 23.08 | 35.00 | 52.18 | 42.86 | 2.02 | 44.56 | 80.00 | 78.27 |
Claude 3.7 | 43.05 | 0.00 | 41.60 | 0.00 | 0.00 | 30.76 | 2.31 | 82.00 | 6.26 | 42.86 | 0.00 | 13.77 | 28.00 | 52.18 |
UI-TARS-1.5 | 100.00 | 0.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 |
Minecraft
Task Type | Task Name | VPT | DreamerV3 | Previous SOTA | UI-TARS-1.5 w/o Thought | UI-TARS-1.5 w/ Thought |
---|---|---|---|---|---|---|
Mine Blocks | (oak_log) | 0.8 | 1.0 | 1.0 | 1.0 | 1.0 |
(obsidian) | 0.0 | 0.0 | 0.0 | 0.2 | 0.3 | |
(white_bed) | 0.0 | 0.0 | 0.1 | 0.4 | 0.6 | |
200 Tasks Avg. | 0.06 | 0.03 | 0.32 | 0.35 | 0.42 | |
Kill Mobs | (mooshroom) | 0.0 | 0.0 | 0.1 | 0.3 | 0.4 |
(zombie) | 0.4 | 0.1 | 0.6 | 0.7 | 0.9 | |
(chicken) | 0.1 | 0.0 | 0.4 | 0.5 | 0.6 | |
100 Tasks Avg. | 0.04 | 0.03 | 0.18 | 0.25 | 0.31 |
Here we compare performance across different model scales of UI-TARS on the OSworld benchmark.
Benchmark Type | Benchmark | UI-TARS-72B-DPO | UI-TARS-1.5-7B | UI-TARS-1.5 |
---|---|---|---|---|
Computer Use | OSWorld | 24.6 | 27.5 | 42.5 |
GUI Grounding | ScreenSpotPro | 38.1 | 49.6 | 61.6 |
While UI-TARS-1.5 represents a significant advancement in multimodal agent capabilities, we acknowledge several important limitations:
- Misuse: Given its enhanced performance in GUI tasks, including successfully navigating authentication challenges like CAPTCHA, UI-TARS-1.5 could potentially be misused for unauthorized access or automation of protected content. To mitigate this risk, extensive internal safety evaluations are underway.
- Computation: UI-TARS-1.5 still requires substantial computational resources, particularly for large-scale tasks or extended gameplay scenarios.
- Hallucination: UI-TARS-1.5 may occasionally generate inaccurate descriptions, misidentify GUI elements, or take suboptimal actions based on incorrect inferences—especially in ambiguous or unfamiliar environments.
- Model scale: The released UI-TARS-1.5-7B focuses primarily on enhancing general computer use capabilities and is not specifically optimized for game-based scenarios, where the UI-TARS-1.5 still holds a significant advantage.
We are providing early research access to our top-performing UI-TARS-1.5 model to facilitate collaborative research. Interested researchers can contact us at [email protected].
Looking ahead, we envision UI-TARS evolving into increasingly sophisticated agentic experiences capable of performing real-world actions, thereby empowering platforms such as doubao to accomplish more complex tasks for you :)
If you find our paper and model useful in your research, feel free to give us a cite.
@article{qin2025ui,
title={UI-TARS: Pioneering Automated GUI Interaction with Native Agents},
author={Qin, Yujia and Ye, Yining and Fang, Junjie and Wang, Haoming and Liang, Shihao and Tian, Shizuo and Zhang, Junda and Li, Jiahao and Li, Yunxin and Huang, Shijue and others},
journal={arXiv preprint arXiv:2501.12326},
year={2025}
}