|  |  | 
					
						
						|  | --- | 
					
						
						|  | license: apache-2.0 | 
					
						
						|  | language: | 
					
						
						|  | - en | 
					
						
						|  | pipeline_tag: image-text-to-text | 
					
						
						|  | tags: | 
					
						
						|  | - multimodal | 
					
						
						|  | - gui | 
					
						
						|  | library_name: transformers | 
					
						
						|  | --- | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  | # UI-TARS-1.5 Model | 
					
						
						|  |  | 
					
						
						|  | We shared the latest progress of the UI-TARS-1.5 model in [our blog](https://seed-tars.com/1.5/), which excels in playing games and performing GUI tasks. | 
					
						
						|  |  | 
					
						
						|  | ## Introduction | 
					
						
						|  |  | 
					
						
						|  | UI-TARS-1.5, an open-source multimodal agent built upon a powerful vision-language model. It is capable of effectively performing diverse tasks within virtual worlds. | 
					
						
						|  |  | 
					
						
						|  | Leveraging the foundational architecture introduced in [our recent paper](https://arxiv.org/abs/2501.12326), UI-TARS-1.5 integrates advanced reasoning enabled by reinforcement learning. This allows the model to reason through its thoughts before taking action, significantly enhancing its performance and adaptability, particularly in inference-time scaling. Our new 1.5 version achieves state-of-the-art results across a variety of standard benchmarks, demonstrating strong reasoning capabilities and notable improvements over prior models. | 
					
						
						|  | <!--  --> | 
					
						
						|  | <p align="center"> | 
					
						
						|  | <video controls width="480"> | 
					
						
						|  | <source src="https://huggingface.co/datasets/JjjFangg/Demo_video/resolve/main/GUI_demo.mp4" type="video/mp4"> | 
					
						
						|  | </video> | 
					
						
						|  |  | 
					
						
						|  | <p> | 
					
						
						|  | <p align="center"> | 
					
						
						|  | <video controls width="480"> | 
					
						
						|  | <source src="https://huggingface.co/datasets/JjjFangg/Demo_video/resolve/main/Game_demo.mp4" type="video/mp4"> | 
					
						
						|  | </video> | 
					
						
						|  | <p> | 
					
						
						|  |  | 
					
						
						|  | <!--  --> | 
					
						
						|  | Code: https://github.com/bytedance/UI-TARS | 
					
						
						|  |  | 
					
						
						|  | Application: https://github.com/bytedance/UI-TARS-desktop | 
					
						
						|  |  | 
					
						
						|  | ## Performance | 
					
						
						|  | **Online Benchmark Evaluation** | 
					
						
						|  | | Benchmark type | Benchmark                                                                                                                                       | UI-TARS-1.5 | OpenAI CUA | Claude 3.7 | Previous SOTA       | | 
					
						
						|  | |----------------|--------------------------------------------------------------------------------------------------------------------------------------------------|-------------|-------------|-------------|----------------------| | 
					
						
						|  | | **Computer Use** | [OSworld](https://arxiv.org/abs/2404.07972) (100 steps)                                                                                        | **42.5**     | 36.4        | 28          | 38.1 (200 step)      | | 
					
						
						|  | |                | [Windows Agent Arena](https://arxiv.org/abs/2409.08264) (50 steps)                                                                              | **42.1**     | -           | -           | 29.8                 | | 
					
						
						|  | | **Browser Use**  | [WebVoyager](https://arxiv.org/abs/2401.13919)                                                                                                 | 84.8         | **87**      | 84.1        | 87                   | | 
					
						
						|  | |                | [Online-Mind2web](https://arxiv.org/abs/2504.01382)                                                                                              | **75.8**     | 71          | 62.9        | 71                   | | 
					
						
						|  | | **Phone Use**    | [Android World](https://arxiv.org/abs/2405.14573)                                                                                              | **64.2**     | -           | -           | 59.5                 | | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  | **Grounding Capability Evaluation** | 
					
						
						|  | | Benchmark | UI-TARS-1.5 | OpenAI CUA | Claude 3.7 | Previous SOTA | | 
					
						
						|  | |-----------|-------------|------------|------------|----------------| | 
					
						
						|  | | [ScreensSpot-V2](https://arxiv.org/pdf/2410.23218) | **94.2** | 87.9 | 87.6 | 91.6 | | 
					
						
						|  | | [ScreenSpotPro](https://arxiv.org/pdf/2504.07981v1) | **61.6** | 23.4 | 27.7 | 43.6 | | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  | **Poki Game** | 
					
						
						|  |  | 
					
						
						|  | | Model       | [2048](https://poki.com/en/g/2048) | [cubinko](https://poki.com/en/g/cubinko) | [energy](https://poki.com/en/g/energy) | [free-the-key](https://poki.com/en/g/free-the-key) | [Gem-11](https://poki.com/en/g/gem-11) | [hex-frvr](https://poki.com/en/g/hex-frvr) | [Infinity-Loop](https://poki.com/en/g/infinity-loop) | [Maze:Path-of-Light](https://poki.com/en/g/maze-path-of-light) | [shapes](https://poki.com/en/g/shapes) | [snake-solver](https://poki.com/en/g/snake-solver) | [wood-blocks-3d](https://poki.com/en/g/wood-blocks-3d) | [yarn-untangle](https://poki.com/en/g/yarn-untangle) | [laser-maze-puzzle](https://poki.com/en/g/laser-maze-puzzle) | [tiles-master](https://poki.com/en/g/tiles-master) | | 
					
						
						|  | |-------------|-----------|--------------|-------------|-------------------|-------------|---------------|---------------------|--------------------------|-------------|--------------------|----------------------|---------------------|------------------------|---------------------| | 
					
						
						|  | | OpenAI CUA  | 31.04     | 0.00         | 32.80       | 0.00              | 46.27       | 92.25         | 23.08               | 35.00                    | 52.18       | 42.86              | 2.02                 | 44.56               | 80.00                  | 78.27               | | 
					
						
						|  | | Claude 3.7  | 43.05     | 0.00         | 41.60       | 0.00              | 0.00        | 30.76         | 2.31                | 82.00                    | 6.26        | 42.86              | 0.00                 | 13.77               | 28.00                  | 52.18               | | 
					
						
						|  | | UI-TARS-1.5 | 100.00    | 0.00         | 100.00      | 100.00            | 100.00      | 100.00        | 100.00              | 100.00                   | 100.00      | 100.00             | 100.00               | 100.00              | 100.00                 | 100.00              | | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  | **Minecraft** | 
					
						
						|  |  | 
					
						
						|  | | Task Type   | Task Name           | [VPT](https://openai.com/index/vpt/) | [DreamerV3](https://www.nature.com/articles/s41586-025-08744-2) | Previous SOTA | UI-TARS-1.5 w/o Thought | UI-TARS-1.5 w/ Thought | | 
					
						
						|  | |-------------|---------------------|----------|----------------|--------------------|------------------|-----------------| | 
					
						
						|  | | Mine Blocks | (oak_log)               | 0.8      | 1.0            | 1.0                | 1.0              | 1.0             | | 
					
						
						|  | |             | (obsidian)          | 0.0      | 0.0            | 0.0                | 0.2              | 0.3             | | 
					
						
						|  | |             | (white_bed)               | 0.0      | 0.0            | 0.1                | 0.4              | 0.6             | | 
					
						
						|  | |             | **200 Tasks Avg.**  | 0.06     | 0.03           | 0.32               | 0.35             | 0.42            | | 
					
						
						|  | | Kill Mobs   | (mooshroom)            | 0.0      | 0.0            | 0.1                | 0.3              | 0.4             | | 
					
						
						|  | |             | (zombie)            | 0.4      | 0.1            | 0.6                | 0.7              | 0.9             | | 
					
						
						|  | |             | (chicken)          | 0.1      | 0.0            | 0.4                | 0.5              | 0.6             | | 
					
						
						|  | |             | **100 Tasks Avg.**  | 0.04     | 0.03           | 0.18               | 0.25             | 0.31            | | 
					
						
						|  |  | 
					
						
						|  | ## Model Scale Comparison | 
					
						
						|  |  | 
					
						
						|  | This table compares performance across different model scales of UI-TARS on the OSworld benchmark. | 
					
						
						|  |  | 
					
						
						|  | | **Benchmark Type** | **Benchmark**                      | **UI-TARS-72B-DPO** | **UI-TARS-1.5-7B** | **UI-TARS-1.5** | | 
					
						
						|  | |--------------------|------------------------------------|---------------------|--------------------|-----------------| | 
					
						
						|  | | Computer Use       | [OSWorld](https://arxiv.org/abs/2404.07972)             | 24.6                | 27.5               | **42.5**        | | 
					
						
						|  | | GUI Grounding      | [ScreenSpotPro](https://arxiv.org/pdf/2504.07981v1) | 38.1                | 49.6               | **61.6**        | | 
					
						
						|  |  | 
					
						
						|  | The released UI-TARS-1.5-7B focuses primarily on enhancing general computer use capabilities and is not specifically optimized for game-based scenarios, where the UI-TARS-1.5 still holds a significant advantage. | 
					
						
						|  |  | 
					
						
						|  | ## What's next | 
					
						
						|  | We are providing early research access to our top-performing UI-TARS-1.5 model to facilitate collaborative research. Interested researchers can contact us at [email protected]. | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  | ## Citation | 
					
						
						|  | If you find our paper and model useful in your research, feel free to give us a cite. | 
					
						
						|  |  | 
					
						
						|  | ```BibTeX | 
					
						
						|  | @article{qin2025ui, | 
					
						
						|  | title={UI-TARS: Pioneering Automated GUI Interaction with Native Agents}, | 
					
						
						|  | author={Qin, Yujia and Ye, Yining and Fang, Junjie and Wang, Haoming and Liang, Shihao and Tian, Shizuo and Zhang, Junda and Li, Jiahao and Li, Yunxin and Huang, Shijue and others}, | 
					
						
						|  | journal={arXiv preprint arXiv:2501.12326}, | 
					
						
						|  | year={2025} | 
					
						
						|  | } | 
					
						
						|  | ``` |