This model is convert by mlx_vlm from ByteDance-Seed/UI-TARS-1.5-7B
Model Description
UI-TARS-1.5 is ByteDance's open-source multimodal agent built upon a powerful vision-language model. It is capable of effectively performing diverse tasks within virtual worlds.
The released UI-TARS-1.5-7B focuses primarily on enhancing general computer use capabilities and is not specifically optimized for game-based scenarios, where the UI-TARS-1.5 still holds a significant advantage.
| Benchmark Type | Benchmark | UI-TARS-1.5-7B | UI-TARS-1.5 |
|---|---|---|---|
| Computer Use | OSWorld | 27.5 | 42.5 |
| GUI Grounding | ScreenSpotPro | 49.6 | 61.6 |
P.S. This is the performance of UI-TARS-1.5-7B and UI-TARS-1.5 on OSWorld and ScreenSpotProd.
Quick Start
mlx_vlm.generate --model flin775/UI-Tars-1.5-7B-4bit-mlx \
--max-tokens 1024 \
--temperature 0.0 \
--prompt "List all contacts’ names and their corresponding grounding boxes([x1, y1, x2, y2]) from the left sidebar of the IM chat interface, return the results in JSON format." \
--image https://wechat.qpic.cn/uploads/2016/05/WeChat-Windows-2.11.jpg
- Downloads last month
- 5
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for flin775/UI-Tars-1.5-7B-4bit-mlx
Base model
ByteDance-Seed/UI-TARS-1.5-7B