This model is convert by mlx_vlm from ByteDance-Seed/UI-TARS-1.5-7B

Model Description

UI-TARS-1.5 is ByteDance's open-source multimodal agent built upon a powerful vision-language model. It is capable of effectively performing diverse tasks within virtual worlds.

The released UI-TARS-1.5-7B focuses primarily on enhancing general computer use capabilities and is not specifically optimized for game-based scenarios, where the UI-TARS-1.5 still holds a significant advantage.

Benchmark Type Benchmark UI-TARS-1.5-7B UI-TARS-1.5
Computer Use OSWorld 27.5 42.5
GUI Grounding ScreenSpotPro 49.6 61.6

P.S. This is the performance of UI-TARS-1.5-7B and UI-TARS-1.5 on OSWorld and ScreenSpotProd.

Quick Start

mlx_vlm.generate --model flin775/UI-Tars-1.5-7B-4bit-mlx \
  --max-tokens 1024 \
  --temperature 0.0 \
  --prompt "List all contacts’ names and their corresponding grounding boxes([x1, y1, x2, y2]) from the left sidebar of the IM chat interface, return the results in JSON format." \
  --image https://wechat.qpic.cn/uploads/2016/05/WeChat-Windows-2.11.jpg
Downloads last month
5
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for flin775/UI-Tars-1.5-7B-4bit-mlx

Quantized
(12)
this model