This model is convert by mlx_vlm from ByteDance-Seed/UI-TARS-1.5-7B

Model Description

UI-TARS-1.5 is ByteDance's open-source multimodal agent built upon a powerful vision-language model. It is capable of effectively performing diverse tasks within virtual worlds.

The released UI-TARS-1.5-7B focuses primarily on enhancing general computer use capabilities and is not specifically optimized for game-based scenarios, where the UI-TARS-1.5 still holds a significant advantage.

Benchmark Type	Benchmark	UI-TARS-1.5-7B	UI-TARS-1.5
Computer Use	OSWorld	27.5	42.5
GUI Grounding	ScreenSpotPro	49.6	61.6

P.S. This is the performance of UI-TARS-1.5-7B and UI-TARS-1.5 on OSWorld and ScreenSpotProd.

Quick Start

mlx_vlm.generate --model flin775/UI-Tars-1.5-7B-4bit-mlx \
  --max-tokens 1024 \
  --temperature 0.0 \
  --prompt "List all contacts’ names and their corresponding grounding boxes([x1, y1, x2, y2]) from the left sidebar of the IM chat interface, return the results in JSON format." \
  --image https://wechat.qpic.cn/uploads/2016/05/WeChat-Windows-2.11.jpg

Downloads last month: 5

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for flin775/UI-Tars-1.5-7B-4bit-mlx

Base model

ByteDance-Seed/UI-TARS-1.5-7B

Quantized

(12)

this model