Qwen3-VL-30B-A3B-Instruct Android Control LoRA Fine-tuned Model

Model Overview

This model is a fine-tuned version of Qwen's Qwen3-VL-30B-A3B-Instruct base model with LoRA adaptation for Android UI control tasks. This model demonstrates strong performance in GUI Grounding tasks, particularly excelling in coordinate prediction accuracy for click actions.

Key Performance Highlights

Strong GUI Grounding Performance:

Click L2 Distance: 87.04 pixels - showing competitive performance compared to other models in the benchmark
Demonstrates strong coordinate prediction capabilities for GUI interaction tasks
Solid performance across other action types (input text match: 0.8455, scroll direction match: 0.8689)

The model demonstrates strong spatial understanding for GUI elements, making it suitable for automated UI testing and accessibility applications.

Training Data

Dataset: OfficerChul/Android-Control-84k
Data Format: Mobile UI screenshots paired with user instructions to perform appropriate actions (click, scroll, input, etc.)
Training Samples: 84,000+ UI interaction examples

Training Data Format Example

{
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant that can identify what action to perform on mobile UI Screenshot given the user instruction."
    },
    {
      "role": "user",
      "content": "<image>Click on the Recording 2"
    },
    {
      "role": "assistant",
      "content": "{\"action_type\": \"click\", \"x\": 561, \"y\": 535}"
    }
  ],
  "images": ["and_ctrl/out_episode_18557_step_001.png"]
}

Training Method

LoRA fine-tuning performed using LLaMA-Factory framework

Training Configuration (`qwen_3_vl_30b.yaml`)

Base Model: Qwen/Qwen3-VL-30B-A3B-Instruct
Training Method: LoRA (Low-Rank Adaptation)
LoRA Configuration:
- Rank: 8
- Target modules: all
- Image max pixels: 128,000
Training Parameters:
- Batch size: 4 (gradient accumulation: 48, effective batch size: 192)
- Learning rate: 1e-4
- Epochs: 5
- LR scheduler: Cosine
- Warmup ratio: 0.1
- Optimizer: AdamW (fused)
- Precision: bf16
- Weight decay: 0.01
- Cutoff length: 2048 tokens
Additional Settings:
- Gradient checkpointing enabled
- Flash Attention 2 enabled
- Vision tower, multi-modal projector, and language model all trainable
- DeepSpeed ZeRO-3 utilized
- Validation size: 5%
- Evaluation steps: 100

Training Results

Total Steps: 2,055
Final Training Loss: 0.2086
Final Evaluation Loss: 0.1190
Training Runtime: ~104 hours
Samples per Second: 1.049

Supported Action Types

click: Click on specific coordinates (x, y)
long_press: Long press action
scroll: Scroll (up/down/left/right)
input_text: Text input
navigate_back: Navigate back
navigate_home: Navigate to home screen
open_app: Open application
wait: Wait action

Usage

The merged model can be directly loaded using the Hugging Face Transformers library.

from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image

model_path = "OfficerChul/Qwen3-VL-30B-Android-Control"
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    trust_remote_code=True,
    device_map="auto"
)

# Prepare your UI screenshot
image = Image.open("path/to/screenshot.png")
instruction = "Click on the Settings button"

# Prepare conversation
messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant that can identify what action to perform on mobile UI Screenshot given the user instruction."
    },
    {
        "role": "user",
        "content": f"<image>{instruction}"
    }
]

# Process and generate
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=128)
result = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print(result)

Evaluation Results

Comprehensive Benchmark Comparison

Model	Action Type Accuracy	Click L2 Distance	Input Text Match	Scroll Direction Match
Qwen/Qwen2.5-VL-3B-Instruct	0.6645	88.21 (n=165)	0.7889 (n=90)	0.3519 (n=108)
OfficerChul/Qwen2.5-VL-3B-Instruct	0.9965	446.54 (n=1467)	0.9363 (n=157)	0.9738 (n=267)
InfiX-ai/InfiGUI-G1-3B	0.8745	102.39 (n=1020)	0.7700 (n=100)	0.2299 (n=174)
OfficerChul/InfiGUI-G1-3B	0.9980	449.73 (n=1467)	0.9625 (n=160)	0.9625 (n=267)
Qwen/Qwen3-VL-30B-A3B-Instruct	0.9090	705.72 (n=812)	0.8264 (n=121)	0.3226 (n=248)
OfficerChul/Qwen3-VL-30B-A3B-Instruct_lora_sft	0.5907	87.04	0.8455	0.8689
Qwen/Qwen2.5-VL-72B-Instruct	0.6594	64.98 (n=125)	0.8879 (n=107)	0.2925 (n=106)
OfficerChul/Qwen2.5-VL-72B-Instruct	0.8838	529.23	0.9032	0.9512
google/gemma-3n-E4B-it	0.5398	824.09	0.7521	0.5217
OfficerChul/gemma-3n-E4B-it	0.5088	878.66 (n=124)	0.8763 (n=97)	0.3689 (n=103)

License

This model follows the Apache 2.0 license of the Qwen3-VL base model.

Acknowledgments

Base model: Qwen3-VL-30B-A3B-Instruct by Qwen team
Training framework: LLaMA-Factory
Dataset: Android-Control-84k

Notes

This model was developed for research purposes in mobile UI automation and accessibility enhancement
The strong GUI grounding performance makes it suitable for applications requiring precise coordinate prediction
Proper validation is required when using in production environments
For best results, ensure input images are clear and at appropriate resolution

Generated with LLaMA-Factory | For questions or issues, please open an issue on the model repository.

Downloads last month: 4

Safetensors

Model size

31B params

Tensor type

BF16

Model tree for OfficerChul/Qwen3-VL-30B-A3B-Instruct-Android-Control-84k

Base model

Qwen/Qwen3-VL-30B-A3B-Instruct

Adapter

(1)

this model

Dataset used to train OfficerChul/Qwen3-VL-30B-A3B-Instruct-Android-Control-84k

Evaluation results

Metadata error: specify a dataset to view leaderboard