Qwen3-VL-30B-A3B-Instruct Android Control LoRA Fine-tuned Model

Model Overview

This model is a fine-tuned version of Qwen's Qwen3-VL-30B-A3B-Instruct base model with LoRA adaptation for Android UI control tasks. This model demonstrates strong performance in GUI Grounding tasks, particularly excelling in coordinate prediction accuracy for click actions.

Key Performance Highlights

Strong GUI Grounding Performance:

  • Click L2 Distance: 87.04 pixels - showing competitive performance compared to other models in the benchmark
  • Demonstrates strong coordinate prediction capabilities for GUI interaction tasks
  • Solid performance across other action types (input text match: 0.8455, scroll direction match: 0.8689)

The model demonstrates strong spatial understanding for GUI elements, making it suitable for automated UI testing and accessibility applications.

Training Data

  • Dataset: OfficerChul/Android-Control-84k
  • Data Format: Mobile UI screenshots paired with user instructions to perform appropriate actions (click, scroll, input, etc.)
  • Training Samples: 84,000+ UI interaction examples

Training Data Format Example

{
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant that can identify what action to perform on mobile UI Screenshot given the user instruction."
    },
    {
      "role": "user",
      "content": "<image>Click on the Recording 2"
    },
    {
      "role": "assistant",
      "content": "{\"action_type\": \"click\", \"x\": 561, \"y\": 535}"
    }
  ],
  "images": ["and_ctrl/out_episode_18557_step_001.png"]
}

Training Method

LoRA fine-tuning performed using LLaMA-Factory framework

Training Configuration (qwen_3_vl_30b.yaml)

  • Base Model: Qwen/Qwen3-VL-30B-A3B-Instruct
  • Training Method: LoRA (Low-Rank Adaptation)
  • LoRA Configuration:
    • Rank: 8
    • Target modules: all
    • Image max pixels: 128,000
  • Training Parameters:
    • Batch size: 4 (gradient accumulation: 48, effective batch size: 192)
    • Learning rate: 1e-4
    • Epochs: 5
    • LR scheduler: Cosine
    • Warmup ratio: 0.1
    • Optimizer: AdamW (fused)
    • Precision: bf16
    • Weight decay: 0.01
    • Cutoff length: 2048 tokens
  • Additional Settings:
    • Gradient checkpointing enabled
    • Flash Attention 2 enabled
    • Vision tower, multi-modal projector, and language model all trainable
    • DeepSpeed ZeRO-3 utilized
    • Validation size: 5%
    • Evaluation steps: 100

Training Results

  • Total Steps: 2,055
  • Final Training Loss: 0.2086
  • Final Evaluation Loss: 0.1190
  • Training Runtime: ~104 hours
  • Samples per Second: 1.049

Supported Action Types

  • click: Click on specific coordinates (x, y)
  • long_press: Long press action
  • scroll: Scroll (up/down/left/right)
  • input_text: Text input
  • navigate_back: Navigate back
  • navigate_home: Navigate to home screen
  • open_app: Open application
  • wait: Wait action

Usage

The merged model can be directly loaded using the Hugging Face Transformers library.

from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image

model_path = "OfficerChul/Qwen3-VL-30B-Android-Control"
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    trust_remote_code=True,
    device_map="auto"
)

# Prepare your UI screenshot
image = Image.open("path/to/screenshot.png")
instruction = "Click on the Settings button"

# Prepare conversation
messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant that can identify what action to perform on mobile UI Screenshot given the user instruction."
    },
    {
        "role": "user",
        "content": f"<image>{instruction}"
    }
]

# Process and generate
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=128)
result = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print(result)

Evaluation Results

Comprehensive Benchmark Comparison

Model Action Type Accuracy Click L2 Distance Input Text Match Scroll Direction Match
Qwen/Qwen2.5-VL-3B-Instruct 0.6645 88.21 (n=165) 0.7889 (n=90) 0.3519 (n=108)
OfficerChul/Qwen2.5-VL-3B-Instruct 0.9965 446.54 (n=1467) 0.9363 (n=157) 0.9738 (n=267)
InfiX-ai/InfiGUI-G1-3B 0.8745 102.39 (n=1020) 0.7700 (n=100) 0.2299 (n=174)
OfficerChul/InfiGUI-G1-3B 0.9980 449.73 (n=1467) 0.9625 (n=160) 0.9625 (n=267)
Qwen/Qwen3-VL-30B-A3B-Instruct 0.9090 705.72 (n=812) 0.8264 (n=121) 0.3226 (n=248)
OfficerChul/Qwen3-VL-30B-A3B-Instruct_lora_sft 0.5907 87.04 0.8455 0.8689
Qwen/Qwen2.5-VL-72B-Instruct 0.6594 64.98 (n=125) 0.8879 (n=107) 0.2925 (n=106)
OfficerChul/Qwen2.5-VL-72B-Instruct 0.8838 529.23 0.9032 0.9512
google/gemma-3n-E4B-it 0.5398 824.09 0.7521 0.5217
OfficerChul/gemma-3n-E4B-it 0.5088 878.66 (n=124) 0.8763 (n=97) 0.3689 (n=103)

License

This model follows the Apache 2.0 license of the Qwen3-VL base model.

Acknowledgments

Notes

  • This model was developed for research purposes in mobile UI automation and accessibility enhancement
  • The strong GUI grounding performance makes it suitable for applications requiring precise coordinate prediction
  • Proper validation is required when using in production environments
  • For best results, ensure input images are clear and at appropriate resolution

Generated with LLaMA-Factory | For questions or issues, please open an issue on the model repository.

Downloads last month
4
Safetensors
Model size
31B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OfficerChul/Qwen3-VL-30B-A3B-Instruct-Android-Control-84k

Adapter
(1)
this model

Dataset used to train OfficerChul/Qwen3-VL-30B-A3B-Instruct-Android-Control-84k