Qwen3-VL-30B-A3B-Instruct Android Control LoRA Fine-tuned Model
Model Overview
This model is a fine-tuned version of Qwen's Qwen3-VL-30B-A3B-Instruct base model with LoRA adaptation for Android UI control tasks. This model demonstrates strong performance in GUI Grounding tasks, particularly excelling in coordinate prediction accuracy for click actions.
Key Performance Highlights
Strong GUI Grounding Performance:
- Click L2 Distance: 87.04 pixels - showing competitive performance compared to other models in the benchmark
- Demonstrates strong coordinate prediction capabilities for GUI interaction tasks
- Solid performance across other action types (input text match: 0.8455, scroll direction match: 0.8689)
The model demonstrates strong spatial understanding for GUI elements, making it suitable for automated UI testing and accessibility applications.
Training Data
- Dataset: OfficerChul/Android-Control-84k
- Data Format: Mobile UI screenshots paired with user instructions to perform appropriate actions (click, scroll, input, etc.)
- Training Samples: 84,000+ UI interaction examples
Training Data Format Example
{
"messages": [
{
"role": "system",
"content": "You are a helpful assistant that can identify what action to perform on mobile UI Screenshot given the user instruction."
},
{
"role": "user",
"content": "<image>Click on the Recording 2"
},
{
"role": "assistant",
"content": "{\"action_type\": \"click\", \"x\": 561, \"y\": 535}"
}
],
"images": ["and_ctrl/out_episode_18557_step_001.png"]
}
Training Method
LoRA fine-tuning performed using LLaMA-Factory framework
Training Configuration (qwen_3_vl_30b.yaml)
- Base Model:
Qwen/Qwen3-VL-30B-A3B-Instruct - Training Method: LoRA (Low-Rank Adaptation)
- LoRA Configuration:
- Rank: 8
- Target modules: all
- Image max pixels: 128,000
- Training Parameters:
- Batch size: 4 (gradient accumulation: 48, effective batch size: 192)
- Learning rate: 1e-4
- Epochs: 5
- LR scheduler: Cosine
- Warmup ratio: 0.1
- Optimizer: AdamW (fused)
- Precision: bf16
- Weight decay: 0.01
- Cutoff length: 2048 tokens
- Additional Settings:
- Gradient checkpointing enabled
- Flash Attention 2 enabled
- Vision tower, multi-modal projector, and language model all trainable
- DeepSpeed ZeRO-3 utilized
- Validation size: 5%
- Evaluation steps: 100
Training Results
- Total Steps: 2,055
- Final Training Loss: 0.2086
- Final Evaluation Loss: 0.1190
- Training Runtime: ~104 hours
- Samples per Second: 1.049
Supported Action Types
click: Click on specific coordinates (x, y)long_press: Long press actionscroll: Scroll (up/down/left/right)input_text: Text inputnavigate_back: Navigate backnavigate_home: Navigate to home screenopen_app: Open applicationwait: Wait action
Usage
The merged model can be directly loaded using the Hugging Face Transformers library.
from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
model_path = "OfficerChul/Qwen3-VL-30B-Android-Control"
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path,
trust_remote_code=True,
device_map="auto"
)
# Prepare your UI screenshot
image = Image.open("path/to/screenshot.png")
instruction = "Click on the Settings button"
# Prepare conversation
messages = [
{
"role": "system",
"content": "You are a helpful assistant that can identify what action to perform on mobile UI Screenshot given the user instruction."
},
{
"role": "user",
"content": f"<image>{instruction}"
}
]
# Process and generate
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=128)
result = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print(result)
Evaluation Results
Comprehensive Benchmark Comparison
| Model | Action Type Accuracy | Click L2 Distance | Input Text Match | Scroll Direction Match |
|---|---|---|---|---|
| Qwen/Qwen2.5-VL-3B-Instruct | 0.6645 | 88.21 (n=165) | 0.7889 (n=90) | 0.3519 (n=108) |
| OfficerChul/Qwen2.5-VL-3B-Instruct | 0.9965 | 446.54 (n=1467) | 0.9363 (n=157) | 0.9738 (n=267) |
| InfiX-ai/InfiGUI-G1-3B | 0.8745 | 102.39 (n=1020) | 0.7700 (n=100) | 0.2299 (n=174) |
| OfficerChul/InfiGUI-G1-3B | 0.9980 | 449.73 (n=1467) | 0.9625 (n=160) | 0.9625 (n=267) |
| Qwen/Qwen3-VL-30B-A3B-Instruct | 0.9090 | 705.72 (n=812) | 0.8264 (n=121) | 0.3226 (n=248) |
| OfficerChul/Qwen3-VL-30B-A3B-Instruct_lora_sft | 0.5907 | 87.04 | 0.8455 | 0.8689 |
| Qwen/Qwen2.5-VL-72B-Instruct | 0.6594 | 64.98 (n=125) | 0.8879 (n=107) | 0.2925 (n=106) |
| OfficerChul/Qwen2.5-VL-72B-Instruct | 0.8838 | 529.23 | 0.9032 | 0.9512 |
| google/gemma-3n-E4B-it | 0.5398 | 824.09 | 0.7521 | 0.5217 |
| OfficerChul/gemma-3n-E4B-it | 0.5088 | 878.66 (n=124) | 0.8763 (n=97) | 0.3689 (n=103) |
License
This model follows the Apache 2.0 license of the Qwen3-VL base model.
Acknowledgments
- Base model: Qwen3-VL-30B-A3B-Instruct by Qwen team
- Training framework: LLaMA-Factory
- Dataset: Android-Control-84k
Notes
- This model was developed for research purposes in mobile UI automation and accessibility enhancement
- The strong GUI grounding performance makes it suitable for applications requiring precise coordinate prediction
- Proper validation is required when using in production environments
- For best results, ensure input images are clear and at appropriate resolution
Generated with LLaMA-Factory | For questions or issues, please open an issue on the model repository.
- Downloads last month
- 4
Model tree for OfficerChul/Qwen3-VL-30B-A3B-Instruct-Android-Control-84k
Base model
Qwen/Qwen3-VL-30B-A3B-Instruct