LLaVA-OneVision: Easy Visual Task Transfer
Paper
•
2408.03326
•
Published
•
61
LLaVA-OneVision is a multimodal vision-language model that integrates a pretrained Qwen-2 language model with a visual encoder, enabling instruction-tuned understanding and reasoning across text and images.
Original paper: LLaVA-OneVision: Easy Visual Task Transfer
This model uses LLaVA-OneVision with Qwen-2 as the language backbone, allowing rich multimodal reasoning and generation capabilities. It is well suited for applications such as image-grounded question answering, multimodal dialogue, and tasks requiring aligned understanding of visual and textual information.
Model Configuration:
| Model | Device | Model Link |
|---|---|---|
| LLaVA-OneVision | N1-655 | Model_Link |