Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning
AI & ML interests
Computer Vision
Recent Activity
View all activity
Papers
ViCO: A Training Strategy towards Semantic Aware Dynamic High-Resolution
ExpVid: A Benchmark for Experiment Video Understanding & Reasoning
InternVL3.5-Flash is a fast variant of InternVL3.5 using semantic aware dynamic high-resolution strategy.
-
ViCO: A Training Strategy towards Semantic Aware Dynamic High-Resolution
Paper • 2510.12793 • Published • 2 -
OpenGVLab/InternVL3_5-241B-A28B-Flash
Image-Text-to-Text • 242B • Updated • 111 • 4 -
OpenGVLab/InternVL3_5-38B-Flash
Image-Text-to-Text • 40B • Updated • 153 • 5 -
OpenGVLab/InternVL3_5-30B-A3B-Flash
Image-Text-to-Text • 31B • Updated • 1.77k • 5
This collection includes all released checkpoints of InternVL3.5, covering different training stages (e.g., Pretraining, SFT, MPO, Cascade RL).
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Paper • 2508.18265 • Published • 201 -
OpenGVLab/InternVL3_5-241B-A28B-HF
Image-Text-to-Text • 241B • Updated • 59 • 11 -
OpenGVLab/InternVL3_5-38B-HF
Image-Text-to-Text • 38B • Updated • 1.39k • 5 -
OpenGVLab/InternVL3_5-30B-A3B-HF
Image-Text-to-Text • 31B • Updated • 2.78k • 5
Sequential Diffusion Language Models
ZeroGUI: Automating Online GUI Learning at Zero Human Cost
[NeurIPS 2024 Spotlight (Ranking Top 10), TPAMI 2025] Parameter-Inverted Image Pyramid Networks
-
Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding
Paper • 2501.07783 • Published • 7 -
OpenGVLab/PIIP
Object Detection • Updated • 5 -
OpenGVLab/PIIP-LLaVA_CLIP-BL_512-256_7B
Image-Text-to-Text • 7B • Updated • 27 -
OpenGVLab/PIIP-LLaVA_ConvNeXt-B_CLIP-L_640-224_7B
Image-Text-to-Text • 7B • Updated • 30
-
OpenGVLab/InternVideo2_5_Chat_8B
Video-Text-to-Text • 8B • Updated • 11.1k • 85 -
OpenGVLab/InternVL_2_5_HiCo_R16
Video-Text-to-Text • 8B • Updated • 1.9k • 6 -
OpenGVLab/InternVL_2_5_HiCo_R64
Video-Text-to-Text • 8B • Updated • 105 • 3 -
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
Paper • 2501.12386 • Published • 1
Faster and more powerful VideoChat.
-
OpenGVLab/VideoChat-Flash-Qwen2_5-2B_res448
Video-Text-to-Text • 2B • Updated • 648 • 26 -
OpenGVLab/VideoChat-Flash-Qwen2-7B_res224
Video-Text-to-Text • 8B • Updated • 130 • 7 -
OpenGVLab/VideoChat-Flash-Qwen2-7B_res448
Video-Text-to-Text • 8B • Updated • 2.26k • 12 -
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
Paper • 2501.00574 • Published • 6
Enhancing the Reasoning Ability of MLLMs via Mixed Preference Optimization
-
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Paper • 2411.10442 • Published • 87 -
OpenGVLab/InternVL2_5-78B-MPO
Image-Text-to-Text • 78B • Updated • 80 • 54 -
OpenGVLab/InternVL2_5-38B-MPO
Image-Text-to-Text • 38B • Updated • 1.08k • 20 -
OpenGVLab/InternVL2_5-26B-MPO
Image-Text-to-Text • 26B • Updated • 86 • 14
A Pioneering Open-Source Alternative to GPT-4V
-
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Paper • 2404.16821 • Published • 57 -
OpenGVLab/InternVL-Chat-V1-5
Image-Text-to-Text • 26B • Updated • 2.66k • 416 -
OpenGVLab/InternViT-6B-448px-V1-5
Image Feature Extraction • 6B • Updated • 1.6k • 78 -
OpenGVLab/InternViT-300M-448px
Image Feature Extraction • 0.3B • Updated • 3.9k • 60
Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding
InternVideo2
-
InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding
Paper • 2403.15377 • Published • 26 -
OpenGVLab/InternVideo2-Chat-8B
Video-Text-to-Text • 8B • Updated • 237 • 23 -
OpenGVLab/InternVideo2_chat_8B_HD
Video-Text-to-Text • 8B • Updated • 119 • 18 -
OpenGVLab/InternVideo2_Chat_8B_InternLM2_5
Video-Text-to-Text • 9B • Updated • 54 • 7
State Space Model for Efficient Video Understanding
A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
-
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
Paper • 2211.05778 • Published -
OpenGVLab/internimage_t_1k_224
Image Classification • 29.9M • Updated • 140 • 2 -
OpenGVLab/internimage_s_1k_224
Image Classification • 50.1M • Updated • 255 • 1 -
OpenGVLab/internimage_b_1k_224
Image Classification • 97.5M • Updated • 368 • 1
[NeurIPS 2025] Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints
This collection includes only the InternVL3.5 checkpoints that have completed the full training pipeline (i.e., Pretraining, SFT, MPO, Cascade RL).
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Paper • 2508.18265 • Published • 201 -
OpenGVLab/InternVL3_5-241B-A28B-HF
Image-Text-to-Text • 241B • Updated • 59 • 11 -
OpenGVLab/InternVL3_5-38B-HF
Image-Text-to-Text • 38B • Updated • 1.39k • 5 -
OpenGVLab/InternVL3_5-30B-A3B-HF
Image-Text-to-Text • 31B • Updated • 2.78k • 5
[CVPR 2025] Docopilot: Improving Multimodal Models for Document-Level Understanding
-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Paper • 2504.10479 • Published • 298 -
OpenGVLab/InternVL3-1B
Image-Text-to-Text • 0.9B • Updated • 69.3k • 74 -
OpenGVLab/InternVL3-2B
Image-Text-to-Text • 2B • Updated • 60.3k • 38 -
OpenGVLab/InternVL3-8B
Image-Text-to-Text • 8B • Updated • 276k • 100
A Pioneering Monolithic MLLM
-
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
Paper • 2410.08202 • Published • 4 -
Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models
Paper • 2507.12566 • Published • 14 -
OpenGVLab/Mono-InternVL-2B
Image-Text-to-Text • 3B • Updated • 8.35k • 36 -
OpenGVLab/Mono-InternVL-2B-S1-1
Image-Text-to-Text • 3B • Updated • 56
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
-
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
Paper • 2303.16727 • Published -
OpenGVLab/VideoMAEv2-Base
Video Classification • 86.2M • Updated • 16k • 8 -
OpenGVLab/VideoMAEv2-Large
Video Classification • 0.3B • Updated • 5.1k • 1 -
OpenGVLab/VideoMAEv2-Huge
Video Classification • 0.6B • Updated • 388 • 1
Better than InternVL 2.0
-
500
InternVL
⚡Interact with a multimodal chatbot that analyzes images and text
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
Paper • 2412.05271 • Published • 159 -
OpenGVLab/InternVL2_5-78B
Image-Text-to-Text • 78B • Updated • 258 • 192 -
OpenGVLab/InternVL2_5-78B-AWQ
Image-Text-to-Text • Updated • 137 • 14
Expanding Performance Boundaries of Open-Source MLLM
Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
-
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Paper • 2312.14238 • Published • 20 -
OpenGVLab/InternViT-6B-224px
Image Feature Extraction • Updated • 87 • 24 -
OpenGVLab/InternVL-14B-224px
Image Feature Extraction • 14B • Updated • 503 • 35 -
OpenGVLab/InternVL-Chat-V1-2-Plus
Image-Text-to-Text • 40B • Updated • 94 • 34
Adaptation Models for Specific Domains
-
OpenGVLab/Mini-InternVL2-4B-DA-DriveLM
Image-Text-to-Text • 4B • Updated • 129 • 3 -
OpenGVLab/Mini-InternVL2-4B-DA-Medical
Image-Text-to-Text • 4B • Updated • 77 • 6 -
OpenGVLab/Mini-InternVL2-4B-DA-BDD
Image-Text-to-Text • 4B • Updated • 51 -
OpenGVLab/Mini-InternVL2-2B-DA-DriveLM
Image-Text-to-Text • 2B • Updated • 109
Chat-Centric Video Understanding
A Large-Scale Video-Text Dataset
Improved Baselines with Pyramid Vision Transformer
Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning
[NeurIPS 2025] Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints
InternVL3.5-Flash is a fast variant of InternVL3.5 using semantic aware dynamic high-resolution strategy.
-
ViCO: A Training Strategy towards Semantic Aware Dynamic High-Resolution
Paper • 2510.12793 • Published • 2 -
OpenGVLab/InternVL3_5-241B-A28B-Flash
Image-Text-to-Text • 242B • Updated • 111 • 4 -
OpenGVLab/InternVL3_5-38B-Flash
Image-Text-to-Text • 40B • Updated • 153 • 5 -
OpenGVLab/InternVL3_5-30B-A3B-Flash
Image-Text-to-Text • 31B • Updated • 1.77k • 5
This collection includes only the InternVL3.5 checkpoints that have completed the full training pipeline (i.e., Pretraining, SFT, MPO, Cascade RL).
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Paper • 2508.18265 • Published • 201 -
OpenGVLab/InternVL3_5-241B-A28B-HF
Image-Text-to-Text • 241B • Updated • 59 • 11 -
OpenGVLab/InternVL3_5-38B-HF
Image-Text-to-Text • 38B • Updated • 1.39k • 5 -
OpenGVLab/InternVL3_5-30B-A3B-HF
Image-Text-to-Text • 31B • Updated • 2.78k • 5
This collection includes all released checkpoints of InternVL3.5, covering different training stages (e.g., Pretraining, SFT, MPO, Cascade RL).
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Paper • 2508.18265 • Published • 201 -
OpenGVLab/InternVL3_5-241B-A28B-HF
Image-Text-to-Text • 241B • Updated • 59 • 11 -
OpenGVLab/InternVL3_5-38B-HF
Image-Text-to-Text • 38B • Updated • 1.39k • 5 -
OpenGVLab/InternVL3_5-30B-A3B-HF
Image-Text-to-Text • 31B • Updated • 2.78k • 5
Sequential Diffusion Language Models
[CVPR 2025] Docopilot: Improving Multimodal Models for Document-Level Understanding
ZeroGUI: Automating Online GUI Learning at Zero Human Cost
-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Paper • 2504.10479 • Published • 298 -
OpenGVLab/InternVL3-1B
Image-Text-to-Text • 0.9B • Updated • 69.3k • 74 -
OpenGVLab/InternVL3-2B
Image-Text-to-Text • 2B • Updated • 60.3k • 38 -
OpenGVLab/InternVL3-8B
Image-Text-to-Text • 8B • Updated • 276k • 100
A Pioneering Monolithic MLLM
-
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
Paper • 2410.08202 • Published • 4 -
Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models
Paper • 2507.12566 • Published • 14 -
OpenGVLab/Mono-InternVL-2B
Image-Text-to-Text • 3B • Updated • 8.35k • 36 -
OpenGVLab/Mono-InternVL-2B-S1-1
Image-Text-to-Text • 3B • Updated • 56
[NeurIPS 2024 Spotlight (Ranking Top 10), TPAMI 2025] Parameter-Inverted Image Pyramid Networks
-
Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding
Paper • 2501.07783 • Published • 7 -
OpenGVLab/PIIP
Object Detection • Updated • 5 -
OpenGVLab/PIIP-LLaVA_CLIP-BL_512-256_7B
Image-Text-to-Text • 7B • Updated • 27 -
OpenGVLab/PIIP-LLaVA_ConvNeXt-B_CLIP-L_640-224_7B
Image-Text-to-Text • 7B • Updated • 30
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
-
OpenGVLab/InternVideo2_5_Chat_8B
Video-Text-to-Text • 8B • Updated • 11.1k • 85 -
OpenGVLab/InternVL_2_5_HiCo_R16
Video-Text-to-Text • 8B • Updated • 1.9k • 6 -
OpenGVLab/InternVL_2_5_HiCo_R64
Video-Text-to-Text • 8B • Updated • 105 • 3 -
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
Paper • 2501.12386 • Published • 1
-
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
Paper • 2303.16727 • Published -
OpenGVLab/VideoMAEv2-Base
Video Classification • 86.2M • Updated • 16k • 8 -
OpenGVLab/VideoMAEv2-Large
Video Classification • 0.3B • Updated • 5.1k • 1 -
OpenGVLab/VideoMAEv2-Huge
Video Classification • 0.6B • Updated • 388 • 1
Faster and more powerful VideoChat.
-
OpenGVLab/VideoChat-Flash-Qwen2_5-2B_res448
Video-Text-to-Text • 2B • Updated • 648 • 26 -
OpenGVLab/VideoChat-Flash-Qwen2-7B_res224
Video-Text-to-Text • 8B • Updated • 130 • 7 -
OpenGVLab/VideoChat-Flash-Qwen2-7B_res448
Video-Text-to-Text • 8B • Updated • 2.26k • 12 -
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
Paper • 2501.00574 • Published • 6
Better than InternVL 2.0
-
500
InternVL
⚡Interact with a multimodal chatbot that analyzes images and text
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
Paper • 2412.05271 • Published • 159 -
OpenGVLab/InternVL2_5-78B
Image-Text-to-Text • 78B • Updated • 258 • 192 -
OpenGVLab/InternVL2_5-78B-AWQ
Image-Text-to-Text • Updated • 137 • 14
Enhancing the Reasoning Ability of MLLMs via Mixed Preference Optimization
-
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Paper • 2411.10442 • Published • 87 -
OpenGVLab/InternVL2_5-78B-MPO
Image-Text-to-Text • 78B • Updated • 80 • 54 -
OpenGVLab/InternVL2_5-38B-MPO
Image-Text-to-Text • 38B • Updated • 1.08k • 20 -
OpenGVLab/InternVL2_5-26B-MPO
Image-Text-to-Text • 26B • Updated • 86 • 14
Expanding Performance Boundaries of Open-Source MLLM
A Pioneering Open-Source Alternative to GPT-4V
-
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Paper • 2404.16821 • Published • 57 -
OpenGVLab/InternVL-Chat-V1-5
Image-Text-to-Text • 26B • Updated • 2.66k • 416 -
OpenGVLab/InternViT-6B-448px-V1-5
Image Feature Extraction • 6B • Updated • 1.6k • 78 -
OpenGVLab/InternViT-300M-448px
Image Feature Extraction • 0.3B • Updated • 3.9k • 60
Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
-
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Paper • 2312.14238 • Published • 20 -
OpenGVLab/InternViT-6B-224px
Image Feature Extraction • Updated • 87 • 24 -
OpenGVLab/InternVL-14B-224px
Image Feature Extraction • 14B • Updated • 503 • 35 -
OpenGVLab/InternVL-Chat-V1-2-Plus
Image-Text-to-Text • 40B • Updated • 94 • 34
Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding
Adaptation Models for Specific Domains
-
OpenGVLab/Mini-InternVL2-4B-DA-DriveLM
Image-Text-to-Text • 4B • Updated • 129 • 3 -
OpenGVLab/Mini-InternVL2-4B-DA-Medical
Image-Text-to-Text • 4B • Updated • 77 • 6 -
OpenGVLab/Mini-InternVL2-4B-DA-BDD
Image-Text-to-Text • 4B • Updated • 51 -
OpenGVLab/Mini-InternVL2-2B-DA-DriveLM
Image-Text-to-Text • 2B • Updated • 109
InternVideo2
-
InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding
Paper • 2403.15377 • Published • 26 -
OpenGVLab/InternVideo2-Chat-8B
Video-Text-to-Text • 8B • Updated • 237 • 23 -
OpenGVLab/InternVideo2_chat_8B_HD
Video-Text-to-Text • 8B • Updated • 119 • 18 -
OpenGVLab/InternVideo2_Chat_8B_InternLM2_5
Video-Text-to-Text • 9B • Updated • 54 • 7
Chat-Centric Video Understanding
State Space Model for Efficient Video Understanding
A Large-Scale Video-Text Dataset
A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
-
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
Paper • 2211.05778 • Published -
OpenGVLab/internimage_t_1k_224
Image Classification • 29.9M • Updated • 140 • 2 -
OpenGVLab/internimage_s_1k_224
Image Classification • 50.1M • Updated • 255 • 1 -
OpenGVLab/internimage_b_1k_224
Image Classification • 97.5M • Updated • 368 • 1
Improved Baselines with Pyramid Vision Transformer