OpenGVLab

community

https://github.com/opengvlab

opengvlab

OpenGVLab

Activity Feed Request to join this org

AI & ML interests

Computer Vision

Recent Activity

lll2343 new activity about 10 hours ago

OpenGVLab/SDLM-3B-D8:what is the difference between D4 and D8?

CuiLong7 updated a collection 12 days ago

InternVL3.5-Flash

CuiLong7 authored a paper 12 days ago

HIR-Diff: Unsupervised Hyperspectral Image Restoration Via Improved Diffusion Models

View all activity

Papers

ViCO: A Training Strategy towards Semantic Aware Dynamic High-Resolution

ExpVid: A Benchmark for Experiment Video Understanding & Reasoning

View all Papers

OpenGVLab 's collections 33

Vlaser

Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning

OpenGVLab/Vlaser-2B

2B • Updated 16 days ago • 57 • 1
OpenGVLab/Vlaser-8B

8B • Updated 16 days ago • 94 • 2
OpenGVLab/Vlaser-2B-VLA

Updated 16 days ago • 3
Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning

Paper • 2510.11027 • Published 14 days ago • 19

InternVL3.5-Flash

InternVL3.5-Flash is a fast variant of InternVL3.5 using semantic aware dynamic high-resolution strategy.

ViCO: A Training Strategy towards Semantic Aware Dynamic High-Resolution

Paper • 2510.12793 • Published 12 days ago • 2
OpenGVLab/InternVL3_5-241B-A28B-Flash

Image-Text-to-Text • 242B • Updated 28 days ago • 111 • 4
OpenGVLab/InternVL3_5-38B-Flash

Image-Text-to-Text • 40B • Updated 29 days ago • 153 • 5
OpenGVLab/InternVL3_5-30B-A3B-Flash

Image-Text-to-Text • 31B • Updated 29 days ago • 1.77k • 5

InternVL3.5

This collection includes all released checkpoints of InternVL3.5, covering different training stages (e.g., Pretraining, SFT, MPO, Cascade RL).

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Paper • 2508.18265 • Published Aug 25 • 201
OpenGVLab/InternVL3_5-241B-A28B-HF

Image-Text-to-Text • 241B • Updated Sep 8 • 59 • 11
OpenGVLab/InternVL3_5-38B-HF

Image-Text-to-Text • 38B • Updated Sep 8 • 1.39k • 5
OpenGVLab/InternVL3_5-30B-A3B-HF

Image-Text-to-Text • 31B • Updated Sep 8 • 2.78k • 5

SDLM

Sequential Diffusion Language Models

Sequential Diffusion Language Models

Paper • 2509.24007 • Published 28 days ago • 42
OpenGVLab/SDLM-32B-D4

Text Generation • 33B • Updated 23 days ago • 425 • 11
OpenGVLab/SDLM-3B-D4

Text Generation • 3B • Updated 24 days ago • 399 • 4

ZeroGUI

ZeroGUI: Automating Online GUI Learning at Zero Human Cost

OpenGVLab/ZeroGUI-AndroidLab-7B

Image-Text-to-Text • 8B • Updated May 30 • 30 • 5
OpenGVLab/ZeroGUI-OSWorld-7B

Image-Text-to-Text • 8B • Updated Jun 20 • 35 • 6
ZeroGUI: Automating Online GUI Learning at Zero Human Cost

Paper • 2505.23762 • Published May 29 • 45

VisualPRM

VisualPRM: An Effective Process Reward Model for Multimodal Reasoning

Paper • 2503.10291 • Published Mar 13 • 36
OpenGVLab/VisualPRM-8B

Image-Text-to-Text • 8B • Updated May 6 • 347 • 17
OpenGVLab/VisualPRM-8B-v1_1

Image-Text-to-Text • 8B • Updated May 29 • 64 • 9
OpenGVLab/VisualPRM400K

Preview • Updated Apr 15 • 76 • 14

PIIP

[NeurIPS 2024 Spotlight (Ranking Top 10), TPAMI 2025] Parameter-Inverted Image Pyramid Networks

Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding

Paper • 2501.07783 • Published Jan 14 • 7
OpenGVLab/PIIP

Object Detection • Updated Apr 16 • 5
OpenGVLab/PIIP-LLaVA_CLIP-BL_512-256_7B

Image-Text-to-Text • 7B • Updated Apr 20 • 27
OpenGVLab/PIIP-LLaVA_ConvNeXt-B_CLIP-L_640-224_7B

Image-Text-to-Text • 7B • Updated Apr 20 • 30

InternVideo2.5

OpenGVLab/InternVideo2_5_Chat_8B

Video-Text-to-Text • 8B • Updated Aug 4 • 11.1k • 85
OpenGVLab/InternVL_2_5_HiCo_R16

Video-Text-to-Text • 8B • Updated Feb 13 • 1.9k • 6
OpenGVLab/InternVL_2_5_HiCo_R64

Video-Text-to-Text • 8B • Updated May 13 • 105 • 3
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling

Paper • 2501.12386 • Published Jan 21 • 1

VideoChat-Flash

Faster and more powerful VideoChat.

OpenGVLab/VideoChat-Flash-Qwen2_5-2B_res448

Video-Text-to-Text • 2B • Updated Mar 16 • 648 • 26
OpenGVLab/VideoChat-Flash-Qwen2-7B_res224

Video-Text-to-Text • 8B • Updated Mar 16 • 130 • 7
OpenGVLab/VideoChat-Flash-Qwen2-7B_res448

Video-Text-to-Text • 8B • Updated Mar 16 • 2.26k • 12
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

Paper • 2501.00574 • Published Dec 31, 2024 • 6

InternVL2.5-MPO

Enhancing the Reasoning Ability of MLLMs via Mixed Preference Optimization

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

Paper • 2411.10442 • Published Nov 15, 2024 • 87
OpenGVLab/InternVL2_5-78B-MPO

Image-Text-to-Text • 78B • Updated Sep 11 • 80 • 54
OpenGVLab/InternVL2_5-38B-MPO

Image-Text-to-Text • 38B • Updated Sep 11 • 1.08k • 20
OpenGVLab/InternVL2_5-26B-MPO

Image-Text-to-Text • 26B • Updated Mar 25 • 86 • 14

InternVL1.5

A Pioneering Open-Source Alternative to GPT-4V

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Paper • 2404.16821 • Published Apr 25, 2024 • 57
OpenGVLab/InternVL-Chat-V1-5

Image-Text-to-Text • 26B • Updated Mar 25 • 2.66k • 416
OpenGVLab/InternViT-6B-448px-V1-5

Image Feature Extraction • 6B • Updated Dec 9, 2024 • 1.6k • 78
OpenGVLab/InternViT-300M-448px

Image Feature Extraction • 0.3B • Updated Jan 8 • 3.9k • 60

V2PE

Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding

OpenGVLab/V2PE

Updated Dec 13, 2024 • 4
OpenGVLab/V2PE-Data

Preview • Updated Dec 14, 2024 • 358 • 7
V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding

Paper • 2412.09616 • Published Dec 12, 2024 • 1

InternVideo2

InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding

Paper • 2403.15377 • Published Mar 22, 2024 • 26
OpenGVLab/InternVideo2-Chat-8B

Video-Text-to-Text • 8B • Updated Oct 10, 2024 • 237 • 23
OpenGVLab/InternVideo2_chat_8B_HD

Video-Text-to-Text • 8B • Updated Dec 18, 2024 • 119 • 18
OpenGVLab/InternVideo2_Chat_8B_InternLM2_5

Video-Text-to-Text • 9B • Updated Sep 19, 2024 • 54 • 7

VideoMamba

State Space Model for Efficient Video Understanding

VideoMamba: State Space Model for Efficient Video Understanding

Paper • 2403.06977 • Published Mar 11, 2024 • 30
OpenGVLab/VideoMamba

Video Classification • Updated Apr 14, 2024 • 28
Runtime error

98

98

VideoMamba

🐍

Identify actions and objects in videos and images
Andy1621/VideoMamba

Updated Mar 13, 2024 • 2

OmniCorpus

A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

Paper • 2406.08418 • Published Jun 12, 2024 • 31
OpenGVLab/OmniCorpus-CC

Viewer • Updated Mar 20 • 872M • 11.7k • 20
OpenGVLab/OmniCorpus-CC-210M

Viewer • Updated Mar 20 • 208M • 3.29k • 31
OpenGVLab/OmniCorpus-YT

Updated Mar 20 • 429 • 13

InternImage

Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

Paper • 2211.05778 • Published Nov 10, 2022
OpenGVLab/internimage_t_1k_224

Image Classification • 29.9M • Updated Mar 25 • 140 • 2
OpenGVLab/internimage_s_1k_224

Image Classification • 50.1M • Updated Mar 25 • 255 • 1
OpenGVLab/internimage_b_1k_224

Image Classification • 97.5M • Updated Mar 25 • 368 • 1

InternVL Data

OpenGVLab/InternVL-Chat-V1-2-SFT-Data

Viewer • Updated Sep 20, 2024 • 573k • 821 • 28
OpenGVLab/InternVL-SA-1B-Caption

Viewer • Updated Sep 21, 2024 • 8.63M • 171 • 17
OpenGVLab/ShareGPT-4o

Viewer • Updated Aug 17, 2024 • 59.4k • 92.2k • 185
OpenGVLab/MMPR

Preview • Updated Apr 11 • 39 • 50

NaViL

[NeurIPS 2025] Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints

NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints

Paper • 2510.08565 • Published 17 days ago • 19
OpenGVLab/NaViL-2B

4B • Updated 16 days ago • 60
OpenGVLab/NaViL-9B

16B • Updated 16 days ago • 56

InternVL3.5-Core

This collection includes only the InternVL3.5 checkpoints that have completed the full training pipeline (i.e., Pretraining, SFT, MPO, Cascade RL).

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Paper • 2508.18265 • Published Aug 25 • 201
OpenGVLab/InternVL3_5-241B-A28B-HF

Image-Text-to-Text • 241B • Updated Sep 8 • 59 • 11
OpenGVLab/InternVL3_5-38B-HF

Image-Text-to-Text • 38B • Updated Sep 8 • 1.39k • 5
OpenGVLab/InternVL3_5-30B-A3B-HF

Image-Text-to-Text • 31B • Updated Sep 8 • 2.78k • 5

ScaleCUA

OpenGVLab/ScaleCUA-3B

Image-Text-to-Text • 4B • Updated Sep 17 • 338 • 9
OpenGVLab/ScaleCUA-7B

Image-Text-to-Text • 8B • Updated Sep 18 • 346 • 8
OpenGVLab/ScaleCUA-32B

Image-Text-to-Text • 33B • Updated Sep 18 • 104 • 20
OpenGVLab/ScaleCUA-Data

Preview • Updated 29 days ago • 5.89k • 22

Docopilot

[CVPR 2025] Docopilot: Improving Multimodal Models for Document-Level Understanding

OpenGVLab/Docopilot-2B

Image-Text-to-Text • 2B • Updated Jul 20 • 50 • 8
OpenGVLab/Docopilot-8B

Image-Text-to-Text • 8B • Updated Jul 20 • 52 • 3
OpenGVLab/Doc-750K

Preview • Updated Jul 22 • 1.94k • 13

InternVL3

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Paper • 2504.10479 • Published Apr 14 • 298
OpenGVLab/InternVL3-1B

Image-Text-to-Text • 0.9B • Updated Sep 11 • 69.3k • 74
OpenGVLab/InternVL3-2B

Image-Text-to-Text • 2B • Updated Sep 11 • 60.3k • 38
OpenGVLab/InternVL3-8B

Image-Text-to-Text • 8B • Updated Sep 11 • 276k • 100

Mono-InternVL

A Pioneering Monolithic MLLM

Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training

Paper • 2410.08202 • Published Oct 10, 2024 • 4
Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models

Paper • 2507.12566 • Published Jul 16 • 14
OpenGVLab/Mono-InternVL-2B

Image-Text-to-Text • 3B • Updated Jul 22 • 8.35k • 36
OpenGVLab/Mono-InternVL-2B-S1-1

Image-Text-to-Text • 3B • Updated Jul 22 • 56

VideoChat-R1

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

OpenGVLab/VideoChat-R1-thinking_7B

Video-Text-to-Text • 8B • Updated Apr 13 • 273
OpenGVLab/VideoChat-R1_7B

Video-Text-to-Text • 8B • Updated Apr 22 • 713 • 8
OpenGVLab/VideoChat-R1_7B_caption

Video-Text-to-Text • 8B • Updated Apr 22 • 4.57k • 4
OpenGVLab/VideoChat-R1_5-7B

Video-Text-to-Text • 8B • Updated 24 days ago • 1.4k • 7

VideoMAE-v2

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

Paper • 2303.16727 • Published Mar 29, 2023
OpenGVLab/VideoMAEv2-Base

Video Classification • 86.2M • Updated Jan 14 • 16k • 8
OpenGVLab/VideoMAEv2-Large

Video Classification • 0.3B • Updated Jan 14 • 5.1k • 1
OpenGVLab/VideoMAEv2-Huge

Video Classification • 0.6B • Updated Feb 25 • 388 • 1

InternVL2.5

Better than InternVL 2.0

Running

500

500

InternVL

⚡

Interact with a multimodal chatbot that analyzes images and text
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Paper • 2412.05271 • Published Dec 6, 2024 • 159
OpenGVLab/InternVL2_5-78B

Image-Text-to-Text • 78B • Updated Sep 11 • 258 • 192
OpenGVLab/InternVL2_5-78B-AWQ

Image-Text-to-Text • Updated Sep 11 • 137 • 14

InternVL2.0

Expanding Performance Boundaries of Open-Source MLLM

OpenGVLab/InternVL2-Pretrain-Models

Image-Text-to-Text • Updated Mar 25 • 11
OpenGVLab/InternVL2-Llama3-76B

Image-Text-to-Text • 76B • Updated Mar 25 • 280 • 211
OpenGVLab/InternVL2-Llama3-76B-AWQ

Image-Text-to-Text • Updated Mar 25 • 86 • 25
OpenGVLab/InternVL2-40B

Image-Text-to-Text • 40B • Updated Mar 25 • 274 • 94

InternVL1.0

Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Paper • 2312.14238 • Published Dec 21, 2023 • 20
OpenGVLab/InternViT-6B-224px

Image Feature Extraction • Updated Dec 9, 2024 • 87 • 24
OpenGVLab/InternVL-14B-224px

Image Feature Extraction • 14B • Updated Dec 9, 2024 • 503 • 35
OpenGVLab/InternVL-Chat-V1-2-Plus

Image-Text-to-Text • 40B • Updated Mar 25 • 94 • 34

InternVL Adaptation

Adaptation Models for Specific Domains

OpenGVLab/Mini-InternVL2-4B-DA-DriveLM

Image-Text-to-Text • 4B • Updated Dec 9, 2024 • 129 • 3
OpenGVLab/Mini-InternVL2-4B-DA-Medical

Image-Text-to-Text • 4B • Updated Dec 9, 2024 • 77 • 6
OpenGVLab/Mini-InternVL2-4B-DA-BDD

Image-Text-to-Text • 4B • Updated Dec 9, 2024 • 51
OpenGVLab/Mini-InternVL2-2B-DA-DriveLM

Image-Text-to-Text • 2B • Updated Mar 26 • 109

VideoChat

Chat-Centric Video Understanding

OpenGVLab/VideoChat2_stage3_Mistral_7B

Updated May 22, 2024 • 4
OpenGVLab/VideoChat2_stage2_Mistral_7B

Updated May 22, 2024 • 2
OpenGVLab/VideoChat2-IT

Viewer • Updated Jun 29, 2024 • 1.82M • 480 • 51
Running

31

31

VideoChat2

⚡

Display a web page in an iframe

InternVid

A Large-Scale Video-Text Dataset

OpenGVLab/InternVid

Viewer • Updated Aug 13, 2024 • 21.3M • 309 • 86
OpenGVLab/InternVid-10M-FLT-INFO

Viewer • Updated Jan 24, 2024 • 10.6M • 48 • 6
OpenGVLab/ViCLIP

Updated Jun 7, 2024 • 46
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Paper • 2307.06942 • Published Jul 13, 2023 • 23

All-Seeing Project

OpenGVLab/ASMv2

Text Generation • Updated Feb 29, 2024 • 118 • 16
OpenGVLab/ASM-FT

Updated Feb 21, 2024 • 29 • 6
OpenGVLab/AS-Core

Preview • Updated Mar 21, 2024 • 48 • 10
OpenGVLab/ASM-Pretrain

Updated Feb 21, 2024 • 27 • 3

PVT v2

Improved Baselines with Pyramid Vision Transformer

PVT v2: Improved Baselines with Pyramid Vision Transformer

Paper • 2106.13797 • Published Jun 25, 2021
OpenGVLab/pvt_v2_b1

Image Classification • 14M • Updated Mar 12, 2024 • 47 • 1
OpenGVLab/pvt_v2_b2

Image Classification • 25.4M • Updated Mar 12, 2024 • 276 • 1
OpenGVLab/pvt_v2_b2_linear

Image Classification • 22.6M • Updated Mar 12, 2024 • 35 • 1

Vlaser

Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning

OpenGVLab/Vlaser-2B

2B • Updated 16 days ago • 57 • 1
OpenGVLab/Vlaser-8B

8B • Updated 16 days ago • 94 • 2
OpenGVLab/Vlaser-2B-VLA

Updated 16 days ago • 3
Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning

Paper • 2510.11027 • Published 14 days ago • 19

NaViL

[NeurIPS 2025] Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints

NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints

Paper • 2510.08565 • Published 17 days ago • 19
OpenGVLab/NaViL-2B

4B • Updated 16 days ago • 60
OpenGVLab/NaViL-9B

16B • Updated 16 days ago • 56

InternVL3.5-Flash

InternVL3.5-Flash is a fast variant of InternVL3.5 using semantic aware dynamic high-resolution strategy.

ViCO: A Training Strategy towards Semantic Aware Dynamic High-Resolution

Paper • 2510.12793 • Published 12 days ago • 2
OpenGVLab/InternVL3_5-241B-A28B-Flash

Image-Text-to-Text • 242B • Updated 28 days ago • 111 • 4
OpenGVLab/InternVL3_5-38B-Flash

Image-Text-to-Text • 40B • Updated 29 days ago • 153 • 5
OpenGVLab/InternVL3_5-30B-A3B-Flash

Image-Text-to-Text • 31B • Updated 29 days ago • 1.77k • 5

InternVL3.5-Core

This collection includes only the InternVL3.5 checkpoints that have completed the full training pipeline (i.e., Pretraining, SFT, MPO, Cascade RL).

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Paper • 2508.18265 • Published Aug 25 • 201
OpenGVLab/InternVL3_5-241B-A28B-HF

Image-Text-to-Text • 241B • Updated Sep 8 • 59 • 11
OpenGVLab/InternVL3_5-38B-HF

Image-Text-to-Text • 38B • Updated Sep 8 • 1.39k • 5
OpenGVLab/InternVL3_5-30B-A3B-HF

Image-Text-to-Text • 31B • Updated Sep 8 • 2.78k • 5

InternVL3.5

This collection includes all released checkpoints of InternVL3.5, covering different training stages (e.g., Pretraining, SFT, MPO, Cascade RL).

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Paper • 2508.18265 • Published Aug 25 • 201
OpenGVLab/InternVL3_5-241B-A28B-HF

Image-Text-to-Text • 241B • Updated Sep 8 • 59 • 11
OpenGVLab/InternVL3_5-38B-HF

Image-Text-to-Text • 38B • Updated Sep 8 • 1.39k • 5
OpenGVLab/InternVL3_5-30B-A3B-HF

Image-Text-to-Text • 31B • Updated Sep 8 • 2.78k • 5

ScaleCUA

OpenGVLab/ScaleCUA-3B

Image-Text-to-Text • 4B • Updated Sep 17 • 338 • 9
OpenGVLab/ScaleCUA-7B

Image-Text-to-Text • 8B • Updated Sep 18 • 346 • 8
OpenGVLab/ScaleCUA-32B

Image-Text-to-Text • 33B • Updated Sep 18 • 104 • 20
OpenGVLab/ScaleCUA-Data

Preview • Updated 29 days ago • 5.89k • 22

SDLM

Sequential Diffusion Language Models

Sequential Diffusion Language Models

Paper • 2509.24007 • Published 28 days ago • 42
OpenGVLab/SDLM-32B-D4

Text Generation • 33B • Updated 23 days ago • 425 • 11
OpenGVLab/SDLM-3B-D4

Text Generation • 3B • Updated 24 days ago • 399 • 4

Docopilot

[CVPR 2025] Docopilot: Improving Multimodal Models for Document-Level Understanding

OpenGVLab/Docopilot-2B

Image-Text-to-Text • 2B • Updated Jul 20 • 50 • 8
OpenGVLab/Docopilot-8B

Image-Text-to-Text • 8B • Updated Jul 20 • 52 • 3
OpenGVLab/Doc-750K

Preview • Updated Jul 22 • 1.94k • 13

ZeroGUI

ZeroGUI: Automating Online GUI Learning at Zero Human Cost

OpenGVLab/ZeroGUI-AndroidLab-7B

Image-Text-to-Text • 8B • Updated May 30 • 30 • 5
OpenGVLab/ZeroGUI-OSWorld-7B

Image-Text-to-Text • 8B • Updated Jun 20 • 35 • 6
ZeroGUI: Automating Online GUI Learning at Zero Human Cost

Paper • 2505.23762 • Published May 29 • 45

InternVL3

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Paper • 2504.10479 • Published Apr 14 • 298
OpenGVLab/InternVL3-1B

Image-Text-to-Text • 0.9B • Updated Sep 11 • 69.3k • 74
OpenGVLab/InternVL3-2B

Image-Text-to-Text • 2B • Updated Sep 11 • 60.3k • 38
OpenGVLab/InternVL3-8B

Image-Text-to-Text • 8B • Updated Sep 11 • 276k • 100

VisualPRM

VisualPRM: An Effective Process Reward Model for Multimodal Reasoning

Paper • 2503.10291 • Published Mar 13 • 36
OpenGVLab/VisualPRM-8B

Image-Text-to-Text • 8B • Updated May 6 • 347 • 17
OpenGVLab/VisualPRM-8B-v1_1

Image-Text-to-Text • 8B • Updated May 29 • 64 • 9
OpenGVLab/VisualPRM400K

Preview • Updated Apr 15 • 76 • 14

Mono-InternVL

A Pioneering Monolithic MLLM

Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training

Paper • 2410.08202 • Published Oct 10, 2024 • 4
Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models

Paper • 2507.12566 • Published Jul 16 • 14
OpenGVLab/Mono-InternVL-2B

Image-Text-to-Text • 3B • Updated Jul 22 • 8.35k • 36
OpenGVLab/Mono-InternVL-2B-S1-1

Image-Text-to-Text • 3B • Updated Jul 22 • 56

PIIP

[NeurIPS 2024 Spotlight (Ranking Top 10), TPAMI 2025] Parameter-Inverted Image Pyramid Networks

Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding

Paper • 2501.07783 • Published Jan 14 • 7
OpenGVLab/PIIP

Object Detection • Updated Apr 16 • 5
OpenGVLab/PIIP-LLaVA_CLIP-BL_512-256_7B

Image-Text-to-Text • 7B • Updated Apr 20 • 27
OpenGVLab/PIIP-LLaVA_ConvNeXt-B_CLIP-L_640-224_7B

Image-Text-to-Text • 7B • Updated Apr 20 • 30

VideoChat-R1

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

OpenGVLab/VideoChat-R1-thinking_7B

Video-Text-to-Text • 8B • Updated Apr 13 • 273
OpenGVLab/VideoChat-R1_7B

Video-Text-to-Text • 8B • Updated Apr 22 • 713 • 8
OpenGVLab/VideoChat-R1_7B_caption

Video-Text-to-Text • 8B • Updated Apr 22 • 4.57k • 4
OpenGVLab/VideoChat-R1_5-7B

Video-Text-to-Text • 8B • Updated 24 days ago • 1.4k • 7

InternVideo2.5

OpenGVLab/InternVideo2_5_Chat_8B

Video-Text-to-Text • 8B • Updated Aug 4 • 11.1k • 85
OpenGVLab/InternVL_2_5_HiCo_R16

Video-Text-to-Text • 8B • Updated Feb 13 • 1.9k • 6
OpenGVLab/InternVL_2_5_HiCo_R64

Video-Text-to-Text • 8B • Updated May 13 • 105 • 3
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling

Paper • 2501.12386 • Published Jan 21 • 1

VideoMAE-v2

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

Paper • 2303.16727 • Published Mar 29, 2023
OpenGVLab/VideoMAEv2-Base

Video Classification • 86.2M • Updated Jan 14 • 16k • 8
OpenGVLab/VideoMAEv2-Large

Video Classification • 0.3B • Updated Jan 14 • 5.1k • 1
OpenGVLab/VideoMAEv2-Huge

Video Classification • 0.6B • Updated Feb 25 • 388 • 1

VideoChat-Flash

Faster and more powerful VideoChat.

OpenGVLab/VideoChat-Flash-Qwen2_5-2B_res448

Video-Text-to-Text • 2B • Updated Mar 16 • 648 • 26
OpenGVLab/VideoChat-Flash-Qwen2-7B_res224

Video-Text-to-Text • 8B • Updated Mar 16 • 130 • 7
OpenGVLab/VideoChat-Flash-Qwen2-7B_res448

Video-Text-to-Text • 8B • Updated Mar 16 • 2.26k • 12
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

Paper • 2501.00574 • Published Dec 31, 2024 • 6

InternVL2.5

Better than InternVL 2.0

Running

500

500

InternVL

⚡

Interact with a multimodal chatbot that analyzes images and text
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Paper • 2412.05271 • Published Dec 6, 2024 • 159
OpenGVLab/InternVL2_5-78B

Image-Text-to-Text • 78B • Updated Sep 11 • 258 • 192
OpenGVLab/InternVL2_5-78B-AWQ

Image-Text-to-Text • Updated Sep 11 • 137 • 14

InternVL2.5-MPO

Enhancing the Reasoning Ability of MLLMs via Mixed Preference Optimization

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

Paper • 2411.10442 • Published Nov 15, 2024 • 87
OpenGVLab/InternVL2_5-78B-MPO

Image-Text-to-Text • 78B • Updated Sep 11 • 80 • 54
OpenGVLab/InternVL2_5-38B-MPO

Image-Text-to-Text • 38B • Updated Sep 11 • 1.08k • 20
OpenGVLab/InternVL2_5-26B-MPO

Image-Text-to-Text • 26B • Updated Mar 25 • 86 • 14

InternVL2.0

Expanding Performance Boundaries of Open-Source MLLM

OpenGVLab/InternVL2-Pretrain-Models

Image-Text-to-Text • Updated Mar 25 • 11
OpenGVLab/InternVL2-Llama3-76B

Image-Text-to-Text • 76B • Updated Mar 25 • 280 • 211
OpenGVLab/InternVL2-Llama3-76B-AWQ

Image-Text-to-Text • Updated Mar 25 • 86 • 25
OpenGVLab/InternVL2-40B

Image-Text-to-Text • 40B • Updated Mar 25 • 274 • 94

InternVL1.5

A Pioneering Open-Source Alternative to GPT-4V

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Paper • 2404.16821 • Published Apr 25, 2024 • 57
OpenGVLab/InternVL-Chat-V1-5

Image-Text-to-Text • 26B • Updated Mar 25 • 2.66k • 416
OpenGVLab/InternViT-6B-448px-V1-5

Image Feature Extraction • 6B • Updated Dec 9, 2024 • 1.6k • 78
OpenGVLab/InternViT-300M-448px

Image Feature Extraction • 0.3B • Updated Jan 8 • 3.9k • 60

InternVL1.0

Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Paper • 2312.14238 • Published Dec 21, 2023 • 20
OpenGVLab/InternViT-6B-224px

Image Feature Extraction • Updated Dec 9, 2024 • 87 • 24
OpenGVLab/InternVL-14B-224px

Image Feature Extraction • 14B • Updated Dec 9, 2024 • 503 • 35
OpenGVLab/InternVL-Chat-V1-2-Plus

Image-Text-to-Text • 40B • Updated Mar 25 • 94 • 34

V2PE

Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding

OpenGVLab/V2PE

Updated Dec 13, 2024 • 4
OpenGVLab/V2PE-Data

Preview • Updated Dec 14, 2024 • 358 • 7
V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding

Paper • 2412.09616 • Published Dec 12, 2024 • 1

InternVL Adaptation

Adaptation Models for Specific Domains

OpenGVLab/Mini-InternVL2-4B-DA-DriveLM

Image-Text-to-Text • 4B • Updated Dec 9, 2024 • 129 • 3
OpenGVLab/Mini-InternVL2-4B-DA-Medical

Image-Text-to-Text • 4B • Updated Dec 9, 2024 • 77 • 6
OpenGVLab/Mini-InternVL2-4B-DA-BDD

Image-Text-to-Text • 4B • Updated Dec 9, 2024 • 51
OpenGVLab/Mini-InternVL2-2B-DA-DriveLM

Image-Text-to-Text • 2B • Updated Mar 26 • 109

InternVideo2

InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding

Paper • 2403.15377 • Published Mar 22, 2024 • 26
OpenGVLab/InternVideo2-Chat-8B

Video-Text-to-Text • 8B • Updated Oct 10, 2024 • 237 • 23
OpenGVLab/InternVideo2_chat_8B_HD

Video-Text-to-Text • 8B • Updated Dec 18, 2024 • 119 • 18
OpenGVLab/InternVideo2_Chat_8B_InternLM2_5

Video-Text-to-Text • 9B • Updated Sep 19, 2024 • 54 • 7

VideoChat

Chat-Centric Video Understanding

OpenGVLab/VideoChat2_stage3_Mistral_7B

Updated May 22, 2024 • 4
OpenGVLab/VideoChat2_stage2_Mistral_7B

Updated May 22, 2024 • 2
OpenGVLab/VideoChat2-IT

Viewer • Updated Jun 29, 2024 • 1.82M • 480 • 51
Running

31

31

VideoChat2

⚡

Display a web page in an iframe

VideoMamba

State Space Model for Efficient Video Understanding

VideoMamba: State Space Model for Efficient Video Understanding

Paper • 2403.06977 • Published Mar 11, 2024 • 30
OpenGVLab/VideoMamba

Video Classification • Updated Apr 14, 2024 • 28
Runtime error

98

98

VideoMamba

🐍

Identify actions and objects in videos and images
Andy1621/VideoMamba

Updated Mar 13, 2024 • 2

InternVid

A Large-Scale Video-Text Dataset

OpenGVLab/InternVid

Viewer • Updated Aug 13, 2024 • 21.3M • 309 • 86
OpenGVLab/InternVid-10M-FLT-INFO

Viewer • Updated Jan 24, 2024 • 10.6M • 48 • 6
OpenGVLab/ViCLIP

Updated Jun 7, 2024 • 46
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Paper • 2307.06942 • Published Jul 13, 2023 • 23

OmniCorpus

A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

Paper • 2406.08418 • Published Jun 12, 2024 • 31
OpenGVLab/OmniCorpus-CC

Viewer • Updated Mar 20 • 872M • 11.7k • 20
OpenGVLab/OmniCorpus-CC-210M

Viewer • Updated Mar 20 • 208M • 3.29k • 31
OpenGVLab/OmniCorpus-YT

Updated Mar 20 • 429 • 13

All-Seeing Project

OpenGVLab/ASMv2

Text Generation • Updated Feb 29, 2024 • 118 • 16
OpenGVLab/ASM-FT

Updated Feb 21, 2024 • 29 • 6
OpenGVLab/AS-Core

Preview • Updated Mar 21, 2024 • 48 • 10
OpenGVLab/ASM-Pretrain

Updated Feb 21, 2024 • 27 • 3

InternImage

Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

Paper • 2211.05778 • Published Nov 10, 2022
OpenGVLab/internimage_t_1k_224

Image Classification • 29.9M • Updated Mar 25 • 140 • 2
OpenGVLab/internimage_s_1k_224

Image Classification • 50.1M • Updated Mar 25 • 255 • 1
OpenGVLab/internimage_b_1k_224

Image Classification • 97.5M • Updated Mar 25 • 368 • 1

PVT v2

Improved Baselines with Pyramid Vision Transformer

PVT v2: Improved Baselines with Pyramid Vision Transformer

Paper • 2106.13797 • Published Jun 25, 2021
OpenGVLab/pvt_v2_b1

Image Classification • 14M • Updated Mar 12, 2024 • 47 • 1
OpenGVLab/pvt_v2_b2

Image Classification • 25.4M • Updated Mar 12, 2024 • 276 • 1
OpenGVLab/pvt_v2_b2_linear

Image Classification • 22.6M • Updated Mar 12, 2024 • 35 • 1

InternVL Data

OpenGVLab/InternVL-Chat-V1-2-SFT-Data

Viewer • Updated Sep 20, 2024 • 573k • 821 • 28
OpenGVLab/InternVL-SA-1B-Caption

Viewer • Updated Sep 21, 2024 • 8.63M • 171 • 17
OpenGVLab/ShareGPT-4o

Viewer • Updated Aug 17, 2024 • 59.4k • 92.2k • 185
OpenGVLab/MMPR

Preview • Updated Apr 11 • 39 • 50

AI & ML interests

Recent Activity

Papers

Team members 116

OpenGVLab 's collections 33

VideoMamba

InternVL

VideoChat2