Spaces:
Running
on
Zero
Running
on
Zero
| title: MiniCPM-V-4.5 Multimodal Chat | |
| emoji: π | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: gradio | |
| sdk_version: 5.44.0 | |
| app_file: app.py | |
| pinned: false | |
| license: apache-2.0 | |
| # MiniCPM-V-4.5 Multimodal Chat π | |
| A powerful Gradio interface for the MiniCPM-V-4.5 multimodal model - a GPT-4V level MLLM with only 8B parameters! | |
| ## Features | |
| - πΈ **Image Understanding**: Analyze single or multiple images with high-resolution support (up to 1.8M pixels) | |
| - π₯ **Video Understanding**: Process videos with high refresh rate (up to 10 FPS) and efficient compression | |
| - π **Document Parsing**: Strong OCR capabilities and PDF document parsing | |
| - π§ **Thinking Modes**: Choose between fast thinking for efficiency or deep thinking for complex problems | |
| - π **Multilingual**: Support for 30+ languages | |
| - βοΈ **Customizable**: Adjust FPS, context size, temperature, and system prompts | |
| ## Model Capabilities | |
| MiniCPM-V-4.5 achieves state-of-the-art performance across multiple benchmarks: | |
| - Surpasses GPT-4o-latest and Gemini-2.0 Pro on vision-language tasks | |
| - Leading OCR performance on OCRBench | |
| - Efficient video token compression (96x rate) | |
| - Trustworthy behaviors with multilingual support | |
| ## Usage | |
| 1. **Upload**: Choose an image or video file | |
| 2. **Configure**: Adjust settings like FPS (for videos), context size, and temperature | |
| 3. **Prompt**: Enter your question or use the system prompt for specific instructions | |
| 4. **Generate**: Click the generate button to get the model's response | |
| ## Examples | |
| - "What objects do you see in this image?" | |
| - "Describe the main action happening in this video" | |
| - "Read and transcribe any text visible in the image" | |
| - "Analyze this image from an artistic perspective" | |
| ## Technical Details | |
| - **Architecture**: Built on Qwen3-8B and SigLIP2-400M | |
| - **Parameters**: 8B total parameters | |
| - **Video Processing**: 3D-Resampler with temporal understanding | |
| - **Resolution**: Supports images up to 1344x1344 pixels | |
| - **Efficiency**: 4x fewer visual tokens than most MLLMs | |
| ## License | |
| This model is released under the MiniCPM Model License. Free for academic research and commercial use after registration. | |
| ## Citation | |
| ```bibtex | |
| @article{yao2024minicpm, | |
| title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone}, | |
| author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others}, | |
| journal={Nat Commun 16, 5509 (2025)}, | |
| year={2025} | |
| } | |
| ``` |