Spaces:

orrzxz
/

MiniCPM-V-4_5

Running on Zero

App Files Files Community

MiniCPM-V-4_5 / README.md

orrzxz

Update README.md

fcbedeb verified 2 months ago

preview code

raw

history blame contribute delete

2.49 kB

	---
	title: MiniCPM-V-4.5 Multimodal Chat
	emoji: 🚀
	colorFrom: blue
	colorTo: purple
	sdk: gradio
	sdk_version: 5.44.0
	app_file: app.py
	pinned: false
	license: apache-2.0
	---

	# MiniCPM-V-4.5 Multimodal Chat 🚀

	A powerful Gradio interface for the MiniCPM-V-4.5 multimodal model - a GPT-4V level MLLM with only 8B parameters!

	## Features

	- 📸 Image Understanding: Analyze single or multiple images with high-resolution support (up to 1.8M pixels)
	- 🎥 Video Understanding: Process videos with high refresh rate (up to 10 FPS) and efficient compression
	- 📄 Document Parsing: Strong OCR capabilities and PDF document parsing
	- 🧠 Thinking Modes: Choose between fast thinking for efficiency or deep thinking for complex problems
	- 🌍 Multilingual: Support for 30+ languages
	- ⚙️ Customizable: Adjust FPS, context size, temperature, and system prompts

	## Model Capabilities

	MiniCPM-V-4.5 achieves state-of-the-art performance across multiple benchmarks:
	- Surpasses GPT-4o-latest and Gemini-2.0 Pro on vision-language tasks
	- Leading OCR performance on OCRBench
	- Efficient video token compression (96x rate)
	- Trustworthy behaviors with multilingual support

	## Usage

	1. Upload: Choose an image or video file
	2. Configure: Adjust settings like FPS (for videos), context size, and temperature
	3. Prompt: Enter your question or use the system prompt for specific instructions
	4. Generate: Click the generate button to get the model's response

	## Examples

	- "What objects do you see in this image?"
	- "Describe the main action happening in this video"
	- "Read and transcribe any text visible in the image"
	- "Analyze this image from an artistic perspective"

	## Technical Details

	- Architecture: Built on Qwen3-8B and SigLIP2-400M
	- Parameters: 8B total parameters
	- Video Processing: 3D-Resampler with temporal understanding
	- Resolution: Supports images up to 1344x1344 pixels
	- Efficiency: 4x fewer visual tokens than most MLLMs

	## License

	This model is released under the MiniCPM Model License. Free for academic research and commercial use after registration.

	## Citation

	```bibtex
	@article{yao2024minicpm,
	title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone},
	author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others},
	journal={Nat Commun 16, 5509 (2025)},
	year={2025}
	}
	```