SmolVLM-0.6B-Cap: A Small yet Optimized VLM for Image Caption
The insights of this model are traced from the open-source project Qwen3-Smol on GitHub.
However, the original model is not an integrated model that can be loaded with the Transformers library using model = AutoModelForImageTextToText.from_pretrained().
In this repository, we propose SmolVLM-0.6B-Cap, which is built upon the Qwen2.5-0.5B-Instruct model and the vision encoder of HuggingfaceTB SmolVLM-0.5B-Instruct. We highlight the following bright points of our model:
- Small size:
SmolVLM-0.6B-Caphas no more than 0.6B parameters, which is 1.5 times smaller than the original model. - Composited Structure: Following the SOTA VLM models, e.g.,
LLaVA,Qwen-VL,GLM-V,SmolVLM, etc.,SmolVLM-0.6B-Capleverages a sandwich structure that consists of a vision encoder, a text encoder, and an MLP connector. Notably, the vision encoder is drawn from HuggingfaceTBSmolVLM-0.5B-Instructmodel, and the text encoder isQwen2.5-0.5B-Instructmodel. - Bilingual support: Unlike English-only original
SmolVLMmodels,SmolVLM-0.6B-Capsupports both Chinese and English captions. - Optimized Capacity: Despite its moderate volume,
SmolVLM-0.6B-Capis optimized on massive image-caption pairs in both Chinese and English settings, demonstrating significant advantages over the generalist VLMs with several times the parameters.
Comparisons
| Model | Params | BLEU | BERTScore | ROUGE-1 | ROUGE-2 | ROUGE-L |
|---|---|---|---|---|---|---|
| Ovis2.5 | 2B | 0.3402 | 0.8393 | 0.4877 | 0.2000 | 0.3320 |
| Qwen2.5-VL | 3B | 0.4394 | 0.8623 | 0.5208 | 0.2260 | 0.3555 |
| GLM-4.1V | 9B | 0.4016 | 0.8731 | 0.4847 | 0.1745 | 0.3158 |
| SmolVLM-Cap | 0.6B | 0.4927 | 0.8706 | 0.5519 | 0.3045 | 0.4133 |
Notes: 500 images randomly sampled from the SA1B dataset are used for evaluation.
Usage
from transformers import AutoProcessor, AutoModelForImageTextToText
from PIL import Image
import torch
device = "cuda:0"
model_path = 'Jax-dan/SmolVLM-0.6B-Cap'
model = AutoModelForImageTextToText.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
).to(device)
processor = AutoProcessor.from_pretrained(
model_path,
trust_remote_code=True,
)
image = Image.open("<your_image_path>").convert("RGB")
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "请描述这张图片"},
],
},
]
# 应用聊天模板
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
# 处理输入
inputs = processor(
text=text,
images=[image],
return_tensors="pt"
)
# 生成回复
outputs = model.generate(
**inputs.to(model.device),
max_new_tokens=128,
do_sample=False,
use_cache=True,
)
# 解码输出
generated_ids = outputs[0][inputs["input_ids"].shape[1]:]
response = processor.decode(generated_ids, skip_special_tokens=True)
print(response)
Future Works
While SmolVLM-0.6B-Cap is positioned as a foundational model for the image-caption, it can be further extended to any Image-Text-To-Text tasks. The following works will be soonly added to this collection:
- Instruct Tuning: Autonomous generation of instruction data and self-tuning with the
SmolVLM-0.6Bmodel. - In-context Image Caption: Tuning the
SmolVLM-0.6Bmodel to caption images embedded in a long document context.
- Downloads last month
- 6
Model tree for Jax-dan/SmolVLM-0.6B-Cap
Base model
HuggingFaceTB/SmolLM2-135M
