SmolVLM-0.6B-Cap: A Small yet Optimized VLM for Image Caption

The insights of this model are traced from the open-source project Qwen3-Smol on GitHub. However, the original model is not an integrated model that can be loaded with the Transformers library using model = AutoModelForImageTextToText.from_pretrained().

Image Qwen2.5-VL-3B Ovis2.5-2B SmolVLM-0.6B-Cap
替代文本 This image captures a serene sunset over a body of water, likely a lake or sea. The image captures a stunning sunset over a body of water, likely a lake or sea. This photograph captures a serene scene of a calm lake at sunset.
替代文本 这是一张非常可爱的卡通狗狗图片。狗狗全身是白色的,眼睛大而明亮,呈现出蓝色的光泽。 这张图片展示了一只可爱的白色小狗,它有着大大的蓝色眼睛和粉红色的舌头,显得非常活泼和快乐。 这是一张展示一只白色的雪白毛色的小狗在蓝色背景上的照片。小狗穿着一件蓝色的连衣裙,它的眼睛明亮而好奇,嘴巴张开,露出微笑。背景中有一些白色的花朵和星星,营造出一种宁静而美丽的氛围。

In this repository, we propose SmolVLM-0.6B-Cap, which is built upon the Qwen2.5-0.5B-Instruct model and the vision encoder of HuggingfaceTB SmolVLM-0.5B-Instruct. We highlight the following bright points of our model:

  • Small size: SmolVLM-0.6B-Cap has no more than 0.6B parameters, which is 1.5 times smaller than the original model.
  • Composited Structure: Following the SOTA VLM models, e.g., LLaVA, Qwen-VL, GLM-V, SmolVLM, etc., SmolVLM-0.6B-Cap leverages a sandwich structure that consists of a vision encoder, a text encoder, and an MLP connector. Notably, the vision encoder is drawn from HuggingfaceTB SmolVLM-0.5B-Instruct model, and the text encoder is Qwen2.5-0.5B-Instruct model.
  • Bilingual support: Unlike English-only original SmolVLM models, SmolVLM-0.6B-Cap supports both Chinese and English captions.
  • Optimized Capacity: Despite its moderate volume, SmolVLM-0.6B-Cap is optimized on massive image-caption pairs in both Chinese and English settings, demonstrating significant advantages over the generalist VLMs with several times the parameters.

Comparisons

Model Params BLEU BERTScore ROUGE-1 ROUGE-2 ROUGE-L
Ovis2.5 2B 0.3402 0.8393 0.4877 0.2000 0.3320
Qwen2.5-VL 3B 0.4394 0.8623 0.5208 0.2260 0.3555
GLM-4.1V 9B 0.4016 0.8731 0.4847 0.1745 0.3158
SmolVLM-Cap 0.6B 0.4927 0.8706 0.5519 0.3045 0.4133

Notes: 500 images randomly sampled from the SA1B dataset are used for evaluation.

Usage

from transformers import AutoProcessor, AutoModelForImageTextToText

from PIL import Image
import torch

device = "cuda:0"

model_path = 'Jax-dan/SmolVLM-0.6B-Cap'
model = AutoModelForImageTextToText.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
).to(device)

processor = AutoProcessor.from_pretrained(
    model_path,
    trust_remote_code=True,
)

image = Image.open("<your_image_path>").convert("RGB")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "请描述这张图片"},
        ],
    },
]

# 应用聊天模板
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# 处理输入
inputs = processor(
    text=text,
    images=[image],
    return_tensors="pt"
)

# 生成回复
outputs = model.generate(
    **inputs.to(model.device),
    max_new_tokens=128,
    do_sample=False,
    use_cache=True,
)

# 解码输出
generated_ids = outputs[0][inputs["input_ids"].shape[1]:]
response = processor.decode(generated_ids, skip_special_tokens=True)
print(response)

Future Works

While SmolVLM-0.6B-Cap is positioned as a foundational model for the image-caption, it can be further extended to any Image-Text-To-Text tasks. The following works will be soonly added to this collection:

  • Instruct Tuning: Autonomous generation of instruction data and self-tuning with the SmolVLM-0.6B model.
  • In-context Image Caption: Tuning the SmolVLM-0.6B model to caption images embedded in a long document context.
Downloads last month
6
Safetensors
Model size
0.6B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Jax-dan/SmolVLM-0.6B-Cap

Dataset used to train Jax-dan/SmolVLM-0.6B-Cap