SmolVLM-0.6B-Cap: A Small yet Optimized VLM for Image Caption

The insights of this model are traced from the open-source project Qwen3-Smol on GitHub. However, the original model is not an integrated model that can be loaded with the Transformers library using model = AutoModelForImageTextToText.from_pretrained().

Image	Qwen2.5-VL-3B	Ovis2.5-2B	SmolVLM-0.6B-Cap
	This image captures a serene sunset over a body of water, likely a lake or sea.	The image captures a stunning sunset over a body of water, likely a lake or sea.	This photograph captures a serene scene of a calm lake at sunset.
	这是一张非常可爱的卡通狗狗图片。狗狗全身是白色的，眼睛大而明亮，呈现出蓝色的光泽。	这张图片展示了一只可爱的白色小狗，它有着大大的蓝色眼睛和粉红色的舌头，显得非常活泼和快乐。	这是一张展示一只白色的雪白毛色的小狗在蓝色背景上的照片。小狗穿着一件蓝色的连衣裙，它的眼睛明亮而好奇，嘴巴张开，露出微笑。背景中有一些白色的花朵和星星，营造出一种宁静而美丽的氛围。

In this repository, we propose SmolVLM-0.6B-Cap, which is built upon the Qwen2.5-0.5B-Instruct model and the vision encoder of HuggingfaceTB SmolVLM-0.5B-Instruct. We highlight the following bright points of our model:

Small size: SmolVLM-0.6B-Cap has no more than 0.6B parameters, which is 1.5 times smaller than the original model.
Composited Structure: Following the SOTA VLM models, e.g., LLaVA, Qwen-VL, GLM-V, SmolVLM, etc., SmolVLM-0.6B-Cap leverages a sandwich structure that consists of a vision encoder, a text encoder, and an MLP connector. Notably, the vision encoder is drawn from HuggingfaceTB SmolVLM-0.5B-Instruct model, and the text encoder is Qwen2.5-0.5B-Instruct model.
Bilingual support: Unlike English-only original SmolVLM models, SmolVLM-0.6B-Cap supports both Chinese and English captions.
Optimized Capacity: Despite its moderate volume, SmolVLM-0.6B-Cap is optimized on massive image-caption pairs in both Chinese and English settings, demonstrating significant advantages over the generalist VLMs with several times the parameters.

Comparisons

Model	Params	BLEU	BERTScore	ROUGE-1	ROUGE-2	ROUGE-L
Ovis2.5	2B	0.3402	0.8393	0.4877	0.2000	0.3320
Qwen2.5-VL	3B	0.4394	0.8623	0.5208	0.2260	0.3555
GLM-4.1V	9B	0.4016	0.8731	0.4847	0.1745	0.3158
SmolVLM-Cap	0.6B	0.4927	0.8706	0.5519	0.3045	0.4133

Notes: 500 images randomly sampled from the SA1B dataset are used for evaluation.

Usage

from transformers import AutoProcessor, AutoModelForImageTextToText

from PIL import Image
import torch

device = "cuda:0"

model_path = 'Jax-dan/SmolVLM-0.6B-Cap'
model = AutoModelForImageTextToText.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
).to(device)

processor = AutoProcessor.from_pretrained(
    model_path,
    trust_remote_code=True,
)

image = Image.open("<your_image_path>").convert("RGB")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "请描述这张图片"},
        ],
    },
]

# 应用聊天模板
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# 处理输入
inputs = processor(
    text=text,
    images=[image],
    return_tensors="pt"
)

# 生成回复
outputs = model.generate(
    **inputs.to(model.device),
    max_new_tokens=128,
    do_sample=False,
    use_cache=True,
)

# 解码输出
generated_ids = outputs[0][inputs["input_ids"].shape[1]:]
response = processor.decode(generated_ids, skip_special_tokens=True)
print(response)

Future Works

While SmolVLM-0.6B-Cap is positioned as a foundational model for the image-caption, it can be further extended to any Image-Text-To-Text tasks. The following works will be soonly added to this collection:

Instruct Tuning: Autonomous generation of instruction data and self-tuning with the SmolVLM-0.6B model.
In-context Image Caption: Tuning the SmolVLM-0.6B model to caption images embedded in a long document context.

Downloads last month: 6

Safetensors

Model size

0.6B params

Tensor type

BF16

Model tree for Jax-dan/SmolVLM-0.6B-Cap

Base model

HuggingFaceTB/SmolLM2-135M

Quantized

HuggingFaceTB/SmolLM2-135M-Instruct

Quantized

HuggingFaceTB/SmolVLM-256M-Instruct

Finetuned

(42)

this model

Jax-dan
/

SmolVLM-0.6B-Cap

SmolVLM-0.6B-Cap: A Small yet Optimized VLM for Image Caption

Comparisons

Usage

Future Works

Model tree for Jax-dan/SmolVLM-0.6B-Cap

Dataset used to train Jax-dan/SmolVLM-0.6B-Cap