Model Card for Qwen3-VL-8B-catmus-medieval
This model is a fine-tuned version of Qwen/Qwen3-VL-8B-Instruct for transcribing medieval Latin manuscripts from images. It has been trained using TRL on the CATMuS/medieval dataset.
Model Description
This vision-language model specializes in transcribing text from images of medieval Latin manuscripts. Given an image of manuscript text, the model generates the corresponding transcription.
Performance
The model was evaluated on 100 examples from the CATMuS/medieval dataset (test split).
Metrics
| Metric | Base Model | Fine-tuned Model | Improvement |
|---|---|---|---|
| Character Error Rate (CER) | 0.3778 (37.78%) | 0.1997 (19.97%) | +47.14% |
| Word Error Rate (WER) | 0.8300 (83.00%) | 0.5457 (54.57%) | +34.25% |
Sample Predictions
Here are some example transcriptions comparing the base model and fine-tuned model:
Example 1:
- Reference: paulꝯ ad thessalonicenses .iii.
- Base Model: paul9adthellalomconceB·iii·
- Fine-tuned Model: paulꝰ ad thessalonicenses .iii.
Example 2:
- Reference: acceptad mi humilde seruicio. e dissipad. e plantad en el
- Base Model: acceptad mi humilde servicio, è dissipad, è plantat en el
- Fine-tuned Model: acceptad mi humilde seruicio. e dissipad. e plantad en el
Example 3:
- Reference: ꝙ mattheus illam dictionem ponat
- Base Model: q mattheus illam dictionem ponat
- Fine-tuned Model: ꝙ mattheus illam dictiõnem ponat
Example 4:
- Reference: Elige ꝗd uoueas. eadẽ ħ ꝗꝗ sama ferebat.
- Base Model: fuge quoniam cade hic quia tama ferebar.
- Fine-tuned Model: Fuge qd̾ uoneas. eadẽ ħ ꝗꝗ sana ferebat:
Example 5:
- Reference: a prima coniugatione ue
- Base Model: aprimaconiugazioneue
- Fine-tuned Model: a prima coniugatione ue
Quick start
from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
from peft import PeftModel
from PIL import Image
# Load model and processor
base_model = "Qwen/Qwen3-VL-8B-Instruct"
adapter_model = "small-models-for-glam/Qwen3-VL-8B-catmus"
model = Qwen3VLForConditionalGeneration.from_pretrained(
base_model,
torch_dtype="auto",
device_map="auto"
)
model = PeftModel.from_pretrained(model, adapter_model)
processor = AutoProcessor.from_pretrained(base_model)
# Load your image
image = Image.open("path/to/your/manuscript_image.jpg")
# Prepare the message
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "Transcribe the text shown in this image."},
],
},
]
# Generate transcription
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt"
).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=256)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
transcription = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(transcription)
Use Cases
This model is designed for:
- Transcribing medieval Latin manuscripts
- Digitizing historical manuscripts
- Supporting historical research and archival work
- Optical Character Recognition (OCR) for specialized historical texts
Training procedure
This model was fine-tuned using Supervised Fine-Tuning (SFT) with LoRA adapters on the Qwen3-VL-8B-Instruct base model.
Training Data
The model was trained on CATMuS/medieval, a dataset containing images of medieval Latin manuscripts with corresponding text transcriptions.
Training Configuration
- Base Model: Qwen/Qwen3-VL-8B-Instruct
- Training Method: Supervised Fine-Tuning (SFT) with LoRA
- LoRA Configuration:
- Rank (r): 16
- Alpha: 32
- Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
- Dropout: 0.1
- Training Arguments:
- Epochs: 3
- Batch size per device: 2
- Gradient accumulation steps: 4
- Learning rate: 5e-05
- Optimizer: AdamW
- Mixed precision: FP16
Framework versions
- TRL: 0.23.0
- Transformers: 4.57.1
- Pytorch: 2.8.0
- Datasets: 4.1.1
- Tokenizers: 0.22.1
Limitations
- The model is specialized for medieval Latin manuscripts and may not perform well on other types of text or images
- Performance may vary depending on image quality, resolution, and handwriting style
- The model has been trained on a specific dataset and may require fine-tuning for other manuscript collections
Citations
If you use this model, please cite the base model and training framework:
Qwen3-VL
@article{Qwen3-VL,
title={Qwen3-VL: Large Vision Language Models Pretrained on Massive Data},
author={Qwen Team},
journal={arXiv preprint},
year={2024}
}
TRL (Transformer Reinforcement Learning)
@misc{vonwerra2022trl,
title = {{TRL: Transformer Reinforcement Learning}},
author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'{'e}}dec},
year = 2020,
journal = {GitHub repository},
publisher = {GitHub},
howpublished = {\url{https://github.com/huggingface/trl}}
}
README generated automatically on 2025-10-24 10:40:41
Model tree for small-models-for-glam/Qwen3-VL-8B-catmus
Base model
Qwen/Qwen3-VL-8B-Instruct