Model Card for Qwen3-VL-8B-catmus-medieval

This model is a fine-tuned version of Qwen/Qwen3-VL-8B-Instruct for transcribing medieval Latin manuscripts from images. It has been trained using TRL on the CATMuS/medieval dataset.

Model Description

This vision-language model specializes in transcribing text from images of medieval Latin manuscripts. Given an image of manuscript text, the model generates the corresponding transcription.

Performance

The model was evaluated on 100 examples from the CATMuS/medieval dataset (test split).

Metrics

Metric	Base Model	Fine-tuned Model	Improvement
Character Error Rate (CER)	0.3778 (37.78%)	0.1997 (19.97%)	+47.14%
Word Error Rate (WER)	0.8300 (83.00%)	0.5457 (54.57%)	+34.25%

Sample Predictions

Here are some example transcriptions comparing the base model and fine-tuned model:

Example 1:

Reference: paulꝯ ad thessalonicenses .iii.
Base Model: paul9adthellalomconceB·iii·
Fine-tuned Model: paulꝰ ad thessalonicenses .iii.

Example 2:

Reference: acceptad mi humilde seruicio. e dissipad. e plantad en el
Base Model: acceptad mi humilde servicio, è dissipad, è plantat en el
Fine-tuned Model: acceptad mi humilde seruicio. e dissipad. e plantad en el

Example 3:

Reference: ꝙ mattheus illam dictionem ponat
Base Model: q mattheus illam dictionem ponat
Fine-tuned Model: ꝙ mattheus illam dictiõnem ponat

Example 4:

Reference: Elige ꝗd uoueas. eadẽ ħ ꝗꝗ sama ferebat.
Base Model: fuge quoniam cade hic quia tama ferebar.
Fine-tuned Model: Fuge qd̾ uoneas. eadẽ ħ ꝗꝗ sana ferebat:

Example 5:

Reference: a prima coniugatione ue
Base Model: aprimaconiugazioneue
Fine-tuned Model: a prima coniugatione ue

Quick start

from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
from peft import PeftModel
from PIL import Image

# Load model and processor
base_model = "Qwen/Qwen3-VL-8B-Instruct"
adapter_model = "small-models-for-glam/Qwen3-VL-8B-catmus"

model = Qwen3VLForConditionalGeneration.from_pretrained(
    base_model,
    torch_dtype="auto",
    device_map="auto"
)
model = PeftModel.from_pretrained(model, adapter_model)
processor = AutoProcessor.from_pretrained(base_model)

# Load your image
image = Image.open("path/to/your/manuscript_image.jpg")

# Prepare the message
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "Transcribe the text shown in this image."},
        ],
    },
]

# Generate transcription
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
).to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=256)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
transcription = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]

print(transcription)

Use Cases

This model is designed for:

Transcribing medieval Latin manuscripts
Digitizing historical manuscripts
Supporting historical research and archival work
Optical Character Recognition (OCR) for specialized historical texts

Training procedure

This model was fine-tuned using Supervised Fine-Tuning (SFT) with LoRA adapters on the Qwen3-VL-8B-Instruct base model.

Training Data

The model was trained on CATMuS/medieval, a dataset containing images of medieval Latin manuscripts with corresponding text transcriptions.

Training Configuration

Base Model: Qwen/Qwen3-VL-8B-Instruct
Training Method: Supervised Fine-Tuning (SFT) with LoRA
LoRA Configuration:
- Rank (r): 16
- Alpha: 32
- Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
- Dropout: 0.1
Training Arguments:
- Epochs: 3
- Batch size per device: 2
- Gradient accumulation steps: 4
- Learning rate: 5e-05
- Optimizer: AdamW
- Mixed precision: FP16

Framework versions

TRL: 0.23.0
Transformers: 4.57.1
Pytorch: 2.8.0
Datasets: 4.1.1
Tokenizers: 0.22.1

Limitations

The model is specialized for medieval Latin manuscripts and may not perform well on other types of text or images
Performance may vary depending on image quality, resolution, and handwriting style
The model has been trained on a specific dataset and may require fine-tuning for other manuscript collections

Citations

If you use this model, please cite the base model and training framework:

Qwen3-VL

@article{Qwen3-VL,
  title={Qwen3-VL: Large Vision Language Models Pretrained on Massive Data},
  author={Qwen Team},
  journal={arXiv preprint},
  year={2024}
}

TRL (Transformer Reinforcement Learning)

@misc{vonwerra2022trl,
    title        = {{TRL: Transformer Reinforcement Learning}},
    author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'{'e}}dec},
    year         = 2020,
    journal      = {GitHub repository},
    publisher    = {GitHub},
    howpublished = {\url{https://github.com/huggingface/trl}}
}

README generated automatically on 2025-10-24 10:40:41

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for small-models-for-glam/Qwen3-VL-8B-catmus

Base model

Qwen/Qwen3-VL-8B-Instruct

Finetuned

(20)

this model

Dataset used to train small-models-for-glam/Qwen3-VL-8B-catmus

Collection including small-models-for-glam/Qwen3-VL-8B-catmus

Qwen 3 VL - CATMuS

Collection

A collection of finetunes of Qwen 3 VL. These models were finetuned on the CATMuS dataset via TRL SFT. • 3 items • Updated 2 days ago • 2