Model Card for Qwen3-VL-8B-catmus-medieval

This model is a fine-tuned version of Qwen/Qwen3-VL-8B-Instruct for transcribing medieval Latin manuscripts from images. It has been trained using TRL on the CATMuS/medieval dataset.

Model Description

This vision-language model specializes in transcribing text from images of medieval Latin manuscripts. Given an image of manuscript text, the model generates the corresponding transcription.

Performance

The model was evaluated on 100 examples from the CATMuS/medieval dataset (test split).

Metrics

Metric Base Model Fine-tuned Model Improvement
Character Error Rate (CER) 0.3778 (37.78%) 0.1997 (19.97%) +47.14%
Word Error Rate (WER) 0.8300 (83.00%) 0.5457 (54.57%) +34.25%

Sample Predictions

Here are some example transcriptions comparing the base model and fine-tuned model:

Example 1:

  • Reference: paulꝯ ad thessalonicenses .iii.
  • Base Model: paul9adthellalomconceB·iii·
  • Fine-tuned Model: paulꝰ ad thessalonicenses .iii.

Example 2:

  • Reference: acceptad mi humilde seruicio. e dissipad. e plantad en el
  • Base Model: acceptad mi humilde servicio, è dissipad, è plantat en el
  • Fine-tuned Model: acceptad mi humilde seruicio. e dissipad. e plantad en el

Example 3:

  • Reference: ꝙ mattheus illam dictionem ponat
  • Base Model: q mattheus illam dictionem ponat
  • Fine-tuned Model: ꝙ mattheus illam dictiõnem ponat

Example 4:

  • Reference: Elige ꝗd uoueas. eadẽ ħ ꝗꝗ sama ferebat.
  • Base Model: fuge quoniam cade hic quia tama ferebar.
  • Fine-tuned Model: Fuge qd̾ uoneas. eadẽ ħ ꝗꝗ sana ferebat:

Example 5:

  • Reference: a prima coniugatione ue
  • Base Model: aprimaconiugazioneue
  • Fine-tuned Model: a prima coniugatione ue

Quick start

from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
from peft import PeftModel
from PIL import Image

# Load model and processor
base_model = "Qwen/Qwen3-VL-8B-Instruct"
adapter_model = "small-models-for-glam/Qwen3-VL-8B-catmus"

model = Qwen3VLForConditionalGeneration.from_pretrained(
    base_model,
    torch_dtype="auto",
    device_map="auto"
)
model = PeftModel.from_pretrained(model, adapter_model)
processor = AutoProcessor.from_pretrained(base_model)

# Load your image
image = Image.open("path/to/your/manuscript_image.jpg")

# Prepare the message
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "Transcribe the text shown in this image."},
        ],
    },
]

# Generate transcription
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
).to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=256)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
transcription = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]

print(transcription)

Use Cases

This model is designed for:

  • Transcribing medieval Latin manuscripts
  • Digitizing historical manuscripts
  • Supporting historical research and archival work
  • Optical Character Recognition (OCR) for specialized historical texts

Training procedure

This model was fine-tuned using Supervised Fine-Tuning (SFT) with LoRA adapters on the Qwen3-VL-8B-Instruct base model.

Training Data

The model was trained on CATMuS/medieval, a dataset containing images of medieval Latin manuscripts with corresponding text transcriptions.

Training Configuration

  • Base Model: Qwen/Qwen3-VL-8B-Instruct
  • Training Method: Supervised Fine-Tuning (SFT) with LoRA
  • LoRA Configuration:
    • Rank (r): 16
    • Alpha: 32
    • Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
    • Dropout: 0.1
  • Training Arguments:
    • Epochs: 3
    • Batch size per device: 2
    • Gradient accumulation steps: 4
    • Learning rate: 5e-05
    • Optimizer: AdamW
    • Mixed precision: FP16

Framework versions

  • TRL: 0.23.0
  • Transformers: 4.57.1
  • Pytorch: 2.8.0
  • Datasets: 4.1.1
  • Tokenizers: 0.22.1

Limitations

  • The model is specialized for medieval Latin manuscripts and may not perform well on other types of text or images
  • Performance may vary depending on image quality, resolution, and handwriting style
  • The model has been trained on a specific dataset and may require fine-tuning for other manuscript collections

Citations

If you use this model, please cite the base model and training framework:

Qwen3-VL

@article{Qwen3-VL,
  title={Qwen3-VL: Large Vision Language Models Pretrained on Massive Data},
  author={Qwen Team},
  journal={arXiv preprint},
  year={2024}
}

TRL (Transformer Reinforcement Learning)

@misc{vonwerra2022trl,
    title        = {{TRL: Transformer Reinforcement Learning}},
    author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'{'e}}dec},
    year         = 2020,
    journal      = {GitHub repository},
    publisher    = {GitHub},
    howpublished = {\url{https://github.com/huggingface/trl}}
}

README generated automatically on 2025-10-24 10:40:41

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for small-models-for-glam/Qwen3-VL-8B-catmus

Finetuned
(20)
this model

Dataset used to train small-models-for-glam/Qwen3-VL-8B-catmus

Collection including small-models-for-glam/Qwen3-VL-8B-catmus