--- language: ca datasets: - mozilla-foundation/common_voice_17_0 - projecte-aina/3catparla_asr tags: - audio - automatic-speech-recognition - catalan - whisper-large-v3 - projecte-aina - barcelona-supercomputing-center - bsc - punctuated-data license: apache-2.0 model-index: - name: langtech-veu/whisper-large-v3-ca-punctuated-3370h results: - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Mozilla Common Voice 17.0 (Test) type: mozilla-foundation/common_voice_17_0 split: test args: language: ca metrics: - name: WER type: wer value: 5.5 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Mozilla Common Voice 17.0 (Dev) type: mozilla-foundation/common_voice_17_0 split: validation args: language: ca metrics: - name: WER type: wer value: 5.07 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: CV Benchmark Catalan Accents (Balearic fem) type: projecte-aina/commonvoice_benchmark_catalan_accents split: Balearic female args: language: ca metrics: - name: WER type: wer value: 6.06 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: CV Benchmark Catalan Accents (Balearic male) type: projecte-aina/commonvoice_benchmark_catalan_accents split: Balearic male args: language: ca metrics: - name: WER type: wer value: 5.34 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: CV Benchmark Catalan Accents (Central fem) type: projecte-aina/commonvoice_benchmark_catalan_accents split: Central female args: language: ca metrics: - name: WER type: wer value: 4.08 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: CV Benchmark Catalan Accents (Central male) type: projecte-aina/commonvoice_benchmark_catalan_accents split: Central male args: language: ca metrics: - name: WER type: wer value: 4.56 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: CV Benchmark Catalan Accents (Northern fem) type: projecte-aina/commonvoice_benchmark_catalan_accents split: Northern female args: language: ca metrics: - name: WER type: wer value: 4.21 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: CV Benchmark Catalan Accents (Northern male) type: projecte-aina/commonvoice_benchmark_catalan_accents split: Northern male args: language: ca metrics: - name: WER type: wer value: 4.28 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: CV Benchmark Catalan Accents (Northwestern fem) type: projecte-aina/commonvoice_benchmark_catalan_accents split: Northwestern female args: language: ca metrics: - name: WER type: wer value: 3.87 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: CV Benchmark Catalan Accents (Northwestern male) type: projecte-aina/commonvoice_benchmark_catalan_accents split: Northwestern male args: language: ca metrics: - name: WER type: wer value: 4.86 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: CV Benchmark Catalan Accents (Valencian fem) type: projecte-aina/commonvoice_benchmark_catalan_accents split: Valencian female args: language: ca metrics: - name: WER type: wer value: 4.67 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: CV Benchmark Catalan Accents (Valencian male) type: projecte-aina/commonvoice_benchmark_catalan_accents split: Valencian male args: language: ca metrics: - name: WER type: wer value: 4.6 base_model: - openai/whisper-large-v3 pipeline_tag: automatic-speech-recognition library_name: transformers --- # whisper-large-v3-ca-punctuated-3370h ## Table of Contents
Click to expand - [Model Description](#model-description) - [Intended Uses and Limitations](#intended-uses-and-limitations) - [How to Get Started with the Model](#how-to-get-started-with-the-model) - [Training Details](#training-details) - [Citation](#citation) - [Additional Information](#additional-information)
## Model Description The "whisper-large-v3-ca-punctuated-3370h" is an acoustic model suitable for Automatic Speech Recognition in Catalan. It is the result of finetuning the model ["openai/whisper-large-v3"](https://huggingface.co/openai/whisper-large-v3) with a combination of Catalan data from [Common Voice 17.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) (2,659 hours) and 710 hours of data released by the [Projecte AINA](https://projecteaina.cat/) from Barcelona, Spain. Totalling 3369 hours and 53 minutes. A key advantage of this model is that it was trained on meticulously transcribed data, including punctuation and capitalization. As a result, the output transcriptions preserve these features, delivering more structured and readable outputs compared to standard ASR models. ## Intended Uses and Limitations This model can be used for Automatic Speech Recognition (ASR) in Catalan. The model is intended to transcribe audio files in Catalan to plain text with punctuation and capitalization. ## How to Get Started with the Model To see a functional version of this code, please see our our [Notebook](https://colab.research.google.com/drive/1MHiPrffNTwiyWeUyMQvSdSbfkef_8aJC?usp=sharing) and, in order to invoke this model, just substitute the instances of "projecte-aina/whisper-large-v3-ca-3catparla" with "langtech-veu/whisper-large-v3-ca-punctuated-3370h". ### Installation In order to use this model, you may install [datasets](https://huggingface.co/docs/datasets/installation) and [transformers](https://huggingface.co/docs/transformers/installation): Create a virtual environment: ```bash python -m venv /path/to/venv ``` Activate the environment: ```bash source /path/to/venv/bin/activate ``` Install the modules: ```bash pip install datasets transformers ``` ### For Inference In order to transcribe audio in Catalan using this model, you can follow this example: ```bash #Install Prerequisites pip install torch pip install datasets pip install 'transformers[torch]' pip install evaluate pip install jiwer ``` ```python #This code works with GPU #Notice that: load_metric is no longer part of datasets. #you have to remove it and use evaluate's load instead. #(Note from November 2024) import torch from transformers import WhisperForConditionalGeneration, WhisperProcessor #Load the processor and model. MODEL_NAME="langtech-veu/whisper-large-v3-ca-punctuated-3370hs" processor = WhisperProcessor.from_pretrained(MODEL_NAME) model = WhisperForConditionalGeneration.from_pretrained(MODEL_NAME).to("cuda") #Load the dataset from datasets import load_dataset, load_metric, Audio ds=load_dataset("projecte-aina/parlament_parla",split='test') #Downsample to 16kHz ds = ds.cast_column("audio", Audio(sampling_rate=16_000)) #Process the dataset def map_to_pred(batch): audio = batch["audio"] input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features batch["reference"] = processor.tokenizer._normalize(batch['normalized_text']) with torch.no_grad(): predicted_ids = model.generate(input_features.to("cuda"))[0] transcription = processor.decode(predicted_ids) batch["prediction"] = processor.tokenizer._normalize(transcription) return batch #Do the evaluation result = ds.map(map_to_pred) #Compute the overall WER now. from evaluate import load wer = load("wer") WER=100 * wer.compute(references=result["reference"], predictions=result["prediction"]) print(WER) ``` ## Training Details ### Training data The specific datasets used to create the model are: - [Common Voice 17.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) - ["3CatParla"](https://huggingface.co/datasets/projecte-aina/3catparla_asr). (soon to be published) ### Training procedure This model is the result of finetuning the model ["openai/whisper-large-v3"](https://huggingface.co/openai/whisper-large-v3) by following this [tutorial](https://huggingface.co/blog/fine-tune-whisper) provided by Hugging Face. ### Training Hyperparameters * language: catalan * hours of training audio: 3369 hours and 53 minutes * learning rate: 1e-5 * sample rate: 16000 * train batch size: 32 (x4 GPUs) * gradient accumulation steps: 1 * eval batch size: 32 * save total limit: 4 * max steps: 77660 * warmup steps: 7766 * eval steps: 7766 * save steps: 7766 ## Citation If this model contributes to your research, please cite the work: ```bibtex @misc{mena2025whisperpunctuated, title={Acoustic Model in Catalan: whisper-large-v3-ca-punctuated-3370h.}, author={Hernandez Mena, Carlos Daniel}, organization={Barcelona Supercomputing Center}, url={https://huggingface.co/langtech-veu/whisper-large-v3-ca-punctuated-3370h}, year={2025} } ``` ## Additional Information ### Author The fine-tuning process was performed during April (2025) in the [Language Technologies laboratory](https://huggingface.co/BSC-LT) of the [Barcelona Supercomputing Center](https://www.bsc.es/) by [Carlos Daniel Hernández Mena](https://huggingface.co/carlosdanielhernandezmena). ### Contact For further information, please send an email to . ### Copyright Copyright(c) 2025 by Language Technologies Laboratory, Barcelona Supercomputing Center. ### License [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) ### Funding This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/). The training of the model was possible thanks to the computing time provided by [Barcelona Supercomputing Center](https://www.bsc.es/) through MareNostrum 5.