carlosdanielhernandezmena commited on
Commit
74d00f4
·
verified ·
1 Parent(s): 688fac1

Adding info to the README file

Browse files
Files changed (1) hide show
  1. README.md +206 -7
README.md CHANGED
@@ -1,15 +1,214 @@
1
  ---
2
- license: apache-2.0
3
  datasets:
4
  - mozilla-foundation/common_voice_17_0
5
  - projecte-aina/3catparla_asr
6
- language:
7
- - ca
8
- metrics:
9
- - wer
10
- - cer
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  base_model:
12
  - openai/whisper-large-v3
13
  pipeline_tag: text-to-speech
14
  library_name: transformers
15
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: ca
3
  datasets:
4
  - mozilla-foundation/common_voice_17_0
5
  - projecte-aina/3catparla_asr
6
+ tags:
7
+ - audio
8
+ - automatic-speech-recognition
9
+ - catalan
10
+ - whisper-large-v3
11
+ - projecte-aina
12
+ - barcelona-supercomputing-center
13
+ - bsc
14
+ - punctuated-data
15
+ license: apache-2.0
16
+ model-index:
17
+ - name: whisper-large-v3-ca-3catparla
18
+ results:
19
+ - task:
20
+ name: Automatic Speech Recognition
21
+ type: automatic-speech-recognition
22
+ dataset:
23
+ name: 3CatParla (Test)
24
+ type: projecte-aina/3catparla_asr
25
+ split: test
26
+ args:
27
+ language: ca
28
+ metrics:
29
+ - name: WER
30
+ type: wer
31
+ value: 0.96
32
+ - task:
33
+ name: Automatic Speech Recognition
34
+ type: automatic-speech-recognition
35
+ dataset:
36
+ name: 3CatParla (Dev)
37
+ type: projecte-aina/3catparla_asr
38
+ split: dev
39
+ args:
40
+ language: ca
41
+ metrics:
42
+ - name: WER
43
+ type: wer
44
+ value: 0.92
45
  base_model:
46
  - openai/whisper-large-v3
47
  pipeline_tag: text-to-speech
48
  library_name: transformers
49
+ ---
50
+ # whisper-large-v3-ca-punctuated-3370h
51
+
52
+ ## Table of Contents
53
+ <details>
54
+ <summary>Click to expand</summary>
55
+
56
+ - [Model Description](#model-description)
57
+ - [Intended Uses and Limitations](#intended-uses-and-limitations)
58
+ - [How to Get Started with the Model](#how-to-get-started-with-the-model)
59
+ - [Training Details](#training-details)
60
+ - [Citation](#citation)
61
+ - [Additional Information](#additional-information)
62
+
63
+ </details>
64
+
65
+ ## Model Description
66
+
67
+ The "whisper-large-v3-ca-punctuated-3370h" is an acoustic model suitable for Automatic Speech Recognition in Catalan. It is the result of finetuning the model ["openai/whisper-large-v3"](https://huggingface.co/openai/whisper-large-v3) with a combination of Catalan data from [Common Voice 17.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) (2,659 hours) and 710 hours of data released by the [Projecte AINA](https://projecteaina.cat/) from Barcelona, Spain. Totalling 3369 hours and 53 minutes.
68
+
69
+ A key advantage of this model is that it was trained on meticulously transcribed data, including punctuation and capitalization. As a result, the output transcriptions preserve these features, delivering more structured and readable outputs compared to standard ASR models.
70
+
71
+ ## Intended Uses and Limitations
72
+
73
+ This model can be used for Automatic Speech Recognition (ASR) in Catalan. The model is intended to transcribe audio files in Catalan to plain text with punctuation and capitalization.
74
+
75
+ ## How to Get Started with the Model
76
+
77
+ To see a functional version of this code, please see our our [Notebook](https://colab.research.google.com/drive/1MHiPrffNTwiyWeUyMQvSdSbfkef_8aJC?usp=sharing) and, in order to invoke this model, just substitute the instances of "projecte-aina/whisper-large-v3-ca-3catparla" with "langtech-veu/whisper-large-v3-ca-punctuated-3370h".
78
+
79
+ ### Installation
80
+
81
+ In order to use this model, you may install [datasets](https://huggingface.co/docs/datasets/installation) and [transformers](https://huggingface.co/docs/transformers/installation):
82
+
83
+ Create a virtual environment:
84
+ ```bash
85
+ python -m venv /path/to/venv
86
+ ```
87
+ Activate the environment:
88
+ ```bash
89
+ source /path/to/venv/bin/activate
90
+ ```
91
+ Install the modules:
92
+ ```bash
93
+ pip install datasets transformers
94
+ ```
95
+
96
+ ### For Inference
97
+ In order to transcribe audio in Catalan using this model, you can follow this example:
98
+
99
+ ```bash
100
+ #Install Prerequisites
101
+ pip install torch
102
+ pip install datasets
103
+ pip install 'transformers[torch]'
104
+ pip install evaluate
105
+ pip install jiwer
106
+ ```
107
+
108
+ ```python
109
+ #This code works with GPU
110
+
111
+ #Notice that: load_metric is no longer part of datasets.
112
+ #you have to remove it and use evaluate's load instead.
113
+ #(Note from November 2024)
114
+
115
+ import torch
116
+ from transformers import WhisperForConditionalGeneration, WhisperProcessor
117
+
118
+ #Load the processor and model.
119
+ MODEL_NAME="langtech-veu/whisper-large-v3-ca-punctuated-3370hs"
120
+ processor = WhisperProcessor.from_pretrained(MODEL_NAME)
121
+ model = WhisperForConditionalGeneration.from_pretrained(MODEL_NAME).to("cuda")
122
+
123
+ #Load the dataset
124
+ from datasets import load_dataset, load_metric, Audio
125
+ ds=load_dataset("projecte-aina/3catparla_asr",split='test')
126
+
127
+ #Downsample to 16kHz
128
+ ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
129
+
130
+ #Process the dataset
131
+ def map_to_pred(batch):
132
+ audio = batch["audio"]
133
+ input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features
134
+ batch["reference"] = processor.tokenizer._normalize(batch['normalized_text'])
135
+
136
+ with torch.no_grad():
137
+ predicted_ids = model.generate(input_features.to("cuda"))[0]
138
+
139
+ transcription = processor.decode(predicted_ids)
140
+ batch["prediction"] = processor.tokenizer._normalize(transcription)
141
+
142
+ return batch
143
+
144
+ #Do the evaluation
145
+ result = ds.map(map_to_pred)
146
+
147
+ #Compute the overall WER now.
148
+ from evaluate import load
149
+
150
+ wer = load("wer")
151
+ WER=100 * wer.compute(references=result["reference"], predictions=result["prediction"])
152
+ print(WER)
153
+ ```
154
+ <!--
155
+ **Test Result**: 0.96
156
+ -->
157
+
158
+ ## Training Details
159
+
160
+ ### Training data
161
+
162
+ The specific datasets used to create the model are [Common Voice 17.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) and ["3CatParla"](https://huggingface.co/datasets/projecte-aina/3catparla_asr).
163
+
164
+ ### Training procedure
165
+
166
+ This model is the result of finetuning the model ["openai/whisper-large-v3"](https://huggingface.co/openai/whisper-large-v3) by following this [tutorial](https://huggingface.co/blog/fine-tune-whisper) provided by Hugging Face.
167
+
168
+ ### Training Hyperparameters
169
+
170
+ * language: catalan
171
+ * hours of training audio: 3369 hours and 53 minutes
172
+ * learning rate: 1e-5
173
+ * sample rate: 16000
174
+ * train batch size: 32 (x4 GPUs)
175
+ * gradient accumulation steps: 1
176
+ * eval batch size: 32
177
+ * save total limit: 4
178
+ * max steps: 77660
179
+ * warmup steps: 7766
180
+ * eval steps: 7766
181
+ * save steps: 7766
182
+
183
+ ## Citation
184
+ If this model contributes to your research, please cite the work:
185
+ ```bibtex
186
+ @misc{mena2025whisperpunctuated,
187
+ title={Acoustic Model in Catalan: whisper-large-v3-ca-punctuated-3370h.},
188
+ author={Hernandez Mena, Carlos Daniel},
189
+ organization={Barcelona Supercomputing Center},
190
+ url={https://huggingface.co/langtech-veu/whisper-large-v3-ca-punctuated-3370h},
191
+ year={2025}
192
+ }
193
+ ```
194
+
195
+ ## Additional Information
196
+
197
+ ### Author
198
+
199
+ The fine-tuning process was perform during April (2025) in the [Language Technologies Unit](https://huggingface.co/BSC-LT) of the [Barcelona Supercomputing Center](https://www.bsc.es/) by [Carlos Daniel Hernández Mena](https://huggingface.co/carlosdanielhernandezmena).
200
+
201
+ ### Contact
202
+ For further information, please send an email to <[email protected]>.
203
+
204
+ ### Copyright
205
+ Copyright(c) 2025 by Language Technologies Unit, Barcelona Supercomputing Center.
206
+
207
+ ### License
208
+
209
+ [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0)
210
+
211
+ ### Funding
212
+ This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/).
213
+
214
+ The training of the model was possible thanks to the compute time provided by [Barcelona Supercomputing Center](https://www.bsc.es/) through MareNostrum 5.