grgsaliba commited on
Commit
ed095cd
·
verified ·
1 Parent(s): 3aa7271

Upload HF_TRAINING_GUIDE.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. HF_TRAINING_GUIDE.md +413 -0
HF_TRAINING_GUIDE.md ADDED
@@ -0,0 +1,413 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Training DTLN with Hugging Face Datasets
2
+
3
+ Complete guide for training a production-quality voice denoising model using real datasets from Hugging Face Hub.
4
+
5
+ ## Overview
6
+
7
+ This guide shows you how to train the DTLN (Dual-signal Transformation LSTM Network) model using:
8
+ - **Clean Speech**: LibriSpeech, Common Voice, or other speech datasets
9
+ - **Noise**: DNS Challenge, FSD50K, or environmental noise datasets
10
+ - **Quantization-Aware Training (QAT)**: For efficient INT8 deployment
11
+ - **TensorFlow Lite**: For edge deployment
12
+
13
+ ## Quick Start
14
+
15
+ ### 1. Install Dependencies
16
+
17
+ ```bash
18
+ pip install -r training_requirements.txt
19
+ ```
20
+
21
+ ### 2. Train with Default Settings (LibriSpeech + DNS Challenge)
22
+
23
+ ```bash
24
+ python train_with_hf_datasets.py \
25
+ --epochs 50 \
26
+ --batch-size 16 \
27
+ --samples-per-epoch 1000 \
28
+ --lstm-units 128 \
29
+ --output-dir ./models
30
+ ```
31
+
32
+ This will:
33
+ - Download LibriSpeech clean speech (train.clean.100 split)
34
+ - Download DNS Challenge noise dataset
35
+ - Train for 50 epochs with quantization-aware training
36
+ - Save best model to `./models/best_model.h5`
37
+
38
+ ### 3. Convert to TensorFlow Lite
39
+
40
+ ```bash
41
+ python convert_to_tflite.py \
42
+ --model ./models/best_model.h5 \
43
+ --output ./models/dtln_int8.tflite \
44
+ --calibration-dir ./calibration_data
45
+ ```
46
+
47
+ ## Available Datasets
48
+
49
+ ### Clean Speech Datasets
50
+
51
+ #### LibriSpeech (Default)
52
+ - **Dataset**: `librispeech_asr`
53
+ - **Split**: `train.clean.100` (100 hours) or `train.clean.360` (360 hours)
54
+ - **Language**: English
55
+ - **Quality**: High-quality read speech
56
+
57
+ ```bash
58
+ python train_with_hf_datasets.py \
59
+ --clean-dataset librispeech_asr \
60
+ --clean-split train.clean.100
61
+ ```
62
+
63
+ #### Common Voice
64
+ - **Dataset**: `mozilla-foundation/common_voice_11_0`
65
+ - **Split**: `train`
66
+ - **Language**: Multiple (specify with config, e.g., "en")
67
+ - **Quality**: User-contributed speech
68
+
69
+ ```bash
70
+ python train_with_hf_datasets.py \
71
+ --clean-dataset "mozilla-foundation/common_voice_11_0" \
72
+ --clean-split train
73
+ ```
74
+
75
+ #### VoxPopuli
76
+ - **Dataset**: `facebook/voxpopuli`
77
+ - **Split**: `train`
78
+ - **Language**: Multiple European languages
79
+ - **Quality**: European Parliament recordings
80
+
81
+ #### Other Options
82
+ - `google/fleurs` - Multilingual speech
83
+ - `facebook/multilingual_librispeech` - Multiple languages
84
+ - Any HF dataset with audio column
85
+
86
+ ### Noise Datasets
87
+
88
+ #### DNS Challenge (Default)
89
+ - **Dataset**: `dns-challenge/dns-challenge-4`
90
+ - **Split**: `train`
91
+ - **Content**: Diverse environmental noises
92
+
93
+ ```bash
94
+ python train_with_hf_datasets.py \
95
+ --noise-dataset "dns-challenge/dns-challenge-4" \
96
+ --noise-split train
97
+ ```
98
+
99
+ #### FSD50K (Freesound Dataset)
100
+ - **Dataset**: `Fhrozen/FSD50k`
101
+ - **Split**: `train`
102
+ - **Content**: 50,000+ sound events
103
+
104
+ #### WHAM! Noise
105
+ - **Dataset**: `JorisCos/WHAM`
106
+ - **Split**: `train`
107
+ - **Content**: Ambient noise samples
108
+
109
+ #### Create Your Own
110
+ If a noise dataset isn't available, the script will fall back to synthetic white noise.
111
+
112
+ ## Training Configuration
113
+
114
+ ### Basic Training
115
+
116
+ ```bash
117
+ python train_with_hf_datasets.py \
118
+ --clean-dataset librispeech_asr \
119
+ --noise-dataset "dns-challenge/dns-challenge-4" \
120
+ --epochs 50 \
121
+ --batch-size 16 \
122
+ --samples-per-epoch 1000
123
+ ```
124
+
125
+ ### Advanced Training
126
+
127
+ ```bash
128
+ python train_with_hf_datasets.py \
129
+ --clean-dataset librispeech_asr \
130
+ --clean-split train.clean.360 \
131
+ --noise-dataset "dns-challenge/dns-challenge-4" \
132
+ --noise-split train \
133
+ --epochs 100 \
134
+ --batch-size 32 \
135
+ --samples-per-epoch 5000 \
136
+ --lstm-units 128 \
137
+ --learning-rate 0.001 \
138
+ --output-dir ./models/production \
139
+ --cache-dir ./hf_cache
140
+ ```
141
+
142
+ ### Small Model (for constrained devices)
143
+
144
+ ```bash
145
+ python train_with_hf_datasets.py \
146
+ --epochs 50 \
147
+ --lstm-units 64 \
148
+ --batch-size 8 \
149
+ --samples-per-epoch 500
150
+ ```
151
+
152
+ ### Large Model (maximum quality)
153
+
154
+ ```bash
155
+ python train_with_hf_datasets.py \
156
+ --epochs 100 \
157
+ --lstm-units 256 \
158
+ --batch-size 32 \
159
+ --samples-per-epoch 10000
160
+ ```
161
+
162
+ ## Training Parameters
163
+
164
+ | Parameter | Default | Description |
165
+ |-----------|---------|-------------|
166
+ | `--clean-dataset` | `librispeech_asr` | HF dataset for clean speech |
167
+ | `--noise-dataset` | `dns-challenge/dns-challenge-4` | HF dataset for noise |
168
+ | `--clean-split` | `train.clean.100` | Dataset split for clean speech |
169
+ | `--noise-split` | `train` | Dataset split for noise |
170
+ | `--epochs` | 50 | Number of training epochs |
171
+ | `--batch-size` | 16 | Batch size (reduce if OOM) |
172
+ | `--samples-per-epoch` | 1000 | Samples per epoch |
173
+ | `--lstm-units` | 128 | LSTM units (64/128/256) |
174
+ | `--learning-rate` | 0.001 | Adam learning rate |
175
+ | `--no-qat` | False | Disable quantization-aware training |
176
+ | `--cache-dir` | None | Cache directory for datasets |
177
+ | `--output-dir` | `./models` | Output directory |
178
+
179
+ ## Expected Results
180
+
181
+ ### Training Progress
182
+
183
+ ```
184
+ Epoch 1/50
185
+ 63/63 [==============================] - 145s 2s/step - loss: 0.0234 - mae: 0.1023
186
+ Epoch 2/50
187
+ 63/63 [==============================] - 142s 2s/step - loss: 0.0189 - mae: 0.0854
188
+ ...
189
+ Epoch 50/50
190
+ 63/63 [==============================] - 141s 2s/step - loss: 0.0042 - mae: 0.0312
191
+ ```
192
+
193
+ ### Model Performance
194
+
195
+ | Metric | Expected Value |
196
+ |--------|---------------|
197
+ | Final Loss | 0.003 - 0.006 |
198
+ | MAE | 0.02 - 0.04 |
199
+ | SNR Improvement | 12-15 dB |
200
+ | PESQ | 3.0 - 3.5 |
201
+ | STOI | 0.90 - 0.95 |
202
+
203
+ ### Model Size
204
+
205
+ | Configuration | Parameters | FP32 Size | INT8 Size |
206
+ |--------------|------------|-----------|-----------|
207
+ | Small (64 units) | ~150K | ~600 KB | ~150 KB |
208
+ | Medium (128 units) | ~400K | ~1.6 MB | ~400 KB |
209
+ | Large (256 units) | ~1.3M | ~5.2 MB | ~1.3 MB |
210
+
211
+ ## Using Trained Model
212
+
213
+ ### Load and Test
214
+
215
+ ```python
216
+ import tensorflow as tf
217
+ import numpy as np
218
+ import soundfile as sf
219
+
220
+ # Load model
221
+ model = tf.keras.models.load_model('./models/best_model.h5')
222
+
223
+ # Load noisy audio
224
+ noisy_audio, sr = sf.read('noisy_audio.wav')
225
+
226
+ # Ensure correct sample rate (16kHz)
227
+ if sr != 16000:
228
+ import librosa
229
+ noisy_audio = librosa.resample(noisy_audio, orig_sr=sr, target_sr=16000)
230
+ sr = 16000
231
+
232
+ # Denoise
233
+ enhanced_audio = model.predict(noisy_audio[np.newaxis, :])[0]
234
+
235
+ # Save result
236
+ sf.write('enhanced_audio.wav', enhanced_audio, sr)
237
+ ```
238
+
239
+ ### Convert to TFLite INT8
240
+
241
+ ```bash
242
+ python convert_to_tflite.py \
243
+ --model ./models/best_model.h5 \
244
+ --output ./models/dtln_int8.tflite \
245
+ --calibration-dir ./calibration_data
246
+ ```
247
+
248
+ ### Optimize for Hardware Accelerator (Arm Ethos-U)
249
+
250
+ ```bash
251
+ # Install Vela compiler
252
+ pip install ethos-u-vela
253
+
254
+ # Optimize for Ethos-U55
255
+ vela \
256
+ --accelerator-config ethos-u55-256 \
257
+ --system-config Ethos_U55_High_End_Embedded \
258
+ --memory-mode Shared_Sram \
259
+ ./models/dtln_int8.tflite
260
+ ```
261
+
262
+ ## Troubleshooting
263
+
264
+ ### Out of Memory (OOM)
265
+
266
+ **Problem**: Training crashes with OOM error
267
+
268
+ **Solutions**:
269
+ ```bash
270
+ # Reduce batch size
271
+ --batch-size 8
272
+
273
+ # Reduce samples per epoch
274
+ --samples-per-epoch 500
275
+
276
+ # Reduce LSTM units
277
+ --lstm-units 64
278
+ ```
279
+
280
+ ### Dataset Download Fails
281
+
282
+ **Problem**: Cannot download dataset from Hugging Face
283
+
284
+ **Solutions**:
285
+ 1. Check internet connection
286
+ 2. Login to Hugging Face: `huggingface-cli login`
287
+ 3. Try a different dataset
288
+ 4. Use local files with original `train_dtln.py`
289
+
290
+ ### Slow Training
291
+
292
+ **Problem**: Training is very slow
293
+
294
+ **Solutions**:
295
+ ```bash
296
+ # Use streaming datasets (already default)
297
+ # Reduce samples per epoch
298
+ --samples-per-epoch 500
299
+
300
+ # Use GPU (automatically detected)
301
+ # Check: python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
302
+ ```
303
+
304
+ ### Poor Model Quality
305
+
306
+ **Problem**: Model doesn't denoise well
307
+
308
+ **Solutions**:
309
+ 1. Train longer: `--epochs 100`
310
+ 2. Use more samples: `--samples-per-epoch 5000`
311
+ 3. Use larger model: `--lstm-units 256`
312
+ 4. Use better datasets (LibriSpeech 360h instead of 100h)
313
+ 5. Check training loss - should be < 0.01
314
+
315
+ ## Advanced Topics
316
+
317
+ ### Custom Dataset
318
+
319
+ To use your own dataset, it must be on Hugging Face Hub with an `audio` column:
320
+
321
+ ```bash
322
+ python train_with_hf_datasets.py \
323
+ --clean-dataset "your-username/your-speech-dataset" \
324
+ --noise-dataset "your-username/your-noise-dataset"
325
+ ```
326
+
327
+ ### Multi-GPU Training
328
+
329
+ TensorFlow automatically uses multiple GPUs if available. To control:
330
+
331
+ ```python
332
+ # In train_with_hf_datasets.py, add:
333
+ strategy = tf.distribute.MirroredStrategy()
334
+ with strategy.scope():
335
+ model = dtln.build_model()
336
+ # ... rest of training code
337
+ ```
338
+
339
+ ### Transfer Learning
340
+
341
+ Start from a pre-trained checkpoint:
342
+
343
+ ```python
344
+ # Load pre-trained weights
345
+ model = tf.keras.models.load_model('pretrained_model.h5')
346
+
347
+ # Continue training
348
+ history = model.fit(train_generator, epochs=20)
349
+ ```
350
+
351
+ ### Monitoring with TensorBoard
352
+
353
+ ```bash
354
+ # During training, logs are saved to ./models/logs/
355
+ # View with:
356
+ tensorboard --logdir ./models/logs
357
+
358
+ # Open browser to: http://localhost:6006
359
+ ```
360
+
361
+ ## Production Deployment
362
+
363
+ ### 1. Train Production Model
364
+
365
+ ```bash
366
+ python train_with_hf_datasets.py \
367
+ --clean-dataset librispeech_asr \
368
+ --clean-split train.clean.360 \
369
+ --epochs 100 \
370
+ --batch-size 32 \
371
+ --samples-per-epoch 10000 \
372
+ --lstm-units 128 \
373
+ --output-dir ./models/production
374
+ ```
375
+
376
+ ### 2. Evaluate on Test Set
377
+
378
+ ```bash
379
+ # Create test script to measure PESQ, STOI, SNR
380
+ python evaluate_model.py \
381
+ --model ./models/production/best_model.h5 \
382
+ --test-dir ./test_data
383
+ ```
384
+
385
+ ### 3. Convert to TFLite INT8
386
+
387
+ ```bash
388
+ python convert_to_tflite.py \
389
+ --model ./models/production/best_model.h5 \
390
+ --output ./models/dtln_production_int8.tflite
391
+ ```
392
+
393
+ ### 4. Deploy to Edge Device
394
+
395
+ See `alif_e7_voice_denoising_guide.md` for hardware deployment details.
396
+
397
+ ## Resources
398
+
399
+ - **HF Datasets**: https://huggingface.co/datasets
400
+ - **Audio Datasets**: https://huggingface.co/datasets?task_categories=audio
401
+ - **DTLN Paper**: https://arxiv.org/abs/2005.07551
402
+ - **TFLite Guide**: https://www.tensorflow.org/lite
403
+
404
+ ## Citation
405
+
406
+ ```bibtex
407
+ @inproceedings{westhausen2020dtln,
408
+ title={Dual-signal transformation LSTM network for real-time noise suppression},
409
+ author={Westhausen, Nils L and Meyer, Bernd T},
410
+ booktitle={Interspeech},
411
+ year={2020}
412
+ }
413
+ ```