Spaces:

grgsaliba
/

voice-denoising

Sleeping

App Files Files Community

grgsaliba commited on Oct 12

Commit

ed095cd

verified ·

1 Parent(s): 3aa7271

Upload HF_TRAINING_GUIDE.md with huggingface_hub

Browse files

Files changed (1) hide show

HF_TRAINING_GUIDE.md +413 -0

HF_TRAINING_GUIDE.md ADDED Viewed

	@@ -0,0 +1,413 @@

+# Training DTLN with Hugging Face Datasets
+Complete guide for training a production-quality voice denoising model using real datasets from Hugging Face Hub.
+## Overview
+This guide shows you how to train the DTLN (Dual-signal Transformation LSTM Network) model using:
+- **Clean Speech**: LibriSpeech, Common Voice, or other speech datasets
+- **Noise**: DNS Challenge, FSD50K, or environmental noise datasets
+- **Quantization-Aware Training (QAT)**: For efficient INT8 deployment
+- **TensorFlow Lite**: For edge deployment
+## Quick Start
+### 1. Install Dependencies
+```bash
+pip install -r training_requirements.txt
+```
+### 2. Train with Default Settings (LibriSpeech + DNS Challenge)
+```bash
+python train_with_hf_datasets.py \
+    --epochs 50 \
+    --batch-size 16 \
+    --samples-per-epoch 1000 \
+    --lstm-units 128 \
+    --output-dir ./models
+```
+This will:
+- Download LibriSpeech clean speech (train.clean.100 split)
+- Download DNS Challenge noise dataset
+- Train for 50 epochs with quantization-aware training
+- Save best model to `./models/best_model.h5`
+### 3. Convert to TensorFlow Lite
+```bash
+python convert_to_tflite.py \
+    --model ./models/best_model.h5 \
+    --output ./models/dtln_int8.tflite \
+    --calibration-dir ./calibration_data
+```
+## Available Datasets
+### Clean Speech Datasets
+#### LibriSpeech (Default)
+- **Dataset**: `librispeech_asr`
+- **Split**: `train.clean.100` (100 hours) or `train.clean.360` (360 hours)
+- **Language**: English
+- **Quality**: High-quality read speech
+```bash
+python train_with_hf_datasets.py \
+    --clean-dataset librispeech_asr \
+    --clean-split train.clean.100
+```
+#### Common Voice
+- **Dataset**: `mozilla-foundation/common_voice_11_0`
+- **Split**: `train`
+- **Language**: Multiple (specify with config, e.g., "en")
+- **Quality**: User-contributed speech
+```bash
+python train_with_hf_datasets.py \
+    --clean-dataset "mozilla-foundation/common_voice_11_0" \
+    --clean-split train
+```
+#### VoxPopuli
+- **Dataset**: `facebook/voxpopuli`
+- **Split**: `train`
+- **Language**: Multiple European languages
+- **Quality**: European Parliament recordings
+#### Other Options
+- `google/fleurs` - Multilingual speech
+- `facebook/multilingual_librispeech` - Multiple languages
+- Any HF dataset with audio column
+### Noise Datasets
+#### DNS Challenge (Default)
+- **Dataset**: `dns-challenge/dns-challenge-4`
+- **Split**: `train`
+- **Content**: Diverse environmental noises
+```bash
+python train_with_hf_datasets.py \
+    --noise-dataset "dns-challenge/dns-challenge-4" \
+    --noise-split train
+```
+#### FSD50K (Freesound Dataset)
+- **Dataset**: `Fhrozen/FSD50k`
+- **Split**: `train`
+- **Content**: 50,000+ sound events
+#### WHAM! Noise
+- **Dataset**: `JorisCos/WHAM`
+- **Split**: `train`
+- **Content**: Ambient noise samples
+#### Create Your Own
+If a noise dataset isn't available, the script will fall back to synthetic white noise.
+## Training Configuration
+### Basic Training
+```bash
+python train_with_hf_datasets.py \
+    --clean-dataset librispeech_asr \
+    --noise-dataset "dns-challenge/dns-challenge-4" \
+    --epochs 50 \
+    --batch-size 16 \
+    --samples-per-epoch 1000
+```
+### Advanced Training
+```bash
+python train_with_hf_datasets.py \
+    --clean-dataset librispeech_asr \
+    --clean-split train.clean.360 \
+    --noise-dataset "dns-challenge/dns-challenge-4" \
+    --noise-split train \
+    --epochs 100 \
+    --batch-size 32 \
+    --samples-per-epoch 5000 \
+    --lstm-units 128 \
+    --learning-rate 0.001 \
+    --output-dir ./models/production \
+    --cache-dir ./hf_cache
+```
+### Small Model (for constrained devices)
+```bash
+python train_with_hf_datasets.py \
+    --epochs 50 \
+    --lstm-units 64 \
+    --batch-size 8 \
+    --samples-per-epoch 500
+```
+### Large Model (maximum quality)
+```bash
+python train_with_hf_datasets.py \
+    --epochs 100 \
+    --lstm-units 256 \
+    --batch-size 32 \
+    --samples-per-epoch 10000
+```
+## Training Parameters
+| Parameter | Default | Description |
+|-----------|---------|-------------|
+| `--clean-dataset` | `librispeech_asr` | HF dataset for clean speech |
+| `--noise-dataset` | `dns-challenge/dns-challenge-4` | HF dataset for noise |
+| `--clean-split` | `train.clean.100` | Dataset split for clean speech |
+| `--noise-split` | `train` | Dataset split for noise |
+| `--epochs` | 50 | Number of training epochs |
+| `--batch-size` | 16 | Batch size (reduce if OOM) |
+| `--samples-per-epoch` | 1000 | Samples per epoch |
+| `--lstm-units` | 128 | LSTM units (64/128/256) |
+| `--learning-rate` | 0.001 | Adam learning rate |
+| `--no-qat` | False | Disable quantization-aware training |
+| `--cache-dir` | None | Cache directory for datasets |
+| `--output-dir` | `./models` | Output directory |
+## Expected Results
+### Training Progress
+```
+Epoch 1/50
+63/63 [==============================] - 145s 2s/step - loss: 0.0234 - mae: 0.1023
+Epoch 2/50
+63/63 [==============================] - 142s 2s/step - loss: 0.0189 - mae: 0.0854
+...
+Epoch 50/50
+63/63 [==============================] - 141s 2s/step - loss: 0.0042 - mae: 0.0312
+```
+### Model Performance
+| Metric | Expected Value |
+|--------|---------------|
+| Final Loss | 0.003 - 0.006 |
+| MAE | 0.02 - 0.04 |
+| SNR Improvement | 12-15 dB |
+| PESQ | 3.0 - 3.5 |
+| STOI | 0.90 - 0.95 |
+### Model Size
+| Configuration | Parameters | FP32 Size | INT8 Size |
+|--------------|------------|-----------|-----------|
+| Small (64 units) | ~150K | ~600 KB | ~150 KB |
+| Medium (128 units) | ~400K | ~1.6 MB | ~400 KB |
+| Large (256 units) | ~1.3M | ~5.2 MB | ~1.3 MB |
+## Using Trained Model
+### Load and Test
+```python
+import tensorflow as tf
+import numpy as np
+import soundfile as sf
+# Load model
+model = tf.keras.models.load_model('./models/best_model.h5')
+# Load noisy audio
+noisy_audio, sr = sf.read('noisy_audio.wav')
+# Ensure correct sample rate (16kHz)
+if sr != 16000:
+    import librosa
+    noisy_audio = librosa.resample(noisy_audio, orig_sr=sr, target_sr=16000)
+    sr = 16000
+# Denoise
+enhanced_audio = model.predict(noisy_audio[np.newaxis, :])[0]
+# Save result
+sf.write('enhanced_audio.wav', enhanced_audio, sr)
+```
+### Convert to TFLite INT8
+```bash
+python convert_to_tflite.py \
+    --model ./models/best_model.h5 \
+    --output ./models/dtln_int8.tflite \
+    --calibration-dir ./calibration_data
+```
+### Optimize for Hardware Accelerator (Arm Ethos-U)
+```bash
+# Install Vela compiler
+pip install ethos-u-vela
+# Optimize for Ethos-U55
+vela \
+    --accelerator-config ethos-u55-256 \
+    --system-config Ethos_U55_High_End_Embedded \
+    --memory-mode Shared_Sram \
+    ./models/dtln_int8.tflite
+```
+## Troubleshooting
+### Out of Memory (OOM)
+**Problem**: Training crashes with OOM error
+**Solutions**:
+```bash
+# Reduce batch size
+--batch-size 8
+# Reduce samples per epoch
+--samples-per-epoch 500
+# Reduce LSTM units
+--lstm-units 64
+```
+### Dataset Download Fails
+**Problem**: Cannot download dataset from Hugging Face
+**Solutions**:
+1. Check internet connection
+2. Login to Hugging Face: `huggingface-cli login`
+3. Try a different dataset
+4. Use local files with original `train_dtln.py`
+### Slow Training
+**Problem**: Training is very slow
+**Solutions**:
+```bash
+# Use streaming datasets (already default)
+# Reduce samples per epoch
+--samples-per-epoch 500
+# Use GPU (automatically detected)
+# Check: python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
+```
+### Poor Model Quality
+**Problem**: Model doesn't denoise well
+**Solutions**:
+1. Train longer: `--epochs 100`
+2. Use more samples: `--samples-per-epoch 5000`
+3. Use larger model: `--lstm-units 256`
+4. Use better datasets (LibriSpeech 360h instead of 100h)
+5. Check training loss - should be < 0.01
+## Advanced Topics
+### Custom Dataset
+To use your own dataset, it must be on Hugging Face Hub with an `audio` column:
+```bash
+python train_with_hf_datasets.py \
+    --clean-dataset "your-username/your-speech-dataset" \
+    --noise-dataset "your-username/your-noise-dataset"
+```
+### Multi-GPU Training
+TensorFlow automatically uses multiple GPUs if available. To control:
+```python
+# In train_with_hf_datasets.py, add:
+strategy = tf.distribute.MirroredStrategy()
+with strategy.scope():
+    model = dtln.build_model()
+    # ... rest of training code
+```
+### Transfer Learning
+Start from a pre-trained checkpoint:
+```python
+# Load pre-trained weights
+model = tf.keras.models.load_model('pretrained_model.h5')
+# Continue training
+history = model.fit(train_generator, epochs=20)
+```
+### Monitoring with TensorBoard
+```bash
+# During training, logs are saved to ./models/logs/
+# View with:
+tensorboard --logdir ./models/logs
+# Open browser to: http://localhost:6006
+```
+## Production Deployment
+### 1. Train Production Model
+```bash
+python train_with_hf_datasets.py \
+    --clean-dataset librispeech_asr \
+    --clean-split train.clean.360 \
+    --epochs 100 \
+    --batch-size 32 \
+    --samples-per-epoch 10000 \
+    --lstm-units 128 \
+    --output-dir ./models/production
+```
+### 2. Evaluate on Test Set
+```bash
+# Create test script to measure PESQ, STOI, SNR
+python evaluate_model.py \
+    --model ./models/production/best_model.h5 \
+    --test-dir ./test_data
+```
+### 3. Convert to TFLite INT8
+```bash
+python convert_to_tflite.py \
+    --model ./models/production/best_model.h5 \
+    --output ./models/dtln_production_int8.tflite
+```
+### 4. Deploy to Edge Device
+See `alif_e7_voice_denoising_guide.md` for hardware deployment details.
+## Resources
+- **HF Datasets**: https://huggingface.co/datasets
+- **Audio Datasets**: https://huggingface.co/datasets?task_categories=audio
+- **DTLN Paper**: https://arxiv.org/abs/2005.07551
+- **TFLite Guide**: https://www.tensorflow.org/lite
+## Citation
+```bibtex
+@inproceedings{westhausen2020dtln,
+  title={Dual-signal transformation LSTM network for real-time noise suppression},
+  author={Westhausen, Nils L and Meyer, Bernd T},
+  booktitle={Interspeech},
+  year={2020}
+}
+```