Spaces:
Sleeping
A newer version of the Gradio SDK is available:
6.0.0
Training DTLN with Hugging Face Datasets
Complete guide for training a production-quality voice denoising model using real datasets from Hugging Face Hub.
Overview
This guide shows you how to train the DTLN (Dual-signal Transformation LSTM Network) model using:
- Clean Speech: LibriSpeech, Common Voice, or other speech datasets
- Noise: DNS Challenge, FSD50K, or environmental noise datasets
- Quantization-Aware Training (QAT): For efficient INT8 deployment
- TensorFlow Lite: For edge deployment
Quick Start
1. Install Dependencies
pip install -r training_requirements.txt
2. Train with Default Settings (LibriSpeech + DNS Challenge)
python train_with_hf_datasets.py \
--epochs 50 \
--batch-size 16 \
--samples-per-epoch 1000 \
--lstm-units 128 \
--output-dir ./models
This will:
- Download LibriSpeech clean speech (train.clean.100 split)
- Download DNS Challenge noise dataset
- Train for 50 epochs with quantization-aware training
- Save best model to
./models/best_model.h5
3. Convert to TensorFlow Lite
python convert_to_tflite.py \
--model ./models/best_model.h5 \
--output ./models/dtln_int8.tflite \
--calibration-dir ./calibration_data
Available Datasets
Clean Speech Datasets
LibriSpeech (Default)
- Dataset:
librispeech_asr - Split:
train.clean.100(100 hours) ortrain.clean.360(360 hours) - Language: English
- Quality: High-quality read speech
python train_with_hf_datasets.py \
--clean-dataset librispeech_asr \
--clean-split train.clean.100
Common Voice
- Dataset:
mozilla-foundation/common_voice_11_0 - Split:
train - Language: Multiple (specify with config, e.g., "en")
- Quality: User-contributed speech
python train_with_hf_datasets.py \
--clean-dataset "mozilla-foundation/common_voice_11_0" \
--clean-split train
VoxPopuli
- Dataset:
facebook/voxpopuli - Split:
train - Language: Multiple European languages
- Quality: European Parliament recordings
Other Options
google/fleurs- Multilingual speechfacebook/multilingual_librispeech- Multiple languages- Any HF dataset with audio column
Noise Datasets
DNS Challenge (Default)
- Dataset:
dns-challenge/dns-challenge-4 - Split:
train - Content: Diverse environmental noises
python train_with_hf_datasets.py \
--noise-dataset "dns-challenge/dns-challenge-4" \
--noise-split train
FSD50K (Freesound Dataset)
- Dataset:
Fhrozen/FSD50k - Split:
train - Content: 50,000+ sound events
WHAM! Noise
- Dataset:
JorisCos/WHAM - Split:
train - Content: Ambient noise samples
Create Your Own
If a noise dataset isn't available, the script will fall back to synthetic white noise.
Training Configuration
Basic Training
python train_with_hf_datasets.py \
--clean-dataset librispeech_asr \
--noise-dataset "dns-challenge/dns-challenge-4" \
--epochs 50 \
--batch-size 16 \
--samples-per-epoch 1000
Advanced Training
python train_with_hf_datasets.py \
--clean-dataset librispeech_asr \
--clean-split train.clean.360 \
--noise-dataset "dns-challenge/dns-challenge-4" \
--noise-split train \
--epochs 100 \
--batch-size 32 \
--samples-per-epoch 5000 \
--lstm-units 128 \
--learning-rate 0.001 \
--output-dir ./models/production \
--cache-dir ./hf_cache
Small Model (for constrained devices)
python train_with_hf_datasets.py \
--epochs 50 \
--lstm-units 64 \
--batch-size 8 \
--samples-per-epoch 500
Large Model (maximum quality)
python train_with_hf_datasets.py \
--epochs 100 \
--lstm-units 256 \
--batch-size 32 \
--samples-per-epoch 10000
Training Parameters
| Parameter | Default | Description |
|---|---|---|
--clean-dataset |
librispeech_asr |
HF dataset for clean speech |
--noise-dataset |
dns-challenge/dns-challenge-4 |
HF dataset for noise |
--clean-split |
train.clean.100 |
Dataset split for clean speech |
--noise-split |
train |
Dataset split for noise |
--epochs |
50 | Number of training epochs |
--batch-size |
16 | Batch size (reduce if OOM) |
--samples-per-epoch |
1000 | Samples per epoch |
--lstm-units |
128 | LSTM units (64/128/256) |
--learning-rate |
0.001 | Adam learning rate |
--no-qat |
False | Disable quantization-aware training |
--cache-dir |
None | Cache directory for datasets |
--output-dir |
./models |
Output directory |
Expected Results
Training Progress
Epoch 1/50
63/63 [==============================] - 145s 2s/step - loss: 0.0234 - mae: 0.1023
Epoch 2/50
63/63 [==============================] - 142s 2s/step - loss: 0.0189 - mae: 0.0854
...
Epoch 50/50
63/63 [==============================] - 141s 2s/step - loss: 0.0042 - mae: 0.0312
Model Performance
| Metric | Expected Value |
|---|---|
| Final Loss | 0.003 - 0.006 |
| MAE | 0.02 - 0.04 |
| SNR Improvement | 12-15 dB |
| PESQ | 3.0 - 3.5 |
| STOI | 0.90 - 0.95 |
Model Size
| Configuration | Parameters | FP32 Size | INT8 Size |
|---|---|---|---|
| Small (64 units) | ~150K | ~600 KB | ~150 KB |
| Medium (128 units) | ~400K | ~1.6 MB | ~400 KB |
| Large (256 units) | ~1.3M | ~5.2 MB | ~1.3 MB |
Using Trained Model
Load and Test
import tensorflow as tf
import numpy as np
import soundfile as sf
# Load model
model = tf.keras.models.load_model('./models/best_model.h5')
# Load noisy audio
noisy_audio, sr = sf.read('noisy_audio.wav')
# Ensure correct sample rate (16kHz)
if sr != 16000:
import librosa
noisy_audio = librosa.resample(noisy_audio, orig_sr=sr, target_sr=16000)
sr = 16000
# Denoise
enhanced_audio = model.predict(noisy_audio[np.newaxis, :])[0]
# Save result
sf.write('enhanced_audio.wav', enhanced_audio, sr)
Convert to TFLite INT8
python convert_to_tflite.py \
--model ./models/best_model.h5 \
--output ./models/dtln_int8.tflite \
--calibration-dir ./calibration_data
Optimize for Hardware Accelerator (Arm Ethos-U)
# Install Vela compiler
pip install ethos-u-vela
# Optimize for Ethos-U55
vela \
--accelerator-config ethos-u55-256 \
--system-config Ethos_U55_High_End_Embedded \
--memory-mode Shared_Sram \
./models/dtln_int8.tflite
Troubleshooting
Out of Memory (OOM)
Problem: Training crashes with OOM error
Solutions:
# Reduce batch size
--batch-size 8
# Reduce samples per epoch
--samples-per-epoch 500
# Reduce LSTM units
--lstm-units 64
Dataset Download Fails
Problem: Cannot download dataset from Hugging Face
Solutions:
- Check internet connection
- Login to Hugging Face:
huggingface-cli login - Try a different dataset
- Use local files with original
train_dtln.py
Slow Training
Problem: Training is very slow
Solutions:
# Use streaming datasets (already default)
# Reduce samples per epoch
--samples-per-epoch 500
# Use GPU (automatically detected)
# Check: python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
Poor Model Quality
Problem: Model doesn't denoise well
Solutions:
- Train longer:
--epochs 100 - Use more samples:
--samples-per-epoch 5000 - Use larger model:
--lstm-units 256 - Use better datasets (LibriSpeech 360h instead of 100h)
- Check training loss - should be < 0.01
Advanced Topics
Custom Dataset
To use your own dataset, it must be on Hugging Face Hub with an audio column:
python train_with_hf_datasets.py \
--clean-dataset "your-username/your-speech-dataset" \
--noise-dataset "your-username/your-noise-dataset"
Multi-GPU Training
TensorFlow automatically uses multiple GPUs if available. To control:
# In train_with_hf_datasets.py, add:
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
model = dtln.build_model()
# ... rest of training code
Transfer Learning
Start from a pre-trained checkpoint:
# Load pre-trained weights
model = tf.keras.models.load_model('pretrained_model.h5')
# Continue training
history = model.fit(train_generator, epochs=20)
Monitoring with TensorBoard
# During training, logs are saved to ./models/logs/
# View with:
tensorboard --logdir ./models/logs
# Open browser to: http://localhost:6006
Production Deployment
1. Train Production Model
python train_with_hf_datasets.py \
--clean-dataset librispeech_asr \
--clean-split train.clean.360 \
--epochs 100 \
--batch-size 32 \
--samples-per-epoch 10000 \
--lstm-units 128 \
--output-dir ./models/production
2. Evaluate on Test Set
# Create test script to measure PESQ, STOI, SNR
python evaluate_model.py \
--model ./models/production/best_model.h5 \
--test-dir ./test_data
3. Convert to TFLite INT8
python convert_to_tflite.py \
--model ./models/production/best_model.h5 \
--output ./models/dtln_production_int8.tflite
4. Deploy to Edge Device
See alif_e7_voice_denoising_guide.md for hardware deployment details.
Resources
- HF Datasets: https://huggingface.co/datasets
- Audio Datasets: https://huggingface.co/datasets?task_categories=audio
- DTLN Paper: https://arxiv.org/abs/2005.07551
- TFLite Guide: https://www.tensorflow.org/lite
Citation
@inproceedings{westhausen2020dtln,
title={Dual-signal transformation LSTM network for real-time noise suppression},
author={Westhausen, Nils L and Meyer, Bernd T},
booktitle={Interspeech},
year={2020}
}