voice-denoising / HF_TRAINING_GUIDE.md
grgsaliba's picture
Upload HF_TRAINING_GUIDE.md with huggingface_hub
ed095cd verified

A newer version of the Gradio SDK is available: 6.0.0

Upgrade

Training DTLN with Hugging Face Datasets

Complete guide for training a production-quality voice denoising model using real datasets from Hugging Face Hub.

Overview

This guide shows you how to train the DTLN (Dual-signal Transformation LSTM Network) model using:

  • Clean Speech: LibriSpeech, Common Voice, or other speech datasets
  • Noise: DNS Challenge, FSD50K, or environmental noise datasets
  • Quantization-Aware Training (QAT): For efficient INT8 deployment
  • TensorFlow Lite: For edge deployment

Quick Start

1. Install Dependencies

pip install -r training_requirements.txt

2. Train with Default Settings (LibriSpeech + DNS Challenge)

python train_with_hf_datasets.py \
    --epochs 50 \
    --batch-size 16 \
    --samples-per-epoch 1000 \
    --lstm-units 128 \
    --output-dir ./models

This will:

  • Download LibriSpeech clean speech (train.clean.100 split)
  • Download DNS Challenge noise dataset
  • Train for 50 epochs with quantization-aware training
  • Save best model to ./models/best_model.h5

3. Convert to TensorFlow Lite

python convert_to_tflite.py \
    --model ./models/best_model.h5 \
    --output ./models/dtln_int8.tflite \
    --calibration-dir ./calibration_data

Available Datasets

Clean Speech Datasets

LibriSpeech (Default)

  • Dataset: librispeech_asr
  • Split: train.clean.100 (100 hours) or train.clean.360 (360 hours)
  • Language: English
  • Quality: High-quality read speech
python train_with_hf_datasets.py \
    --clean-dataset librispeech_asr \
    --clean-split train.clean.100

Common Voice

  • Dataset: mozilla-foundation/common_voice_11_0
  • Split: train
  • Language: Multiple (specify with config, e.g., "en")
  • Quality: User-contributed speech
python train_with_hf_datasets.py \
    --clean-dataset "mozilla-foundation/common_voice_11_0" \
    --clean-split train

VoxPopuli

  • Dataset: facebook/voxpopuli
  • Split: train
  • Language: Multiple European languages
  • Quality: European Parliament recordings

Other Options

  • google/fleurs - Multilingual speech
  • facebook/multilingual_librispeech - Multiple languages
  • Any HF dataset with audio column

Noise Datasets

DNS Challenge (Default)

  • Dataset: dns-challenge/dns-challenge-4
  • Split: train
  • Content: Diverse environmental noises
python train_with_hf_datasets.py \
    --noise-dataset "dns-challenge/dns-challenge-4" \
    --noise-split train

FSD50K (Freesound Dataset)

  • Dataset: Fhrozen/FSD50k
  • Split: train
  • Content: 50,000+ sound events

WHAM! Noise

  • Dataset: JorisCos/WHAM
  • Split: train
  • Content: Ambient noise samples

Create Your Own

If a noise dataset isn't available, the script will fall back to synthetic white noise.

Training Configuration

Basic Training

python train_with_hf_datasets.py \
    --clean-dataset librispeech_asr \
    --noise-dataset "dns-challenge/dns-challenge-4" \
    --epochs 50 \
    --batch-size 16 \
    --samples-per-epoch 1000

Advanced Training

python train_with_hf_datasets.py \
    --clean-dataset librispeech_asr \
    --clean-split train.clean.360 \
    --noise-dataset "dns-challenge/dns-challenge-4" \
    --noise-split train \
    --epochs 100 \
    --batch-size 32 \
    --samples-per-epoch 5000 \
    --lstm-units 128 \
    --learning-rate 0.001 \
    --output-dir ./models/production \
    --cache-dir ./hf_cache

Small Model (for constrained devices)

python train_with_hf_datasets.py \
    --epochs 50 \
    --lstm-units 64 \
    --batch-size 8 \
    --samples-per-epoch 500

Large Model (maximum quality)

python train_with_hf_datasets.py \
    --epochs 100 \
    --lstm-units 256 \
    --batch-size 32 \
    --samples-per-epoch 10000

Training Parameters

Parameter Default Description
--clean-dataset librispeech_asr HF dataset for clean speech
--noise-dataset dns-challenge/dns-challenge-4 HF dataset for noise
--clean-split train.clean.100 Dataset split for clean speech
--noise-split train Dataset split for noise
--epochs 50 Number of training epochs
--batch-size 16 Batch size (reduce if OOM)
--samples-per-epoch 1000 Samples per epoch
--lstm-units 128 LSTM units (64/128/256)
--learning-rate 0.001 Adam learning rate
--no-qat False Disable quantization-aware training
--cache-dir None Cache directory for datasets
--output-dir ./models Output directory

Expected Results

Training Progress

Epoch 1/50
63/63 [==============================] - 145s 2s/step - loss: 0.0234 - mae: 0.1023
Epoch 2/50
63/63 [==============================] - 142s 2s/step - loss: 0.0189 - mae: 0.0854
...
Epoch 50/50
63/63 [==============================] - 141s 2s/step - loss: 0.0042 - mae: 0.0312

Model Performance

Metric Expected Value
Final Loss 0.003 - 0.006
MAE 0.02 - 0.04
SNR Improvement 12-15 dB
PESQ 3.0 - 3.5
STOI 0.90 - 0.95

Model Size

Configuration Parameters FP32 Size INT8 Size
Small (64 units) ~150K ~600 KB ~150 KB
Medium (128 units) ~400K ~1.6 MB ~400 KB
Large (256 units) ~1.3M ~5.2 MB ~1.3 MB

Using Trained Model

Load and Test

import tensorflow as tf
import numpy as np
import soundfile as sf

# Load model
model = tf.keras.models.load_model('./models/best_model.h5')

# Load noisy audio
noisy_audio, sr = sf.read('noisy_audio.wav')

# Ensure correct sample rate (16kHz)
if sr != 16000:
    import librosa
    noisy_audio = librosa.resample(noisy_audio, orig_sr=sr, target_sr=16000)
    sr = 16000

# Denoise
enhanced_audio = model.predict(noisy_audio[np.newaxis, :])[0]

# Save result
sf.write('enhanced_audio.wav', enhanced_audio, sr)

Convert to TFLite INT8

python convert_to_tflite.py \
    --model ./models/best_model.h5 \
    --output ./models/dtln_int8.tflite \
    --calibration-dir ./calibration_data

Optimize for Hardware Accelerator (Arm Ethos-U)

# Install Vela compiler
pip install ethos-u-vela

# Optimize for Ethos-U55
vela \
    --accelerator-config ethos-u55-256 \
    --system-config Ethos_U55_High_End_Embedded \
    --memory-mode Shared_Sram \
    ./models/dtln_int8.tflite

Troubleshooting

Out of Memory (OOM)

Problem: Training crashes with OOM error

Solutions:

# Reduce batch size
--batch-size 8

# Reduce samples per epoch
--samples-per-epoch 500

# Reduce LSTM units
--lstm-units 64

Dataset Download Fails

Problem: Cannot download dataset from Hugging Face

Solutions:

  1. Check internet connection
  2. Login to Hugging Face: huggingface-cli login
  3. Try a different dataset
  4. Use local files with original train_dtln.py

Slow Training

Problem: Training is very slow

Solutions:

# Use streaming datasets (already default)
# Reduce samples per epoch
--samples-per-epoch 500

# Use GPU (automatically detected)
# Check: python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

Poor Model Quality

Problem: Model doesn't denoise well

Solutions:

  1. Train longer: --epochs 100
  2. Use more samples: --samples-per-epoch 5000
  3. Use larger model: --lstm-units 256
  4. Use better datasets (LibriSpeech 360h instead of 100h)
  5. Check training loss - should be < 0.01

Advanced Topics

Custom Dataset

To use your own dataset, it must be on Hugging Face Hub with an audio column:

python train_with_hf_datasets.py \
    --clean-dataset "your-username/your-speech-dataset" \
    --noise-dataset "your-username/your-noise-dataset"

Multi-GPU Training

TensorFlow automatically uses multiple GPUs if available. To control:

# In train_with_hf_datasets.py, add:
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
    model = dtln.build_model()
    # ... rest of training code

Transfer Learning

Start from a pre-trained checkpoint:

# Load pre-trained weights
model = tf.keras.models.load_model('pretrained_model.h5')

# Continue training
history = model.fit(train_generator, epochs=20)

Monitoring with TensorBoard

# During training, logs are saved to ./models/logs/
# View with:
tensorboard --logdir ./models/logs

# Open browser to: http://localhost:6006

Production Deployment

1. Train Production Model

python train_with_hf_datasets.py \
    --clean-dataset librispeech_asr \
    --clean-split train.clean.360 \
    --epochs 100 \
    --batch-size 32 \
    --samples-per-epoch 10000 \
    --lstm-units 128 \
    --output-dir ./models/production

2. Evaluate on Test Set

# Create test script to measure PESQ, STOI, SNR
python evaluate_model.py \
    --model ./models/production/best_model.h5 \
    --test-dir ./test_data

3. Convert to TFLite INT8

python convert_to_tflite.py \
    --model ./models/production/best_model.h5 \
    --output ./models/dtln_production_int8.tflite

4. Deploy to Edge Device

See alif_e7_voice_denoising_guide.md for hardware deployment details.

Resources

Citation

@inproceedings{westhausen2020dtln,
  title={Dual-signal transformation LSTM network for real-time noise suppression},
  author={Westhausen, Nils L and Meyer, Bernd T},
  booktitle={Interspeech},
  year={2020}
}