Spaces:

grgsaliba
/

voice-denoising

Sleeping

App Files Files Community

voice-denoising / HF_TRAINING_GUIDE.md

grgsaliba

Upload HF_TRAINING_GUIDE.md with huggingface_hub

ed095cd verified about 1 month ago

preview code

raw

history blame contribute delete

9.96 kB

A newer version of the Gradio SDK is available: 6.0.0

Upgrade

Training DTLN with Hugging Face Datasets

Complete guide for training a production-quality voice denoising model using real datasets from Hugging Face Hub.

Overview

This guide shows you how to train the DTLN (Dual-signal Transformation LSTM Network) model using:

Clean Speech: LibriSpeech, Common Voice, or other speech datasets
Noise: DNS Challenge, FSD50K, or environmental noise datasets
Quantization-Aware Training (QAT): For efficient INT8 deployment
TensorFlow Lite: For edge deployment

Quick Start

1. Install Dependencies

pip install -r training_requirements.txt

2. Train with Default Settings (LibriSpeech + DNS Challenge)

python train_with_hf_datasets.py \
    --epochs 50 \
    --batch-size 16 \
    --samples-per-epoch 1000 \
    --lstm-units 128 \
    --output-dir ./models

This will:

Download LibriSpeech clean speech (train.clean.100 split)
Download DNS Challenge noise dataset
Train for 50 epochs with quantization-aware training
Save best model to ./models/best_model.h5

3. Convert to TensorFlow Lite

python convert_to_tflite.py \
    --model ./models/best_model.h5 \
    --output ./models/dtln_int8.tflite \
    --calibration-dir ./calibration_data

Available Datasets

Clean Speech Datasets

LibriSpeech (Default)

Dataset: librispeech_asr
Split: train.clean.100 (100 hours) or train.clean.360 (360 hours)
Language: English
Quality: High-quality read speech

python train_with_hf_datasets.py \
    --clean-dataset librispeech_asr \
    --clean-split train.clean.100

Common Voice

Dataset: mozilla-foundation/common_voice_11_0
Split: train
Language: Multiple (specify with config, e.g., "en")
Quality: User-contributed speech

python train_with_hf_datasets.py \
    --clean-dataset "mozilla-foundation/common_voice_11_0" \
    --clean-split train

VoxPopuli

Dataset: facebook/voxpopuli
Split: train
Language: Multiple European languages
Quality: European Parliament recordings

Other Options

google/fleurs - Multilingual speech
facebook/multilingual_librispeech - Multiple languages
Any HF dataset with audio column

Noise Datasets

DNS Challenge (Default)

Dataset: dns-challenge/dns-challenge-4
Split: train
Content: Diverse environmental noises

python train_with_hf_datasets.py \
    --noise-dataset "dns-challenge/dns-challenge-4" \
    --noise-split train

FSD50K (Freesound Dataset)

Dataset: Fhrozen/FSD50k
Split: train
Content: 50,000+ sound events

WHAM! Noise

Dataset: JorisCos/WHAM
Split: train
Content: Ambient noise samples

Create Your Own

If a noise dataset isn't available, the script will fall back to synthetic white noise.

Training Configuration

Basic Training

python train_with_hf_datasets.py \
    --clean-dataset librispeech_asr \
    --noise-dataset "dns-challenge/dns-challenge-4" \
    --epochs 50 \
    --batch-size 16 \
    --samples-per-epoch 1000

Advanced Training

python train_with_hf_datasets.py \
    --clean-dataset librispeech_asr \
    --clean-split train.clean.360 \
    --noise-dataset "dns-challenge/dns-challenge-4" \
    --noise-split train \
    --epochs 100 \
    --batch-size 32 \
    --samples-per-epoch 5000 \
    --lstm-units 128 \
    --learning-rate 0.001 \
    --output-dir ./models/production \
    --cache-dir ./hf_cache

Small Model (for constrained devices)

python train_with_hf_datasets.py \
    --epochs 50 \
    --lstm-units 64 \
    --batch-size 8 \
    --samples-per-epoch 500

Large Model (maximum quality)

python train_with_hf_datasets.py \
    --epochs 100 \
    --lstm-units 256 \
    --batch-size 32 \
    --samples-per-epoch 10000

Training Parameters

Parameter	Default	Description
`--clean-dataset`	`librispeech_asr`	HF dataset for clean speech
`--noise-dataset`	`dns-challenge/dns-challenge-4`	HF dataset for noise
`--clean-split`	`train.clean.100`	Dataset split for clean speech
`--noise-split`	`train`	Dataset split for noise
`--epochs`	50	Number of training epochs
`--batch-size`	16	Batch size (reduce if OOM)
`--samples-per-epoch`	1000	Samples per epoch
`--lstm-units`	128	LSTM units (64/128/256)
`--learning-rate`	0.001	Adam learning rate
`--no-qat`	False	Disable quantization-aware training
`--cache-dir`	None	Cache directory for datasets
`--output-dir`	`./models`	Output directory

Expected Results

Training Progress

Epoch 1/50
63/63 [==============================] - 145s 2s/step - loss: 0.0234 - mae: 0.1023
Epoch 2/50
63/63 [==============================] - 142s 2s/step - loss: 0.0189 - mae: 0.0854
...
Epoch 50/50
63/63 [==============================] - 141s 2s/step - loss: 0.0042 - mae: 0.0312

Model Performance

Metric	Expected Value
Final Loss	0.003 - 0.006
MAE	0.02 - 0.04
SNR Improvement	12-15 dB
PESQ	3.0 - 3.5
STOI	0.90 - 0.95

Model Size

Configuration	Parameters	FP32 Size	INT8 Size
Small (64 units)	~150K	~600 KB	~150 KB
Medium (128 units)	~400K	~1.6 MB	~400 KB
Large (256 units)	~1.3M	~5.2 MB	~1.3 MB

Using Trained Model

Load and Test

import tensorflow as tf
import numpy as np
import soundfile as sf

# Load model
model = tf.keras.models.load_model('./models/best_model.h5')

# Load noisy audio
noisy_audio, sr = sf.read('noisy_audio.wav')

# Ensure correct sample rate (16kHz)
if sr != 16000:
    import librosa
    noisy_audio = librosa.resample(noisy_audio, orig_sr=sr, target_sr=16000)
    sr = 16000

# Denoise
enhanced_audio = model.predict(noisy_audio[np.newaxis, :])[0]

# Save result
sf.write('enhanced_audio.wav', enhanced_audio, sr)

Convert to TFLite INT8

python convert_to_tflite.py \
    --model ./models/best_model.h5 \
    --output ./models/dtln_int8.tflite \
    --calibration-dir ./calibration_data

Optimize for Hardware Accelerator (Arm Ethos-U)

# Install Vela compiler
pip install ethos-u-vela

# Optimize for Ethos-U55
vela \
    --accelerator-config ethos-u55-256 \
    --system-config Ethos_U55_High_End_Embedded \
    --memory-mode Shared_Sram \
    ./models/dtln_int8.tflite

Troubleshooting

Out of Memory (OOM)

Problem: Training crashes with OOM error

Solutions:

# Reduce batch size
--batch-size 8

# Reduce samples per epoch
--samples-per-epoch 500

# Reduce LSTM units
--lstm-units 64

Dataset Download Fails

Problem: Cannot download dataset from Hugging Face

Solutions:

Check internet connection
Login to Hugging Face: huggingface-cli login
Try a different dataset
Use local files with original train_dtln.py

Slow Training

Problem: Training is very slow

Solutions:

# Use streaming datasets (already default)
# Reduce samples per epoch
--samples-per-epoch 500

# Use GPU (automatically detected)
# Check: python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

Poor Model Quality

Problem: Model doesn't denoise well

Solutions:

Train longer: --epochs 100
Use more samples: --samples-per-epoch 5000
Use larger model: --lstm-units 256
Use better datasets (LibriSpeech 360h instead of 100h)
Check training loss - should be < 0.01

Advanced Topics

Custom Dataset

To use your own dataset, it must be on Hugging Face Hub with an audio column:

python train_with_hf_datasets.py \
    --clean-dataset "your-username/your-speech-dataset" \
    --noise-dataset "your-username/your-noise-dataset"

Multi-GPU Training

TensorFlow automatically uses multiple GPUs if available. To control:

# In train_with_hf_datasets.py, add:
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
    model = dtln.build_model()
    # ... rest of training code

Transfer Learning

Start from a pre-trained checkpoint:

# Load pre-trained weights
model = tf.keras.models.load_model('pretrained_model.h5')

# Continue training
history = model.fit(train_generator, epochs=20)

Monitoring with TensorBoard

# During training, logs are saved to ./models/logs/
# View with:
tensorboard --logdir ./models/logs

# Open browser to: http://localhost:6006

Production Deployment

1. Train Production Model

python train_with_hf_datasets.py \
    --clean-dataset librispeech_asr \
    --clean-split train.clean.360 \
    --epochs 100 \
    --batch-size 32 \
    --samples-per-epoch 10000 \
    --lstm-units 128 \
    --output-dir ./models/production

2. Evaluate on Test Set

# Create test script to measure PESQ, STOI, SNR
python evaluate_model.py \
    --model ./models/production/best_model.h5 \
    --test-dir ./test_data

3. Convert to TFLite INT8

python convert_to_tflite.py \
    --model ./models/production/best_model.h5 \
    --output ./models/dtln_production_int8.tflite

4. Deploy to Edge Device

See alif_e7_voice_denoising_guide.md for hardware deployment details.

Resources

HF Datasets: https://huggingface.co/datasets
Audio Datasets: https://huggingface.co/datasets?task_categories=audio
DTLN Paper: https://arxiv.org/abs/2005.07551
TFLite Guide: https://www.tensorflow.org/lite

Citation

@inproceedings{westhausen2020dtln,
  title={Dual-signal transformation LSTM network for real-time noise suppression},
  author={Westhausen, Nils L and Meyer, Bernd T},
  booktitle={Interspeech},
  year={2020}
}