Plapre Simple - Phoneme-based TTS Model
A simplified phoneme-based text-to-speech model built on Qwen3-0.6B-Base, trained to generate audio tokens from phoneme sequences.
Model Overview
This model is trained to perform phoneme-to-audio-token generation using a causal language modeling approach. It takes phoneme sequences as input and generates audio tokens that can be decoded by a neural audio codec (e.g., NeuCodec).
Tokenization
The model uses a custom phoneme-based tokenizer with the following vocabulary structure:
Vocabulary Composition (66,192 tokens total)
- Standard tokens (4):
<pad>,<unk>,<bos>,<eos> - Phonemes (109): IPA phoneme characters from
phoneme_list.json - Audio tokens (65,536):
<audio_0>to<audio_65535>(representing neural codec codes) - Special structure tokens (8):
<phoneme_start>,<phoneme_end><audio_start>,<audio_end><ref_audio_start>,<ref_audio_end><ref_text_start>,<ref_text_end>
- Placeholder tokens (128):
<placeholder_0>to<placeholder_127>(reserved for future use)
Training Sequence Format
The model is trained on sequences with the following structure:
<phoneme_start> + [phoneme tokens] + <phoneme_end> + <audio_start> + [audio tokens] + <audio_end>
Example Sequence
For the text "You know, when":
- Text: "You know, when"
- Phonemes:
juː nˈoʊ, wˌɛn - Training sequence:
<phoneme_start> j u ː n ˈ o ʊ , w ˌ ɛ n <phoneme_end> <audio_start> <audio_2151> <audio_43235> <audio_56802> ... (audio tokens) <audio_end>
Training Objective
- The model uses causal language modeling (next-token prediction)
- Phoneme tokens are masked in the loss (labels set to -100)
- Only audio tokens are trained to be predicted from the phoneme context
- This teaches the model to generate audio tokens conditioned on phoneme input
Phoneme Encoding
Text is converted to phonemes using espeak-ng with the following settings:
- Language:
en-us - Preserve punctuation:
True - With stress markers:
True
Phonemes are then tokenized character-by-character (each IPA symbol is a separate token).
Audio Token Encoding
Audio codes from the neural codec (range 0-65535) are mapped to vocabulary tokens:
- Audio code
n→ token<audio_n>→ token ID(audio_token_start_id + n)
Model Details
- Base model: Qwen3-0.6B-Base
- Vocabulary size: 66,192 tokens
- Training dataset: neuphonic/emilia-yodas-english-neucodec
- Batch size: 16 (effective)
- Precision: bfloat16
- Attention: Flash Attention 2
Usage
To use this model, you'll need:
- The custom
PhonemeTokenizerclass (seetrain_simple.py) - espeak-ng for phonemization
- A neural audio codec decoder for converting audio tokens to waveforms
from transformers import AutoModelForCausalLM
from train_simple import PhonemeTokenizer
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("syvai/plapre-simple")
tokenizer = PhonemeTokenizer.from_pretrained("syvai/plapre-simple")
# Your inference code here
Files in Repository
config.json- Model configurationmodel.safetensors/pytorch_model.bin- Model weightstokenizer_config.json- Tokenizer configuration and vocabularyphoneme_list.json- List of phonemes used in vocabularyREADME.md- This file
Training Details
Trained using the Hugging Face Transformers Trainer with:
- Learning rate: 0.0002
- Warmup steps: 1000
- Gradient accumulation: 4
- Per-device batch size: 4
- Optimizer: AdamW
License
Inherits license from Qwen3-0.6B-Base.
- Downloads last month
- 66