YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Plapre Simple - Phoneme-based TTS Model

A simplified phoneme-based text-to-speech model built on Qwen3-0.6B-Base, trained to generate audio tokens from phoneme sequences.

Model Overview

This model is trained to perform phoneme-to-audio-token generation using a causal language modeling approach. It takes phoneme sequences as input and generates audio tokens that can be decoded by a neural audio codec (e.g., NeuCodec).

Tokenization

The model uses a custom phoneme-based tokenizer with the following vocabulary structure:

Vocabulary Composition (66,192 tokens total)

Standard tokens (4): <pad>, <unk>, <bos>, <eos>
Phonemes (109): IPA phoneme characters from phoneme_list.json
Audio tokens (65,536): <audio_0> to <audio_65535> (representing neural codec codes)
Special structure tokens (8):
- <phoneme_start>, <phoneme_end>
- <audio_start>, <audio_end>
- <ref_audio_start>, <ref_audio_end>
- <ref_text_start>, <ref_text_end>
Placeholder tokens (128): <placeholder_0> to <placeholder_127> (reserved for future use)

Training Sequence Format

The model is trained on sequences with the following structure:

<phoneme_start> + [phoneme tokens] + <phoneme_end> + <audio_start> + [audio tokens] + <audio_end>

Example Sequence

For the text "You know, when":

Text: "You know, when"
Phonemes: juː nˈoʊ, wˌɛn

Training sequence:

<phoneme_start>
j u ː   n ˈ o ʊ ,   w ˌ ɛ n
<phoneme_end>
<audio_start>
<audio_2151> <audio_43235> <audio_56802> ... (audio tokens)
<audio_end>

Training Objective

The model uses causal language modeling (next-token prediction)
Phoneme tokens are masked in the loss (labels set to -100)
Only audio tokens are trained to be predicted from the phoneme context
This teaches the model to generate audio tokens conditioned on phoneme input

Phoneme Encoding

Text is converted to phonemes using espeak-ng with the following settings:

Language: en-us
Preserve punctuation: True
With stress markers: True

Phonemes are then tokenized character-by-character (each IPA symbol is a separate token).

Audio Token Encoding

Audio codes from the neural codec (range 0-65535) are mapped to vocabulary tokens:

Audio code n → token <audio_n> → token ID (audio_token_start_id + n)

Model Details

Base model: Qwen3-0.6B-Base
Vocabulary size: 66,192 tokens
Training dataset: neuphonic/emilia-yodas-english-neucodec
Batch size: 16 (effective)
Precision: bfloat16
Attention: Flash Attention 2

Usage

To use this model, you'll need:

The custom PhonemeTokenizer class (see train_simple.py)
espeak-ng for phonemization
A neural audio codec decoder for converting audio tokens to waveforms

from transformers import AutoModelForCausalLM
from train_simple import PhonemeTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("syvai/plapre-simple")
tokenizer = PhonemeTokenizer.from_pretrained("syvai/plapre-simple")

# Your inference code here

Files in Repository

config.json - Model configuration
model.safetensors / pytorch_model.bin - Model weights
tokenizer_config.json - Tokenizer configuration and vocabulary
phoneme_list.json - List of phonemes used in vocabulary
README.md - This file

Training Details

Trained using the Hugging Face Transformers Trainer with:

Learning rate: 0.0002
Warmup steps: 1000
Gradient accumulation: 4
Per-device batch size: 4
Optimizer: AdamW

License

Inherits license from Qwen3-0.6B-Base.

Downloads last month: 66

Safetensors

Model size

0.5B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support