YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Plapre Simple - Phoneme-based TTS Model

A simplified phoneme-based text-to-speech model built on Qwen3-0.6B-Base, trained to generate audio tokens from phoneme sequences.

Model Overview

This model is trained to perform phoneme-to-audio-token generation using a causal language modeling approach. It takes phoneme sequences as input and generates audio tokens that can be decoded by a neural audio codec (e.g., NeuCodec).

Tokenization

The model uses a custom phoneme-based tokenizer with the following vocabulary structure:

Vocabulary Composition (66,192 tokens total)

  1. Standard tokens (4): <pad>, <unk>, <bos>, <eos>
  2. Phonemes (109): IPA phoneme characters from phoneme_list.json
  3. Audio tokens (65,536): <audio_0> to <audio_65535> (representing neural codec codes)
  4. Special structure tokens (8):
    • <phoneme_start>, <phoneme_end>
    • <audio_start>, <audio_end>
    • <ref_audio_start>, <ref_audio_end>
    • <ref_text_start>, <ref_text_end>
  5. Placeholder tokens (128): <placeholder_0> to <placeholder_127> (reserved for future use)

Training Sequence Format

The model is trained on sequences with the following structure:

<phoneme_start> + [phoneme tokens] + <phoneme_end> + <audio_start> + [audio tokens] + <audio_end>

Example Sequence

For the text "You know, when":

  1. Text: "You know, when"
  2. Phonemes: juː nˈoʊ, wˌɛn
  3. Training sequence:
    <phoneme_start>
    j u ː   n ˈ o ʊ ,   w ˌ ɛ n
    <phoneme_end>
    <audio_start>
    <audio_2151> <audio_43235> <audio_56802> ... (audio tokens)
    <audio_end>
    

Training Objective

  • The model uses causal language modeling (next-token prediction)
  • Phoneme tokens are masked in the loss (labels set to -100)
  • Only audio tokens are trained to be predicted from the phoneme context
  • This teaches the model to generate audio tokens conditioned on phoneme input

Phoneme Encoding

Text is converted to phonemes using espeak-ng with the following settings:

  • Language: en-us
  • Preserve punctuation: True
  • With stress markers: True

Phonemes are then tokenized character-by-character (each IPA symbol is a separate token).

Audio Token Encoding

Audio codes from the neural codec (range 0-65535) are mapped to vocabulary tokens:

  • Audio code n → token <audio_n> → token ID (audio_token_start_id + n)

Model Details

  • Base model: Qwen3-0.6B-Base
  • Vocabulary size: 66,192 tokens
  • Training dataset: neuphonic/emilia-yodas-english-neucodec
  • Batch size: 16 (effective)
  • Precision: bfloat16
  • Attention: Flash Attention 2

Usage

To use this model, you'll need:

  1. The custom PhonemeTokenizer class (see train_simple.py)
  2. espeak-ng for phonemization
  3. A neural audio codec decoder for converting audio tokens to waveforms
from transformers import AutoModelForCausalLM
from train_simple import PhonemeTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("syvai/plapre-simple")
tokenizer = PhonemeTokenizer.from_pretrained("syvai/plapre-simple")

# Your inference code here

Files in Repository

  • config.json - Model configuration
  • model.safetensors / pytorch_model.bin - Model weights
  • tokenizer_config.json - Tokenizer configuration and vocabulary
  • phoneme_list.json - List of phonemes used in vocabulary
  • README.md - This file

Training Details

Trained using the Hugging Face Transformers Trainer with:

  • Learning rate: 0.0002
  • Warmup steps: 1000
  • Gradient accumulation: 4
  • Per-device batch size: 4
  • Optimizer: AdamW

License

Inherits license from Qwen3-0.6B-Base.

Downloads last month
66
Safetensors
Model size
0.5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support