Apply for community grant: Company project (gpu)

#1
by FireRedTeam - opened

FireRedTTS‑2 is a long-form streaming TTS system for multi-speaker dialogue generation, delivering stable, natural speech with reliable speaker switching and context-aware prosody.

Highlights

  • Long Conversational Speech Generation: It currently supports 3 minutes dialogues with 4 speakers and can be easily scaled to longer conversations with more speakers by extending training corpus.
  • Multilingual Support: It supports multiple languages including English, Chinese, Japanese, Korean, French, German, and Russian. Support zero-shot voice cloning for cross-lingual and code-switching scenarios.
  • Ultra-Low Latency: Building on the new 12.5Hz streaming speech tokenizer, we employ a dual-transformer architecture that operates on a text–speech interleaved sequence, enabling flexible sentence-by-sentence generation and reducing first-packet latency,Specifically, on an L20 GPU, our first-packet latency as low as 140ms while maintaining high-quality audio output.
  • Strong Stability: Our model achieves high similarity and low WER/CER in both monologue and dialogue tests.
  • Random Timbre Generation: Useful for creating ASR/speech interaction data.

Sign up or log in to comment