Update the order of code and add pipeline usage! (#22)
Browse files- Update the order of code and add pipeline usage! (578a8302e3794675594bb02d971245bb7cf5690e)
Co-authored-by: Vaibhav Srivastav <[email protected]>
README.md
CHANGED
|
@@ -47,44 +47,35 @@ Extensive evaluations show the superiority of the proposed SpeechT5 framework on
|
|
| 47 |
|
| 48 |
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
| 49 |
|
| 50 |
-
##
|
| 51 |
-
|
| 52 |
-
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
|
| 53 |
-
|
| 54 |
-
You can use this model for speech synthesis. See the [model hub](https://huggingface.co/models?search=speecht5) to look for fine-tuned versions on a task that interests you.
|
| 55 |
-
|
| 56 |
-
## Downstream Use [optional]
|
| 57 |
-
|
| 58 |
-
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
|
| 59 |
-
|
| 60 |
-
[More Information Needed]
|
| 61 |
-
|
| 62 |
-
## Out-of-Scope Use
|
| 63 |
-
|
| 64 |
-
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
|
| 65 |
-
|
| 66 |
-
[More Information Needed]
|
| 67 |
|
| 68 |
-
|
| 69 |
|
| 70 |
-
|
|
|
|
|
|
|
| 71 |
|
| 72 |
-
|
|
|
|
|
|
|
| 73 |
|
| 74 |
-
|
| 75 |
|
| 76 |
-
|
|
|
|
|
|
|
| 77 |
|
| 78 |
-
|
| 79 |
|
|
|
|
| 80 |
|
| 81 |
-
|
| 82 |
|
| 83 |
-
|
| 84 |
|
| 85 |
```python
|
| 86 |
# Following pip packages need to be installed:
|
| 87 |
-
# !pip install
|
| 88 |
|
| 89 |
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
|
| 90 |
from datasets import load_dataset
|
|
@@ -111,6 +102,37 @@ sf.write("speech.wav", speech.numpy(), samplerate=16000)
|
|
| 111 |
|
| 112 |
Refer to [this Colab notebook](https://colab.research.google.com/drive/1i7I5pzBcU3WDFarDnzweIj4-sVVoIUFJ) for an example of how to fine-tune SpeechT5 for TTS on a different dataset or a new language.
|
| 113 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 114 |
# Training Details
|
| 115 |
|
| 116 |
## Training Data
|
|
|
|
| 47 |
|
| 48 |
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
| 49 |
|
| 50 |
+
## How to Get Started With the Model
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
|
| 52 |
+
You can access the SpeechT5 model via the `Text-to-Speech` pipeline in just a couple lines of code!
|
| 53 |
|
| 54 |
+
```python
|
| 55 |
+
# Following pip packages need to be installed:
|
| 56 |
+
# !pip install transformers sentencepiece datasets
|
| 57 |
|
| 58 |
+
from transformers import pipeline
|
| 59 |
+
from datasets import load_dataset
|
| 60 |
+
import soundfile as sf
|
| 61 |
|
| 62 |
+
synthesiser = pipeline("text-to-speech", "microsoft/speech_tt5")
|
| 63 |
|
| 64 |
+
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
|
| 65 |
+
speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)
|
| 66 |
+
# You can replace this embedding with your own as well.
|
| 67 |
|
| 68 |
+
speech = pipe("Hello what is happening", forward_params={"speaker_embeddings": speaker_embeddings})
|
| 69 |
|
| 70 |
+
sf.write("speech.wav", speech["audio"], samplerate=speech["sampling_rate"])
|
| 71 |
|
| 72 |
+
```
|
| 73 |
|
| 74 |
+
For more fine-grained control you can use the processor + generate code to convert text into a mono 16 kHz speech waveform.
|
| 75 |
|
| 76 |
```python
|
| 77 |
# Following pip packages need to be installed:
|
| 78 |
+
# !pip install transformers sentencepiece datasets
|
| 79 |
|
| 80 |
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
|
| 81 |
from datasets import load_dataset
|
|
|
|
| 102 |
|
| 103 |
Refer to [this Colab notebook](https://colab.research.google.com/drive/1i7I5pzBcU3WDFarDnzweIj4-sVVoIUFJ) for an example of how to fine-tune SpeechT5 for TTS on a different dataset or a new language.
|
| 104 |
|
| 105 |
+
|
| 106 |
+
## Direct Use
|
| 107 |
+
|
| 108 |
+
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
|
| 109 |
+
|
| 110 |
+
You can use this model for speech synthesis. See the [model hub](https://huggingface.co/models?search=speecht5) to look for fine-tuned versions on a task that interests you.
|
| 111 |
+
|
| 112 |
+
## Downstream Use [optional]
|
| 113 |
+
|
| 114 |
+
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
|
| 115 |
+
|
| 116 |
+
[More Information Needed]
|
| 117 |
+
|
| 118 |
+
## Out-of-Scope Use
|
| 119 |
+
|
| 120 |
+
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
|
| 121 |
+
|
| 122 |
+
[More Information Needed]
|
| 123 |
+
|
| 124 |
+
# Bias, Risks, and Limitations
|
| 125 |
+
|
| 126 |
+
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
| 127 |
+
|
| 128 |
+
[More Information Needed]
|
| 129 |
+
|
| 130 |
+
## Recommendations
|
| 131 |
+
|
| 132 |
+
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
|
| 133 |
+
|
| 134 |
+
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
|
| 135 |
+
|
| 136 |
# Training Details
|
| 137 |
|
| 138 |
## Training Data
|