amaai-lab
/

text2midi

@@ -17,7 +17,7 @@ tags:
 [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/amaai-lab/text2midi)
 </div>
-**text2midi** is the first end-to-end model for generating MIDI files from textual descriptions. By leveraging pretrained large language models and a powerful autoregressive transformer decoder, **text2midi** allows users to create symbolic music that aligns with detailed textual prompts, including musical attributes like chords, tempo, and style.
 🔥 Live demo available on [HuggingFace Spaces](https://huggingface.co/spaces/amaai-lab/text2midi).
@@ -99,7 +99,29 @@ pip install -r requirements-mac.txt
 ```
 ## Datasets
-The MidiCaps dataset is a large-scale dataset of 168k MIDI files paired with rich text captions. These captions contain musical attributes such as key, tempo, style, and mood, making it ideal for text-to-MIDI generation tasks.
 ## Results of the Listening Study
@@ -154,20 +176,5 @@ accelerate launch train.py \
 --epochs=40 \
 ```
-## Inference
-We spport inference on CUDA, MPS and cpu. Please make sure you have pip installed the correct requirement file (requirments.txt for CUDA, requirements-mac.txt for MPS)
-```bash
-python model/transformer_model.py --caption <your intended descriptions>
-```
-## Citation
-If you use text2midi in your research, please cite:
-```
-@inproceedings{bhandari2025text2midi,
-    title={text2midi: Generating Symbolic Music from Captions},
-    author={Keshav Bhandari and Abhinaba Roy and Kyra Wang and Geeta Puri and Simon Colton and Dorien Herremans},
-    booktitle={Proceedings of the 39th AAAI Conference on Artificial Intelligence (AAAI 2025)},
-    year={2025}
-}
-```

 [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/amaai-lab/text2midi)
 </div>
+**text2midi** is the first end-to-end model for generating MIDI files from textual descriptions. By leveraging pretrained large language models and a powerful autoregressive transformer decoder, **text2midi** allows users to create symbolic music that aligns with detailed textual prompts, including musical attributes like chords, tempo, and style. The details of the model are described in [this paper](https://arxiv.org/abs/2412.16526).
 🔥 Live demo available on [HuggingFace Spaces](https://huggingface.co/spaces/amaai-lab/text2midi).
 ```
 ## Datasets
+The model was trained using two datasets: [SymphonyNet](https://symphonynet.github.io/) for semi-supervised pretraining and MidiCaps for finetuning towards MIDI generation from captions.
+The [MidiCaps dataset](https://huggingface.co/datasets/amaai-lab/MidiCaps) is a large-scale dataset of 168k MIDI files paired with rich text captions. These captions contain musical attributes such as key, tempo, style, and mood, making it ideal for text-to-MIDI generation tasks as described in [this paper](https://arxiv.org/abs/2406.02255).
+## Inference
+We spport inference on CUDA, MPS and cpu. Please make sure you have pip installed the correct requirement file (requirments.txt for CUDA, requirements-mac.txt for MPS)
+```bash
+python model/transformer_model.py --caption <your intended descriptions>
+```
+## Citation
+If you use text2midi in your research, please cite:
+```
+@inproceedings{bhandari2025text2midi,
+    title={text2midi: Generating Symbolic Music from Captions},
+    author={Keshav Bhandari and Abhinaba Roy and Kyra Wang and Geeta Puri and Simon Colton and Dorien Herremans},
+    booktitle={Proceedings of the 39th AAAI Conference on Artificial Intelligence (AAAI 2025)},
+    year={2025}
+}
+```
 ## Results of the Listening Study
 --epochs=40 \
 ```