Spaces:
Running
Running
| # MAGNeT: Masked Audio Generation using a Single Non-Autoregressive Transformer | |
| AudioCraft provides the code and models for MAGNeT, [Masked Audio Generation using a Single Non-Autoregressive Transformer][arxiv]. | |
| MAGNeT is a text-to-music and text-to-sound model capable of generating high-quality audio samples conditioned on text descriptions. | |
| It is a masked generative non-autoregressive Transformer trained over a 32kHz EnCodec tokenizer with 4 codebooks sampled at 50 Hz. | |
| Unlike prior work on masked generative audio Transformers, such as [SoundStorm](https://arxiv.org/abs/2305.09636) and [VampNet](https://arxiv.org/abs/2307.04686), | |
| MAGNeT doesn't require semantic token conditioning, model cascading or audio prompting, and employs a full text-to-audio using a single non-autoregressive Transformer. | |
| Check out our [sample page][magnet_samples] or test the available demo! | |
| We use 16K hours of licensed music to train MAGNeT. Specifically, we rely on an internal dataset | |
| of 10K high-quality music tracks, and on the ShutterStock and Pond5 music data. | |
| ## Model Card | |
| See [the model card](../model_cards/MAGNET_MODEL_CARD.md). | |
| ## Installation | |
| Please follow the AudioCraft installation instructions from the [README](../README.md). | |
| AudioCraft requires a GPU with at least 16 GB of memory for running inference with the medium-sized models (~1.5B parameters). | |
| ## Usage | |
| We currently offer two ways to interact with MAGNeT: | |
| 1. You can use the gradio demo locally by running [`python -m demos.magnet_app --share`](../demos/magnet_app.py). | |
| 2. You can play with MAGNeT by running the jupyter notebook at [`demos/magnet_demo.ipynb`](../demos/magnet_demo.ipynb) locally (if you have a GPU). | |
| ## API | |
| We provide a simple API and 6 pre-trained models. The pre trained models are: | |
| - `facebook/magnet-small-10secs`: 300M model, text to music, generates 10-second samples - [🤗 Hub](https://huggingface.co/facebook/magnet-small-10secs) | |
| - `facebook/magnet-medium-10secs`: 1.5B model, text to music, generates 10-second samples - [🤗 Hub](https://huggingface.co/facebook/magnet-medium-10secs) | |
| - `facebook/magnet-small-30secs`: 300M model, text to music, generates 30-second samples - [🤗 Hub](https://huggingface.co/facebook/magnet-small-30secs) | |
| - `facebook/magnet-medium-30secs`: 1.5B model, text to music, generates 30-second samples - [🤗 Hub](https://huggingface.co/facebook/magnet-medium-30secs) | |
| - `facebook/audio-magnet-small`: 300M model, text to sound-effect - [🤗 Hub](https://huggingface.co/facebook/audio-magnet-small) | |
| - `facebook/audio-magnet-medium`: 1.5B model, text to sound-effect - [🤗 Hub](https://huggingface.co/facebook/audio-magnet-medium) | |
| In order to use MAGNeT locally **you must have a GPU**. We recommend 16GB of memory, especially for | |
| the medium size models. | |
| See after a quick example for using the API. | |
| ```python | |
| import torchaudio | |
| from audiocraft.models import MAGNeT | |
| from audiocraft.data.audio import audio_write | |
| model = MAGNeT.get_pretrained('facebook/magnet-small-10secs') | |
| descriptions = ['disco beat', 'energetic EDM', 'funky groove'] | |
| wav = model.generate(descriptions) # generates 3 samples. | |
| for idx, one_wav in enumerate(wav): | |
| # Will save under {idx}.wav, with loudness normalization at -14 db LUFS. | |
| audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness", loudness_compressor=True) | |
| ``` | |
| ## 🤗 Transformers Usage | |
| Coming soon... | |
| ## Training | |
| The [MagnetSolver](../audiocraft/solvers/magnet.py) implements MAGNeT's training pipeline. | |
| It defines a masked generation task over multiple streams of discrete tokens | |
| extracted from a pre-trained EnCodec model (see [EnCodec documentation](./ENCODEC.md) | |
| for more details on how to train such model). | |
| Note that **we do NOT provide any of the datasets** used for training MAGNeT. | |
| We provide a dummy dataset containing just a few examples for illustrative purposes. | |
| Please read first the [TRAINING documentation](./TRAINING.md), in particular the Environment Setup section. | |
| ### Example configurations and grids | |
| We provide configurations to reproduce the released models and our research. | |
| MAGNeT solvers configuration are available in [config/solver/magnet](../config/solver/magnet), | |
| in particular: | |
| * MAGNeT model for text-to-music: | |
| [`solver=magnet/magnet_32khz`](../config/solver/magnet/magnet_32khz.yaml) | |
| * MAGNeT model for text-to-sound: | |
| [`solver=magnet/audio_magnet_16khz`](../config/solver/magnet/audio_magnet_16khz.yaml) | |
| We provide 3 different scales, e.g. `model/lm/model_scale=small` (300M), or `medium` (1.5B), and `large` (3.3B). | |
| Please find some example grids to train MAGNeT at | |
| [audiocraft/grids/magnet](../audiocraft/grids/magnet/). | |
| ```shell | |
| # text-to-music | |
| dora grid magnet.magnet_32khz --dry_run --init | |
| # text-to-sound | |
| dora grid magnet.audio_magnet_16khz --dry_run --init | |
| # Remove the `--dry_run --init` flags to actually schedule the jobs once everything is setup. | |
| ``` | |
| ### dataset and metadata | |
| Learn more in the [datasets section](./DATASETS.md). | |
| #### Music Models | |
| MAGNeT's underlying dataset is an AudioDataset augmented with music-specific metadata. | |
| The MAGNeT dataset implementation expects the metadata to be available as `.json` files | |
| at the same location as the audio files. | |
| #### Sound Models | |
| Audio-MAGNeT's underlying dataset is an AudioDataset augmented with description metadata. | |
| The Audio-MAGNeT dataset implementation expects the metadata to be available as `.json` files | |
| at the same location as the audio files or through specified external folder. | |
| ### Audio tokenizers | |
| See [MusicGen](./MUSICGEN.md) | |
| ### Fine tuning existing models | |
| You can initialize your model to one of the pretrained models by using the `continue_from` argument, in particular | |
| ```bash | |
| # Using pretrained MAGNeT model. | |
| dora run solver=magnet/magnet_32khz model/lm/model_scale=medium continue_from=//pretrained/facebook/magnet-medium-10secs conditioner=text2music | |
| # Using another model you already trained with a Dora signature SIG. | |
| dora run solver=magnet/magnet_32khz model/lm/model_scale=medium continue_from=//sig/SIG conditioner=text2music | |
| # Or providing manually a path | |
| dora run solver=magnet/magnet_32khz model/lm/model_scale=medium continue_from=/checkpoints/my_other_xp/checkpoint.th | |
| ``` | |
| **Warning:** You are responsible for selecting the other parameters accordingly, in a way that make it compatible | |
| with the model you are fine tuning. Configuration is NOT automatically inherited from the model you continue from. In particular make sure to select the proper `conditioner` and `model/lm/model_scale`. | |
| **Warning:** We currently do not support fine tuning a model with slightly different layers. If you decide | |
| to change some parts, like the conditioning or some other parts of the model, you are responsible for manually crafting a checkpoint file from which we can safely run `load_state_dict`. | |
| If you decide to do so, make sure your checkpoint is saved with `torch.save` and contains a dict | |
| `{'best_state': {'model': model_state_dict_here}}`. Directly give the path to `continue_from` without a `//pretrained/` prefix. | |
| ### Evaluation stage | |
| For the 6 pretrained MAGNeT models, objective metrics could be reproduced using the following grids: | |
| ```shell | |
| # text-to-music | |
| REGEN=1 dora grid magnet.magnet_pretrained_32khz_eval --dry_run --init | |
| # text-to-sound | |
| REGEN=1 dora grid magnet.audio_magnet_pretrained_16khz_eval --dry_run --init | |
| # Remove the `--dry_run --init` flags to actually schedule the jobs once everything is setup. | |
| ``` | |
| See [MusicGen](./MUSICGEN.md) for more details. | |
| ### Generation stage | |
| See [MusicGen](./MUSICGEN.md) | |
| ### Playing with the model | |
| Once you have launched some experiments, you can easily get access | |
| to the Solver with the latest trained model using the following snippet. | |
| ```python | |
| from audiocraft.solvers.magnet import MagnetSolver | |
| solver = MagnetSolver.get_eval_solver_from_sig('SIG', device='cpu', batch_size=8) | |
| solver.model | |
| solver.dataloaders | |
| ``` | |
| ### Importing / Exporting models | |
| We do not support currently loading a model from the Hugging Face implementation or exporting to it. | |
| If you want to export your model in a way that is compatible with `audiocraft.models.MAGNeT` | |
| API, you can run: | |
| ```python | |
| from audiocraft.utils import export | |
| from audiocraft import train | |
| xp = train.main.get_xp_from_sig('SIG_OF_LM') | |
| export.export_lm(xp.folder / 'checkpoint.th', '/checkpoints/my_audio_lm/state_dict.bin') | |
| # You also need to bundle the EnCodec model you used !! | |
| ## Case 1) you trained your own | |
| xp_encodec = train.main.get_xp_from_sig('SIG_OF_ENCODEC') | |
| export.export_encodec(xp_encodec.folder / 'checkpoint.th', '/checkpoints/my_audio_lm/compression_state_dict.bin') | |
| ## Case 2) you used a pretrained model. Give the name you used without the //pretrained/ prefix. | |
| ## This will actually not dump the actual model, simply a pointer to the right model to download. | |
| export.export_pretrained_compression_model('facebook/encodec_32khz', '/checkpoints/my_audio_lm/compression_state_dict.bin') | |
| ``` | |
| Now you can load your custom model with: | |
| ```python | |
| import audiocraft.models | |
| magnet = audiocraft.models.MAGNeT.get_pretrained('/checkpoints/my_audio_lm/') | |
| ``` | |
| ### Learn more | |
| Learn more about AudioCraft training pipelines in the [dedicated section](./TRAINING.md). | |
| ## FAQ | |
| #### What are top-k, top-p, temperature and classifier-free guidance? | |
| Check out [@FurkanGozukara tutorial](https://github.com/FurkanGozukara/Stable-Diffusion/blob/main/Tutorials/AI-Music-Generation-Audiocraft-Tutorial.md#more-info-about-top-k-top-p-temperature-and-classifier-free-guidance-from-chatgpt). | |
| #### Should I use FSDP or autocast ? | |
| The two are mutually exclusive (because FSDP does autocast on its own). | |
| You can use autocast up to 1.5B (medium), if you have enough RAM on your GPU. | |
| FSDP makes everything more complex but will free up some memory for the actual | |
| activations by sharding the optimizer state. | |
| ## Citation | |
| ``` | |
| @misc{ziv2024masked, | |
| title={Masked Audio Generation using a Single Non-Autoregressive Transformer}, | |
| author={Alon Ziv and Itai Gat and Gael Le Lan and Tal Remez and Felix Kreuk and Alexandre Défossez and Jade Copet and Gabriel Synnaeve and Yossi Adi}, | |
| year={2024}, | |
| eprint={2401.04577}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.SD} | |
| } | |
| ``` | |
| ## License | |
| See license information in the [model card](../model_cards/MAGNET_MODEL_CARD.md). | |
| [arxiv]: https://arxiv.org/abs/2401.04577 | |
| [magnet_samples]: https://pages.cs.huji.ac.il/adiyoss-lab/MAGNeT/ | |