Spaces:
Runtime error
Runtime error
| [CodeParrot](https://huggingface.co/lvwerra/codeparrot) is a code generation model trained on **50GB** of pre-processed Python data from Github repositories: [CodeParrot dataset](https://huggingface.co/datasets/lvwerra/codeparrot-clean). The original dataset contains a lot of duplicated and noisy data. Therefore, the dataset was cleaned with the following steps: | |
| - Exact match deduplication | |
| - Filtering: | |
| - Average line length < 100 tokens | |
| - Maximum line length < 1000 MB | |
| - Alphanumeric characters fraction > 0.25 | |
| - Remove auto-generated files (keyword search) | |
| For more details see the preprocessing script in the transformers repository [here](https://github.com/huggingface/transformers/tree/master/examples/research_projects/codeparrot). |