Spaces:

Duplicated from codeparrot/code-generation-models

Canstralian
/

code-generation-space

Paused

App Files Files Community

code-generation-space / datasets /codegen.txt

loubnabnl's picture

loubnabnl HF Staff

update

7678306 over 3 years ago

1.05 kB

	[Codegen](https://huggingface.co/Salesforce/codegen-16B-mono) is a model for conversational program synthesis, where each problem is interactively solved in multiple steps, each consisting of a natural language specification from the user and a synthesized subprogram from the system.

	It was was sequentially trained on three datasets:
	- [The Pile](https://huggingface.co/datasets/the_pile)
	- A 341GB subset of Google’s [BigQuery dataset](https://cloud.google.com/blog/topics/public-datasets/github-on-bigquery-analyze-all-the-open-source-code) of code files from multiple programming languages, keeping only 6: C, C++, Go, Java, JavaScript, and Python
	- 217GB of Python data from Github repositories

	The second and third datasets used the following preprocessing:
	- Exact match deduplication
	- Filtering:
	- Exact match deduplication
	- Average line length < 100 tokens
	- Maximum line length < 1000 MB
	- Characters being decimal or hexadecimal digits >90%

	Remark:
	The reported data sizes are after preprocessing.