code-generation-space

Paused

code-generation-space / datasets /github_code.txt

loubnabnl HF Staff

update

a3de0e1 over 3 years ago

916 Bytes

	We also released [Github code dataset](https://huggingface.co/datasets/lvwerra/github-code), a 1TB of code data from Github repositories in 32 programming languages. The dataset can be loaded in a streaming mode if you don't want to download it because of memory issues, this will create an iterable dataset:

	```python
	from datasets import load_dataset

	ds = load_dataset("lvwerra/github-code", streaming=True, split="train")
	print(next(iter(ds)))

	#OUTPUT:
	{
	'code': "import mod189 from './mod189';\nvar value=mod189+1;\nexport default value;\n",
	'repo_name': 'MirekSz/webpack-es6-ts',
	'path': 'app/mods/mod190.js',
	'language': 'JavaScript',
	'license': 'isc',
	'size': 73
	}

	```
	You can see that in addition to the code, the samples include some metadata: repo name, path, language, license, and the size of the file.

	For model-specific information about the pretraining dataset, please select a model below: