update description
Browse files- datasets/incoder.txt +3 -3
datasets/incoder.txt
CHANGED
|
@@ -1,4 +1,4 @@
|
|
| 1 |
-
[InCoder](https://huggingface.co/facebook/incoder-6B) was trained on trained on 216 GB of data from Github and Stackoverflow from 28 programming languages. 52 GB
|
| 2 |
|
| 3 |
The Github data used the following filtering:
|
| 4 |
- Average line length < 100
|
|
@@ -6,10 +6,10 @@ The Github data used the following filtering:
|
|
| 6 |
- Alphanumeric characters fraction > 0.4
|
| 7 |
- Remove auto-generated files (keyword search)
|
| 8 |
|
| 9 |
-
The second
|
| 10 |
- all questions that have at least one answer
|
| 11 |
- up to ten answers with a non-negative score (sorted by score) per question
|
| 12 |
- up to five comments per question/answer
|
| 13 |
-
Exact match deduplication was performed
|
| 14 |
|
| 15 |
For more details please refer to this [paper](https://arxiv.org/pdf/2204.05999.pdf).
|
|
|
|
| 1 |
+
[InCoder](https://huggingface.co/facebook/incoder-6B) was trained on trained on **216 GB** of data from Github and Stackoverflow from 28 programming languages. 52 GB is in Python, 107GB in other programming languages and 57GB is content from Stackoverflow that isn't code.
|
| 2 |
|
| 3 |
The Github data used the following filtering:
|
| 4 |
- Average line length < 100
|
|
|
|
| 6 |
- Alphanumeric characters fraction > 0.4
|
| 7 |
- Remove auto-generated files (keyword search)
|
| 8 |
|
| 9 |
+
The second component of the data consists of questions, answers, and comments from StackOverflow, it includes:
|
| 10 |
- all questions that have at least one answer
|
| 11 |
- up to ten answers with a non-negative score (sorted by score) per question
|
| 12 |
- up to five comments per question/answer
|
| 13 |
+
Exact match deduplication was performed on code files.
|
| 14 |
|
| 15 |
For more details please refer to this [paper](https://arxiv.org/pdf/2204.05999.pdf).
|