update
Browse files- datasets/codegen.txt +1 -1
datasets/codegen.txt
CHANGED
|
@@ -11,7 +11,7 @@ The second and third datasets used the following preprocessing:
|
|
| 11 |
- Exact match deduplication
|
| 12 |
- Average line length < 100 tokens
|
| 13 |
- Maximum line length < 1000 MB
|
| 14 |
-
-
|
| 15 |
|
| 16 |
**Remark**:
|
| 17 |
The reported data sizes are after preprocessing.
|
|
|
|
| 11 |
- Exact match deduplication
|
| 12 |
- Average line length < 100 tokens
|
| 13 |
- Maximum line length < 1000 MB
|
| 14 |
+
- Characters being decimal or hexadecimal digits >90%
|
| 15 |
|
| 16 |
**Remark**:
|
| 17 |
The reported data sizes are after preprocessing.
|