Hugging Face's logo Hugging Face
  • Models
  • Datasets
  • Spaces
  • Docs
  • Enterprise
  • Pricing

  • Log In
  • Sign Up
codelion 's Collections
Pre-training Dataset Samples
Ellora
Pivotal Token Search
Internal Coherence Maximization
Securade.ai

Pre-training Dataset Samples

updated 16 days ago

A collection of pre-training datasets samples of sizes 10M, 100M and 1B tokens. Ideal for use in quick experimentation and ablations.

Upvote
5

  • codelion/dclm-baseline-1B

    Viewer • Updated Jul 7 • 774k • 99

  • codelion/fineweb-edu-1B

    Viewer • Updated Jul 7 • 970k • 59

  • sumuks/essential-web-v1.0-sample-1B

    Viewer • Updated Jul 3 • 1.83M • 21

  • codelion/finepdfs-1B

    Viewer • Updated Sep 8 • 186k • 113

  • codelion/dclm-baseline-100M

    Viewer • Updated Jul 6 • 77.2k • 24

  • codelion/fineweb-edu-100M

    Viewer • Updated Jul 6 • 115k • 30

  • sumuks/essential-web-v1.0-sample-100M

    Viewer • Updated Jul 3 • 183k • 36

  • codelion/finepdfs-100M

    Viewer • Updated Sep 8 • 18.6k • 1.55k • 1

  • codelion/dclm-baseline-10M

    Viewer • Updated Jul 6 • 7.95k • 7

  • codelion/fineweb-edu-10M

    Viewer • Updated Jul 6 • 9.46k • 13

  • sumuks/essential-web-v1.0-sample-10M

    Viewer • Updated Jul 3 • 18.3k • 20 • 1

  • codelion/finepdfs-10M

    Viewer • Updated Sep 8 • 7.54k • 15
Upvote
5
  • Collection guide
  • Browse collections
Company
TOS Privacy About Jobs
Website
Models Datasets Spaces Pricing Docs