The reasoning datasets that defined 2025. Part 1 of Datasets Wrapped 2025. #DatasetsWrapped2025
Daniel van Strien PRO
AI & ML interests
Machine Learning Librarian
Recent Activity
updated
a dataset
about 3 hours ago
data-is-better-together/fineweb-c-progress
updated
a dataset
about 15 hours ago
librarian-bots/dataset-columns
updated
a dataset
2 days ago
librarian-bots/arxiv-metadata-snapshot
Organizations
hub-tldr
Creating a smol model for tl;dr-ing the hub
-
davanstrien/Smol-Hub-tldr
Text Generation β’ 0.4B β’ Updated β’ 34 β’ 11 -
Running84
Semantic Hugging Face Hub Search
π84Find datasets and models using semantic search
-
davanstrien/hub-tldr-dataset-summaries-llama
Viewer β’ Updated β’ 5k β’ 92 β’ 1 -
davanstrien/hub-tldr-model-summaries-llama
Viewer β’ Updated β’ 5k β’ 83 β’ 1
synthetic-data-generation-demos
A collection of demos for various approaches to synthetic data generation
-
Runtime error8
Genstruct 7B
π8 -
Runtime errorFeatured86
Instruction Synthesizer
π86Generate instruction-response pairs from text
-
Running on ZeroFeatured72
Magpie
π¦72Generate and rate instruction-response pairs
-
Runtime error11
Bonito
π¬11Generate task-specific instructions and responses from text
Synthetic (text) Dataset Generation
Papers about synthetic dataset generation
-
Better Synthetic Data by Retrieving and Transforming Existing Datasets
Paper β’ 2404.14361 β’ Published β’ 2 -
Generative AI for Synthetic Data Generation: Methods, Challenges and the Future
Paper β’ 2403.04190 β’ Published β’ 1 -
Best Practices and Lessons Learned on Synthetic Data for Language Models
Paper β’ 2404.07503 β’ Published β’ 31 -
A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models
Paper β’ 2404.14445 β’ Published
Historic language modeling
This collection contains models, datasets and spaces related to historic language models i.e. language models trained on historic data
-
dbmdz/bert-base-finnish-europeana-cased
Fill-Mask β’ 0.1B β’ Updated β’ 28 -
dbmdz/bert-base-historic-english-cased
Fill-Mask β’ 0.1B β’ Updated β’ 34 β’ 1 -
Livingwithmachines/erwt-year
Fill-Mask β’ Updated β’ 61 -
dbmdz/bert-base-historic-dutch-cased
Fill-Mask β’ 0.1B β’ Updated β’ 155 β’ β’ 2
Image Preference Optimization Datasets
Datasets suitable for Image Preference Optimization based on their colum names
Reasoning Required?
-
davanstrien/reasoning-required
Viewer β’ Updated β’ 5k β’ 277 β’ 19 -
davanstrien/ModernBERT-based-Reasoning-Required
Text Classification β’ 0.1B β’ Updated β’ 40 β’ 10 -
davanstrien/fineweb-with-reasoning-scores-and-topics
Viewer β’ Updated β’ 10k β’ 70 β’ 1 -
davanstrien/fine-reasoning-questions
Viewer β’ Updated β’ 244 β’ 127 β’ 19
Maths reasoning
Maths reasoning datasets found using https://huggingface.co/spaces/librarian-bots/huggingface-datasets-semantic-search
-
Running84
Semantic Hugging Face Hub Search
π84Find datasets and models using semantic search
-
open-r1/OpenR1-Math-220k
Viewer β’ Updated β’ 450k β’ 12.9k β’ 687 -
simplescaling/s1K-1.1
Viewer β’ Updated β’ 1k β’ 3.47k β’ 141 -
MU-NLPC/Calc-ape210k
Viewer β’ Updated β’ 404k β’ 1.85k β’ 25
sentence-transformers-from-synthetic-data
Example of using distilabel to generate synthetic triplets data for fine-tuning a Sentence Transformer model
-
bigcode/self-oss-instruct-sc2-exec-filter-50k
Viewer β’ Updated β’ 50.7k β’ 336 β’ 104 -
davanstrien/similarity-dataset-sc2-8b
Viewer β’ Updated β’ 2.32k β’ 150 β’ 6 -
davanstrien/code-prompt-similarity-model
Sentence Similarity β’ 0.1B β’ Updated β’ 16 β’ 6 -
davanstrien/abstract-wiki
Viewer β’ Updated β’ 5k β’ 70 β’ 2
haiku
πΈ This is a collection of synthetic datasets built to help improve the ability of open language models to better write haikus through the use of DPO
Probably DPO datasets
A collection of datasets that probably support DPO
-
HuggingFaceH4/ultrafeedback_binarized
Viewer β’ Updated β’ 187k β’ 9.56k β’ 316 -
mlabonne/orpo-dpo-mix-40k
Viewer β’ Updated β’ 44.2k β’ 539 β’ 297 -
argilla/OpenHermesPreferences
Viewer β’ Updated β’ 989k β’ 2.08k β’ 210 -
argilla/distilabel-capybara-dpo-7k-binarized
Viewer β’ Updated β’ 7.56k β’ 2.49k β’ 182
query-to-hub-datasets-viewer-project
Datasets Wrapped 2025: Reasoning
The reasoning datasets that defined 2025. Part 1 of Datasets Wrapped 2025. #DatasetsWrapped2025
Reasoning Required?
-
davanstrien/reasoning-required
Viewer β’ Updated β’ 5k β’ 277 β’ 19 -
davanstrien/ModernBERT-based-Reasoning-Required
Text Classification β’ 0.1B β’ Updated β’ 40 β’ 10 -
davanstrien/fineweb-with-reasoning-scores-and-topics
Viewer β’ Updated β’ 10k β’ 70 β’ 1 -
davanstrien/fine-reasoning-questions
Viewer β’ Updated β’ 244 β’ 127 β’ 19
hub-tldr
Creating a smol model for tl;dr-ing the hub
-
davanstrien/Smol-Hub-tldr
Text Generation β’ 0.4B β’ Updated β’ 34 β’ 11 -
Running84
Semantic Hugging Face Hub Search
π84Find datasets and models using semantic search
-
davanstrien/hub-tldr-dataset-summaries-llama
Viewer β’ Updated β’ 5k β’ 92 β’ 1 -
davanstrien/hub-tldr-model-summaries-llama
Viewer β’ Updated β’ 5k β’ 83 β’ 1
Maths reasoning
Maths reasoning datasets found using https://huggingface.co/spaces/librarian-bots/huggingface-datasets-semantic-search
-
Running84
Semantic Hugging Face Hub Search
π84Find datasets and models using semantic search
-
open-r1/OpenR1-Math-220k
Viewer β’ Updated β’ 450k β’ 12.9k β’ 687 -
simplescaling/s1K-1.1
Viewer β’ Updated β’ 1k β’ 3.47k β’ 141 -
MU-NLPC/Calc-ape210k
Viewer β’ Updated β’ 404k β’ 1.85k β’ 25
synthetic-data-generation-demos
A collection of demos for various approaches to synthetic data generation
-
Runtime error8
Genstruct 7B
π8 -
Runtime errorFeatured86
Instruction Synthesizer
π86Generate instruction-response pairs from text
-
Running on ZeroFeatured72
Magpie
π¦72Generate and rate instruction-response pairs
-
Runtime error11
Bonito
π¬11Generate task-specific instructions and responses from text
sentence-transformers-from-synthetic-data
Example of using distilabel to generate synthetic triplets data for fine-tuning a Sentence Transformer model
-
bigcode/self-oss-instruct-sc2-exec-filter-50k
Viewer β’ Updated β’ 50.7k β’ 336 β’ 104 -
davanstrien/similarity-dataset-sc2-8b
Viewer β’ Updated β’ 2.32k β’ 150 β’ 6 -
davanstrien/code-prompt-similarity-model
Sentence Similarity β’ 0.1B β’ Updated β’ 16 β’ 6 -
davanstrien/abstract-wiki
Viewer β’ Updated β’ 5k β’ 70 β’ 2
Synthetic (text) Dataset Generation
Papers about synthetic dataset generation
-
Better Synthetic Data by Retrieving and Transforming Existing Datasets
Paper β’ 2404.14361 β’ Published β’ 2 -
Generative AI for Synthetic Data Generation: Methods, Challenges and the Future
Paper β’ 2403.04190 β’ Published β’ 1 -
Best Practices and Lessons Learned on Synthetic Data for Language Models
Paper β’ 2404.07503 β’ Published β’ 31 -
A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models
Paper β’ 2404.14445 β’ Published
haiku
πΈ This is a collection of synthetic datasets built to help improve the ability of open language models to better write haikus through the use of DPO
Historic language modeling
This collection contains models, datasets and spaces related to historic language models i.e. language models trained on historic data
-
dbmdz/bert-base-finnish-europeana-cased
Fill-Mask β’ 0.1B β’ Updated β’ 28 -
dbmdz/bert-base-historic-english-cased
Fill-Mask β’ 0.1B β’ Updated β’ 34 β’ 1 -
Livingwithmachines/erwt-year
Fill-Mask β’ Updated β’ 61 -
dbmdz/bert-base-historic-dutch-cased
Fill-Mask β’ 0.1B β’ Updated β’ 155 β’ β’ 2
Probably DPO datasets
A collection of datasets that probably support DPO
-
HuggingFaceH4/ultrafeedback_binarized
Viewer β’ Updated β’ 187k β’ 9.56k β’ 316 -
mlabonne/orpo-dpo-mix-40k
Viewer β’ Updated β’ 44.2k β’ 539 β’ 297 -
argilla/OpenHermesPreferences
Viewer β’ Updated β’ 989k β’ 2.08k β’ 210 -
argilla/distilabel-capybara-dpo-7k-binarized
Viewer β’ Updated β’ 7.56k β’ 2.49k β’ 182
Image Preference Optimization Datasets
Datasets suitable for Image Preference Optimization based on their colum names
query-to-hub-datasets-viewer-project