FineData

Team

community

AI & ML interests

We release large pre-training datasets to accelerate open LLM development. Part of the Hugging Face Science team (hf.co/science)

Recent Activity

guipenedo new activity 1 day ago

HuggingFaceFW/finewiki:Filtered Cebuano?

hynky new activity 4 days ago

HuggingFaceFW/finepdfs:OCR or not classifier

hynky new activity 4 days ago

HuggingFaceFW/finepdfs:A Few Questions About the Implementation Details of the finepdfs Project

View all activity

Papers

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

View all Papers

guipenedo

in HuggingFaceFW/finewiki 1 day ago

Filtered Cebuano?

#3 opened 4 days ago by

hynky

in HuggingFaceFW/finepdfs 4 days ago

OCR or not classifier

#6 opened about 2 months ago by

A Few Questions About the Implementation Details of the finepdfs Project

#24 opened 11 days ago by

guipenedo

in HuggingFaceFW/finewiki 4 days ago

docs: fix typo

#2 opened 5 days ago by

guipenedo

updated a dataset 5 days ago

HuggingFaceFW/clean-wikipedia

Viewer • Updated 5 days ago • 61.2M • 1.95k • 22

guipenedo

updated a Space 5 days ago

README

guipenedo

published a dataset 5 days ago

HuggingFaceFW/finewiki

Viewer • Updated 4 days ago • 61.6M • 3.74k • 110

guipenedo

published a Space 5 days ago

FineWiki Viewer

Viewer to explore the finewiki dataset

hynky

in HuggingFaceFW/finepdfs_lang_classification 5 days ago

datatrove

#2 opened 5 days ago by

hynky

published a dataset 5 days ago

HuggingFaceFW/finepdfs_lang_classification_tmp

Updated 5 days ago • 4

hynky

in HuggingFaceFW/finepdfs 6 days ago

Deciding on extraction path

#10 opened about 1 month ago by

Were the original PDFs saved?

#2 opened about 2 months ago by

Docling output

#4 opened about 2 months ago by

Can additional corpuses further train this model?

#13 opened about 1 month ago by

hynky

updated a dataset 6 days ago

HuggingFaceFW/ocr-annotations

Viewer • Updated 6 days ago • 1.62k • 133 • 11

guipenedo

updated a collection 6 days ago

📄 FinePDFs

78 items • Updated 6 days ago • 11

hynky

in HuggingFaceFW/finepdfs 6 days ago

The "file_path" data field appears to primarily contain cc-index paths rather than WARC paths.

#16 opened about 1 month ago by

adfadvab

#15 opened about 1 month ago by

Github LInk or XGBoost Model

#22 opened 26 days ago by