The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text Paper • 2506.05209 • Published Jun 5 • 50
Lapa v0.1.2 Release Collection Release of SOTA Ukrainian LLM and Datasets • 19 items • Updated about 15 hours ago • 15
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks Paper • 2005.11401 • Published May 22, 2020 • 13
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only Paper • 2306.01116 • Published Jun 1, 2023 • 41