Hub documentation
DuckDB
DuckDB
DuckDB is an in-process SQL OLAP database management system.
You can use the Hugging Face paths (hf://) to access data on the Hub:

The DuckDB CLI (Command Line Interface) is a single, dependency-free executable. There are also other APIs available for running DuckDB, including Python, C++, Go, Java, Rust, and more. For additional details, visit their clients page.
For installation details, visit the installation page.
Starting from version v0.10.3, the DuckDB CLI includes native support for accessing datasets on the Hugging Face Hub via URLs with the hf:// scheme. Here are some features you can leverage with this powerful tool:
- Query public datasets and your own gated and private datasets
- Analyze datasets and perform SQL operations
- Combine datasets and export it to different formats
- Conduct vector similarity search on embedding datasets
- Implement full-text search on datasets
For a complete list of DuckDB features, visit the DuckDB documentation.
To start the CLI, execute the following command in the installation folder:
./duckdb
Forging the Hugging Face URL
To access Hugging Face datasets, use the following URL format:
hf://datasets/{my-username}/{my-dataset}/{path_to_file} - my-username, the user or organization of the dataset, e.g.
ibm - my-dataset, the dataset name, e.g:
duorc - path_to_parquet_file, the parquet file path which supports glob patterns, e.g
**/*.parquet, to query all parquet files
You can query auto-converted Parquet files using the @~parquet branch, which corresponds to the
refs/convert/parquetrevision. For more details, refer to the documentation at https://huggingface.co/docs/datasets-server/en/parquet#conversion-to-parquet.To reference the
refs/convert/parquetrevision of a dataset, use the following syntax:hf://datasets/{my-username}/{my-dataset}@~parquet/{path_to_file}Here is a sample URL following the above syntax:
hf://datasets/ibm/duorc@~parquet/ParaphraseRC/test/0000.parquet
Let’s start with a quick demo to query all the rows of a dataset:
FROM 'hf://datasets/ibm/duorc/ParaphraseRC/*.parquet' LIMIT 3;Or using traditional SQL syntax:
SELECT * FROM 'hf://datasets/ibm/duorc/ParaphraseRC/*.parquet' LIMIT 3;In the following sections, we will cover more complex operations you can perform with DuckDB on Hugging Face datasets.
Update on GitHub