Tahoe-x1

Tahoe-x1 is a family of perturbation-trained single-cell foundation models with up to 3 billion parameters, developed by Tahoe Therapeutics. Pretrained on 266 million single-cell transcriptomic profiles including the Tahoe-100M perturbation compendium, Tahoe-x1 achieves state-of-the-art performance on cancer-relevant tasks with 3โ€“30ร— higher compute efficiency than other cell-state models.

Quick Links:

๐Ÿค– Model Sizes

We provide pretrained weights for three model sizes:

  • Tx1-70M: ~70M parameters
  • Tx1-1B: ~1.3B parameters
  • Tx1-3B: ~3B parameters

๐Ÿš€ Quickstart

Load a model directly from Hugging Face and generate cell embeddings:

from tahoe_x1.model import ComposerTX
import scanpy as sc

# Load model from Hugging Face in a single line
# Options: "70m", "1b", or "3b"
model, vocab, model_cfg, collator_cfg = ComposerTX.from_hf(
    repo_id="tahoebio/Tahoe-x1",
    model_size="3b"
)

# Load your single-cell data
adata = sc.read_h5ad("your_data.h5ad")

# Generate embeddings (see tutorials for full example)
# Cell embeddings are stored in adata.obsm

๐Ÿ“ฆ Installation

To use the models, install the tahoex package from GitHub:

# Clone the repository
git clone https://github.com/tahoebio/tahoe-x1
cd tahoe-x1

# Install using Docker (recommended) or uv
# See installation guide: https://github.com/tahoebio/tahoe-x1#installation

Docker installation provides better reproducibility and is recommended for the best experience. For native installation, use uv or pip for dependency management.

Model checkpoints and configs are automatically downloaded from this Hugging Face repository when using ComposerTX.from_hf(). Training data is hosted publicly on S3 (s3://tahoe-hackathon-data) and will be downloaded as needed.

๐Ÿ“š Tutorials

Please refer to the tutorials in the GitHub repository for detailed examples:

Tutorial Description Link
Clustering Tutorial Cell clustering and UMAP visualization with Tahoe-x1 embeddings clustering_tutorial.ipynb
Training Tutorial Step-by-step guide to training and fine-tuning Tahoe-x1 models training_tutorial.ipynb

๐Ÿงฌ Generating Cell and Gene Embeddings

Using Configuration Files

  1. Create a configuration file (see scripts/inference/configs/predict.yaml in the GitHub repo):
# Key configuration options:
#   - paths.hf_repo_id: Hugging Face repository (tahoebio/Tahoe-x1)
#   - paths.hf_model_size: model size (70m, 1b, or 3b)
#   - paths.adata_output: where to save AnnData output including embeddings
#   - predict.return_gene_embeddings: True (for extracting gene embeddings)

See the Clustering Tutorial for a full example of preparing input data and configuration files. 2. Run the embedding script:

python scripts/inference/predict_embeddings.py path/to/config.yaml

# Optional: override config values via command line
python scripts/inference/predict_embeddings.py path/to/config.yaml \
    --paths.model_name=tx --batch_size=128

Advanced Usage

For memory-efficient gene embedding extraction over large datasets, use the lower-level API:

from tahoex.tasks import get_batch_embeddings

cell_embs, gene_embs = get_batch_embeddings(
    adata=adata,
    model=model,
    vocab=vocab,
    model_cfg=model_cfg,
    collator_cfg=collator_cfg,
    return_gene_embeddings=True
)

Cell embeddings are saved to adata.obsm and gene embeddings to adata.varm (if return_gene_embeddings=True).

๐Ÿ‹๏ธ Training and Fine-tuning

Tahoe-x1 models can be trained from scratch or fine-tuned on your own data. See the GitHub repository for detailed instructions.

Quick Training Example

# Use the test configuration to train a small model on Tahoe-100M
composer scripts/train.py -f configs/test_config.yaml

# Fine-tune from a pretrained checkpoint
composer scripts/train.py -f configs/finetune_config.yaml \
    --load_path s3://tahoe-hackathon-data/MFM/ckpts/3b/

For more details on training infrastructure, datasets, and benchmarks, please visit the GitHub repository.

๐Ÿ“Š Model Details

Architecture

  • Base Architecture: Transformer-based encoder pretrained with masked language modeling (MLM)
  • Training Data: 266M single-cell profiles from CellxGene, scBaseCamp, and Tahoe-100M
  • Training Infrastructure: Built on MosaicML Composer with FlashAttention, FSDP, and mixed precision

Benchmarks

Tahoe-x1 achieves state-of-the-art performance on cancer-relevant tasks:

  • DepMap Essentiality: Predicting gene dependencies in cancer cell lines
  • MSigDB Hallmarks: Recovering pathway memberships from gene embeddings
  • Cell-Type Classification: Classifying cell types across multiple tissues
  • Perturbation Prediction: Predicting transcriptional responses to perturbations

See the paper for detailed benchmark results.

๐Ÿ“„ Citation

If you use Tahoe-x1 in your research, please cite:

@article{gandhi2025tahoe,
  title        = {Tahoe-x1: Scaling Perturbation-Trained Single-Cell Foundation Models to 3 Billion Parameters},
  author       = {Gandhi, Shreshth and Javadi, Farnoosh and Svensson, Valentine and Khan, Umair and Jones, Matthew G. and Yu, Johnny and Merico, Daniele and Goodarzi, Hani and Alidoust, Nima},
  journal      = {bioRxiv},
  year         = {2025},
  doi          = {10.1101/2025.10.23.683759},
  url          = {https://www.biorxiv.org/content/10.1101/2025.10.23.683759v1},
  publisher    = {Cold Spring Harbor Laboratory}
}

๐Ÿ“œ License

Model weights and code are released under the Apache 2.0 license.

๐Ÿ“ง Contact

For questions or collaboration inquiries, please:

Downloads last month
85
Safetensors
Model size
3B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Dataset used to train tahoebio/Tahoe-x1

Space using tahoebio/Tahoe-x1 1

Collection including tahoebio/Tahoe-x1