Tahoe-x1
Tahoe-x1 is a family of perturbation-trained single-cell foundation models with up to 3 billion parameters, developed by Tahoe Therapeutics. Pretrained on 266 million single-cell transcriptomic profiles including the Tahoe-100M perturbation compendium, Tahoe-x1 achieves state-of-the-art performance on cancer-relevant tasks with 3โ30ร higher compute efficiency than other cell-state models.
Quick Links:
- โจ Blog Post - Read our announcement
- ๐ Preprint - Read our preprint on bioRxiv
- ๐ป GitHub Repository - Access the code
- ๐ฎ Interactive Demo - Try the model with no code required!
๐ค Model Sizes
We provide pretrained weights for three model sizes:
- Tx1-70M: ~70M parameters
- Tx1-1B: ~1.3B parameters
- Tx1-3B: ~3B parameters
๐ Quickstart
Load a model directly from Hugging Face and generate cell embeddings:
from tahoe_x1.model import ComposerTX
import scanpy as sc
# Load model from Hugging Face in a single line
# Options: "70m", "1b", or "3b"
model, vocab, model_cfg, collator_cfg = ComposerTX.from_hf(
repo_id="tahoebio/Tahoe-x1",
model_size="3b"
)
# Load your single-cell data
adata = sc.read_h5ad("your_data.h5ad")
# Generate embeddings (see tutorials for full example)
# Cell embeddings are stored in adata.obsm
๐ฆ Installation
To use the models, install the tahoex package from GitHub:
# Clone the repository
git clone https://github.com/tahoebio/tahoe-x1
cd tahoe-x1
# Install using Docker (recommended) or uv
# See installation guide: https://github.com/tahoebio/tahoe-x1#installation
Docker installation provides better reproducibility and is recommended for the best experience. For native installation, use uv or pip for dependency management.
Model checkpoints and configs are automatically downloaded from this Hugging Face repository when using ComposerTX.from_hf(). Training data is hosted publicly on S3 (s3://tahoe-hackathon-data) and will be downloaded as needed.
๐ Tutorials
Please refer to the tutorials in the GitHub repository for detailed examples:
| Tutorial | Description | Link |
|---|---|---|
| Clustering Tutorial | Cell clustering and UMAP visualization with Tahoe-x1 embeddings | clustering_tutorial.ipynb |
| Training Tutorial | Step-by-step guide to training and fine-tuning Tahoe-x1 models | training_tutorial.ipynb |
๐งฌ Generating Cell and Gene Embeddings
Using Configuration Files
- Create a configuration file (see
scripts/inference/configs/predict.yamlin the GitHub repo):
# Key configuration options:
# - paths.hf_repo_id: Hugging Face repository (tahoebio/Tahoe-x1)
# - paths.hf_model_size: model size (70m, 1b, or 3b)
# - paths.adata_output: where to save AnnData output including embeddings
# - predict.return_gene_embeddings: True (for extracting gene embeddings)
See the Clustering Tutorial for a full example of preparing input data and configuration files. 2. Run the embedding script:
python scripts/inference/predict_embeddings.py path/to/config.yaml
# Optional: override config values via command line
python scripts/inference/predict_embeddings.py path/to/config.yaml \
--paths.model_name=tx --batch_size=128
Advanced Usage
For memory-efficient gene embedding extraction over large datasets, use the lower-level API:
from tahoex.tasks import get_batch_embeddings
cell_embs, gene_embs = get_batch_embeddings(
adata=adata,
model=model,
vocab=vocab,
model_cfg=model_cfg,
collator_cfg=collator_cfg,
return_gene_embeddings=True
)
Cell embeddings are saved to adata.obsm and gene embeddings to adata.varm (if return_gene_embeddings=True).
๐๏ธ Training and Fine-tuning
Tahoe-x1 models can be trained from scratch or fine-tuned on your own data. See the GitHub repository for detailed instructions.
Quick Training Example
# Use the test configuration to train a small model on Tahoe-100M
composer scripts/train.py -f configs/test_config.yaml
# Fine-tune from a pretrained checkpoint
composer scripts/train.py -f configs/finetune_config.yaml \
--load_path s3://tahoe-hackathon-data/MFM/ckpts/3b/
For more details on training infrastructure, datasets, and benchmarks, please visit the GitHub repository.
๐ Model Details
Architecture
- Base Architecture: Transformer-based encoder pretrained with masked language modeling (MLM)
- Training Data: 266M single-cell profiles from CellxGene, scBaseCamp, and Tahoe-100M
- Training Infrastructure: Built on MosaicML Composer with FlashAttention, FSDP, and mixed precision
Benchmarks
Tahoe-x1 achieves state-of-the-art performance on cancer-relevant tasks:
- DepMap Essentiality: Predicting gene dependencies in cancer cell lines
- MSigDB Hallmarks: Recovering pathway memberships from gene embeddings
- Cell-Type Classification: Classifying cell types across multiple tissues
- Perturbation Prediction: Predicting transcriptional responses to perturbations
See the paper for detailed benchmark results.
๐ Citation
If you use Tahoe-x1 in your research, please cite:
@article{gandhi2025tahoe,
title = {Tahoe-x1: Scaling Perturbation-Trained Single-Cell Foundation Models to 3 Billion Parameters},
author = {Gandhi, Shreshth and Javadi, Farnoosh and Svensson, Valentine and Khan, Umair and Jones, Matthew G. and Yu, Johnny and Merico, Daniele and Goodarzi, Hani and Alidoust, Nima},
journal = {bioRxiv},
year = {2025},
doi = {10.1101/2025.10.23.683759},
url = {https://www.biorxiv.org/content/10.1101/2025.10.23.683759v1},
publisher = {Cold Spring Harbor Laboratory}
}
๐ License
Model weights and code are released under the Apache 2.0 license.
๐ง Contact
For questions or collaboration inquiries, please:
- Open an issue on GitHub or HuggingFace
- Email us at [email protected]
- Downloads last month
- 85