ClinOracle
Contents
Code
- representations/- scripts to train and transcriptome effect representations on Tahoe-100M
- clinical_data_curation/- scripts to curate clinical trial data
- approval_prediction_benchmark.ipynb- benchmark on clinical approval prediction
- classifier.py- Benchmarking classifier implementation
Data
- clinical_evidence_data/- Curated clinical evidence data on Tahoe drugs
- data_for_classifier/- input data for benchmarks
- data/- misc processed data
Team Members
- Emma Dann - Stanford University & Gladstone Insitutes - [email protected]
- Tony Zeng - Stanford University - [email protected]
- Ross Giglio - Columbia University - [email protected]
- Kevin Hoffer-Hawlik - Columbia University - [email protected]
- Meer Mustafa - BigHat Biosciences - [email protected]
Project
Pharmacotranscriptomic representations to predict clinical trial success
Overview
Large in vitro perturbation screens like Tahoe-100M allow for assessing whether transcriptional responses are predictive of metrics of clinical success like drug approval.
Motivation
Despite rigorous research efforts, clinical success and drug approval is challenging and difficult to predict in early drug development.
Methods
Clinical trial information
We used LLMs to collected clinical trial and adverse effects data associated with the chemical agents screened in Tahoe-100M, annotated which drugs were tested or reached approval for a condition affecting one of the screened organs.
Transcriptome effects representations
- E-distance: overall transcriptional shift from DMSO for each drug in each cell line. We selected the dose with max e-distance for each drug-cellline pair.
- LDVAE: VAE with linear decoder for gene program interpretability (trained on plates 1-4 and generated embedding for full dataset)
- mrVI: sample-aware VAE representation. Using the pseudobulked Tahoe-100M data, we trained a MrVI model with sample defined as cell_drug with the union of highly variable genes within cell line as features. We generated two-latent embeddings, the 10-dimensional u-space and the 30-dimensional z-space that were used as input to the classifier.
Benchmark set-up
We use logistic regression on the transcriptome-effect representations to predict whether a drug was approved for a tissue of interest, splitting drugs into train and test set and evaluating the precision-recall curve for the test drugs. We consider rate of approvals per organ as a technical confounder to be accounted for.
Results
None of the unsupervised multi-dimensional representations outperformed the approval rate baseline, while we found that e-distance is consistently negatively associated with approval for conditions affecting the target tissue.
Discussion and Future Work
With the concept established, we propose expanding by testing additional representations of the data including MrVI single-cell sample-sample distances, differential gene expression or program expression, and cell counts. The framework is setup to test additional and advanced prediction metrics like clinical trial phase success and AE rate or severity prediction.
