# PreprocessReadMe.md The `preprocess.py` script is designed to preprocess gene expression data for use in our models. It takes in `data.raw.X` or `data.X` data, applies various preprocessing techniques, and prepares it for training or inference. # General Workflow The script follows these main steps: 0. **Load Data and Metadata**: The script starts by loading the gene expression data from an AnnData file and metadata from a JSON file. 1. **Set Raw Layer**: It checks if the `data.raw.X` layer is set in the AnnData object. If not, it sets it based on the integer counts in the `data.X`. 2. **Initialize Processed Layer**: It initializes the `data.layer['processed']` in the AnnData object, which is the layer that will be affected by preprocessing. 3. **Filter Genes by Reference ID**: It filters genes based on reference IDs if specified in the hyperparameters. 4. **Remove Assays**: It removes specified assays from the data. 5. **Filter Cells by Gene Counts**: It filters out cells with gene counts below a specified threshold. 6. **Filter Cells by Mitochondrial Fraction**: It removes cells with a high mitochondrial gene fraction. 7. **Filter Highly Variable Genes**: It filters genes to retain only highly variable ones using specified methods. 8. **Normalize Data**: It normalizes the data by applying row (gene level) normalization and scaling. 9. **Scale Columns by Median**: It scales columns based on median values from a specified dictionary. 10. **Log Transform**: It applies a log+1 transformation to the data. 11. **Compute Medians**: It computes and saves medians of the processed data if specified. 12. **Update Metadata**: It updates the metadata with cell counts and processing arguments. 13. **Save and Cleanup**: It saves the processed data and metadata to disk and performs garbage collection. # Preprocessing Arguments The script uses several preprocessing arguments to control its behavior. Here is an explanation of each argument and the steps they influence: - `reference_id_only` - Description: Specifies whether to filter genes by reference ID. - Impact: If enabled, the script filters genes based on reference IDs. - `remove_assays` - Description: List of assays to remove from the data. - Impact: The script removes specified assays from the data. - `min_gene_counts` - Description: Minimum gene counts required for cells to be retained. - Impact: The script filters out cells with gene counts below this threshold. - `max_mitochondrial_prop` - Description: Maximum mitochondrial gene fraction allowed for cells. - Impact: The script removes cells with a mitochondrial gene fraction above this threshold. - `hvg_method` - Description: Method to use for filtering highly variable genes. - Impact: The script filters genes to retain only highly variable ones using the specified method. - `normalized_total` - Description: Value to normalize the total gene expression to. - Impact: The script normalizes the data by applying row (gene level) normalization and scaling. - `median_dict` - Description: Path to a JSON file containing median values for scaling columns. - Impact: The script scales columns based on median values from the specified dictionary. - `median_column` - Description: Column name to use for looking up median values. - Impact: The script uses this column to look up median values for scaling. - `log1p` - Description: Indicates whether to apply a log transformation to the data. - Impact: If enabled, the script applies a log transformation to the data. - `compute_medians` - Description: Indicates whether to compute and save medians of the processed data. - Impact: If enabled, the script computes and saves medians of the processed data.