# PreprocessReadMe.md

The `preprocess.py` script is designed to preprocess gene expression data for use in our models. It takes in `data.raw.X` or `data.X` data, applies various preprocessing techniques, and prepares it for training or inference.

# General Workflow
The script follows these main steps:
0. **Load Data and Metadata**: The script starts by loading the gene expression data from an AnnData file and metadata from a JSON file.
1. **Set Raw Layer**: It checks if the `data.raw.X` layer is set in the AnnData object. If not, it sets it based on the integer counts in the `data.X`.
2. **Initialize Processed Layer**: It initializes the `data.layer['processed']` in the AnnData object, which is the layer that will be affected by preprocessing.
3. **Filter Genes by Reference ID**: It filters genes based on reference IDs if specified in the hyperparameters.
4. **Remove Assays**: It removes specified assays from the data.
5. **Filter Cells by Gene Counts**: It filters out cells with gene counts below a specified threshold.
6. **Filter Cells by Mitochondrial Fraction**: It removes cells with a high mitochondrial gene fraction.
7. **Filter Highly Variable Genes**: It filters genes to retain only highly variable ones using specified methods.
8. **Normalize Data**: It normalizes the data by applying row (gene level) normalization and scaling.
9. **Scale Columns by Median**: It scales columns based on median values from a specified dictionary.
10. **Log Transform**: It applies a log+1 transformation to the data.
11. **Compute Medians**: It computes and saves medians of the processed data if specified.
12. **Update Metadata**: It updates the metadata with cell counts and processing arguments.
13. **Save and Cleanup**: It saves the processed data and metadata to disk and performs garbage collection.


# Preprocessing Arguments
The script uses several preprocessing arguments to control its behavior. Here is an explanation of each argument and the steps they influence:

- `reference_id_only`
    - Description: Specifies whether to filter genes by reference ID.
    - Impact: If enabled, the script filters genes based on reference IDs.
- `remove_assays`
    - Description: List of assays to remove from the data.
    - Impact: The script removes specified assays from the data.
- `min_gene_counts`
    - Description: Minimum gene counts required for cells to be retained.
    - Impact: The script filters out cells with gene counts below this threshold.
- `max_mitochondrial_prop`
    - Description: Maximum mitochondrial gene fraction allowed for cells.
    - Impact: The script removes cells with a mitochondrial gene fraction above this threshold.
- `hvg_method`
    - Description: Method to use for filtering highly variable genes.
    - Impact: The script filters genes to retain only highly variable ones using the specified method.
- `normalized_total`
    - Description: Value to normalize the total gene expression to.
    - Impact: The script normalizes the data by applying row (gene level) normalization and scaling.
- `median_dict`
    - Description: Path to a JSON file containing median values for scaling columns.
    - Impact: The script scales columns based on median values from the specified dictionary.
- `median_column`
    - Description: Column name to use for looking up median values.
    - Impact: The script uses this column to look up median values for scaling.
- `log1p`
    - Description: Indicates whether to apply a log transformation to the data.
    - Impact: If enabled, the script applies a log transformation to the data.
- `compute_medians`
    - Description: Indicates whether to compute and save medians of the processed data.
    - Impact: If enabled, the script computes and saves medians of the processed data.