Help & Documentation
Quick Start Guide
Explore
Browse T-cell datasets using filters. Visualize gene expression patterns across conditions.
Predict
Modify gene expression levels and predict how cells would respond to perturbations.
Optimize
Define a target cell state and find optimal gene combinations to achieve it.
Understanding the Filters
| Filter | Description | Available Options |
|---|---|---|
| Condition / Perturbation | Condition label used for Explore-phase browsing. In the current release this is a coarse categorization (not a full experimental perturbation ontology yet). | control vaccination infection drug cytokine genetic stimulation diet_metabolic cancer autoimmune environment other |
| T-cell Type | The T-cell subpopulation based on marker gene expression and reference-based annotation. |
CD4 Helper T-cells CD8 Cytotoxic T-cells Treg Regulatory T-cells |
| Donor Type | Health status of the donor from which cells were isolated. | healthy diseased unknown |
| Tissue | Anatomical location from which T-cells were collected. | blood lung liver brain skin + more |
Understanding the Visualizations
UMAP Plot
UMAP (Uniform Manifold Approximation and Projection) is a dimensionality reduction technique that visualizes high-dimensional gene expression data in 2D.
- Each point represents a cell
- Proximity indicates similar expression profiles
- Clusters represent distinct cell states
- Gray points show all data; colored points show your filter selection
Expression Heatmap
The heatmap displays mean expression levels of highly variable genes (HVGs) across your selected datasets.
- Columns represent individual genes
- Color intensity indicates expression level (blue=low, red=high)
- Dot below indicates uncertainty/variance
- Hover over cells to see exact values
Color Scales
Expression Level:
Uncertainty (Variance):
Prediction Models
The Predict tab simulates gene knockout effects using statistical models trained on observational (control-only) T-cell data. This section explains how each method works.
How Prediction Works
- Baseline: Compute average expression from control samples matching your filters
- Knockout: Set selected gene(s) expression to zero
- Propagate: Use learned gene-gene relationships to predict downstream effects
- Compare: Show predicted vs baseline expression (Δ = delta)
Method 1: Naive Knockout
Plain English: Simply sets the knocked-out gene's expression to zero. Other genes remain unchanged. This is the baseline "null model" that assumes no downstream effects.
When to use: As a control/reference to see what "no propagation" looks like.
Mathematical Formulation:
$$\hat{x}_i = \begin{cases} 0 & \text{if gene } i \text{ is knocked out} \\ \mu_i & \text{otherwise} \end{cases}$$Method 2: Gaussian Conditional
Plain English: Assumes gene expression follows a multivariate Gaussian distribution. When you "clamp" a gene to zero, the model computes the conditional expectation of all other genes given this constraint. Uses a precision matrix (inverse covariance) learned from control data.
When to use: Default choice. Captures linear correlations between genes.
Mathematical Formulation:
Given precision matrix $\mathbf{P} = \Sigma^{-1}$, partition into knocked-out ($S$) and remaining ($R$) genes:
$$\hat{\mathbf{x}}_R = \boldsymbol{\mu}_R - \mathbf{P}_{RS} \mathbf{P}_{SS}^{-1} (\mathbf{0} - \boldsymbol{\mu}_S)$$This is the standard Gaussian conditional mean formula.
Method 3: Invariant Precision
Plain English: Same math as Gaussian Conditional, but the precision matrix is learned differently. Before computing correlations, each dataset (BioProject) is mean-centered to remove batch effects. This captures within-environment gene relationships that may be more stable across different experimental conditions.
When to use: When you suspect batch effects in training data, or want more robust predictions.
Mathematical Formulation:
For each environment $e$, center data:
$$\tilde{\mathbf{x}}^{(e)} = \mathbf{x}^{(e)} - \boldsymbol{\mu}^{(e)}$$Compute precision from pooled centered data:
$$\mathbf{P}_{\text{inv}} = \left(\tilde{\Sigma}_{\text{pooled}} + \lambda \mathbf{I}\right)^{-1}$$Inference formula same as Gaussian Conditional.
Method 4: Deep Denoising Autoencoder (DAE)
Plain English: A neural network trained to "denoise" corrupted gene expression. During training, random genes are masked to zero and the network learns to reconstruct them. For prediction, we mask the knocked-out genes and let the network predict their downstream effects.
When to use: May capture non-linear relationships. Best for exploratory analysis.
Architecture:
$\text{Input}(250) \rightarrow \text{Dense}(128, \text{GELU}) \rightarrow \text{Dense}(250)$
Training:
$$\tilde{\mathbf{x}} = \text{mask}(\mathbf{x}, p=0.4) + \mathcal{N}(0, \sigma^2), \quad \sigma = 0.07$$ $$\mathcal{L} = \text{MSE}(f(\tilde{\mathbf{x}}), \mathbf{x})$$Gaussian Conditional Expectation
For a multivariate Gaussian $\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma)$, partitioned into sets $S$ (knocked out) and $R$ (remaining):
Covariance form:
$$\mathbb{E}[\mathbf{X}_R \mid \mathbf{X}_S = \mathbf{x}_S] = \boldsymbol{\mu}_R + \Sigma_{RS}\Sigma_{SS}^{-1}(\mathbf{x}_S - \boldsymbol{\mu}_S)$$Precision form (used in INSITE):
$$\mathbb{E}[\mathbf{X}_R \mid \mathbf{X}_S = \mathbf{x}_S] = \boldsymbol{\mu}_R - \mathbf{P}_{RS}\mathbf{P}_{SS}^{-1}(\mathbf{x}_S - \boldsymbol{\mu}_S)$$where $\mathbf{P} = \Sigma^{-1}$ is the precision matrix. The precision form is computationally more stable when $\Sigma$ is near-singular.
Log-space Transformation
Gene expression is modeled in log-space for better Gaussian approximation:
Knockout in log-space: $y_{\text{KO}} = \log(1 + 0) = 0$
Numerical Stability
To solve $\mathbf{P}_{SS}^{-1}\mathbf{b}$, we use regularized linear solve instead of direct inversion:
This prevents numerical instability when $\mathbf{P}_{SS}$ is ill-conditioned.
Understanding the Output
| Output | Description |
|---|---|
| $\Delta$ (Delta) | Predicted expression minus baseline: $\Delta = \hat{x} - \mu$. Positive = upregulated, Negative = downregulated. |
| Top Upregulated | Genes with largest positive $\Delta$. These are predicted to increase when KO genes are knocked out. |
| Top Downregulated | Genes with largest negative $\Delta$. These are predicted to decrease (in addition to the KO gene itself). |
| Method Comparison | Side-by-side view of $\Delta$ values from all 4 methods. Agreement across methods suggests more robust prediction. |
| CSV Export | Download full predictions for all 250 genes across all methods for downstream analysis. |
About the Data
Data Source
INSITE uses T-cell single-cell RNA sequencing data from the scBaseCount repository, a comprehensive database of uniformly processed scRNA-seq data from public repositories (SRA, GEO).
Processing Pipeline
- Cell Selection: T-cells identified using SingleR reference-based annotation and marker gene scoring (CD3D, CD3E, CD4, CD8A, FOXP3)
- Subtype Classification: CD4, CD8, and Treg subtypes assigned based on dominant marker expression
- Quality Control: Cells filtered by UMI count, gene count, and mitochondrial percentage
- Normalization: Log-normalized expression values per 10,000 counts
- HVG Selection: Top 2,000 highly variable genes selected for visualization
Perturbation Categories
For this Explore-phase release, INSITE clusters unstructured scBaseCount perturbation text into coarse categories for fast browsing:
- control: Baseline/unperturbed samples
- vaccination: Vaccine-induced responses
- infection: Pathogen exposure (viral, bacterial)
- drug: Pharmacological treatments
- cytokine: Cytokine treatments
- genetic: CRISPR, RNAi, engineering, CAR/TCR
- stimulation: CD3/CD28, PMA/ionomycin, activation
- diet_metabolic: Diet, glucose, metabolic context
- cancer: Oncology-related contexts
- autoimmune: Autoimmune disease contexts
- environment: Hypoxia/oxygen and stress
- other: Other contexts not captured above