Help - INSITE

Quick Start Guide

1

Explore

Browse T-cell datasets using filters. Visualize gene expression patterns across conditions.

2

Predict

Modify gene expression levels and predict how cells would respond to perturbations.

3

Optimize

Define a target cell state and find optimal gene combinations to achieve it.

Understanding the Filters

Filter	Description	Available Options
Condition / Perturbation	Condition label used for Explore-phase browsing. In the current release this is a coarse categorization (not a full experimental perturbation ontology yet).	control vaccination infection drug cytokine genetic stimulation diet_metabolic cancer autoimmune environment other
T-cell Type	The T-cell subpopulation based on marker gene expression and reference-based annotation.	CD4 Helper T-cells CD8 Cytotoxic T-cells Treg Regulatory T-cells
Donor Type	Health status of the donor from which cells were isolated.	healthy diseased unknown
Tissue	Anatomical location from which T-cells were collected.	blood lung liver brain skin + more

Smart Filtering: Options that would result in zero matches are automatically grayed out based on your current selections.

Understanding the Visualizations

UMAP Plot

UMAP (Uniform Manifold Approximation and Projection) is a dimensionality reduction technique that visualizes high-dimensional gene expression data in 2D.

Each point represents a cell
Proximity indicates similar expression profiles
Clusters represent distinct cell states
Gray points show all data; colored points show your filter selection

Expression Heatmap

The heatmap displays mean expression levels of highly variable genes (HVGs) across your selected datasets.

Columns represent individual genes
Color intensity indicates expression level (blue=low, red=high)
Dot below indicates uncertainty/variance
Hover over cells to see exact values

Color Scales

Expression Level:

Low

Med

High

Uncertainty (Variance):

Low

Med

High

Prediction Models

The Predict tab simulates gene knockout effects using statistical models trained on observational (control-only) T-cell data. This section explains how each method works.

How Prediction Works

Baseline: Compute average expression from control samples matching your filters
Knockout: Set selected gene(s) expression to zero
Propagate: Use learned gene-gene relationships to predict downstream effects
Compare: Show predicted vs baseline expression (Δ = delta)

Method 1: Naive Knockout

Plain English: Simply sets the knocked-out gene's expression to zero. Other genes remain unchanged. This is the baseline "null model" that assumes no downstream effects.

When to use: As a control/reference to see what "no propagation" looks like.

Mathematical Formulation:

$$\hat{x}_i = \begin{cases} 0 & \text{if gene } i \text{ is knocked out} \\ \mu_i & \text{otherwise} \end{cases}$$

Method 2: Gaussian Conditional

Plain English: Assumes gene expression follows a multivariate Gaussian distribution. When you "clamp" a gene to zero, the model computes the conditional expectation of all other genes given this constraint. Uses a precision matrix (inverse covariance) learned from control data.

When to use: Default choice. Captures linear correlations between genes.

Mathematical Formulation:

Given precision matrix $\mathbf{P} = \Sigma^{-1}$, partition into knocked-out ($S$) and remaining ($R$) genes:

$$\hat{\mathbf{x}}_R = \boldsymbol{\mu}_R - \mathbf{P}_{RS} \mathbf{P}_{SS}^{-1} (\mathbf{0} - \boldsymbol{\mu}_S)$$

This is the standard Gaussian conditional mean formula.

Method 3: Invariant Precision

Plain English: Same math as Gaussian Conditional, but the precision matrix is learned differently. Before computing correlations, each dataset (BioProject) is mean-centered to remove batch effects. This captures within-environment gene relationships that may be more stable across different experimental conditions.

When to use: When you suspect batch effects in training data, or want more robust predictions.

Mathematical Formulation:

For each environment $e$, center data:

$$\tilde{\mathbf{x}}^{(e)} = \mathbf{x}^{(e)} - \boldsymbol{\mu}^{(e)}$$

Compute precision from pooled centered data:

$$\mathbf{P}_{\text{inv}} = \left(\tilde{\Sigma}_{\text{pooled}} + \lambda \mathbf{I}\right)^{-1}$$

Inference formula same as Gaussian Conditional.

Method 4: Deep Denoising Autoencoder (DAE)

Plain English: A neural network trained to "denoise" corrupted gene expression. During training, random genes are masked to zero and the network learns to reconstruct them. For prediction, we mask the knocked-out genes and let the network predict their downstream effects.

When to use: May capture non-linear relationships. Best for exploratory analysis.

Architecture:

$\text{Input}(250) \rightarrow \text{Dense}(128, \text{GELU}) \rightarrow \text{Dense}(250)$

Training:

$$\tilde{\mathbf{x}} = \text{mask}(\mathbf{x}, p=0.4) + \mathcal{N}(0, \sigma^2), \quad \sigma = 0.07$$ $$\mathcal{L} = \text{MSE}(f(\tilde{\mathbf{x}}), \mathbf{x})$$

Gaussian Conditional Expectation

For a multivariate Gaussian $\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma)$, partitioned into sets $S$ (knocked out) and $R$ (remaining):

Covariance form:

$$\mathbb{E}[\mathbf{X}_R \mid \mathbf{X}_S = \mathbf{x}_S] = \boldsymbol{\mu}_R + \Sigma_{RS}\Sigma_{SS}^{-1}(\mathbf{x}_S - \boldsymbol{\mu}_S)$$

Precision form (used in INSITE):

$$\mathbb{E}[\mathbf{X}_R \mid \mathbf{X}_S = \mathbf{x}_S] = \boldsymbol{\mu}_R - \mathbf{P}_{RS}\mathbf{P}_{SS}^{-1}(\mathbf{x}_S - \boldsymbol{\mu}_S)$$

where $\mathbf{P} = \Sigma^{-1}$ is the precision matrix. The precision form is computationally more stable when $\Sigma$ is near-singular.

Log-space Transformation

Gene expression is modeled in log-space for better Gaussian approximation:

$$y = \log(1 + x) \quad \text{(log1p transform)}$$ $$\hat{x} = \exp(\hat{y}) - 1 \quad \text{(expm1 inverse)}$$

Knockout in log-space: $y_{\text{KO}} = \log(1 + 0) = 0$

Numerical Stability

To solve $\mathbf{P}_{SS}^{-1}\mathbf{b}$, we use regularized linear solve instead of direct inversion:

$$\text{solve}\left((\mathbf{P}_{SS} + \lambda \mathbf{I}), \mathbf{b}\right) \quad \text{where } \lambda = 10^{-6}$$

This prevents numerical instability when $\mathbf{P}_{SS}$ is ill-conditioned.

Understanding the Output

Output	Description
$\Delta$ (Delta)	Predicted expression minus baseline: $\Delta = \hat{x} - \mu$. Positive = upregulated, Negative = downregulated.
Top Upregulated	Genes with largest positive $\Delta$. These are predicted to increase when KO genes are knocked out.
Top Downregulated	Genes with largest negative $\Delta$. These are predicted to decrease (in addition to the KO gene itself).
Method Comparison	Side-by-side view of $\Delta$ values from all 4 methods. Agreement across methods suggests more robust prediction.
CSV Export	Download full predictions for all 250 genes across all methods for downstream analysis.

About the Data

Data Source

INSITE uses T-cell single-cell RNA sequencing data from the scBaseCount repository, a comprehensive database of uniformly processed scRNA-seq data from public repositories (SRA, GEO).

Processing Pipeline

Cell Selection: T-cells identified using SingleR reference-based annotation and marker gene scoring (CD3D, CD3E, CD4, CD8A, FOXP3)
Subtype Classification: CD4, CD8, and Treg subtypes assigned based on dominant marker expression
Quality Control: Cells filtered by UMI count, gene count, and mitochondrial percentage
Normalization: Log-normalized expression values per 10,000 counts
HVG Selection: Top 2,000 highly variable genes selected for visualization

Perturbation Categories

For this Explore-phase release, INSITE clusters unstructured scBaseCount perturbation text into coarse categories for fast browsing:

control: Baseline/unperturbed samples
vaccination: Vaccine-induced responses
infection: Pathogen exposure (viral, bacterial)
drug: Pharmacological treatments
cytokine: Cytokine treatments
genetic: CRISPR, RNAi, engineering, CAR/TCR

stimulation: CD3/CD28, PMA/ionomycin, activation
diet_metabolic: Diet, glucose, metabolic context
cancer: Oncology-related contexts
autoimmune: Autoimmune disease contexts
environment: Hypoxia/oxygen and stress
other: Other contexts not captured above

Frequently Asked Questions

Options are disabled when selecting them would result in zero matching datasets. This "smart filtering" helps you navigate to valid data combinations without hitting dead ends.

The colored dot below each gene in the heatmap represents variance/uncertainty in expression. Purple indicates consistent expression across cells, while orange/yellow indicates high variability. High uncertainty suggests the gene may be differentially expressed or has batch effects.

Expression profiles are weighted averages across all cells in your selected datasets. Each dataset's contribution is proportional to its cell count, ensuring larger studies have more influence while preventing any single dataset from dominating.

Yes! Click the + button to add a second filter row. Each filter row will be displayed as a separate colored set in the UMAP and as a separate heatmap panel, allowing side-by-side comparison.

Help & Documentation

Quick Start Guide

Explore

Predict

Optimize

Understanding the Filters

Understanding the Visualizations

UMAP Plot

Expression Heatmap

Color Scales

Prediction Models

How Prediction Works

Method 1: Naive Knockout

Method 2: Gaussian Conditional

Method 3: Invariant Precision

Method 4: Deep Denoising Autoencoder (DAE)

Detailed Mathematical Derivation

Gaussian Conditional Expectation

Log-space Transformation

Numerical Stability

Understanding the Output

About the Data

Data Source

Processing Pipeline

Perturbation Categories

Frequently Asked Questions

Why are some filter options grayed out?

What does the uncertainty dot mean?

How are mean expression profiles calculated?

Can I compare two conditions?