Help & Documentation

Quick Start Guide
1
Explore

Browse T-cell datasets using filters. Visualize gene expression patterns across conditions.

2
Predict

Modify gene expression levels and predict how cells would respond to perturbations.

3
Optimize

Define a target cell state and find optimal gene combinations to achieve it.

Understanding the Filters
Filter Description Available Options
Condition / Perturbation Condition label used for Explore-phase browsing. In the current release this is a coarse categorization (not a full experimental perturbation ontology yet). control vaccination infection drug cytokine genetic stimulation diet_metabolic cancer autoimmune environment other
T-cell Type The T-cell subpopulation based on marker gene expression and reference-based annotation. CD4 Helper T-cells
CD8 Cytotoxic T-cells
Treg Regulatory T-cells
Donor Type Health status of the donor from which cells were isolated. healthy diseased unknown
Tissue Anatomical location from which T-cells were collected. blood lung liver brain skin + more
Smart Filtering: Options that would result in zero matches are automatically grayed out based on your current selections.
Understanding the Visualizations
UMAP Plot

UMAP (Uniform Manifold Approximation and Projection) is a dimensionality reduction technique that visualizes high-dimensional gene expression data in 2D.

  • Each point represents a cell
  • Proximity indicates similar expression profiles
  • Clusters represent distinct cell states
  • Gray points show all data; colored points show your filter selection
Expression Heatmap

The heatmap displays mean expression levels of highly variable genes (HVGs) across your selected datasets.

  • Columns represent individual genes
  • Color intensity indicates expression level (blue=low, red=high)
  • Dot below indicates uncertainty/variance
  • Hover over cells to see exact values
Color Scales

Expression Level:

Low
Med
High

Uncertainty (Variance):

Low
Med
High
Prediction Models

The Predict tab simulates gene knockout effects using statistical models trained on observational (control-only) T-cell data. This section explains how each method works.

How Prediction Works
  1. Baseline: Compute average expression from control samples matching your filters
  2. Knockout: Set selected gene(s) expression to zero
  3. Propagate: Use learned gene-gene relationships to predict downstream effects
  4. Compare: Show predicted vs baseline expression (Δ = delta)
Method 1: Naive Knockout

Plain English: Simply sets the knocked-out gene's expression to zero. Other genes remain unchanged. This is the baseline "null model" that assumes no downstream effects.

When to use: As a control/reference to see what "no propagation" looks like.

Mathematical Formulation:

$$\hat{x}_i = \begin{cases} 0 & \text{if gene } i \text{ is knocked out} \\ \mu_i & \text{otherwise} \end{cases}$$
Method 2: Gaussian Conditional

Plain English: Assumes gene expression follows a multivariate Gaussian distribution. When you "clamp" a gene to zero, the model computes the conditional expectation of all other genes given this constraint. Uses a precision matrix (inverse covariance) learned from control data.

When to use: Default choice. Captures linear correlations between genes.

Mathematical Formulation:

Given precision matrix $\mathbf{P} = \Sigma^{-1}$, partition into knocked-out ($S$) and remaining ($R$) genes:

$$\hat{\mathbf{x}}_R = \boldsymbol{\mu}_R - \mathbf{P}_{RS} \mathbf{P}_{SS}^{-1} (\mathbf{0} - \boldsymbol{\mu}_S)$$

This is the standard Gaussian conditional mean formula.

Method 3: Invariant Precision

Plain English: Same math as Gaussian Conditional, but the precision matrix is learned differently. Before computing correlations, each dataset (BioProject) is mean-centered to remove batch effects. This captures within-environment gene relationships that may be more stable across different experimental conditions.

When to use: When you suspect batch effects in training data, or want more robust predictions.

Mathematical Formulation:

For each environment $e$, center data:

$$\tilde{\mathbf{x}}^{(e)} = \mathbf{x}^{(e)} - \boldsymbol{\mu}^{(e)}$$

Compute precision from pooled centered data:

$$\mathbf{P}_{\text{inv}} = \left(\tilde{\Sigma}_{\text{pooled}} + \lambda \mathbf{I}\right)^{-1}$$

Inference formula same as Gaussian Conditional.

Method 4: Deep Denoising Autoencoder (DAE)

Plain English: A neural network trained to "denoise" corrupted gene expression. During training, random genes are masked to zero and the network learns to reconstruct them. For prediction, we mask the knocked-out genes and let the network predict their downstream effects.

When to use: May capture non-linear relationships. Best for exploratory analysis.

Architecture:

$\text{Input}(250) \rightarrow \text{Dense}(128, \text{GELU}) \rightarrow \text{Dense}(250)$

Training:

$$\tilde{\mathbf{x}} = \text{mask}(\mathbf{x}, p=0.4) + \mathcal{N}(0, \sigma^2), \quad \sigma = 0.07$$ $$\mathcal{L} = \text{MSE}(f(\tilde{\mathbf{x}}), \mathbf{x})$$

Gaussian Conditional Expectation

For a multivariate Gaussian $\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma)$, partitioned into sets $S$ (knocked out) and $R$ (remaining):

Covariance form:

$$\mathbb{E}[\mathbf{X}_R \mid \mathbf{X}_S = \mathbf{x}_S] = \boldsymbol{\mu}_R + \Sigma_{RS}\Sigma_{SS}^{-1}(\mathbf{x}_S - \boldsymbol{\mu}_S)$$

Precision form (used in INSITE):

$$\mathbb{E}[\mathbf{X}_R \mid \mathbf{X}_S = \mathbf{x}_S] = \boldsymbol{\mu}_R - \mathbf{P}_{RS}\mathbf{P}_{SS}^{-1}(\mathbf{x}_S - \boldsymbol{\mu}_S)$$

where $\mathbf{P} = \Sigma^{-1}$ is the precision matrix. The precision form is computationally more stable when $\Sigma$ is near-singular.

Log-space Transformation

Gene expression is modeled in log-space for better Gaussian approximation:

$$y = \log(1 + x) \quad \text{(log1p transform)}$$ $$\hat{x} = \exp(\hat{y}) - 1 \quad \text{(expm1 inverse)}$$

Knockout in log-space: $y_{\text{KO}} = \log(1 + 0) = 0$

Numerical Stability

To solve $\mathbf{P}_{SS}^{-1}\mathbf{b}$, we use regularized linear solve instead of direct inversion:

$$\text{solve}\left((\mathbf{P}_{SS} + \lambda \mathbf{I}), \mathbf{b}\right) \quad \text{where } \lambda = 10^{-6}$$

This prevents numerical instability when $\mathbf{P}_{SS}$ is ill-conditioned.

Understanding the Output
Output Description
$\Delta$ (Delta) Predicted expression minus baseline: $\Delta = \hat{x} - \mu$. Positive = upregulated, Negative = downregulated.
Top Upregulated Genes with largest positive $\Delta$. These are predicted to increase when KO genes are knocked out.
Top Downregulated Genes with largest negative $\Delta$. These are predicted to decrease (in addition to the KO gene itself).
Method Comparison Side-by-side view of $\Delta$ values from all 4 methods. Agreement across methods suggests more robust prediction.
CSV Export Download full predictions for all 250 genes across all methods for downstream analysis.
About the Data
Data Source

INSITE uses T-cell single-cell RNA sequencing data from the scBaseCount repository, a comprehensive database of uniformly processed scRNA-seq data from public repositories (SRA, GEO).

Processing Pipeline
  1. Cell Selection: T-cells identified using SingleR reference-based annotation and marker gene scoring (CD3D, CD3E, CD4, CD8A, FOXP3)
  2. Subtype Classification: CD4, CD8, and Treg subtypes assigned based on dominant marker expression
  3. Quality Control: Cells filtered by UMI count, gene count, and mitochondrial percentage
  4. Normalization: Log-normalized expression values per 10,000 counts
  5. HVG Selection: Top 2,000 highly variable genes selected for visualization
Perturbation Categories

For this Explore-phase release, INSITE clusters unstructured scBaseCount perturbation text into coarse categories for fast browsing:

  • control: Baseline/unperturbed samples
  • vaccination: Vaccine-induced responses
  • infection: Pathogen exposure (viral, bacterial)
  • drug: Pharmacological treatments
  • cytokine: Cytokine treatments
  • genetic: CRISPR, RNAi, engineering, CAR/TCR
  • stimulation: CD3/CD28, PMA/ionomycin, activation
  • diet_metabolic: Diet, glucose, metabolic context
  • cancer: Oncology-related contexts
  • autoimmune: Autoimmune disease contexts
  • environment: Hypoxia/oxygen and stress
  • other: Other contexts not captured above
Frequently Asked Questions

Options are disabled when selecting them would result in zero matching datasets. This "smart filtering" helps you navigate to valid data combinations without hitting dead ends.

The colored dot below each gene in the heatmap represents variance/uncertainty in expression. Purple indicates consistent expression across cells, while orange/yellow indicates high variability. High uncertainty suggests the gene may be differentially expressed or has batch effects.

Expression profiles are weighted averages across all cells in your selected datasets. Each dataset's contribution is proportional to its cell count, ensuring larger studies have more influence while preventing any single dataset from dominating.

Yes! Click the + button to add a second filter row. Each filter row will be displayed as a separate colored set in the UMAP and as a separate heatmap panel, allowing side-by-side comparison.