Advances in single-cell sequencing and CRISPR technologies have enabled detailed case-control comparisons and experimental perturbations at single-cell resolution. However, uncovering causal relationships in observational genomic data remains challenging due to selection bias and inadequate adjustment for unmeasured confounders, particularly in heterogeneous datasets. To address these challenges, we introduce causarray [Du26], a doubly robust causal inference framework for analyzing array-based genomic data at both bulk-cell and single-cell levels. causarray integrates a generalized confounder adjustment method to account for unmeasured confounders and employs semiparametric inference with flexible machine learning techniques to ensure robust statistical estimation of treatment effects.
We recommend using causarray in a conda environment:
# create a new conda environment and install the necessary packages
conda create -n causarray python=3.12 -y
# activate the environment
conda activate causarrayThe module can be installed via PyPI:
pip install causarrayFor optimal parallel performance, we recommend installing llvm-openmp if using conda:
conda install -c conda-forge llvm-openmpFor R users, reticulate can be used to call causarray from R.
The documentation and tutorials using both Python and R are available at causarray.readthedocs.io.
| Tutorial | Language | Description | Link |
|---|---|---|---|
| Perturb-seq [Jin20] | Python | CRISPR screen analysis on excitatory neurons | Notebook |
| Perturb-seq [Jin20] | R | Same analysis using reticulate |
Notebook |
| Genome-wide CRISPRi screen [Replogle22] | Python | Batch fitting on 200 perturbations from a K562 genome-wide CRISPRi screen | Notebook |
| Case-control: SEA-AD [Gabitto24] | Python | Causal inference on observational single-cell data (Alzheimer's disease) | Notebook |
For screens with hundreds to thousands of perturbations, use gcate_lfc_batch so
that peak memory is bounded by one batch at a time:
from causarray import gcate_lfc_batch
df_res = gcate_lfc_batch(
Y, X, A, r,
batch_size=10, # perturbations per batch (or use n_batches= for a fixed count)
max_cells=2000, # max pert cells per batch (ctrl added on top)
n_ctrl=2000, # fixed ctrl subsample shared across batches
cache_path='results.h5', # resume if interrupted
verbose=True,
)See the Replogle-E-K562 tutorial for a demonstration on 200 perturbations from a genome-wide CRISPRi screen.
See CHANGELOG for a full version history.
[Du26] Jin-Hong Du, Maya Shen, Hansruedi Mathys, and Kathryn Roeder. "Uncovering causal relationships in single cell omic studies with causarray". In: Briefings in Bioinformatics (2026).
[Gabitto24] Mariano I. Gabitto et al. "Integrated multimodal cell atlas of Alzheimer's disease". In: Nature Neuroscience (2024).
[Jin20] Xin Jin et al. "In vivo Perturb-seq reveals neuronal and glial abnormalities associated with autism risk genes". In: Nature Neuroscience (2020).
[Replogle22] Joseph M. Replogle et al. "Mapping information-rich genotype-phenotype landscapes with genome-scale Perturb-seq". In: Cell (2022).