MetaCell#
Metacells are small, transcriptionally homogeneous groups of single cells that are treated as the unit of analysis instead of individual cells. They denoise sparse single-cell counts while preserving cell-state granularity — unlike Leiden clusters (too coarse for state-level analysis) or single cells (too sparse for many downstream tools).
This section has three layers:
Layer |
Tutorial |
When |
|---|---|---|
1. Recommended workflow |
Day-one user. Run the recommended backend (SEACells, soft membership) end-to-end and drive a downstream pipeline (DEG, pseudobulk, marker dotplot). |
|
2. Multi-sample workflow |
You have ≥2 batches / donors / conditions. Build per-sample-aware metacells on a Harmony-corrected embedding. |
|
3. Backend zoo |
You want to compare all 7 backends side-by-side on your own data ( |
Metacell vs pseudobulk — what’s the difference?#
Both metacells and pseudobulk produce aggregated count profiles, but they answer different questions and have different statistical properties.
Pseudobulk |
Metacell |
|
|---|---|---|
Granularity |
One profile per sample × celltype (or sample × cluster) — usually 5–50 profiles total. |
One profile per metacell — typically |
Aggregation key |
Pre-existing labels (sample, celltype). |
Learned partition based on transcriptional similarity (graph / archetype / VQ-VAE / …). |
Within-group purity |
Whatever the labels imply (often messy — “Beta cells from sample 3” still has substate variation). |
Optimized to be transcriptionally homogeneous — each metacell ≈ one cell state. |
Sample / batch awareness |
Native — sample is the aggregation key. |
Optional — most backends are sample-agnostic by default; multi-sample workflows need a corrected embedding (see t_metacell_multisample). |
What it’s good for |
Cohort-level DE (DESeq2 / edgeR / limma): “does gene X change between healthy and IBD donors averaged over T cells?” |
State-level analysis with denoised counts: cell-cell communication, GRN inference, RNA velocity, marker discovery, trajectory smoothing. |
What it’s NOT good for |
Within-celltype state granularity, trajectory inference, cell-cell communication. |
Cohort-level DE (you’d be testing thousands of “samples”, inflating power and breaking the variance model). |
Typical N profiles |
~10s |
~100s–1000s |
Rule of thumb: if your statistical model is expression ~ condition and
treats samples as the unit of replication, you want pseudobulk. If your
model is “give me per-cell-state expression but with less noise”, you want
metacell.
The two are also composable: you can pseudobulk metacells per sample (e.g. for cross-sample DE within a metacell type), or compute metacells within each sample independently and then concatenate. omicverse’s t_metacell_multisample shows the latter.
Architecture#
ov.single.MetaCell(adata, method=...) dispatches to seven backends, each
writing a unified AnnData schema:
adata.obs['metacell_id'] categorical — universal
adata.obs['metacell_conf'] float — universal
adata.obsm['X_metacell'] latent — when backend has 'latent' capability
adata.obsm['metacell_soft'] sparse — when backend has 'soft' capability
adata.uns['metacell'] metadata — method, n_metacells, runtime, ...
Downstream tools (CellPhoneDB / LIANA / SCENIC / DESeq2) consume any
backend’s output via this schema — you never need an if method == ...
branch.
The seven backends, with their differentiating capability:
seacells— soft kernel archetypal analysis (Persad 2023, Nat Biotech). Default recommendation.metaq— VQ-VAE codebook with closed-form out-of-sample projection (Li 2025, Nat Comms). Use when new samples will arrive after the metacell map is built.supercell— kNN + walktrap with graining hierarchy cache (Bilous 2022, BMC Bioinf).kmeans— sklearn baseline (fast, codebook, out-of-sample).random— honest lower bound.geosketch— density-aware sketching (Hie 2019, Cell Systems).mc2— divide-and-conquer (Ben-Kiki 2022, Genome Biology).pip install metacells.
See zoo/index for the full per-backend tour.