SuperCell — kNN graph + walktrap community detection#
Bilous et al. Metacells untangle large and complex single-cell transcriptome networks. BMC Bioinformatics 23, 336 (2022). doi:10.1186/s12859-022-04861-1
Algorithm. Build a k_knn-nearest-neighbour graph on a low-dim
embedding (default X_pca), then run walktrap — short random walks merge
cells before they have time to escape a tightly-connected neighbourhood.
The resulting hierarchical dendrogram is cut at the level that yields
n_metacells communities.
Capabilities. hierarchical (re-cut the cached dendrogram at a
different graining without rebuilding the graph).
Strengths. Very fast (~O(N log N)). Hierarchical → you can sweep graining factors cheaply. Pure-Python re-implementation in omicverse via igraph (no R dependency).
Weaknesses. No latent space; no soft assignments. mcRigor benchmarks
have flagged SuperCell + MC2 for producing some heterogeneous “dubious”
metacells under default settings — check dubious_rate carefully.
1. Setup#
# Standard imports + omicverse defaults.
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import omicverse as ov
import scvelo as scv # only used for the demo dataset
ov.plot_set()
🔬 Starting plot initialization...
🧬 Detecting GPU devices…
✅ NVIDIA CUDA GPUs detected: 1
• [CUDA 0] NVIDIA H100 80GB HBM3
Memory: 79.1 GB | Compute: 9.0
____ _ _ __
/ __ \____ ___ (_)___| | / /__ _____________
/ / / / __ `__ \/ / ___/ | / / _ \/ ___/ ___/ _ \
/ /_/ / / / / / / / /__ | |/ / __/ / (__ ) __/
\____/_/ /_/ /_/_/\___/ |___/\___/_/ /____/\___/
🔖 Version: 2.2.0 📚 Tutorials: https://omicverse.readthedocs.io/
✅ plot_set complete.
2. Load and preprocess#
# Pancreas scRNA-seq (Bastidas-Ponce et al. 2019). Standard omicverse
# preprocess flow: qc -> preprocess -> scale -> pca -> neighbors -> umap.
adata = scv.datasets.pancreas()
adata = ov.pp.qc(adata,
tresh={'mito_perc': 0.20, 'nUMIs': 500, 'detected_genes': 250},
mt_startswith='mt-')
adata = ov.pp.preprocess(adata, mode='shiftlog|pearson', n_HVGs=2000)
adata.layers['lognorm'] = adata.X.copy() # mcRigor reads this
adata = adata[:, adata.var.highly_variable_features]
ov.pp.scale(adata)
ov.pp.pca(adata, layer='scaled', n_pcs=30)
adata.obsm['X_pca'] = adata.obsm['scaled|original|X_pca']
ov.pp.neighbors(adata, n_neighbors=15, use_rep='X_pca')
ov.pp.umap(adata)
print('adata:', adata.shape, 'celltypes:', sorted(adata.obs['clusters'].unique()))
🖥️ Using CPU mode for QC...
📊 Step 1: Calculating QC Metrics
✓ Gene Family Detection:
┌──────────────────────────────┬────────────────────┬────────────────────┐
│ Gene Family │ Genes Found │ Detection Method │
├──────────────────────────────┼────────────────────┼────────────────────┤
│ Mitochondrial │ 13 │ Auto (MT-) │
├──────────────────────────────┼────────────────────┼────────────────────┤
│ Ribosomal │ 0 ⚠️ │ Auto (RPS/RPL) │
├──────────────────────────────┼────────────────────┼────────────────────┤
│ Hemoglobin │ 0 ⚠️ │ Auto (regex) │
└──────────────────────────────┴────────────────────┴────────────────────┘
✓ QC Metrics Summary:
┌─────────────────────────┬────────────────────┬─────────────────────────┐
│ Metric │ Mean │ Range (Min - Max) │
├─────────────────────────┼────────────────────┼─────────────────────────┤
│ nUMIs │ 6675 │ 3020 - 18524 │
├─────────────────────────┼────────────────────┼─────────────────────────┤
│ Detected Genes │ 2516 │ 1473 - 4492 │
├─────────────────────────┼────────────────────┼─────────────────────────┤
│ Mitochondrial % │ 0.7% │ 0.2% - 4.3% │
├─────────────────────────┼────────────────────┼─────────────────────────┤
│ Ribosomal % │ 0.0% │ 0.0% - 0.0% │
├─────────────────────────┼────────────────────┼─────────────────────────┤
│ Hemoglobin % │ 0.0% │ 0.0% - 0.0% │
└─────────────────────────┴────────────────────┴─────────────────────────┘
📈 Original cell count: 3,696
🔧 Step 2: Quality Filtering (SEURAT)
Thresholds: mito≤0.2, nUMIs≥500, genes≥250
📊 Seurat Filter Results:
• nUMIs filter (≥500): 0 cells failed (0.0%)
• Genes filter (≥250): 0 cells failed (0.0%)
• Mitochondrial filter (≤0.2): 0 cells failed (0.0%)
✓ Filters applied successfully
✓ Combined QC filters: 0 cells removed (0.0%)
🎯 Step 3: Final Filtering
Parameters: min_genes=200, min_cells=3
Ratios: max_genes_ratio=1, max_cells_ratio=1
✓ Final filtering: 0 cells, 12,261 genes removed
🔍 Step 4: Doublet Detection
💡 Running pyscdblfinder (Python port of R scDblFinder)
🔍 Running scdblfinder detection...
[ScDblFinder] wrote scDblFinder_score + scDblFinder_class — threshold=0.387
✓ scDblFinder completed: 66 doublets removed (1.8%)
╭─ SUMMARY: qc ──────────────────────────────────────────────────────╮
│ Duration: 16.9835s │
│ Shape: 3,696 x 27,998 (Unchanged) │
│ │
│ CHANGES DETECTED │
│ ──────────────── │
│ ● OBS │ ✚ cell_complexity (float) │
│ │ ✚ detected_genes (int) │
│ │ ✚ hb_perc (float) │
│ │ ✚ mito_perc (float) │
│ │ ✚ nUMIs (float) │
│ │ ✚ n_counts (float) │
│ │ ✚ n_genes (int) │
│ │ ✚ n_genes_by_counts (int) │
│ │ ✚ passing_mt (bool) │
│ │ ✚ passing_nUMIs (bool) │
│ │ ✚ passing_ngenes (bool) │
│ │ ✚ pct_counts_hb (float) │
│ │ ✚ pct_counts_mt (float) │
│ │ ✚ pct_counts_ribo (float) │
│ │ ✚ ribo_perc (float) │
│ │ ✚ total_counts (float) │
│ │
│ ● VAR │ ✚ hb (bool) │
│ │ ✚ mt (bool) │
│ │ ✚ ribo (bool) │
│ │
╰────────────────────────────────────────────────────────────────────╯
🔍 [2026-05-19 17:26:20] Running preprocessing in 'cpu' mode...
Begin robust gene identification
After filtration, 15737/15737 genes are kept.
Among 15737 genes, 15736 genes are robust.
✅ Robust gene identification completed successfully.
Begin size normalization: shiftlog and HVGs selection pearson
🔍 Count Normalization:
Target sum: 500000.0
Exclude highly expressed: True
Max fraction threshold: 0.2
⚠️ Excluding 1 highly-expressed genes from normalization computation
Excluded genes: ['Ghrl']
✅ Count Normalization Completed Successfully!
✓ Processed: 3,630 cells × 15,736 genes
✓ Runtime: 0.25s
🔍 Highly Variable Genes Selection (Experimental):
Method: pearson_residuals
Target genes: 2,000
Theta (overdispersion): 100
✅ Experimental HVG Selection Completed Successfully!
✓ Selected: 2,000 highly variable genes out of 15,736 total (12.7%)
✓ Results added to AnnData object:
• 'highly_variable': Boolean vector (adata.var)
• 'highly_variable_rank': Float vector (adata.var)
• 'highly_variable_nbatches': Int vector (adata.var)
• 'highly_variable_intersection': Boolean vector (adata.var)
• 'means': Float vector (adata.var)
• 'variances': Float vector (adata.var)
• 'residual_variances': Float vector (adata.var)
Time to analyze data in cpu: 1.41 seconds.
✅ Preprocessing completed successfully.
Added:
'highly_variable_features', boolean vector (adata.var)
'means', float vector (adata.var)
'variances', float vector (adata.var)
'residual_variances', float vector (adata.var)
'counts', raw counts layer (adata.layers)
End of size normalization: shiftlog and HVGs selection pearson
╭─ SUMMARY: preprocess ──────────────────────────────────────────────╮
│ Duration: 1.7922s │
│ Shape: 3,630 x 15,737 -> 3,630 x 15,736 │
│ │
│ CHANGES DETECTED │
│ ──────────────── │
│ ● VAR │ ✚ highly_variable (bool) │
│ │ ✚ highly_variable_features (bool) │
│ │ ✚ highly_variable_rank (float) │
│ │ ✚ means (float) │
│ │ ✚ n_cells (int) │
│ │ ✚ percent_cells (float) │
│ │ ✚ residual_variances (float) │
│ │ ✚ robust (bool) │
│ │ ✚ variances (float) │
│ │
│ ● UNS │ ✚ history_log │
│ │ ✚ hvg │
│ │ ✚ log1p │
│ │
│ ● LAYERS │ ✚ counts (sparse matrix, 3630x15736) │
│ │
╰────────────────────────────────────────────────────────────────────╯
╭─ SUMMARY: scale ───────────────────────────────────────────────────╮
│ Duration: 0.6501s │
│ Shape: 3,630 x 2,000 (Unchanged) │
│ │
│ CHANGES DETECTED │
│ ──────────────── │
│ ● LAYERS │ ✚ scaled (array, 3630x2000) │
│ │
╰────────────────────────────────────────────────────────────────────╯
computing PCA🔍
with n_comps=30
🖥️ Using sklearn PCA for CPU computation
🖥️ sklearn PCA backend: CPU computation
📊 PCA input data type: ArrayView, shape: (3630, 2000), dtype: float64
🔧 PCA solver used: covariance_eigh
finished✅ (1.97s)
╭─ SUMMARY: pca ─────────────────────────────────────────────────────╮
│ Duration: 1.9803s │
│ Shape: 3,630 x 2,000 (Unchanged) │
│ │
│ CHANGES DETECTED │
│ ──────────────── │
│ ● UNS │ ✚ scaled|original|cum_sum_eigenvalues │
│ │ ✚ scaled|original|pca_var_ratios │
│ │
│ ● OBSM │ ✚ scaled|original|X_pca (array, 3630x30) │
│ │
╰────────────────────────────────────────────────────────────────────╯
🖥️ Using Scanpy CPU to calculate neighbors...
🔍 K-Nearest Neighbors Graph Construction:
Mode: cpu
Neighbors: 15
Method: umap
Metric: euclidean
Representation: X_pca
🔍 Computing neighbor distances...
🔍 Computing connectivity matrix...
💡 Using UMAP-style connectivity
✓ Graph is fully connected
✅ KNN Graph Construction Completed Successfully!
✓ Processed: 3,630 cells with 15 neighbors each
✓ Results added to AnnData object:
• 'neighbors': Neighbors metadata (adata.uns)
• 'distances': Distance matrix (adata.obsp)
• 'connectivities': Connectivity matrix (adata.obsp)
╭─ SUMMARY: neighbors ───────────────────────────────────────────────╮
│ Duration: 8.5516s │
│ Shape: 3,630 x 2,000 (Unchanged) │
│ │
│ CHANGES DETECTED │
│ ──────────────── │
╰────────────────────────────────────────────────────────────────────╯
🔍 [2026-05-19 17:26:33] Running UMAP in 'cpu' mode...
🖥️ Using Scanpy CPU UMAP...
🔍 UMAP Dimensionality Reduction:
Mode: cpu
Method: umap
Components: 2
Min distance: 0.5
{'n_neighbors': 15, 'method': 'umap', 'random_state': 0, 'metric': 'euclidean', 'use_rep': 'X_pca'}
🔍 Computing UMAP parameters...
🔍 Computing UMAP embedding (classic method)...
✅ UMAP Dimensionality Reduction Completed Successfully!
✓ Embedding shape: 3,630 cells × 2 dimensions
✓ Results added to AnnData object:
• 'X_umap': UMAP coordinates (adata.obsm)
• 'umap': UMAP parameters (adata.uns)
✅ UMAP completed successfully.
╭─ SUMMARY: umap ────────────────────────────────────────────────────╮
│ Duration: 0.7343s │
│ Shape: 3,630 x 2,000 (Unchanged) │
│ │
│ CHANGES DETECTED │
│ ──────────────── │
│ ● UNS │ ✚ umap │
│ │ └─ params: {'a': 0.5830300199950147, 'b': 1.334166993228519}│
│ │
╰────────────────────────────────────────────────────────────────────╯
adata: (3630, 2000) celltypes: ['Alpha', 'Beta', 'Delta', 'Ductal', 'Epsilon', 'Ngn3 high EP', 'Ngn3 low EP', 'Pre-endocrine']
3. Fit SuperCell#
k_knn=5 is the R upstream default; walktrap_steps=4 controls how far
random walks roam before being cut.
mc = ov.single.MetaCell(
adata.copy(), method='supercell', n_metacells=100,
use_rep='X_pca', k_knn=5, walktrap_steps=4, random_state=0,
).fit()
print(f'fit done: {mc.method}, runtime={mc._fit_result.runtime_s:.2f} s')
fit done: supercell, runtime=0.52 s
4. AnnData schema after fit#
Every backend writes the same fields into adata — that’s what lets the
downstream helpers below work without branching on the backend.
# Inspect what the fit wrote into adata via the unified schema.
print(f'method : {mc.method}')
print(f'capabilities: {sorted(mc.capabilities)}')
print(f'n_metacells : {np.unique(mc._fit_result.assignments).size}')
print(f'runtime : {mc._fit_result.runtime_s:.3f} s')
print(f'uns : {dict(mc.adata.uns["metacell"])}')
method : supercell
capabilities: ['hierarchical']
n_metacells : 100
runtime : 0.521 s
uns : {'method': 'supercell', 'n_metacells': 100, 'n_iter': 1, 'converged': True, 'runtime_s': 0.5209763050079346, 'random_state': 0, 'capabilities': ['hierarchical']}
5. Aggregate to a metacell AnnData#
predicted(method='hard', layer='counts', summary='sum') returns a
metacell × gene AnnData with raw count totals preserved — the format that
downstream tools (SCENIC, CellPhoneDB, pseudobulk DE) actually want.
ad_mc = mc.predicted(method='hard', layer='counts', summary='sum',
celltype_label='clusters')
print(f'metacell AnnData: {ad_mc.shape}')
print(f'mean cells/metacell: {ad_mc.obs["n_cells"].mean():.1f}')
ad_mc.obs.head()
metacell AnnData: (100, 2000)
mean cells/metacell: 36.3
| n_cells | clusters | clusters_purity | |
|---|---|---|---|
| mc-0 | 9 | Beta | 0.666667 |
| mc-1 | 201 | Ductal | 0.791045 |
| mc-2 | 28 | Alpha | 0.857143 |
| mc-3 | 71 | Ductal | 0.985915 |
| mc-4 | 85 | Ngn3 high EP | 0.882353 |
6. Benchmarking metrics (purity / separation / compactness)#
7. mcRigor: statistical validation#
For each metacell, mcRigor permutes the (cells × genes) submatrix in two
ways and asks: is the observed gene–gene covariance bigger than the null
distribution at this metacell’s size? Metacells whose mcDiv exceeds the
size-stratified threshold are flagged as 'dubious'.
# mcRigor's double-permutation null. dubious_rate = fraction of cells in
# heterogeneous metacells; rigor_score = 1 - 0.5*(dubious_rate + zero_rate).
rep = mc.check_rigor(layer_lognorm='lognorm', n_rep=20,
feature_use=1000, random_state=0)
print(f'rigor_score : {rep.score:.3f}')
print(f'dubious_rate: {rep.dubious_rate:.3f}')
print(f'zero_rate : {rep.zero_rate:.3f}')
print(f'# metacells : {rep.n_metacells}')
rigor_score : 0.499
dubious_rate: 0.755
zero_rate : 0.248
# metacells : 100
7.1 Per-metacell mcDiv vs size#
8. UMAP with metacell centroids#
# UMAP coloured by celltype with metacell centroids overlaid in dark grey.
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(5, 4))
ov.pl.embedding(mc.adata, basis='X_umap', color='clusters', ax=ax, show=False,
frameon='small', title='SuperCell (n=100)', size=12)
labels = mc._fit_result.assignments
pts = np.array([mc.adata.obsm['X_umap'][labels == u].mean(axis=0)
for u in np.unique(labels)])
ax.scatter(pts[:, 0], pts[:, 1], s=24, c='#222',
edgecolors='white', linewidths=0.6, zorder=5)
plt.tight_layout(); plt.show()
9. Per-celltype purity boxplot#
10. Metacell-level UMAP#
A common downstream use of metacells is to treat them as a much smaller “atlas” of pseudo-cells and re-run the standard omicverse preprocess → PCA → UMAP loop on them. Cell-type signal should survive.
# Treat the metacell AnnData as a smaller dataset and run the standard
# omicverse preprocess -> pca -> neighbors -> umap loop on it.
ad_mc = ov.pp.preprocess(ad_mc, mode='shiftlog|pearson',
n_HVGs=min(2000, ad_mc.n_vars))
ad_mc = ad_mc[:, ad_mc.var.highly_variable_features]
ov.pp.scale(ad_mc)
ov.pp.pca(ad_mc, layer='scaled', n_pcs=min(30, ad_mc.n_obs - 1))
ad_mc.obsm['X_pca'] = ad_mc.obsm['scaled|original|X_pca']
ov.pp.neighbors(ad_mc, n_neighbors=min(15, ad_mc.n_obs - 1), use_rep='X_pca')
ov.pp.umap(ad_mc)
ov.pl.embedding(ad_mc, basis='X_umap', color='clusters',
frameon='small', title='metacell-level UMAP', size=80)
🔍 [2026-05-19 17:26:56] Running preprocessing in 'cpu' mode...
Begin robust gene identification
After filtration, 2000/2000 genes are kept.
Among 2000 genes, 2000 genes are robust.
✅ Robust gene identification completed successfully.
Begin size normalization: shiftlog and HVGs selection pearson
🔍 Count Normalization:
Target sum: 500000.0
Exclude highly expressed: True
Max fraction threshold: 0.2
⚠️ Excluding 1 highly-expressed genes from normalization computation
Excluded genes: ['Ghrl']
✅ Count Normalization Completed Successfully!
✓ Processed: 100 cells × 2,000 genes
✓ Runtime: 0.00s
🔍 Highly Variable Genes Selection (Experimental):
Method: pearson_residuals
Target genes: 2,000
Theta (overdispersion): 100
✅ Experimental HVG Selection Completed Successfully!
✓ Selected: 2,000 highly variable genes out of 2,000 total (100.0%)
✓ Results added to AnnData object:
• 'highly_variable': Boolean vector (adata.var)
• 'highly_variable_rank': Float vector (adata.var)
• 'highly_variable_nbatches': Int vector (adata.var)
• 'highly_variable_intersection': Boolean vector (adata.var)
• 'means': Float vector (adata.var)
• 'variances': Float vector (adata.var)
• 'residual_variances': Float vector (adata.var)
Time to analyze data in cpu: 0.03 seconds.
✅ Preprocessing completed successfully.
Added:
'highly_variable_features', boolean vector (adata.var)
'means', float vector (adata.var)
'variances', float vector (adata.var)
'residual_variances', float vector (adata.var)
'counts', raw counts layer (adata.layers)
End of size normalization: shiftlog and HVGs selection pearson
╭─ SUMMARY: preprocess ──────────────────────────────────────────────╮
│ Duration: 0.0438s │
│ Shape: 100 x 2,000 (Unchanged) │
│ │
│ CHANGES DETECTED │
│ ──────────────── │
│ ● UNS │ ✚ REFERENCE_MANU │
│ │ ✚ _ov_provenance │
│ │ ✚ history_log │
│ │ ✚ hvg │
│ │ ✚ log1p │
│ │ ✚ status │
│ │ ✚ status_args │
│ │
│ ● LAYERS │ ✚ counts (sparse matrix, 100x2000) │
│ │
╰────────────────────────────────────────────────────────────────────╯
╭─ SUMMARY: scale ───────────────────────────────────────────────────╮
│ Duration: 0.0153s │
│ Shape: 100 x 2,000 (Unchanged) │
│ │
│ CHANGES DETECTED │
│ ──────────────── │
│ ● LAYERS │ ✚ scaled (array, 100x2000) │
│ │
╰────────────────────────────────────────────────────────────────────╯
computing PCA🔍
with n_comps=30
🖥️ Using sklearn PCA for CPU computation
🖥️ sklearn PCA backend: CPU computation
📊 PCA input data type: ArrayView, shape: (100, 2000), dtype: float64
🔧 PCA solver used: covariance_eigh
finished✅ (1.30s)
╭─ SUMMARY: pca ─────────────────────────────────────────────────────╮
│ Duration: 1.303s │
│ Shape: 100 x 2,000 (Unchanged) │
│ │
│ CHANGES DETECTED │
│ ──────────────── │
│ ● UNS │ ✚ pca │
│ │ └─ params: {'zero_center': True, 'use_highly_variable': Tr...│
│ │ ✚ scaled|original|cum_sum_eigenvalues │
│ │ ✚ scaled|original|pca_var_ratios │
│ │
│ ● OBSM │ ✚ X_pca (array, 100x30) │
│ │ ✚ scaled|original|X_pca (array, 100x30) │
│ │
╰────────────────────────────────────────────────────────────────────╯
🖥️ Using Scanpy CPU to calculate neighbors...
🔍 K-Nearest Neighbors Graph Construction:
Mode: cpu
Neighbors: 15
Method: umap
Metric: euclidean
Representation: X_pca
🔍 Computing neighbor distances...
🔍 Computing connectivity matrix...
💡 Using UMAP-style connectivity
✓ Graph is fully connected
✅ KNN Graph Construction Completed Successfully!
✓ Processed: 100 cells with 15 neighbors each
✓ Results added to AnnData object:
• 'neighbors': Neighbors metadata (adata.uns)
• 'distances': Distance matrix (adata.obsp)
• 'connectivities': Connectivity matrix (adata.obsp)
╭─ SUMMARY: neighbors ───────────────────────────────────────────────╮
│ Duration: 0.1154s │
│ Shape: 100 x 2,000 (Unchanged) │
│ │
│ CHANGES DETECTED │
│ ──────────────── │
│ ● UNS │ ✚ neighbors │
│ │ └─ params: {'n_neighbors': 15, 'method': 'umap', 'random_s...│
│ │
│ ● OBSP │ ✚ connectivities (sparse matrix, 100x100) │
│ │ ✚ distances (sparse matrix, 100x100) │
│ │
╰────────────────────────────────────────────────────────────────────╯
🔍 [2026-05-19 17:26:58] Running UMAP in 'cpu' mode...
🖥️ Using Scanpy CPU UMAP...
🔍 UMAP Dimensionality Reduction:
Mode: cpu
Method: umap
Components: 2
Min distance: 0.5
{'n_neighbors': 15, 'method': 'umap', 'random_state': 0, 'metric': 'euclidean', 'use_rep': 'X_pca'}
🔍 Computing UMAP parameters...
🔍 Computing UMAP embedding (classic method)...
✅ UMAP Dimensionality Reduction Completed Successfully!
✓ Embedding shape: 100 cells × 2 dimensions
✓ Results added to AnnData object:
• 'X_umap': UMAP coordinates (adata.obsm)
• 'umap': UMAP parameters (adata.uns)
✅ UMAP completed successfully.
╭─ SUMMARY: umap ────────────────────────────────────────────────────╮
│ Duration: 0.0118s │
│ Shape: 100 x 2,000 (Unchanged) │
│ │
│ CHANGES DETECTED │
│ ──────────────── │
│ ● UNS │ ✚ umap │
│ │ └─ params: {'a': 0.5830300199950147, 'b': 1.334166993228519}│
│ │
│ ● OBSM │ ✚ X_umap (array, 100x2) │
│ │
╰────────────────────────────────────────────────────────────────────╯
11. Top markers per celltype on the metacell AnnData#
# Find top markers per celltype on the metacell AnnData (omicverse helper —
# drops the categories with <2 metacells automatically and reports cell-type
# fractions ``pts`` along with the gene names).
counts = ad_mc.obs['clusters'].value_counts()
keep = counts[counts >= 2].index.tolist()
ad_mc_for_de = ad_mc[ad_mc.obs['clusters'].isin(keep)].copy()
ad_mc_for_de.obs['clusters'] = ad_mc_for_de.obs['clusters'].astype('category')
ov.single.find_markers(ad_mc_for_de, groupby='clusters', method='wilcoxon',
key_added='rank_genes_groups', pts=True, use_gpu=False)
ov.single.get_markers(ad_mc_for_de, n_genes=3, key='rank_genes_groups')
🔍 Finding marker genes | method: wilcoxon | groupby: clusters | n_groups: 8 | n_genes: 50
✅ Done | 8 groups × 50 genes | corr: benjamini-hochberg | stored in adata.uns['rank_genes_groups']
| group | rank | names | scores | logfoldchanges | pvals | pvals_adj | pct_group | pct_rest | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Alpha | 1 | Smarca1 | 6.067086 | 3.528083 | 1.302521e-09 | 2.127080e-06 | 1.0 | 0.941176 |
| 1 | Alpha | 2 | Ttr | 5.912633 | 2.369870 | 3.366814e-09 | 2.127080e-06 | 1.0 | 1.000000 |
| 2 | Alpha | 3 | Tmem27 | 5.864367 | 6.051901 | 4.508499e-09 | 2.127080e-06 | 1.0 | 0.788235 |
| 3 | Beta | 1 | Sytl4 | 5.392514 | 7.145541 | 6.947866e-08 | 3.869363e-05 | 1.0 | 0.348315 |
| 4 | Beta | 2 | Gng12 | 5.370481 | 4.113616 | 7.852677e-08 | 3.869363e-05 | 1.0 | 0.932584 |
| 5 | Beta | 3 | Gm27033 | 5.293367 | 4.532875 | 1.200848e-07 | 3.869363e-05 | 1.0 | 0.707865 |
| 6 | Delta | 1 | Rbp4 | 3.377268 | 5.227028 | 7.320963e-04 | 9.230787e-02 | 1.0 | 0.989583 |
| 7 | Delta | 2 | Sst | 3.377268 | 12.785007 | 7.320963e-04 | 9.230787e-02 | 1.0 | 0.583333 |
| 8 | Delta | 3 | Spock3 | 3.377268 | 5.802761 | 7.320963e-04 | 9.230787e-02 | 1.0 | 0.458333 |
| 9 | Ductal | 1 | Lurap1l | 6.893820 | 3.140599 | 5.431379e-12 | 8.347573e-10 | 1.0 | 0.950000 |
| 10 | Ductal | 2 | Cldn10 | 6.893820 | 4.276017 | 5.431379e-12 | 8.347573e-10 | 1.0 | 0.812500 |
| 11 | Ductal | 3 | Cyr61 | 6.893820 | 7.576176 | 5.431379e-12 | 8.347573e-10 | 1.0 | 0.600000 |
| 12 | Epsilon | 1 | Ghrl | 4.675616 | 11.245130 | 2.930724e-06 | 1.437596e-03 | 1.0 | 0.913043 |
| 13 | Epsilon | 2 | Card19 | 4.650205 | 2.920076 | 3.316050e-06 | 1.437596e-03 | 1.0 | 0.934783 |
| 14 | Epsilon | 3 | Gm11837 | 4.650205 | 6.151111 | 3.316050e-06 | 1.437596e-03 | 1.0 | 0.586957 |
| 15 | Ngn3 high EP | 1 | Cbfa2t3 | 6.418805 | 3.734952 | 1.373487e-10 | 6.710507e-08 | 1.0 | 0.867470 |
| 16 | Ngn3 high EP | 2 | Gadd45a | 6.382100 | 5.235402 | 1.746765e-10 | 6.710507e-08 | 1.0 | 0.891566 |
| 17 | Ngn3 high EP | 3 | Cdk4 | 6.345394 | 1.235358 | 2.218565e-10 | 6.710507e-08 | 1.0 | 1.000000 |
| 18 | Ngn3 low EP | 1 | Grin3a | 3.324498 | 7.126914 | 8.857774e-04 | 1.327895e-01 | 1.0 | 0.302083 |
| 19 | Ngn3 low EP | 2 | Cat | 3.324498 | 3.847967 | 8.857774e-04 | 1.327895e-01 | 1.0 | 0.875000 |
| 20 | Ngn3 low EP | 3 | Atoh8 | 3.324498 | 6.508173 | 8.857774e-04 | 1.327895e-01 | 1.0 | 0.302083 |
| 21 | Pre-endocrine | 1 | Bax | 6.486627 | 1.032453 | 8.777949e-11 | 1.755590e-07 | 1.0 | 1.000000 |
| 22 | Pre-endocrine | 2 | Cystm1 | 6.291986 | 1.170726 | 3.134307e-10 | 2.089538e-07 | 1.0 | 1.000000 |
| 23 | Pre-endocrine | 3 | Cxxc5 | 6.105807 | 1.100205 | 1.022825e-09 | 3.409416e-07 | 1.0 | 1.000000 |
12. SuperCell exclusive: re-cut at different graining#
The walktrap dendrogram is cached after .fit(). Cutting at a different
n_metacells is essentially free — no graph rebuild needed. Perfect for
graining-factor sweeps.
# Sweep four graining factors. γ = n_cells / n_metacells.
gammas = [10, 25, 50, 100]
sweeps = mc.fit_multi_gamma(gammas)
rows = []
for g, fit in sweeps.items():
n_mc = fit.backend_meta['n_metacells']
sib = ov.single.MetaCell(adata.copy(), method='supercell',
n_metacells=n_mc, use_rep='X_pca')
sib._fit_result = fit
sib._write_schema()
p = sib.compute_purity('clusters').purity.mean()
rows.append({'gamma': g, 'n_metacells': n_mc, 'mean_purity': round(p, 3)})
sweep_df = pd.DataFrame(rows)
sweep_df
| gamma | n_metacells | mean_purity | |
|---|---|---|---|
| 0 | 10.0 | 363 | 0.890 |
| 1 | 25.0 | 145 | 0.879 |
| 2 | 50.0 | 73 | 0.869 |
| 3 | 100.0 | 36 | 0.847 |
# Line plot: mean purity vs graining γ.
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(4.5, 3))
ax.plot(sweep_df['gamma'], sweep_df['mean_purity'], 'o-', linewidth=1.5)
ax.set_xlabel('graining γ (n_cells / n_metacells)')
ax.set_ylabel('mean purity'); ax.set_ylim(0.4, 1.05)
ax.set_title('SuperCell purity vs graining')
plt.tight_layout(); plt.show()
13. Save / load roundtrip#
# Save/load roundtrip — every backend supports this.
import tempfile, os
with tempfile.NamedTemporaryFile(suffix='.pkl', delete=False) as f:
path = f.name
mc.save(path)
mc2 = ov.single.MetaCell(adata.copy(), method='supercell', n_metacells=100,
use_rep='X_pca', random_state=0)
mc2.load(path)
print(f'saved+loaded {path}')
os.remove(path)
saved+loaded /tmp/tmpp0v27s5m.pkl
14. Takeaways#
SuperCell is the fastest principled metacell method here. For tens of thousands of cells it’s ~10–100× faster than SEACells.
The hierarchical cache (
fit_multi_gamma) is perfect for “what graining should I use?” experiments.Watch
dubious_ratefrom mcRigor — SuperCell can produce heterogeneous metacells at low graining (high γ) on some datasets. Ifdubious_rate > 0.10consider switching to SEACells or tighteningk_knn.