SuperCell — kNN graph + walktrap community detection#

Bilous et al. Metacells untangle large and complex single-cell transcriptome networks. BMC Bioinformatics 23, 336 (2022). doi:10.1186/s12859-022-04861-1

Algorithm. Build a k_knn-nearest-neighbour graph on a low-dim embedding (default X_pca), then run walktrap — short random walks merge cells before they have time to escape a tightly-connected neighbourhood. The resulting hierarchical dendrogram is cut at the level that yields n_metacells communities.

Capabilities. hierarchical (re-cut the cached dendrogram at a different graining without rebuilding the graph).

Strengths. Very fast (~O(N log N)). Hierarchical → you can sweep graining factors cheaply. Pure-Python re-implementation in omicverse via igraph (no R dependency).

Weaknesses. No latent space; no soft assignments. mcRigor benchmarks have flagged SuperCell + MC2 for producing some heterogeneous “dubious” metacells under default settings — check dubious_rate carefully.

1. Setup#

# Standard imports + omicverse defaults.
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import omicverse as ov
import scvelo as scv      # only used for the demo dataset

ov.plot_set()
🔬 Starting plot initialization...
🧬 Detecting GPU devices…
✅ NVIDIA CUDA GPUs detected: 1
    • [CUDA 0] NVIDIA H100 80GB HBM3
      Memory: 79.1 GB | Compute: 9.0

   ____            _     _    __                  
  / __ \____ ___  (_)___| |  / /__  _____________ 
 / / / / __ `__ \/ / ___/ | / / _ \/ ___/ ___/ _ \ 
/ /_/ / / / / / / / /__ | |/ /  __/ /  (__  )  __/ 
\____/_/ /_/ /_/_/\___/ |___/\___/_/  /____/\___/                                              

🔖 Version: 2.2.0   📚 Tutorials: https://omicverse.readthedocs.io/
✅ plot_set complete.

2. Load and preprocess#

# Pancreas scRNA-seq (Bastidas-Ponce et al. 2019). Standard omicverse
# preprocess flow: qc -> preprocess -> scale -> pca -> neighbors -> umap.
adata = scv.datasets.pancreas()
adata = ov.pp.qc(adata,
                 tresh={'mito_perc': 0.20, 'nUMIs': 500, 'detected_genes': 250},
                 mt_startswith='mt-')
adata = ov.pp.preprocess(adata, mode='shiftlog|pearson', n_HVGs=2000)
adata.layers['lognorm'] = adata.X.copy()        # mcRigor reads this
adata = adata[:, adata.var.highly_variable_features]
ov.pp.scale(adata)
ov.pp.pca(adata, layer='scaled', n_pcs=30)
adata.obsm['X_pca'] = adata.obsm['scaled|original|X_pca']
ov.pp.neighbors(adata, n_neighbors=15, use_rep='X_pca')
ov.pp.umap(adata)
print('adata:', adata.shape, 'celltypes:', sorted(adata.obs['clusters'].unique()))
🖥️ Using CPU mode for QC...

📊 Step 1: Calculating QC Metrics

   ✓ Gene Family Detection:
   ┌──────────────────────────────┬────────────────────┬────────────────────┐
   │ Gene Family                 Genes Found       Detection Method   │
   ├──────────────────────────────┼────────────────────┼────────────────────┤
   │ Mitochondrial                │ 13                 │ Auto (MT-)         │
   ├──────────────────────────────┼────────────────────┼────────────────────┤
   │ Ribosomal                    │ 0 ⚠️               │ Auto (RPS/RPL)     │
   ├──────────────────────────────┼────────────────────┼────────────────────┤
   │ Hemoglobin                   │ 0 ⚠️               │ Auto (regex)       │
   └──────────────────────────────┴────────────────────┴────────────────────┘

   ✓ QC Metrics Summary:
   ┌─────────────────────────┬────────────────────┬─────────────────────────┐
   │ Metric                 Mean              Range (Min - Max)       │
   ├─────────────────────────┼────────────────────┼─────────────────────────┤
   │ nUMIs                   │ 6675               │ 3020 - 18524            │
   ├─────────────────────────┼────────────────────┼─────────────────────────┤
   │ Detected Genes          │ 2516               │ 1473 - 4492             │
   ├─────────────────────────┼────────────────────┼─────────────────────────┤
   │ Mitochondrial %         │ 0.7%               │ 0.2% - 4.3%             │
   ├─────────────────────────┼────────────────────┼─────────────────────────┤
   │ Ribosomal %             │ 0.0%               │ 0.0% - 0.0%             │
   ├─────────────────────────┼────────────────────┼─────────────────────────┤
   │ Hemoglobin %            │ 0.0%               │ 0.0% - 0.0%             │
   └─────────────────────────┴────────────────────┴─────────────────────────┘

   📈 Original cell count: 3,696

🔧 Step 2: Quality Filtering (SEURAT)
   Thresholds: mito≤0.2, nUMIs≥500, genes≥250
   📊 Seurat Filter Results:
     • nUMIs filter (≥500): 0 cells failed (0.0%)
     • Genes filter (≥250): 0 cells failed (0.0%)
     • Mitochondrial filter (≤0.2): 0 cells failed (0.0%)
   ✓ Filters applied successfully
   ✓ Combined QC filters: 0 cells removed (0.0%)

🎯 Step 3: Final Filtering
   Parameters: min_genes=200, min_cells=3
   Ratios: max_genes_ratio=1, max_cells_ratio=1
✓ Final filtering: 0 cells, 12,261 genes removed

🔍 Step 4: Doublet Detection
   💡 Running pyscdblfinder (Python port of R scDblFinder)
   🔍 Running scdblfinder detection...
[ScDblFinder] wrote scDblFinder_score + scDblFinder_class — threshold=0.387
   ✓ scDblFinder completed: 66 doublets removed (1.8%)

╭─ SUMMARY: qc ──────────────────────────────────────────────────────╮
  Duration: 16.9835s                                                
  Shape:    3,696 x 27,998 (Unchanged)                              
                                                                    
  CHANGES DETECTED                                                  
  ────────────────                                                  
   OBS    cell_complexity (float)                              
 detected_genes (int)                                 
 hb_perc (float)                                      
 mito_perc (float)                                    
 nUMIs (float)                                        
 n_counts (float)                                     
 n_genes (int)                                        
 n_genes_by_counts (int)                              
 passing_mt (bool)                                    
 passing_nUMIs (bool)                                 
 passing_ngenes (bool)                                
 pct_counts_hb (float)                                
 pct_counts_mt (float)                                
 pct_counts_ribo (float)                              
 ribo_perc (float)                                    
 total_counts (float)                                 
                                                                    
   VAR    hb (bool)                                            
 mt (bool)                                            
 ribo (bool)                                          
                                                                    
╰────────────────────────────────────────────────────────────────────╯
🔍 [2026-05-19 17:26:20] Running preprocessing in 'cpu' mode...
Begin robust gene identification
    After filtration, 15737/15737 genes are kept.
    Among 15737 genes, 15736 genes are robust.
✅ Robust gene identification completed successfully.
Begin size normalization: shiftlog and HVGs selection pearson
🔍 Count Normalization:
   Target sum: 500000.0
   Exclude highly expressed: True
   Max fraction threshold: 0.2
   ⚠️ Excluding 1 highly-expressed genes from normalization computation
   Excluded genes: ['Ghrl']
✅ Count Normalization Completed Successfully!
   ✓ Processed: 3,630 cells × 15,736 genes
   ✓ Runtime: 0.25s

🔍 Highly Variable Genes Selection (Experimental):
   Method: pearson_residuals
   Target genes: 2,000
   Theta (overdispersion): 100
✅ Experimental HVG Selection Completed Successfully!
   ✓ Selected: 2,000 highly variable genes out of 15,736 total (12.7%)
   ✓ Results added to AnnData object:
     • 'highly_variable': Boolean vector (adata.var)
     • 'highly_variable_rank': Float vector (adata.var)
     • 'highly_variable_nbatches': Int vector (adata.var)
     • 'highly_variable_intersection': Boolean vector (adata.var)
     • 'means': Float vector (adata.var)
     • 'variances': Float vector (adata.var)
     • 'residual_variances': Float vector (adata.var)
    Time to analyze data in cpu: 1.41 seconds.
✅ Preprocessing completed successfully.
    Added:
        'highly_variable_features', boolean vector (adata.var)
        'means', float vector (adata.var)
        'variances', float vector (adata.var)
        'residual_variances', float vector (adata.var)
        'counts', raw counts layer (adata.layers)
    End of size normalization: shiftlog and HVGs selection pearson

╭─ SUMMARY: preprocess ──────────────────────────────────────────────╮
  Duration: 1.7922s                                                 
  Shape:    3,630 x 15,737 -> 3,630 x 15,736                        
                                                                    
  CHANGES DETECTED                                                  
  ────────────────                                                  
   VAR    highly_variable (bool)                               
 highly_variable_features (bool)                      
 highly_variable_rank (float)                         
 means (float)                                        
 n_cells (int)                                        
 percent_cells (float)                                
 residual_variances (float)                           
 robust (bool)                                        
 variances (float)                                    
                                                                    
   UNS    history_log                                          
 hvg                                                  
 log1p                                                
                                                                    
   LAYERS counts (sparse matrix, 3630x15736)                   
                                                                    
╰────────────────────────────────────────────────────────────────────╯
╭─ SUMMARY: scale ───────────────────────────────────────────────────╮
  Duration: 0.6501s                                                 
  Shape:    3,630 x 2,000 (Unchanged)                               
                                                                    
  CHANGES DETECTED                                                  
  ────────────────                                                  
   LAYERS scaled (array, 3630x2000)                            
                                                                    
╰────────────────────────────────────────────────────────────────────╯
computing PCA🔍
    with n_comps=30
   🖥️ Using sklearn PCA for CPU computation
   🖥️ sklearn PCA backend: CPU computation
   📊 PCA input data type: ArrayView, shape: (3630, 2000), dtype: float64
🔧 PCA solver used: covariance_eigh
    finished✅ (1.97s)

╭─ SUMMARY: pca ─────────────────────────────────────────────────────╮
  Duration: 1.9803s                                                 
  Shape:    3,630 x 2,000 (Unchanged)                               
                                                                    
  CHANGES DETECTED                                                  
  ────────────────                                                  
   UNS    scaled|original|cum_sum_eigenvalues                  
 scaled|original|pca_var_ratios                       
                                                                    
   OBSM   scaled|original|X_pca (array, 3630x30)               
                                                                    
╰────────────────────────────────────────────────────────────────────╯
🖥️ Using Scanpy CPU to calculate neighbors...

🔍 K-Nearest Neighbors Graph Construction:
   Mode: cpu
   Neighbors: 15
   Method: umap
   Metric: euclidean
   Representation: X_pca
   🔍 Computing neighbor distances...
🔍 Computing connectivity matrix...
   💡 Using UMAP-style connectivity
✓ Graph is fully connected

✅ KNN Graph Construction Completed Successfully!
   ✓ Processed: 3,630 cells with 15 neighbors each
   ✓ Results added to AnnData object:
     • 'neighbors': Neighbors metadata (adata.uns)
     • 'distances': Distance matrix (adata.obsp)
     • 'connectivities': Connectivity matrix (adata.obsp)

╭─ SUMMARY: neighbors ───────────────────────────────────────────────╮
  Duration: 8.5516s                                                 
  Shape:    3,630 x 2,000 (Unchanged)                               
                                                                    
  CHANGES DETECTED                                                  
  ────────────────                                                  
╰────────────────────────────────────────────────────────────────────╯
🔍 [2026-05-19 17:26:33] Running UMAP in 'cpu' mode...
🖥️ Using Scanpy CPU UMAP...

🔍 UMAP Dimensionality Reduction:
   Mode: cpu
   Method: umap
   Components: 2
   Min distance: 0.5
{'n_neighbors': 15, 'method': 'umap', 'random_state': 0, 'metric': 'euclidean', 'use_rep': 'X_pca'}
   🔍 Computing UMAP parameters...
   🔍 Computing UMAP embedding (classic method)...
✅ UMAP Dimensionality Reduction Completed Successfully!
   ✓ Embedding shape: 3,630 cells × 2 dimensions
   ✓ Results added to AnnData object:
     • 'X_umap': UMAP coordinates (adata.obsm)
     • 'umap': UMAP parameters (adata.uns)
✅ UMAP completed successfully.

╭─ SUMMARY: umap ────────────────────────────────────────────────────╮
  Duration: 0.7343s                                                 
  Shape:    3,630 x 2,000 (Unchanged)                               
                                                                    
  CHANGES DETECTED                                                  
  ────────────────                                                  
   UNS    umap                                                 
└─ params: {'a': 0.5830300199950147, 'b': 1.334166993228519}
                                                                    
╰────────────────────────────────────────────────────────────────────╯
adata: (3630, 2000) celltypes: ['Alpha', 'Beta', 'Delta', 'Ductal', 'Epsilon', 'Ngn3 high EP', 'Ngn3 low EP', 'Pre-endocrine']

3. Fit SuperCell#

k_knn=5 is the R upstream default; walktrap_steps=4 controls how far random walks roam before being cut.

mc = ov.single.MetaCell(
    adata.copy(), method='supercell', n_metacells=100,
    use_rep='X_pca', k_knn=5, walktrap_steps=4, random_state=0,
).fit()
print(f'fit done: {mc.method}, runtime={mc._fit_result.runtime_s:.2f} s')
fit done: supercell, runtime=0.52 s

4. AnnData schema after fit#

Every backend writes the same fields into adata — that’s what lets the downstream helpers below work without branching on the backend.

# Inspect what the fit wrote into adata via the unified schema.
print(f'method      : {mc.method}')
print(f'capabilities: {sorted(mc.capabilities)}')
print(f'n_metacells : {np.unique(mc._fit_result.assignments).size}')
print(f'runtime     : {mc._fit_result.runtime_s:.3f} s')
print(f'uns         : {dict(mc.adata.uns["metacell"])}')
method      : supercell
capabilities: ['hierarchical']
n_metacells : 100
runtime     : 0.521 s
uns         : {'method': 'supercell', 'n_metacells': 100, 'n_iter': 1, 'converged': True, 'runtime_s': 0.5209763050079346, 'random_state': 0, 'capabilities': ['hierarchical']}

5. Aggregate to a metacell AnnData#

predicted(method='hard', layer='counts', summary='sum') returns a metacell × gene AnnData with raw count totals preserved — the format that downstream tools (SCENIC, CellPhoneDB, pseudobulk DE) actually want.

ad_mc = mc.predicted(method='hard', layer='counts', summary='sum',
                     celltype_label='clusters')
print(f'metacell AnnData: {ad_mc.shape}')
print(f'mean cells/metacell: {ad_mc.obs["n_cells"].mean():.1f}')
ad_mc.obs.head()
metacell AnnData: (100, 2000)
mean cells/metacell: 36.3
n_cells clusters clusters_purity
mc-0 9 Beta 0.666667
mc-1 201 Ductal 0.791045
mc-2 28 Alpha 0.857143
mc-3 71 Ductal 0.985915
mc-4 85 Ngn3 high EP 0.882353

6. Benchmarking metrics (purity / separation / compactness)#

# Compute purity / separation / compactness AND show the 3-panel histogram
# in one call (ov.pl.metacell_metrics returns the per-metacell tables too).
purity, separation, compactness = ov.pl.metacell_metrics(
    mc, label_key='clusters', use_rep='X_pca',
)

7. mcRigor: statistical validation#

For each metacell, mcRigor permutes the (cells × genes) submatrix in two ways and asks: is the observed gene–gene covariance bigger than the null distribution at this metacell’s size? Metacells whose mcDiv exceeds the size-stratified threshold are flagged as 'dubious'.

# mcRigor's double-permutation null. dubious_rate = fraction of cells in
# heterogeneous metacells; rigor_score = 1 - 0.5*(dubious_rate + zero_rate).
rep = mc.check_rigor(layer_lognorm='lognorm', n_rep=20,
                     feature_use=1000, random_state=0)
print(f'rigor_score : {rep.score:.3f}')
print(f'dubious_rate: {rep.dubious_rate:.3f}')
print(f'zero_rate   : {rep.zero_rate:.3f}')
print(f'# metacells : {rep.n_metacells}')
rigor_score : 0.499
dubious_rate: 0.755
zero_rate   : 0.248
# metacells : 100

7.1 Per-metacell mcDiv vs size#

# mcDiv vs metacell size, overlaid with size-stratified threshold.
ov.pl.rigor_scatter(rep)
../../../_images/d10a5301d449f42a9d02ffebca2ec7eea3d96cf8e8066a7e718bea6a4c0ecba1.png
<Axes: xlabel='metacell size', ylabel='mcDiv  (T_org / T_colperm)'>

8. UMAP with metacell centroids#

# UMAP coloured by celltype with metacell centroids overlaid in dark grey.
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(5, 4))
ov.pl.embedding(mc.adata, basis='X_umap', color='clusters', ax=ax, show=False,
                frameon='small', title='SuperCell (n=100)', size=12)
labels = mc._fit_result.assignments
pts = np.array([mc.adata.obsm['X_umap'][labels == u].mean(axis=0)
                for u in np.unique(labels)])
ax.scatter(pts[:, 0], pts[:, 1], s=24, c='#222',
           edgecolors='white', linewidths=0.6, zorder=5)
plt.tight_layout(); plt.show()

9. Per-celltype purity boxplot#

# Per-celltype boxplot of metacell purity.
ov.pl.metacell_purity_box(mc, label_key='clusters')
../../../_images/76aa1550967170a7de53f3a7b4ac3035c381921089e95225ec8042f32ef2daf5.png
<Axes: xlabel='majority celltype', ylabel='purity'>

10. Metacell-level UMAP#

A common downstream use of metacells is to treat them as a much smaller “atlas” of pseudo-cells and re-run the standard omicverse preprocess → PCA → UMAP loop on them. Cell-type signal should survive.

# Treat the metacell AnnData as a smaller dataset and run the standard
# omicverse preprocess -> pca -> neighbors -> umap loop on it.
ad_mc = ov.pp.preprocess(ad_mc, mode='shiftlog|pearson',
                          n_HVGs=min(2000, ad_mc.n_vars))
ad_mc = ad_mc[:, ad_mc.var.highly_variable_features]
ov.pp.scale(ad_mc)
ov.pp.pca(ad_mc, layer='scaled', n_pcs=min(30, ad_mc.n_obs - 1))
ad_mc.obsm['X_pca'] = ad_mc.obsm['scaled|original|X_pca']
ov.pp.neighbors(ad_mc, n_neighbors=min(15, ad_mc.n_obs - 1), use_rep='X_pca')
ov.pp.umap(ad_mc)
ov.pl.embedding(ad_mc, basis='X_umap', color='clusters',
                frameon='small', title='metacell-level UMAP', size=80)
🔍 [2026-05-19 17:26:56] Running preprocessing in 'cpu' mode...
Begin robust gene identification
    After filtration, 2000/2000 genes are kept.
    Among 2000 genes, 2000 genes are robust.
✅ Robust gene identification completed successfully.
Begin size normalization: shiftlog and HVGs selection pearson

🔍 Count Normalization:
   Target sum: 500000.0
   Exclude highly expressed: True
   Max fraction threshold: 0.2
   ⚠️ Excluding 1 highly-expressed genes from normalization computation
   Excluded genes: ['Ghrl']

✅ Count Normalization Completed Successfully!
   ✓ Processed: 100 cells × 2,000 genes
   ✓ Runtime: 0.00s

🔍 Highly Variable Genes Selection (Experimental):
   Method: pearson_residuals
   Target genes: 2,000
   Theta (overdispersion): 100

✅ Experimental HVG Selection Completed Successfully!
   ✓ Selected: 2,000 highly variable genes out of 2,000 total (100.0%)
   ✓ Results added to AnnData object:
     • 'highly_variable': Boolean vector (adata.var)
     • 'highly_variable_rank': Float vector (adata.var)
     • 'highly_variable_nbatches': Int vector (adata.var)
     • 'highly_variable_intersection': Boolean vector (adata.var)
     • 'means': Float vector (adata.var)
     • 'variances': Float vector (adata.var)
     • 'residual_variances': Float vector (adata.var)
    Time to analyze data in cpu: 0.03 seconds.
✅ Preprocessing completed successfully.
    Added:
        'highly_variable_features', boolean vector (adata.var)
        'means', float vector (adata.var)
        'variances', float vector (adata.var)
        'residual_variances', float vector (adata.var)
        'counts', raw counts layer (adata.layers)
    End of size normalization: shiftlog and HVGs selection pearson

╭─ SUMMARY: preprocess ──────────────────────────────────────────────╮
  Duration: 0.0438s                                                 
  Shape:    100 x 2,000 (Unchanged)                                 
                                                                    
  CHANGES DETECTED                                                  
  ────────────────                                                  
   UNS    REFERENCE_MANU                                       
 _ov_provenance                                       
 history_log                                          
 hvg                                                  
 log1p                                                
 status                                               
 status_args                                          
                                                                    
   LAYERS counts (sparse matrix, 100x2000)                     
                                                                    
╰────────────────────────────────────────────────────────────────────╯

╭─ SUMMARY: scale ───────────────────────────────────────────────────╮
  Duration: 0.0153s                                                 
  Shape:    100 x 2,000 (Unchanged)                                 
                                                                    
  CHANGES DETECTED                                                  
  ────────────────                                                  
   LAYERS scaled (array, 100x2000)                             
                                                                    
╰────────────────────────────────────────────────────────────────────╯
computing PCA🔍
    with n_comps=30
   🖥️ Using sklearn PCA for CPU computation
   🖥️ sklearn PCA backend: CPU computation
   📊 PCA input data type: ArrayView, shape: (100, 2000), dtype: float64
🔧 PCA solver used: covariance_eigh
    finished✅ (1.30s)

╭─ SUMMARY: pca ─────────────────────────────────────────────────────╮
  Duration: 1.303s                                                  
  Shape:    100 x 2,000 (Unchanged)                                 
                                                                    
  CHANGES DETECTED                                                  
  ────────────────                                                  
   UNS    pca                                                  
└─ params: {'zero_center': True, 'use_highly_variable': Tr...
 scaled|original|cum_sum_eigenvalues                  
 scaled|original|pca_var_ratios                       
                                                                    
   OBSM   X_pca (array, 100x30)                                
 scaled|original|X_pca (array, 100x30)                
                                                                    
╰────────────────────────────────────────────────────────────────────╯
🖥️ Using Scanpy CPU to calculate neighbors...

🔍 K-Nearest Neighbors Graph Construction:
   Mode: cpu
   Neighbors: 15
   Method: umap
   Metric: euclidean
   Representation: X_pca
   🔍 Computing neighbor distances...
   🔍 Computing connectivity matrix...
   💡 Using UMAP-style connectivity
   ✓ Graph is fully connected

✅ KNN Graph Construction Completed Successfully!
   ✓ Processed: 100 cells with 15 neighbors each
   ✓ Results added to AnnData object:
     • 'neighbors': Neighbors metadata (adata.uns)
     • 'distances': Distance matrix (adata.obsp)
     • 'connectivities': Connectivity matrix (adata.obsp)

╭─ SUMMARY: neighbors ───────────────────────────────────────────────╮
  Duration: 0.1154s                                                 
  Shape:    100 x 2,000 (Unchanged)                                 
                                                                    
  CHANGES DETECTED                                                  
  ────────────────                                                  
   UNS    neighbors                                            
└─ params: {'n_neighbors': 15, 'method': 'umap', 'random_s...
                                                                    
   OBSP   connectivities (sparse matrix, 100x100)              
 distances (sparse matrix, 100x100)                   
                                                                    
╰────────────────────────────────────────────────────────────────────╯
🔍 [2026-05-19 17:26:58] Running UMAP in 'cpu' mode...
🖥️ Using Scanpy CPU UMAP...

🔍 UMAP Dimensionality Reduction:
   Mode: cpu
   Method: umap
   Components: 2
   Min distance: 0.5
{'n_neighbors': 15, 'method': 'umap', 'random_state': 0, 'metric': 'euclidean', 'use_rep': 'X_pca'}
   🔍 Computing UMAP parameters...
   🔍 Computing UMAP embedding (classic method)...

✅ UMAP Dimensionality Reduction Completed Successfully!
   ✓ Embedding shape: 100 cells × 2 dimensions
   ✓ Results added to AnnData object:
     • 'X_umap': UMAP coordinates (adata.obsm)
     • 'umap': UMAP parameters (adata.uns)
✅ UMAP completed successfully.

╭─ SUMMARY: umap ────────────────────────────────────────────────────╮
  Duration: 0.0118s                                                 
  Shape:    100 x 2,000 (Unchanged)                                 
                                                                    
  CHANGES DETECTED                                                  
  ────────────────                                                  
   UNS    umap                                                 
└─ params: {'a': 0.5830300199950147, 'b': 1.334166993228519}
                                                                    
   OBSM   X_umap (array, 100x2)                                
                                                                    
╰────────────────────────────────────────────────────────────────────╯
../../../_images/ff1c32ddc6b685b96e54d93de7016c79ec95bfd0b671fa567df824b66fb7318b.png

11. Top markers per celltype on the metacell AnnData#

# Find top markers per celltype on the metacell AnnData (omicverse helper —
# drops the categories with <2 metacells automatically and reports cell-type
# fractions ``pts`` along with the gene names).
counts = ad_mc.obs['clusters'].value_counts()
keep = counts[counts >= 2].index.tolist()
ad_mc_for_de = ad_mc[ad_mc.obs['clusters'].isin(keep)].copy()
ad_mc_for_de.obs['clusters'] = ad_mc_for_de.obs['clusters'].astype('category')

ov.single.find_markers(ad_mc_for_de, groupby='clusters', method='wilcoxon',
                       key_added='rank_genes_groups', pts=True, use_gpu=False)
ov.single.get_markers(ad_mc_for_de, n_genes=3, key='rank_genes_groups')
🔍 Finding marker genes | method: wilcoxon | groupby: clusters | n_groups: 8 | n_genes: 50
✅ Done | 8 groups × 50 genes | corr: benjamini-hochberg | stored in adata.uns['rank_genes_groups']
group rank names scores logfoldchanges pvals pvals_adj pct_group pct_rest
0 Alpha 1 Smarca1 6.067086 3.528083 1.302521e-09 2.127080e-06 1.0 0.941176
1 Alpha 2 Ttr 5.912633 2.369870 3.366814e-09 2.127080e-06 1.0 1.000000
2 Alpha 3 Tmem27 5.864367 6.051901 4.508499e-09 2.127080e-06 1.0 0.788235
3 Beta 1 Sytl4 5.392514 7.145541 6.947866e-08 3.869363e-05 1.0 0.348315
4 Beta 2 Gng12 5.370481 4.113616 7.852677e-08 3.869363e-05 1.0 0.932584
5 Beta 3 Gm27033 5.293367 4.532875 1.200848e-07 3.869363e-05 1.0 0.707865
6 Delta 1 Rbp4 3.377268 5.227028 7.320963e-04 9.230787e-02 1.0 0.989583
7 Delta 2 Sst 3.377268 12.785007 7.320963e-04 9.230787e-02 1.0 0.583333
8 Delta 3 Spock3 3.377268 5.802761 7.320963e-04 9.230787e-02 1.0 0.458333
9 Ductal 1 Lurap1l 6.893820 3.140599 5.431379e-12 8.347573e-10 1.0 0.950000
10 Ductal 2 Cldn10 6.893820 4.276017 5.431379e-12 8.347573e-10 1.0 0.812500
11 Ductal 3 Cyr61 6.893820 7.576176 5.431379e-12 8.347573e-10 1.0 0.600000
12 Epsilon 1 Ghrl 4.675616 11.245130 2.930724e-06 1.437596e-03 1.0 0.913043
13 Epsilon 2 Card19 4.650205 2.920076 3.316050e-06 1.437596e-03 1.0 0.934783
14 Epsilon 3 Gm11837 4.650205 6.151111 3.316050e-06 1.437596e-03 1.0 0.586957
15 Ngn3 high EP 1 Cbfa2t3 6.418805 3.734952 1.373487e-10 6.710507e-08 1.0 0.867470
16 Ngn3 high EP 2 Gadd45a 6.382100 5.235402 1.746765e-10 6.710507e-08 1.0 0.891566
17 Ngn3 high EP 3 Cdk4 6.345394 1.235358 2.218565e-10 6.710507e-08 1.0 1.000000
18 Ngn3 low EP 1 Grin3a 3.324498 7.126914 8.857774e-04 1.327895e-01 1.0 0.302083
19 Ngn3 low EP 2 Cat 3.324498 3.847967 8.857774e-04 1.327895e-01 1.0 0.875000
20 Ngn3 low EP 3 Atoh8 3.324498 6.508173 8.857774e-04 1.327895e-01 1.0 0.302083
21 Pre-endocrine 1 Bax 6.486627 1.032453 8.777949e-11 1.755590e-07 1.0 1.000000
22 Pre-endocrine 2 Cystm1 6.291986 1.170726 3.134307e-10 2.089538e-07 1.0 1.000000
23 Pre-endocrine 3 Cxxc5 6.105807 1.100205 1.022825e-09 3.409416e-07 1.0 1.000000
# Dotplot of top markers per metacell-level celltype.
ov.pl.markers_dotplot(ad_mc_for_de, groupby='clusters', n_genes=3,
                      key='rank_genes_groups')

12. SuperCell exclusive: re-cut at different graining#

The walktrap dendrogram is cached after .fit(). Cutting at a different n_metacells is essentially free — no graph rebuild needed. Perfect for graining-factor sweeps.

# Sweep four graining factors. γ = n_cells / n_metacells.
gammas = [10, 25, 50, 100]
sweeps = mc.fit_multi_gamma(gammas)

rows = []
for g, fit in sweeps.items():
    n_mc = fit.backend_meta['n_metacells']
    sib = ov.single.MetaCell(adata.copy(), method='supercell',
                              n_metacells=n_mc, use_rep='X_pca')
    sib._fit_result = fit
    sib._write_schema()
    p = sib.compute_purity('clusters').purity.mean()
    rows.append({'gamma': g, 'n_metacells': n_mc, 'mean_purity': round(p, 3)})
sweep_df = pd.DataFrame(rows)
sweep_df
gamma n_metacells mean_purity
0 10.0 363 0.890
1 25.0 145 0.879
2 50.0 73 0.869
3 100.0 36 0.847
# Line plot: mean purity vs graining γ.
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(4.5, 3))
ax.plot(sweep_df['gamma'], sweep_df['mean_purity'], 'o-', linewidth=1.5)
ax.set_xlabel('graining γ (n_cells / n_metacells)')
ax.set_ylabel('mean purity'); ax.set_ylim(0.4, 1.05)
ax.set_title('SuperCell purity vs graining')
plt.tight_layout(); plt.show()

13. Save / load roundtrip#

# Save/load roundtrip — every backend supports this.
import tempfile, os
with tempfile.NamedTemporaryFile(suffix='.pkl', delete=False) as f:
    path = f.name
mc.save(path)

mc2 = ov.single.MetaCell(adata.copy(), method='supercell', n_metacells=100,
                          use_rep='X_pca', random_state=0)
mc2.load(path)
print(f'saved+loaded {path}')
os.remove(path)
saved+loaded /tmp/tmpp0v27s5m.pkl

14. Takeaways#

  • SuperCell is the fastest principled metacell method here. For tens of thousands of cells it’s ~10–100× faster than SEACells.

  • The hierarchical cache (fit_multi_gamma) is perfect for “what graining should I use?” experiments.

  • Watch dubious_rate from mcRigor — SuperCell can produce heterogeneous metacells at low graining (high γ) on some datasets. If dubious_rate > 0.10 consider switching to SEACells or tightening k_knn.