Side-by-side comparison of all metacell backends#

Each backend in the metacell-zoo was developed by a different group with different assumptions. This notebook runs all six always-available backends (seacells, metaq, supercell, kmeans, random, geosketch) on the same pancreas data and ranks them by:

  • runtime โ€” how long does it take?

  • purity โ€” fraction of each metacell that belongs to a single celltype

  • rigor_score โ€” mcRigorโ€™s composite score (higher = better)

  • dubious_rate โ€” fraction of cells in heterogeneous metacells

  • compactness โ€” mean distance from cells to their metacell centroid in PCA space

ov.single.compare_metacell_backends runs all of this in one call.

1. Setup#

# Standard imports + omicverse defaults.
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import omicverse as ov
import scvelo as scv      # only used for the demo dataset

ov.plot_set()
๐Ÿ”ฌ Starting plot initialization...
๐Ÿงฌ Detecting GPU devicesโ€ฆ
โœ… NVIDIA CUDA GPUs detected: 1
    โ€ข [CUDA 0] NVIDIA H100 80GB HBM3
      Memory: 79.1 GB | Compute: 9.0

   ____            _     _    __                  
  / __ \____ ___  (_)___| |  / /__  _____________ 
 / / / / __ `__ \/ / ___/ | / / _ \/ ___/ ___/ _ \ 
/ /_/ / / / / / / / /__ | |/ /  __/ /  (__  )  __/ 
\____/_/ /_/ /_/_/\___/ |___/\___/_/  /____/\___/                                              

๐Ÿ”– Version: 2.2.0   ๐Ÿ“š Tutorials: https://omicverse.readthedocs.io/
โœ… plot_set complete.

2. Load and preprocess#

# Pancreas scRNA-seq (Bastidas-Ponce et al. 2019). Standard omicverse
# preprocess flow: qc -> preprocess -> scale -> pca -> neighbors -> umap.
adata = scv.datasets.pancreas()
adata = ov.pp.qc(adata,
                 tresh={'mito_perc': 0.20, 'nUMIs': 500, 'detected_genes': 250},
                 mt_startswith='mt-')
adata = ov.pp.preprocess(adata, mode='shiftlog|pearson', n_HVGs=2000)
adata.layers['lognorm'] = adata.X.copy()        # mcRigor reads this
adata = adata[:, adata.var.highly_variable_features]
ov.pp.scale(adata)
ov.pp.pca(adata, layer='scaled', n_pcs=30)
adata.obsm['X_pca'] = adata.obsm['scaled|original|X_pca']
ov.pp.neighbors(adata, n_neighbors=15, use_rep='X_pca')
ov.pp.umap(adata)
print('adata:', adata.shape, 'celltypes:', sorted(adata.obs['clusters'].unique()))
๐Ÿ–ฅ๏ธ Using CPU mode for QC...

๐Ÿ“Š Step 1: Calculating QC Metrics

   โœ“ Gene Family Detection:
   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ”‚ Gene Family                  โ”‚ Genes Found        โ”‚ Detection Method   โ”‚
   โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
   โ”‚ Mitochondrial                โ”‚ 13                 โ”‚ Auto (MT-)         โ”‚
   โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
   โ”‚ Ribosomal                    โ”‚ 0 โš ๏ธ               โ”‚ Auto (RPS/RPL)     โ”‚
   โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
   โ”‚ Hemoglobin                   โ”‚ 0 โš ๏ธ               โ”‚ Auto (regex)       โ”‚
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

   โœ“ QC Metrics Summary:
   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ”‚ Metric                  โ”‚ Mean               โ”‚ Range (Min - Max)       โ”‚
   โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
   โ”‚ nUMIs                   โ”‚ 6675               โ”‚ 3020 - 18524            โ”‚
   โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
   โ”‚ Detected Genes          โ”‚ 2516               โ”‚ 1473 - 4492             โ”‚
   โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
   โ”‚ Mitochondrial %         โ”‚ 0.7%               โ”‚ 0.2% - 4.3%             โ”‚
   โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
   โ”‚ Ribosomal %             โ”‚ 0.0%               โ”‚ 0.0% - 0.0%             โ”‚
   โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
   โ”‚ Hemoglobin %            โ”‚ 0.0%               โ”‚ 0.0% - 0.0%             โ”‚
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

   ๐Ÿ“ˆ Original cell count: 3,696

๐Ÿ”ง Step 2: Quality Filtering (SEURAT)
   Thresholds: mitoโ‰ค0.2, nUMIsโ‰ฅ500, genesโ‰ฅ250
   ๐Ÿ“Š Seurat Filter Results:
     โ€ข nUMIs filter (โ‰ฅ500): 0 cells failed (0.0%)
     โ€ข Genes filter (โ‰ฅ250): 0 cells failed (0.0%)
     โ€ข Mitochondrial filter (โ‰ค0.2): 0 cells failed (0.0%)
   โœ“ Filters applied successfully
   โœ“ Combined QC filters: 0 cells removed (0.0%)

๐ŸŽฏ Step 3: Final Filtering
   Parameters: min_genes=200, min_cells=3
   Ratios: max_genes_ratio=1, max_cells_ratio=1
โœ“ Final filtering: 0 cells, 12,261 genes removed

๐Ÿ” Step 4: Doublet Detection
   ๐Ÿ’ก Running pyscdblfinder (Python port of R scDblFinder)
   ๐Ÿ” Running scdblfinder detection...
[ScDblFinder] wrote scDblFinder_score + scDblFinder_class โ€” threshold=0.387
   โœ“ scDblFinder completed: 66 doublets removed (1.8%)

โ•ญโ”€ SUMMARY: qc โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚  Duration: 17.1113s                                                โ”‚
โ”‚  Shape:    3,696 x 27,998 (Unchanged)                              โ”‚
โ”‚                                                                    โ”‚
โ”‚  CHANGES DETECTED                                                  โ”‚
โ”‚  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€                                                  โ”‚
โ”‚  โ— OBS    โ”‚ โœš cell_complexity (float)                              โ”‚
โ”‚           โ”‚ โœš detected_genes (int)                                 โ”‚
โ”‚           โ”‚ โœš hb_perc (float)                                      โ”‚
โ”‚           โ”‚ โœš mito_perc (float)                                    โ”‚
โ”‚           โ”‚ โœš nUMIs (float)                                        โ”‚
โ”‚           โ”‚ โœš n_counts (float)                                     โ”‚
โ”‚           โ”‚ โœš n_genes (int)                                        โ”‚
โ”‚           โ”‚ โœš n_genes_by_counts (int)                              โ”‚
โ”‚           โ”‚ โœš passing_mt (bool)                                    โ”‚
โ”‚           โ”‚ โœš passing_nUMIs (bool)                                 โ”‚
โ”‚           โ”‚ โœš passing_ngenes (bool)                                โ”‚
โ”‚           โ”‚ โœš pct_counts_hb (float)                                โ”‚
โ”‚           โ”‚ โœš pct_counts_mt (float)                                โ”‚
โ”‚           โ”‚ โœš pct_counts_ribo (float)                              โ”‚
โ”‚           โ”‚ โœš ribo_perc (float)                                    โ”‚
โ”‚           โ”‚ โœš total_counts (float)                                 โ”‚
โ”‚                                                                    โ”‚
โ”‚  โ— VAR    โ”‚ โœš hb (bool)                                            โ”‚
โ”‚           โ”‚ โœš mt (bool)                                            โ”‚
โ”‚           โ”‚ โœš ribo (bool)                                          โ”‚
โ”‚                                                                    โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
๐Ÿ” [2026-05-19 17:38:39] Running preprocessing in 'cpu' mode...
Begin robust gene identification
    After filtration, 15737/15737 genes are kept.
    Among 15737 genes, 15736 genes are robust.
โœ… Robust gene identification completed successfully.
Begin size normalization: shiftlog and HVGs selection pearson
๐Ÿ” Count Normalization:
   Target sum: 500000.0
   Exclude highly expressed: True
   Max fraction threshold: 0.2
   โš ๏ธ Excluding 1 highly-expressed genes from normalization computation
   Excluded genes: ['Ghrl']
โœ… Count Normalization Completed Successfully!
   โœ“ Processed: 3,630 cells ร— 15,736 genes
   โœ“ Runtime: 0.25s

๐Ÿ” Highly Variable Genes Selection (Experimental):
   Method: pearson_residuals
   Target genes: 2,000
   Theta (overdispersion): 100
โœ… Experimental HVG Selection Completed Successfully!
   โœ“ Selected: 2,000 highly variable genes out of 15,736 total (12.7%)
   โœ“ Results added to AnnData object:
     โ€ข 'highly_variable': Boolean vector (adata.var)
     โ€ข 'highly_variable_rank': Float vector (adata.var)
     โ€ข 'highly_variable_nbatches': Int vector (adata.var)
     โ€ข 'highly_variable_intersection': Boolean vector (adata.var)
     โ€ข 'means': Float vector (adata.var)
     โ€ข 'variances': Float vector (adata.var)
     โ€ข 'residual_variances': Float vector (adata.var)
    Time to analyze data in cpu: 1.46 seconds.
โœ… Preprocessing completed successfully.
    Added:
        'highly_variable_features', boolean vector (adata.var)
        'means', float vector (adata.var)
        'variances', float vector (adata.var)
        'residual_variances', float vector (adata.var)
        'counts', raw counts layer (adata.layers)
    End of size normalization: shiftlog and HVGs selection pearson

โ•ญโ”€ SUMMARY: preprocess โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚  Duration: 1.857s                                                  โ”‚
โ”‚  Shape:    3,630 x 15,737 -> 3,630 x 15,736                        โ”‚
โ”‚                                                                    โ”‚
โ”‚  CHANGES DETECTED                                                  โ”‚
โ”‚  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€                                                  โ”‚
โ”‚  โ— VAR    โ”‚ โœš highly_variable (bool)                               โ”‚
โ”‚           โ”‚ โœš highly_variable_features (bool)                      โ”‚
โ”‚           โ”‚ โœš highly_variable_rank (float)                         โ”‚
โ”‚           โ”‚ โœš means (float)                                        โ”‚
โ”‚           โ”‚ โœš n_cells (int)                                        โ”‚
โ”‚           โ”‚ โœš percent_cells (float)                                โ”‚
โ”‚           โ”‚ โœš residual_variances (float)                           โ”‚
โ”‚           โ”‚ โœš robust (bool)                                        โ”‚
โ”‚           โ”‚ โœš variances (float)                                    โ”‚
โ”‚                                                                    โ”‚
โ”‚  โ— UNS    โ”‚ โœš history_log                                          โ”‚
โ”‚           โ”‚ โœš hvg                                                  โ”‚
โ”‚           โ”‚ โœš log1p                                                โ”‚
โ”‚                                                                    โ”‚
โ”‚  โ— LAYERS โ”‚ โœš counts (sparse matrix, 3630x15736)                   โ”‚
โ”‚                                                                    โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
โ•ญโ”€ SUMMARY: scale โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚  Duration: 0.6593s                                                 โ”‚
โ”‚  Shape:    3,630 x 2,000 (Unchanged)                               โ”‚
โ”‚                                                                    โ”‚
โ”‚  CHANGES DETECTED                                                  โ”‚
โ”‚  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€                                                  โ”‚
โ”‚  โ— LAYERS โ”‚ โœš scaled (array, 3630x2000)                            โ”‚
โ”‚                                                                    โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
computing PCA๐Ÿ”
    with n_comps=30
   ๐Ÿ–ฅ๏ธ Using sklearn PCA for CPU computation
   ๐Ÿ–ฅ๏ธ sklearn PCA backend: CPU computation
   ๐Ÿ“Š PCA input data type: ArrayView, shape: (3630, 2000), dtype: float64
๐Ÿ”ง PCA solver used: covariance_eigh
    finishedโœ… (1.89s)

โ•ญโ”€ SUMMARY: pca โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚  Duration: 1.8986s                                                 โ”‚
โ”‚  Shape:    3,630 x 2,000 (Unchanged)                               โ”‚
โ”‚                                                                    โ”‚
โ”‚  CHANGES DETECTED                                                  โ”‚
โ”‚  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€                                                  โ”‚
โ”‚  โ— UNS    โ”‚ โœš scaled|original|cum_sum_eigenvalues                  โ”‚
โ”‚           โ”‚ โœš scaled|original|pca_var_ratios                       โ”‚
โ”‚                                                                    โ”‚
โ”‚  โ— OBSM   โ”‚ โœš scaled|original|X_pca (array, 3630x30)               โ”‚
โ”‚                                                                    โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
๐Ÿ–ฅ๏ธ Using Scanpy CPU to calculate neighbors...

๐Ÿ” K-Nearest Neighbors Graph Construction:
   Mode: cpu
   Neighbors: 15
   Method: umap
   Metric: euclidean
   Representation: X_pca
   ๐Ÿ” Computing neighbor distances...
๐Ÿ” Computing connectivity matrix...
   ๐Ÿ’ก Using UMAP-style connectivity
โœ“ Graph is fully connected

โœ… KNN Graph Construction Completed Successfully!
   โœ“ Processed: 3,630 cells with 15 neighbors each
   โœ“ Results added to AnnData object:
     โ€ข 'neighbors': Neighbors metadata (adata.uns)
     โ€ข 'distances': Distance matrix (adata.obsp)
     โ€ข 'connectivities': Connectivity matrix (adata.obsp)

โ•ญโ”€ SUMMARY: neighbors โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚  Duration: 8.6527s                                                 โ”‚
โ”‚  Shape:    3,630 x 2,000 (Unchanged)                               โ”‚
โ”‚                                                                    โ”‚
โ”‚  CHANGES DETECTED                                                  โ”‚
โ”‚  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€                                                  โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
๐Ÿ” [2026-05-19 17:38:53] Running UMAP in 'cpu' mode...
๐Ÿ–ฅ๏ธ Using Scanpy CPU UMAP...

๐Ÿ” UMAP Dimensionality Reduction:
   Mode: cpu
   Method: umap
   Components: 2
   Min distance: 0.5
{'n_neighbors': 15, 'method': 'umap', 'random_state': 0, 'metric': 'euclidean', 'use_rep': 'X_pca'}
   ๐Ÿ” Computing UMAP parameters...
   ๐Ÿ” Computing UMAP embedding (classic method)...
โœ… UMAP Dimensionality Reduction Completed Successfully!
   โœ“ Embedding shape: 3,630 cells ร— 2 dimensions
   โœ“ Results added to AnnData object:
     โ€ข 'X_umap': UMAP coordinates (adata.obsm)
     โ€ข 'umap': UMAP parameters (adata.uns)
โœ… UMAP completed successfully.

โ•ญโ”€ SUMMARY: umap โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚  Duration: 0.7885s                                                 โ”‚
โ”‚  Shape:    3,630 x 2,000 (Unchanged)                               โ”‚
โ”‚                                                                    โ”‚
โ”‚  CHANGES DETECTED                                                  โ”‚
โ”‚  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€                                                  โ”‚
โ”‚  โ— UNS    โ”‚ โœš umap                                                 โ”‚
โ”‚           โ”‚ โ””โ”€ params: {'a': 0.5830300199950147, 'b': 1.334166993228519}โ”‚
โ”‚                                                                    โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
adata: (3630, 2000) celltypes: ['Alpha', 'Beta', 'Delta', 'Ductal', 'Epsilon', 'Ngn3 high EP', 'Ngn3 low EP', 'Pre-endocrine']

3. The single-call benchmark#

bench = ov.single.compare_metacell_backends(
    adata,
    backends=['seacells', 'metaq', 'supercell', 'kmeans', 'random', 'geosketch'],
    n_metacells=100,
    use_rep='X_pca',
    layer='counts',
    layer_lognorm='lognorm',
    eval_label='clusters',
    device='cpu',           # seacells GPU path needs cupy
    n_rigor_rep=20,
    random_state=0,
    train_epoch=80, warm_epochs=10, codebook_init='random',  # metaq-only kwargs
)
bench[['runtime_s', 'dubious_rate', 'rigor_score',
       'mean_purity', 'mean_compactness', 'n_metacells']].round(3)
Welcome to SEACells!
Parameter graph_construction = union being used to build KNN graph...
Building kernel on X_pca
WARNING: adata.X seems to be already log-transformed.
[Epoch 1] RNA: Loss Rec=1.6497
[Epoch 1] RNA: Loss Rec=0.8152 Loss Rec Q=0.8190 | Codebook: Loss C=0.0038
[Epoch 20] RNA: Loss Rec=0.7487 Loss Rec Q=0.7667 | Codebook: Loss C=0.0095
[Epoch 40] RNA: Loss Rec=0.7331 Loss Rec Q=0.7618 | Codebook: Loss C=0.0106
[Epoch 60] RNA: Loss Rec=0.7214 Loss Rec Q=0.7591 | Codebook: Loss C=0.0104
[Epoch 80] RNA: Loss Rec=0.7126 Loss Rec Q=0.7565 | Codebook: Loss C=0.0110
runtime_s dubious_rate rigor_score mean_purity mean_compactness n_metacells
backend
seacells 16.119 0.653 0.420 0.896 8.044 100
metaq 424.741 0.695 0.400 0.879 8.791 100
supercell 0.452 0.801 0.346 0.874 8.186 100
kmeans 1.998 0.616 0.439 0.884 8.027 100
random 0.000 1.000 0.284 0.282 20.182 100
geosketch 0.251 0.910 0.280 0.898 7.541 100

4. Rank by purity / runtime / rigor#

import matplotlib.pyplot as plt
ranked = bench.sort_values('mean_purity', ascending=True)
fig, axes = plt.subplots(1, 3, figsize=(13, 3.5))
ranked['mean_purity'].plot.barh(ax=axes[0], color='#1f77b4')
axes[0].set_xlabel('mean purity'); axes[0].set_xlim(0, 1)
axes[0].axvline(ranked['mean_purity']['random'], color='red', linestyle='--')

bench['runtime_s'].plot.barh(ax=axes[1], color='#2ca02c', logx=True)
axes[1].set_xlabel('runtime (s, log)')

bench['rigor_score'].plot.barh(ax=axes[2], color='#9467bd')
axes[2].set_xlabel('rigor_score'); axes[2].set_xlim(0, 1.05)
for ax in axes:
    ax.invert_yaxis()
plt.tight_layout(); plt.show()

5. UMAP overlay grid#

Project all six partitions back onto the source UMAP.

# Refit each backend keeping the MetaCell object so we can use the
# ov.pl.metacell_centroids helper for each panel.
backends = ['seacells', 'metaq', 'supercell', 'kmeans', 'random', 'geosketch']
fits = {}
for b in backends:
    extra = dict(train_epoch=80, warm_epochs=10, codebook_init='random') if b == 'metaq' else {}
    fits[b] = ov.single.MetaCell(
        adata.copy(), method=b, n_metacells=100,
        use_rep='X_pca',
        device='cpu' if b != 'metaq' else 'cuda',
        random_state=0, **extra,
    ).fit()
print('refit done')
Welcome to SEACells!
Parameter graph_construction = union being used to build KNN graph...
Building kernel on X_pca
WARNING: adata.X seems to be already log-transformed.
[Epoch 1] RNA: Loss Rec=1.7653
[Epoch 1] RNA: Loss Rec=0.9919 Loss Rec Q=1.0011 | Codebook: Loss C=0.0033
[Epoch 20] RNA: Loss Rec=0.9290 Loss Rec Q=0.9383 | Codebook: Loss C=0.0076
[Epoch 40] RNA: Loss Rec=0.9147 Loss Rec Q=0.9292 | Codebook: Loss C=0.0095
[Epoch 60] RNA: Loss Rec=0.9028 Loss Rec Q=0.9254 | Codebook: Loss C=0.0095
[Epoch 80] RNA: Loss Rec=0.8912 Loss Rec Q=0.9230 | Codebook: Loss C=0.0099
refit done
fig, axes = plt.subplots(2, 3, figsize=(13, 8))
for ax, name in zip(axes.flatten(), backends):
    a = fits[name].adata
    ov.pl.embedding(a, basis='X_umap', color='clusters', ax=ax, show=False,
                    frameon='small',
                    title=f'{name}  (purity={bench.loc[name, "mean_purity"]:.3f})',
                    legend_loc=None, size=8)
    labels = fits[name]._fit_result.assignments
    pts = np.array([a.obsm['X_umap'][labels == u].mean(axis=0)
                    for u in np.unique(labels)])
    ax.scatter(pts[:, 0], pts[:, 1], s=20, c='#222',
               edgecolors='white', linewidths=0.6, zorder=5)
plt.tight_layout(); plt.show()

6. Purity histograms across backends#

import seaborn as sns
fig, axes = plt.subplots(2, 3, figsize=(13, 5.5), sharex=True, sharey=True)
for ax, name in zip(axes.flatten(), backends):
    p = fits[name].compute_purity('clusters')['purity']
    sns.histplot(p, bins=20, ax=ax, color='#1f77b4')
    ax.set_title(f'{name}  (mean={p.mean():.2f})')
    ax.set_xlabel('purity'); ax.set_xlim(0, 1.05)
plt.tight_layout(); plt.show()

7. Marker recovery on the aggregated AnnData#

Aggregate with each backend, then ask omicverse for the top marker per celltype in the metacell AnnData. The โ€œrightโ€ markers (Ins1/Ins2 for Beta, Gcg for Alpha, etc.) should pop out regardless of which backend you used.

top1 = {}
for b in backends:
    ad_mc = fits[b].predicted(method='hard', layer='counts', summary='sum',
                              celltype_label='clusters')
    ad_mc = ov.pp.preprocess(ad_mc, mode='shiftlog|pearson',
                              n_HVGs=min(2000, ad_mc.n_vars))
    counts = ad_mc.obs['clusters'].value_counts()
    keep = counts[counts >= 2].index.tolist()
    ad_mc = ad_mc[ad_mc.obs['clusters'].isin(keep)].copy()
    ad_mc.obs['clusters'] = ad_mc.obs['clusters'].astype('category')
    ov.single.find_markers(ad_mc, groupby='clusters', method='wilcoxon',
                           key_added='rank_genes_groups', pts=True, use_gpu=False)
    top1[b] = pd.DataFrame(ad_mc.uns['rank_genes_groups']['names']).iloc[0]
pd.DataFrame(top1)
๐Ÿ” [2026-05-19 17:49:27] Running preprocessing in 'cpu' mode...
Begin robust gene identification
    After filtration, 2000/2000 genes are kept.
    Among 2000 genes, 2000 genes are robust.
โœ… Robust gene identification completed successfully.
Begin size normalization: shiftlog and HVGs selection pearson

๐Ÿ” Count Normalization:
   Target sum: 500000.0
   Exclude highly expressed: True
   Max fraction threshold: 0.2
   โš ๏ธ Excluding 1 highly-expressed genes from normalization computation
   Excluded genes: ['Ghrl']

โœ… Count Normalization Completed Successfully!
   โœ“ Processed: 100 cells ร— 2,000 genes
   โœ“ Runtime: 0.00s

๐Ÿ” Highly Variable Genes Selection (Experimental):
   Method: pearson_residuals
   Target genes: 2,000
   Theta (overdispersion): 100

โœ… Experimental HVG Selection Completed Successfully!
   โœ“ Selected: 2,000 highly variable genes out of 2,000 total (100.0%)
   โœ“ Results added to AnnData object:
     โ€ข 'highly_variable': Boolean vector (adata.var)
     โ€ข 'highly_variable_rank': Float vector (adata.var)
     โ€ข 'highly_variable_nbatches': Int vector (adata.var)
     โ€ข 'highly_variable_intersection': Boolean vector (adata.var)
     โ€ข 'means': Float vector (adata.var)
     โ€ข 'variances': Float vector (adata.var)
     โ€ข 'residual_variances': Float vector (adata.var)
    Time to analyze data in cpu: 0.10 seconds.
โœ… Preprocessing completed successfully.
    Added:
        'highly_variable_features', boolean vector (adata.var)
        'means', float vector (adata.var)
        'variances', float vector (adata.var)
        'residual_variances', float vector (adata.var)
        'counts', raw counts layer (adata.layers)
    End of size normalization: shiftlog and HVGs selection pearson

โ•ญโ”€ SUMMARY: preprocess โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚  Duration: 0.1051s                                                 โ”‚
โ”‚  Shape:    100 x 2,000 (Unchanged)                                 โ”‚
โ”‚                                                                    โ”‚
โ”‚  CHANGES DETECTED                                                  โ”‚
โ”‚  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€                                                  โ”‚
โ”‚  โ— UNS    โ”‚ โœš REFERENCE_MANU                                       โ”‚
โ”‚           โ”‚ โœš _ov_provenance                                       โ”‚
โ”‚           โ”‚ โœš history_log                                          โ”‚
โ”‚           โ”‚ โœš hvg                                                  โ”‚
โ”‚           โ”‚ โœš log1p                                                โ”‚
โ”‚           โ”‚ โœš status                                               โ”‚
โ”‚           โ”‚ โœš status_args                                          โ”‚
โ”‚                                                                    โ”‚
โ”‚  โ— LAYERS โ”‚ โœš counts (sparse matrix, 100x2000)                     โ”‚
โ”‚                                                                    โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
๐Ÿ” Finding marker genes | method: wilcoxon | groupby: clusters | n_groups: 8 | n_genes: 50
โœ… Done | 8 groups ร— 50 genes | corr: benjamini-hochberg | stored in adata.uns['rank_genes_groups']
๐Ÿ” [2026-05-19 17:49:27] Running preprocessing in 'cpu' mode...
Begin robust gene identification
    After filtration, 2000/2000 genes are kept.
    Among 2000 genes, 2000 genes are robust.
โœ… Robust gene identification completed successfully.
Begin size normalization: shiftlog and HVGs selection pearson

๐Ÿ” Count Normalization:
   Target sum: 500000.0
   Exclude highly expressed: True
   Max fraction threshold: 0.2
   โš ๏ธ Excluding 0 highly-expressed genes from normalization computation
   Excluded genes: []

โœ… Count Normalization Completed Successfully!
   โœ“ Processed: 100 cells ร— 2,000 genes
   โœ“ Runtime: 0.00s

๐Ÿ” Highly Variable Genes Selection (Experimental):
   Method: pearson_residuals
   Target genes: 2,000
   Theta (overdispersion): 100

โœ… Experimental HVG Selection Completed Successfully!
   โœ“ Selected: 2,000 highly variable genes out of 2,000 total (100.0%)
   โœ“ Results added to AnnData object:
     โ€ข 'highly_variable': Boolean vector (adata.var)
     โ€ข 'highly_variable_rank': Float vector (adata.var)
     โ€ข 'highly_variable_nbatches': Int vector (adata.var)
     โ€ข 'highly_variable_intersection': Boolean vector (adata.var)
     โ€ข 'means': Float vector (adata.var)
     โ€ข 'variances': Float vector (adata.var)
     โ€ข 'residual_variances': Float vector (adata.var)
    Time to analyze data in cpu: 0.02 seconds.
โœ… Preprocessing completed successfully.
    Added:
        'highly_variable_features', boolean vector (adata.var)
        'means', float vector (adata.var)
        'variances', float vector (adata.var)
        'residual_variances', float vector (adata.var)
        'counts', raw counts layer (adata.layers)
    End of size normalization: shiftlog and HVGs selection pearson

โ•ญโ”€ SUMMARY: preprocess โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚  Duration: 0.0253s                                                 โ”‚
โ”‚  Shape:    100 x 2,000 (Unchanged)                                 โ”‚
โ”‚                                                                    โ”‚
โ”‚  CHANGES DETECTED                                                  โ”‚
โ”‚  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€                                                  โ”‚
โ”‚  โ— UNS    โ”‚ โœš REFERENCE_MANU                                       โ”‚
โ”‚           โ”‚ โœš _ov_provenance                                       โ”‚
โ”‚           โ”‚ โœš history_log                                          โ”‚
โ”‚           โ”‚ โœš hvg                                                  โ”‚
โ”‚           โ”‚ โœš log1p                                                โ”‚
โ”‚           โ”‚ โœš status                                               โ”‚
โ”‚           โ”‚ โœš status_args                                          โ”‚
โ”‚                                                                    โ”‚
โ”‚  โ— LAYERS โ”‚ โœš counts (sparse matrix, 100x2000)                     โ”‚
โ”‚                                                                    โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
๐Ÿ” Finding marker genes | method: wilcoxon | groupby: clusters | n_groups: 7 | n_genes: 50
โœ… Done | 7 groups ร— 50 genes | corr: benjamini-hochberg | stored in adata.uns['rank_genes_groups']
๐Ÿ” [2026-05-19 17:49:28] Running preprocessing in 'cpu' mode...
Begin robust gene identification
    After filtration, 2000/2000 genes are kept.
    Among 2000 genes, 2000 genes are robust.
โœ… Robust gene identification completed successfully.
Begin size normalization: shiftlog and HVGs selection pearson

๐Ÿ” Count Normalization:
   Target sum: 500000.0
   Exclude highly expressed: True
   Max fraction threshold: 0.2
   โš ๏ธ Excluding 1 highly-expressed genes from normalization computation
   Excluded genes: ['Ghrl']

โœ… Count Normalization Completed Successfully!
   โœ“ Processed: 100 cells ร— 2,000 genes
   โœ“ Runtime: 0.00s
๐Ÿ” Highly Variable Genes Selection (Experimental):
   Method: pearson_residuals
   Target genes: 2,000
   Theta (overdispersion): 100

โœ… Experimental HVG Selection Completed Successfully!
   โœ“ Selected: 2,000 highly variable genes out of 2,000 total (100.0%)
   โœ“ Results added to AnnData object:
     โ€ข 'highly_variable': Boolean vector (adata.var)
     โ€ข 'highly_variable_rank': Float vector (adata.var)
     โ€ข 'highly_variable_nbatches': Int vector (adata.var)
     โ€ข 'highly_variable_intersection': Boolean vector (adata.var)
     โ€ข 'means': Float vector (adata.var)
     โ€ข 'variances': Float vector (adata.var)
     โ€ข 'residual_variances': Float vector (adata.var)
    Time to analyze data in cpu: 0.02 seconds.
โœ… Preprocessing completed successfully.
    Added:
        'highly_variable_features', boolean vector (adata.var)
        'means', float vector (adata.var)
        'variances', float vector (adata.var)
        'residual_variances', float vector (adata.var)
        'counts', raw counts layer (adata.layers)
    End of size normalization: shiftlog and HVGs selection pearson

โ•ญโ”€ SUMMARY: preprocess โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚  Duration: 0.0246s                                                 โ”‚
โ”‚  Shape:    100 x 2,000 (Unchanged)                                 โ”‚
โ”‚                                                                    โ”‚
โ”‚  CHANGES DETECTED                                                  โ”‚
โ”‚  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€                                                  โ”‚
โ”‚  โ— UNS    โ”‚ โœš REFERENCE_MANU                                       โ”‚
โ”‚           โ”‚ โœš _ov_provenance                                       โ”‚
โ”‚           โ”‚ โœš history_log                                          โ”‚
โ”‚           โ”‚ โœš hvg                                                  โ”‚
โ”‚           โ”‚ โœš log1p                                                โ”‚
โ”‚           โ”‚ โœš status                                               โ”‚
โ”‚           โ”‚ โœš status_args                                          โ”‚
โ”‚                                                                    โ”‚
โ”‚  โ— LAYERS โ”‚ โœš counts (sparse matrix, 100x2000)                     โ”‚
โ”‚                                                                    โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
๐Ÿ” Finding marker genes | method: wilcoxon | groupby: clusters | n_groups: 8 | n_genes: 50
โœ… Done | 8 groups ร— 50 genes | corr: benjamini-hochberg | stored in adata.uns['rank_genes_groups']
๐Ÿ” [2026-05-19 17:49:28] Running preprocessing in 'cpu' mode...
Begin robust gene identification
    After filtration, 2000/2000 genes are kept.
    Among 2000 genes, 2000 genes are robust.
โœ… Robust gene identification completed successfully.
Begin size normalization: shiftlog and HVGs selection pearson

๐Ÿ” Count Normalization:
   Target sum: 500000.0
   Exclude highly expressed: True
   Max fraction threshold: 0.2
   โš ๏ธ Excluding 1 highly-expressed genes from normalization computation
   Excluded genes: ['Ghrl']

โœ… Count Normalization Completed Successfully!
   โœ“ Processed: 100 cells ร— 2,000 genes
   โœ“ Runtime: 0.00s

๐Ÿ” Highly Variable Genes Selection (Experimental):
   Method: pearson_residuals
   Target genes: 2,000
   Theta (overdispersion): 100

โœ… Experimental HVG Selection Completed Successfully!
   โœ“ Selected: 2,000 highly variable genes out of 2,000 total (100.0%)
   โœ“ Results added to AnnData object:
     โ€ข 'highly_variable': Boolean vector (adata.var)
     โ€ข 'highly_variable_rank': Float vector (adata.var)
     โ€ข 'highly_variable_nbatches': Int vector (adata.var)
     โ€ข 'highly_variable_intersection': Boolean vector (adata.var)
     โ€ข 'means': Float vector (adata.var)
     โ€ข 'variances': Float vector (adata.var)
     โ€ข 'residual_variances': Float vector (adata.var)
    Time to analyze data in cpu: 0.02 seconds.
โœ… Preprocessing completed successfully.
    Added:
        'highly_variable_features', boolean vector (adata.var)
        'means', float vector (adata.var)
        'variances', float vector (adata.var)
        'residual_variances', float vector (adata.var)
        'counts', raw counts layer (adata.layers)
    End of size normalization: shiftlog and HVGs selection pearson

โ•ญโ”€ SUMMARY: preprocess โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚  Duration: 0.0252s                                                 โ”‚
โ”‚  Shape:    100 x 2,000 (Unchanged)                                 โ”‚
โ”‚                                                                    โ”‚
โ”‚  CHANGES DETECTED                                                  โ”‚
โ”‚  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€                                                  โ”‚
โ”‚  โ— UNS    โ”‚ โœš REFERENCE_MANU                                       โ”‚
โ”‚           โ”‚ โœš _ov_provenance                                       โ”‚
โ”‚           โ”‚ โœš history_log                                          โ”‚
โ”‚           โ”‚ โœš hvg                                                  โ”‚
โ”‚           โ”‚ โœš log1p                                                โ”‚
โ”‚           โ”‚ โœš status                                               โ”‚
โ”‚           โ”‚ โœš status_args                                          โ”‚
โ”‚                                                                    โ”‚
โ”‚  โ— LAYERS โ”‚ โœš counts (sparse matrix, 100x2000)                     โ”‚
โ”‚                                                                    โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
๐Ÿ” Finding marker genes | method: wilcoxon | groupby: clusters | n_groups: 8 | n_genes: 50
โœ… Done | 8 groups ร— 50 genes | corr: benjamini-hochberg | stored in adata.uns['rank_genes_groups']
๐Ÿ” [2026-05-19 17:49:28] Running preprocessing in 'cpu' mode...
Begin robust gene identification
    After filtration, 2000/2000 genes are kept.
    Among 2000 genes, 2000 genes are robust.
โœ… Robust gene identification completed successfully.
Begin size normalization: shiftlog and HVGs selection pearson

๐Ÿ” Count Normalization:
   Target sum: 500000.0
   Exclude highly expressed: True
   Max fraction threshold: 0.2
   โš ๏ธ Excluding 0 highly-expressed genes from normalization computation
   Excluded genes: []

โœ… Count Normalization Completed Successfully!
   โœ“ Processed: 100 cells ร— 2,000 genes
   โœ“ Runtime: 0.00s

๐Ÿ” Highly Variable Genes Selection (Experimental):
   Method: pearson_residuals
   Target genes: 2,000
   Theta (overdispersion): 100

โœ… Experimental HVG Selection Completed Successfully!
   โœ“ Selected: 2,000 highly variable genes out of 2,000 total (100.0%)
   โœ“ Results added to AnnData object:
     โ€ข 'highly_variable': Boolean vector (adata.var)
     โ€ข 'highly_variable_rank': Float vector (adata.var)
     โ€ข 'highly_variable_nbatches': Int vector (adata.var)
     โ€ข 'highly_variable_intersection': Boolean vector (adata.var)
     โ€ข 'means': Float vector (adata.var)
     โ€ข 'variances': Float vector (adata.var)
     โ€ข 'residual_variances': Float vector (adata.var)
    Time to analyze data in cpu: 0.02 seconds.
โœ… Preprocessing completed successfully.
    Added:
        'highly_variable_features', boolean vector (adata.var)
        'means', float vector (adata.var)
        'variances', float vector (adata.var)
        'residual_variances', float vector (adata.var)
        'counts', raw counts layer (adata.layers)
    End of size normalization: shiftlog and HVGs selection pearson

โ•ญโ”€ SUMMARY: preprocess โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚  Duration: 0.0276s                                                 โ”‚
โ”‚  Shape:    100 x 2,000 (Unchanged)                                 โ”‚
โ”‚                                                                    โ”‚
โ”‚  CHANGES DETECTED                                                  โ”‚
โ”‚  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€                                                  โ”‚
โ”‚  โ— UNS    โ”‚ โœš REFERENCE_MANU                                       โ”‚
โ”‚           โ”‚ โœš _ov_provenance                                       โ”‚
โ”‚           โ”‚ โœš history_log                                          โ”‚
โ”‚           โ”‚ โœš hvg                                                  โ”‚
โ”‚           โ”‚ โœš log1p                                                โ”‚
โ”‚           โ”‚ โœš status                                               โ”‚
โ”‚           โ”‚ โœš status_args                                          โ”‚
โ”‚                                                                    โ”‚
โ”‚  โ— LAYERS โ”‚ โœš counts (sparse matrix, 100x2000)                     โ”‚
โ”‚                                                                    โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
๐Ÿ” Finding marker genes | method: wilcoxon | groupby: clusters | n_groups: 5 | n_genes: 50
โœ… Done | 5 groups ร— 50 genes | corr: benjamini-hochberg | stored in adata.uns['rank_genes_groups']
๐Ÿ” [2026-05-19 17:49:28] Running preprocessing in 'cpu' mode...
Begin robust gene identification
    After filtration, 2000/2000 genes are kept.
    Among 2000 genes, 2000 genes are robust.
โœ… Robust gene identification completed successfully.
Begin size normalization: shiftlog and HVGs selection pearson

๐Ÿ” Count Normalization:
   Target sum: 500000.0
   Exclude highly expressed: True
   Max fraction threshold: 0.2
   โš ๏ธ Excluding 1 highly-expressed genes from normalization computation
   Excluded genes: ['Ghrl']

โœ… Count Normalization Completed Successfully!
   โœ“ Processed: 100 cells ร— 2,000 genes
   โœ“ Runtime: 0.00s
๐Ÿ” Highly Variable Genes Selection (Experimental):
   Method: pearson_residuals
   Target genes: 2,000
   Theta (overdispersion): 100

โœ… Experimental HVG Selection Completed Successfully!
   โœ“ Selected: 2,000 highly variable genes out of 2,000 total (100.0%)
   โœ“ Results added to AnnData object:
     โ€ข 'highly_variable': Boolean vector (adata.var)
     โ€ข 'highly_variable_rank': Float vector (adata.var)
     โ€ข 'highly_variable_nbatches': Int vector (adata.var)
     โ€ข 'highly_variable_intersection': Boolean vector (adata.var)
     โ€ข 'means': Float vector (adata.var)
     โ€ข 'variances': Float vector (adata.var)
     โ€ข 'residual_variances': Float vector (adata.var)
    Time to analyze data in cpu: 0.02 seconds.
โœ… Preprocessing completed successfully.
    Added:
        'highly_variable_features', boolean vector (adata.var)
        'means', float vector (adata.var)
        'variances', float vector (adata.var)
        'residual_variances', float vector (adata.var)
        'counts', raw counts layer (adata.layers)
    End of size normalization: shiftlog and HVGs selection pearson

โ•ญโ”€ SUMMARY: preprocess โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚  Duration: 0.0236s                                                 โ”‚
โ”‚  Shape:    100 x 2,000 (Unchanged)                                 โ”‚
โ”‚                                                                    โ”‚
โ”‚  CHANGES DETECTED                                                  โ”‚
โ”‚  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€                                                  โ”‚
โ”‚  โ— UNS    โ”‚ โœš REFERENCE_MANU                                       โ”‚
โ”‚           โ”‚ โœš _ov_provenance                                       โ”‚
โ”‚           โ”‚ โœš history_log                                          โ”‚
โ”‚           โ”‚ โœš hvg                                                  โ”‚
โ”‚           โ”‚ โœš log1p                                                โ”‚
โ”‚           โ”‚ โœš status                                               โ”‚
โ”‚           โ”‚ โœš status_args                                          โ”‚
โ”‚                                                                    โ”‚
โ”‚  โ— LAYERS โ”‚ โœš counts (sparse matrix, 100x2000)                     โ”‚
โ”‚                                                                    โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
๐Ÿ” Finding marker genes | method: wilcoxon | groupby: clusters | n_groups: 8 | n_genes: 50
โœ… Done | 8 groups ร— 50 genes | corr: benjamini-hochberg | stored in adata.uns['rank_genes_groups']
seacells metaq supercell kmeans random geosketch
Alpha Sphkap Tmem27 Smarca1 Smarca1 Irx2 Zcchc18
Beta Gng12 Mapt Sytl4 Gng12 Sec61b Sytl4
Delta Rbp4 NaN Rbp4 Scn3a NaN Sst
Ductal Nudt19 Nudt19 Lurap1l Nudt19 Spp1 Hpgd
Epsilon Ghrl Guca2b Ghrl Gm11837 NaN Anpep
Ngn3 high EP Numbl Numbl Cbfa2t3 Cbfa2t3 Cotl1 Smarcd2
Ngn3 low EP Cited4 Cat Grin3a Angptl4 NaN Atoh8
Pre-endocrine Eif3e Cystm1 Bax Neat1 Fev Fev

8. Recommendation table#

A rough decision tree for this kind of dataset (mid-size mouse scRNA-seq atlas, no batch / multi-modal complications):

Downstream task

Pick

UMAP / Leiden / pretty plots

kmeans or geosketch โ€” theyโ€™re within ~5 % of SEACells on purity at 1โ€“100ร— lower runtime

Differential expression / pseudobulk DESeq2

seacells or metaq โ€” proper aggregation preserves count semantics

Cellโ€“cell communication (CellPhoneDB / LIANA)

seacells (soft membership matters)

Gene regulatory networks (SCENIC)

seacells or metaq

Atlas with new samples arriving over time

metaq โ€” only backend with closed-form out-of-sample projection

Multi-million cell dataset

metaq (linear-time) or mc2 (divide-and-conquer)

Quick sanity check that the pipeline works at all

kmeans first; random to confirm the floor

The most important rule: include random in your benchmark. If your principled metacell pipeline isnโ€™t beating random on purity / rigor, you have a bug, not metacells.