Side-by-side comparison of all metacell backends#
Each backend in the metacell-zoo was developed by a different
group with different assumptions. This notebook runs all six
always-available backends (seacells, metaq, supercell, kmeans,
random, geosketch) on the same pancreas data and ranks them by:
runtime โ how long does it take?
purity โ fraction of each metacell that belongs to a single celltype
rigor_score โ mcRigorโs composite score (higher = better)
dubious_rate โ fraction of cells in heterogeneous metacells
compactness โ mean distance from cells to their metacell centroid in PCA space
ov.single.compare_metacell_backends runs all of this in one call.
1. Setup#
# Standard imports + omicverse defaults.
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import omicverse as ov
import scvelo as scv # only used for the demo dataset
ov.plot_set()
๐ฌ Starting plot initialization...
๐งฌ Detecting GPU devicesโฆ
โ
NVIDIA CUDA GPUs detected: 1
โข [CUDA 0] NVIDIA H100 80GB HBM3
Memory: 79.1 GB | Compute: 9.0
____ _ _ __
/ __ \____ ___ (_)___| | / /__ _____________
/ / / / __ `__ \/ / ___/ | / / _ \/ ___/ ___/ _ \
/ /_/ / / / / / / / /__ | |/ / __/ / (__ ) __/
\____/_/ /_/ /_/_/\___/ |___/\___/_/ /____/\___/
๐ Version: 2.2.0 ๐ Tutorials: https://omicverse.readthedocs.io/
โ
plot_set complete.
2. Load and preprocess#
# Pancreas scRNA-seq (Bastidas-Ponce et al. 2019). Standard omicverse
# preprocess flow: qc -> preprocess -> scale -> pca -> neighbors -> umap.
adata = scv.datasets.pancreas()
adata = ov.pp.qc(adata,
tresh={'mito_perc': 0.20, 'nUMIs': 500, 'detected_genes': 250},
mt_startswith='mt-')
adata = ov.pp.preprocess(adata, mode='shiftlog|pearson', n_HVGs=2000)
adata.layers['lognorm'] = adata.X.copy() # mcRigor reads this
adata = adata[:, adata.var.highly_variable_features]
ov.pp.scale(adata)
ov.pp.pca(adata, layer='scaled', n_pcs=30)
adata.obsm['X_pca'] = adata.obsm['scaled|original|X_pca']
ov.pp.neighbors(adata, n_neighbors=15, use_rep='X_pca')
ov.pp.umap(adata)
print('adata:', adata.shape, 'celltypes:', sorted(adata.obs['clusters'].unique()))
๐ฅ๏ธ Using CPU mode for QC...
๐ Step 1: Calculating QC Metrics
โ Gene Family Detection:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโ
โ Gene Family โ Genes Found โ Detection Method โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโค
โ Mitochondrial โ 13 โ Auto (MT-) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโค
โ Ribosomal โ 0 โ ๏ธ โ Auto (RPS/RPL) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโค
โ Hemoglobin โ 0 โ ๏ธ โ Auto (regex) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโ
โ QC Metrics Summary:
โโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Metric โ Mean โ Range (Min - Max) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ nUMIs โ 6675 โ 3020 - 18524 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Detected Genes โ 2516 โ 1473 - 4492 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Mitochondrial % โ 0.7% โ 0.2% - 4.3% โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Ribosomal % โ 0.0% โ 0.0% - 0.0% โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Hemoglobin % โ 0.0% โ 0.0% - 0.0% โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ Original cell count: 3,696
๐ง Step 2: Quality Filtering (SEURAT)
Thresholds: mitoโค0.2, nUMIsโฅ500, genesโฅ250
๐ Seurat Filter Results:
โข nUMIs filter (โฅ500): 0 cells failed (0.0%)
โข Genes filter (โฅ250): 0 cells failed (0.0%)
โข Mitochondrial filter (โค0.2): 0 cells failed (0.0%)
โ Filters applied successfully
โ Combined QC filters: 0 cells removed (0.0%)
๐ฏ Step 3: Final Filtering
Parameters: min_genes=200, min_cells=3
Ratios: max_genes_ratio=1, max_cells_ratio=1
โ Final filtering: 0 cells, 12,261 genes removed
๐ Step 4: Doublet Detection
๐ก Running pyscdblfinder (Python port of R scDblFinder)
๐ Running scdblfinder detection...
[ScDblFinder] wrote scDblFinder_score + scDblFinder_class โ threshold=0.387
โ scDblFinder completed: 66 doublets removed (1.8%)
โญโ SUMMARY: qc โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ Duration: 17.1113s โ
โ Shape: 3,696 x 27,998 (Unchanged) โ
โ โ
โ CHANGES DETECTED โ
โ โโโโโโโโโโโโโโโโ โ
โ โ OBS โ โ cell_complexity (float) โ
โ โ โ detected_genes (int) โ
โ โ โ hb_perc (float) โ
โ โ โ mito_perc (float) โ
โ โ โ nUMIs (float) โ
โ โ โ n_counts (float) โ
โ โ โ n_genes (int) โ
โ โ โ n_genes_by_counts (int) โ
โ โ โ passing_mt (bool) โ
โ โ โ passing_nUMIs (bool) โ
โ โ โ passing_ngenes (bool) โ
โ โ โ pct_counts_hb (float) โ
โ โ โ pct_counts_mt (float) โ
โ โ โ pct_counts_ribo (float) โ
โ โ โ ribo_perc (float) โ
โ โ โ total_counts (float) โ
โ โ
โ โ VAR โ โ hb (bool) โ
โ โ โ mt (bool) โ
โ โ โ ribo (bool) โ
โ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
๐ [2026-05-19 17:38:39] Running preprocessing in 'cpu' mode...
Begin robust gene identification
After filtration, 15737/15737 genes are kept.
Among 15737 genes, 15736 genes are robust.
โ
Robust gene identification completed successfully.
Begin size normalization: shiftlog and HVGs selection pearson
๐ Count Normalization:
Target sum: 500000.0
Exclude highly expressed: True
Max fraction threshold: 0.2
โ ๏ธ Excluding 1 highly-expressed genes from normalization computation
Excluded genes: ['Ghrl']
โ
Count Normalization Completed Successfully!
โ Processed: 3,630 cells ร 15,736 genes
โ Runtime: 0.25s
๐ Highly Variable Genes Selection (Experimental):
Method: pearson_residuals
Target genes: 2,000
Theta (overdispersion): 100
โ
Experimental HVG Selection Completed Successfully!
โ Selected: 2,000 highly variable genes out of 15,736 total (12.7%)
โ Results added to AnnData object:
โข 'highly_variable': Boolean vector (adata.var)
โข 'highly_variable_rank': Float vector (adata.var)
โข 'highly_variable_nbatches': Int vector (adata.var)
โข 'highly_variable_intersection': Boolean vector (adata.var)
โข 'means': Float vector (adata.var)
โข 'variances': Float vector (adata.var)
โข 'residual_variances': Float vector (adata.var)
Time to analyze data in cpu: 1.46 seconds.
โ
Preprocessing completed successfully.
Added:
'highly_variable_features', boolean vector (adata.var)
'means', float vector (adata.var)
'variances', float vector (adata.var)
'residual_variances', float vector (adata.var)
'counts', raw counts layer (adata.layers)
End of size normalization: shiftlog and HVGs selection pearson
โญโ SUMMARY: preprocess โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ Duration: 1.857s โ
โ Shape: 3,630 x 15,737 -> 3,630 x 15,736 โ
โ โ
โ CHANGES DETECTED โ
โ โโโโโโโโโโโโโโโโ โ
โ โ VAR โ โ highly_variable (bool) โ
โ โ โ highly_variable_features (bool) โ
โ โ โ highly_variable_rank (float) โ
โ โ โ means (float) โ
โ โ โ n_cells (int) โ
โ โ โ percent_cells (float) โ
โ โ โ residual_variances (float) โ
โ โ โ robust (bool) โ
โ โ โ variances (float) โ
โ โ
โ โ UNS โ โ history_log โ
โ โ โ hvg โ
โ โ โ log1p โ
โ โ
โ โ LAYERS โ โ counts (sparse matrix, 3630x15736) โ
โ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโ SUMMARY: scale โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ Duration: 0.6593s โ
โ Shape: 3,630 x 2,000 (Unchanged) โ
โ โ
โ CHANGES DETECTED โ
โ โโโโโโโโโโโโโโโโ โ
โ โ LAYERS โ โ scaled (array, 3630x2000) โ
โ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
computing PCA๐
with n_comps=30
๐ฅ๏ธ Using sklearn PCA for CPU computation
๐ฅ๏ธ sklearn PCA backend: CPU computation
๐ PCA input data type: ArrayView, shape: (3630, 2000), dtype: float64
๐ง PCA solver used: covariance_eigh
finishedโ
(1.89s)
โญโ SUMMARY: pca โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ Duration: 1.8986s โ
โ Shape: 3,630 x 2,000 (Unchanged) โ
โ โ
โ CHANGES DETECTED โ
โ โโโโโโโโโโโโโโโโ โ
โ โ UNS โ โ scaled|original|cum_sum_eigenvalues โ
โ โ โ scaled|original|pca_var_ratios โ
โ โ
โ โ OBSM โ โ scaled|original|X_pca (array, 3630x30) โ
โ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
๐ฅ๏ธ Using Scanpy CPU to calculate neighbors...
๐ K-Nearest Neighbors Graph Construction:
Mode: cpu
Neighbors: 15
Method: umap
Metric: euclidean
Representation: X_pca
๐ Computing neighbor distances...
๐ Computing connectivity matrix...
๐ก Using UMAP-style connectivity
โ Graph is fully connected
โ
KNN Graph Construction Completed Successfully!
โ Processed: 3,630 cells with 15 neighbors each
โ Results added to AnnData object:
โข 'neighbors': Neighbors metadata (adata.uns)
โข 'distances': Distance matrix (adata.obsp)
โข 'connectivities': Connectivity matrix (adata.obsp)
โญโ SUMMARY: neighbors โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ Duration: 8.6527s โ
โ Shape: 3,630 x 2,000 (Unchanged) โ
โ โ
โ CHANGES DETECTED โ
โ โโโโโโโโโโโโโโโโ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
๐ [2026-05-19 17:38:53] Running UMAP in 'cpu' mode...
๐ฅ๏ธ Using Scanpy CPU UMAP...
๐ UMAP Dimensionality Reduction:
Mode: cpu
Method: umap
Components: 2
Min distance: 0.5
{'n_neighbors': 15, 'method': 'umap', 'random_state': 0, 'metric': 'euclidean', 'use_rep': 'X_pca'}
๐ Computing UMAP parameters...
๐ Computing UMAP embedding (classic method)...
โ
UMAP Dimensionality Reduction Completed Successfully!
โ Embedding shape: 3,630 cells ร 2 dimensions
โ Results added to AnnData object:
โข 'X_umap': UMAP coordinates (adata.obsm)
โข 'umap': UMAP parameters (adata.uns)
โ
UMAP completed successfully.
โญโ SUMMARY: umap โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ Duration: 0.7885s โ
โ Shape: 3,630 x 2,000 (Unchanged) โ
โ โ
โ CHANGES DETECTED โ
โ โโโโโโโโโโโโโโโโ โ
โ โ UNS โ โ umap โ
โ โ โโ params: {'a': 0.5830300199950147, 'b': 1.334166993228519}โ
โ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
adata: (3630, 2000) celltypes: ['Alpha', 'Beta', 'Delta', 'Ductal', 'Epsilon', 'Ngn3 high EP', 'Ngn3 low EP', 'Pre-endocrine']
3. The single-call benchmark#
bench = ov.single.compare_metacell_backends(
adata,
backends=['seacells', 'metaq', 'supercell', 'kmeans', 'random', 'geosketch'],
n_metacells=100,
use_rep='X_pca',
layer='counts',
layer_lognorm='lognorm',
eval_label='clusters',
device='cpu', # seacells GPU path needs cupy
n_rigor_rep=20,
random_state=0,
train_epoch=80, warm_epochs=10, codebook_init='random', # metaq-only kwargs
)
bench[['runtime_s', 'dubious_rate', 'rigor_score',
'mean_purity', 'mean_compactness', 'n_metacells']].round(3)
Welcome to SEACells!
Parameter graph_construction = union being used to build KNN graph...
Building kernel on X_pca
WARNING: adata.X seems to be already log-transformed.
[Epoch 1] RNA: Loss Rec=1.6497
[Epoch 1] RNA: Loss Rec=0.8152 Loss Rec Q=0.8190 | Codebook: Loss C=0.0038
[Epoch 20] RNA: Loss Rec=0.7487 Loss Rec Q=0.7667 | Codebook: Loss C=0.0095
[Epoch 40] RNA: Loss Rec=0.7331 Loss Rec Q=0.7618 | Codebook: Loss C=0.0106
[Epoch 60] RNA: Loss Rec=0.7214 Loss Rec Q=0.7591 | Codebook: Loss C=0.0104
[Epoch 80] RNA: Loss Rec=0.7126 Loss Rec Q=0.7565 | Codebook: Loss C=0.0110
| runtime_s | dubious_rate | rigor_score | mean_purity | mean_compactness | n_metacells | |
|---|---|---|---|---|---|---|
| backend | ||||||
| seacells | 16.119 | 0.653 | 0.420 | 0.896 | 8.044 | 100 |
| metaq | 424.741 | 0.695 | 0.400 | 0.879 | 8.791 | 100 |
| supercell | 0.452 | 0.801 | 0.346 | 0.874 | 8.186 | 100 |
| kmeans | 1.998 | 0.616 | 0.439 | 0.884 | 8.027 | 100 |
| random | 0.000 | 1.000 | 0.284 | 0.282 | 20.182 | 100 |
| geosketch | 0.251 | 0.910 | 0.280 | 0.898 | 7.541 | 100 |
4. Rank by purity / runtime / rigor#
import matplotlib.pyplot as plt
ranked = bench.sort_values('mean_purity', ascending=True)
fig, axes = plt.subplots(1, 3, figsize=(13, 3.5))
ranked['mean_purity'].plot.barh(ax=axes[0], color='#1f77b4')
axes[0].set_xlabel('mean purity'); axes[0].set_xlim(0, 1)
axes[0].axvline(ranked['mean_purity']['random'], color='red', linestyle='--')
bench['runtime_s'].plot.barh(ax=axes[1], color='#2ca02c', logx=True)
axes[1].set_xlabel('runtime (s, log)')
bench['rigor_score'].plot.barh(ax=axes[2], color='#9467bd')
axes[2].set_xlabel('rigor_score'); axes[2].set_xlim(0, 1.05)
for ax in axes:
ax.invert_yaxis()
plt.tight_layout(); plt.show()
5. UMAP overlay grid#
Project all six partitions back onto the source UMAP.
# Refit each backend keeping the MetaCell object so we can use the
# ov.pl.metacell_centroids helper for each panel.
backends = ['seacells', 'metaq', 'supercell', 'kmeans', 'random', 'geosketch']
fits = {}
for b in backends:
extra = dict(train_epoch=80, warm_epochs=10, codebook_init='random') if b == 'metaq' else {}
fits[b] = ov.single.MetaCell(
adata.copy(), method=b, n_metacells=100,
use_rep='X_pca',
device='cpu' if b != 'metaq' else 'cuda',
random_state=0, **extra,
).fit()
print('refit done')
Welcome to SEACells!
Parameter graph_construction = union being used to build KNN graph...
Building kernel on X_pca
WARNING: adata.X seems to be already log-transformed.
[Epoch 1] RNA: Loss Rec=1.7653
[Epoch 1] RNA: Loss Rec=0.9919 Loss Rec Q=1.0011 | Codebook: Loss C=0.0033
[Epoch 20] RNA: Loss Rec=0.9290 Loss Rec Q=0.9383 | Codebook: Loss C=0.0076
[Epoch 40] RNA: Loss Rec=0.9147 Loss Rec Q=0.9292 | Codebook: Loss C=0.0095
[Epoch 60] RNA: Loss Rec=0.9028 Loss Rec Q=0.9254 | Codebook: Loss C=0.0095
[Epoch 80] RNA: Loss Rec=0.8912 Loss Rec Q=0.9230 | Codebook: Loss C=0.0099
refit done
fig, axes = plt.subplots(2, 3, figsize=(13, 8))
for ax, name in zip(axes.flatten(), backends):
a = fits[name].adata
ov.pl.embedding(a, basis='X_umap', color='clusters', ax=ax, show=False,
frameon='small',
title=f'{name} (purity={bench.loc[name, "mean_purity"]:.3f})',
legend_loc=None, size=8)
labels = fits[name]._fit_result.assignments
pts = np.array([a.obsm['X_umap'][labels == u].mean(axis=0)
for u in np.unique(labels)])
ax.scatter(pts[:, 0], pts[:, 1], s=20, c='#222',
edgecolors='white', linewidths=0.6, zorder=5)
plt.tight_layout(); plt.show()
6. Purity histograms across backends#
import seaborn as sns
fig, axes = plt.subplots(2, 3, figsize=(13, 5.5), sharex=True, sharey=True)
for ax, name in zip(axes.flatten(), backends):
p = fits[name].compute_purity('clusters')['purity']
sns.histplot(p, bins=20, ax=ax, color='#1f77b4')
ax.set_title(f'{name} (mean={p.mean():.2f})')
ax.set_xlabel('purity'); ax.set_xlim(0, 1.05)
plt.tight_layout(); plt.show()
7. Marker recovery on the aggregated AnnData#
Aggregate with each backend, then ask omicverse for the top marker per celltype in the metacell AnnData. The โrightโ markers (Ins1/Ins2 for Beta, Gcg for Alpha, etc.) should pop out regardless of which backend you used.
top1 = {}
for b in backends:
ad_mc = fits[b].predicted(method='hard', layer='counts', summary='sum',
celltype_label='clusters')
ad_mc = ov.pp.preprocess(ad_mc, mode='shiftlog|pearson',
n_HVGs=min(2000, ad_mc.n_vars))
counts = ad_mc.obs['clusters'].value_counts()
keep = counts[counts >= 2].index.tolist()
ad_mc = ad_mc[ad_mc.obs['clusters'].isin(keep)].copy()
ad_mc.obs['clusters'] = ad_mc.obs['clusters'].astype('category')
ov.single.find_markers(ad_mc, groupby='clusters', method='wilcoxon',
key_added='rank_genes_groups', pts=True, use_gpu=False)
top1[b] = pd.DataFrame(ad_mc.uns['rank_genes_groups']['names']).iloc[0]
pd.DataFrame(top1)
๐ [2026-05-19 17:49:27] Running preprocessing in 'cpu' mode...
Begin robust gene identification
After filtration, 2000/2000 genes are kept.
Among 2000 genes, 2000 genes are robust.
โ
Robust gene identification completed successfully.
Begin size normalization: shiftlog and HVGs selection pearson
๐ Count Normalization:
Target sum: 500000.0
Exclude highly expressed: True
Max fraction threshold: 0.2
โ ๏ธ Excluding 1 highly-expressed genes from normalization computation
Excluded genes: ['Ghrl']
โ
Count Normalization Completed Successfully!
โ Processed: 100 cells ร 2,000 genes
โ Runtime: 0.00s
๐ Highly Variable Genes Selection (Experimental):
Method: pearson_residuals
Target genes: 2,000
Theta (overdispersion): 100
โ
Experimental HVG Selection Completed Successfully!
โ Selected: 2,000 highly variable genes out of 2,000 total (100.0%)
โ Results added to AnnData object:
โข 'highly_variable': Boolean vector (adata.var)
โข 'highly_variable_rank': Float vector (adata.var)
โข 'highly_variable_nbatches': Int vector (adata.var)
โข 'highly_variable_intersection': Boolean vector (adata.var)
โข 'means': Float vector (adata.var)
โข 'variances': Float vector (adata.var)
โข 'residual_variances': Float vector (adata.var)
Time to analyze data in cpu: 0.10 seconds.
โ
Preprocessing completed successfully.
Added:
'highly_variable_features', boolean vector (adata.var)
'means', float vector (adata.var)
'variances', float vector (adata.var)
'residual_variances', float vector (adata.var)
'counts', raw counts layer (adata.layers)
End of size normalization: shiftlog and HVGs selection pearson
โญโ SUMMARY: preprocess โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ Duration: 0.1051s โ
โ Shape: 100 x 2,000 (Unchanged) โ
โ โ
โ CHANGES DETECTED โ
โ โโโโโโโโโโโโโโโโ โ
โ โ UNS โ โ REFERENCE_MANU โ
โ โ โ _ov_provenance โ
โ โ โ history_log โ
โ โ โ hvg โ
โ โ โ log1p โ
โ โ โ status โ
โ โ โ status_args โ
โ โ
โ โ LAYERS โ โ counts (sparse matrix, 100x2000) โ
โ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
๐ Finding marker genes | method: wilcoxon | groupby: clusters | n_groups: 8 | n_genes: 50
โ
Done | 8 groups ร 50 genes | corr: benjamini-hochberg | stored in adata.uns['rank_genes_groups']
๐ [2026-05-19 17:49:27] Running preprocessing in 'cpu' mode...
Begin robust gene identification
After filtration, 2000/2000 genes are kept.
Among 2000 genes, 2000 genes are robust.
โ
Robust gene identification completed successfully.
Begin size normalization: shiftlog and HVGs selection pearson
๐ Count Normalization:
Target sum: 500000.0
Exclude highly expressed: True
Max fraction threshold: 0.2
โ ๏ธ Excluding 0 highly-expressed genes from normalization computation
Excluded genes: []
โ
Count Normalization Completed Successfully!
โ Processed: 100 cells ร 2,000 genes
โ Runtime: 0.00s
๐ Highly Variable Genes Selection (Experimental):
Method: pearson_residuals
Target genes: 2,000
Theta (overdispersion): 100
โ
Experimental HVG Selection Completed Successfully!
โ Selected: 2,000 highly variable genes out of 2,000 total (100.0%)
โ Results added to AnnData object:
โข 'highly_variable': Boolean vector (adata.var)
โข 'highly_variable_rank': Float vector (adata.var)
โข 'highly_variable_nbatches': Int vector (adata.var)
โข 'highly_variable_intersection': Boolean vector (adata.var)
โข 'means': Float vector (adata.var)
โข 'variances': Float vector (adata.var)
โข 'residual_variances': Float vector (adata.var)
Time to analyze data in cpu: 0.02 seconds.
โ
Preprocessing completed successfully.
Added:
'highly_variable_features', boolean vector (adata.var)
'means', float vector (adata.var)
'variances', float vector (adata.var)
'residual_variances', float vector (adata.var)
'counts', raw counts layer (adata.layers)
End of size normalization: shiftlog and HVGs selection pearson
โญโ SUMMARY: preprocess โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ Duration: 0.0253s โ
โ Shape: 100 x 2,000 (Unchanged) โ
โ โ
โ CHANGES DETECTED โ
โ โโโโโโโโโโโโโโโโ โ
โ โ UNS โ โ REFERENCE_MANU โ
โ โ โ _ov_provenance โ
โ โ โ history_log โ
โ โ โ hvg โ
โ โ โ log1p โ
โ โ โ status โ
โ โ โ status_args โ
โ โ
โ โ LAYERS โ โ counts (sparse matrix, 100x2000) โ
โ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
๐ Finding marker genes | method: wilcoxon | groupby: clusters | n_groups: 7 | n_genes: 50
โ
Done | 7 groups ร 50 genes | corr: benjamini-hochberg | stored in adata.uns['rank_genes_groups']
๐ [2026-05-19 17:49:28] Running preprocessing in 'cpu' mode...
Begin robust gene identification
After filtration, 2000/2000 genes are kept.
Among 2000 genes, 2000 genes are robust.
โ
Robust gene identification completed successfully.
Begin size normalization: shiftlog and HVGs selection pearson
๐ Count Normalization:
Target sum: 500000.0
Exclude highly expressed: True
Max fraction threshold: 0.2
โ ๏ธ Excluding 1 highly-expressed genes from normalization computation
Excluded genes: ['Ghrl']
โ
Count Normalization Completed Successfully!
โ Processed: 100 cells ร 2,000 genes
โ Runtime: 0.00s
๐ Highly Variable Genes Selection (Experimental):
Method: pearson_residuals
Target genes: 2,000
Theta (overdispersion): 100
โ
Experimental HVG Selection Completed Successfully!
โ Selected: 2,000 highly variable genes out of 2,000 total (100.0%)
โ Results added to AnnData object:
โข 'highly_variable': Boolean vector (adata.var)
โข 'highly_variable_rank': Float vector (adata.var)
โข 'highly_variable_nbatches': Int vector (adata.var)
โข 'highly_variable_intersection': Boolean vector (adata.var)
โข 'means': Float vector (adata.var)
โข 'variances': Float vector (adata.var)
โข 'residual_variances': Float vector (adata.var)
Time to analyze data in cpu: 0.02 seconds.
โ
Preprocessing completed successfully.
Added:
'highly_variable_features', boolean vector (adata.var)
'means', float vector (adata.var)
'variances', float vector (adata.var)
'residual_variances', float vector (adata.var)
'counts', raw counts layer (adata.layers)
End of size normalization: shiftlog and HVGs selection pearson
โญโ SUMMARY: preprocess โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ Duration: 0.0246s โ
โ Shape: 100 x 2,000 (Unchanged) โ
โ โ
โ CHANGES DETECTED โ
โ โโโโโโโโโโโโโโโโ โ
โ โ UNS โ โ REFERENCE_MANU โ
โ โ โ _ov_provenance โ
โ โ โ history_log โ
โ โ โ hvg โ
โ โ โ log1p โ
โ โ โ status โ
โ โ โ status_args โ
โ โ
โ โ LAYERS โ โ counts (sparse matrix, 100x2000) โ
โ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
๐ Finding marker genes | method: wilcoxon | groupby: clusters | n_groups: 8 | n_genes: 50
โ
Done | 8 groups ร 50 genes | corr: benjamini-hochberg | stored in adata.uns['rank_genes_groups']
๐ [2026-05-19 17:49:28] Running preprocessing in 'cpu' mode...
Begin robust gene identification
After filtration, 2000/2000 genes are kept.
Among 2000 genes, 2000 genes are robust.
โ
Robust gene identification completed successfully.
Begin size normalization: shiftlog and HVGs selection pearson
๐ Count Normalization:
Target sum: 500000.0
Exclude highly expressed: True
Max fraction threshold: 0.2
โ ๏ธ Excluding 1 highly-expressed genes from normalization computation
Excluded genes: ['Ghrl']
โ
Count Normalization Completed Successfully!
โ Processed: 100 cells ร 2,000 genes
โ Runtime: 0.00s
๐ Highly Variable Genes Selection (Experimental):
Method: pearson_residuals
Target genes: 2,000
Theta (overdispersion): 100
โ
Experimental HVG Selection Completed Successfully!
โ Selected: 2,000 highly variable genes out of 2,000 total (100.0%)
โ Results added to AnnData object:
โข 'highly_variable': Boolean vector (adata.var)
โข 'highly_variable_rank': Float vector (adata.var)
โข 'highly_variable_nbatches': Int vector (adata.var)
โข 'highly_variable_intersection': Boolean vector (adata.var)
โข 'means': Float vector (adata.var)
โข 'variances': Float vector (adata.var)
โข 'residual_variances': Float vector (adata.var)
Time to analyze data in cpu: 0.02 seconds.
โ
Preprocessing completed successfully.
Added:
'highly_variable_features', boolean vector (adata.var)
'means', float vector (adata.var)
'variances', float vector (adata.var)
'residual_variances', float vector (adata.var)
'counts', raw counts layer (adata.layers)
End of size normalization: shiftlog and HVGs selection pearson
โญโ SUMMARY: preprocess โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ Duration: 0.0252s โ
โ Shape: 100 x 2,000 (Unchanged) โ
โ โ
โ CHANGES DETECTED โ
โ โโโโโโโโโโโโโโโโ โ
โ โ UNS โ โ REFERENCE_MANU โ
โ โ โ _ov_provenance โ
โ โ โ history_log โ
โ โ โ hvg โ
โ โ โ log1p โ
โ โ โ status โ
โ โ โ status_args โ
โ โ
โ โ LAYERS โ โ counts (sparse matrix, 100x2000) โ
โ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
๐ Finding marker genes | method: wilcoxon | groupby: clusters | n_groups: 8 | n_genes: 50
โ
Done | 8 groups ร 50 genes | corr: benjamini-hochberg | stored in adata.uns['rank_genes_groups']
๐ [2026-05-19 17:49:28] Running preprocessing in 'cpu' mode...
Begin robust gene identification
After filtration, 2000/2000 genes are kept.
Among 2000 genes, 2000 genes are robust.
โ
Robust gene identification completed successfully.
Begin size normalization: shiftlog and HVGs selection pearson
๐ Count Normalization:
Target sum: 500000.0
Exclude highly expressed: True
Max fraction threshold: 0.2
โ ๏ธ Excluding 0 highly-expressed genes from normalization computation
Excluded genes: []
โ
Count Normalization Completed Successfully!
โ Processed: 100 cells ร 2,000 genes
โ Runtime: 0.00s
๐ Highly Variable Genes Selection (Experimental):
Method: pearson_residuals
Target genes: 2,000
Theta (overdispersion): 100
โ
Experimental HVG Selection Completed Successfully!
โ Selected: 2,000 highly variable genes out of 2,000 total (100.0%)
โ Results added to AnnData object:
โข 'highly_variable': Boolean vector (adata.var)
โข 'highly_variable_rank': Float vector (adata.var)
โข 'highly_variable_nbatches': Int vector (adata.var)
โข 'highly_variable_intersection': Boolean vector (adata.var)
โข 'means': Float vector (adata.var)
โข 'variances': Float vector (adata.var)
โข 'residual_variances': Float vector (adata.var)
Time to analyze data in cpu: 0.02 seconds.
โ
Preprocessing completed successfully.
Added:
'highly_variable_features', boolean vector (adata.var)
'means', float vector (adata.var)
'variances', float vector (adata.var)
'residual_variances', float vector (adata.var)
'counts', raw counts layer (adata.layers)
End of size normalization: shiftlog and HVGs selection pearson
โญโ SUMMARY: preprocess โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ Duration: 0.0276s โ
โ Shape: 100 x 2,000 (Unchanged) โ
โ โ
โ CHANGES DETECTED โ
โ โโโโโโโโโโโโโโโโ โ
โ โ UNS โ โ REFERENCE_MANU โ
โ โ โ _ov_provenance โ
โ โ โ history_log โ
โ โ โ hvg โ
โ โ โ log1p โ
โ โ โ status โ
โ โ โ status_args โ
โ โ
โ โ LAYERS โ โ counts (sparse matrix, 100x2000) โ
โ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
๐ Finding marker genes | method: wilcoxon | groupby: clusters | n_groups: 5 | n_genes: 50
โ
Done | 5 groups ร 50 genes | corr: benjamini-hochberg | stored in adata.uns['rank_genes_groups']
๐ [2026-05-19 17:49:28] Running preprocessing in 'cpu' mode...
Begin robust gene identification
After filtration, 2000/2000 genes are kept.
Among 2000 genes, 2000 genes are robust.
โ
Robust gene identification completed successfully.
Begin size normalization: shiftlog and HVGs selection pearson
๐ Count Normalization:
Target sum: 500000.0
Exclude highly expressed: True
Max fraction threshold: 0.2
โ ๏ธ Excluding 1 highly-expressed genes from normalization computation
Excluded genes: ['Ghrl']
โ
Count Normalization Completed Successfully!
โ Processed: 100 cells ร 2,000 genes
โ Runtime: 0.00s
๐ Highly Variable Genes Selection (Experimental):
Method: pearson_residuals
Target genes: 2,000
Theta (overdispersion): 100
โ
Experimental HVG Selection Completed Successfully!
โ Selected: 2,000 highly variable genes out of 2,000 total (100.0%)
โ Results added to AnnData object:
โข 'highly_variable': Boolean vector (adata.var)
โข 'highly_variable_rank': Float vector (adata.var)
โข 'highly_variable_nbatches': Int vector (adata.var)
โข 'highly_variable_intersection': Boolean vector (adata.var)
โข 'means': Float vector (adata.var)
โข 'variances': Float vector (adata.var)
โข 'residual_variances': Float vector (adata.var)
Time to analyze data in cpu: 0.02 seconds.
โ
Preprocessing completed successfully.
Added:
'highly_variable_features', boolean vector (adata.var)
'means', float vector (adata.var)
'variances', float vector (adata.var)
'residual_variances', float vector (adata.var)
'counts', raw counts layer (adata.layers)
End of size normalization: shiftlog and HVGs selection pearson
โญโ SUMMARY: preprocess โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ Duration: 0.0236s โ
โ Shape: 100 x 2,000 (Unchanged) โ
โ โ
โ CHANGES DETECTED โ
โ โโโโโโโโโโโโโโโโ โ
โ โ UNS โ โ REFERENCE_MANU โ
โ โ โ _ov_provenance โ
โ โ โ history_log โ
โ โ โ hvg โ
โ โ โ log1p โ
โ โ โ status โ
โ โ โ status_args โ
โ โ
โ โ LAYERS โ โ counts (sparse matrix, 100x2000) โ
โ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
๐ Finding marker genes | method: wilcoxon | groupby: clusters | n_groups: 8 | n_genes: 50
โ
Done | 8 groups ร 50 genes | corr: benjamini-hochberg | stored in adata.uns['rank_genes_groups']
| seacells | metaq | supercell | kmeans | random | geosketch | |
|---|---|---|---|---|---|---|
| Alpha | Sphkap | Tmem27 | Smarca1 | Smarca1 | Irx2 | Zcchc18 |
| Beta | Gng12 | Mapt | Sytl4 | Gng12 | Sec61b | Sytl4 |
| Delta | Rbp4 | NaN | Rbp4 | Scn3a | NaN | Sst |
| Ductal | Nudt19 | Nudt19 | Lurap1l | Nudt19 | Spp1 | Hpgd |
| Epsilon | Ghrl | Guca2b | Ghrl | Gm11837 | NaN | Anpep |
| Ngn3 high EP | Numbl | Numbl | Cbfa2t3 | Cbfa2t3 | Cotl1 | Smarcd2 |
| Ngn3 low EP | Cited4 | Cat | Grin3a | Angptl4 | NaN | Atoh8 |
| Pre-endocrine | Eif3e | Cystm1 | Bax | Neat1 | Fev | Fev |
8. Recommendation table#
A rough decision tree for this kind of dataset (mid-size mouse scRNA-seq atlas, no batch / multi-modal complications):
Downstream task |
Pick |
|---|---|
UMAP / Leiden / pretty plots |
|
Differential expression / pseudobulk DESeq2 |
|
Cellโcell communication (CellPhoneDB / LIANA) |
|
Gene regulatory networks (SCENIC) |
|
Atlas with new samples arriving over time |
|
Multi-million cell dataset |
|
Quick sanity check that the pipeline works at all |
|
The most important rule: include random in your benchmark. If your
principled metacell pipeline isnโt beating random on purity / rigor, you
have a bug, not metacells.