Spatial clustering with GraphST + pymclustR#
GraphST is a contrastive-learning spatial embedder. Top-tier in the Nature Methods 2024-04 benchmark.
This notebook runs the GraphST spatial embedder on the
Maynard 151676 dorsolateral prefrontal cortex Visium sample
(3 460 spots × 10 747 genes) and clusters the resulting embedding with
pymclustR, a pure-Python
re-implementation of CRAN mclust (no rpy2 / R dependency).
Pre-processed input lives at
/scratch/users/steorra/analysis/omicverse_dev/omicverse-test/notebooks/data/cluster_svg.h5ad, which is the canonical fixture used in the originalt_cluster_spacetutorial.
0. Load AnnData + Ground Truth#
import omicverse as ov
import scanpy as sc
import pandas as pd, os, anndata as ad
ov.style(font_path='Arial')
# Load the pre-processed AnnData (3460 spots × 10747 genes — the same
# input the original spatial-clustering tutorial was developed against).
DATA_DIR = '/scratch/users/steorra/analysis/omicverse_dev/omicverse-test/data/151676'
H5AD = '/scratch/users/steorra/analysis/omicverse_dev/omicverse-test/notebooks/data/cluster_svg.h5ad'
adata = ad.read_h5ad(H5AD)
truth = pd.read_csv(os.path.join(DATA_DIR, '151676_truth.txt'),
sep='\t', header=None, index_col=0)
truth.columns = ['Ground Truth']
adata.obs['Ground Truth'] = truth['Ground Truth'].reindex(adata.obs_names)
print('shape:', adata.shape, ' annotated:',
adata.obs['Ground Truth'].notna().sum())
adata
🔬 Starting plot initialization...
Using already downloaded Arial font from: /tmp/omicverse_arial.ttf
Registered as: Arial
🧬 Detecting GPU devices…
✅ NVIDIA CUDA GPUs detected: 1
• [CUDA 0] NVIDIA H100 80GB HBM3
Memory: 79.1 GB | Compute: 9.0
____ _ _ __
/ __ \____ ___ (_)___| | / /__ _____________
/ / / / __ `__ \/ / ___/ | / / _ \/ ___/ ___/ _ \
/ /_/ / / / / / / / /__ | |/ / __/ / (__ ) __/
\____/_/ /_/ /_/_/\___/ |___/\___/_/ /____/\___/
🔖 Version: 2.1.2rc1 📚 Tutorials: https://omicverse.readthedocs.io/
✅ plot_set complete.
shape: (3460, 10747) annotated: 3431
AnnData object with n_obs × n_vars = 3460 × 10747
obs: 'in_tissue', 'array_row', 'array_col', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'Ground Truth'
var: 'gene_ids', 'feature_types', 'genome', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'space_variable_features', 'highly_variable'
uns: 'REFERENCE_MANU', 'spatial'
obsm: 'spatial'
layers: 'counts'
1. Embed with GraphST#
GraphST (Long et al., Nat. Comm. 2023) jointly embeds gene expression and spatial coordinates with a contrastive learning objective. It was rated one of the best spatial-clustering algorithms in Nature Methods’s 2024-04 benchmark.
methods_kwargs = {'GraphST': {'device': 'cuda:0', 'n_pcs': 30}}
adata = ov.space.clusters(adata, methods=['GraphST'],
methods_kwargs=methods_kwargs, lognorm=1e4)
The GraphST method is used to embed the spatial data.
Begin to train ST data...
Optimization finished for ST data!
computing PCA🔍
with n_comps=30
🖥️ Using sklearn PCA for CPU computation
🖥️ sklearn PCA backend: CPU computation
📊 PCA input data type: ArrayView, shape: (3460, 3000), dtype: float32
🔧 PCA solver used: covariance_eigh
finished✅ (3.39s)
╭─ SUMMARY: pca ─────────────────────────────────────────────────────╮
│ Duration: 3.3994s │
│ Shape: 3,460 x 3,000 (Unchanged) │
│ │
│ CHANGES DETECTED │
│ ──────────────── │
│ ● UNS │ ✚ graphst|original|cum_sum_eigenvalues │
│ │ ✚ graphst|original|pca_var_ratios │
│ │ ✚ pca │
│ │ └─ params: {'zero_center': True, 'use_highly_variable': Tr...│
│ │ ✚ status │
│ │ ✚ status_args │
│ │
│ ● OBSM │ ✚ X_pca (array, 3460x30) │
│ │ ✚ graphst|original|X_pca (array, 3460x30) │
│ │
╰────────────────────────────────────────────────────────────────────╯
GraphST embedding has been saved in adata.obsm["GraphST_embedding"] and adata.obsm["graphst|original|X_pca"]
The GraphST embedding are stored in adata.obsm["GraphST_embedding"].
Shape: (3460, 3000)
2. Cluster with pymclustR (no rpy2 / R needed)#
method='pymclustR' is a drop-in pure-Python replacement for the
historical mclust_R backend — same 14 covariance parameterisations,
same EM, same BIC.
ov.utils.cluster(adata, use_rep='graphst|original|X_pca', method='pymclustR',
n_components=10, modelNames='EEE', random_state=42)
adata.obs['pymclustR_GraphST'] = ov.utils.refine_label(adata, radius=30, key='pymclustR')
adata.obs['pymclustR_GraphST'] = adata.obs['pymclustR_GraphST'].astype('category')
res = ov.space.merge_cluster(adata, groupby='pymclustR_GraphST',
use_rep='graphst|original|X_pca',
threshold=0.2, plot=True)
3. Spatial visualisation#
4. ARI vs Maynard ground truth#
from sklearn.metrics.cluster import adjusted_rand_score
obs = adata.obs.dropna(subset=['Ground Truth'])
ari_raw = adjusted_rand_score(obs['pymclustR'], obs['Ground Truth'])
ari_ref = adjusted_rand_score(obs['pymclustR_GraphST'], obs['Ground Truth'])
print(f'GraphST + pymclustR (raw): ARI = {ari_raw:.4f}')
print(f'GraphST + pymclustR (refined): ARI = {ari_ref:.4f}')
GraphST + pymclustR (raw): ARI = 0.3541
GraphST + pymclustR (refined): ARI = 0.3916