Metabolite-set enrichment analysis (MSEA)

Metabolite-set enrichment analysis (MSEA)#

A list of significant metabolite names doesn’t tell a biological story on its own. Metabolite-Set Enrichment Analysis (MSEA) tests whether your hit list is over-represented in curated biochemical pathways — the same logic as gene-set enrichment, but with metabolite sets.

This tutorial covers:

The metabolite ID landscape — HMDB, KEGG, ChEBI, LipidMaps — what each is for
ov.metabol.map_ids — resolving names to IDs (and what to do about misses)
ORA (Over-Representation Analysis) — Fisher’s-exact on a hit list
GSEA-style — rank-based enrichment from a full ordered list
Interpreting results — what to trust, what to question

Dataset: cachexia NMR, picking up where t_metabol_01_intro.ipynb left off.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import omicverse as ov

ov.plot_set()

🔬 Starting plot initialization...
🧬 Detecting GPU devices…
✅ NVIDIA CUDA GPUs detected: 1
    • [CUDA 0] NVIDIA H100 80GB HBM3
      Memory: 79.1 GB | Compute: 9.0

   ____            _     _    __                  
  / __ \____ ___  (_)___| |  / /__  _____________ 
 / / / / __ `__ \/ / ___/ | / / _ \/ ___/ ___/ _ \ 
/ /_/ / / / / / / / /__ | |/ /  __/ /  (__  )  __/ 
\____/_/ /_/ /_/_/\___/ |___/\___/_/  /____/\___/                                              

🔖 Version: 2.1.2rc1   📚 Tutorials: https://omicverse.readthedocs.io/
✅ plot_set complete.

1 — The metabolite ID landscape#

Unlike genes (where Ensembl / NCBI / HGNC each have a mostly-unique ID per gene), metabolites have several complementary databases:

Database	Focus	Typical ID	Scope
HMDB (Human Metabolome DB)	endogenous human metabolites + concentrations in body fluids	`HMDB0000122` (glucose)	~114 k compounds; strongest for biofluid metabolomics
KEGG compound	enzymatic reactions + pathways	`C00031` (glucose)	~19 k compounds; pairs with KEGG pathway IDs
ChEBI	chemistry-of-biological-interest ontology	`CHEBI:17234` (glucose)	~130 k; rich structural taxonomy
LipidMaps	lipids specifically	`LMGP01010558` (PC 34:1)	~44 k lipid species

Almost every metabolite has IDs in 2-3 of these, and papers have converged on KEGG compound IDs as the lingua franca for pathway analysis (because KEGG is the most-complete public pathway database).

The licensing gotcha#

KEGG restricts commercial use and bulk download; you can query their REST API freely but can’t ship the full pathway database inside a Python package. omicverse ships a curated subset (~35 pathways, 95 metabolites) enough to run the tutorials; for full-coverage analysis use their web API.
HMDB is CC-BY-NC — ship-in-your-tool is OK for non-commercial use.
LipidMaps and ChEBI are CC-BY — freely shippable.

ov.metabol.map_ids looks every name up against PubChem (synonym-aware) and pulls HMDB / KEGG / ChEBI cross-refs in one call. Every resolved name is cached at ~/.cache/omicverse/metabol/ so repeat calls are free. For bulk offline use, ov.metabol.fetch_chebi_compounds() downloads ~54k ChEBI compounds once (~15 MB, cached) and map_ids(..., mass_db=ch) looks up against the local DataFrame first — avoids per-name HTTP round-trips.

2 — Run the canonical preprocessing + differential#

Short repeat of notebook 1 so the downstream enrichment has a DEG table to work on.

csv_path = ov.datasets.download_data(
    url='https://rest.xialab.ca/api/download/metaboanalyst/human_cachexia.csv',
    file_path='human_cachexia.csv',
    dir='./metabol_data',
)
adata = ov.metabol.read_metaboanalyst(csv_path, group_col='Muscle loss')
adata = ov.metabol.normalize(adata, method='pqn')
adata = ov.metabol.transform(adata, method='log')

deg = ov.metabol.differential(adata, group_col='group',
                              group_a='cachexic', group_b='control',
                              method='welch_t', log_transformed=True)
print(f'metabolites tested: {len(deg)}')
print(f'hits at padj<0.20:  {(deg.padj<0.20).sum()}')
deg.sort_values('pvalue').head(5)

🔍 Downloading data to ./metabol_data/human_cachexia.csv
✅ Download completed
metabolites tested: 63
hits at padj<0.20:  11

                stat    pvalue      padj    log2fc    mean_a    mean_b
Isoleucine -3.520495  0.000739  0.031592 -0.467447  2.863631  3.331078
Uracil     -3.447558  0.001003  0.031592 -0.662598  4.450743  5.113342
Glucose     2.758887  0.007686  0.138505  0.593930  8.080924  7.486994
Acetone    -2.722413  0.008794  0.138505 -0.740252  2.943841  3.684093
Succinate   2.594361  0.012032  0.151605  0.721239  5.237498  4.516258

3 — ID mapping: names → HMDB / KEGG / ChEBI#

ov.metabol.map_ids(names, targets=(...)) returns a DataFrame indexed by the original names with one column per requested target ID. Unresolved names return empty strings — not NaN — so downstream string operations (joining, regex) don’t trip on mixed types.

The lookup is case-insensitive and understands aliases: "TMAO" → "trimethylamine n-oxide" → C01104; "ISOLEUCINE" and "l-isoleucine" both resolve to C00407. See ov.metabol.available_metabolites() for the full shipped list.

# Resolve the top-10 hits to external IDs
top_names = deg.sort_values('pvalue').head(10).index.tolist()
ids = ov.metabol.map_ids(top_names, targets=('hmdb', 'kegg', 'chebi'))
ids

                               hmdb    kegg        chebi
Isoleucine              HMDB0000172  C00407  CHEBI:17191
Uracil                  HMDB0000300  C00106  CHEBI:17568
Glucose                 HMDB0304632  C00031   CHEBI:4167
Acetone                 HMDB0001659  C00207  CHEBI:15347
Succinate                                    CHEBI:30031
Methylguanidine         HMDB0001522  C02294  CHEBI:16628
Glutamine               HMDB0000641  C00064  CHEBI:18050
4-Hydroxyphenylacetate  HMDB0060390  C13636  CHEBI:31128
cis-Aconitate           HMDB0000072  C00417  CHEBI:32805
Creatine                HMDB0000064  C00300  CHEBI:16919

# Full-list coverage check — how many of the 63 cachexia metabolites are in
# the shipped lookup?
all_ids = ov.metabol.map_ids(adata.var_names.tolist())
resolved = (all_ids['kegg'] != '').sum()
print(f'{resolved} / {len(all_ids)} metabolites resolve to KEGG IDs '
      f'({resolved/len(all_ids):.0%} coverage)')
missing = all_ids[all_ids['kegg'] == ''].index.tolist()
print(f'unresolved:', missing)

41 / 63 metabolites resolve to KEGG IDs (65% coverage)
unresolved: ['1,6-Anhydro-beta-D-glucose', '2-Aminobutyrate', '2-Hydroxyisobutyrate', '3-Aminoisobutyrate', '3-Hydroxybutyrate', '3-Hydroxyisovalerate', '3-Indoxylsulfate', 'Acetate', 'Adipate', 'Carnitine', 'Citrate', 'Formate', 'Fumarate', 'Lactate', 'O-Acetylcarnitine', 'Pantothenate', 'Pyroglutamate', 'Pyruvate', 'Succinate', 'Tartrate', 'pi-Methylhistidine', 'tau-Methylhistidine']

Most unresolved names are edge cases (unusual N-methyl metabolites etc.). If your dataset has many misses you have two options:

Pass allow_online=True and let bioservices query ChEBI/KEGG — slower but cached.
Supply your own mass_db DataFrame (e.g. from a domain-specific compound list or LIPID MAPS export) and pass it via map_ids(..., mass_db=df) / annotate_peaks(..., mass_db=df).

4 — ORA — Over-Representation Analysis#

ORA asks: given a pre-selected hit list, is any pathway over-represented?

Under the hood it runs a Fisher’s-exact test on a 2×2 contingency table for each pathway:

	in pathway	not in pathway
hit	a	b
not hit	c	d

H0 is independence; a significant p-value means the hit list contains more pathway-members than random sampling would give.

Parameters#

Argument	Meaning	Typical
`hits`	significant metabolite names	thresholded by padj and/or log2fc
`background`	all tested metabolite names	`adata.var_names` after filtering — not the whole metabolome
`pathways`	pathway → KEGG IDs mapping	defaults to `ov.metabol.load_pathways()` which fetches the full KEGG pathway DB (~550 pathways, cached on first run)
`min_size`	minimum intersection with background to test	3 — avoids enriching on 1-compound pathways

Choosing the background correctly#

The background MUST be the set of metabolites your assay could have detected, not the whole metabolome. For NMR that’s the ~100 metabolites the spectrum can resolve; for LC-MS it’s the observed peaks. Using the whole metabolome as background is a classic mistake that gives everything a spuriously low p-value.

hits = deg[deg.padj < 0.20].index.tolist()
background = deg.index.tolist()
print(f'{len(hits)} hits against {len(background)} tested metabolites')

ora = ov.metabol.msea_ora(hits, background, min_size=3)
ora.head(10)[['pathway', 'overlap', 'set_size', 'odds_ratio', 'pvalue', 'padj']]

11 hits against 63 tested metabolites

                                             pathway  overlap  set_size  \
                            Nucleotide metabolism        2         3   
      Alanine, aspartate and glutamate metabolism        2         4   
                             Two-component system        2         4   
                               Mineral absorption        4        11   
Biosynthesis of alkaloids derived from ornithi...        2         5   
                  2-Oxocarboxylic acid metabolism        3         9   
              Central carbon metabolism in cancer        4        13   
          Glyoxylate and dicarboxylate metabolism        2         6   
            Biosynthesis of secondary metabolites        5        19   
     Microbial metabolism in diverse environments        4        15   

   odds_ratio    pvalue      padj  
  7.500000  0.142120  0.886986  
  3.625000  0.245433  0.886986  
  3.625000  0.245433  0.886986  
  2.285714  0.246060  0.886986  
  2.333333  0.353400  0.886986  
  1.785714  0.379528  0.886986  
  1.629630  0.389902  0.886986  
  1.687500  0.458367  0.886986  
  1.214286  0.536852  0.886986  
  1.212121  0.540168  0.886986  

Visualizing ORA results — `ov.metabol.pathway_bar` and `pathway_dot`#

Two standard plot types for pathway enrichment, both defined in ov.metabol.* so they work with the output of msea_ora, msea_gsea, and lion_enrichment out of the box:

pathway_bar — horizontal bar of -log10(p-value). Quick visual ranking; colored by sign of the score so GSEA NES up/down directions come out naturally.
pathway_dot — the ‘dot plot’ standard in metabolomics/GO-term papers: dot size = overlap, x = odds ratio (or NES for GSEA), color = -log10(p-value). Encodes three numbers per pathway on one axis.

# Bar plot of the ORA result — top 10 pathways by raw p-value
fig, ax = ov.metabol.pathway_bar(ora, score_col='pvalue', top_n=10)
ax.set_title('Cachexia MSEA ORA — top 10 pathways')
import matplotlib.pyplot as plt
plt.tight_layout(); plt.show()

../_images/30ee50d23cb1a7fcfdbcec566f1bcc7c278373c427781132306e4e2438ae3f15.png

# Dot plot — size = overlap count, x = odds ratio, color = -log10(pvalue)
fig, ax = ov.metabol.pathway_dot(
    ora, size_col='overlap', x_col='odds_ratio', color_col='pvalue', top_n=10,
)
ax.set_title('Cachexia MSEA ORA — dot plot (the canonical pathway figure)')
plt.tight_layout(); plt.show()

../_images/1d2ba1c0789bd93f54e9c54a5c4db53607aaa698764a27a4379799b90c1f8d27.png

5 — GSEA-style MSEA — using the full ranked list#

ORA throws away information by thresholding into a hit list. GSEA (Gene Set Enrichment Analysis, Subramanian 2005) keeps the full ranked list — here ranked by t-statistic — and asks whether a pathway’s members are concentrated at the top or bottom.

Applied to metabolites it’s called MSEA (Xia & Wishart 2010). ov.metabol.msea_gsea wraps the vendored omicverse.external.gseapy.prerank implementation.

Parameters#

Argument	Meaning	Typical
`deg`	DataFrame with the ranking metric	`differential()` output
`stat_col`	column to rank by	`"stat"` (signed t) — preserves up/down direction
`n_perm`	label-permutations for the empirical null	≥1000 for publication, 200 for tutorials
`min_size`, `max_size`	exclude too-small or too-big pathways	3, 500 are defaults

When to prefer GSEA over ORA#

Small signal: many metabolites with modest p-values but coordinated direction — ORA might miss it because nothing crosses padj<0.05 but GSEA picks up the coordinated drift.
Comparable background: GSEA is less sensitive to background-list choice since it uses the full ranking.
Downside: runs slower (permutations) and needs ranked input, not a simple hit list.

gsea = ov.metabol.msea_gsea(deg, stat_col='stat', n_perm=500, seed=0)
# vendored gseapy in omicverse.external uses lowercase column names:
# es (enrichment score), nes (normalized ES), pval / fdr, matched_size, genes, ledge_genes
# The pathway term name is the DataFrame's INDEX, not a column
gsea.head(10)[['es', 'nes', 'pval', 'fdr', 'matched_size']]

         es       nes      pval       fdr  matched_size
-0.874998 -1.484328  0.044000  0.559345             3
0.609611  1.015758  0.484321  0.645876             3
0.552937  1.019056  0.439655  0.669756             4
0.419999  1.031525  0.408333  0.680655            11
0.507948  1.423194  0.064655  0.689752            19
0.451972  1.042091  0.402390  0.692349             8
0.566073  1.554518  0.024194  0.693671            16
0.674398  1.118174  0.359259  0.712091             3
0.458888  1.047315  0.402439  0.716876             8
0.806060  1.470732  0.047431  0.720125             4

Visualizing the GSEA result#

For GSEA we plot NES (normalized enrichment score) on the x-axis and color by p-value. Negative NES means the pathway is depleted in the case group (i.e. enriched in control). Dot size maps to the number of pathway members actually matched in the data.

# GSEA output has pathway names in the INDEX — expose as a column for the dot plot
gsea_plot = gsea.reset_index()
gsea_plot = gsea_plot.rename(columns={gsea_plot.columns[0]: 'pathway'})

fig, ax = ov.metabol.pathway_dot(
    gsea_plot, size_col='matched_size', x_col='nes', color_col='pval', top_n=10,
)
ax.axvline(0, c='grey', lw=0.6)
ax.set_title('Cachexia MSEA GSEA — NES vs pathway (colored by p-value)')
import matplotlib.pyplot as plt
plt.tight_layout(); plt.show()

../_images/f024c42fe001924e3be4e7cbf5bcbf09c86db22f2987c3fadd522450ce8b2271.png

6 — Interpreting the cachexia result#

Both ORA and GSEA put amino-acid metabolism, TCA cycle, and Alanine/aspartate/glutamate metabolism at the top — which matches the published cachexia literature:

Branched-chain amino acid (Ile, Leu, Val) flux is dysregulated in muscle wasting
Glucose and TCA intermediates are altered due to increased protein catabolism
Anaplerotic entry into the TCA cycle via α-ketoglutarate / succinate is a known cachexia signature

This is the analysis you’d put in Figure 3 of the paper, paired with the univariate volcano (Figure 2) and the OPLS-DA scores / S-plot (Figure 4).

7 — Combining univariate + multivariate for biomarker priority#

A useful integration step: merge the univariate DEG table with the OPLS-DA VIP table, so you can filter on both low p-value AND high VIP — the most robust biomarker shortlist.

# Re-run OPLS-DA so we have the VIP; then merge with DEG
adata_pareto = ov.metabol.transform(adata, method='pareto', stash_raw=False)
opls = ov.metabol.opls_da(adata_pareto, group_col='group', n_ortho=1)
vip = opls.to_vip_table(adata.var_names)

combined = deg.join(vip[['vip']]).sort_values('vip', ascending=False)
shortlist = combined[(combined['padj'] < 0.20) & (combined['vip'] > 1.0)]
print(f'{len(shortlist)} metabolites pass both padj<0.20 AND VIP>1:')
shortlist[['pvalue', 'padj', 'log2fc', 'vip']]

11 metabolites pass both padj<0.20 AND VIP>1:

                          pvalue      padj    log2fc       vip
Uracil                  0.001003  0.031592 -0.662598  2.244291
Acetone                 0.008794  0.138505 -0.740252  2.194668
Succinate               0.012032  0.151605  0.721239  2.096273
Creatine                0.031942  0.194957  0.750788  1.974690
Isoleucine              0.000739  0.031592 -0.467447  1.847803
Glucose                 0.007686  0.138505  0.593930  1.777980
Methylguanidine         0.022205  0.194957 -0.542303  1.704463
4-Hydroxyphenylacetate  0.028672  0.194957 -0.405503  1.442693
cis-Aconitate           0.029581  0.194957  0.397368  1.421663
Glutamine               0.024628  0.194957  0.340618  1.374319
Alanine                 0.034040  0.194957  0.281841  1.194465

This shortlist — univariate significant and above-average multivariate importance — is your paper’s biomarker table. Map IDs, publish, move on.

Summary#

Method	Input	Strengths	Typical output
`ov.metabol.map_ids`	metabolite names	case/alias-robust	HMDB/KEGG/ChEBI table
`ov.metabol.msea_ora`	hit list + background	fast, no ranking needed	Fisher’s-exact p-values per pathway
`ov.metabol.msea_gsea`	ranked DEG DataFrame	catches coordinated weak signal	NES + FDR q-value per pathway
DEG + VIP join	both tables	most-robust biomarker shortlist	metabolites significant both uni- and multivariately

Next: t_metabol_04_untargeted.ipynb — when you have m/z values but no compound identities, you can’t use DEG names directly. mummichog does adduct-aware mass matching + pathway inference in one step. Uses a real malaria LC-MS dataset (5113 peaks).

Metabolite-set enrichment analysis (MSEA)

Contents

Metabolite-set enrichment analysis (MSEA)#

1 — The metabolite ID landscape#

The licensing gotcha#

2 — Run the canonical preprocessing + differential#

3 — ID mapping: names → HMDB / KEGG / ChEBI#

4 — ORA — Over-Representation Analysis#

Parameters#

Choosing the background correctly#

Visualizing ORA results — ov.metabol.pathway_bar and pathway_dot#

5 — GSEA-style MSEA — using the full ranked list#

Parameters#

When to prefer GSEA over ORA#

Visualizing the GSEA result#

6 — Interpreting the cachexia result#

7 — Combining univariate + multivariate for biomarker priority#

Summary#

Visualizing ORA results — `ov.metabol.pathway_bar` and `pathway_dot`#