Biomarker discovery — univariate AUC + multivariate panel

Biomarker discovery — univariate AUC + multivariate panel#

Clinical metabolomics studies typically end with a biomarker question: can a small set of metabolites separate cases from controls? Two functions address this:

ov.metabol.roc_feature — per-feature AUC (polarity-invariant, so you don’t need to know a priori which direction each metabolite should change). Optional bootstrap 95% CI.
ov.metabol.biomarker_panel — nested cross-validation of a multi-metabolite panel with RF / logistic regression / SVM. Reports per-fold AUC, feature importance, and an optional permutation null p-value.

We use the MetaboAnalyst Cachexia dataset (77 samples × 63 urinary metabolites, binary Muscle loss label).

0 — Setup and data#

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import omicverse as ov

csv_path = ov.datasets.download_data(
    url='https://rest.xialab.ca/api/download/metaboanalyst/human_cachexia.csv',
    file_path='human_cachexia.csv',
    dir='metabol_demo',
)
adata = ov.metabol.read_metaboanalyst(csv_path, group_col='Muscle loss')
adata = ov.metabol.impute(adata, method='qrilc', seed=0)
adata = ov.metabol.normalize(adata, method='pqn')
adata = ov.metabol.transform(adata, method='log')
adata

🔍 Downloading data to metabol_demo/human_cachexia.csv
⚠️ File metabol_demo/human_cachexia.csv already exists

AnnData object with n_obs × n_vars = 77 × 63
    obs: 'group'
    var: 'missing_frac'
    uns: 'metabol'
    layers: 'raw'

1 — Per-feature ROC AUC#

roc_feature returns a sorted DataFrame of AUC per metabolite. With ci=True it adds 95% bootstrap CIs (slower).

auc = ov.metabol.roc_feature(
    adata,
    group_col='group',
    pos_group='cachexic',
    neg_group='control',
    ci=True, n_bootstrap=500, seed=0,
)
auc.head(15)

                          auc    ci_low   ci_high
Isoleucine           0.724823  0.604165  0.831538
Uracil               0.720922  0.602093  0.826518
Creatine             0.705674  0.584924  0.815645
Acetone              0.679433  0.547649  0.795713
Pantothenate         0.676596  0.559742  0.781275
Succinate            0.669504  0.533756  0.797196
Glucose              0.658156  0.542094  0.769908
N,N-Dimethylglycine  0.656028  0.532805  0.774316
Methylguanidine      0.651773  0.527971  0.773909
Glutamine            0.641844  0.523199  0.775583
Hypoxanthine         0.637589  0.512414  0.764835
Alanine              0.636879  0.516322  0.762654
Tartrate             0.627660  0.516927  0.751250
cis-Aconitate        0.624113  0.514316  0.754925
Betaine              0.621631  0.509081  0.752555

AUC distribution#

fig, ax = plt.subplots(figsize=(6, 4))
ax.hist(auc['auc'], bins=20, edgecolor='k', color='C0')
ax.axvline(0.7, color='C3', ls='--', label='AUC = 0.7')
ax.set_xlabel('Per-feature ROC AUC')
ax.set_ylabel('# metabolites')
ax.legend()
fig.tight_layout()
plt.show()
print(f'{(auc["auc"] >= 0.7).sum()} metabolites with AUC >= 0.7')

../_images/c6f11db71f3c70f16004bc9709bf5b8c04cd0cd97353dbcc87e703638a7e6286.png

3 metabolites with AUC >= 0.7

2 — Multi-metabolite panel with nested CV#

biomarker_panel runs 5-outer × 3-inner nested CV. For features=10 it pre-screens to the top 10 by univariate AUC — note this leaks the test fold (see docstring caveat); pass an explicit feature list from an independent screening cohort for publication estimates.

panel = ov.metabol.biomarker_panel(
    adata,
    group_col='group',
    pos_group='cachexic', neg_group='control',
    features=10,
    classifier='lr',        # try 'rf' or 'svm' too
    cv_outer=5, cv_inner=3,
    n_permutations=100,     # permutation null
    seed=0,
)
print(f'Mean outer-fold AUC : {panel.mean_auc:.3f} ± {panel.std_auc:.3f}')
print(f'Permutation p-value : {panel.permutation_pvalue:.3f}')
print(f'Top features        :')
print(panel.feature_importance)

Mean outer-fold AUC : 0.843 ± 0.091
Permutation p-value : 0.010
Top features        :
Isoleucine             0.340771
Pantothenate           0.230042
Glucose                0.228501
Creatine               0.215480
Uracil                 0.202967
Succinate              0.166891
Methylguanidine        0.117726
Acetone                0.117388
Glutamine              0.116021
N,N-Dimethylglycine    0.098620
Name: importance, dtype: float64

Per-fold AUC and feature importance#

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(11, 4))
ax1.bar(range(1, panel.cv_outer + 1), panel.outer_aucs,
        color='C0', edgecolor='k')
ax1.axhline(0.5, color='gray', ls='--')
ax1.set_xlabel('Outer CV fold')
ax1.set_ylabel('Held-out AUC')
ax1.set_ylim(0, 1)

panel.feature_importance.iloc[::-1].plot.barh(ax=ax2, color='C2',
                                                edgecolor='k')
ax2.set_xlabel('Mean importance across outer folds')
fig.tight_layout()
plt.show()

../_images/912e46c7e8a31712e21c67e206e71c94355dac2f688bdf626b3d780a1ca6a999.png

Out-of-fold ROC curve for the panel#

from sklearn.metrics import roc_curve, auc as _auc
fpr, tpr, _ = roc_curve(panel.outer_labels, panel.outer_predictions)
fig, ax = plt.subplots(figsize=(5, 5))
ax.plot(fpr, tpr, color='C3', lw=2,
        label=f'nested CV AUC = {_auc(fpr, tpr):.3f}')
ax.plot([0, 1], [0, 1], color='gray', ls='--')
ax.set_xlabel('False positive rate')
ax.set_ylabel('True positive rate')
ax.legend(loc='lower right')
ax.set_aspect('equal')
fig.tight_layout()
plt.show()

../_images/d78e202c160c65a845ef7c4e8917e443209196aca83baf3028e387ef1d568baa.png

Takeaways#

Start with roc_feature for a quick univariate screen.
Escalate to biomarker_panel to estimate the joint AUC of a small panel with unbiased nested-CV.
The permutation p (small, here likely < 0.05) is the key sanity check — if it’s not small your panel is overfitting noise.