Batch effect and drift correction for LC-MS

Batch effect and drift correction for LC-MS#

LC-MS untargeted runs suffer from two kinds of unwanted variation that ComBat alone cannot handle:

Signal drift within a batch — the injection-time-dependent decay of signal due to source fouling, column aging, etc. Drift is continuous in injection order, not categorical.
Batch shifts between batches — quantified by ComBat when the batch variable is known.

This tutorial shows the recommended combo on a synthetic LC-MS run with planted drift + QC pool samples. Synthetic data lets us verify that correction works before you apply it to your own study.

Pipeline:

ov.metabol.drift_correct — per-feature LOESS fit on QC injection order
ov.metabol.serrf — QC-based Random Forest (Fan 2019) that borrows strength from correlated co-features
ov.bulk.batch_correction — ComBat for residual between-batch shifts

0 — Setup and simulated LC-MS run#

Simulate 60 samples over 2 batches with 10 QC samples per batch, 40 features. First 15 features have a time-dependent exponential drift; real samples have a small case/ctrl effect on features 20–24.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from anndata import AnnData
import omicverse as ov

rng = np.random.default_rng(0)

n_real_per_batch = 20
n_qc_per_batch = 10
n_batches = 2
n_features = 40
drift_features = 15
drift_strength = 1.5

rows, idx = [], []
order = 0
for b in range(n_batches):
    plan = ['QC']*n_qc_per_batch + ['real']*n_real_per_batch
    rng.shuffle(plan)
    for k, kind in enumerate(plan):
        rows.append({
            'sample_type': kind,
            'group': 'QC' if kind == 'QC' else ('case' if rng.random() > 0.5 else 'ctrl'),
            'batch': f'b{b}',
            'order': order,
        })
        idx.append(f'b{b}_s{k}')
        order += 1

obs = pd.DataFrame(rows, index=idx)
n = len(obs)

# Baseline signal + drift
X = rng.lognormal(mean=4.0, sigma=0.3, size=(n, n_features))
max_order = float(obs['order'].max())
t = obs['order'].to_numpy() / max_order
X[:, :drift_features] *= np.exp(drift_strength * t)[:, None]

# Inject a between-batch additive shift on features 30-39 (ComBat target)
b_mask = (obs['batch'] == 'b1').to_numpy()
X[b_mask, 30:] *= 1.4

# Tiny biology effect on features 20-24
case = (obs['group'] == 'case').to_numpy()
X[case, 20:25] *= 1.8

var = pd.DataFrame(index=[f'feat{i}' for i in range(n_features)])
adata = AnnData(X=X, obs=obs, var=var)
adata.shape

(60, 40)

1 — Baseline: QC CV% before any correction#

Features with drift should have inflated CV% on the pool of QC samples — that is the diagnostic we use to evaluate every correction step.

def qc_cv_pct(adata, qc_col='sample_type', qc_label='QC', layer=None):
    mask = (adata.obs[qc_col] == qc_label).to_numpy()
    X = adata.X if layer is None else adata.layers[layer]
    X = np.asarray(X)[mask]
    mu = X.mean(axis=0)
    sd = X.std(axis=0, ddof=1)
    return np.where(mu > 0, 100 * sd / mu, np.nan)

cv_raw = qc_cv_pct(adata)
print(f'Before correction — drift features mean CV: {cv_raw[:drift_features].mean():.1f}%')
print(f'                  stable features mean CV: {cv_raw[drift_features:30].mean():.1f}%')

Before correction — drift features mean CV: 52.7%
                  stable features mean CV: 27.0%

2 — Step 1: per-feature LOESS drift correction#

drift_correct fits a LOESS curve through the QC samples on each feature (independently) and divides every sample’s value by that fitted curve. This is the classic first-pass correction.

adata_drift = ov.metabol.drift_correct(
    adata,
    injection_order='order',
    qc_mask=(adata.obs['sample_type'] == 'QC').to_numpy(),
)
cv_drift = qc_cv_pct(adata_drift)
print(f'After drift_correct — drift features CV: {cv_drift[:drift_features].mean():.1f}%')

After drift_correct — drift features CV: 27.9%

3 — Step 2: SERRF#

serrf borrows strength from correlated co-features to model each feature’s expected QC baseline. For drifts that are not monotonic (e.g. dips + recoveries) SERRF catches what LOESS cannot.

SERRF stores the original matrix in .layers['raw'] and per-feature CV% before/after in .var:

adata_serrf = ov.metabol.serrf(
    adata,
    qc_col='sample_type', qc_label='QC',
    batch_col='batch',
    top_k=8, n_estimators=100, seed=0,
)

summary = pd.DataFrame({
    'CV_raw': adata_serrf.var['cv_qc_raw'],
    'CV_serrf': adata_serrf.var['cv_qc_serrf'],
})
summary['improvement_%'] = (summary['CV_raw'] - summary['CV_serrf']) / summary['CV_raw'] * 100
summary.head(15)

           CV_raw   CV_serrf  improvement_%
feat0   56.712720  41.317210      27.146485
feat1   50.464151  41.509849      17.743888
feat2   51.106795  28.093674      45.029473
feat3   57.893748  41.949325      27.540836
feat4   54.810053  42.488580      22.480315
feat5   66.226161  43.664605      34.067438
feat6   49.466170  30.746594      37.843189
feat7   46.728802  36.454210      21.987707
feat8   63.352723  32.301074      49.013913
feat9   58.121089  39.154273      32.633276
feat10  53.929962  36.316246      32.660353
feat11  39.701253  30.850989      22.292154
feat12  41.694817  32.690616      21.595491
feat13  50.296132  37.530333      25.381274
feat14  49.698648  29.473423      40.695725

Plot CV% before vs after SERRF on drift features:

fig, ax = plt.subplots(figsize=(6, 4))
is_drift = np.arange(n_features) < drift_features
ax.scatter(summary['CV_raw'][is_drift], summary['CV_serrf'][is_drift],
           color='C3', label='drift features', s=40)
ax.scatter(summary['CV_raw'][~is_drift], summary['CV_serrf'][~is_drift],
           color='C0', label='stable features', s=40, alpha=0.5)
lim = max(summary['CV_raw'].max(), summary['CV_serrf'].max()) * 1.05
ax.plot([0, lim], [0, lim], '--', color='gray', lw=1)
ax.set_xlabel('QC CV% (raw)')
ax.set_ylabel('QC CV% (after SERRF)')
ax.legend()
fig.tight_layout()
plt.show()

../_images/2976d90fa529e77fb44bfa31d6f15ace60e59f4dc12d552e05244f738f474d2b.png

4 — Step 3: ComBat for between-batch shifts#

SERRF corrects within-batch drift but assumes you already dealt with discrete batch shifts. We use ov.bulk.batch_correction (which wraps combat.pycombat) for that. It works on any AnnData with a batch column.

ov.bulk.batch_correction(adata_serrf, batch_key='batch',
                           key_added='batch_correction')
print('ComBat output in adata_serrf.layers["batch_correction"]')
print('shape:', adata_serrf.layers['batch_correction'].shape)

Found 2 batches.
Adjusting for 0 covariate(s) or covariate level(s).
Standardizing Data across genes.
Fitting L/S model and finding priors.
Finding parametric adjustments.
Adjusting the Data
Storing batch correction result in adata.layers['batch_correction']
ComBat output in adata_serrf.layers["batch_correction"]
shape: (60, 40)

5 — Sample-level outlier detection (`sample_qc`)#

QC / batch correction assumes your samples are themselves usable. A single mis-injected or contaminated sample can dominate a PLS-DA component and hide real biology. ov.metabol.sample_qc fits PCA on adata.X and returns two SIMCA-style diagnostics:

Hotelling T² — distance inside the PCA model. Flags unusual but within-model profiles.
DModX — distance to the residual plane. Flags novel profiles that don’t fit the PCA subspace at all.

A sample flagged by either metric deserves a look before downstream stats. Run on the SERRF-corrected real samples (QC pools excluded):

real_mask = (adata_serrf.obs['sample_type'] == 'real').to_numpy()
adata_real = adata_serrf[real_mask].copy()

qc_diag = ov.metabol.sample_qc(adata_real, n_components=3, alpha=0.95)
print(f'{qc_diag["is_outlier"].sum()} outlier sample(s) out of {len(qc_diag)}')
qc_diag.sort_values('T2', ascending=False).head(10)

8 outlier sample(s) out of 40

              T2     DModX   T2_crit  DModX_crit  T2_flag  DModX_flag  \
b1_s6   9.214746  0.615842  9.039977    0.889306     True       False   
b1_s2   7.518378  0.670938  9.039977    0.889306    False       False   
b1_s15  6.808285  0.676331  9.039977    0.889306    False       False   
b1_s24  5.191500  0.986138  9.039977    0.889306    False        True   
b1_s8   4.994036  0.926844  9.039977    0.889306    False        True   
b1_s26  4.939299  0.788794  9.039977    0.889306    False       False   
b0_s1   4.724327  0.799997  9.039977    0.889306    False       False   
b0_s3   4.563427  0.737520  9.039977    0.889306    False       False   
b1_s0   3.888096  0.869859  9.039977    0.889306    False       False   
b0_s13  3.760173  0.625570  9.039977    0.889306    False       False   

        is_outlier  
b1_s6         True  
b1_s2        False  
b1_s15       False  
b1_s24        True  
b1_s8         True  
b1_s26       False  
b0_s1        False  
b0_s3        False  
b1_s0        False  
b0_s13       False  

T² vs DModX scatter#

Quadrants of this plot tell you why a sample is outlying:

Top-right (high T², high DModX) — the strongest outlier kind.
Top-left — fits the model but at its edge.
Bottom-right — doesn’t fit the model; novel pattern.

fig, ax = plt.subplots(figsize=(6, 4))
flag = qc_diag['is_outlier'].to_numpy()
ax.scatter(qc_diag['T2'][~flag], qc_diag['DModX'][~flag],
           color='C0', edgecolor='k', s=40, label='normal')
ax.scatter(qc_diag['T2'][flag], qc_diag['DModX'][flag],
           color='C3', edgecolor='k', s=60, label='outlier')
ax.axvline(qc_diag['T2_crit'].iloc[0], color='gray', ls='--', lw=1)
ax.axhline(qc_diag['DModX_crit'].iloc[0], color='gray', ls='--', lw=1)
ax.set_xlabel('Hotelling T²')
ax.set_ylabel('DModX')
ax.legend()
fig.tight_layout()
plt.show()

../_images/bc1824b88bae96315297cd55ba1fa7128dab90a127861963209d58bbca72e798.png

Summary#

For a typical LC-MS study the recommended order is:

adata = ov.metabol.drift_correct(adata, injection_order=..., qc_mask=...)
adata = ov.metabol.serrf(adata, qc_col=..., batch_col=...)
ov.bulk.batch_correction(adata, batch_key=...)    # if needed

After this, standard downstream steps (impute, normalize, transform, differential, plsda/opls_da, MSEA) apply without modification.