Lipidomics with LIPID MAPS and LION

Lipidomics with LIPID MAPS and LION#

Lipidomics data has distinctive structural properties that make a dedicated analysis path worthwhile:

Features follow the LIPID MAPS shorthand ("PC 34:1", "TAG 54:3", "Cer d18:1/24:0") — you can extract lipid class + total carbons + total double bonds from the feature name alone
Lipid classes (PC, PE, TAG, Cer, SM, …) are biologically-meaningful groupings — per-class totals often tell the story as clearly as per-species analysis
The LION ontology (Molenaar et al. 2019, bioRxiv) groups lipid classes by shared biology (membrane vs storage, subcellular location, signalling role) and is the lipidomics analogue of GO for genes

This tutorial uses a real breast-cancer tissue lipidomics dataset from the lipidr R package (Mohamed et al. 2020, Bioinformatics): 20 patient samples × 412 lipid species, Cancer vs Benign. Every step below is on real published data — no synthetic spike-ins.

Covered:

parse_lipid — decode LIPID MAPS shorthand into structured fields
annotate_lipids — stamp class / carbons / double-bonds on adata.var
aggregate_by_class — collapse the per-species matrix to class-level totals
lion_enrichment — over-representation against the LION ontology
Standard univariate + multivariate pipeline on the lipid matrix
When to analyze at species-level vs class-level

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import anndata as ad

import omicverse as ov

ov.plot_set()

🔬 Starting plot initialization...
🧬 Detecting GPU devices…
✅ NVIDIA CUDA GPUs detected: 1
    • [CUDA 0] NVIDIA H100 80GB HBM3
      Memory: 79.1 GB | Compute: 9.0

   ____            _     _    __                  
  / __ \____ ___  (_)___| |  / /__  _____________ 
 / / / / __ `__ \/ / ___/ | / / _ \/ ___/ ___/ _ \ 
/ /_/ / / / / / / / /__ | |/ /  __/ /  (__  )  __/ 
\____/_/ /_/ /_/_/\___/ |___/\___/_/  /____/\___/                                              

🔖 Version: 2.1.2rc1   📚 Tutorials: https://omicverse.readthedocs.io/
✅ plot_set complete.

1 — Download the dataset#

The lipidr package ships two files in its inst/extdata/:

brca_matrix.csv — 412 lipid species × 20 samples (features-in-rows)
brca_clin.csv — sample-level clinical metadata (SampleType, Race, Stage, Tumor.Type)

Both live on GitHub (MIT-licensed) so we can pull them directly with ov.datasets.download_data. Total size: ~100 KB each.

matrix_path = ov.datasets.download_data(
    url='https://raw.githubusercontent.com/ahmohamed/lipidr/master/inst/extdata/brca_matrix.csv',
    file_path='brca_matrix.csv', dir='./metabol_data',
)
clin_path = ov.datasets.download_data(
    url='https://raw.githubusercontent.com/ahmohamed/lipidr/master/inst/extdata/brca_clin.csv',
    file_path='brca_clin.csv', dir='./metabol_data',
)

🔍 Downloading data to ./metabol_data/brca_matrix.csv
✅ Download completed
🔍 Downloading data to ./metabol_data/brca_clin.csv
✅ Download completed

2 — Load into AnnData#

The layout is features-in-rows (412 lipids × 20 samples), so we transpose and join the clinical table on the sample ID. We also drop the 2 Metastasis samples to get a clean two-group comparison — Cancer (n=8) vs Benign (n=10).

The matrix is intensity-normalized (data comes from Purwaha et al. 2018, already log2-ish scaled by lipidr). We treat it as normalized concentration and skip PQN.

mat = pd.read_csv(matrix_path, index_col=0)      # 412 × 20, features-in-rows
clin = pd.read_csv(clin_path).set_index('Sample')

# transpose to samples-in-rows; align clinical metadata
X_df = mat.T                                       # 20 × 412
adata = ad.AnnData(
    X=X_df.to_numpy(dtype=np.float64),
    obs=clin.loc[X_df.index].copy(),
    var=pd.DataFrame(index=X_df.columns),
)
# Drop Metastasis for a clean 2-group comparison
adata = adata[adata.obs['SampleType'].isin(['Cancer', 'Benign'])].copy()
adata.obs['group'] = adata.obs['SampleType'].astype(str)
print(adata)
print('group split:', adata.obs['group'].value_counts().to_dict())
print('first 5 lipids:', list(adata.var_names[:5]))

AnnData object with n_obs × n_vars = 18 × 413
    obs: 'SampleType', 'Race', 'Stage', 'Tumor.Type', 'group'
group split: {'Benign': 10, 'Cancer': 8}
first 5 lipids: ['CE 16:0', 'CE 16:1', 'CE 18:0', 'CE 18:1', 'CE 18:2']

3 — Parse LIPID MAPS shorthand#

ov.metabol.parse_lipid(name) decodes a single LIPID MAPS shorthand string. It returns a LipidIdentity dataclass with lipid_class, total_carbons, total_db, backbone, and raw. If the string doesn’t look like a lipid, the function returns None — so downstream code can filter or annotate.

Shorthand rules (abbreviated)#

Shorthand	Components	Example
`<CLASS> <C>:<DB>`	sum-composition level — class, total carbons, total double bonds	`PC 34:1` (PC, 34 C, 1 DB)
`<CLASS> <backbone>/<C>:<DB>`	sphingolipids — sphingosine (or variant) backbone + N-acyl	`Cer d18:1/24:0` (Cer, d18:1 sphingosine, C24 saturated acyl)

Lipid classes the parser understands are in ov.metabol.LIPID_CLASSES (extensible). Glycerophospholipids (PC, PE, PS, PG, PI, PA, BMP + their lyso-forms), sphingolipids (SM, Cer + glyco-forms), neutral lipids (TAG, DAG, MAG, CE), and fatty acids are all covered.

# Demonstrate on a few canonical examples + some brca entries
for name in ['PC 34:1', 'TAG 54:3', 'Cer d18:1/24:0', 'LPC 18:0',
             adata.var_names[0], adata.var_names[200]]:
    lid = ov.metabol.parse_lipid(name)
    print(f'  {name:25s} → {lid}')

PC 34:1                   → LipidIdentity(lipid_class='PC', total_carbons=34, total_db=1, backbone=None, raw='PC 34:1')
  TAG 54:3                  → LipidIdentity(lipid_class='TAG', total_carbons=54, total_db=3, backbone=None, raw='TAG 54:3')
  Cer d18:1/24:0            → LipidIdentity(lipid_class='CER', total_carbons=24, total_db=0, backbone='d18:1', raw='Cer d18:1/24:0')
  LPC 18:0                  → LipidIdentity(lipid_class='LPC', total_carbons=18, total_db=0, backbone=None, raw='LPC 18:0')
  CE 16:0                   → LipidIdentity(lipid_class='CE', total_carbons=16, total_db=0, backbone=None, raw='CE 16:0')
  PC 16:1/24:1              → LipidIdentity(lipid_class='PC', total_carbons=16, total_db=1, backbone=None, raw='PC 16:1/24:1')

4 — Annotate the full matrix#

ov.metabol.annotate_lipids(adata) applies parse_lipid to every adata.var_name and writes four new columns to adata.var:

Column	Type	Meaning
`lipid_class`	str	e.g. `"PC"`, `"TAG"`, `"CER"`, `None` for unparseable
`total_carbons`	float	sum of acyl-chain carbons (nan if unparseable)
`total_db`	float	total double bonds
`lipid_backbone`	str	`"d18:1"` for ceramides, else `None`

The function returns a copy, not in-place, so the original is preserved for comparison.

adata = ov.metabol.annotate_lipids(adata)
print('class distribution:')
print(adata.var['lipid_class'].value_counts())
print(f'\nunparseable rows: {adata.var["lipid_class"].isna().sum()}')

class distribution:
lipid_class
PC    172
TG    155
DG     59
CE      8
PA      3
Name: count, dtype: int64

unparseable rows: 16

19 unparseable lipids — those are exotic species (plasmenyl-PC, plasmenyl-PE) that use a slightly different shorthand the parser doesn’t cover yet. You could extend LIPID_CLASSES or drop them for class-level analysis; we’ll drop them below.

5 — Class-level aggregation#

A common first-pass view of lipidomics data is per-class total intensity: does the tumor have more total TAG than the benign controls? More total PC? aggregate_by_class(adata, agg='sum') collapses the n_species × n_samples matrix into a n_classes × n_samples one, with the default aggregation being the sum across species.

Aggregation options:

`agg=`	What it computes	When to use
`"sum"`	per-sample sum of all species in a class	“total PC abundance” — most common
`"mean"`	per-sample mean	normalize for class size, useful when classes have very different member counts
`"median"`	per-sample median	robust to outlier species

class_adata = ov.metabol.aggregate_by_class(adata, agg='sum')
class_adata.obs = adata.obs          # propagate clinical metadata
print('class-level matrix:', class_adata.shape)
print('\nclass → n_species:')
print(class_adata.var)

class-level matrix: (18, 5)

class → n_species:
    n_species
CE          8
DG         59
PA          3
PC        172
TG        155

# Quick per-class stacked-bar by group — the first plot every lipidomics paper shows
import seaborn as sns

df = pd.DataFrame(class_adata.X, index=class_adata.obs_names, columns=class_adata.var_names)
df['group'] = class_adata.obs['group'].values
df_long = df.groupby('group').mean().T
df_pct = df_long.div(df_long.sum(axis=0), axis=1) * 100

fig, ax = plt.subplots(figsize=(5, 4))
df_pct.T.plot(kind='bar', stacked=True, ax=ax, colormap='tab20', width=0.6)
ax.set_ylabel('% of total lipid intensity')
ax.set_xlabel('group')
ax.legend(loc='center left', bbox_to_anchor=(1.01, 0.5), frameon=False, fontsize=8)
ax.set_title('Lipid-class composition by group')
plt.tight_layout(); plt.show()

../_images/443aee6ade7e110f304d243d32f65ec9858e2ef7fa6db589f67582746614b145.png

On brca the main class-level signal visible from this plot is a shift in TAG / CE balance between tumor and benign — consistent with the published Purwaha 2018 finding that tumor tissue has altered neutral-lipid storage.

6 — Species-level differential + volcano#

Now we run per-species differential analysis — same machinery as notebook 1, on the lipid matrix.

# Drop unparseable species and re-run on the clean matrix
adata_clean = adata[:, adata.var['lipid_class'].notna()].copy()
print('clean matrix:', adata_clean.shape)

# Data is already log-scaled by lipidr, so we use log_transformed=True
deg = ov.metabol.differential(
    adata_clean, group_col='group',
    group_a='Cancer', group_b='Benign',
    method='welch_t', log_transformed=True,
)
print(f'significant at padj<0.05: {(deg.padj<0.05).sum()}')
print(f'significant at padj<0.10: {(deg.padj<0.10).sum()}')

fig, ax = ov.metabol.volcano(deg, padj_thresh=0.10, log2fc_thresh=0.3, label_top_n=8)
ax.set_title(f'Lipid species — Cancer vs Benign (n={adata_clean.n_obs})')
plt.tight_layout(); plt.show()

clean matrix: (18, 397)
significant at padj<0.05: 4
significant at padj<0.10: 69

../_images/7eeb7d8031e03f0ae8d102f0a382c7b958f5927a77e586c5086aca7a123391a9.png

7 — LION ontology enrichment#

LION (LIpid Ontology, Molenaar 2019) organizes lipid classes into categories:

Class — which chemical family (Glycerophospholipids, Sphingolipids, Neutral lipids, Fatty acids)
Function — biological role (Membrane constituents, Energy storage, Signalling lipids)
Subcellular — predominant localization (Plasma membrane outer/inner leaflet, Mitochondrial, ER)

ov.metabol.lion_enrichment(hits, background) runs Fisher’s-exact ORA of these terms against a hit list. By default it calls ov.metabol.fetch_lion_associations() which downloads the full ~150-term LION ontology from GitHub (martijnmolenaar/lipidontology.com, ~14 MB) on first use and caches it at ~/.cache/omicverse/metabol/. Pass your own ontology= dict for domain-specific ontologies or to stay offline.

Parameters#

Argument	Meaning
`hits`	lipid names in LIPID MAPS shorthand
`background`	all tested lipids
`ontology`	override the local LION subset (dict)
`min_size`	skip terms with fewer overlapping classes in the background

hits = deg[deg.padj < 0.10].index.tolist()
background = deg.index.tolist()
print(f'{len(hits)} differential lipid species at padj<0.10')

lion = ov.metabol.lion_enrichment(hits, background, min_size=2)
if lion.empty:
    print('no LION terms reached min_size — relax padj threshold')
else:
    lion.head(10)[['term', 'category', 'overlap', 'set_size',
                   'odds_ratio', 'pvalue', 'padj']]

69 differential lipid species at padj<0.10

8 — When to analyze species-level vs class-level#

A common confusion in lipidomics papers: whether to report at the species level (PC 34:1) or class level (PC). Short answer — do both, and use them for different questions:

Question	Level	Reason
“Is total PC up or down in tumor?”	class	sums absorb per-species noise; enough power with small n
“Which specific lipid species drives separation?”	species	OPLS-DA VIP + S-plot; report single-species
“Does membrane / storage / signalling partition shift?”	LION term	aggregates across functional roles, not just chemistry

In a paper you would typically show all three: per-class stacked bar (Figure 2), species volcano (Figure 3), LION enrichment (Figure 4 — or a ridgeplot of the top LION terms’ effect sizes).

Summary#

Step	Function	Notes
Parse name	`ov.metabol.parse_lipid`	returns None for unrecognized shorthand
Annotate matrix	`ov.metabol.annotate_lipids`	adds 4 cols to adata.var
Collapse to class	`ov.metabol.aggregate_by_class`	`agg="sum"` / `"mean"` / `"median"`
Species differential	`ov.metabol.differential`	same machinery as targeted metabolomics
LION enrichment	`ov.metabol.lion_enrichment`	category = class / function / subcellular

What’s not in this v0.2 lipidomics path (roadmap for v0.3)#

linex2-style lipid reaction network (requires shipping the reaction table; needs a license review)
Saturation/carbon-number heatmaps (heatmap by class × (carbons, DB)) — easy to build on top of annotate_lipids, PR welcome
LipidBlast / LipidSearch identification — that’s an upstream step, same as mzML → peak table
Sphingolipid-specific tools (sphingosine headgroup deconvolution)