Differential abundance: Wilcoxon vs DESeq2 vs ANCOM-BC#

Microbiome differential abundance (DA) is the question “which taxa differ between groups?” — the microbiome analogue of a DEG test. Three families of methods dominate the literature, each with different modelling assumptions and different failure modes:

Method

Data model

Handles sparsity?

Handles compositionality?

R dep?

Wilcoxon

Mann-Whitney U on relative abundance — non-parametric

ok (ignores zero-inflation)

no

no

pyDESeq2

Negative-binomial GLM on raw counts — same model as bulk RNA-seq

yes (median-of-ratios size factors)

partial

no

ANCOM-BC

Bias-corrected log-ratios — CLR-like, designed for compositional data

yes

yes

no

The three calls are already wired in omicverse.micro:

ov.micro.DA(adata).wilcoxon(group_key='group', rank='genus')
ov.micro.DA(adata).deseq2(group_key='group',   rank='genus')
ov.micro.DA(adata).ancombc(group_key='group',  rank='genus')

This notebook runs all three on the same 20-sample mouse-gut 16S dataset (the mothur MiSeq SOP, produced by the main 16S tutorial) and quantifies how much they agree.

Key reference: Nearing et al. 2022 (Nat. Commun. 13, 342) benchmarked 14 DA methods on 38 16S studies and concluded that ANCOM-BC and LEfSe produced the most reproducible results across studies, while Wilcoxon was the most conservative and DESeq2 the most liberal of the parametric methods. The plots in this notebook reproduce that qualitative conclusion on a single dataset.

1. Setup#

We reuse the AnnData produced by t_16s_amplicon.ipynb — 20 samples × 598 ASVs, with obs['group'] labelling the Early vs Late time-points of the Kozich et al. 2013 mouse-gut time-series.

import os
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import omicverse as ov
import anndata as ad

ov.plot_set()
print('omicverse:', ov.__version__)

# Path to the AnnData saved by t_16s_amplicon.ipynb.
adata_path = Path(os.environ.get(
    'OMICVERSE_MICRO_SOP_H5AD',
    '/scratch/users/steorra/analysis/omicverse_dev/cache/16s/run_mothur_sop/mothur_sop_16s.h5ad',
))
adata_full = ad.read_h5ad(adata_path)
# The SOP ships one Mock community sample (group='Mock') as a sequencing
# control — drop it so the DA comparison is a clean two-group test.
adata = adata_full[adata_full.obs['group'].isin(['Early', 'Late'])].copy()
print('n_samples by group:', adata.obs['group'].value_counts().to_dict())
adata
🔬 Starting plot initialization...
🧬 Detecting GPU devices…
✅ NVIDIA CUDA GPUs detected: 1
    • [CUDA 0] NVIDIA H100 80GB HBM3
      Memory: 79.1 GB | Compute: 9.0

   ____            _     _    __                  
  / __ \____ ___  (_)___| |  / /__  _____________ 
 / / / / __ `__ \/ / ___/ | / / _ \/ ___/ ___/ _ \ 
/ /_/ / / / / / / / /__ | |/ /  __/ /  (__  )  __/ 
\____/_/ /_/ /_/_/\___/ |___/\___/_/  /____/\___/                                              

🔖 Version: 2.1.2rc1   📚 Tutorials: https://omicverse.readthedocs.io/
✅ plot_set complete.

omicverse: 2.1.2rc1
n_samples by group: {'Late': 10, 'Early': 9}
AnnData object with n_obs × n_vars = 19 × 598
    obs: 'sample', 'day', 'group', 'shannon', 'observed_otus', 'simpson'
    var: 'sequence', 'domain', 'phylum', 'class', 'order', 'family', 'genus', 'species', 'taxonomy', 'sintax_raw', 'sintax_confidence'
    uns: 'micro', 'pipeline'
    obsm: 'braycurtis_pcoa'
    obsp: 'braycurtis'

2. Collapse ASVs to genus#

All three methods consume the same genus-level table so comparisons aren’t confounded by different ASV-grouping choices. ANCOM-BC in particular shines at higher ranks where sparsity is less extreme.

adata_genus = ov.micro.collapse_taxa(adata, rank='genus')
print('samples × genera:', adata_genus.shape)
print('group sizes:\n', adata_genus.obs['group'].value_counts())
samples × genera: (19, 51)
group sizes:
 group
Late     10
Early     9
Name: count, dtype: int64

3. Run the three DA methods#

All three share the same signature — group_key, optional group_a / group_b (default: the two sorted unique values), optional rank (already collapsed here, so rank=None), and a min_prevalence filter.

wx = ov.micro.DA(adata_genus).wilcoxon(
    group_key='group', group_a='Early', group_b='Late', min_prevalence=0.1,
)
print('Wilcoxon — tested', len(wx), 'genera; significant @ FDR 0.05:',
      int((wx['fdr_bh'] < 0.05).sum()))
wx.head()
Wilcoxon — tested 47 genera; significant @ FDR 0.05: 12
         feature  mean_Early  mean_Late  log2FC(Late/Early)  U_stat   p_value  \
8   Anaerotignum    0.002203   0.000299           -2.882933    89.0  0.000374   
46           Zea    0.002203   0.000031           -6.159047    84.5  0.000575   
7   Anaeroplasma    0.004734   0.000312           -3.921901    86.0  0.000783   
31   Muribaculum    0.000375   0.001669            2.153997     4.0  0.000939   
45  Turicibacter    0.003258   0.019708            2.596515     4.0  0.000944   

    prevalence   fdr_bh  
8     0.842105  0.00887  
46    0.473684  0.00887  
7     0.684211  0.00887  
31    0.894737  0.00887  
45    0.947368  0.00887  
ds = ov.micro.DA(adata_genus).deseq2(
    group_key='group', group_a='Early', group_b='Late', min_prevalence=0.1,
)
print('DESeq2  — tested', len(ds), 'genera; significant @ FDR 0.05:',
      int((ds['fdr_bh'] < 0.05).sum()))
ds.head()
DESeq2  — tested 47 genera; significant @ FDR 0.05: 17
                      feature   base_mean  log2FC(Late/Early)  log2FC_se  \
17  Clostridium_sensu_stricto    7.065780            3.942470   0.659801   
31                Muribaculum    5.890730            2.618368   0.482056   
25              Lactobacillus  168.339099            2.408439   0.468433   
8                Anaerotignum    5.756394           -2.615394   0.565759   
46                        Zea    4.089835           -4.534474   1.041442   

        stat       p_value        fdr_bh  prevalence  
17  5.975239  2.297528e-09  8.960359e-08    0.736842  
31  5.431665  5.583076e-08  1.088700e-06    0.894737  
25  5.141478  2.725855e-07  3.543612e-06    1.000000  
8  -4.622810  3.785763e-06  3.691119e-05    0.842105  
46 -4.354035  1.336543e-05  1.042504e-04    0.473684  
ab = ov.micro.DA(adata_genus).ancombc(
    group_key='group', min_prevalence=0.1,
)
sig_col = 'q_value' if 'q_value' in ab.columns else 'fdr_bh'
print('ANCOM-BC — tested', len(ab), 'genera; significant @ q 0.05:',
      int((ab[sig_col] < 0.05).sum()))
ab.head()
ANCOM-BC — tested 47 genera; significant @ q 0.05: 16
         feature       lfc        se         W   p_value   q_value  diff_abn  \
0     Unassigned -0.134216  0.107723 -1.245934  0.212788  1.000000     False   
1  Acetatifactor -1.014216  0.322101 -3.148748  0.001640  0.050831     False   
2  Acinetobacter  0.096679  0.206527  0.468119  0.639700  1.000000     False   
3  Adlercreutzia  0.049084  0.224929  0.218220  0.827257  1.000000     False   
4    Akkermansia -0.631061  0.260633 -2.421265  0.015467  0.417598     False   

   prevalence  
0    1.000000  
1    1.000000  
2    0.105263  
3    0.947368  
4    0.105263  

4. How much do the three methods agree?#

Counting significant features is the crudest comparison but the most actionable one for a reviewer. The Venn below shows the overlap of the sets of genera each method flagged as significant at FDR (or q) < 0.05.

sig_wx = set(wx.loc[wx['fdr_bh']  < 0.05, 'feature'])
sig_ds = set(ds.loc[ds['fdr_bh']  < 0.05, 'feature'])
ancombc_sig_col = 'q_value' if 'q_value' in ab.columns else 'fdr_bh'
sig_ab = set(ab.loc[ab[ancombc_sig_col] < 0.05, 'feature'])

# Pure-matplotlib Venn so the notebook works without matplotlib_venn.
def _venn3_counts(a, b, c):
    return {
        'a_only':  len(a - b - c),
        'b_only':  len(b - a - c),
        'c_only':  len(c - a - b),
        'ab':      len((a & b) - c),
        'ac':      len((a & c) - b),
        'bc':      len((b & c) - a),
        'abc':     len(a & b & c),
    }

counts = _venn3_counts(sig_wx, sig_ds, sig_ab)
print('Wilcoxon only :', counts['a_only'])
print('DESeq2 only   :', counts['b_only'])
print('ANCOM-BC only :', counts['c_only'])
print('Wilcoxon ∩ DESeq2            :', counts['ab'])
print('Wilcoxon ∩ ANCOM-BC          :', counts['ac'])
print('DESeq2   ∩ ANCOM-BC          :', counts['bc'])
print('All three                    :', counts['abc'])

try:
    from matplotlib_venn import venn3
    fig, ax = plt.subplots(figsize=(5, 5))
    venn3(
        [sig_wx, sig_ds, sig_ab],
        set_labels=('Wilcoxon', 'DESeq2', 'ANCOM-BC'),
        ax=ax,
    )
    ax.set_title('Genera significant at FDR/q < 0.05')
    plt.show()
except ImportError:
    print('\n(install matplotlib_venn for a rendered Venn diagram)')
Wilcoxon only : 0
DESeq2 only   : 4
ANCOM-BC only : 4
Wilcoxon ∩ DESeq2            : 2
Wilcoxon ∩ ANCOM-BC          : 1
DESeq2   ∩ ANCOM-BC          : 2
All three                    : 9

(install matplotlib_venn for a rendered Venn diagram)

5. Do the methods agree on effect size?#

Even when they disagree on significance, do the three methods at least point in the same direction (same sign of log2FC)? The scatter below uses the set of genera each pair of methods both tested.

# Align the three tables on a common feature index.
wx_fc = wx.set_index('feature')['log2FC(Late/Early)']
ds_fc = ds.set_index('feature')['log2FC(Late/Early)']
ab_fc = ab.set_index('feature')['lfc']   # ANCOM-BC uses natural-log lfc
# Convert ANCOM-BC ln-fold-change to log2 for apples-to-apples.
ab_fc_log2 = ab_fc / np.log(2)

common = wx_fc.index.intersection(ds_fc.index).intersection(ab_fc_log2.index)
print('genera tested by all 3 methods:', len(common))

fig, axes = plt.subplots(1, 3, figsize=(13, 4), sharex=False, sharey=False)
pairs = [
    ('Wilcoxon', 'DESeq2',   wx_fc[common], ds_fc[common]),
    ('Wilcoxon', 'ANCOM-BC', wx_fc[common], ab_fc_log2[common]),
    ('DESeq2',   'ANCOM-BC', ds_fc[common], ab_fc_log2[common]),
]
for ax, (xn, yn, x, y) in zip(axes, pairs):
    ax.scatter(x.values, y.values, s=12, alpha=0.6, color='#4682B4')
    lim = max(abs(np.nanmin([x.min(), y.min()])),
              abs(np.nanmax([x.max(), y.max()]))) * 1.1
    ax.plot([-lim, lim], [-lim, lim], 'k--', lw=0.7, alpha=0.5)
    ax.axhline(0, color='grey', lw=0.5)
    ax.axvline(0, color='grey', lw=0.5)
    ax.set_xlabel(f'{xn} log2FC (Late/Early)')
    ax.set_ylabel(f'{yn} log2FC (Late/Early)')
    r = np.corrcoef(x.dropna().values,
                    y.reindex(x.dropna().index).values)[0, 1]
    ax.set_title(f'{xn} vs {yn}  (Pearson r = {r:.2f})')
plt.tight_layout()
plt.show()
genera tested by all 3 methods: 47
../_images/dab00e8cee396ae9c3a8b59c77aef0c04601b46fc59cc1b60399a7dd9cbcf069.png

6. Which method should you use?#

The 2022 Nearing benchmark and the later Calgaro et al. 2020 mixOmics-DA review converge on the following practical guidance:

Scenario

Best first choice

Why

Exploratory screen, small cohort (n < 20 per group)

Wilcoxon

robust, no distributional assumption, conservative — few false positives

Well-powered cohort, compositional nature matters

ANCOM-BC

explicit bias correction for sequencing-depth–induced compositionality

RNA-seq-style analysis on high-count, low-sparsity table (e.g. shotgun MAG counts)

pyDESeq2

mature negative-binomial model; reuses the same dispersions you’d trust from bulk

You must publish one method

ANCOM-BC + Wilcoxon intersection

features flagged by both are robust across assumptions

Tunables that matter. All three ov.micro.DA methods accept min_prevalence to drop ultra-rare features before testing — the default 0.1 (≥10% of samples) is a good starting point. For 16S ASV-level tests without collapsing, bump to 0.2–0.3; for collapsed genus tables with only a few hundred rows, 0.1 is fine.

What’s stored on the AnnData. Every DA call writes its result table into adata.uns['micro']['da'] under a method-specific key, so you can run all three and save one h5ad:

list(adata_genus.uns['micro']['da'].keys())
# ['wilcoxon_group_Early_vs_Late_asv',
#  'deseq2_group_Early_vs_Late_asv',
#  'ancombc_group_asv']

References#

  • Nearing, J. T., Douglas, G. M., Hayes, M. G., MacDonald, J., Desai, D. K., Allward, N., Jones, C. M. A., Wright, R. J., Dhanani, A. S., Comeau, A. M., & Langille, M. G. I. (2022). Microbiome differential abundance methods produce different results across 38 datasets. Nature Communications, 13(1), 342. https://doi.org/10.1038/s41467-022-28034-z

  • Lin, H., & Peddada, S. D. (2020). Analysis of compositions of microbiomes with bias correction. Nature Communications, 11(1), 3514. https://doi.org/10.1038/s41467-020-17041-7

  • Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15(12), 550. https://doi.org/10.1186/s13059-014-0550-8

  • Muller, E., Algavi, Y. M., & Borenstein, E. (2021). The gut microbiome-metabolome dataset collection: a curated resource for integrative meta-analysis. npj Biofilms and Microbiomes, 8(1), 79. https://doi.org/10.1038/s41522-022-00345-5