Differential abundance: Wilcoxon vs DESeq2 vs ANCOM-BC

Differential abundance: Wilcoxon vs DESeq2 vs ANCOM-BC#

Microbiome differential abundance (DA) is the question “which taxa differ between groups?” — the microbiome analogue of a DEG test. Three families of methods dominate the literature, each with different modelling assumptions and different failure modes:

Method	Data model	Handles sparsity?	Handles compositionality?	R dep?
Wilcoxon	Mann-Whitney U on relative abundance — non-parametric	ok (ignores zero-inflation)	no	no
pyDESeq2	Negative-binomial GLM on raw counts — same model as bulk RNA-seq	yes (median-of-ratios size factors)	partial	no
ANCOM-BC	Bias-corrected log-ratios — CLR-like, designed for compositional data	yes	yes	no

The three calls are already wired in omicverse.micro:

ov.micro.DA(adata).wilcoxon(group_key='group', rank='genus')
ov.micro.DA(adata).deseq2(group_key='group',   rank='genus')
ov.micro.DA(adata).ancombc(group_key='group',  rank='genus')

This notebook runs all three on the same 20-sample mouse-gut 16S dataset (the mothur MiSeq SOP, produced by the main 16S tutorial) and quantifies how much they agree.

Key reference: Nearing et al. 2022 (Nat. Commun. 13, 342) benchmarked 14 DA methods on 38 16S studies and concluded that ANCOM-BC and LEfSe produced the most reproducible results across studies, while Wilcoxon was the most conservative and DESeq2 the most liberal of the parametric methods. The plots in this notebook reproduce that qualitative conclusion on a single dataset.

1. Setup#

We reuse the AnnData produced by t_16s_amplicon.ipynb — 20 samples × 598 ASVs, with obs['group'] labelling the Early vs Late time-points of the Kozich et al. 2013 mouse-gut time-series.

import os
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import omicverse as ov
import anndata as ad

ov.plot_set()
print('omicverse:', ov.__version__)

# Path to the AnnData saved by t_16s_amplicon.ipynb.
adata_path = Path(os.environ.get(
    'OMICVERSE_MICRO_SOP_H5AD',
    '/scratch/users/steorra/analysis/omicverse_dev/cache/16s/run_mothur_sop/mothur_sop_16s.h5ad',
))
adata_full = ad.read_h5ad(adata_path)
# The SOP ships one Mock community sample (group='Mock') as a sequencing
# control — drop it so the DA comparison is a clean two-group test.
adata = adata_full[adata_full.obs['group'].isin(['Early', 'Late'])].copy()
print('n_samples by group:', adata.obs['group'].value_counts().to_dict())
adata

🔬 Starting plot initialization...
🧬 Detecting GPU devices…
✅ NVIDIA CUDA GPUs detected: 1
    • [CUDA 0] NVIDIA H100 80GB HBM3
      Memory: 79.1 GB | Compute: 9.0

   ____            _     _    __                  
  / __ \____ ___  (_)___| |  / /__  _____________ 
 / / / / __ `__ \/ / ___/ | / / _ \/ ___/ ___/ _ \ 
/ /_/ / / / / / / / /__ | |/ /  __/ /  (__  )  __/ 
\____/_/ /_/ /_/_/\___/ |___/\___/_/  /____/\___/                                              

🔖 Version: 2.1.2rc1   📚 Tutorials: https://omicverse.readthedocs.io/
✅ plot_set complete.

omicverse: 2.1.2rc1
n_samples by group: {'Late': 10, 'Early': 9}

AnnData object with n_obs × n_vars = 19 × 598
    obs: 'sample', 'day', 'group', 'shannon', 'observed_otus', 'simpson'
    var: 'sequence', 'domain', 'phylum', 'class', 'order', 'family', 'genus', 'species', 'taxonomy', 'sintax_raw', 'sintax_confidence'
    uns: 'micro', 'pipeline'
    obsm: 'braycurtis_pcoa'
    obsp: 'braycurtis'

2. Collapse ASVs to genus#

All three methods consume the same genus-level table so comparisons aren’t confounded by different ASV-grouping choices. ANCOM-BC in particular shines at higher ranks where sparsity is less extreme.

adata_genus = ov.micro.collapse_taxa(adata, rank='genus')
print('samples × genera:', adata_genus.shape)
print('group sizes:\n', adata_genus.obs['group'].value_counts())

samples × genera: (19, 51)
group sizes:
 group
Late     10
Early     9
Name: count, dtype: int64

3. Run the three DA methods#

All three share the same signature — group_key, optional group_a / group_b (default: the two sorted unique values), optional rank (already collapsed here, so rank=None), and a min_prevalence filter.

wx = ov.micro.DA(adata_genus).wilcoxon(
    group_key='group', group_a='Early', group_b='Late', min_prevalence=0.1,
)
print('Wilcoxon — tested', len(wx), 'genera; significant @ FDR 0.05:',
      int((wx['fdr_bh'] < 0.05).sum()))
wx.head()

Wilcoxon — tested 47 genera; significant @ FDR 0.05: 12

         feature  mean_Early  mean_Late  log2FC(Late/Early)  U_stat   p_value  \
 Anaerotignum    0.002203   0.000299           -2.882933    89.0  0.000374   
         Zea    0.002203   0.000031           -6.159047    84.5  0.000575   
 Anaeroplasma    0.004734   0.000312           -3.921901    86.0  0.000783   
 Muribaculum    0.000375   0.001669            2.153997     4.0  0.000939   
Turicibacter    0.003258   0.019708            2.596515     4.0  0.000944   

    prevalence   fdr_bh  
   0.842105  0.00887  
  0.473684  0.00887  
   0.684211  0.00887  
  0.894737  0.00887  
  0.947368  0.00887  

ds = ov.micro.DA(adata_genus).deseq2(
    group_key='group', group_a='Early', group_b='Late', min_prevalence=0.1,
)
print('DESeq2  — tested', len(ds), 'genera; significant @ FDR 0.05:',
      int((ds['fdr_bh'] < 0.05).sum()))
ds.head()

DESeq2  — tested 47 genera; significant @ FDR 0.05: 17

                      feature   base_mean  log2FC(Late/Early)  log2FC_se  \
Clostridium_sensu_stricto    7.065780            3.942470   0.659801   
              Muribaculum    5.890730            2.618368   0.482056   
            Lactobacillus  168.339099            2.408439   0.468433   
              Anaerotignum    5.756394           -2.615394   0.565759   
                      Zea    4.089835           -4.534474   1.041442   

        stat       p_value        fdr_bh  prevalence  
5.975239  2.297528e-09  8.960359e-08    0.736842  
5.431665  5.583076e-08  1.088700e-06    0.894737  
5.141478  2.725855e-07  3.543612e-06    1.000000  
-4.622810  3.785763e-06  3.691119e-05    0.842105  
-4.354035  1.336543e-05  1.042504e-04    0.473684  

ab = ov.micro.DA(adata_genus).ancombc(
    group_key='group', min_prevalence=0.1,
)
sig_col = 'q_value' if 'q_value' in ab.columns else 'fdr_bh'
print('ANCOM-BC — tested', len(ab), 'genera; significant @ q 0.05:',
      int((ab[sig_col] < 0.05).sum()))
ab.head()

ANCOM-BC — tested 47 genera; significant @ q 0.05: 16

         feature       lfc        se         W   p_value   q_value  diff_abn  \
   Unassigned -0.134216  0.107723 -1.245934  0.212788  1.000000     False   
Acetatifactor -1.014216  0.322101 -3.148748  0.001640  0.050831     False   
Acinetobacter  0.096679  0.206527  0.468119  0.639700  1.000000     False   
Adlercreutzia  0.049084  0.224929  0.218220  0.827257  1.000000     False   
  Akkermansia -0.631061  0.260633 -2.421265  0.015467  0.417598     False   

   prevalence  
  1.000000  
  1.000000  
  0.105263  
  0.947368  
  0.105263  

4. How much do the three methods agree?#

Counting significant features is the crudest comparison but the most actionable one for a reviewer. The Venn below shows the overlap of the sets of genera each method flagged as significant at FDR (or q) < 0.05.

sig_wx = set(wx.loc[wx['fdr_bh']  < 0.05, 'feature'])
sig_ds = set(ds.loc[ds['fdr_bh']  < 0.05, 'feature'])
ancombc_sig_col = 'q_value' if 'q_value' in ab.columns else 'fdr_bh'
sig_ab = set(ab.loc[ab[ancombc_sig_col] < 0.05, 'feature'])

# Pure-matplotlib Venn so the notebook works without matplotlib_venn.
def _venn3_counts(a, b, c):
    return {
        'a_only':  len(a - b - c),
        'b_only':  len(b - a - c),
        'c_only':  len(c - a - b),
        'ab':      len((a & b) - c),
        'ac':      len((a & c) - b),
        'bc':      len((b & c) - a),
        'abc':     len(a & b & c),
    }

counts = _venn3_counts(sig_wx, sig_ds, sig_ab)
print('Wilcoxon only :', counts['a_only'])
print('DESeq2 only   :', counts['b_only'])
print('ANCOM-BC only :', counts['c_only'])
print('Wilcoxon ∩ DESeq2            :', counts['ab'])
print('Wilcoxon ∩ ANCOM-BC          :', counts['ac'])
print('DESeq2   ∩ ANCOM-BC          :', counts['bc'])
print('All three                    :', counts['abc'])

try:
    from matplotlib_venn import venn3
    fig, ax = plt.subplots(figsize=(5, 5))
    venn3(
        [sig_wx, sig_ds, sig_ab],
        set_labels=('Wilcoxon', 'DESeq2', 'ANCOM-BC'),
        ax=ax,
    )
    ax.set_title('Genera significant at FDR/q < 0.05')
    plt.show()
except ImportError:
    print('\n(install matplotlib_venn for a rendered Venn diagram)')

Wilcoxon only : 0
DESeq2 only   : 4
ANCOM-BC only : 4
Wilcoxon ∩ DESeq2            : 2
Wilcoxon ∩ ANCOM-BC          : 1
DESeq2   ∩ ANCOM-BC          : 2
All three                    : 9

(install matplotlib_venn for a rendered Venn diagram)

5. Do the methods agree on effect size?#

Even when they disagree on significance, do the three methods at least point in the same direction (same sign of log2FC)? The scatter below uses the set of genera each pair of methods both tested.

# Align the three tables on a common feature index.
wx_fc = wx.set_index('feature')['log2FC(Late/Early)']
ds_fc = ds.set_index('feature')['log2FC(Late/Early)']
ab_fc = ab.set_index('feature')['lfc']   # ANCOM-BC uses natural-log lfc
# Convert ANCOM-BC ln-fold-change to log2 for apples-to-apples.
ab_fc_log2 = ab_fc / np.log(2)

common = wx_fc.index.intersection(ds_fc.index).intersection(ab_fc_log2.index)
print('genera tested by all 3 methods:', len(common))

fig, axes = plt.subplots(1, 3, figsize=(13, 4), sharex=False, sharey=False)
pairs = [
    ('Wilcoxon', 'DESeq2',   wx_fc[common], ds_fc[common]),
    ('Wilcoxon', 'ANCOM-BC', wx_fc[common], ab_fc_log2[common]),
    ('DESeq2',   'ANCOM-BC', ds_fc[common], ab_fc_log2[common]),
]
for ax, (xn, yn, x, y) in zip(axes, pairs):
    ax.scatter(x.values, y.values, s=12, alpha=0.6, color='#4682B4')
    lim = max(abs(np.nanmin([x.min(), y.min()])),
              abs(np.nanmax([x.max(), y.max()]))) * 1.1
    ax.plot([-lim, lim], [-lim, lim], 'k--', lw=0.7, alpha=0.5)
    ax.axhline(0, color='grey', lw=0.5)
    ax.axvline(0, color='grey', lw=0.5)
    ax.set_xlabel(f'{xn} log2FC (Late/Early)')
    ax.set_ylabel(f'{yn} log2FC (Late/Early)')
    r = np.corrcoef(x.dropna().values,
                    y.reindex(x.dropna().index).values)[0, 1]
    ax.set_title(f'{xn} vs {yn}  (Pearson r = {r:.2f})')
plt.tight_layout()
plt.show()

genera tested by all 3 methods: 47

../_images/dab00e8cee396ae9c3a8b59c77aef0c04601b46fc59cc1b60399a7dd9cbcf069.png

6. Which method should you use?#

The 2022 Nearing benchmark and the later Calgaro et al. 2020 mixOmics-DA review converge on the following practical guidance:

Scenario	Best first choice	Why
Exploratory screen, small cohort (n < 20 per group)	Wilcoxon	robust, no distributional assumption, conservative — few false positives
Well-powered cohort, compositional nature matters	ANCOM-BC	explicit bias correction for sequencing-depth–induced compositionality
RNA-seq-style analysis on high-count, low-sparsity table (e.g. shotgun MAG counts)	pyDESeq2	mature negative-binomial model; reuses the same dispersions you’d trust from bulk
You must publish one method	ANCOM-BC + Wilcoxon intersection	features flagged by both are robust across assumptions

Tunables that matter. All three ov.micro.DA methods accept min_prevalence to drop ultra-rare features before testing — the default 0.1 (≥10% of samples) is a good starting point. For 16S ASV-level tests without collapsing, bump to 0.2–0.3; for collapsed genus tables with only a few hundred rows, 0.1 is fine.

What’s stored on the AnnData. Every DA call writes its result table into adata.uns['micro']['da'] under a method-specific key, so you can run all three and save one h5ad:

list(adata_genus.uns['micro']['da'].keys())
# ['wilcoxon_group_Early_vs_Late_asv',
#  'deseq2_group_Early_vs_Late_asv',
#  'ancombc_group_asv']

References#

Nearing, J. T., Douglas, G. M., Hayes, M. G., MacDonald, J., Desai, D. K., Allward, N., Jones, C. M. A., Wright, R. J., Dhanani, A. S., Comeau, A. M., & Langille, M. G. I. (2022). Microbiome differential abundance methods produce different results across 38 datasets. Nature Communications, 13(1), 342. https://doi.org/10.1038/s41467-022-28034-z
Lin, H., & Peddada, S. D. (2020). Analysis of compositions of microbiomes with bias correction. Nature Communications, 11(1), 3514. https://doi.org/10.1038/s41467-020-17041-7
Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15(12), 550. https://doi.org/10.1186/s13059-014-0550-8
Muller, E., Algavi, Y. M., & Borenstein, E. (2021). The gut microbiome-metabolome dataset collection: a curated resource for integrative meta-analysis. npj Biofilms and Microbiomes, 8(1), 79. https://doi.org/10.1038/s41522-022-00345-5