omicverse.bulk.pyWGCNA

Contents

omicverse.bulk.pyWGCNA#

omicverse.bulk.pyWGCNA(name='WGCNA', TPMcutoff=1, powers=None, RsquaredCut=0.9, MeanCut=100, networkType='signed hybrid', TOMType='signed', minModuleSize=50, naColor='grey', cut=inf, MEDissThres=0.2, species=None, level='gene', anndata=None, geneExp=None, geneExpPath=None, sep=',', geneInfo=None, sampleInfo=None, save=False, outputPath=None, figureType='pdf')[source]#

Weighted Gene Co-expression Network Analysis.

Identifies highly co-expressed gene modules and relates them to clinical traits / sample metadata. Standard WGCNA workflow:

  1. Preprocess — remove low-expressed genes (TPM cutoff) and outlier samples (Euclidean distance to mean).

  2. Soft-thresholding — pick a power that yields scale-free topology in the gene-gene correlation network.

  3. Adjacency + TOM — adjacency = |cor|^power; topological overlap matrix (TOM) measures shared neighbourhood.

  4. Dynamic tree cut — hierarchical clustering on 1 - TOM; tree cut yields gene modules (named by colour).

  5. Module eigengenes — first principal component of each module’s expression matrix.

  6. Module-trait correlation — Pearson correlation of each module eigengene against numeric sample traits, with FDR-corrected p-values.

Parameters:
  • name (str) – Analysis label, used for output file names.

  • species (str) – Organism (e.g. "mus musculus", "homo sapiens").

  • geneExp (pandas.DataFrame) – Expression matrix shaped (genes × samples). Sample identifiers are the column names, gene identifiers are the index. Note this is the TRANSPOSE of the typical samples × genes layout used by AnnData.

  • TPMcutoff (float, default 1) – Per-gene TPM threshold; genes whose maximum across samples falls below this are dropped during preprocess.

  • powers (list[int], optional) – Candidate soft-threshold powers. Defaults to a 1–30 sweep.

  • networkType ({"signed", "unsigned", "signed hybrid"}) – How adjacency is computed from correlation.

  • minModuleSize (int, default 50) – Smallest module size kept by the dynamic tree cut.

  • save (bool, default False) – Whether to persist results to disk.

Notes

Wide expression CSVs are usually shaped samples × genes; remember to pass data.T so the constructor receives genes × samples.

Methods (call in this order — each step populates the attributes listed under it). Use the high-level runWGCNA() to chain everything end-to-end, or the explicit methods below for finer control:

  • preprocess() — drop low-TPM genes, drop outlier samples (updates self.datExpr).

  • calculate_soft_threshold() — scale-free fit power scan; sets self.power (int, not self.softPower) and self.sft (DataFrame with R²/slope/k per power).

  • calculating_adjacency_matrix() — sets self.adjacency.

  • calculating_TOM_similarity_matrix() — sets self.TOM.

  • calculate_geneTree() — sets self.geneTree (linkage matrix).

  • calculate_dynamicMods(kwargs_function={...}) — sets self.dynamicMods and self.datExpr.var['dynamicColors'].

  • calculate_gene_module(kwargs_function={...}) — merges close modules, sets self.datExpr.var['moduleColors'], self.datExpr.var['moduleLabels'], self.MEs, self.datME.

  • findModules() — convenience that runs the soft-threshold + adjacency + TOM + tree + module merge as one call (preferred).

  • runWGCNA() — runs preprocess() then findModules().

  • analyseWGCNA(geneList=None) — module–trait correlation; sets self.moduleTraitCor and self.moduleTraitPvalue. Requires sample metadata (set via updateSampleInfo(...) or passed via sampleInfo at construction).

Attributes (state machine — populated in this order). The class is a thin shim that delegates to the upstream PyWGCNA implementation; these are the actual attribute names on the returned instance, which agents commonly mis-spell:

  • self.geneExpr — AnnData (genes × samples) holding the original input expression.

  • self.datExpr — AnnData (genes × samples), filtered after preprocess(). Per-gene module annotations live on self.datExpr.var.

  • self.power (int) — chosen soft-threshold power. The attribute is ``power``, NOT ``softPower``. Set after calculate_soft_threshold() or findModules(); before that it is 0.

  • self.sft (pandas.DataFrame) — scale-free fit table per candidate power (columns: Power, SFT.R.sq, slope, mean(k), …). Set together with self.power.

  • self.adjacency (pandas.DataFrame) — gene-gene weighted adjacency. None until calculating_adjacency_matrix() / findModules() runs.

  • self.TOM (numpy.ndarray) — topological overlap matrix. None until calculating_TOM_similarity_matrix() / findModules() runs.

  • self.geneTree — scipy linkage matrix from 1 - TOM.

  • self.dynamicMods — initial dynamic-tree-cut module integer labels per gene.

  • self.datExpr.var['dynamicColors'] — initial module colour per gene (string, e.g. 'turquoise').

  • self.datExpr.var['moduleColors'] — final module colour per gene (after merging close modules). Use this for downstream.

  • self.datExpr.var['moduleLabels'] — integer label per gene aligned to moduleColors.

  • self.MEs (pandas.DataFrame) — module eigengenes, samples × modules. Do not compute this manually — the class already provides it; manual mean-by-mask is not equivalent (eigengene = first PC of the module’s expression, not the mean).

  • self.datME — pre-merge eigengene matrix; usually self.MEs is what you want.

  • self.moduleTraitCor (pandas.DataFrame) — module × trait Pearson correlations. None until analyseWGCNA() runs.

  • self.moduleTraitPvalue (pandas.DataFrame) — parallel p-value table. None until analyseWGCNA() runs.

Examples

>>> import pandas as pd, omicverse as ov
>>> data = pd.read_csv('expressionList.csv', index_col=0)
>>> wgcna = ov.bulk.pyWGCNA(
...     name='5xFAD',
...     species='mus musculus',
...     geneExp=data.T,            # transpose to genes × samples
...     TPMcutoff=1,
...     networkType='signed hybrid',
... )
>>> wgcna.preprocess()
>>> wgcna.findModules()