Gene network module#

For bulk or pseudobulk samples, shared co-expression modules for genes can be derived. Genes expression are commonly internally correlated, reflecting genomic modules that derives specific function cooperations.

WGCNA help cluster genes by expression correlations, and derives gene modules that tend to vary within samples. Commonly, to capture enough variability, at least 15 bulk samples is required to obtain robust and sensible result.

[1]:

%load_ext autoreload
%autoreload 2

[2]:

import exprmat as em
# set working directory
em.setwd('../../../data')
ver = em.version()

[i] exprmat 0.2.66 / exprmat-db 0.2.66
[i] os: posix (linux)  platform version: 6.8.0-90-generic
[i] loaded configuration from /home/data/yangz/.exprmatrc
[i] current working directory: /home/data/yangz/packages/exprmat/data
[i] current database directory: /home/data/yangz/packages/database (0.2.66)
[i] resident memory: 778.72 MiB
[i] virtual memory: 5.95 GiB

We will use the pseudo-bulk samples for WGCNA

[3]:

expm = em.load_experiment('expm/scrna', load_samples = True, load_subset = 'mono-neutro')

   ━━━━━━━━━━━━━━━━━━━━━━━━━━ loading samples          5 / 5     (00:05 < 00:00)

[4]:

print(expm)

annotated data of size 9754 × 19651
subset mono-neutro of size 9754 × 19651
contains modalities: rna

 modality [rna]
    obs : sample <cat> <c/sample> batch <cat> <c/batch> group <cat> <c> modality <cat> <c/modality>
          taxa <cat> <c/taxa> barcode <o> <o> ubc <o> <o> n.umi <f64> <i> n.genes <i64> <i>
          n.mito <f64> <f> n.ribo <f64> <f> pct.mito <f64> <f> pct.ribo <f64> <f>
          filter <bool> <bool> score.doublet <f64> <f> score.doublet.se <f64> <f>
          is.doublet <bool> <bool> qc <bool> <bool/qc> leiden <cat> <c> sc3.5 <cat> <c>
          sc3.10 <cat> <c> sc3.20 <cat> <c> sc3.30 <cat> <c> cell.type <cat> <c>
          kde.umap <f64> <f/kde> psbulk <cat> <o> cytotrace.score <f64> <f>
          cytotrace.potency <cat> <o> cytotrace.relative <f64> <f> cytotrace.score.preknn <f64> <f>
          cytotrace.potency.preknn <cat> <o> ppt.pseudotime <f64> <f> ppt.seg <cat> <o>
          ppt.edge <cat> <o> ppt.milestones <cat> <o>
    var : chr <cat> <c/chromosome> start <i64> <i> end <i64> <i> strand <cat> <c/strand> id <o> <o>
          subtype <cat> <c/gsubtype> gene <cat> <o/gene> tlen <f64> <i/tlen> cdslen <i64> <i/cdslen>
          assembly <cat> <c> uid <o> <o/ugene> vst.hvg <bool> <bool/hvg> vst.all.means <f64> <f>
          vst.all.vars <f64> <f> vst.all.vars.norm <f64> <f> vst.all.hvg.rank <f32> <f>
          vst.all.hvg <bool> <bool>
 layers : counts <f32> <i/counts> magic <f64> <f/normal/imputed> norm <f32> <f>
   obsm : cnmf.10 <df> <f/embedding/usage> diffmap <arr:f32(5)> <f/embedding>
          harmony <arr:f32(35)> <f> knn <arr:i32(100)> <i/knni> knn.d <arr:f32(100)> <f/knnd>
          knn.d.nn30.diffmap <arr:f32(30)> <f> knn.nn30.diffmap <arr:i32(30)> <i>
          pca <arr:f64(35)> <f/embedding/pca> ppt <arr:f64(2000)> <ppt-assign>
          umap <arr:f32(2)> <f/embedding> umap.diff <arr:f32(2)> <f/embedding>
   varm : cnmf.10 <arr:f64(10)> <f/weights> cnmf.coef.10 <arr:f64(10)> <f/usage-coef>
          pca <arr:f64(35)> <f/weights>
   obsp : connectivities <csr:f32> <f/connectivity> connectivities.nn30.diffmap <csr:f32> <f>
          distances <csr:f32> <f/distance> distances.nn30.diffmap <csr:f32> <f>
    uns : cell.type.colors <o> cell.type_colors <o> cnmf <cnmf> cnmf.args <o>
          cnmf.density.10 <cnmf-density> cnmf.dist.10 <f/connectivity> cnmf.stats <cnmf-stats>
          commands <system> diffmap <o> kde.umap <kde-stats> leiden <o> leiden.colors <o>
          magic.errors <magic-errors> magic.t <magic-t> markers <markers> neighbors <knn>
          nn30.diffmap <knn> pca <dict> ppt <ppt> ppt.graph <o> ppt.milestones.colors <o>
          ppt.pseudotime <o> ppt.root <i> ppt.seg.colors <o> sc3.10.colors <o> sc3.20.colors <o>
          sc3.30.colors <o> sc3.5.colors <o> slots <system> trace <trace> umap <o> umap.diff <o>

[*] composed of samples:
  distal                rna   mmu    batch b1        of size 6451 × 21689
  niche                 rna   mmu    batch b2        of size 3656 × 21970
  normal                rna   mmu    batch b3        of size 4537 × 21917
  integrated-metacell   rna   autogen   batch autogen   of size 1277 × 18903
  psbulk                rna   autogen   batch autogen   of size 18 × 8928

The pseudobulk matrix has already been filtered. See Statistics vignitte for details

[5]:

fig = expm.rna.plot_pseudobulk_features(
    run_on_samples = ['psbulk'],
    group = 'group',
    lib_size = None,
    min_count = 10,
    min_total_count = 15,
    large_n = 10,
    min_sample_prop = 0.7,
    cmap = 'viridis',
    min_cell_prop = 0.2,
    min_samples = 2,
    figsize = (7, 3),
    dpi = 100
)

[7]:

expm.rna.log_normalize(run_on_samples = ['psbulk'])

[8]:

expm.rna.view('psbulk')

annotated data of size 18 × 8928
    obs : cell.type <cat> <c> sample <cat> <c/sample> group <cat> <c> counts <f32> <f>
          cells <i64> <i>
    var : chr <cat> <c/chromosome> start <i64> <i> end <i64> <i> strand <cat> <c/strand> id <o> <o>
          subtype <cat> <c/gsubtype> gene <cat> <o/gene> tlen <f64> <i/tlen> cdslen <i64> <i/cdslen>
          assembly <cat> <c> uid <o> <o/ugene> vst.hvg <bool> <bool/hvg> vst.all.means <f64> <f>
          vst.all.vars <f64> <f> vst.all.vars.norm <f64> <f> vst.all.hvg.rank <f32> <f>
          vst.all.hvg <bool> <bool>
 layers : proportions <f64> <f/proportions> norm <f32> <f/linear> lognorm <f32> <f/normal>
          counts <f32> <i/counts>
    uns : commands <system> slots <system>

WGCNA should be run on log normalized data

[10]:

fig = expm.rna.plot_wgcna_soft_threshold(
    run_on_samples = ['psbulk'],
    layer = None,
)

[i] soft-thresholding power = 16 (scale-free r^2 cutoff met).

Select the power for WGCNA, if left empty, it will automatically inferred.

[11]:

expm.rna.wgcna(
    run_on_samples = ['psbulk'],
    layer = None, key_added = 'wgcna',
    power = 6
)

[i] preparing expression matrix...
[i] using user-specified soft-thresholding power = 6.
[i] computing adjacency matrix...
[i] computing topological overlap matrix...
[i] hierarchical clustering of (1 - tom)...
[i] dynamic hybrid tree cut (min_module_size = 30, deep_split = 2)...
[i] initial modules = 50 (plus 2853 unassigned).
[i] merging modules with eigengene dissimilarity < 0.25...
[i] final modules = 15.

[12]:

fig = expm.rna.plot_wgcna_eigengene_dendrogram(
    run_on_samples = ['psbulk'],
    key_added = 'wgcna',
    method = 'pearson',
    cmap = 'viridis',
    figsize = (5, 3),
    dpi = 100
)

[14]:

fig = expm.rna.plot_wgcna_eigengene_adjacency_heatmap(
    run_on_samples = ['psbulk'],
    key_added = 'wgcna',
    method = 'pearson',
    cmap = 'viridis',
    figsize = (4, 4),
    dpi = 100
)

[16]:

fig = expm.rna.plot_wgcna_tom_heatmap(
    run_on_samples = ['psbulk'],
    key_added = 'wgcna',
    layer = None,
    power = 5,
    network_type = 'signed hybrid',
    tom_type = 'signed', tom_denom = 'min',
    show_module_boundaries = True,
    boundary_color = 'white',
    boundary_linewidth = 0.2,
    cmap = 'viridis',
    vmin = 0.0, vmax = 0.25,
    figsize = (6.0, 6.0), dpi = 100,
)

[!] tom matrix was not found in adata.uns; recomputing from expression.

[17]:

fig = expm.rna.plot_wgcna_gene_dendrogram(
    run_on_samples = ['psbulk'],
    key_added = 'wgcna',
    linewidth = 0.5,
    figsize = (10.0, 3.0),
    dpi = 100,
)

[26]:

expm.save()

[i] main dataset write to expm/scrna/subsets/mono-neutro.h5mu
[i] saving individual samples. (pass `save_samples = False` to skip)

   ━━━━━━━━━━━━━━━━━━━━━━━━━━ modality [rna]           5 / 5     (00:01 < 00:00)