Differential Accessibility Analysis#

Once we have a peak-by-sample count matrix, we can identify chromatin regions that show statistically significant differences in accessibility between experimental conditions. Furthermore, we can discover conserved transcription factor binding motifs within these differentially accessible regions, providing mechanistic insights into the regulatory programs underlying phenotypic differences.

Analysis workflow:

  • Annotate peaks to nearby genes

  • Identify differentially accessible peaks (DESeq2)

  • Visualize differential peaks at key loci

  • Extract peak sequences from the reference genome

  • Match sequences against known motif databases (JASPAR)

  • Test motifs for enrichment in differential vs. background peaks

  • Visualize sequence logos and transcription factor footprints

  • Compute ChromVAR motif variability scores across samples

[1]:
%load_ext autoreload
%autoreload 2
[2]:
import exprmat as em

em.setwd('../../../data')
ver = em.version()
[i] exprmat 0.2.66 / exprmat-db 0.2.66
[i] os: posix (linux)  platform version: 6.8.0-90-generic
[i] loaded configuration from /home/data/yangz/.exprmatrc
[i] current working directory: /home/data/yangz/packages/exprmat/data
[i] current database directory: /home/data/yangz/packages/database (0.2.66)
[i] resident memory: 778.63 MiB
[i] virtual memory: 5.95 GiB

[3]:
expm = em.load_experiment('expm/bk-atac')
   ━━━━━━━━━━━━━━━━━━━━━━━━━━ loading samples          2 / 2     (00:04 < 00:00)
[!] integrated mudata object is not generated.


[4]:
expm.atac['bulk-atac'].obs
[4]:
n.fragments pct.dup pct.mito location sample batch group modality taxa tsse
bulk-atac:c1 25596750 0.000264 0.0 atac/control-1.bam c1 b-1 control atac-bulk hsa/grch38 7.181466
bulk-atac:c2 29933508 0.000356 0.0 atac/control-2.bam c2 b-1 control atac-bulk hsa/grch38 4.210467
bulk-atac:c3 26024477 0.000272 0.0 atac/control-3.bam c3 b-1 control atac-bulk hsa/grch38 7.388964
bulk-atac:e1 24406481 0.000328 0.0 atac/hp-1.bam e1 b-1 expm atac-bulk hsa/grch38 9.118949
bulk-atac:e2 26010889 0.000294 0.0 atac/hp-2.bam e2 b-1 expm atac-bulk hsa/grch38 7.051591
bulk-atac:e3 23710917 0.000301 0.0 atac/hp-3.bam e3 b-1 expm atac-bulk hsa/grch38 8.239926
[5]:
print(expm)
[!] dataset not integrated.
[*] composed of samples:
  bulk-atac   atac      hsa    batch autogen   of size 6 × 0
  bulk-atac   atac-peak hsa    batch autogen   of size 6 × 48519

[6]:
expm.atac.view('bulk-atac')
annotated data of size 6 × 0
    obs : n.fragments <ui64> <i> pct.dup <f64> <f> pct.mito <f64> <f> location <o> <o/location>
          sample <o> <c/sample> batch <cat> <c/batch> group <cat> <c> modality <cat> <c/modality>
          taxa <cat> <c/taxa> tsse <f64> <f/tsse>
   obsm : paired <csr:ui32(3088286401)> <fragments>
    uns : assembly <assembly> assembly.size <asm-size> frag.sizes <frag-size>
          pct.tss.overlap <tss-overlap> peaks.group <dict/peaks/grouped-peaks>
          peaks.merged <dict/peaks/merged-peaks> tss.profile <tss-profile> tsse <overall-tsse>
[7]:
expm.atac_peak.view('bulk-atac')
annotated data of size 6 × 48519
    obs : n.fragments <ui64> <i> pct.dup <f64> <f> pct.mito <f64> <f> location <o> <o/location>
          sample <o> <c/sample> batch <cat> <c/batch> group <cat> <c> modality <cat> <c/modality>
          taxa <cat> <c/taxa> tsse <f64> <f/tsse>
    var : expm <bool> <bool> control <bool> <bool> chr <cat> <c/chromosome> start <i64> <i>
          end <i64> <i> strand <cat> <o/strand> summit <i64> <i>
    uns : assembly <assembly> assembly.size <asm-size>
[9]:
expm.atac_peak.list_available_tools('bulk-atac', light = True)
[9]:
[{'tool': 'annotate_peak', 'kind': 'computation'},
 {'tool': 'expression_linkage', 'kind': 'computation'},
 {'tool': 'retrieve_sequence', 'kind': 'computation'}]

Annotating Peaks to Genes#

ATAC-seq peaks represent open chromatin regions, but they don’t come with gene names. A common and simple annotation strategy is the nearest TSS (transcription start site) method:

  1. For each peak, find the nearest annotated TSS in the reference genome

  2. Record the distance (in bp) from the peak center/summit to the TSS

  3. Classify peaks by genomic context:

    • Promoter: Within ±2 kb of a TSS

    • Proximal: 2-10 kb from a TSS

    • Distal intergenic: >10 kb from any TSS (potential enhancers)

    • Intronic / Exonic: Overlapping gene bodies

The annotation adds columns to .var including:

  • type: Genomic region classification

  • tss.nearest: Distance to the nearest TSS

  • ugene: Ensembl gene ID

  • gene: Gene symbol

  • tss.dist: Signed distance to TSS (negative = upstream)

Caveat: The nearest-TSS approach is simplistic. Enhancers can regulate genes hundreds of kb away, and the nearest gene is not always the regulatory target. More sophisticated methods (e.g., correlation-based linking, Hi-C) exist but are beyond this tutorial.

[10]:
expm.atac_peak.annotate_peak(run_on_samples = ['bulk-atac'])
[11]:
expm.atac_peak.view('bulk-atac')
annotated data of size 6 × 48519
    obs : n.fragments <ui64> <i> pct.dup <f64> <f> pct.mito <f64> <f> location <o> <o/location>
          sample <o> <c/sample> batch <cat> <c/batch> group <cat> <c> modality <cat> <c/modality>
          taxa <cat> <c/taxa> tsse <f64> <f/tsse>
    var : expm <bool> <bool> control <bool> <bool> chr <cat> <c/chromosome> start <i64> <i>
          end <i64> <i> strand <cat> <o/strand> summit <i64> <i> type <o> <c/genomic-annot>
          tss.nearest <o> <o> ugene <o> <o/ugene> gene <o> <o/gene> tss.dist <f64> <i>
    uns : assembly <assembly> assembly.size <asm-size>
[13]:
expm.atac_peak['bulk-atac'].var[['type', 'tss.nearest', 'ugene', 'gene', 'tss.dist']]
[13]:
type tss.nearest ugene gene tss.dist
peak:grch38:chr1:9853-10354 promoter ENSG00000290825 rna:hsa:g1 DDX11L16 -1016.0
peak:grch38:chr1:15983-16484 promoter ENSG00000310526 rna:hsa:g2 WASH7P -865.0
peak:grch38:chr1:181228-181729 promoter ENSG00000308415 rna:hsa:g18 DDX11L2 147.0
peak:grch38:chr1:180549-181050 promoter ENSG00000308415 rna:hsa:g18 DDX11L2 -130.0
peak:grch38:chr1:191240-191741 first-intron ENSG00000310527 rna:hsa:g19 WASH9P 1386.0
... ... ... ... ... ...
peak:grch38:chrY:56706793-56707294 intergenic ENSG00000235857 rna:hsa:g78922 CTBP2P1 -148199.0
peak:grch38:chrY:56727895-56728396 intergenic ENSG00000235857 rna:hsa:g78922 CTBP2P1 -127097.0
peak:grch38:chrY:56734538-56735039 intergenic ENSG00000235857 rna:hsa:g78922 CTBP2P1 -120454.0
peak:grch38:chrY:56742463-56742964 intergenic ENSG00000235857 rna:hsa:g78922 CTBP2P1 -112529.0
peak:grch38:chrY:56763268-56763769 intergenic ENSG00000235857 rna:hsa:g78922 CTBP2P1 -91724.0

48519 rows × 5 columns

Differential Accessibility Analysis with DESeq2#

To identify peaks that are significantly more or less accessible between conditions, we use the DESeq2 method — originally developed for bulk RNA-seq differential expression analysis, but widely adapted for ATAC-seq differential peak analysis.

Why DESeq2 for ATAC-seq peaks?

  • ATAC-seq peak counts are count data (integer values), similar to RNA-seq read counts

  • Peaks exhibit overdispersion (variance > mean), which DESeq2 handles via negative binomial modeling

  • DESeq2 uses empirical Bayes shrinkage to stabilize dispersion estimates for low-count peaks

  • The model accounts for library size differences via size factor normalization

Statistical model: $ \text{Count} \sim `:nbsphinx-math:text{NB}`(\mu, \alpha) $ $ \log`(:nbsphinx-math:mu`) = \beta_0 + \beta_1 \cdot `:nbsphinx-math:text{condition}` $

Where:

  • \(\mu\) is the mean count (adjusted for library size)

  • \(\alpha\) is the dispersion parameter

  • \(\beta_1\) is the log2 fold change between conditions

Parameters:

  • formula='~ group': Model accessibility as a function of biological group

  • variable='group': The variable of interest for contrast

  • experiment='expm', control='control': Defines the comparison direction (expm vs control)

[15]:
expm.atac_peak.markers_deseq(
    run_on_samples = ['bulk-atac'], counts = 'X',
    metadata = ['sample', 'group'], formula = '~ group',
    variable = 'group', experiment = 'expm', control = 'control',
)
[i] fitting size factors ...
[i] using None as control genes, passed at deseq_dataset initialization
[i]   done in 0.01 seconds.
[i] fitting dispersions...
[i]   done in 7.14 seconds.
[i] fitting dispersion trend curve ...
[i]   done in 0.83 seconds.
[i] fitting map dispersions ...
[i]   done in 5.02 seconds.
[i] fitting log fold changes ...
[i]   done in 9.43 seconds.
[i] calculating cook's distance ...
[i]   done in 0.02 seconds.
[i] replacing 0 outlier genes.

For example, we can filter for differentially accessible peaks near the TCF7 locus — a key transcription factor in T-cell development and function. The results show peaks annotated to TCF7 with their mean accessibility, log2 fold change (lfc), and statistical significance (q-value).

[ ]:
upreg = expm.atac_peak.get_markers(
    run_on_samples = ['bulk-atac'],
    de_slot = 'markers', max_q = 1,
    min_lfc = 0.1, max_lfc = 10000
)['bulk-atac-peaks'][
    ['mean', 'lfc', 'lfc.se', 'names', 'gene', 'q']
].sort_values('lfc')

upreg.loc[upreg['gene'] == 'TCF7', :]
[i] fetched diff `expm` over `control` (14047 genes)

mean lfc lfc.se names gene q
37300 29.540347 0.156705 0.336371 peak:grch38:chr5:134113861-134114362 TCF7 0.999857
37302 60.085868 0.259853 0.241759 peak:grch38:chr5:134132392-134132893 TCF7 0.999857
37304 30.137926 0.752380 0.323924 peak:grch38:chr5:134140398-134140899 TCF7 0.999857
[20]:
figs = expm.atac.plot_peaks(
    run_on_samples = ['bulk-atac'],
    group_key = 'group', peaks_key = 'peaks.group',
    # either specifying a gene name for convenience
    gene = 'TCF7', upstream = 5000, downstream = 5000,
    # or specify chromosome coordinates manually
    chr = None, xf = None, xt = None,
    figsize = (4, 3.2), dpi = 100,
    gene_track_height = 0.45, chrom_track_height = 0.10,
    peak_annotation_height = 0.1,
    min_frag_length = None, max_frag_length = 180,
    plot_consensus_peak = 'peaks.merged',
    title = None, showgn = True,

    do_tight_layout = False
)
../_images/atac_d1-atac-peaks_17_0.png

Extracting Peak Sequences#

To analyze motif enrichment within differentially accessible peaks, we first need to extract the underlying DNA sequences from the reference genome. The run_atacp_retrieve_sequence() function:

  1. Takes each peak interval (chromosome, start, end)

  2. Extracts the corresponding nucleotide sequence from the reference genome

  3. Optionally extends the sequence to a fixed length (seqlen=200 bp) centered on the peak summit

  4. Stores the sequences for downstream motif analysis

The extracted sequences are stored in the .var DataFrame, where each row contains the genomic coordinates and the retrieved sequence.

[17]:
expm.atac_peak.retrieve_sequence(
    run_on_samples = ['bulk-atac'],
    seqlen = 200
)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━ fetching chromosome     24 / 24    (00:18 < 00:00)
[18]:
expm.atac_peak['bulk-atac'].var[['strand.pos']]
[18]:
strand.pos
peak:grch38:chr1:9853-10354 ACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC...
peak:grch38:chr1:15983-16484 CCTGAGCGCGGGCATCCTGTGTGCAGATACTCCCTGCTTCCTCTCT...
peak:grch38:chr1:181228-181729 GGCGCAGAGACACATGCTAGCGCGCCCAGGGGAGGAGGCGTGGCGC...
peak:grch38:chr1:180549-181050 CCTAACCCTAACCCTACCCTAACCCAACCCTAACCCAACCCTAACT...
peak:grch38:chr1:191240-191741 TGACTGTCATCCCCTCCAGTCTCTGCACACTCCCAGCTGCAGCAGA...
... ...
peak:grch38:chrY:56706793-56707294 GGAATGGAGTGCAGTGCAATGGAATGGAATGGAATGGAATGCAATG...
peak:grch38:chrY:56727895-56728396 ATGGAATGGAATCGAATGCAATGGAATGCTATGGAATGGAATGGAA...
peak:grch38:chrY:56734538-56735039 AAAGGAATCGAATGGAAGGGAATGAAATTGAATCAACACGAATGGA...
peak:grch38:chrY:56742463-56742964 AATGTAATGGAATGGAGTGCAGTGCAATGGAATGGAATGGAATGGA...
peak:grch38:chrY:56763268-56763769 GAATAGAATGGAACGAAATTTCACGGAATGGAATCAAACTGAATGG...

48519 rows × 1 columns

[19]:
expm.atac_peak.view('bulk-atac')
annotated data of size 6 × 48519
    obs : n.fragments <ui64> <i> pct.dup <f64> <f> pct.mito <f64> <f> location <o> <o/location>
          sample <o> <c/sample> batch <cat> <c/batch> group <cat> <c> modality <cat> <c/modality>
          taxa <cat> <c/taxa> tsse <f64> <f/tsse>
    var : expm <bool> <bool> control <bool> <bool> chr <cat> <c/chromosome> start <i64> <i>
          end <i64> <i> strand <cat> <o/strand> summit <i64> <i> type <o> <c/genomic-annot>
          tss.nearest <o> <o> ugene <o> <o/ugene> gene <o> <o/gene> tss.dist <f64> <i>
          strand.pos <o> <o/sequence> strand.neg <o> <o/sequence>
    uns : assembly <assembly> assembly.size <asm-size> markers
[20]:
expm.atac_peak.list_available_tools('bulk-atac', True)
[20]:
[{'tool': 'annotate_peak', 'kind': 'computation'},
 {'tool': 'expression_linkage', 'kind': 'computation'},
 {'tool': 'motif_match', 'kind': 'computation'},
 {'tool': 'retrieve_sequence', 'kind': 'computation'}]

Next, we scan the extracted peak sequences against known transcription factor motifs from the JASPAR database. The run_atacp_motif_match() function:

  1. Uses curated position weight matrices (PWMs) from the JASPAR database (a comprehensive collection of eukaryotic transcription factor binding profiles)

  2. Scans each peak sequence for motif occurrences using a sliding window approach

  3. Scores each position using a log-likelihood ratio (motif score vs. background)

  4. Calls a motif match when the score exceeds a threshold (default: p < 1e-5)

Parameters:

  • subset='jaspar': Use only JASPAR-annotated motifs (high-quality, curated)

  • length=200: The sequence window length to scan

  • threshold=0.00001: Stringent p-value threshold for motif matching

The results store the position and score of each motif match within each peak.

[21]:
expm.atac_peak.motif_match(
    run_on_samples = ['bulk-atac'],
    motif_matches = 'motifs',
    subset = 'jaspar',
    length = 200,
    threshold = 0.00001
)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━ generating pssm      10249 / 10249 (00:00 < 00:00)
[i] matching motifs for foreground ...

   ━━━━━━━━━━━━━━━━━━━━━━━━━━ foreground matching   1936 / 1936  (02:49 < 00:00)
[22]:
expm.atac_peak.list_available_tools('bulk-atac', True)
[22]:
[{'tool': 'annotate_peak', 'kind': 'computation'},
 {'tool': 'chromvar', 'kind': 'computation'},
 {'tool': 'expression_linkage', 'kind': 'computation'},
 {'tool': 'footprint', 'kind': 'computation'},
 {'tool': 'motif_enrichment', 'kind': 'computation'},
 {'tool': 'motif_match', 'kind': 'computation'},
 {'tool': 'retrieve_sequence', 'kind': 'computation'}]

After identifying motif occurrences, we perform a statistical enrichment test to determine which motifs are significantly overrepresented in differentially accessible peaks compared to a background set.

Enrichment method:

  1. Foreground: The set of differentially accessible peaks (e.g., upregulated in experimental condition)

  2. Background: A matched set of non-differential peaks (by default, randomly sampled from all peaks)

  3. Contingency table: Count motif occurrences in foreground vs. background

  4. Statistical test: Chi-squared test (or Fisher’s exact test for small counts) on the 2×2 contingency table

Parameters:

  • background='background': Use the background model to sample matched null peaks

  • n=5000: Sample 5,000 background peaks for comparison

  • random_seed=42: Reproducible random sampling

[23]:
expm.atac_peak.motif_enrichment(
    run_on_samples = ['bulk-atac'],
    background = 'background',
    n = 5000,
    length = 200,
    random_seed = 42,
    threshold = 0.00001,
    motif_matches = 'motifs',
    key_added = 'motifs.enrich',
    subset = 'jaspar'
)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━ generating pssm      10249 / 10249 (00:00 < 00:00)
[i] generating background regions ...
[i] fetching chromosomes, round 1 ...
[i] fetching chromosomes, round 2 ...
[i] fetching chromosomes, round 3 ...
[i] fetching chromosomes, round 4 ...
[i] fetching chromosomes, round 5 ...
[i] fetching chromosomes, round 6 ...
[i] fetching chromosomes, round 7 ...
[i] fetching chromosomes, round 8 ...
[i] fetching chromosomes, round 9 ...
[i] fetching chromosomes, round 10 ...
[i] fetching chromosomes, round 11 ...
[i] matching motifs for background ...

   ━━━━━━━━━━━━━━━━━━━━━━━━━━ background matching   1936 / 1936  (00:05 < 00:00)

We can examine the results table, sorted by statistical significance (chi-squared statistic, descending). This reveals which transcription factor motifs are most strongly enriched in the differential peaks, suggesting which regulatory programs may be driving the chromatin response.

[24]:
m = expm.atac_peak['bulk-atac'].uns['motifs.enrich'].copy()
m.loc[~m['chisq'].isna(), :].sort_values('chisq', ascending = False)
[24]:
motif fg.hit fg gc bg.hit bg fg.ratio bg.ratio fc chisq p log10.p
1255 motif:18810 934 48519 0.560541 529 5000 0.019250 0.1058 0.181949 1.277047e+03 0.000000 inf
731 motif:18013 7062 48519 0.580299 31 5000 0.145551 0.0062 23.476004 7.656322e+02 0.000000 inf
1150 motif:18349 229 48519 0.520852 194 5000 0.004720 0.0388 0.121644 6.714164e+02 0.000000 inf
1175 motif:18478 200 48519 0.507400 144 5000 0.004122 0.0288 0.143128 4.322554e+02 0.000000 inf
734 motif:19009 4501 48519 0.594672 42 5000 0.092768 0.0084 11.043784 4.153527e+02 0.000000 inf
... ... ... ... ... ... ... ... ... ... ... ... ...
1501 motif:36464 78 48519 0.457179 8 5000 0.001608 0.0016 1.004761 1.639540e-04 0.989784 0.004460
536 motif:18443 116 48519 0.634914 12 5000 0.002391 0.0024 0.996173 1.602428e-04 0.989900 0.004409
554 motif:18530 39 48519 0.483974 4 5000 0.000804 0.0008 1.004761 8.191109e-05 0.992779 0.003147
388 motif:18578 184 48519 0.489103 19 5000 0.003792 0.0038 0.997981 7.059359e-05 0.993296 0.002921
665 motif:35629 165 48519 0.440455 17 5000 0.003401 0.0034 1.000215 7.119881e-07 0.999327 0.000292

1532 rows × 12 columns

[26]:
expm.atac_peak.view('bulk-atac')
annotated data of size 6 × 48519
    obs : n.fragments <ui64> <i> pct.dup <f64> <f> pct.mito <f64> <f> location <o> <o/location>
          sample <o> <c/sample> batch <cat> <c/batch> group <cat> <c> modality <cat> <c/modality>
          taxa <cat> <c/taxa> tsse <f64> <f/tsse>
    var : expm <bool> <bool> control <bool> <bool> chr <cat> <c/chromosome> start <i64> <i>
          end <i64> <i> strand <cat> <o/strand> summit <i64> <i> type <o> <c/genomic-annot>
          tss.nearest <o> <o> ugene <o> <o/ugene> gene <o> <o/gene> tss.dist <f64> <i>
          strand.pos <o> <o/sequence> strand.neg <o> <o/sequence>
   varm : motifs <csc:ui8(1936)> <f/motif-matches> motifs.locs <csc:i32(1936)>
    uns : assembly <assembly> assembly.size <asm-size> markers motifs.names motifs.seqlen
          background <motif-background> motifs.enrich <motif-enrich>

Motif Visualization#

The motif IDs in the enrichment table (e.g., motif:18810) are internal codes referencing the exprmat motif database. We can visualize a motif’s sequence logo — a graphical representation of its position weight matrix:

  • The x-axis shows the position within the motif (in bp)

  • The y-axis shows the information content (bits), ranging from 0 (completely random) to 2 (perfectly conserved)

  • Each nucleotide (A, C, G, T) is stacked at each position, with its height proportional to its frequency

  • Taller stacks indicate more conserved positions

Sequence logos are generated using the plot_sequence_logo() function.

[34]:
from exprmat.data.cre import plot_sequence_logo, search_transcription_factor

mot = plot_sequence_logo('motif:18810')
../_images/atac_d1-atac-peaks_32_0.png

We can also query the database for transcription factor - motif associations. For example, to find which JASPAR motifs are known to be bound by TCF7 (TCF1), a T-cell-specific transcription factor. This helps interpret which enriched motifs correspond to biologically relevant regulators.

[35]:
search_transcription_factor('hsa', 'TCF7', 'jaspar')
[35]:
[('mc:311', 'motif:18190'),
 ('mc:311', 'motif:18688'),
 ('mc:311', 'motif:19071'),
 ('mc:4295', 'motif:18389'),
 ('mc:119', 'motif:18944'),
 ('mc:119', 'motif:18945'),
 ('mc:119', 'motif:18948'),
 ('mc:119', 'motif:18952'),
 ('mc:119', 'motif:18953'),
 ('mc:9071', 'motif:18390')]

Transcription Factor Footprint Analysis#

When a transcription factor binds to its cognate motif, it protects the underlying DNA from Tn5 transposition. This creates a footprint — a local dip in the Tn5 cut frequency centered on the binding site, flanked by increased accessibility (the “open” chromatin shoulders).

The run_atacp_footprint() function:

  1. Identifies all motif occurrences in the genome

  2. For each occurrence, aggregates the Tn5 cut frequency ±250 bp around the motif center

  3. Normalizes by the average cut frequency in flanking regions

  4. Produces a meta-profile showing the average footprint pattern

Interpreting footprints:

  • A clear dip at the motif center indicates active TF binding

  • Flanking peaks correspond to accessible regions around the protected binding site

  • Differences between conditions (e.g., control vs. experimental) reveal condition-specific TF binding

Here we compute the footprint for motif motif:18944 (TCF7 binding motif) and visualize the resulting profile.

[28]:
expm.atac_peak.footprint(
    run_on_samples = ['bulk-atac'],
    adata_atac = expm.atac['bulk-atac'],
    motif = 'motif:18944',
    motif_matches = 'motifs',
    groupby = 'group',
    key_added = 'footprint.tcf7-1'
)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━ summarizing peaks     1075 / 1075  (02:30 < 00:00)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━ summarizing peaks     1075 / 1075  (02:00 < 00:00)
[30]:
from exprmat.plotting.utils import line

fig = line(
    x = [x - 250 for x in range(501)],
    y = expm.atac_peak['bulk-atac'].uns['footprint.tcf7-1'],
    figsize = (4, 1.5),
)
../_images/atac_d1-atac-peaks_37_0.png
[31]:
expm.atac_peak.view('bulk-atac')
annotated data of size 6 × 48519
    obs : n.fragments <ui64> <i> pct.dup <f64> <f> pct.mito <f64> <f> location <o> <o/location>
          sample <o> <c/sample> batch <cat> <c/batch> group <cat> <c> modality <cat> <c/modality>
          taxa <cat> <c/taxa> tsse <f64> <f/tsse>
    var : expm <bool> <bool> control <bool> <bool> chr <cat> <c/chromosome> start <i64> <i>
          end <i64> <i> strand <cat> <o/strand> summit <i64> <i> type <o> <c/genomic-annot>
          tss.nearest <o> <o> ugene <o> <o/ugene> gene <o> <o/gene> tss.dist <f64> <i>
          strand.pos <o> <o/sequence> strand.neg <o> <o/sequence>
   varm : motifs <csc:ui8(1936)> <f/motif-matches> motifs.locs <csc:i32(1936)>
    uns : assembly <assembly> assembly.size <asm-size> markers motifs.names motifs.seqlen
          background <motif-background> motifs.enrich <motif-enrich> footprint.tcf7-1 <footprint>
[35]:
expm.atac_peak.list_available_tools('bulk-atac', True)
[35]:
[{'tool': 'annotate_peak', 'kind': 'computation'},
 {'tool': 'chromvar', 'kind': 'computation'},
 {'tool': 'expression_linkage', 'kind': 'computation'},
 {'tool': 'footprint', 'kind': 'computation'},
 {'tool': 'motif_enrichment', 'kind': 'computation'},
 {'tool': 'motif_match', 'kind': 'computation'},
 {'tool': 'retrieve_sequence', 'kind': 'computation'}]

ChromVAR Analysis#

ChromVAR (Chromatin Variation Across Regions) is a method for quantifying motif variability across samples. It addresses the question: “Which transcription factor motifs show the most variable accessibility across my samples?”

How ChromVAR works:

  1. For each motif, calculate the accessibility deviation — how much more or less accessible a sample is at regions containing that motif, compared to the genome-wide average

  2. Correct for GC-content bias (since GC-rich regions have different accessibility properties)

  3. Apply a background correction using a set of matched control regions

  4. Output a motif-by-sample variability matrix

Output: The atac-chromvar modality contains a matrix where:

  • Rows are samples

  • Columns are motifs

  • Values are the deviation z-scores (how many standard deviations each sample’s accessibility at that motif deviates from expected)

This allows identification of motifs that distinguish sample groups (e.g., which TF motifs are consistently more accessible in experimental vs. control samples).

[36]:
chromvar = expm.atac_peak.chromvar(
    run_on_samples = ['bulk-atac'],
    chunk_size = 10000,
    background = 'bg.intern',
    motif_matches = 'motifs',
    gc_key = 'gc', subset = 'jaspar',
    length = 200, threshold = 1e-5
)
[!] cannot find background peaks in the input object .varm
[!] calculating internal background peaks and inserting to .varm['bg.intern']
[i] computing expectation reads per cell and peak ...
[i] computing observed + bg motif deviations...
[i] calculating for rows (cells) 0 - 6 ...

   ━━━━━━━━━━━━━━━━━━━━━━━━━ background iterations    50 / 50    (00:08 < 00:00)
[i] created subset [bulk-atac] from sample [bulk-atac]


[37]:
print(expm)
[!] dataset not integrated.
[*] composed of samples:
  bulk-atac   atac       hsa    batch autogen   of size 6 × 0
  bulk-atac   atac-peak  hsa    batch autogen   of size 6 × 48519
  bulk-atac   atac-motif hsa    batch autogen   of size 6 × 1936

[39]:
expm.atac_peak.view('bulk-atac')
annotated data of size 6 × 48519
    obs : n.fragments <ui64> <i> pct.dup <f64> <f> pct.mito <f64> <f> location <o> <o/location>
          sample <o> <c/sample> batch <cat> <c/batch> group <cat> <c> modality <cat> <c/modality>
          taxa <cat> <c/taxa> tsse <f64> <f/tsse>
    var : expm <bool> <bool> control <bool> <bool> chr <cat> <c/chromosome> start <i64> <i>
          end <i64> <i> strand <cat> <o/strand> summit <i64> <i> type <o> <c/genomic-annot>
          tss.nearest <o> <o> ugene <o> <o/ugene> gene <o> <o/gene> tss.dist <f64> <i>
          strand.pos <o> <o/sequence> strand.neg <o> <o/sequence> gc <f64> <f>
   varm : motifs <csc:ui8(1936)> <f/motif-matches> motifs.locs <csc:i32(1936)>
          bg.intern <arr:i32(50)> <motif-background-index>
    uns : assembly <assembly> assembly.size <asm-size> markers motifs.names motifs.seqlen
          background <motif-background> motifs.enrich <motif-enrich> footprint.tcf7-1 <footprint>
[38]:
expm.modalities['atac-motif']['bulk-atac']
[38]:
AnnData object with n_obs × n_vars = 6 × 1936
    obs: 'n.fragments', 'pct.dup', 'pct.mito', 'location', 'sample', 'batch', 'group', 'modality', 'taxa', 'tsse'
[ ]:
em.pd.DataFrame(
    expm.atac_motif['bulk-atac-peaks'].X,
    index = expm.atac_motif['bulk-atac-peaks'].obs_names,
    columns = expm.atac_motif['bulk-atac-peaks'].var_names
).iloc[:, 0:6]
motif:18200 motif:17986 motif:17990 motif:17992 motif:18399 motif:18983
bulk-atac:c1 -1.715671 1.849657 1.455513 1.831647 1.065492 1.441480
bulk-atac:c2 0.003073 -0.969662 -0.461601 -0.940206 -0.562150 -0.533376
bulk-atac:c3 -1.448396 3.508345 0.699794 3.322296 1.392191 1.790036
bulk-atac:e1 0.885136 -1.268004 -0.321142 -1.894677 -0.303776 -1.591238
bulk-atac:e2 1.420448 -3.043641 -1.137865 -2.807431 -1.212565 -0.457472
bulk-atac:e3 0.062925 1.518287 0.627722 2.024162 0.460492 0.830303

Save the Processed Dataset#

All analysis results — including peak annotations, differential accessibility results, motif matches, enrichment statistics, footprint profiles, and ChromVAR deviation scores — are saved to disk for future reference and publication figures.

[43]:
expm.save()
[i] saving individual samples. (pass `save_samples = False` to skip)

   ━━━━━━━━━━━━━━━━━━━━━━━━━━ modality [atac]          1 / 1     (00:03 < 00:00)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━ modality [atac-peak]     1 / 1     (00:01 < 00:00)
   ━━━━━━━━━━━━━━━━━━━━━━━━━ modality [atac-motif]     1 / 1     (00:00 < 00:00)