Bulk ATAC-seq Analysis#

This notebook demonstrates how to process bulk ATAC-seq (Assay for Transposase-Accessible Chromatin with high-throughput sequencing) data that has already been aligned to a reference genome. The exprmat package supports two input formats:

  1. BAM files (Binary Alignment/Map): The recommended format when available. BAM files store complete alignment information including:

    • Per-read mapping coordinates, CIGAR strings, and mapping qualities

    • Mate pair information for paired-end reads

    • Read sequences and quality scores

    • This enables the most accurate fragment analysis, duplicate removal, and quality filtering

  2. BigWig files: A common format in public databases (e.g., ENCODE, Roadmap Epigenomics). BigWig stores per-base or per-bin signal values at reduced resolution. Important limitations:

    • Signal has been summarized/aggregated, losing individual read information

    • Sequence information (nucleotides) is entirely discarded

    • PCR duplicate removal and fragment length filtering are no longer possible

    • Only suitable when BAM files are unavailable

ATAC-seq Background#

ATAC-seq uses the Tn5 transposase to simultaneously fragment and tag accessible chromatin regions with sequencing adapters. Key properties:

  • Nucleosome-free regions (~50-180 bp fragments) represent open chromatin (promoters, enhancers)

  • Mono-nucleosomal fragments (~180-400 bp) represent nucleosome-occupied regions

  • The fragment size distribution is a key quality metric, with a characteristic ~10.4 bp periodicity reflecting DNA helical pitch

[1]:
%load_ext autoreload
%autoreload 2
[2]:
import exprmat as em

em.setwd('../../../data')
ver = em.version()
[i] exprmat 0.2.66 / exprmat-db 0.2.66
[i] os: posix (linux)  platform version: 6.8.0-90-generic
[i] loaded configuration from /home/data/yangz/.exprmatrc
[i] current working directory: /home/data/yangz/packages/exprmat/data
[i] current database directory: /home/data/yangz/packages/database (0.2.66)
[i] resident memory: 776.45 MiB
[i] virtual memory: 5.95 GiB

In this section, we first demonstrate how to load bulk ATAC-seq samples from BAM files into the exprmat experiment framework. The data source directory is inspected to confirm the file structure.

[3]:
! tree -sh ./atac
[4.0K]  ./atac
├── [2.3G]  control-1.bam
├── [2.3M]  control-1.bam.bai
├── [148M]  control-1.fragments.tsv.gz
├── [2.3G]  control-1.sample.bam
├── [2.7G]  control-2.bam
├── [2.3M]  control-2.bam.bai
├── [178M]  control-2.fragments.tsv.gz
├── [2.7G]  control-2.sample.bam
├── [2.3G]  control-3.bam
├── [2.3M]  control-3.bam.bai
├── [149M]  control-3.fragments.tsv.gz
├── [2.3G]  control-3.sample.bam
├── [2.1G]  hp-1.bam
├── [2.3M]  hp-1.bam.bai
├── [137M]  hp-1.fragments.tsv.gz
├── [2.1G]  hp-1.sample.bam
├── [2.3G]  hp-2.bam
├── [2.4M]  hp-2.bam.bai
├── [150M]  hp-2.fragments.tsv.gz
├── [2.3G]  hp-2.sample.bam
├── [2.0G]  hp-3.bam
├── [2.3M]  hp-3.bam.bai
├── [134M]  hp-3.fragments.tsv.gz
└── [2.1G]  hp-3.sample.bam

1 directory, 24 files
[3]:
meta = em.metadata(
    locations    = [
        'atac/control-1.bam',
        'atac/control-2.bam',
        'atac/control-3.bam',
        'atac/hp-1.bam',
        'atac/hp-2.bam',
        'atac/hp-3.bam',
    ],
    modality     = ['atac-bulk'] * 6,
    default_taxa = ['hsa/grch38'] * 6,
    batches      = ['b-1'] * 6,
    names        = ['c1', 'c2', 'c3', 'e1', 'e2', 'e3'],
    groups       = ['control', 'control', 'control', 'expm', 'expm', 'expm'],
)

Each ATAC-seq sample requires one row in the metadata table with modality='atac-bulk'. The genome assembly version used for read alignment is critical — different assemblies (e.g., GRCh38 vs GRCh37 for human) have different coordinate systems, making cross-assembly comparisons invalid. The assembly is specified after the species name separated by / (e.g., hsa/grch38).

Available assemblies: The exprmat database ships multiple reference assemblies. For example, mouse genomes include both grcm38 (based on mm10) and grcm39 (based on mm39). Always use the assembly matching your alignment pipeline.

The metadata table also assigns biological groups (control vs expm) for downstream differential analysis, and batch identifiers to account for technical variation.

[4]:
meta.dataframe
[4]:
location sample batch group modality taxa
0 atac/control-1.bam c1 b-1 control atac-bulk hsa/grch38
1 atac/control-2.bam c2 b-1 control atac-bulk hsa/grch38
2 atac/control-3.bam c3 b-1 control atac-bulk hsa/grch38
3 atac/hp-1.bam e1 b-1 expm atac-bulk hsa/grch38
4 atac/hp-2.bam e2 b-1 expm atac-bulk hsa/grch38
5 atac/hp-3.bam e3 b-1 expm atac-bulk hsa/grch38
[5]:
expm = em.experiment(meta, dump = 'expm/bk-atac')
[i] reading sample c1 [atac-bulk] ...
[i] reading sample c2 [atac-bulk] ...
[i] reading sample c3 [atac-bulk] ...
[i] reading sample e1 [atac-bulk] ...
[i] reading sample e2 [atac-bulk] ...
[i] reading sample e3 [atac-bulk] ...

The bulk ATAC-seq data is stored in an automatically generated atac modality object. Note that its .var (feature/variable) count is initially 0 — this is by design for compatibility with the single-cell ATAC data format. For scATAC-seq, the expression matrix is typically a bin count matrix (genome tiled into bins). However, for bulk ATAC-seq, constructing such a matrix genome-wide is not meaningful. Instead, we work directly with BAM-derived fragment information and only construct a peak-based matrix later (see the “Peak Matrix” section below).

[3]:
expm = em.load_experiment('expm/bk-atac')
   ━━━━━━━━━━━━━━━━━━━━━━━━━━ loading samples          1 / 1     (00:01 < 00:00)
[!] integrated mudata object is not generated.


[6]:
print(expm)
[!] dataset not integrated.
[*] composed of samples:
  bulk-atac   atac  hsa    batch autogen   of size 6 × 0

[8]:
expm.atac['bulk-atac'].obs
[8]:
n.fragments pct.dup pct.mito location sample batch group modality taxa
bulk-atac:c1 25596750 0.000264 0.0 atac/control-1.bam c1 b-1 control atac-bulk hsa/grch38
bulk-atac:c2 29933508 0.000356 0.0 atac/control-2.bam c2 b-1 control atac-bulk hsa/grch38
bulk-atac:c3 26024477 0.000272 0.0 atac/control-3.bam c3 b-1 control atac-bulk hsa/grch38
bulk-atac:e1 24406481 0.000328 0.0 atac/hp-1.bam e1 b-1 expm atac-bulk hsa/grch38
bulk-atac:e2 26010889 0.000294 0.0 atac/hp-2.bam e2 b-1 expm atac-bulk hsa/grch38
bulk-atac:e3 23710917 0.000301 0.0 atac/hp-3.bam e3 b-1 expm atac-bulk hsa/grch38
[19]:
expm.atac._init_atac(run_on_samples = expm.all_samples('atac'))
[23]:
expm.atac.view('bulk-atac')
annotated data of size 6 × 0
    obs : n.fragments <ui64> <i> pct.dup <f64> <f> pct.mito <f64> <f> location <o> <o/location>
          sample <o> <c/sample> batch <o> <c/batch> group <o> <c> modality <o> <c/modality>
          taxa <o> <c/taxa>
   obsm : paired <csr:ui32(3088286401)> <fragments>
    uns : assembly.size <asm-size> assembly <assembly>
[3]:
expm = em.load_experiment('expm/bk-atac')
   ━━━━━━━━━━━━━━━━━━━━━━━━━━ loading samples          3 / 3     (00:01 < 00:00)
[!] integrated mudata object is not generated.


ATAC-seq Quality Control#

Before downstream analysis, we assess the quality of each ATAC-seq sample using the same visualization tools available for single-cell ATAC data. Key QC metrics include:

  • Fragment size distribution: A healthy ATAC-seq library shows a characteristic pattern with peaks at ~50 bp (nucleosome-free), ~200 bp (mono-nucleosome), ~400 bp (di-nucleosome), etc.

  • TSS enrichment score: Measures signal enrichment at transcription start sites relative to flanking regions. High TSS enrichment (>5-10) indicates good signal-to-noise.

  • Fraction of reads in peaks (FRiP): The proportion of reads overlapping called peaks. Typically 10-30% for high-quality bulk ATAC-seq.

  • Duplication rate: High duplication rates (>50%) may indicate PCR over-amplification or low starting material.

[24]:
figs = expm.atac.plot_qc(run_on_samples = True)
../_images/atac_b1-atac-bam_17_0.png
[25]:
expm.atac.view('bulk-atac')
annotated data of size 6 × 0
    obs : n.fragments <ui64> <i> pct.dup <f64> <f> pct.mito <f64> <f> location <o> <o/location>
          sample <o> <c/sample> batch <o> <c/batch> group <o> <c> modality <o> <c/modality>
          taxa <o> <c/taxa> tsse <f64> <f/tsse>
   obsm : paired <csr:ui32(3088286401)> <fragments>
    uns : assembly.size <asm-size> assembly <assembly> frag.sizes <frag-size> tsse <overall-tsse>
          pct.tss.overlap <tss-overlap> tss.profile <tss-profile>

Peak Calling#

For bulk ATAC-seq data, we do not need unsupervised clustering (unlike scATAC-seq). Instead, we directly call peaks — regions of significant chromatin accessibility — on each group of samples. The peak calling procedure:

  1. Aggregates fragment alignments across all samples within each group

  2. Uses MACS2-style algorithms to identify regions with significantly more signal than the local background

  3. Controls the false discovery rate via the q-value threshold

  4. Exports peak tables per group into the .uns slot of the ATAC modality

Key parameter: ``max_frag_size=180`` — We restrict peak calling to nucleosome-free fragments (≤180 bp), which represent true open chromatin regions. Longer fragments include nucleosomal DNA and would blur the signal.

The groupby='group' parameter tells the function to call peaks separately for each biological group (control and experimental), producing group-specific peak sets.

[26]:
expm.atac.call_peaks_fragments(
    run_on_samples = ['bulk-atac'],

    # call peaks from fragments
    groupby = 'group',
    qvalue = 0.05,
    call_broad_peaks = False,
    broad_cutoff = 0.1,
    replicate = 'sample',
    replicate_qvalue = None,
    max_frag_size = 180,  # only call nucleosome-free fragments
    selections = None,
    nolambda = False,
    shift = -100,
    extsize = 200,
    min_len = None,
    blacklist = None,

    # directly call groups of peaks
    key_added = 'peaks.group',
    inplace = True,
    n_jobs = 8
)
[i] exporting fragments ...
[i] calling peaks ...

[14]:
expm.atac.view('bulk-atac')
annotated data of size 6 × 0
    obs : n.fragments <ui64> <i> pct.dup <f64> <f> pct.mito <f64> <f> location <o> <o/location>
          sample <o> <c/sample> batch <cat> <c/batch> group <cat> <c> modality <cat> <c/modality>
          taxa <cat> <c/taxa> tsse <f64> <f/tsse>
   obsm : paired <csr:ui32(3088286401)> <fragments>
    uns : assembly <assembly> assembly.size <asm-size> frag.sizes <frag-size>
          pct.tss.overlap <tss-overlap> peaks.group <dict/peaks/grouped-peaks>
          peaks.merged <dict/peaks/merged-peaks> tss.profile <tss-profile> tsse <overall-tsse>

We can visualize the called peaks in a genome browser-style track plot. The plot_atac_peaks() function generates a composite figure showing:

  1. Coverage tracks per group: The aggregated read depth for control and experimental samples

  2. Peak annotation track: The called peak intervals for each group

  3. Gene track: Reference gene annotations for the selected region

  4. Chromosome ideogram: The genomic context

Here we visualize peaks around the TCF7 gene locus (a T-cell transcription factor), using a ±5 kb window around the gene body. The peak annotation track shows clear differences between control and experimental groups in the TCF7 promoter/enhancer regions.

[5]:
figs = expm.atac.plot_peaks(
    run_on_samples = ['bulk-atac'],
    group_key = 'group', peaks_key = 'peaks.group',
    # either specifying a gene name for convenience
    gene = 'TCF7', upstream = 5000, downstream = 5000,
    # or specify chromosome coordinates manually
    chr = None, xf = None, xt = None,
    figsize = (4, 3.5), dpi = 100,
    gene_track_height = 0.4, chrom_track_height = 0.10,
    peak_annotation_height = 0.1,
    min_frag_length = None, max_frag_length = 180,
    title = None, showgn = True,

    do_tight_layout = False
)
../_images/atac_b1-atac-bam_23_0.png

Before constructing the peak-by-sample count matrix, we need to establish a consensus peak set — the intersection of peaks identified across all groups. This step, called peak merging, reconciles peak coordinates from different groups into a unified set of intervals that can be used for count quantification across all samples.

IDR-based Peak Merging (ENCODE Standard)#

IDR (Irreproducible Discovery Rate) is the ENCODE-recommended method for merging peaks from bulk ATAC-seq and ChIP-seq experiments. The IDR approach:

  1. Takes replicate peak calls and ranks them by statistical significance

  2. Evaluates the consistency of peak rankings across replicates

  3. Identifies the set of peaks that are consistently reproducible above a specified rank threshold

This method is particularly useful when you have biological replicates, as it ensures only reproducible peaks are retained. The flavor='idr' parameter selects this method, which internally uses the idr Python package (a port of the original ENCODE IDR tool).

[4]:
expm.atac.merge_peaks_fragments(
    run_on_samples = ['bulk-atac'],
    key_added = 'peaks.merged',
    key_groups = 'peaks.group',
    flavor = 'idr'
)
[i] merging peaks ...
[i] ranking peaks ...

We inspect the merged consensus peaks stored in .uns['peaks.merged']. This DataFrame contains the genomic coordinates (chromosome, start, end) of all consensus peaks. The merged peaks are also plotted at the bottom of the locus view — these represent the union of reproducible open chromatin regions across all groups.

[5]:
expm.atac['bulk-atac'].uns['peaks.merged']
[5]:
chr start end strand score fc p q summit signal.x signal.y
0 chr20 2101614 2103533 . 1025 7.587683 63.241371 60.665310 457 419.0 606.0
1 chr19 47273238 47275238 . 481 4.275474 37.156929 34.723652 383 134.0 347.0
2 chr5 138464000 138466031 . 1158 9.895336 626.924805 623.349915 196 158.0 1000.0
3 chr20 33993010 33994675 . 2000 22.268004 849.103821 845.262634 222 1000.0 1000.0
4 chr3 9396141 9397827 . 2000 6.767930 162.998184 160.120667 206 1000.0 1000.0
... ... ... ... ... ... ... ... ... ... ... ...
44503 chr4 15786800 15787259 . 55 2.565707 7.645530 5.579120 199 NaN NaN
44504 chr8 63179767 63180248 . 54 3.018109 7.620480 5.465224 241 NaN NaN
44505 chr9 97154891 97155469 . 52 3.000378 7.368090 5.221102 449 NaN NaN
44506 chr5 37519258 37519720 . 50 2.879079 7.138310 5.003808 221 NaN NaN
44507 chr18 2604253 2604589 . 47 2.528090 6.743580 4.706681 135 NaN NaN

44508 rows × 11 columns

[6]:
expm.atac.view('bulk-atac')
annotated data of size 6 × 0
    obs : n.fragments <ui64> <i> pct.dup <f64> <f> pct.mito <f64> <f> location <o> <o/location>
          sample <o> <c/sample> batch <cat> <c/batch> group <cat> <c> modality <cat> <c/modality>
          taxa <cat> <c/taxa> tsse <f64> <f/tsse>
   obsm : paired <csr:ui32(3088286401)> <fragments>
    uns : assembly <assembly> assembly.size <asm-size> frag.sizes <frag-size>
          pct.tss.overlap <tss-overlap> peaks.group <dict/peaks/merged-peaks>
          peaks.merged <dict/peaks/merged-peaks> tss.profile <tss-profile> tsse <overall-tsse>
[16]:
figs = expm.atac.plot_peaks(
    run_on_samples = ['bulk-atac'],
    group_key = 'group', peaks_key = 'peaks.group',
    # either specifying a gene name for convenience
    gene = 'TCF7', upstream = 5000, downstream = 5000,
    # or specify chromosome coordinates manually
    chr = None, xf = None, xt = None,
    figsize = (4, 3.5), dpi = 100,
    gene_track_height = 0.4, chrom_track_height = 0.10,
    peak_annotation_height = 0.1,
    min_frag_length = None, max_frag_length = 180,
    plot_consensus_peak = 'peaks.merged',
    title = None, showgn = True,
    do_tight_layout = False
)
../_images/atac_b1-atac-bam_30_0.png

SnapATAC-based Peak Merging#

SnapATAC provides an efficient alternative algorithm for peak merging. Originally designed for single-cell ATAC-seq data, it works equally well for bulk data. The algorithm:

  1. Takes peak centers from all groups

  2. Merges peaks whose centers are within half_width base pairs of each other

  3. Uses a greedy merging strategy: iteratively merges overlapping intervals

  4. Produces a consensus set with the merge resolution controlled by half_width (default: 250 bp)

This method is computationally efficient and produces comparable results to IDR when replicates are of good quality. The half_width parameter determines how close two peak summits need to be to be merged — smaller values produce more refined boundaries, larger values produce broader consensus regions.

[17]:
expm.atac.merge_peaks_fragments(
    run_on_samples = ['bulk-atac'],
    key_added = 'peaks.merged',
    key_groups = 'peaks.group',
    flavor = 'snapatac',
    half_width = 250
)
[18]:
expm.atac['bulk-atac'].uns['peaks.merged']
[18]:
expm control chr start end strand summit
0 True True chr1 9853 10354 . 250
1 True True chr1 15983 16484 . 250
2 True True chr1 181228 181729 . 250
3 True True chr1 180549 181050 . 250
4 True True chr1 191240 191741 . 250
... ... ... ... ... ... ... ...
48514 False True chrY 56706793 56707294 . 250
48515 True True chrY 56727895 56728396 . 250
48516 True True chrY 56734538 56735039 . 250
48517 True True chrY 56742463 56742964 . 250
48518 True True chrY 56763268 56763769 . 250

48519 rows × 7 columns

[19]:
expm.atac.view('bulk-atac')
annotated data of size 6 × 0
    obs : n.fragments <ui64> <i> pct.dup <f64> <f> pct.mito <f64> <f> location <o> <o/location>
          sample <o> <c/sample> batch <cat> <c/batch> group <cat> <c> modality <cat> <c/modality>
          taxa <cat> <c/taxa> tsse <f64> <f/tsse>
   obsm : paired <csr:ui32(3088286401)> <fragments>
    uns : assembly <assembly> assembly.size <asm-size> frag.sizes <frag-size>
          pct.tss.overlap <tss-overlap> peaks.group <dict/peaks/grouped-peaks>
          peaks.merged <dict/peaks/merged-peaks> tss.profile <tss-profile> tsse <overall-tsse>
[ ]:
figs = expm.atac.plot_peaks(
    run_on_samples = ['bulk-atac'], group_key = 'group', peaks_key = 'peaks.group',
    # either specifying a gene name for convenience
    gene = 'TCF7', upstream = 5000, downstream = 5000,
    # or specify chromosome coordinates manually
    chr = None, xf = None, xt = None,
    figsize = (4, 3.5), dpi = 100,
    gene_track_height = 0.4, chrom_track_height = 0.10,
    peak_annotation_height = 0.1,
    min_frag_length = None, max_frag_length = 180,
    plot_consensus_peak = 'peaks.merged',
    title = None, showgn = True
)
../_images/atac_b1-atac-bam_35_0.png

Peak Matrix Construction and Gene Activity Inference#

In the final processing step, we perform two critical operations:

Peak-by-Sample Count Matrix (atac-peak modality)#

We count the number of transposition events (Tn5 insertion sites) that fall within each consensus peak for each sample. This process:

  • Counts only paired-end insertions (both read1 and read2 start sites), which represent true Tn5 cutting events

  • Filters fragments to the nucleosome-free range (≤180 bp)

  • Uses a chunked approach (chunk_size=500) for memory efficiency

  • Automatically creates a new modality named atac-p (ATAC peaks), where observations are samples and variables are peaks

2. Gene Activity Matrix#

Chromatin accessibility at promoters and gene bodies can be used to estimate transcriptional activity. This inference:

  • Maps each gene to its promoter region (±2 kb around TSS)

  • Sums the accessibility signal within those regions

  • Produces an inferred gene expression matrix (simulated rna modality)

  • This inferred matrix can be used for integration with RNA-seq data, enabling cross-modality comparisons

ATAC-seq gene activity inference is based on the observation that open chromatin at promoters strongly correlates with gene expression, though the relationship is indirect and promoter-proximal accessibility is only one of many regulatory mechanisms.

[23]:
expm.atac.make_peak_matrix_fragments(
    run_on_samples = ['bulk-atac'],
    chunk_size = 500,
    min_frag_size = None,
    max_frag_size = 180,
    counting_strategy = 'paired-insertion',
    value_type = 'target',
    summary_type = 'sum'
)
[i] created subset [bulk-atac] from sample [bulk-atac]

[24]:
expm.atac_peak['bulk-atac'].var
[24]:
expm control chr start end strand summit
peak:grch38:chr1:9853-10354 True True chr1 9853 10354 . 250
peak:grch38:chr1:15983-16484 True True chr1 15983 16484 . 250
peak:grch38:chr1:181228-181729 True True chr1 181228 181729 . 250
peak:grch38:chr1:180549-181050 True True chr1 180549 181050 . 250
peak:grch38:chr1:191240-191741 True True chr1 191240 191741 . 250
... ... ... ... ... ... ... ...
peak:grch38:chrY:56706793-56707294 False True chrY 56706793 56707294 . 250
peak:grch38:chrY:56727895-56728396 True True chrY 56727895 56728396 . 250
peak:grch38:chrY:56734538-56735039 True True chrY 56734538 56735039 . 250
peak:grch38:chrY:56742463-56742964 True True chrY 56742463 56742964 . 250
peak:grch38:chrY:56763268-56763769 True True chrY 56763268 56763769 . 250

48519 rows × 7 columns

[30]:
expm.atac_peak.view('bulk-atac')
annotated data of size 6 × 48519
    obs : n.fragments <ui64> <i> pct.dup <f64> <f> pct.mito <f64> <f> location <o> <o/location>
          sample <o> <c/sample> batch <cat> <c/batch> group <cat> <c> modality <cat> <c/modality>
          taxa <cat> <c/taxa> tsse <f64> <f/tsse>
    var : expm <bool> <bool> control <bool> <bool> chr <o> <c/chromosome> start <i64> <i>
          end <i64> <i> strand <o> <o/strand> summit <i64> <i>
    uns : assembly <assembly> assembly.size <asm-size>

Remove peaks with very low total accessibility across all samples. The min_sum=3 threshold discards peaks whose summed count across all samples is less than 3. This aggressive filtering step:

  • Eliminates noisy, poorly-represented peaks

  • Reduces the multiple testing burden in downstream differential analysis

  • Focuses analysis on robustly detected open chromatin regions

[32]:
expm.atac_peak.filter_genes_by_sum(run_on_samples = ['bulk-atac'], min_sum = 3)
[33]:
print(expm)
[!] dataset not integrated.
[*] composed of samples:
  bulk-atac   atac      hsa    batch autogen   of size 6 × 0
  bulk-atac   atac-peak hsa    batch autogen   of size 6 × 48519

Save the Processed Dataset#

All changes made to the experiment — including QC filters, peak calls, merged peaks, peak matrix, and gene activity matrix — are persisted to disk. The experiment can later be reloaded with em.load_experiment('bk-atac') without repeating any of the preprocessing steps.

[34]:
expm.save()
[i] saving individual samples. (pass `save_samples = False` to skip)

   ━━━━━━━━━━━━━━━━━━━━━━━━━━ modality [atac]          1 / 1     (00:03 < 00:00)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━ modality [atac-peak]     1 / 1     (00:00 < 00:00)