Immune Repertoire Tracing#

T-cell receptors (TCRs) and B-cell receptors (BCRs) are generated through somatic recombination of V(D)J gene segments, producing an immensely diverse repertoire of antigen-specific receptors. Single-cell RNA-seq data can simultaneously capture both the transcriptome and the immune receptor sequences from individual lymphocytes, enabling the tracing of clonal lineages and the linking of clonotype identity to functional state. This notebook demonstrates immune repertoire analysis using the exprmat package, covering the integration of TCR/BCR sequencing data with gene expression profiles, clonotype clustering, analysis of clonal expansion, and characterization of the transcriptional programs associated with specific clonal lineages.

[1]:

%load_ext autoreload
%autoreload 2

[2]:

import exprmat as em
# set working directory
em.setwd('../../../data')
ver = em.version()

[i] exprmat 0.2.66 / exprmat-db 0.2.66
[i] os: posix (linux)  platform version: 6.8.0-90-generic
[i] loaded configuration from /home/data/yangz/.exprmatrc
[i] current working directory: /home/data/yangz/packages/exprmat/data
[i] current database directory: /home/data/yangz/packages/database (0.2.66)
[i] resident memory: 777.80 MiB
[i] virtual memory: 5.95 GiB

We need two sets of data from the 10X pipeline: the corresponding RNA sequencing data, and the secondary data table filtered_contig_annotations.csv.

[4]:

! eza --group-directories-first --tree --long scrna/pbmc-tcr

drwxr-x---    - yangz  4 May 10:09 scrna/pbmc-tcr
drwxr-x---    - yangz  4 May 10:09 ├── rna
.rw-r-----  64k yangz  4 May 10:09 │  ├── barcodes.tsv.gz
.rw-r----- 342k yangz  4 May 10:09 │  ├── features.tsv.gz
.rw-r----- 101M yangz  4 May 10:09 │  └── matrix.mtx.gz
drwxr-x---    - yangz  4 May 10:09 └── tcr
.rw-r-----  35M yangz  4 May 10:09    ├── airr_rearrangement.tsv
.rw-r----- 182k yangz  4 May 10:09    ├── cell_barcodes.json
.rw-r----- 898k yangz  4 May 10:09    ├── clonotypes.csv
.rw-r----- 2.1M yangz  4 May 10:09    ├── concat_ref.bam
.rw-r----- 790k yangz  4 May 10:09    ├── concat_ref.bam.bai
.rw-r----- 9.8M yangz  4 May 10:09    ├── concat_ref.fasta
.rw-r----- 547k yangz  4 May 10:09    ├── concat_ref.fasta.fai
.rw-r----- 1.5M yangz  4 May 10:09    ├── consensus.bam
.rw-r----- 790k yangz  4 May 10:09    ├── consensus.bam.bai
.rw-r----- 5.6M yangz  4 May 10:09    ├── consensus.fasta
.rw-r----- 529k yangz  4 May 10:09    ├── consensus.fasta.fai
.rw-r----- 6.4M yangz  4 May 10:09    ├── consensus_annotations.csv
.rw-r-----  14k yangz  4 May 10:09    ├── donor_regions.fa
.rw-r----- 8.1M yangz  4 May 10:09    ├── filtered_contig.fasta
.rw-r-----  16M yangz  4 May 10:09    ├── filtered_contig.fastq
.rw-r----- 9.0M yangz  4 May 10:09    └── filtered_contig_annotations.csv

[8]:

meta = em.metadata(
    locations    = [
        'scrna/pbmc-tcr/rna',
        'scrna/pbmc-tcr/tcr',
    ],
    modality     = ['rna', 'rna/tcr'],
    default_taxa = ['hsa', 'hsa'],
    batches      = ['b-1', 'b-1'],
    names        = ['pbmc', 'pbmc'],
    groups       = ['pbmc', 'pbmc'],
)

[6]:

meta.dataframe

[6]:

	location	sample	batch	group	modality	taxa
0	scrna/pbmc-tcr/rna	pbmc	b-1	pbmc	rna	hsa
1	scrna/pbmc-tcr/tcr	pbmc	b-1	pbmc	rna/tcr	hsa

[9]:

expm = em.experiment(meta, dump = 'expm/tcr')

[i] reading sample pbmc [rna] ...
[i] 958 genes (out of 38606) not in the reference gene list.
[i] total 37648 genes mapped. 37557 unique genes.
[i] reading sample pbmc [rna/tcr] ...

Currently, the TCR table has not been added to the dataset. The lookup and integration of TCR-matched cells occurs in the merge function.

Quality Control and Visualization#

We first perform basic cleaning of the dataset.

[10]:

expm.rna.qc(
    run_on_samples = True,
    mt_seqid = 'chrM',
    mt_percent = 0.15,
    ribo_genes = None,
    ribo_percent = None,
    outlier_mode = 'mads',
    outlier_n = 5,
    doublet_method = 'no',
    min_cells = 3,
    min_genes = 50
)

[i] found 13 mitochondrial genes (expected 13)
[i] found 102 ribosomal genes
quality controlling sample [pbmc] ...
raw dataset contains 13409 cells, 26253 genes
filtered dataset contains 12641 cells, 22325 genes

[11]:

figs = expm.rna.plot_qc(run_on_samples = True)

[12]:

expm.rna.qc(
    run_on_samples = True,
    mt_seqid = 'chrM',
    mt_percent = 0.15,
    ribo_genes = None,
    ribo_percent = None,
    outlier_mode = 'mads',
    outlier_n = 5,
    doublet_method = 'scrublet',
    min_cells = 3,
    min_genes = 800
)

[i] found 13 mitochondrial genes (expected 13)
[i] found 102 ribosomal genes
quality controlling sample [pbmc] ...
raw dataset contains 13409 cells, 26253 genes
[i] preprocessing observation count matrix ...
[i] simulating doublets ...
[i] embedding using pca ...
[i] calculating doublet scores ...
[i] detected doublet rate: 3.7 %
[i] estimated detectable doublet fraction: 80.5 %
[i] overall doublet rate: 4.6 %
filtered dataset contains 11229 cells, 22165 genes

[13]:

figs = expm.rna.plot_qc(run_on_samples = True)

[14]:

expm.rna.filter(run_on_samples = True)

[15]:

expm.rna.log_normalize(
    run_on_samples = True,
    key_norm = 'norm',
    key_lognorm = 'lognorm'
)

expm.rna.select_hvg(
    run_on_samples = True,
    key_lognorm = 'lognorm',
    method = 'vst',
    dest = 'vst',
    n_top_genes = 1500
)

expm.merge(join = 'outer')

[i] reading tcr table from expm/tcr/tcr/pbmc.tsv.gz ...
[!] tcr table contains 1289 duplicated contigs for one cell.
[!] 5463 out of 6154 (88.8%) tcr detections mapped.
[i] 11229 out of 11229 (100.0%) tcr detections mapped.

During the integration process, the TCR dataset is automatically loaded and matched to the corresponding cells, populating the obs columns.

[16]:

print(expm)

annotated data of size 11229 × 22165
integrated dataset of size 11229 × 22165
contains modalities: rna

 modality [rna]
    obs : sample <cat> <c/sample> batch <cat> <c/batch> group <cat> <c> modality <cat> <c/modality>
          taxa <cat> <c/taxa> barcode <o> <o> ubc <o> <o> n.umi <f64> <i> n.genes <i64> <i>
          n.mito <f64> <f> n.ribo <f64> <f> pct.mito <f64> <f> pct.ribo <f64> <f>
          filter <bool> <bool> score.doublet <f64> <f> score.doublet.se <f64> <f>
          is.doublet <bool> <bool> qc <bool> <bool/qc> tra <o> trb <o> clone <o> clone.id <o>
          trav <o> traj <o> trbv <o> trbj <o>
    var : chr <o> <c/chromosome> start <i64> <i> end <i64> <i> strand <o> <c/strand> id <o> <o>
          subtype <o> <c/gsubtype> gene <o> <o/gene> tlen <f64> <i/tlen> cdslen <i64> <i/cdslen>
          assembly <o> <c> uid <o> <o/ugene>
 layers : counts <f32> <i/counts>
    uns : slots <system>

[*] composed of samples:
  pbmc    rna   hsa    batch b-1     of size 11229 × 22165

[17]:

expm['rna'].obs[['clone.id', 'trav', 'traj', 'trbv', 'trbj']]

[17]:

	clone.id	trav	traj	trbv	trbj
pbmc:2	pbmc:c:1	TRAV21	TRAJ41	TRBV11-1	TRBJ2-2
pbmc:3	pbmc:c:2	TRAV19	TRAJ17	TRBV20-1	TRBJ2-7
pbmc:4	na	na	na	na	na
pbmc:5	pbmc:c:3	TRAV38-2/DV8	TRAJ53	TRBV5-1	TRBJ1-4
pbmc:6	pbmc:c:4	TRAV14/DV4	TRAJ54	TRBV2	TRBJ2-3
...	...	...	...	...	...
pbmc:13403	pbmc:c:5132	TRAV8-4	TRAJ26	TRBV2	TRBJ2-2
pbmc:13404	na	na	na	na	na
pbmc:13407	na	na	na	na	na
pbmc:13408	na	na	na	na	na
pbmc:13409	pbmc:c:5133	TRAV26-1	TRAJ32	TRBV10-3	TRBJ2-6

11229 rows × 5 columns

Dimensionality Reduction and Clustering#

We use conventional methods for dimensionality reduction of the 5’ data. Because 5’ sequencing captures immunoglobulin gene diversity, we choose to exclude these genes during dimensionality reduction to prevent different clones from clustering separately.

[18]:

expm.rna.select_hvg(
    key_lognorm = 'X',
    method = 'vst',
    dest = 'vst',
    batch_key = 'batch',
    n_top_genes = 1500
)

expm.rna.scale_pca(
    hvg = 'vst.hvg',
    key_lognorm = 'X',
    key_scaled = 'scaled',
    key_added = 'pca', n_comps = 35,
    keep_sparse = False,
    random_state = 42,
    svd_solver = 'arpack'
)

expm.rna.knn(
    use_rep = 'pca',
    n_comps = None,
    n_neighbors = 30,
    knn = True,
    method = "umap",
    transformer = None,
    metric = "euclidean",
    metric_kwds = {},
    random_state = 42,
    key_added = 'nn30'
)

expm.rna.umap(
    min_dist = 0.3,
    spread = 1,
    n_components = 2,
    maxiter = None,
    alpha = 1,
    gamma = 1,
    negative_sample_rate = 5,
    init_pos = "spectral",
    random_state = 42,
    a = None, b = None,
    key_added = 'umap',
    neighbors_key = "nn30"
)

[21]:

expm.rna.leiden(
    resolution = 0.5,
    restrict_to = None,
    random_state = 42,
    key_added = 'leiden',
    adjacency = None,
    directed = None,
    use_weights = True,
    n_iterations = 2,
    partition_type = None,
    neighbors_key = 'nn30',
    obsp = None,
    flavor = 'igraph'
)

[22]:

fig = expm.rna.plot_embedding(
    basis = 'umap', color = 'leiden',
    figsize = (3, 3), dpi = 100, legend_col = 2, legend = False,
    ptsize = 1, annotate_style = 'text', contour_plot = True
)

TCR Diversity Analysis#

Compute basic TCR statistics.

[23]:

expm.rna.calculate_tcr_metrics(
    expanded_clone = 5
)

[25]:

print(expm)

annotated data of size 11229 × 22165
integrated dataset of size 11229 × 22165
contains modalities: rna

 modality [rna]
    obs : sample <cat> <c/sample> batch <cat> <c/batch> group <cat> <c> modality <cat> <c/modality>
          taxa <cat> <c/taxa> barcode <o> <o> ubc <o> <o> n.umi <f64> <i> n.genes <i64> <i>
          n.mito <f64> <f> n.ribo <f64> <f> pct.mito <f64> <f> pct.ribo <f64> <f>
          filter <bool> <bool> score.doublet <f64> <f> score.doublet.se <f64> <f>
          is.doublet <bool> <bool> qc <bool> <bool/qc> tra <o> trb <o> clone <o> clone.id <o>
          trav <o> traj <o> trbv <o> trbj <o> leiden <cat> <c> trab <bool> tcr.clone.sum <i64> <i>
          tcr.clone.size <i64> <i> tcr.expanded <bool> <bool> tcr.clone.size.rel <f64> <f>
    var : chr <o> <c/chromosome> start <i64> <i> end <i64> <i> strand <o> <c/strand> id <o> <o>
          subtype <o> <c/gsubtype> gene <o> <o/gene> tlen <f64> <i/tlen> cdslen <i64> <i/cdslen>
          assembly <o> <c> uid <o> <o/ugene> vst.means <f64> vst.vars <f64> vst.vars.norm <f64>
          vst.hvg.rank <f32> vst.hvg <bool>
 layers : counts <f32> <i/counts>
   obsm : pca <arr:f64(35)> <f/embedding/pca> knn.nn30 <arr:i32(30)> knn.d.nn30 <arr:f32(30)>
          umap <arr:f32(2)> <f/embedding>
   varm : pca <arr:f64(35)> <f/weights>
   obsp : connectivities.nn30 <csr:f32> distances.nn30 <csr:f32>
    uns : slots <system> commands <system> pca <dict> nn30 <knn> umap leiden leiden.colors

[*] composed of samples:
  pbmc    rna   hsa    batch b-1     of size 11229 × 22165

[24]:

expm['rna'].obs[['trab', 'tcr.clone.sum', 'tcr.clone.size', 'tcr.expanded']]

[24]:

	trab	tcr.clone.sum	tcr.clone.size	tcr.expanded
pbmc:2	True	5463	1	False
pbmc:3	True	5463	1	False
pbmc:4	False	5463	0	False
pbmc:5	True	5463	1	False
pbmc:6	True	5463	1	False
...	...	...	...	...
pbmc:13403	True	5463	1	False
pbmc:13404	False	5463	0	False
pbmc:13407	False	5463	0	False
pbmc:13408	False	5463	0	False
pbmc:13409	True	5463	1	False

11229 rows × 4 columns

[30]:

fig = expm.rna.plot_embedding_mask(
    basis = 'umap', color = 'trab', values = [True],
    figsize = (3, 3), dpi = 100, legend = False, ptsize = 1,
)

Some Shannon entropy-based methods can provide shared statistics at the single-cell lineage tracing cluster level.

[ ]:

expm.rna.calculate_startracs_metrics(
    clonotype = 'clone.id',
    cluster = 'leiden',
    tissue = None
)

[32]:

print(expm)

annotated data of size 11229 × 22165
integrated dataset of size 11229 × 22165
contains modalities: rna

 modality [rna]
    obs : sample <cat> <c/sample> batch <cat> <c/batch> group <cat> <c> modality <cat> <c/modality>
          taxa <cat> <c/taxa> barcode <o> <o> ubc <o> <o> n.umi <f64> <i> n.genes <i64> <i>
          n.mito <f64> <f> n.ribo <f64> <f> pct.mito <f64> <f> pct.ribo <f64> <f>
          filter <bool> <bool> score.doublet <f64> <f> score.doublet.se <f64> <f>
          is.doublet <bool> <bool> qc <bool> <bool/qc> tra <o> trb <o> clone <o> clone.id <o>
          trav <o> traj <o> trbv <o> trbj <o> leiden <cat> <c> trab <bool> tcr.clone.sum <i64> <i>
          tcr.clone.size <i64> <i> tcr.expanded <bool> <bool> tcr.clone.size.rel <f64> <f>
          cluster.expansion <f64> <f> cluster.plasticity <f64> <f> clone.trans <f64> <f>
    var : chr <o> <c/chromosome> start <i64> <i> end <i64> <i> strand <o> <c/strand> id <o> <o>
          subtype <o> <c/gsubtype> gene <o> <o/gene> tlen <f64> <i/tlen> cdslen <i64> <i/cdslen>
          assembly <o> <c> uid <o> <o/ugene> vst.means <f64> vst.vars <f64> vst.vars.norm <f64>
          vst.hvg.rank <f32> vst.hvg <bool>
 layers : counts <f32> <i/counts>
   obsm : pca <arr:f64(35)> <f/embedding/pca> knn.nn30 <arr:i32(30)> knn.d.nn30 <arr:f32(30)>
          umap <arr:f32(2)> <f/embedding>
   varm : pca <arr:f64(35)> <f/weights>
   obsp : connectivities.nn30 <csr:f32> distances.nn30 <csr:f32>
    uns : slots <system> commands <system> pca <dict> nn30 <knn> umap leiden leiden.colors

[*] composed of samples:
  pbmc    rna   hsa    batch b-1     of size 11229 × 22165

[35]:

fig = expm.rna.plot_multiple_embedding(
    basis = 'umap', ncols = 2, features = [
        'cluster.expansion', 'cluster.plasticity',
    ],
    figsize = (6, 3), dpi = 100, legend_col = 2, legend = False,
    ptsize = 2, annotate_style = 'text',
    cmap = 'reds/r'
)

[36]:

expm.save(save_samples = False)

[i] main dataset write to expm/tcr/integrated.h5mu