Immune Repertoire Tracing#

T-cell receptors (TCRs) and B-cell receptors (BCRs) are generated through somatic recombination of V(D)J gene segments, producing an immensely diverse repertoire of antigen-specific receptors. Single-cell RNA-seq data can simultaneously capture both the transcriptome and the immune receptor sequences from individual lymphocytes, enabling the tracing of clonal lineages and the linking of clonotype identity to functional state. This notebook demonstrates immune repertoire analysis using the exprmat package, covering the integration of TCR/BCR sequencing data with gene expression profiles, clonotype clustering, analysis of clonal expansion, and characterization of the transcriptional programs associated with specific clonal lineages.

[1]:
%load_ext autoreload
%autoreload 2
[2]:
import exprmat as em
# set working directory
em.setwd('../../../data')
ver = em.version()
[i] exprmat 0.2.66 / exprmat-db 0.2.66
[i] os: posix (linux)  platform version: 6.8.0-90-generic
[i] loaded configuration from /home/data/yangz/.exprmatrc
[i] current working directory: /home/data/yangz/packages/exprmat/data
[i] current database directory: /home/data/yangz/packages/database (0.2.66)
[i] resident memory: 777.80 MiB
[i] virtual memory: 5.95 GiB

We need two sets of data from the 10X pipeline: the corresponding RNA sequencing data, and the secondary data table filtered_contig_annotations.csv.

[4]:
! eza --group-directories-first --tree --long scrna/pbmc-tcr
drwxr-x---    - yangz  4 May 10:09 scrna/pbmc-tcr
drwxr-x---    - yangz  4 May 10:09 ├── rna
.rw-r-----  64k yangz  4 May 10:09 │  ├── barcodes.tsv.gz
.rw-r----- 342k yangz  4 May 10:09 │  ├── features.tsv.gz
.rw-r----- 101M yangz  4 May 10:09 │  └── matrix.mtx.gz
drwxr-x---    - yangz  4 May 10:09 └── tcr
.rw-r-----  35M yangz  4 May 10:09    ├── airr_rearrangement.tsv
.rw-r----- 182k yangz  4 May 10:09    ├── cell_barcodes.json
.rw-r----- 898k yangz  4 May 10:09    ├── clonotypes.csv
.rw-r----- 2.1M yangz  4 May 10:09    ├── concat_ref.bam
.rw-r----- 790k yangz  4 May 10:09    ├── concat_ref.bam.bai
.rw-r----- 9.8M yangz  4 May 10:09    ├── concat_ref.fasta
.rw-r----- 547k yangz  4 May 10:09    ├── concat_ref.fasta.fai
.rw-r----- 1.5M yangz  4 May 10:09    ├── consensus.bam
.rw-r----- 790k yangz  4 May 10:09    ├── consensus.bam.bai
.rw-r----- 5.6M yangz  4 May 10:09    ├── consensus.fasta
.rw-r----- 529k yangz  4 May 10:09    ├── consensus.fasta.fai
.rw-r----- 6.4M yangz  4 May 10:09    ├── consensus_annotations.csv
.rw-r-----  14k yangz  4 May 10:09    ├── donor_regions.fa
.rw-r----- 8.1M yangz  4 May 10:09    ├── filtered_contig.fasta
.rw-r-----  16M yangz  4 May 10:09    ├── filtered_contig.fastq
.rw-r----- 9.0M yangz  4 May 10:09    └── filtered_contig_annotations.csv
[8]:
meta = em.metadata(
    locations    = [
        'scrna/pbmc-tcr/rna',
        'scrna/pbmc-tcr/tcr',
    ],
    modality     = ['rna', 'rna/tcr'],
    default_taxa = ['hsa', 'hsa'],
    batches      = ['b-1', 'b-1'],
    names        = ['pbmc', 'pbmc'],
    groups       = ['pbmc', 'pbmc'],
)
[6]:
meta.dataframe
[6]:
location sample batch group modality taxa
0 scrna/pbmc-tcr/rna pbmc b-1 pbmc rna hsa
1 scrna/pbmc-tcr/tcr pbmc b-1 pbmc rna/tcr hsa
[9]:
expm = em.experiment(meta, dump = 'expm/tcr')
[i] reading sample pbmc [rna] ...
[i] 958 genes (out of 38606) not in the reference gene list.
[i] total 37648 genes mapped. 37557 unique genes.
[i] reading sample pbmc [rna/tcr] ...

Currently, the TCR table has not been added to the dataset. The lookup and integration of TCR-matched cells occurs in the merge function.

Quality Control and Visualization#

We first perform basic cleaning of the dataset.

[10]:
expm.rna.qc(
    run_on_samples = True,
    mt_seqid = 'chrM',
    mt_percent = 0.15,
    ribo_genes = None,
    ribo_percent = None,
    outlier_mode = 'mads',
    outlier_n = 5,
    doublet_method = 'no',
    min_cells = 3,
    min_genes = 50
)
[i] found 13 mitochondrial genes (expected 13)
[i] found 102 ribosomal genes
quality controlling sample [pbmc] ...
raw dataset contains 13409 cells, 26253 genes
filtered dataset contains 12641 cells, 22325 genes
[11]:
figs = expm.rna.plot_qc(run_on_samples = True)
../_images/scrna_k1-tcr-bcr_11_0.png
[12]:
expm.rna.qc(
    run_on_samples = True,
    mt_seqid = 'chrM',
    mt_percent = 0.15,
    ribo_genes = None,
    ribo_percent = None,
    outlier_mode = 'mads',
    outlier_n = 5,
    doublet_method = 'scrublet',
    min_cells = 3,
    min_genes = 800
)
[i] found 13 mitochondrial genes (expected 13)
[i] found 102 ribosomal genes
quality controlling sample [pbmc] ...
raw dataset contains 13409 cells, 26253 genes
[i] preprocessing observation count matrix ...
[i] simulating doublets ...
[i] embedding using pca ...
[i] calculating doublet scores ...
[i] detected doublet rate: 3.7 %
[i] estimated detectable doublet fraction: 80.5 %
[i] overall doublet rate: 4.6 %
filtered dataset contains 11229 cells, 22165 genes
[13]:
figs = expm.rna.plot_qc(run_on_samples = True)
../_images/scrna_k1-tcr-bcr_13_0.png
[14]:
expm.rna.filter(run_on_samples = True)
[15]:
expm.rna.log_normalize(
    run_on_samples = True,
    key_norm = 'norm',
    key_lognorm = 'lognorm'
)

expm.rna.select_hvg(
    run_on_samples = True,
    key_lognorm = 'lognorm',
    method = 'vst',
    dest = 'vst',
    n_top_genes = 1500
)

expm.merge(join = 'outer')
[i] reading tcr table from expm/tcr/tcr/pbmc.tsv.gz ...
[!] tcr table contains 1289 duplicated contigs for one cell.
[!] 5463 out of 6154 (88.8%) tcr detections mapped.
[i] 11229 out of 11229 (100.0%) tcr detections mapped.

During the integration process, the TCR dataset is automatically loaded and matched to the corresponding cells, populating the obs columns.

[16]:
print(expm)
annotated data of size 11229 × 22165
integrated dataset of size 11229 × 22165
contains modalities: rna

 modality [rna]
    obs : sample <cat> <c/sample> batch <cat> <c/batch> group <cat> <c> modality <cat> <c/modality>
          taxa <cat> <c/taxa> barcode <o> <o> ubc <o> <o> n.umi <f64> <i> n.genes <i64> <i>
          n.mito <f64> <f> n.ribo <f64> <f> pct.mito <f64> <f> pct.ribo <f64> <f>
          filter <bool> <bool> score.doublet <f64> <f> score.doublet.se <f64> <f>
          is.doublet <bool> <bool> qc <bool> <bool/qc> tra <o> trb <o> clone <o> clone.id <o>
          trav <o> traj <o> trbv <o> trbj <o>
    var : chr <o> <c/chromosome> start <i64> <i> end <i64> <i> strand <o> <c/strand> id <o> <o>
          subtype <o> <c/gsubtype> gene <o> <o/gene> tlen <f64> <i/tlen> cdslen <i64> <i/cdslen>
          assembly <o> <c> uid <o> <o/ugene>
 layers : counts <f32> <i/counts>
    uns : slots <system>

[*] composed of samples:
  pbmc    rna   hsa    batch b-1     of size 11229 × 22165

[17]:
expm['rna'].obs[['clone.id', 'trav', 'traj', 'trbv', 'trbj']]
[17]:
clone.id trav traj trbv trbj
pbmc:2 pbmc:c:1 TRAV21 TRAJ41 TRBV11-1 TRBJ2-2
pbmc:3 pbmc:c:2 TRAV19 TRAJ17 TRBV20-1 TRBJ2-7
pbmc:4 na na na na na
pbmc:5 pbmc:c:3 TRAV38-2/DV8 TRAJ53 TRBV5-1 TRBJ1-4
pbmc:6 pbmc:c:4 TRAV14/DV4 TRAJ54 TRBV2 TRBJ2-3
... ... ... ... ... ...
pbmc:13403 pbmc:c:5132 TRAV8-4 TRAJ26 TRBV2 TRBJ2-2
pbmc:13404 na na na na na
pbmc:13407 na na na na na
pbmc:13408 na na na na na
pbmc:13409 pbmc:c:5133 TRAV26-1 TRAJ32 TRBV10-3 TRBJ2-6

11229 rows × 5 columns

Dimensionality Reduction and Clustering#

We use conventional methods for dimensionality reduction of the 5’ data. Because 5’ sequencing captures immunoglobulin gene diversity, we choose to exclude these genes during dimensionality reduction to prevent different clones from clustering separately.

[18]:
expm.rna.select_hvg(
    key_lognorm = 'X',
    method = 'vst',
    dest = 'vst',
    batch_key = 'batch',
    n_top_genes = 1500
)

expm.rna.scale_pca(
    hvg = 'vst.hvg',
    key_lognorm = 'X',
    key_scaled = 'scaled',
    key_added = 'pca', n_comps = 35,
    keep_sparse = False,
    random_state = 42,
    svd_solver = 'arpack'
)

expm.rna.knn(
    use_rep = 'pca',
    n_comps = None,
    n_neighbors = 30,
    knn = True,
    method = "umap",
    transformer = None,
    metric = "euclidean",
    metric_kwds = {},
    random_state = 42,
    key_added = 'nn30'
)

expm.rna.umap(
    min_dist = 0.3,
    spread = 1,
    n_components = 2,
    maxiter = None,
    alpha = 1,
    gamma = 1,
    negative_sample_rate = 5,
    init_pos = "spectral",
    random_state = 42,
    a = None, b = None,
    key_added = 'umap',
    neighbors_key = "nn30"
)
[21]:
expm.rna.leiden(
    resolution = 0.5,
    restrict_to = None,
    random_state = 42,
    key_added = 'leiden',
    adjacency = None,
    directed = None,
    use_weights = True,
    n_iterations = 2,
    partition_type = None,
    neighbors_key = 'nn30',
    obsp = None,
    flavor = 'igraph'
)
[22]:
fig = expm.rna.plot_embedding(
    basis = 'umap', color = 'leiden',
    figsize = (3, 3), dpi = 100, legend_col = 2, legend = False,
    ptsize = 1, annotate_style = 'text', contour_plot = True
)
../_images/scrna_k1-tcr-bcr_22_0.png

TCR Diversity Analysis#

Compute basic TCR statistics.

[23]:
expm.rna.calculate_tcr_metrics(
    expanded_clone = 5
)
[25]:
print(expm)
annotated data of size 11229 × 22165
integrated dataset of size 11229 × 22165
contains modalities: rna

 modality [rna]
    obs : sample <cat> <c/sample> batch <cat> <c/batch> group <cat> <c> modality <cat> <c/modality>
          taxa <cat> <c/taxa> barcode <o> <o> ubc <o> <o> n.umi <f64> <i> n.genes <i64> <i>
          n.mito <f64> <f> n.ribo <f64> <f> pct.mito <f64> <f> pct.ribo <f64> <f>
          filter <bool> <bool> score.doublet <f64> <f> score.doublet.se <f64> <f>
          is.doublet <bool> <bool> qc <bool> <bool/qc> tra <o> trb <o> clone <o> clone.id <o>
          trav <o> traj <o> trbv <o> trbj <o> leiden <cat> <c> trab <bool> tcr.clone.sum <i64> <i>
          tcr.clone.size <i64> <i> tcr.expanded <bool> <bool> tcr.clone.size.rel <f64> <f>
    var : chr <o> <c/chromosome> start <i64> <i> end <i64> <i> strand <o> <c/strand> id <o> <o>
          subtype <o> <c/gsubtype> gene <o> <o/gene> tlen <f64> <i/tlen> cdslen <i64> <i/cdslen>
          assembly <o> <c> uid <o> <o/ugene> vst.means <f64> vst.vars <f64> vst.vars.norm <f64>
          vst.hvg.rank <f32> vst.hvg <bool>
 layers : counts <f32> <i/counts>
   obsm : pca <arr:f64(35)> <f/embedding/pca> knn.nn30 <arr:i32(30)> knn.d.nn30 <arr:f32(30)>
          umap <arr:f32(2)> <f/embedding>
   varm : pca <arr:f64(35)> <f/weights>
   obsp : connectivities.nn30 <csr:f32> distances.nn30 <csr:f32>
    uns : slots <system> commands <system> pca <dict> nn30 <knn> umap leiden leiden.colors

[*] composed of samples:
  pbmc    rna   hsa    batch b-1     of size 11229 × 22165

[24]:
expm['rna'].obs[['trab', 'tcr.clone.sum', 'tcr.clone.size', 'tcr.expanded']]
[24]:
trab tcr.clone.sum tcr.clone.size tcr.expanded
pbmc:2 True 5463 1 False
pbmc:3 True 5463 1 False
pbmc:4 False 5463 0 False
pbmc:5 True 5463 1 False
pbmc:6 True 5463 1 False
... ... ... ... ...
pbmc:13403 True 5463 1 False
pbmc:13404 False 5463 0 False
pbmc:13407 False 5463 0 False
pbmc:13408 False 5463 0 False
pbmc:13409 True 5463 1 False

11229 rows × 4 columns

[30]:
fig = expm.rna.plot_embedding_mask(
    basis = 'umap', color = 'trab', values = [True],
    figsize = (3, 3), dpi = 100, legend = False, ptsize = 1,
)
../_images/scrna_k1-tcr-bcr_27_0.png

Some Shannon entropy-based methods can provide shared statistics at the single-cell lineage tracing cluster level.

[ ]:
expm.rna.calculate_startracs_metrics(
    clonotype = 'clone.id',
    cluster = 'leiden',
    tissue = None
)
[32]:
print(expm)
annotated data of size 11229 × 22165
integrated dataset of size 11229 × 22165
contains modalities: rna

 modality [rna]
    obs : sample <cat> <c/sample> batch <cat> <c/batch> group <cat> <c> modality <cat> <c/modality>
          taxa <cat> <c/taxa> barcode <o> <o> ubc <o> <o> n.umi <f64> <i> n.genes <i64> <i>
          n.mito <f64> <f> n.ribo <f64> <f> pct.mito <f64> <f> pct.ribo <f64> <f>
          filter <bool> <bool> score.doublet <f64> <f> score.doublet.se <f64> <f>
          is.doublet <bool> <bool> qc <bool> <bool/qc> tra <o> trb <o> clone <o> clone.id <o>
          trav <o> traj <o> trbv <o> trbj <o> leiden <cat> <c> trab <bool> tcr.clone.sum <i64> <i>
          tcr.clone.size <i64> <i> tcr.expanded <bool> <bool> tcr.clone.size.rel <f64> <f>
          cluster.expansion <f64> <f> cluster.plasticity <f64> <f> clone.trans <f64> <f>
    var : chr <o> <c/chromosome> start <i64> <i> end <i64> <i> strand <o> <c/strand> id <o> <o>
          subtype <o> <c/gsubtype> gene <o> <o/gene> tlen <f64> <i/tlen> cdslen <i64> <i/cdslen>
          assembly <o> <c> uid <o> <o/ugene> vst.means <f64> vst.vars <f64> vst.vars.norm <f64>
          vst.hvg.rank <f32> vst.hvg <bool>
 layers : counts <f32> <i/counts>
   obsm : pca <arr:f64(35)> <f/embedding/pca> knn.nn30 <arr:i32(30)> knn.d.nn30 <arr:f32(30)>
          umap <arr:f32(2)> <f/embedding>
   varm : pca <arr:f64(35)> <f/weights>
   obsp : connectivities.nn30 <csr:f32> distances.nn30 <csr:f32>
    uns : slots <system> commands <system> pca <dict> nn30 <knn> umap leiden leiden.colors

[*] composed of samples:
  pbmc    rna   hsa    batch b-1     of size 11229 × 22165

[35]:
fig = expm.rna.plot_multiple_embedding(
    basis = 'umap', ncols = 2, features = [
        'cluster.expansion', 'cluster.plasticity',
    ],
    figsize = (6, 3), dpi = 100, legend_col = 2, legend = False,
    ptsize = 2, annotate_style = 'text',
    cmap = 'reds/r'
)
../_images/scrna_k1-tcr-bcr_31_0.png
[36]:
expm.save(save_samples = False)
[i] main dataset write to expm/tcr/integrated.h5mu