Spliced Dynamics Inference#

We can infer the gene expression using dynamics of the spliced and unspliced proportions of a gene’s transcripts.

[1]:
%load_ext autoreload
%autoreload 2
[2]:
import exprmat as em
# set working directory
em.setwd('../../../data')
ver = em.version()
[i] exprmat 0.2.66 / exprmat-db 0.2.66
[i] os: posix (linux)  platform version: 6.8.0-90-generic
[i] loaded configuration from /home/data/yangz/.exprmatrc
[i] current working directory: /home/data/yangz/packages/exprmat/data
[i] current database directory: /home/data/yangz/packages/database (0.2.66)
[i] resident memory: 777.88 MiB
[i] virtual memory: 5.95 GiB

[3]:
ver = em.version()
[i] exprmat 0.2.66 / exprmat-db 0.2.66
[i] os: posix (linux)  platform version: 6.8.0-90-generic
[i] loaded configuration from /home/data/yangz/.exprmatrc
[i] current working directory: /home/data/yangz/packages/exprmat/data
[i] current database directory: /home/data/yangz/packages/database (0.2.66)
[i] resident memory: 779.01 MiB
[i] virtual memory: 5.95 GiB

Initializing the splicing expression matrix from four count matrices#

The data below provides a standard 10X-format expression matrix, along with two additional matrices recording spliced and unspliced count matrices: We need the following file structure to automatically load the splicing matrices. These four MTX matrices and two TSV files are required.

[4]:
! tree -sh scrna/pancreas
[4.0K]  scrna/pancreas
├── [ 94K]  annotation.tsv
├── [ 15K]  barcodes.tsv.gz
├── [200K]  features.tsv.gz
├── [ 28M]  matrix.mtx.gz
├── [ 50M]  pancreas.h5ad
├── [9.3M]  spanning.mtx.gz
├── [ 28M]  spliced.mtx.gz
└── [9.3M]  unspliced.mtx.gz

1 directory, 8 files
[8]:
meta = em.metadata(
    # two samples from a murine tumor infiltrating cd8+ t cell dataset.
    locations    = [
        'scrna/pancreas',
        'scrna/pancreas',
    ],
    modality     = ['rna', 'rna/splicing'],
    default_taxa = ['mmu', 'mmu'],
    batches      = ['b-1', 'b-1'],
    names        = ['pancreas', 'pancreas'],
    groups       = ['pancreas', 'pancreas'],
)
[6]:
meta.dataframe
[6]:
location sample batch group modality taxa
0 scrna/pancreas pancreas b-1 pancreas rna mmu
1 scrna/pancreas pancreas b-1 pancreas rna/splicing mmu
[9]:
expm = em.experiment(meta, dump = 'expm/pancreas')
[i] reading sample pancreas [rna] ...
[i] 433 genes (out of 27998) not in the reference gene list.
[i] total 27565 genes mapped. 27543 unique genes.
[i] reading sample pancreas [rna/splicing] ...

[10]:
print(expm)
[!] dataset not integrated.
[*] composed of samples:
  pancreas   rna   mmu    batch b-1     of size 3696 × 27543

[ ]:
fig = expm.plot_rna_spliced_proportions(groupby = 'cell.type', figsize = (8, 2.5))
../_images/scrna_i1-velocity_10_0.png

Quality control and visualization#

RNA dynamics inference generally requires a dataset that has already undergone marker gene selection, dimensionality reduction, clustering, and visualization. We first process the dataset using standard methods — see the Quality Control chapter for details.

[ ]:
expm.rna.qc(
    run_on_samples = True,
    mt_seqid = 'chrM',
    mt_percent = 0.15,
    ribo_genes = None,
    ribo_percent = None,
    outlier_mode = 'mads',
    outlier_n = 10,
    doublet_method = 'no',
    min_cells = 3,
    min_genes = 50,
)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━ processing samples       1 / 1     (00:04 < 00:00)
[12]:
figs = expm.rna.plot_qc(run_on_samples = True)
../_images/scrna_i1-velocity_13_0.png
[13]:
expm.merge(join = 'outer')
[14]:
expm.rna.log_normalize(
    key_norm = 'norm',
    key_lognorm = 'lognorm'
)

expm.rna.select_hvg(
    key_lognorm = 'X',
    method = 'vst',
    dest = 'vst',
    batch_key = 'batch',
    n_top_genes = 1500
)
[15]:
expm.rna.scale_pca(
    hvg = 'vst.hvg',
    key_lognorm = 'lognorm',
    key_scaled = 'scaled',
    key_added = 'pca', n_comps = 35,
    keep_sparse = False,
    random_state = 42,
    svd_solver = 'arpack'
)
[17]:
expm.rna.knn(
    use_rep = 'pca',
    n_comps = None,
    n_neighbors = 30,
    knn = True,
    method = "umap",
    transformer = None,
    metric = "cosine",
    metric_kwds = {},
    random_state = 42,
    key_added = 'neighbors'
)

expm.rna.leiden(
    resolution = 0.5,
    restrict_to = None,
    random_state = 42,
    key_added = 'leiden',
    adjacency = None,
    directed = None,
    use_weights = True,
    n_iterations = 2,
    partition_type = None,
    neighbors_key = None,
    obsp = None,
    flavor = 'igraph'
)

expm.rna.umap(
    min_dist = 0.5,
    spread = 1,
    n_components = 2,
    maxiter = None,
    alpha = 1,
    gamma = 1,
    negative_sample_rate = 5,
    init_pos = "spectral",
    random_state = 42,
    a = None, b = None,
    key_added = 'umap',
    neighbors_key = "neighbors"
)

Now that we have obtained a dimensionality reduction representation and performed clustering, we can directly adopt the clustering information provided by the authors’ metadata. Note that the current dataset already contains three standard splicing matrices in the layers: spliced, unspliced, and ambiguous. These three matrices represent the count matrices for spliced transcripts, unspliced transcripts, and unclassified spanning reads, respectively.

[18]:
annots = em.pd.read_table('scrna/pancreas/annotation.tsv', index_col = 0)
expm['rna'].obs['cell.type'] = annots['clusters'].tolist()
expm['rna'].obs['cell.type'] = expm['rna'].obs['cell.type'].astype('category')
annots
[18]:
clusters
index
AAACCTGAGAGGGATA Pre-endocrine
AAACCTGAGCCTTGAT Ductal
AAACCTGAGGCAATTA Alpha
AAACCTGCATCATCCC Ductal
AAACCTGGTAAGTGGC Ngn3 high EP
... ...
TTTGTCAAGTGACATA Pre-endocrine
TTTGTCAAGTGTGGCA Ngn3 high EP
TTTGTCAGTTGTTTGG Ductal
TTTGTCATCGAATGCT Alpha
TTTGTCATCTGTTTGT Epsilon

3696 rows × 1 columns

[19]:
fig = expm.rna.plot_multiple_embedding(
    basis = 'umap', features = ['cell.type', 'leiden'], ncols = 2, cmap = 'set1',
    figsize = (6, 3), dpi = 100, legend_col = 2, legend = False,
    ptsize = 2, annotate_style = 'text', annotate_fontsize = 8, contour_plot = True
)
../_images/scrna_i1-velocity_20_0.png
[20]:
print(expm)
annotated data of size 3696 × 17584
integrated dataset of size 3696 × 17584
contains modalities: rna

 modality [rna]
    obs : sample <cat> <c/sample> batch <cat> <c/batch> group <cat> <c> modality <cat> <c/modality>
          taxa <cat> <c/taxa> barcode <o> <o> ubc <o> <o> n.umi <f64> <i> n.genes <i64> <i>
          n.mito <f64> <f> n.ribo <f64> <f> pct.mito <f64> <f> pct.ribo <f64> <f>
          filter <bool> <bool> score.doublet <f64> <f> score.doublet.se <f64> <f>
          is.doublet <bool> <bool> qc <bool> <bool/qc> leiden <cat> <c> cell.type <cat>
    var : chr <o> <c/chromosome> start <i64> <i> end <i64> <i> strand <o> <c/strand> id <o> <o>
          subtype <o> <c/gsubtype> gene <o> <o/gene> tlen <f64> <i/tlen> cdslen <i64> <i/cdslen>
          assembly <o> <c> uid <o> <o/ugene> vst.means <f64> <f> vst.vars <f64> <f>
          vst.vars.norm <f64> <f> vst.hvg.rank <f32> <f> vst.hvg <bool> <bool>
 layers : spliced <f64> <i/counts/spliced> unspliced <f64> <i/counts/unspliced>
          ambiguous <f64> <i/counts/ambiguous> norm <f32> <f/linear> lognorm <f32> <f/normal>
          counts <f32> <i/counts>
   obsm : pca <arr:f64(35)> <f/embedding/pca> knn <arr:i32(30)> <i/knni>
          knn.d <arr:f64(30)> <f/knnd> umap <arr:f32(2)> <f/embedding>
   varm : pca <arr:f64(35)> <f/weights>
   obsp : connectivities <csr:f32> <f/connectivity> distances <csr:f64> <f/distance>
    uns : slots <system> commands <system> pca <dict> neighbors <knn> leiden <o> umap
          cell.type.colors leiden.colors

[*] composed of samples:
  pancreas   rna   mmu    batch b-1     of size 3696 × 17584

RNA velocity inference#

The package already implements stochastic velocity inference.

[21]:
expm.rna.velocity(
    neighbor_key = 'neighbors', neighbor_connectivity = 'connectivities',
    n_neighbors = 30, hvg = 'vst.hvg', velocity_key = 'velocity',
    n_cpus = 100,
    kwargs_filter = {},
    kwargs_velocity = { 'mode': 'stochastic' },
    kwargs_velocity_graph = {},
    kwargs_terminal_state = {},
    kwargs_pseudotime = { 'save_diffmap': True }
)
[i] normalized count data: spliced, unspliced.
[i] computing moments based on connectivities
[i] computing velocities ...
[i] computing velocity graph (using 100/144 cores)

   ━━━━━━━━━━━━━━━━━━━━━━━━━━                       3696 / 3696  (00:09 < 00:00)
[i] computing terminal states ...
[i] identified 1 region of root cells and 1 region of end points .

velocity.pseudotime, velocity.length, velocity.confidence, and velocity.confidence.transition — four velocity quality parameters — have been computed and automatically filled into the obs slot. We can now evaluate the reliability of the velocity estimates.

[22]:
print(expm)
annotated data of size 3696 × 17584
integrated dataset of size 3696 × 17584
contains modalities: rna

 modality [rna]
    obs : sample <cat> <c/sample> batch <cat> <c/batch> group <cat> <c> modality <cat> <c/modality>
          taxa <cat> <c/taxa> barcode <o> <o> ubc <o> <o> n.umi <f32> <i> n.genes <i64> <i>
          n.mito <f64> <f> n.ribo <f64> <f> pct.mito <f64> <f> pct.ribo <f64> <f>
          filter <cat> <bool> score.doublet <f64> <f> score.doublet.se <f64> <f>
          is.doublet <bool> <bool> qc <cat> <bool/qc> leiden <cat> <c> cell.type <cat>
          n.umi.unspliced <f64> <i> n.umi.spliced <f64> <i> velocity.self.transition <f32>
          root.cells <f64> <i> endpoints <f64> <i> velocity.pseudotime <f64> velocity.length <f32>
          velocity.confidence <f64> velocity.confidence.transition <f32>
    var : chr <cat> <c/chromosome> start <i64> <i> end <i64> <i> strand <cat> <c/strand> id <o> <o>
          subtype <cat> <c/gsubtype> gene <o> <o/gene> tlen <f64> <i/tlen> cdslen <i64> <i/cdslen>
          assembly <cat> <c> uid <o> <o/ugene> vst.means <f64> <f> vst.vars <f64> <f>
          vst.vars.norm <f64> <f> vst.hvg.rank <f32> <f> vst.hvg <bool> <bool> velocity.gamma <f32>
          velocity.qreg.ratio <f32> velocity.r2 <f64> velocity.genes <bool>
 layers : spliced <f64> <i/counts/spliced> unspliced <f64> <i/counts/unspliced>
          ambiguous <f64> <i/counts/ambiguous> norm <f32> <f/linear> lognorm <f32> <f/normal>
          counts <f32> <i/counts> spliced.counts <f64> <i/counts/spliced>
          unspliced.counts <f64> <i/counts/unspliced> ms <f32> <f/ms> mu <f32> <f/mu>
          velocity <f32> <f/vector> variance.velocity <f32>
   obsm : pca <arr:f64(35)> <f/embedding/pca> knn <arr:i32(30)> <i/knni>
          knn.d <arr:f64(30)> <f/knnd> umap <arr:f32(2)> <f/embedding> vdiff <arr:f64(10)> <f>
   varm : pca <arr:f64(35)> <f/weights>
   obsp : connectivities <csr:f32> <f/connectivity> distances <csr:f64> <f/distance>
    uns : slots <system> commands <system> pca <dict> neighbors <knn> leiden <o> umap
          cell.type.colors leiden.colors velocity.params velocity.graph velocity.graph.neg

[*] composed of samples:
  pancreas   rna   mmu    batch b-1     of size 3696 × 17584

[23]:
fig = expm.rna.plot_multiple_embedding(
    basis = 'umap', features = [
        'velocity.pseudotime', 'velocity.length', 'velocity.confidence', 'velocity.confidence.transition'
    ], ncols = 4, cmap = 'turbo',
    figsize = (10, 2.5), dpi = 100, legend_col = 2, legend = False,
    ptsize = 2, annotate_style = 'text', annotate_fontsize = 8, contour_plot = False
)
../_images/scrna_i1-velocity_26_0.png

Given a 2D base embedding, we can project the high-dimensional velocity vectors onto UMAP for visualization.

[25]:
fig = expm.rna.plot_velocity_streamline(
    basis = 'umap', vkey = "velocity",
    neighbor_key = 'neighbors', frameon = False,
    color = 'cell.type', contour_plot = False, figsize = (3, 3), dpi = 100,
    density = 1.5, annotate_style = 'text', annotate_fontsize = 8
)
../_images/scrna_i1-velocity_28_0.png

Saving the dataset#

Save the integrated dataset.

[26]:
expm.save(save_samples = False)
[i] main dataset write to expm/pancreas/integrated.h5mu