Spliced Dynamics Inference#
We can infer the gene expression using dynamics of the spliced and unspliced proportions of a gene’s transcripts.
[1]:
%load_ext autoreload
%autoreload 2
[2]:
import exprmat as em
# set working directory
em.setwd('../../../data')
ver = em.version()
[i] exprmat 0.2.66 / exprmat-db 0.2.66
[i] os: posix (linux) platform version: 6.8.0-90-generic
[i] loaded configuration from /home/data/yangz/.exprmatrc
[i] current working directory: /home/data/yangz/packages/exprmat/data
[i] current database directory: /home/data/yangz/packages/database (0.2.66)
[i] resident memory: 777.88 MiB
[i] virtual memory: 5.95 GiB
[3]:
ver = em.version()
[i] exprmat 0.2.66 / exprmat-db 0.2.66
[i] os: posix (linux) platform version: 6.8.0-90-generic
[i] loaded configuration from /home/data/yangz/.exprmatrc
[i] current working directory: /home/data/yangz/packages/exprmat/data
[i] current database directory: /home/data/yangz/packages/database (0.2.66)
[i] resident memory: 779.01 MiB
[i] virtual memory: 5.95 GiB
Initializing the splicing expression matrix from four count matrices#
The data below provides a standard 10X-format expression matrix, along with two additional matrices recording spliced and unspliced count matrices: We need the following file structure to automatically load the splicing matrices. These four MTX matrices and two TSV files are required.
[4]:
! tree -sh scrna/pancreas
[4.0K] scrna/pancreas
├── [ 94K] annotation.tsv
├── [ 15K] barcodes.tsv.gz
├── [200K] features.tsv.gz
├── [ 28M] matrix.mtx.gz
├── [ 50M] pancreas.h5ad
├── [9.3M] spanning.mtx.gz
├── [ 28M] spliced.mtx.gz
└── [9.3M] unspliced.mtx.gz
1 directory, 8 files
[8]:
meta = em.metadata(
# two samples from a murine tumor infiltrating cd8+ t cell dataset.
locations = [
'scrna/pancreas',
'scrna/pancreas',
],
modality = ['rna', 'rna/splicing'],
default_taxa = ['mmu', 'mmu'],
batches = ['b-1', 'b-1'],
names = ['pancreas', 'pancreas'],
groups = ['pancreas', 'pancreas'],
)
[6]:
meta.dataframe
[6]:
| location | sample | batch | group | modality | taxa | |
|---|---|---|---|---|---|---|
| 0 | scrna/pancreas | pancreas | b-1 | pancreas | rna | mmu |
| 1 | scrna/pancreas | pancreas | b-1 | pancreas | rna/splicing | mmu |
[9]:
expm = em.experiment(meta, dump = 'expm/pancreas')
[i] reading sample pancreas [rna] ...
[i] 433 genes (out of 27998) not in the reference gene list.
[i] total 27565 genes mapped. 27543 unique genes.
[i] reading sample pancreas [rna/splicing] ...
[10]:
print(expm)
[!] dataset not integrated.
[*] composed of samples:
pancreas rna mmu batch b-1 of size 3696 × 27543
[ ]:
fig = expm.plot_rna_spliced_proportions(groupby = 'cell.type', figsize = (8, 2.5))
Quality control and visualization#
RNA dynamics inference generally requires a dataset that has already undergone marker gene selection, dimensionality reduction, clustering, and visualization. We first process the dataset using standard methods — see the Quality Control chapter for details.
[ ]:
expm.rna.qc(
run_on_samples = True,
mt_seqid = 'chrM',
mt_percent = 0.15,
ribo_genes = None,
ribo_percent = None,
outlier_mode = 'mads',
outlier_n = 10,
doublet_method = 'no',
min_cells = 3,
min_genes = 50,
)
━━━━━━━━━━━━━━━━━━━━━━━━━━ processing samples 1 / 1 (00:04 < 00:00)
[12]:
figs = expm.rna.plot_qc(run_on_samples = True)
[13]:
expm.merge(join = 'outer')
[14]:
expm.rna.log_normalize(
key_norm = 'norm',
key_lognorm = 'lognorm'
)
expm.rna.select_hvg(
key_lognorm = 'X',
method = 'vst',
dest = 'vst',
batch_key = 'batch',
n_top_genes = 1500
)
[15]:
expm.rna.scale_pca(
hvg = 'vst.hvg',
key_lognorm = 'lognorm',
key_scaled = 'scaled',
key_added = 'pca', n_comps = 35,
keep_sparse = False,
random_state = 42,
svd_solver = 'arpack'
)
[17]:
expm.rna.knn(
use_rep = 'pca',
n_comps = None,
n_neighbors = 30,
knn = True,
method = "umap",
transformer = None,
metric = "cosine",
metric_kwds = {},
random_state = 42,
key_added = 'neighbors'
)
expm.rna.leiden(
resolution = 0.5,
restrict_to = None,
random_state = 42,
key_added = 'leiden',
adjacency = None,
directed = None,
use_weights = True,
n_iterations = 2,
partition_type = None,
neighbors_key = None,
obsp = None,
flavor = 'igraph'
)
expm.rna.umap(
min_dist = 0.5,
spread = 1,
n_components = 2,
maxiter = None,
alpha = 1,
gamma = 1,
negative_sample_rate = 5,
init_pos = "spectral",
random_state = 42,
a = None, b = None,
key_added = 'umap',
neighbors_key = "neighbors"
)
Now that we have obtained a dimensionality reduction representation and performed clustering, we can directly adopt the clustering information provided by the authors’ metadata. Note that the current dataset already contains three standard splicing matrices in the layers: spliced, unspliced, and ambiguous. These three matrices represent the count matrices for spliced transcripts, unspliced transcripts, and unclassified spanning reads, respectively.
[18]:
annots = em.pd.read_table('scrna/pancreas/annotation.tsv', index_col = 0)
expm['rna'].obs['cell.type'] = annots['clusters'].tolist()
expm['rna'].obs['cell.type'] = expm['rna'].obs['cell.type'].astype('category')
annots
[18]:
| clusters | |
|---|---|
| index | |
| AAACCTGAGAGGGATA | Pre-endocrine |
| AAACCTGAGCCTTGAT | Ductal |
| AAACCTGAGGCAATTA | Alpha |
| AAACCTGCATCATCCC | Ductal |
| AAACCTGGTAAGTGGC | Ngn3 high EP |
| ... | ... |
| TTTGTCAAGTGACATA | Pre-endocrine |
| TTTGTCAAGTGTGGCA | Ngn3 high EP |
| TTTGTCAGTTGTTTGG | Ductal |
| TTTGTCATCGAATGCT | Alpha |
| TTTGTCATCTGTTTGT | Epsilon |
3696 rows × 1 columns
[19]:
fig = expm.rna.plot_multiple_embedding(
basis = 'umap', features = ['cell.type', 'leiden'], ncols = 2, cmap = 'set1',
figsize = (6, 3), dpi = 100, legend_col = 2, legend = False,
ptsize = 2, annotate_style = 'text', annotate_fontsize = 8, contour_plot = True
)
[20]:
print(expm)
annotated data of size 3696 × 17584
integrated dataset of size 3696 × 17584
contains modalities: rna
modality [rna]
obs : sample <cat> <c/sample> batch <cat> <c/batch> group <cat> <c> modality <cat> <c/modality>
taxa <cat> <c/taxa> barcode <o> <o> ubc <o> <o> n.umi <f64> <i> n.genes <i64> <i>
n.mito <f64> <f> n.ribo <f64> <f> pct.mito <f64> <f> pct.ribo <f64> <f>
filter <bool> <bool> score.doublet <f64> <f> score.doublet.se <f64> <f>
is.doublet <bool> <bool> qc <bool> <bool/qc> leiden <cat> <c> cell.type <cat>
var : chr <o> <c/chromosome> start <i64> <i> end <i64> <i> strand <o> <c/strand> id <o> <o>
subtype <o> <c/gsubtype> gene <o> <o/gene> tlen <f64> <i/tlen> cdslen <i64> <i/cdslen>
assembly <o> <c> uid <o> <o/ugene> vst.means <f64> <f> vst.vars <f64> <f>
vst.vars.norm <f64> <f> vst.hvg.rank <f32> <f> vst.hvg <bool> <bool>
layers : spliced <f64> <i/counts/spliced> unspliced <f64> <i/counts/unspliced>
ambiguous <f64> <i/counts/ambiguous> norm <f32> <f/linear> lognorm <f32> <f/normal>
counts <f32> <i/counts>
obsm : pca <arr:f64(35)> <f/embedding/pca> knn <arr:i32(30)> <i/knni>
knn.d <arr:f64(30)> <f/knnd> umap <arr:f32(2)> <f/embedding>
varm : pca <arr:f64(35)> <f/weights>
obsp : connectivities <csr:f32> <f/connectivity> distances <csr:f64> <f/distance>
uns : slots <system> commands <system> pca <dict> neighbors <knn> leiden <o> umap
cell.type.colors leiden.colors
[*] composed of samples:
pancreas rna mmu batch b-1 of size 3696 × 17584
RNA velocity inference#
The package already implements stochastic velocity inference.
[21]:
expm.rna.velocity(
neighbor_key = 'neighbors', neighbor_connectivity = 'connectivities',
n_neighbors = 30, hvg = 'vst.hvg', velocity_key = 'velocity',
n_cpus = 100,
kwargs_filter = {},
kwargs_velocity = { 'mode': 'stochastic' },
kwargs_velocity_graph = {},
kwargs_terminal_state = {},
kwargs_pseudotime = { 'save_diffmap': True }
)
[i] normalized count data: spliced, unspliced.
[i] computing moments based on connectivities
[i] computing velocities ...
[i] computing velocity graph (using 100/144 cores)
━━━━━━━━━━━━━━━━━━━━━━━━━━ 3696 / 3696 (00:09 < 00:00)
[i] computing terminal states ...
[i] identified 1 region of root cells and 1 region of end points .
velocity.pseudotime, velocity.length, velocity.confidence, and velocity.confidence.transition — four velocity quality parameters — have been computed and automatically filled into the obs slot. We can now evaluate the reliability of the velocity estimates.
[22]:
print(expm)
annotated data of size 3696 × 17584
integrated dataset of size 3696 × 17584
contains modalities: rna
modality [rna]
obs : sample <cat> <c/sample> batch <cat> <c/batch> group <cat> <c> modality <cat> <c/modality>
taxa <cat> <c/taxa> barcode <o> <o> ubc <o> <o> n.umi <f32> <i> n.genes <i64> <i>
n.mito <f64> <f> n.ribo <f64> <f> pct.mito <f64> <f> pct.ribo <f64> <f>
filter <cat> <bool> score.doublet <f64> <f> score.doublet.se <f64> <f>
is.doublet <bool> <bool> qc <cat> <bool/qc> leiden <cat> <c> cell.type <cat>
n.umi.unspliced <f64> <i> n.umi.spliced <f64> <i> velocity.self.transition <f32>
root.cells <f64> <i> endpoints <f64> <i> velocity.pseudotime <f64> velocity.length <f32>
velocity.confidence <f64> velocity.confidence.transition <f32>
var : chr <cat> <c/chromosome> start <i64> <i> end <i64> <i> strand <cat> <c/strand> id <o> <o>
subtype <cat> <c/gsubtype> gene <o> <o/gene> tlen <f64> <i/tlen> cdslen <i64> <i/cdslen>
assembly <cat> <c> uid <o> <o/ugene> vst.means <f64> <f> vst.vars <f64> <f>
vst.vars.norm <f64> <f> vst.hvg.rank <f32> <f> vst.hvg <bool> <bool> velocity.gamma <f32>
velocity.qreg.ratio <f32> velocity.r2 <f64> velocity.genes <bool>
layers : spliced <f64> <i/counts/spliced> unspliced <f64> <i/counts/unspliced>
ambiguous <f64> <i/counts/ambiguous> norm <f32> <f/linear> lognorm <f32> <f/normal>
counts <f32> <i/counts> spliced.counts <f64> <i/counts/spliced>
unspliced.counts <f64> <i/counts/unspliced> ms <f32> <f/ms> mu <f32> <f/mu>
velocity <f32> <f/vector> variance.velocity <f32>
obsm : pca <arr:f64(35)> <f/embedding/pca> knn <arr:i32(30)> <i/knni>
knn.d <arr:f64(30)> <f/knnd> umap <arr:f32(2)> <f/embedding> vdiff <arr:f64(10)> <f>
varm : pca <arr:f64(35)> <f/weights>
obsp : connectivities <csr:f32> <f/connectivity> distances <csr:f64> <f/distance>
uns : slots <system> commands <system> pca <dict> neighbors <knn> leiden <o> umap
cell.type.colors leiden.colors velocity.params velocity.graph velocity.graph.neg
[*] composed of samples:
pancreas rna mmu batch b-1 of size 3696 × 17584
[23]:
fig = expm.rna.plot_multiple_embedding(
basis = 'umap', features = [
'velocity.pseudotime', 'velocity.length', 'velocity.confidence', 'velocity.confidence.transition'
], ncols = 4, cmap = 'turbo',
figsize = (10, 2.5), dpi = 100, legend_col = 2, legend = False,
ptsize = 2, annotate_style = 'text', annotate_fontsize = 8, contour_plot = False
)
Given a 2D base embedding, we can project the high-dimensional velocity vectors onto UMAP for visualization.
[25]:
fig = expm.rna.plot_velocity_streamline(
basis = 'umap', vkey = "velocity",
neighbor_key = 'neighbors', frameon = False,
color = 'cell.type', contour_plot = False, figsize = (3, 3), dpi = 100,
density = 1.5, annotate_style = 'text', annotate_fontsize = 8
)
Saving the dataset#
Save the integrated dataset.
[26]:
expm.save(save_samples = False)
[i] main dataset write to expm/pancreas/integrated.h5mu