Statistics#
We introduce several basic utility functions for summarizing and computing statistics on sparse matrices.
[1]:
%load_ext autoreload
%autoreload 2
[2]:
import exprmat as em
# set working directory
em.setwd('../../../data')
ver = em.version()
[i] exprmat 0.2.66 / exprmat-db 0.2.66
[i] os: posix (linux) platform version: 6.8.0-90-generic
[i] loaded configuration from /home/data/yangz/.exprmatrc
[i] current working directory: /home/data/yangz/packages/exprmat/data
[i] current database directory: /home/data/yangz/packages/database (0.2.66)
[i] resident memory: 778.73 MiB
[i] virtual memory: 5.95 GiB
[3]:
expm = em.load_experiment('expm/scrna', load_subset = 'mono-neutro')
━━━━━━━━━━━━━━━━━━━━━━━━━━ loading samples 3 / 3 (00:03 < 00:00)
[4]:
print(expm)
annotated data of size 9754 × 19651
subset mono-neutro of size 9754 × 19651
contains modalities: rna
modality [rna]
obs : sample <cat> <c/sample> batch <cat> <c/batch> group <cat> <c> modality <cat> <c/modality>
taxa <cat> <c/taxa> barcode <o> <o> ubc <o> <o> n.umi <f64> <i> n.genes <i64> <i>
n.mito <f64> <f> n.ribo <f64> <f> pct.mito <f64> <f> pct.ribo <f64> <f>
filter <bool> <bool> score.doublet <f64> <f> score.doublet.se <f64> <f>
is.doublet <bool> <bool> qc <bool> <bool/qc> leiden <cat> <c> sc3.5 <cat> <c>
sc3.10 <cat> <c> sc3.20 <cat> <c> sc3.30 <cat> <c> cell.type <cat> <c>
var : chr <cat> <c/chromosome> start <i64> <i> end <i64> <i> strand <cat> <c/strand> id <o> <o>
subtype <cat> <c/gsubtype> gene <cat> <o/gene> tlen <f64> <i/tlen> cdslen <i64> <i/cdslen>
assembly <cat> <c> uid <o> <o/ugene> vst.hvg <bool> <bool/hvg> vst.all.means <f64> <f>
vst.all.vars <f64> <f> vst.all.vars.norm <f64> <f> vst.all.hvg.rank <f32> <f>
vst.all.hvg <bool> <bool>
layers : counts <f32> <i/counts> norm <f32> <f>
obsm : cnmf.10 <df> <f/embedding/usage> harmony <arr:f32(35)> <f> knn <arr:i32(100)> <i/knni>
knn.d <arr:f32(100)> <f/knnd> pca <arr:f64(35)> <f/embedding/pca>
umap <arr:f32(2)> <f/embedding>
varm : cnmf.10 <arr:f64(10)> <f/weights> cnmf.coef.10 <arr:f64(10)> <f/usage-coef>
pca <arr:f64(35)> <f/weights>
obsp : connectivities <csr:f32> <f/connectivity> distances <csr:f32> <f/distance>
uns : cell.type.colors cell.type_colors cnmf <cnmf> cnmf.args <o>
cnmf.density.10 <cnmf-density> cnmf.dist.10 <f/connectivity> cnmf.stats <cnmf-stats>
commands <system> leiden <o> leiden.colors <o> markers <markers> neighbors <knn>
pca <dict> sc3.10.colors <o> sc3.20.colors <o> sc3.30.colors <o> sc3.5.colors <o>
slots <system> umap <o>
[*] composed of samples:
distal rna mmu batch b1 of size 6451 × 21689
niche rna mmu batch b2 of size 3656 × 21970
normal rna mmu batch b3 of size 4537 × 21917
Cluster proportion statistics#
Compute the proportions of categorical variables across dataset rows (cells). The major parameter specifies the primary grouping variable, while minor specifies the categorical variable for which proportions are calculated.
[7]:
expm.rna.proportion(
major = 'sample', minor = 'cell.type',
normalize = 'index'
)['integrated']
[7]:
| cell.type | DCp | MDP | MM | Mo | Neu | Prog | iMac |
|---|---|---|---|---|---|---|---|
| sample | |||||||
| distal | 0.023297 | 0.062052 | 0.080421 | 0.035842 | 0.774194 | 0.023970 | 0.000224 |
| niche | 0.001878 | 0.160345 | 0.079985 | 0.088246 | 0.563275 | 0.024784 | 0.081487 |
| normal | 0.065855 | 0.159878 | 0.051389 | 0.031976 | 0.666159 | 0.023601 | 0.001142 |
Gaussian kernel density estimation#
Given a low-dimensional representation, we can compute the density of the scatter point distribution to visualize how cells are concentrated in different regions of the embedding space.
[8]:
fig = expm.rna.plot_embedding(
basis = 'umap', color = 'cell.type',
figsize = (3, 3), dpi = 100, legend_col = 2, legend = False,
ptsize = 2, annotate_style = 'text', annotate_fontsize = 8, contour_plot = True
)
[9]:
expm.rna.kde(
basis = "umap",
groupby = "sample",
key_added = "kde.umap",
components = "1,2"
)
[10]:
kde = expm.rna.plot_kde(
basis = 'umap', kde = 'kde.umap', grouping_key = 'sample',
background = 'cell.type', annotate = True, annotate_fontsize = 8, ptsize = 2,
figsize = (9, 3), dpi = 100, groups = None, ncols = 3, cmap = 'turbo'
)
Summary statistics#
The summary function creates a three-dimensional table using the on, across, and split parameters. Each cell value in the table is computed from the data, orient, and method parameters. The returned object is an AnnData structure where rows correspond to on, columns correspond to across, and layers contain each split.
When orient is set to 'obs', the on, across, and split parameters must be column names in the obs table. Similarly, when orient is set to 'var', these must be column names in the var table. Specific rows or columns are selected by the three filter conditions, and the method function is applied to produce the cell values of the three-dimensional table. If the filtered result is an empty set, the corresponding cell is NaN.
When method is set to the special method n, it returns the count of observations (rows or columns, not the count of values within the statistical range).
[13]:
adata = expm.rna.summary(
data = 'X', method = 'n', method_args = {},
orient = 'obs', on = 'sample',
across = 'cell.type', split = 'sample',
attached_metadata_on = None, attached_metadata_across = None,
attach_method_on = 'first', attach_method_across = 'first'
)['integrated']
[14]:
em.pd.DataFrame(
adata.layers['niche'],
index = adata.obs_names,
columns = adata.var_names
)
[14]:
| Neu | MDP | MM | Mo | DCp | Prog | iMac | |
|---|---|---|---|---|---|---|---|
| distal | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| niche | 1500.0 | 427.0 | 213.0 | 235.0 | 5.0 | 66.0 | 217.0 |
| normal | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Aggregate statistics#
The aggregate function collapses the original data table along rows or columns, applying a specified method (such as sum or mean) to groups defined by observation or variable keys.
[15]:
adata = expm.rna.aggregate(
data = 'counts', method = 'sum', method_args = {},
obs_key = 'sample', var_key = None
)['integrated']
[16]:
em.pd.DataFrame(
adata.X,
index = adata.obs_names,
columns = adata.var['gene']
).iloc[:, 0:10]
[16]:
| gene | Bzw1 | Ppp1r9b | Samd14 | Clk1 | Pdk2 | Itga3 | Ppil3 | Tac4 | Kat7 | Gm11520 |
|---|---|---|---|---|---|---|---|---|---|---|
| distal | 4153.0 | 1685.0 | 29.0 | 6723.0 | 54.0 | 0.0 | 456.0 | 2.0 | 1604.0 | 144.0 |
| niche | 3908.0 | 2584.0 | 42.0 | 4752.0 | 46.0 | 11.0 | 442.0 | 6.0 | 1415.0 | 110.0 |
| normal | 3470.0 | 1561.0 | 21.0 | 4831.0 | 72.0 | 1.0 | 446.0 | 8.0 | 1292.0 | 129.0 |
Pseudobulking#
Pseudobulking merges cells from a specified category by counts, and continue the analysis normalization as if they were coming from bulk dataset.
[17]:
expm.rna.annotate_paste(
froms = ['sample', 'cell.type'],
into = 'psbulk',
slot = 'rna',
sep = ':'
)
psbulk
distal:Neu 3456
normal:Neu 1750
niche:Neu 1500
niche:MDP 427
normal:MDP 420
distal:MM 359
distal:MDP 277
niche:Mo 235
niche:iMac 217
niche:MM 213
normal:DCp 173
distal:Mo 160
normal:MM 135
distal:Prog 107
distal:DCp 104
normal:Mo 84
niche:Prog 66
normal:Prog 62
niche:DCp 5
normal:iMac 3
distal:iMac 1
Name: count, dtype: int64
[19]:
expm.rna.pseudobulk(
obs_key = 'psbulk',
sample_added = 'psbulk',
obs_metadata = ['cell.type', 'sample', 'group'],
merge_obs_metadata = 'mode'
)
[i] created subset [psbulk] from sample [integrated]
[20]:
print(expm)
annotated data of size 9754 × 19651
subset mono-neutro of size 9754 × 19651
contains modalities: rna
modality [rna]
obs : sample <cat> <c/sample> batch <cat> <c/batch> group <cat> <c> modality <cat> <c/modality>
taxa <cat> <c/taxa> barcode <o> <o> ubc <o> <o> n.umi <f64> <i> n.genes <i64> <i>
n.mito <f64> <f> n.ribo <f64> <f> pct.mito <f64> <f> pct.ribo <f64> <f>
filter <bool> <bool> score.doublet <f64> <f> score.doublet.se <f64> <f>
is.doublet <bool> <bool> qc <bool> <bool/qc> leiden <cat> <c> sc3.5 <cat> <c>
sc3.10 <cat> <c> sc3.20 <cat> <c> sc3.30 <cat> <c> cell.type <cat> <c>
kde.umap <f64> <f/kde> psbulk <cat>
var : chr <cat> <c/chromosome> start <i64> <i> end <i64> <i> strand <cat> <c/strand> id <o> <o>
subtype <cat> <c/gsubtype> gene <cat> <o/gene> tlen <f64> <i/tlen> cdslen <i64> <i/cdslen>
assembly <cat> <c> uid <o> <o/ugene> vst.hvg <bool> <bool/hvg> vst.all.means <f64> <f>
vst.all.vars <f64> <f> vst.all.vars.norm <f64> <f> vst.all.hvg.rank <f32> <f>
vst.all.hvg <bool> <bool>
layers : counts <f32> <i/counts> norm <f32> <f>
obsm : cnmf.10 <df> <f/embedding/usage> harmony <arr:f32(35)> <f> knn <arr:i32(100)> <i/knni>
knn.d <arr:f32(100)> <f/knnd> pca <arr:f64(35)> <f/embedding/pca>
umap <arr:f32(2)> <f/embedding>
varm : cnmf.10 <arr:f64(10)> <f/weights> cnmf.coef.10 <arr:f64(10)> <f/usage-coef>
pca <arr:f64(35)> <f/weights>
obsp : connectivities <csr:f32> <f/connectivity> distances <csr:f32> <f/distance>
uns : cell.type.colors cell.type_colors cnmf <cnmf> cnmf.args <o>
cnmf.density.10 <cnmf-density> cnmf.dist.10 <f/connectivity> cnmf.stats <cnmf-stats>
commands <system> leiden <o> leiden.colors <o> markers <markers> neighbors <knn>
pca <dict> sc3.10.colors <o> sc3.20.colors <o> sc3.30.colors <o> sc3.5.colors <o>
slots <system> umap <o> kde.umap <kde-stats>
[*] composed of samples:
distal rna mmu batch b1 of size 6451 × 21689
niche rna mmu batch b2 of size 3656 × 21970
normal rna mmu batch b3 of size 4537 × 21917
psbulk rna autogen batch autogen of size 21 × 19651
[21]:
fig = expm.rna.plot_pseudobulk_samples(
run_on_samples = ['psbulk'],
groupby = None, figsize = (3, 3),
min_cells = 10, min_counts = 1000,
log = True,
)
[22]:
expm.rna.filter_pseudobulk_samples(
run_on_samples = ['psbulk'],
min_cells = 10, min_counts = 1000,
)
[23]:
expm.rna.view('psbulk')
annotated data of size 18 × 19651
obs : cell.type <o> <c> sample <o> <c/sample> group <o> <c> counts <f32> <f> cells <i64> <i>
var : chr <cat> <c/chromosome> start <i64> <i> end <i64> <i> strand <cat> <c/strand> id <o> <o>
subtype <cat> <c/gsubtype> gene <cat> <o/gene> tlen <f64> <i/tlen> cdslen <i64> <i/cdslen>
assembly <cat> <c> uid <o> <o/ugene> vst.hvg <bool> <bool/hvg> vst.all.means <f64> <f>
vst.all.vars <f64> <f> vst.all.vars.norm <f64> <f> vst.all.hvg.rank <f32> <f>
vst.all.hvg <bool> <bool>
layers : proportions <f64> <f/proportions>
uns : slots <system> commands <system>
[24]:
fig = expm.rna.plot_pseudobulk_features(
run_on_samples = ['psbulk'],
group = 'group',
lib_size = None,
min_count = 10,
min_total_count = 15,
large_n = 10,
min_sample_prop = 0.7,
cmap = 'viridis',
min_cell_prop = 0.2,
min_samples = 2,
figsize = (7, 3),
dpi = 100
)
[25]:
expm.rna.filter_pseudobulk_features(
run_on_samples = ['psbulk'],
group = 'group',
lib_size = None,
min_count = 10,
min_total_count = 15,
large_n = 10,
min_sample_prop = 0.7,
min_cell_prop = 0.2,
min_samples = 2
)
[27]:
expm.save()
[i] main dataset write to expm/scrna/subsets/mono-neutro.h5mu
[i] saving individual samples. (pass `save_samples = False` to skip)
━━━━━━━━━━━━━━━━━━━━━━━━━━ modality [rna] 4 / 4 (00:01 < 00:00)