Transcription Factor Activity Inference#
SCENIC+ provides method of TF target ranking based inference for activated or suppressed transcriptional factor from single cell expression
[1]:
%load_ext autoreload
%autoreload 2
[2]:
import exprmat as em
# set working directory
em.setwd('../../../data')
ver = em.version()
[i] exprmat 0.2.66 / exprmat-db 0.2.66
[i] os: posix (linux) platform version: 6.8.0-90-generic
[i] loaded configuration from /home/data/yangz/.exprmatrc
[i] current working directory: /home/data/yangz/packages/exprmat/data
[i] current database directory: /home/data/yangz/packages/database (0.2.66)
[i] resident memory: 777.45 MiB
[i] virtual memory: 5.95 GiB
[3]:
expm = em.load_experiment('expm/scrna', load_samples = True, load_subset = 'mono-neutro')
━━━━━━━━━━━━━━━━━━━━━━━━━━ loading samples 5 / 5 (00:05 < 00:00)
[4]:
print(expm)
annotated data of size 9754 × 19651
subset mono-neutro of size 9754 × 19651
contains modalities: rna
modality [rna]
obs : sample <cat> <c/sample> batch <cat> <c/batch> group <cat> <c> modality <cat> <c/modality>
taxa <cat> <c/taxa> barcode <o> <o> ubc <o> <o> n.umi <f64> <i> n.genes <i64> <i>
n.mito <f64> <f> n.ribo <f64> <f> pct.mito <f64> <f> pct.ribo <f64> <f>
filter <bool> <bool> score.doublet <f64> <f> score.doublet.se <f64> <f>
is.doublet <bool> <bool> qc <bool> <bool/qc> leiden <cat> <c> sc3.5 <cat> <c>
sc3.10 <cat> <c> sc3.20 <cat> <c> sc3.30 <cat> <c> cell.type <cat> <c>
kde.umap <f64> <f/kde> psbulk <cat> <o> cytotrace.score <f64> <f>
cytotrace.potency <cat> <o> cytotrace.relative <f64> <f> cytotrace.score.preknn <f64> <f>
cytotrace.potency.preknn <cat> <o> ppt.pseudotime <f64> <f> ppt.seg <cat> <o>
ppt.edge <cat> <o> ppt.milestones <cat> <o>
var : chr <cat> <c/chromosome> start <i64> <i> end <i64> <i> strand <cat> <c/strand> id <o> <o>
subtype <cat> <c/gsubtype> gene <cat> <o/gene> tlen <f64> <i/tlen> cdslen <i64> <i/cdslen>
assembly <cat> <c> uid <o> <o/ugene> vst.hvg <bool> <bool/hvg> vst.all.means <f64> <f>
vst.all.vars <f64> <f> vst.all.vars.norm <f64> <f> vst.all.hvg.rank <f32> <f>
vst.all.hvg <bool> <bool>
layers : counts <f32> <i/counts> magic <f64> <f/normal/imputed> norm <f32> <f>
obsm : cnmf.10 <df> <f/embedding/usage> diffmap <arr:f32(5)> <f/embedding>
harmony <arr:f32(35)> <f> knn <arr:i32(100)> <i/knni> knn.d <arr:f32(100)> <f/knnd>
knn.d.nn30.diffmap <arr:f32(30)> <f> knn.nn30.diffmap <arr:i32(30)> <i>
pca <arr:f64(35)> <f/embedding/pca> ppt <arr:f64(2000)> <ppt-assign>
umap <arr:f32(2)> <f/embedding> umap.diff <arr:f32(2)> <f/embedding>
varm : cnmf.10 <arr:f64(10)> <f/weights> cnmf.coef.10 <arr:f64(10)> <f/usage-coef>
pca <arr:f64(35)> <f/weights>
obsp : connectivities <csr:f32> <f/connectivity> connectivities.nn30.diffmap <csr:f32> <f>
distances <csr:f32> <f/distance> distances.nn30.diffmap <csr:f32> <f>
uns : cell.type.colors <o> cell.type_colors <o> cnmf <cnmf> cnmf.args <o>
cnmf.density.10 <cnmf-density> cnmf.dist.10 <f/connectivity> cnmf.stats <cnmf-stats>
commands <system> diffmap <o> kde.umap <kde-stats> leiden <o> leiden.colors <o>
magic.errors <magic-errors> magic.t <magic-t> markers <markers> neighbors <knn>
nn30.diffmap <knn> pca <dict> ppt <ppt> ppt.graph <o> ppt.milestones.colors <o>
ppt.pseudotime <o> ppt.root <i> ppt.seg.colors <o> sc3.10.colors <o> sc3.20.colors <o>
sc3.30.colors <o> sc3.5.colors <o> slots <system> trace <trace> umap <o> umap.diff <o>
[*] composed of samples:
distal rna mmu batch b1 of size 6451 × 21689
niche rna mmu batch b2 of size 3656 × 21970
normal rna mmu batch b3 of size 4537 × 21917
integrated-metacell rna autogen batch autogen of size 1277 × 18903
psbulk rna autogen batch autogen of size 18 × 8928
SCENIC#
To obtain more stable result and reduce computation, running SCENIC on metacells is a good choice.
[5]:
expm.rna['integrated-metacell'].layers['counts'] = em.np.expm1(
expm.rna['integrated-metacell'].X
).floor()
The calculation takes a long time. Due to the instability of jupyter kernels when executing long-time runs, we recommend running it in Python REPL
[7]:
expm.rna.infer_tf_activity(
run_on_samples = ['integrated-metacell'],
layer = 'counts',
gene = 'gene',
method = 'grnboost2',
taxa = 'mmu',
ncpus = 100,
seed = 42,
features = 'genes',
cistromes = 'motifs',
thresholds = [0.75, 0.90],
top_n_targets = [50],
top_n_regulators = [5, 10, 50],
min_genes = 20,
mask_dropout = True,
keep_only_activating = True,
rank_threshold = 5000,
auc_threshold = 0.05,
nes_threshold = 3.0,
max_similarity_fdr = 0.001,
min_orthologous_identity = 0,
chunk_size = 100,
weighted = False,
key_adjacencies = 'tf.adjacencies',
key_added = 'tf'
)
[i] loaded 1860 transcriptional factors.
[i] starting grnboost2 using 100 processes
━━━━━━━━━━━━━━━━━━━━━ inferring partial network 18903 / 18903 (56:56 < 00:00)
[i] finished in 3419.57559466362 seconds.
[i] calculating pearson correlations ...
[i] using 100 workers.
[i] creating regulons from a dataframe of enriched features.
[i] additional columns saved: []
━━━━━━━━━━━━━━━━━━━━━━━━━━ auc 266 / 266 (00:05 < 00:00)
[8]:
expm.rna['integrated-metacell'].uns['tf.adjacencies']
[8]:
| tf | target | importance | |
|---|---|---|---|
| 1244 | Ltf | Ngp | 1.225779e+01 |
| 1244 | Ltf | Abca13 | 9.880056e+00 |
| 1094 | Irf7 | Ifit1bl1 | 9.697832e+00 |
| 1244 | Ltf | Camp | 9.575884e+00 |
| 94 | Irf4 | Ccr7 | 9.397725e+00 |
| ... | ... | ... | ... |
| 1274 | Elf4 | Slc25a21 | 1.083091e-19 |
| 292 | Ezr | Prg2 | 5.567626e-20 |
| 906 | Mrps25 | Slc25a21 | 4.510558e-20 |
| 781 | Lcorl | Prg2 | 2.973287e-20 |
| 719 | Sfpq | Usf3 | 2.380551e-20 |
997664 rows × 3 columns
[10]:
expm.rna['integrated-metacell'].obsm['tf'].iloc[:, 0:6]
[10]:
| regulon | Ahr (+) | Arid3a (+) | Arid5b (+) | Arntl (+) | Atf1 (+) | Atf2 (+) |
|---|---|---|---|---|---|---|
| mc:integrated:1 | 0.022785 | 0.041407 | 0.137302 | 0.000000 | 0.041860 | 0.032423 |
| mc:integrated:2 | 0.041360 | 0.091019 | 0.081349 | 0.000000 | 0.069738 | 0.025587 |
| mc:integrated:3 | 0.010312 | 0.029286 | 0.111905 | 0.000000 | 0.047406 | 0.032212 |
| mc:integrated:4 | 0.032939 | 0.067835 | 0.130159 | 0.000000 | 0.085915 | 0.072000 |
| mc:integrated:5 | 0.052145 | 0.078238 | 0.073810 | 0.000000 | 0.076212 | 0.030032 |
| ... | ... | ... | ... | ... | ... | ... |
| mc:integrated:1273 | 0.040842 | 0.084477 | 0.263889 | 0.000000 | 0.057523 | 0.013037 |
| mc:integrated:1274 | 0.075650 | 0.068866 | 0.194444 | 0.000000 | 0.087294 | 0.048021 |
| mc:integrated:1275 | 0.025464 | 0.106507 | 0.127513 | 0.014932 | 0.061451 | 0.029122 |
| mc:integrated:1276 | 0.041945 | 0.065663 | 0.160053 | 0.000000 | 0.077026 | 0.027894 |
| mc:integrated:1277 | 0.023798 | 0.029643 | 0.108598 | 0.000000 | 0.056603 | 0.028339 |
1277 rows × 6 columns
[24]:
fig = expm.rna.plot_multiple_embedding(
run_on_samples = ['integrated-metacell'],
basis = 'umap', features = ['Klf4 (+)', 'Irf8 (+)', 'Mafb (+)', 'Cebpb (+)'], ncols = 4,
sort = True, figsize = (10, 2.5), dpi = 100, legend_col = 2
)
[25]:
em.memory()
[i] resident memory: 4.40 GiB
[i] virtual memory: 20.30 GiB
[ ]:
expm.save()