Metadata#

Creating Metadata#

Creating a metadata table manually or using scripts is the first step in loading a dataset. This is a TSV table. You can create it using the exprmat.metadata class and save it as a TSV file, or you can create it by other means or manually. The table has at least six columns with fixed column names, as follows:

  • location: Specifies the location of the raw dataset.

    This can be a folder or a file. The format specification is determined by the modality column. A specific modality can accept several supported file formats. See the table below for the supported range:

    Acceptable file types for each modality#

    Modality

    Acceptable formats as locations

    rna

    • A folder containing three files: matrix.mtx(.gz), barcodes.tsv(.gz) and features.tsv(.gz) (the legacy 10X format of genes.tsv is also supported.)

    • A `.h5ad` or `.h5` file. If the file have variable names starting with rna: prefix, it is considered as reuseable processed files by the exprmat pipeline. It is thus expected that the obs table should contain the columns sample, modality, batch, gene, taxa. Any missing columns will lead to an error. If not starting with rna:, it is considered as files from other pipelines e.g. 10X standard H5 files, and the obs columns will be overwritten by the ones given in the metadata table (even if there is already in the H5 file). Variable names will be normalized to rna: automatically.

    • A `.csv` or `.tsv` file. Rows are genes and columns are cells. This will be transposed automatically, so do not edit yourself.

    rna/splicing

    • A `.loom` file from the standard velocyto pipeline.

    • A folder, containing barcodes.tsv.gz, features.tsv.gz, spanning.mtx.gz, spliced.mtx.gz and unspliced.mtx.gz. These typically from separate feature counting pipelines or from BGI’s DNBC4 tools

    rna/tcr

    • A `.csv` file from 10X’s standard CellRanger pipelines: filtered_contig_annotations.csv.

    rna/bcr

    • A `.csv` file from 10X’s standard CellRanger pipelines: filtered_contig_annotations.csv.

    rna/raw

    • A folder containing 10X compatible matrix market files

    • A 10X compatible .h5 file

    atac

    • A `.tsv.gz` file containing mapped fragments. The genomic reference is automatically determined by taxa. This results in the use of latest genome reference by default, only can be configured manually to use user-defined versions.

    atac-bulk

    • A `.bam` file for sequence alignment of ATAC-seq. This is recommended and have supports more functionality then input of a .bigwig.

    • A `.bigwig` file containing the summerized bins

    cite

    • A folder containing 10X compatible matrix market files

    • A 10X compatible .h5 file. Such files contains 10X feature types other than RNA. This will be splitted to one rna modality and one cite modality after loading

    xenium/c

    • A folder containing cells.zarr.zip, and cell_feature_matrix.zarr.zip to load segmented Xenium output from XOB analysis

    xenium/s

    • A folder containing morphology_focus, and transcripts.zarr.zip for raw spot level Xenium data

    visium

    • Visium v1 output folder with spatial, and filtered feature barcode matrix

    hd/c

    • Cell segmented Visium HD data which contains segmented_outputs and barcode_mappings.parquet

    hd/2, hd/8, hd/16

    • Bin level Visium HD data which contains square_xxxum and barcode_mappings.parquet

    mif

    • A `.tiff` file with channelnames.txt under the same directory for multiplex IF or CODEX data

  • modality: Specifies the modality type of the data sample. See the table above.

  • sample: Specifies the name of the sample.

    Sample names must be unique within each independent modality but can be repeated across different independent modalities. Samples of dependent modalities must find their parent sample with the same name in an independent modality.

    Important

    What you set in the metadata table is not always equal to what the program actually maintained after experiment loading. The user input modalities contains several root classes and subsequent modifiers separated by a slash. (called sub-modalities). Submodalities inform the program to load with an special routine. After loading, the program will uniform such distinct submodalities into unified modality that have registered functions and accessors. The unified modality names are rna, atac, cite, spatial-cell, spatial-bin. Though the rna and atac may seems to be copies from the input modality (actually not), the latter three may appear to be completely re-organized and generated by the loader. As you can see, when user have an input modality cite, the program actually splits the dataset into one rna and one cite. The spatial- modalities are not acceptable as user input, users should specify distince technique names and the loader will decide which spatial modality to use. The final modality cannot contain duplicate sample names. For example, since xenium/c and mif both results in spatial-cell modality, it is not acceptable to input the same sample name for these two.

  • batch: Specifies the batch of the sample, used for batch integration and correction.

  • group: Free-form content, generally used to store experimental group conditions.

  • taxa: The species name of the sample.

    For RNA data, we look up the species gene list from the built-in database using the species name. For ATAC data, we find the latest reference genome based on the species name. By default, the installed database only contains the hsa and mmu species. You can build or download databases for other species of interest yourself.

Creating a metadata table directly using the constructor#
 1import exprmat as em
 2
 3meta = em.metadata(
 4    locations    = [
 5        # one sample from over-expression group.
 6        './oe1/filtered',
 7        './oe1/velocyto/splicing.loom',
 8        # one sample from wild type group.
 9        './wt1/filtered',
10        './wt1/velocyto/splicing.loom',
11    ],
12    modality     = ['rna', 'rna.splicing', 'rna', 'rna.splicing'],
13    default_taxa = ['mmu'] * 4,
14    batches      = ['b1', 'b1', 'b2', 'b2'],
15    names        = ['oe', 'oe', 'wt', 'wt'],
16    groups       = [
17        'somatic(cond(cd8, oe(A)))',
18        'somatic(cond(cd8, oe(A)))',
19        'somatic(wt)',
20        'somatic(wt)'
21    ]
22)

Note

Due to backward compatibility, the parameter names in the exprmat.metadata constructor do not exactly match the TSV column names. For example, the default_taxa parameter corresponds to the taxa column, and the names parameter corresponds to the sample column. If you decide to write the metadata table manually, use the column names specified in the example table below.

Finally, you should call exprmat.metadata.save() to save it as a TSV file on disk.

Saving the metadata table#
1meta.save('metadata.tsv')

This will be saved as a TSV file:

location

sample

batch

group

modality

taxa

./oe1/filtered

oe

b1

somatic(cond(cd8, oe(A)))

rna

mmu

./oe1/velocyto/splicing.loom

oe

b1

somatic(cond(cd8, oe(A)))

rna.splicing

mmu

./wt1/filtered

wt

b2

somatic(wt)

rna

mmu

./wt1/velocyto/splicing.loom

wt

b2

somatic(wt)

rna.splicing

mmu

metadata Reference

You can provide information for the six required columns in the constructor and use define_column() to define new custom data columns. You can manipulate this table object directly via metadata.dataframe. This class is a simple wrapper around a Pandas DataFrame. Nevertheless, we recommend following certain naming conventions to make the built-in methods easier to understand.

class exprmat.metadata(locations, modality, default_taxa, names=None, batches=None, groups=None, df=None)#

Metadata at sample level. This class is a wrapper around sample level metadata table in pandas format, and can be dumped and loaded using plain text tables.

Parameters:
  • locations (list[str]) – File system paths indicating directories where the data files locate. The protocols and procedures on how to load the dump files in the directory is determined by modality and what this location points to. For example, when modality is rna, the program will search for 10X-compatible mtx and tsv files if a folder is provided, or h5ad/h5 files if files are specified directly.

  • modality (list[str]) – What modality is in the location. Supported values are: - rna (Single cell RNA-seq), - rna.splicing (Supplementary spliced/unspliced counts in loom or mtx) - rna.tcr (10X 5’ TCR annotations), - atac (Single cell ATAC-seq)

  • default_taxa (list[str]) – The reference taxa of the sample.

  • names (list[str] | None) – Sample names in the loaded experiment. If left to None, names of the folder will be used. Since the folder names may duplicate, this is not forced to be unique. This might introduce some unexpected settings, so you are recommended to set this parameter explicitly.

  • batches (list[str] | None) – Batch information, if not set, this column is set to uniform value ‘.’.

  • groups (list[str] | None) – Experimental groupings, if not set, this column is set to uniform value ‘.’.

Notes

The inner metadata table follows several naming conventions. That auto-generated tables must have column location and sample for locations and sample names, and batch and group columns for batches and experimental groupings, and modality for library modality, taxa for default taxa where no prefix in features is specified. other metadata information is not necessary, and can be appended as specified by user, but do not set duplicate column names as these six.

You may load a TSV table as metadata using load_metadata().

define_column(name, default)#

Create a new column in the metadata table with given column name. And fill the column with the default value. The values can be further defined using conditions of finer scope. We recommend indicating all conditions with the carefully named sample names, and use simple conditional filters to process the sample names.

It should be noted that the column names should not be duplicated with the pre-defined ones: one of location, modality, sample, batch, taxa and group, unless you are sure about what you are doing.

save(fpath)#

Write the metadata object into a disk table file.

set_fraction(key='sample', dest='group', sep='.', fraction=0, fallback='.')#

Alter the content of a column if starting with a string in the key column.

Parameters:
  • key (str) – The key column to match the conditional pattern. This column must exist.

  • dest (str) – The column you may want to alter value according to the patterns in column key, according to the conditions given.

  • sep (str) – split the key column with specified separator, and picks out the certain fraction by sequential index to become the value of metadata.

  • fallback (str) – If the key column has bad format, what should be used to fill the metadata.

set_if_contains(key='sample', dest='group', contains='.', value='.')#

Alter the content of a column if containing a string in the key column.

Parameters:
  • key (str) – The key column to match the conditional pattern. This column must exist.

  • dest (str) – The column you may want to alter value according to the patterns in column key, according to the conditions given.

  • contains (str) – Test if values is contained in the key. This requires that both variables be string.

  • value (str) – If the condition is true, set the dest column with this value.

set_if_ends(key='sample', dest='group', ends='.', value='.')#

Alter the content of a column if ending with a string in the key column.

Parameters:
  • key (str) – The key column to match the conditional pattern. This column must exist.

  • dest (str) – The column you may want to alter value according to the patterns in column key, according to the conditions given.

  • ends (str) – Test if values in the key ends with this string

  • value (str) – If the condition is true, set the dest column with this value.

set_if_starts(key='sample', dest='group', starts='.', value='.')#

Alter the content of a column if starting with a string in the key column.

Parameters:
  • key (str) – The key column to match the conditional pattern. This column must exist.

  • dest (str) – The column you may want to alter value according to the patterns in column key, according to the conditions given.

  • starts (str) – Test if values in the key starts with this string

  • value (str) – If the condition is true, set the dest column with this value.

set_paste(key1, key2, dest, sep=':')#

Paste the stringify values of two keys with separator

Reading Metadata from Disk#

You can read a TSV metadata table using exprmat.load_metadata().

Caution

If you choose to create the metadata table manually, please ensure your table is Tab-separated. The TSV table must have column headers and no row names. The column order can be rearranged, but the default order is generally recommended. Note that some editors automatically convert tabs to spaces without proper configuration. Make sure your tabs are actual tab characters. For example, in Visual Studio Code, you need to configure the following setting:

_images/change-tab-1.png
load_metadata Reference

Reads metadata from a saved TSV table. This function checks that the table contains the required columns as specified. Otherwise, it will raise an error.

exprmat.load_metadata(fpath)#

Read the metadata table from disk.

Creating a Dataset from Metadata#

Once the metadata is ready, you have a guide that tells the package how and where to read each sample’s data. You need to create a dataset based on this guide. The package will automatically process and load each sample according to the specified types, creating an in-memory dataset. See exprmat.experiment.