structure
gene tokenizers
Function to get tokens from a set of genes, given their ensembl ids. For now use 2 different models:
RNABert: for non coding genesESM3: for protein coding genes
given ids, a fasta file, will use the models to compute an embedding of each gene.
This could be potentially applied to genes with mutations and from different species.
data_loaders
From scDataloader. (see more in the available readmes and website https://jkobject.com/scDataLoader)
For now can work with either one to many AnnData's or a laminDB Collection of AnnDatas
allows you to preprocess your anndatas too.
They can be stored locally or remotely
stores them in a Dataset class. Creates the DataLoaders from a Datamodule
Class. Collates the results using a Collator function.
model
Extends from lightning data module to implement all the necessary functions to do:
- training
- validation
- testing
- prediction (inference)
is subdivided into multiple parts:
- encoder
- transformer
- decoder
- the fsq module
trainer & cli
the model uses lightning's training toolkit and CLI tools.
to use CLI, just call scprint2 ... (will call the __main__.py function).
Additional, training-specific informations are passed to the model using the
trainer.py function. specific training schemes are available under the config
folder as yaml files. Moreover the model can be trained on multiple compute
types. SLURM scripts are available under the slurm folder.
tasks
Implement different tasks that a pretrained model would perform. for now:
- GRN prediction: given a single cell dataset and a group (cell type, cluster,
...) will output a
GRnnDatacompleted with a predicted GRN from the attention of the model. - denoising: from a single cell dataset, will modify the count matrices to predict what it would have looked like if it had been sequenced deeper, according to the model.
- embedding: from a single cell dataset, will create embeddings (low dimensional representations) of each cells, as well as prediction of the cell labels the model has been trained on (cell type, disease, ethnicity, sex...). It will also output a umap and predicted expression from the zinb, post bottleneck (similar to a VAE decoder prediction)
- generation: from a set of conditions (cell type, disease state, ethnicity, sex...) will generate new cell profile matching these conditions.
- imputation: from a single cell dataset with missing values (genes), will impute the missing genes according to the model.
- finetuning: from a pretrained model and a new single cell dataset, will finetune the model to better fit the new dataset and potentially correct for batch effects if provided with batch labels.
- gene embedding: from a list of genes, and an expression profile will output embeddings for gene.