structure
gene embedders
Function to get embeddings from a set of genes, given their ensembl ids. For now use 2 different models:
RNABert
: for non coding genesESM2
: for protein coding genes
given ids, a fasta file, will use the models to compute an embedding of each gene.
This could be potentially applied to genes with mutations and from different species.
data_loaders
From scDataloader. (see more in the available readmes and website https://jkobject.com/scDataLoader)
For now can work with either one to many AnnData's or a laminDB Collection of AnnDatas
allows you to preprocess your anndatas too.
They can be stored locally or remotely
stores them in a Dataset
class.
Creates the DataLoaders from a Datamodule
Class.
Collates the results using a Collator
function.
model
Extends from lightning data module to implement all the necessary functions to do:
- training
- validation
- testing
- prediction (inference)
is subdivided into multiple parts:
- encoder
- transformer
- decoder
trainer & cli
the model uses lightning's training toolkit and CLI tools.
to use CLI, just call scprint ...
(will call the __main__.py
function).
Additional, training-specific informations are passed to the model using the trainer.py function.
specific training schemes are available under the config
folder as yaml
files.
Moreover the model can be trained on multiple compute types. SLURM scripts are available under the slurm
folder.
tasks
Implement different tasks that a pretrained model would perform. for now:
- GRN prediction: given a single cell dataset and a group (cell type, cluster, ...) will output a
GRnnData
completed with a predicted GRN from the attention of the model. - denoising: from a single cell dataset, will modify the count matrices to predict what it would have looked like if it had been sequenced deeper, according to the model.
- embedding: from a single cell dataset, will create embeddings (low dimensional representations) of each cells, as well as prediction of the cell labels the model has been trained on (cell type, disease, ethnicity, sex...). It will also output a umap and predicted expression from the zinb, post bottleneck (similar to a VAE decoder prediction)