scPRINT-2: 🏃🏃Your next-gen single cell foundation model

logo

scPRINT-2 is a single-cell RNA-seq foundation model built by Jérémie Kalfon in the Cantini Lab. It uses novel architecture, encoding, decoding, training paradigms and losses.

scPRINT-2 has been pretrained on more than 350 million cells from more than 22,000 datasets and 16 species.

scPRINT-2 can be used to perform the following analyses in a zero-shot mode:

expression denoising & imputation: increase the resolution of your scRNAseq data and discover un-measured genes' expression
cell embedding and batch correction: generate a low-dimensional representation of your dataset at syntax level (organism, disease, cell type, sequencer, ...)
label prediction: predict the cell type, disease, sequencer, sex, age, tissue of origin and ethnicity of your cells.
gene network inference: generate a gene network from any cell or cell cluster in your scRNAseq dataset
cross species integration: scPRINT-2 has been trained on 16 species and can be used to integrate data from different species.

Example of scPRINT-2 finetuning exist for:

new species: finetune scPRINT-2 on a new organism
classification: finetune scPRINT-2 on your own cell type /disease / age labels / more...
batch correction of your datasets / atlas: finetune scPRINT-2 to integrate data across species, technologies, and labs.

scPRINT-2 is a foundation model and can be fine-tuned to perform many other analysis

Read the manuscript! if you would like to know more about scPRINT-2. Or have a look at some of my X-plainers.

🎊 test scPRINT and scDataloader on this simple example google collab

Use `scPRINT-2`

For the moment scPRINT-2 has been tested on MacOS and Linux (Ubuntu 20.04) with Python 3.10+. Its instalation takes on average 2 minutes in uv but much longer on conda. We highly recommend using uv to manage your python virtual environments!!

Here is a link to our --still maintained-- previous generation model which contains larger size models: scPRINT-1 (don't forget to star it as well!):

try scPRINT-1 in superbio.ai!

HERE

Try scPRINT-1 on a Google Colab notebook!

To know about: lamin.ai

To use scPRINT-2, you will need to use lamin.ai. This is required to load biological information like genes, cell types, organisms.. (but also to manage the pre-training datasets if this is something you want to set up)

install

Here, is how to install uv

uv venv <env-name> --python 3.11
source <env-name>/bin/activate
#one of
uv pip install scprint2
# OR uv pip install scprint2[dev] # for the dev dependencies (building etc..)
# OR uv pip install scprint2[flash] # to use multiple flash attention functionalities written only in triton
# : only if you have a compatible gpu (e.g. not available for apple GPUs for now, see https://github.com/triton-lang/triton?tab=readme-ov-file#compatibility), it is not useful if you just run inference using the main scPRINT, scPRINT-2 models
#OR pip install scprint2[dev,flash]

lamin init --storage ./testdb --name testdb --modules bionty
scprint2 easy_setup

⚠️ ./testdb is set in this example, but be mindful about where you want to store your data, this might get quite big as you use i,t and if you are on specific partition you want to consider this.

If you start with lamin and have to do a lamin init, you will also do a scprint2 easy_setup. This populates your ontologies and adds some additional gene names. This is because scPRINT-2 is using ontologies to define its cell types, diseases, sexes, ethnicities, etc. (link to view ontologies)

You can do it via the command:

scdataloader populate all

⚠️ It is ok to get warnings with this function

or with this function:

from scdataloader.utils import populate_my_ontology, _adding_scbasecamp_genes

populate_my_ontology() #to populate everything (can take 2-10mns)

populate_my_ontology( #the minimum for scPRINT-1 to run some inferences (denoising, grn inference)
    organisms: List[str] = ["NCBITaxon:10090", "NCBITaxon:9606"],
    sex: List[str] = ["PATO:0000384", "PATO:0000383"],
    celltypes = None,
    ethnicities = None,
    assays = None,
    tissues = None,
    diseases = None,
    dev_stages = None,
)
_adding_scbasecamp_genes()  #to add when using scPRINT-2

It will also download the default checkpoint of a pretrain scprint2 model from our hugging face page. But you can use other ones if you prefer:

$ hf download jkobject/scPRINT v2-medium.ckpt --local-dir .

A notebook for setting-up scPRINT-2 and lamin is also available here

We make use of some additional packages we developed alongside scPRINT-2 (they are also shipped with scprint-2 already).

Please refer to their documentation for more information:

scDataLoader: a dataloader for training large cell models.
GRnnData: a package to work with gene networks from single cell data.
benGRN: a package to benchmark gene network inference methods from single cell data.
simpler-flash: a package to easily use different versions of flash attention in pytorch models.
hierarchical-classifier: a package to do hierarchical classification with pytorch when your labels can be mapped to a graph.

pytorch and GPUs

scPRINT-2 can run on machines without GPUs, but it will be slow. It is highly recommended to use a GPU for inference.

Most of the time, everything works out of the box; otherwise, please send an issue

model = scPRINT2.load_from_checkpoint(
    '../data/temp/last.ckpt', precpt_gene_emb=None, gene_pos_file=None,)

You will know more by following the get-started notebook.

Usage

To get a sense of how scPRINT-2 works, have a look at our get-started notebook.

scPRINT-2's basic commands

This is a template of how you would go and use scPRINT most of the time:

# import stuff
from lightning.pytorch import Trainer
from scprint2 import scPRINT2
from scdataloader import DataModule

# setup a datamodule to train scprint2 from scratch
datamodule = DataModule(...)
# setup a model parameter
model = scPRINT2(...)
# to train / fit / test the model setup a trainer
trainer = Trainer(...)
# call the fit function
trainer.fit(model, datamodule=datamodule)
# to do predictions Denoiser, Embedder, GNInfer
denoiser = Denoiser(...)
adata = sc.read_h5ad(...)
denoiser(model, adata=adata)
...

scPRINT-2's basic command line

Then fine-tune or analyse on your data

$ scprint2 fit/train/predict/test/denoise/embed/gninfer/impute/gene_emb/generate/finetune --config config/[medium|large|vlarge] ...

To denoise a dataset:

$ scprint2 denoise --adata my_human_anndata.h5ad --ckpt_path v2-medium.ckpt --species "NCBITaxon:9606" --output_filename denoised.h5ad

to do embedding and classification on a dataset: (the current version implies doing a PCA and Umap so it might need a lot of RAM if run as is)

$ scprint2 embed --adata my_human_anndata.h5ad --ckpt_path v2-medium.ckpt --species "NCBITaxon:9606" --output_filename embedded.h5ad

To do gene network inference on a dataset:

$ scprint2 gninfer --adata my_human_anndata.h5ad --ckpt_path v2-medium.ckpt --species "NCBITaxon:9606" --cell_type 'cell_type_name_from-cell_type-obs_col' --output_filename grn.h5ad

To re-train scPRINT-2 from scratch or from a checkpoint:

$ scprint2 fit --config config/base_v2.yml --config config/pretrain_large.yml --ckpt_path large.ckpt

find out more about the commands by running scprint2 --help or scprint2 [command] --help.

more examples of using the command line are available in the docs.

Example notebooks

get-started: how to set things up
run scPRINT-2 on a new species: how to fine-tune scPRINT-2 on a new organism. you will also need to generate embeddings and gene locations for your organism, see the FAQ below.
do gene-network inference with scPRINT-2: how to use scPRINT-2 to infer gene regulatory networks from your scRNAseq data (the first part is about getting ground truth data with benGRN)
generate cell embeddings and cell label predictions from my data: how to use scPRINT-2 to generate cell embeddings and predict cell type
generate gene output embeddings from my gene expressiond data: how to use scPRINT-2 to generate gene embeddings from your scRNAseq data
do counterfactual gene expression prediction with scPRINT-2: how to use scPRINT-2 to impute gene expression under different conditions
do denoising with scPRINT-2: how to use scPRINT-2 to denoise your scRNAseq data
do imputation with scPRINT-2 (e.g. on Xenium Panel data): how to use scPRINT-2 to impute missing genes in your scRNAseq data
run scPRINT-2 on some Xenium spatial transcriptomics data: how to use scPRINT-2 to analyse spatial transcriptomics data
fine-tune scPRINT-2 for cell type classification and/or batch correction: how to fine-tune scPRINT-2 on your own cell type labels

Documentation

For more information on usage, please see the documentation in https://www.jkobject.com/scPRINT-2/

Docker

By using the scPRINT-2 Docker image, you can bypass the complexities of manual package installation, ensuring a consistent deployment environment. Included in this repository is a Dockerfile that lets you craft a container for the project; you have the choice to either build this image on your own or conveniently pull it from Docker Hub.

Make sure that you have the docker command line interface installed on your system.

A recommended way to install Docker with the correct NVIDIA drivers on Linux is to use this script

/!\ A MORE UP TO DATE DOCKER IMAGE is made as part of the open-problems benchmark and available on their GitHub for all tasks where scPRINT-2 is benchmarked

Simple tests:

An installation of scPRINT-2 and a simple test of the denoiser is performed during each commit to the main branch with a Github action and pytest workflow. It also provides an expected runtime for the installation and run of scPRINT-2. We now explore the different usages of scPRINT-2:

FAQ

I have a dataset and want a quick analysis:

-> use superbio

I have a dataset and want some more control over what is going on and which model to use:

You will need to understand a few things, like lamindb, scdataloader, and scprint-2's inference tool.

-> start with a quick intro using the google collab notebook

-> look at the other FAQ element based on your desired use-case

What does my anndata need to contain to be run with scPRINT-2

-> your anndata only needs to contain the species ontology id in its obs['organism_ontology_term_id'] (e.g. "NCBITaxon:9606"). It also needs to contain .var_names or .var.index with gene ids defined as ENSEMBL_IDs or HUGO_SYMBOL.

-> That's it. You can then follow the preprocessing steps from various example notebooks to align your anndata to our gene set, make sure that it fits our requirements and then send it to the model!

I want to generate an atlas-level embedding

-> Refer to the notebook nice_umap_explain.ipynb.

I need to generate gene tokens using pLLMs

To run scPRINT-2, you can use the option to define the gene tokens using protein language model embeddings of genes. This is done by providing the path to a parquet file of the precomputed set of embeddings for each gene name to scPRINT-2 via "precpt_gene_emb"

-> To generate this file please refer to the notebook generate_gene_embeddings.

I want to re-train scPRINT-2 from scratch on my own data

-> Refer to the documentation page pretrain scprint-2

I want to regenerate the scPRINT-2 training corpus

-> Have a look at the scDataLoader's README to understand how to do this.

I want to fine-tune scPRINT-2 on my own data

-> make sure that you a run of scPRINT-2's inference e.g. this one

-> then please refine your question: do you want finetuning to predict labels? do batch correction? or make scprint work on your species? Have a look at the usage section and the rest of the FAQ to find the relevant information.

how can I find if scPRINT-2 was trained on my data?

If your data is available in cellxgene, or is listed in Arc's scBaseCount, scPRINT-2 was likely trained on it. However, some cells, and datasets were dropped due to low-quality data, and some were randomly removed to be part of the validation/test sets.

can I use scPRINT-2 on other organisms rather than humans?

scPRINT-2 has been pretrained on 16 organisms, check in the model.organisms or in our manuscript that yours isn't one of them, or highly related first. If so uses these and make sure that the gene names can be easily mapped by scdataloader's preprocess function.

If not, scPRINT-2 can be used on other organisms that are not part of its training set, for this have a look at this notebook. You will also need to compute gene embeddings and gene locations for your organism's genetic data. Have a look at both notebooks.

If you want to use scPRINT-2 on very different organisms than what it was trained on, you might need to then apply some finetuning, have a look at the finetuning notebook too.

How long does scPRINT-2 take? What kind of resources do I need? (or in alternative: can I run scPRINT-2 locally?)

Please look at our manuscript table 1 and supplementary Table 1 - 2 to know more about computational ressources. But know that you will likely need at least one high performance GPU.

I have different scRNASeq batches. Should I integrate my data before running scPRINT-2?

scPRINT-2 takes raw count as inputs, so please don't use integrated data. Just give the raw counts to scPRINT-2 and it will take care of the rest. For better results you can apply some finetuning of scPRINT-2 on your batches to better integrate them. See the finetuning notebook. You can replace the cross-species MMD loss with a cross-batch MMD loss.

I have new labels for my data that scPRINT-2 doesn't predict, how can I fine-tune it to predict them?

First have a look at scPRINT-2's inference capabilities and checkout the finetuning notebooks.

In your case, what you will need to do is to reuse the finetuning notebook but also update the output layers of the classifier to predict your new labels. You can do this by changing the number of output classes in the classifier head to match the number of new labels you have. You will also need to update the scPRINT-2's mat_labels_hierarchy attribute to include your new labels and their relationships if they are hierarchical and you want this to happen, otherwise update it with an empty vector.

Make sure also to update the label_decoders attribute in the model to include at the right index your new label decoder / classifier, the name of your new labels.

Then you can proceed with the finetuning as usual, using your dataset with the new labels in the obs of your anndata. (I am sure chatgpt can help you with it too)

where to find the input gene embeddings?

If you think you need the gene embeddings file for loading the model from a checkpoint, you don't need to recompute them, as the embeddings are also stored in the model weights. You just need to load the weights like this:

model = scPRINT2.load_from_checkpoint(
    '../../data/temp/last.ckpt',
    precpt_gene_emb=None,
    gene_pos_file=None,
)

# to remove gene embeddings that scPRINT-2 was trained with but that are not found in the lamin ontology anymore
missing = set(model.genes) - set(load_genes(model.organisms).index)
if len(missing) > 0:
    print(
        "Warning: some genes missmatch exist between model and ontology: solving...",
    )
    model._rm_genes(missing)

But if you want to, you can also recreate the gene embedding file through this notebook. Just call the functions, and it should recreate the file itself.

The file itself is also available on hugging face

/!\ Please understand that what I mean by gene embedding is the immutable input gene embeddings encoding the gene name. scPRINT-2 directly takes raw counts as input and takes care of doing the embedding on the fly. (it does similarly for a gene's location in the genome).

Development

dev install

If you want to use the latest version of scPRINT-2 and work on the code yourself use git clone and pip -e instead of pip install.

git clone https://github.com/cantinilab/scPRINT-2
git clone https://github.com/jkobject/scDataLoader
git clone https://github.com/cantinilab/GRnnData
git clone https://github.com/jkobject/benGRN
pip install -e scprint2[dev]
pip install -e scDataLoader[dev]
pip install -e GRnnData[dev]
pip install -e benGRN[dev]

Reproducibility

To reproduce the paper please use the version / tag 1.6.4 and you will have to git clone the repo to have access to all the pre-training functionalities!

⚠️ When re-training scPRINT-2 from scratch, by default, every N epoch, the test() function will be called `. It is using a predownloadedtest datasets paths (see https://github.com/cantinilab/scPRINT-2/issues/12). Replace them with your own paths you want to use these test functions. They are also made available on hf.co: https://huggingface.co/jkobject/scPRINT-2/tree/main

Building the Docker Image

To build the Docker image from the provided Dockerfile, run the following command from the root directory of this repository:

docker build -t scprint2:latest -f Dockerfile .

Pulling the Docker Image from Docker Hub

If you don't want to build the image yourself, you can pull it directly from Docker Hub:

docker pull jkobject/scprint2:1.0.0
docker tag scprint2:latest jkobject/scprint2:1.0.0

Running the Docker Container

Once you have the image (either by building it or pulling it), you can start a container with:

docker run --gpus all --rm -it scprint2:latest bash

Please note: When running the Docker container, ensure you mount any necessary folders using the -v option to access them inside the container.

Participate

Read the CONTRIBUTING.md file.

Read the training runs document to know more about how pre-training was performed and the its behavior.

code coverage is not right as I am using the command line interface for now. >50% of the code is covered by my current unit test.

Acknowledgement: python template laminDB lightning

Created by Jérémie Kalfon