scPRINT use case on BPH¶
In this use-case, also presented in Figure 5 of our manuscript, we perform an extensive analysis of a multi studies dataset of benign prostatic hyperplasia.
Our biological question is to check if there exist pre-cancerous cells that exhibits behaviors of mature cancer cells at this early stage of the disease.
In those cells, we want to know which genes might be implicated in cell state changes, and explore potentially novel targets in the treatment of prostate cancer and BPH.
We will start with a fresh datasets coming from the cellXgene database and representing 2 studies of BPH.
We will first explore these dataset to understand:
- what are the cell types that are present in the data
- what are the cell distributions (cell distributions? what are they?)
- what sequencers were used, etc.
We also want to confirm existing target in prostate cancer through precancerous lesion analysis, and find potentially novel ones that would serve as less invasive BPH treatments than current ones.
Finally we want to know how these targets interacts and are involved in biological pathways.
We now showcase how to use scPRINT across its different functionalities to answer some of these questions.
table of contents:¶
- Downloading and preprocessing
- Embedding and annotations
- Annotation cleanup
- Clustering and differential expression
- Denoising and differential expression
- Gene network inference
In the notebook cancer_usecase_part2.ipynb you will see how to analyse cell type specific gene regulatory networks.
from scprint import scPrint
from scdataloader import Preprocessor, utils
from scprint.tasks import GNInfer, Embedder, Denoiser, withknn
from bengrn import BenGRN, get_sroy_gt, compute_genie3
from bengrn.base import train_classifier
from grnndata import utils as grnutils
from grnndata import read_h5ad
from anndata.utils import make_index_unique
from anndata import concat
import scanpy as sc
from matplotlib import pyplot as plt
import numpy as np
import lamindb as ln
%load_ext autoreload
%autoreload 2
import torch
torch.set_float32_matmul_precision('medium')
→ connected lamindb: jkobject/scprint
Downloading and preprocessing¶
We now use lamindb to easily access cellxgene and download a dataset of normal and benign prostatic hyperplasia tissues.
data is available here https://cellxgene.cziscience.com/e/574e9f9e-f8b4-41ef-bf19-89a9964fd9c7.cxg/ .
We then use scDataloader's preprocessing method. This method is quite extensive and does a few things.. find our more about it on its documentation.
On our end we are using the preprocessor to make sure that the the gene expression that we have are raw counts and that we have enough information to use scPRINT (i.e., enough genes expressed and enough counts per cells across the dataset).
Finally, the preprocessor will also increase the size of the expression matrix to be a fixed set of genes defined by the latest version of ensemble.
cx_dataset = ln.Collection.using(instance="laminlabs/cellxgene").filter(name="cellxgene-census", version='2023-12-15').one()
prostate_adata = cx_dataset.artifacts.filter(key='cell-census/2023-12-15/h5ads/574e9f9e-f8b4-41ef-bf19-89a9964fd9c7.h5ad').one().load()
sc.pl.umap(prostate_adata)
/home/ml4ig1/miniconda3/envs/scprint/lib/python3.10/asyncio/sslproto.py:320: ResourceWarning: unclosed transport <asyncio.sslproto._SSLProtocolTransport object at 0x7fd088b36f80> _warn(f"unclosed transport {self!r}", ResourceWarning, source=self) ResourceWarning: Enable tracemalloc to get the object allocation traceback
# preprocessing using scDataloader
prostate_adata.obs.drop(columns="is_primary_data", inplace=True)
preprocessor = Preprocessor(do_postp=False)
prostate_adata = preprocessor(prostate_adata)
Dropping layers: KeysView(Layers with keys: ) checking raw counts removed 0 non primary cells, 83451 renamining filtered out 0 cells, 83451 renamining Removed 76 genes. validating
/home/ml4ig1/Documents code/scPRINT/scDataLoader/scdataloader/preprocess.py:241: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]` data_utils.validate(adata, organism=adata.obs.organism_ontology_term_id[0])
startin QC Seeing 13047 outliers (15.63% of total dataset): done
Embedding and annotations¶
We now start to load a large version of scPRINT from a specific checkpoint. Please download the checkpoints following the instructions in the README.
We will then use out Embedder class to embed the data and annotate the cells. These classes are how we parametrize and access the different functions of scPRINT
. Find out more about its parameters in our documentation.
model = scPrint.load_from_checkpoint('../data/temp/o2uniqsx/epoch=18-step=133000.ckpt',
# make sure that you check if you have a GPU with flashattention or not (see README)
# you need this else, the model will look for the file with the embeddings instead of getting them from the weights
precpt_gene_emb = None)
RuntimeError caught: scPrint is not attached to a `Trainer`.
embedder = Embedder(
# can work on random genes or most variables etc..
how="random expr",
# number of genes to use
max_len=4000,
add_zero_genes=0,
# for the dataloading
num_workers=8,
# we will only use the cell type embedding here.
pred_embedding = ["cell_type_ontology_term_id"]
)#, "disease_ontology_term_id"])
Using 16bit Automatic Mixed Precision (AMP) GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs
# create the embedding
prostate_adata, metrics = embedder(model, prostate_adata, cache=False, output_expression="none")
100%|██████████| 1560/1560 [28:20<00:00, 1.09s/it]
AnnData object with n_obs × n_vars = 99826 × 424 obs: 'pred_cell_type_ontology_term_id', 'pred_disease_ontology_term_id', 'pred_assay_ontology_term_id', 'pred_self_reported_ethnicity_ontology_term_id', 'pred_sex_ontology_term_id', 'pred_organism_ontology_term_id', 'conv_pred_cell_type_ontology_term_id', 'conv_pred_disease_ontology_term_id', 'conv_pred_assay_ontology_term_id', 'conv_pred_self_reported_ethnicity_ontology_term_id'
/home/ml4ig1/miniconda3/envs/scprint/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:1251: FutureWarning: The default value of 'ignore' for the `na_action` parameter in pandas.Categorical.map is deprecated and will be changed to 'None' in a future version. Please set na_action to the desired value to avoid seeing this warning color_vector = pd.Categorical(values.map(color_map)) /home/ml4ig1/miniconda3/envs/scprint/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:1251: FutureWarning: The default value of 'ignore' for the `na_action` parameter in pandas.Categorical.map is deprecated and will be changed to 'None' in a future version. Please set na_action to the desired value to avoid seeing this warning color_vector = pd.Categorical(values.map(color_map)) /home/ml4ig1/miniconda3/envs/scprint/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:1251: FutureWarning: The default value of 'ignore' for the `na_action` parameter in pandas.Categorical.map is deprecated and will be changed to 'None' in a future version. Please set na_action to the desired value to avoid seeing this warning color_vector = pd.Categorical(values.map(color_map)) /home/ml4ig1/miniconda3/envs/scprint/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:1251: FutureWarning: The default value of 'ignore' for the `na_action` parameter in pandas.Categorical.map is deprecated and will be changed to 'None' in a future version. Please set na_action to the desired value to avoid seeing this warning color_vector = pd.Categorical(values.map(color_map)) /home/ml4ig1/miniconda3/envs/scprint/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:1251: FutureWarning: The default value of 'ignore' for the `na_action` parameter in pandas.Categorical.map is deprecated and will be changed to 'None' in a future version. Please set na_action to the desired value to avoid seeing this warning color_vector = pd.Categorical(values.map(color_map)) /home/ml4ig1/miniconda3/envs/scprint/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:1251: FutureWarning: The default value of 'ignore' for the `na_action` parameter in pandas.Categorical.map is deprecated and will be changed to 'None' in a future version. Please set na_action to the desired value to avoid seeing this warning color_vector = pd.Categorical(values.map(color_map))
couldn't log to tensorboard couldn't log to wandb couldn't log to wandb cell_type_ontology_term_id accuracy: 0.7250215374752068 disease_ontology_term_id accuracy: 0.8394506441207702 assay_ontology_term_id accuracy: 0.998988239536794 self_reported_ethnicity_ontology_term_id accuracy: 0.9504037024422495 sex_ontology_term_id accuracy: 0.9948911105323263 organism_ontology_term_id accuracy: 1.0
prostate_adata
AnnData object with n_obs × n_vars = 99826 × 70116 obs: 'assay_ontology_term_id', 'donor_id', 'sex_ontology_term_id', 'disease_ontology_term_id', 'organism_ontology_term_id', 'suspension_type', 'cell_type_ontology_term_id', 'tissue_ontology_term_id', 'development_stage_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'nnz', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_20_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'log1p_total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'log1p_total_counts_hb', 'pct_counts_hb', 'outlier', 'mt_outlier', 'init_dataset', 'pred_cell_type_ontology_term_id', 'pred_disease_ontology_term_id', 'pred_assay_ontology_term_id', 'pred_self_reported_ethnicity_ontology_term_id', 'pred_sex_ontology_term_id', 'pred_organism_ontology_term_id', 'conv_pred_cell_type_ontology_term_id', 'conv_pred_disease_ontology_term_id', 'conv_pred_assay_ontology_term_id', 'conv_pred_self_reported_ethnicity_ontology_term_id', 'sprint_leiden' uns: 'init_dataset_colors' obsm: 'X_pca', 'X_umap', 'scprint_umap', 'scprint'
Annotation cleanup¶
scPRINT generates predictions over hundreds of possible labels for each cell.
It is often advised to "cleanup" the predictions, e.g. making sure to remove low frequency cells and misslabellings.
Here, we use the most straightforward approach which is to remove any annotations that appear a small number of times.
A better approach would be doing majority voting over cell clusters as it would aggregate and smoothout the predictions over multiple cells. it would also remove most of the low frequency mistakes in the predictions.
We will also have a look at the embeddings of scPRINT
by plotting its UMAP visualization.
#cleaning up the cell types
prostate_adata.obs['cleaned_pred_cell_type_ontology_term_id'] = prostate_adata.obs['conv_pred_cell_type_ontology_term_id'].astype(str)
prostate_adata.obs.loc[~prostate_adata.obs['conv_pred_cell_type_ontology_term_id'].isin([k for k, v in prostate_adata.obs['conv_pred_cell_type_ontology_term_id'].value_counts().items() if v > 400]), 'cleaned_pred_cell_type_ontology_term_id'] = "other"
prostate_adata.obs['cleaned_pred_cell_type_ontology_term_id'].value_counts().plot.pie()
<Axes: ylabel='count'>
# cleaning up the diseases
prostate_adata.obs['cleaned_pred_disease_ontology_term_id'] = prostate_adata.obs['conv_pred_disease_ontology_term_id'].astype(str)
prostate_adata.obs.loc[~prostate_adata.obs['conv_pred_disease_ontology_term_id'].isin([k for k, v in prostate_adata.obs['conv_pred_disease_ontology_term_id'].value_counts().items() if v > 1000]), 'cleaned_pred_disease_ontology_term_id'] = "other"
prostate_adata.obs['cleaned_pred_disease_ontology_term_id'].value_counts().plot.pie()
<Axes: ylabel='count'>
sc.pl.embedding(prostate_adata, basis="scprint_umap", color=['cleaned_pred_cell_type_ontology_term_id'])
sc.pl.embedding(prostate_adata, basis="scprint_umap", color=['cleaned_pred_disease_ontology_term_id'])
sc.pl.embedding(prostate_adata, basis="scprint_umap", color=['assay'])
sc.pl.embedding(prostate_adata, basis="scprint_umap", color=['disease'])
sc.pl.embedding(prostate_adata, basis="scprint_umap", color=['development_stage'])
/home/ml4ig1/miniconda3/envs/scprint/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:1251: FutureWarning: The default value of 'ignore' for the `na_action` parameter in pandas.Categorical.map is deprecated and will be changed to 'None' in a future version. Please set na_action to the desired value to avoid seeing this warning color_vector = pd.Categorical(values.map(color_map))
/home/ml4ig1/miniconda3/envs/scprint/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:1251: FutureWarning: The default value of 'ignore' for the `na_action` parameter in pandas.Categorical.map is deprecated and will be changed to 'None' in a future version. Please set na_action to the desired value to avoid seeing this warning color_vector = pd.Categorical(values.map(color_map))
/home/ml4ig1/miniconda3/envs/scprint/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:1251: FutureWarning: The default value of 'ignore' for the `na_action` parameter in pandas.Categorical.map is deprecated and will be changed to 'None' in a future version. Please set na_action to the desired value to avoid seeing this warning color_vector = pd.Categorical(values.map(color_map))
prostate_adata.obs.cleaned_pred_cell_type_ontology_term_id.value_counts().head(20)
cleaned_pred_cell_type_ontology_term_id basal cell of epidermis 22449 mammary alveolar cell 9688 vasa recta descending limb cell 8777 CD1c-positive myeloid dendritic cell 5782 effector memory CD8-positive, alpha-beta T cell 5410 smooth muscle cell of the pulmonary artery 5007 other 4972 pancreatic acinar cell 4639 basal epithelial cell of prostatic duct 3405 serous cell of epithelium of trachea 3327 prostate gland microvascular endothelial cell 2747 paneth cell of epithelium of small intestine 2698 mast cell 2365 IgG-negative class switched memory B cell 2348 fibroblast of connective tissue of nonglandular part of prostate 2058 club cell 1794 pulmonary artery endothelial cell 1239 CD141-positive myeloid dendritic cell 1186 mucous neck cell 1039 smooth muscle cell of prostate 857 peptic cell 766 T-helper 17 cell 735 CD14-positive, CD16-negative classical monocyte 671 CD16-negative, CD56-bright natural killer cell, human 666 lung pericyte 663 alveolar type 2 fibroblast cell 646 type II pneumocyte 618 naive B cell 606 effector CD4-positive, alpha-beta T cell 506 fibroblast of connective tissue of glandular part of prostate 464 luminal epithelial cell of mammary gland 437 inflammatory macrophage 436 fibroblast of mammary gland 416 basal cell of epithelium of bronchus 409 Name: count, dtype: int64
# we save for next time
prostate_adata.write_h5ad("../../data/prostate_adata_o2uniqsx.h5ad")
prostate_adata = sc.read_h5ad("../../data/prostate_adata_o2uniqsx.h5ad")
Clustering and differential expression¶
We will now cluster using the louvain algorithm on a kNN graph.
Once we detect a cluster of interest we will perform differential expression analysis on it. Taking as example some B-cell clusters, we will use scanpy's implementation of rank_gene_groups for our differential expression
# do louvain mutliple times
sc.pp.neighbors(prostate_adata, n_neighbors=10, use_rep="scprint")
# using multiple resolutions can help spotting smaller clusters
sc.tl.louvain(prostate_adata, resolution=0.5, key_added="louvain_0.5")
sc.tl.louvain(prostate_adata, resolution=1.0, key_added="louvain_1.0")
# check clusters
sc.pl.embedding(prostate_adata, basis="scprint_umap", color='louvain_1.0', show=False, legend_loc="on data")
/home/ml4ig1/miniconda3/envs/scprint/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:1251: FutureWarning: The default value of 'ignore' for the `na_action` parameter in pandas.Categorical.map is deprecated and will be changed to 'None' in a future version. Please set na_action to the desired value to avoid seeing this warning color_vector = pd.Categorical(values.map(color_map))
<Axes: title={'center': 'louvain_1.0'}, xlabel='scprint_umap1', ylabel='scprint_umap2'>
# check cluster 9
i=9
loc = prostate_adata.obs['louvain_1.0']==str(i)
prostate_adata.obs[loc].conv_pred_disease_ontology_term_id.value_counts().head(2), prostate_adata.obs[loc].conv_pred_cell_type_ontology_term_id.value_counts().head(5)
(conv_pred_disease_ontology_term_id benign prostatic hyperplasia 3554 normal 18 Name: count, dtype: int64, conv_pred_cell_type_ontology_term_id urethra urothelial cell 1703 luminal cell of prostate epithelium 703 mucous neck cell 566 basal epithelial cell of prostatic duct 372 club cell 174 Name: count, dtype: int64)
# check cluster 11
i=10
loc = prostate_adata.obs['louvain_1.0']==str(i)
prostate_adata.obs[loc].cleaned_pred_disease_ontology_term_id.value_counts().head(2), prostate_adata.obs[loc].cleaned_pred_cell_type_ontology_term_id.value_counts().head(5)
(cleaned_pred_disease_ontology_term_id benign prostatic hyperplasia 2900 normal 68 Name: count, dtype: int64, cleaned_pred_cell_type_ontology_term_id IgG-negative class switched memory B cell 2879 other 45 CD4-positive, alpha-beta thymocyte 16 effector CD8-positive, alpha-beta T cell 12 CD1c-positive myeloid dendritic cell 8 Name: count, dtype: int64)
# We have find a nice IgG-negative class switched memory B cell cluster. let's use it and define a clean annotation for a plot
loc = loc&(prostate_adata.obs.cleaned_pred_cell_type_ontology_term_id=="IgG-negative class switched memory B cell")
prostate_adata.obs[loc].cleaned_pred_disease_ontology_term_id.value_counts().head(2)
cleaned_pred_disease_ontology_term_id benign prostatic hyperplasia 2810 normal 60 Name: count, dtype: int64
# making a "focus" annotation for the B-cell to generate a nice plot of the B-cell cluster only
prostate_adata.obs['focus'] = "other"
prostate_adata.obs.loc[loc, 'focus'] = "memory B cell"
prostate_adata.obs.loc[loc & (prostate_adata.obs['cleaned_pred_disease_ontology_term_id']=='benign prostatic hyperplasia'), 'focus'] = "BPH associated memory B cell"
prostate_adata.obs['focus'].value_counts()
focus other 80572 BPH associated memory B cell 2810 memory B cell 69 Name: count, dtype: int64
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import seaborn as sns
color = sns.color_palette()[1]
fig, ax = plt.subplots(figsize=(2, 2))
rect = patches.Rectangle((0, 0), 1, 1, facecolor=color)
ax.add_patch(rect)
ax.set_xlim(0, 1)
ax.set_ylim(0, 1)
ax.axis('off')
plt.show()
# looking at the B-cell cluster. We can see some normal and BPH-associated memory B-cells
sc.pl.embedding(prostate_adata[(prostate_adata.obs['louvain_1.0']==str(i)) & (prostate_adata.obsm['scprint_umap'][:,0]>4)], basis="scprint_umap",color='focus', show=False, size=8, title="Switched memory B-cell cluster", legend_loc="right margin")
... storing 'focus' as categorical /home/ml4ig1/miniconda3/envs/scprint/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:1251: FutureWarning: The default value of 'ignore' for the `na_action` parameter in pandas.Categorical.map is deprecated and will be changed to 'None' in a future version. Please set na_action to the desired value to avoid seeing this warning color_vector = pd.Categorical(values.map(color_map))
<Axes: title={'center': 'focus'}, xlabel='scprint_umap_rot1', ylabel='scprint_umap_rot2'>
# here we use a function of scdataloader to add some more gene annotations to our dataset so we can view HGNC symbols
genedf = utils.load_genes(prostate_adata.obs.organism_ontology_term_id.iloc[0])
# columns that create issues when saving anndatas
prostate_adata.var = genedf.loc[prostate_adata.var.index].drop(columns=["stable_id", "created_at", "updated_at"])
# now the diff expression between B-cells and the rest
sc.tl.rank_genes_groups(prostate_adata, groupby='cleaned_pred_cell_type_ontology_term_id', groups=['IgG-negative class switched memory B cell'], reference='other', method='t-test')
# Plot the most differentially expressed genes
sc.pl.rank_genes_groups(prostate_adata, n_genes=25, sharey=False, gene_symbols='symbol')
# super strong B cell markers
... storing 'focus' as categorical ... storing 'symbol' as categorical ... storing 'ncbi_gene_ids' as categorical ... storing 'biotype' as categorical ... storing 'description' as categorical ... storing 'synonyms' as categorical ... storing 'organism' as categorical
WARNING: It seems you use rank_genes_groups on the raw count data. Please logarithmize your data before calling rank_genes_groups.
/home/ml4ig1/miniconda3/envs/scprint/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:420: RuntimeWarning: overflow encountered in expm1 self.expm1_func(mean_rest) + 1e-9 /home/ml4ig1/miniconda3/envs/scprint/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:422: RuntimeWarning: divide by zero encountered in log2 self.stats[group_name, 'logfoldchanges'] = np.log2(
Denoising and differential expression¶
What we found out from our previous analysis is that there is not a lot of normal (i.e. healthy) B-cells in our cluster, most of them are BPH associated. In this case, if we wanted to compare BPH B-cells to normal B-cells we might be very underpowered...
Instead of going to look for some other dataset, let's use scPRINT
to increase the depth of the expression profile of the cells, virtually adding more signal to our dataset.
We will use the Denoiser
class (see more about the class in our documentation) in a similar way Trainer
is used in pytorch lightning to denoise the expression profile of the cells.
We will then show the results of differential expression analysis before and after denoising.
# in case you started from here
prostate_adata = sc.read_h5ad("../../data/temp/prostate_adata_o2uniqsx.h5ad")
prostate_adata
--------------------------------------------------------------------------- FileNotFoundError Traceback (most recent call last) Cell In[4], line 2 1 # in case you started from here ----> 2 prostate_adata = sc.read_h5ad("../../data/temp/prostate_adata_o2uniqsx.h5ad") 3 prostate_adata File ~/miniconda3/envs/scprint/lib/python3.10/site-packages/anndata/_io/h5ad.py:234, in read_h5ad(filename, backed, as_sparse, as_sparse_fmt, chunk_size) 226 raise NotImplementedError( 227 "Currently only `X` and `raw/X` can be read as sparse." 228 ) 230 rdasp = partial( 231 read_dense_as_sparse, sparse_format=as_sparse_fmt, axis_chunk=chunk_size 232 ) --> 234 with h5py.File(filename, "r") as f: 236 def callback(func, elem_name: str, elem, iospec): 237 if iospec.encoding_type == "anndata" or elem_name.endswith("/"): File ~/miniconda3/envs/scprint/lib/python3.10/site-packages/h5py/_hl/files.py:561, in File.__init__(self, name, mode, driver, libver, userblock_size, swmr, rdcc_nslots, rdcc_nbytes, rdcc_w0, track_order, fs_strategy, fs_persist, fs_threshold, fs_page_size, page_buf_size, min_meta_keep, min_raw_keep, locking, alignment_threshold, alignment_interval, meta_block_size, **kwds) 552 fapl = make_fapl(driver, libver, rdcc_nslots, rdcc_nbytes, rdcc_w0, 553 locking, page_buf_size, min_meta_keep, min_raw_keep, 554 alignment_threshold=alignment_threshold, 555 alignment_interval=alignment_interval, 556 meta_block_size=meta_block_size, 557 **kwds) 558 fcpl = make_fcpl(track_order=track_order, fs_strategy=fs_strategy, 559 fs_persist=fs_persist, fs_threshold=fs_threshold, 560 fs_page_size=fs_page_size) --> 561 fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr) 563 if isinstance(libver, tuple): 564 self._libver = libver File ~/miniconda3/envs/scprint/lib/python3.10/site-packages/h5py/_hl/files.py:235, in make_fid(name, mode, userblock_size, fapl, fcpl, swmr) 233 if swmr and swmr_support: 234 flags |= h5f.ACC_SWMR_READ --> 235 fid = h5f.open(name, flags, fapl=fapl) 236 elif mode == 'r+': 237 fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl) File h5py/_objects.pyx:54, in h5py._objects.with_phil.wrapper() File h5py/_objects.pyx:55, in h5py._objects.with_phil.wrapper() File h5py/h5f.pyx:102, in h5py.h5f.open() FileNotFoundError: [Errno 2] Unable to synchronously open file (unable to open file: name = '../../data/temp/prostate_adata_o2uniqsx.h5ad', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)
# here we compare memory B-cell in BPH to normal memory B cells before denoising
sc.tl.rank_genes_groups(prostate_adata, groupby='focus', groups=['BPH associated memory B cell'], reference='memory B cell', method='t-test')
# Plot the most differentially expressed genes
sc.pl.rank_genes_groups(prostate_adata, n_genes=25, sharey=False, gene_symbols='symbol')
# super strong B cell markers
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) File ~/miniconda3/envs/scprint/lib/python3.10/site-packages/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key) 3804 try: -> 3805 return self._engine.get_loc(casted_key) 3806 except KeyError as err: File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc() File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc() File pandas/_libs/hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item() File pandas/_libs/hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'focus' The above exception was the direct cause of the following exception: KeyError Traceback (most recent call last) Cell In[11], line 2 1 # here we compare memory B-cell in BPH to normal memory B cells before denoising ----> 2 sc.tl.rank_genes_groups(prostate_adata, groupby='focus', groups=['BPH associated memory B cell'], reference='memory B cell', method='t-test') 3 # Plot the most differentially expressed genes 4 sc.pl.rank_genes_groups(prostate_adata, n_genes=25, sharey=False, gene_symbols='symbol') File ~/miniconda3/envs/scprint/lib/python3.10/site-packages/legacy_api_wrap/__init__.py:80, in legacy_api.<locals>.wrapper.<locals>.fn_compatible(*args_all, **kw) 77 @wraps(fn) 78 def fn_compatible(*args_all: P.args, **kw: P.kwargs) -> R: 79 if len(args_all) <= n_positional: ---> 80 return fn(*args_all, **kw) 82 args_pos: P.args 83 args_pos, args_rest = args_all[:n_positional], args_all[n_positional:] File ~/miniconda3/envs/scprint/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:637, in rank_genes_groups(adata, groupby, mask_var, use_raw, groups, reference, n_genes, rankby_abs, pts, key_added, copy, method, corr_method, tie_correct, layer, **kwds) 635 if reference != "rest" and reference not in set(groups_order): 636 groups_order += [reference] --> 637 if reference != "rest" and reference not in adata.obs[groupby].cat.categories: 638 cats = adata.obs[groupby].cat.categories.tolist() 639 raise ValueError( 640 f"reference = {reference} needs to be one of groupby = {cats}." 641 ) File ~/miniconda3/envs/scprint/lib/python3.10/site-packages/pandas/core/frame.py:4102, in DataFrame.__getitem__(self, key) 4100 if self.columns.nlevels > 1: 4101 return self._getitem_multilevel(key) -> 4102 indexer = self.columns.get_loc(key) 4103 if is_integer(indexer): 4104 indexer = [indexer] File ~/miniconda3/envs/scprint/lib/python3.10/site-packages/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key) 3807 if isinstance(casted_key, slice) or ( 3808 isinstance(casted_key, abc.Iterable) 3809 and any(isinstance(x, slice) for x in casted_key) 3810 ): 3811 raise InvalidIndexError(key) -> 3812 raise KeyError(key) from err 3813 except TypeError: 3814 # If we have a listlike key, _check_indexing_error will raise 3815 # InvalidIndexError. Otherwise we fall through and re-raise 3816 # the TypeError. 3817 self._check_indexing_error(key) KeyError: 'focus'
# here we compare memory B-cell in BPH to normal memory B cells before denoising
sc.tl.rank_genes_groups(prostate_adata, groupby='focus', groups=['BPH associated memory B cell'], reference='memory B cell', method='t-test')
# Plot the most differentially expressed genes
sc.pl.rank_genes_groups(prostate_adata, n_genes=25, sharey=False, gene_symbols='symbol')
# super strong B cell markers
WARNING: It seems you use rank_genes_groups on the raw count data. Please logarithmize your data before calling rank_genes_groups.
# perform denoising
denoise = Denoiser(
# could be on top differentially expressed genes or on random expressed genes in the cell (completed by random non expressed gene)
how="random expr"
# the size of the minibatch (need to fit in memory)
batch_size=20,
# the number of genes to use
max_len=5000,
# the number of cells to use (here more than what we will use so we will use everything)
max_cells=10_000,
doplot=False,
# how much do we want to increase the depth / counts of the cells (here, 10x)
predict_depth_mult=10,
)
idx, genes, expr = denoise(model, prostate_adata[prostate_adata.obs['focus']!="other"])
100%|██████████| 82/82 [00:50<00:00, 1.64it/s]
# now what we are doing here it to complete the expression profile with the denoised values. this is not done by default for now
i = 0
prostate_adata.X = prostate_adata.X.tolil()
for idx, val in enumerate(prostate_adata.obs['focus']!="other"):
if val:
prostate_adata.X[idx, prostate_adata.var.index.get_indexer(np.array(model.genes)[genes[i]])] = expr[i]
i += 1
prostate_adata.X = prostate_adata.X.tocsr()
# now we compare memory B-cell in BPH to normal memory B cells after denoising
sc.tl.rank_genes_groups(prostate_adata, groupby='focus', groups=['BPH associated memory B cell'], reference='memory B cell', method='t-test')
# Plot the most differentially expressed genes
sc.pl.rank_genes_groups(prostate_adata, n_genes=25, sharey=False, gene_symbols='symbol')
# super strong B cell markers
prostate_adata.write_h5ad("../../data/temp/prostate_adata_denoised.h5ad")
prostate_adata = sc.read_h5ad("../../data/temp/prostate_combined_o2uniqsx_v2.h5ad")
prostate_adata
AnnData object with n_obs × n_vars = 83451 × 70116 obs: 'Sample', 'Lineage', 'Population', 'resolution_0.1', 'resolution_0.2', 'resolution_0.3', 'resolution_0.4', 'resolution_0.5', 'resolution_0.75', 'resolution_1', 'resolution_2', 'resolution_3', 'resolution_4', 'resolution_5', 'nCount_RNA', 'nFeature_RNA', 'percent.mito', 'percent.ribo', 'Stress1', 'assay_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'disease_ontology_term_id', 'tissue_ontology_term_id', 'cell_type_ontology_term_id', 'development_stage_ontology_term_id', 'sex_ontology_term_id', 'organism_ontology_term_id', 'donor_id', 'suspension_type', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'nnz', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_20_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'log1p_total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'log1p_total_counts_hb', 'pct_counts_hb', 'outlier', 'mt_outlier', 'pred_cell_type_ontology_term_id', 'pred_disease_ontology_term_id', 'pred_assay_ontology_term_id', 'pred_self_reported_ethnicity_ontology_term_id', 'pred_sex_ontology_term_id', 'pred_organism_ontology_term_id', 'conv_pred_cell_type_ontology_term_id', 'conv_pred_disease_ontology_term_id', 'conv_pred_assay_ontology_term_id', 'conv_pred_self_reported_ethnicity_ontology_term_id', 'sprint_leiden', 'cleaned_pred_cell_type_ontology_term_id', 'cleaned_pred_disease_ontology_term_id' var: 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'uid', 'symbol', 'ncbi_gene_ids', 'biotype', 'description', 'synonyms', 'organism_id', 'public_source_id', 'created_by_id', 'mt', 'ribo', 'hb', 'organism', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts' uns: 'assay_colors', 'cleaned_pred_cell_type_ontology_term_id_colors', 'cleaned_pred_disease_ontology_term_id_colors', 'development_stage_colors', 'disease_colors', 'unseen_genes' obsm: 'X_pca', 'X_umap', 'scprint', 'scprint_umap'
Gene network inference¶
Finally we will use scPRINT to infer gene networks on another cell of interest, the fibroblasts, in both normal and BPH conditions.
We will use the GRNfer
class to infer gene networks. (see the cancer_usecase_part2.ipynb for more details on how to analyse the gene networks.)
prostate_adata.obs.cleaned_pred_cell_type_ontology_term_id.value_counts().head(20)
cleaned_pred_cell_type_ontology_term_id basal epithelial cell of prostatic duct 25740 prostate gland microvascular endothelial cell 12165 urethra urothelial cell 10952 CD1c-positive myeloid dendritic cell 6898 other 5869 aortic smooth muscle cell 3970 effector CD8-positive, alpha-beta T cell 3925 pancreatic acinar cell 3711 luminal cell of prostate epithelium 3435 fibroblast of connective tissue of nonglandular part of prostate 3334 mucous neck cell 3260 IgG-negative class switched memory B cell 2932 basophil 1862 fibroblast of connective tissue of glandular part of prostate 1815 pancreatic ductal cell 1651 CD4-positive, alpha-beta thymocyte 1516 club cell 1226 smooth muscle cell of prostate 1000 effector memory CD8-positive, alpha-beta T cell 991 peptic cell 872 mature conventional dendritic cell 721 effector CD4-positive, alpha-beta T cell 686 mast cell 451 renal interstitial pericyte 440 retinal blood vessel endothelial cell 404 Name: count, dtype: int64
loc = (prostate_adata.obs.cleaned_pred_cell_type_ontology_term_id=="fibroblast of connective tissue of glandular part of prostate")
prostate_adata.obs[loc]['cleaned_pred_disease_ontology_term_id'].value_counts()
cleaned_pred_disease_ontology_term_id benign prostatic hyperplasia 1482 normal 673 other 4 Name: count, dtype: int64
prostate_adata.obs[loc]['louvain_1.0'].value_counts().head(10)
louvain_1.0 5 1458 4 671 17 10 8 8 7 7 0 3 9 2 6 0 3 0 1 0 10 0 11 0 12 0 13 0 14 0 15 0 16 0 2 0 18 0 Name: count, dtype: int64
loc = loc & (prostate_adata.obs['louvain_1.0']==str(5))
prostate_adata.obs[loc]['cleaned_pred_disease_ontology_term_id'].value_counts()
cleaned_pred_disease_ontology_term_id benign prostatic hyperplasia 790 normal 664 other 4 Name: count, dtype: int64
prostate_adata.obs['fibro'] = None
prostate_adata.obs.loc[loc, 'fibro']="fibroblasts"
prostate_adata.obs.loc[loc & (prostate_adata.obs.cleaned_pred_disease_ontology_term_id=="benign prostatic hyperplasia"), 'fibro']="BPH associated fibroblasts"
sc.pl.embedding(prostate_adata[prostate_adata.obs['louvain_1.0'].isin(["5","4"]) & (prostate_adata.obsm['scprint_umap'][:,1]>10)], basis="scprint_umap_rot",color='fibro', show=False, size=8, title="Fibroblasts cluster", legend_loc="right margin")
... storing 'focus' as categorical ... storing 'fibro' as categorical /home/ml4ig1/miniconda3/envs/scprint/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:1251: FutureWarning: The default value of 'ignore' for the `na_action` parameter in pandas.Categorical.map is deprecated and will be changed to 'None' in a future version. Please set na_action to the desired value to avoid seeing this warning color_vector = pd.Categorical(values.map(color_map))
grn_inferer = GNInfer(
# here we use the most variable genes across the fibroblasts vs the rest
how="most var across",
#how="random expr",
# we will preprocess the attention matrix with softmax
preprocess="softmax",
# we don't aggregate the heads here, we will do it manually
head_agg='none',
# here if we generate the attention matrices by performing a task, like denoising or by just passing the expression profile through the model
forward_mode="none",
# the number of genes to use (here the 4000 most variable genes)
num_genes=4000,
# the max number of cell to use per cell type
max_cells=300,
doplot=False,
batch_size=16,
# the column in anndata the defines the cell type
cell_type_col='fibro',
# list of layers to use
layer=list(range(model.nlayers))[:]
)
# I was missing this from the model (not really necessary)
model.organisms = ['NCBITaxon:9606', 'NCBITaxon:10090']
prostate_adata.obs.fibro = prostate_adata.obs.fibro.astype(str)
prostate_adata.obs[prostate_adata.obs.fibro=="BPH associated fibroblasts"].disease.value_counts()
# compute GRNs on fibroblasts, we use all the atetetion layers
grn = grn_inferer(model, prostate_adata, cell_type="fibroblasts")
# highlight differential links on genes that are expressed in both
grn.varp['all'] = grn.varp['GRN'].copy()
# now we aggregate the heads by taking their average
grn.varp['GRN'] = grn.varp['GRN'].mean(-1)
grn.write_h5ad("../../data/temp/prostate_fibro_grn_all.h5ad")
#same on the BPH associated fibroblasts
# I wanted to use only the ones that the labellers had defined as coing from BPH
prostate_adata.obs.loc[(prostate_adata.obs.fibro=="BPH associated fibroblasts") & ( prostate_adata.obs.disease=="benign prostatic hyperplasia"), 'fibro'] = "true BPH associated fibroblasts"
grn_c = grn_inferer(model, prostate_adata, cell_type="true cancer associated fibroblasts")
# highlight differential links on genes that are expressed in both
grn_c.varp['all'] = grn_c.varp['GRN'].copy()
grn_c.varp['GRN'] = grn_c.varp['GRN'].mean(-1)
grn_c.write_h5ad("../../data/temp/prostate_BPH_fibro_grn_all.h5ad")