scprint
  • Home
  • structure
  • pre-training
  • usage example

example notebooks

  • scPRINT use case on BPH
    • table of contents:
      • 🎴 YOU NEED TO FIRST HAVE A POPULATED LAMINDB INSTANCE (see README) 🎴
    • Downloading and preprocessing
    • Embedding and annotations
    • Annotation cleanup
    • Clustering and differential expression
    • Denoising and differential expression
    • Gene network inference
  • scPRINT use case on BPH (part 2, GN analysis)

documentation

  • model
  • tasks
  • cli
  • embedders
  • utils
scprint
  • example notebooks
  • scPRINT use case on BPH

scPRINT use case on BPH¶

In this use-case, also presented in Figure 5 of our manuscript, we perform an extensive analysis of a multi studies dataset of benign prostatic hyperplasia.

Our biological question is to check if there exist pre-cancerous cells that exhibits behaviors of mature cancer cells at this early stage of the disease.

In those cells, we want to know which genes might be implicated in cell state changes, and explore potentially novel targets in the treatment of prostate cancer and BPH.

We will start with a fresh datasets coming from the cellXgene database and representing 2 studies of BPH.

We will first explore these dataset to understand:

  • what are the cell types that are present in the data
  • what are the cell distributions (cell distributions? what are they?)
  • what sequencers were used, etc.

We also want to confirm existing target in prostate cancer through precancerous lesion analysis, and find potentially novel ones that would serve as less invasive BPH treatments than current ones.

Finally we want to know how these targets interacts and are involved in biological pathways.

We now showcase how to use scPRINT across its different functionalities to answer some of these questions.

table of contents:¶

  1. Downloading and preprocessing
  2. Embedding and annotations
  3. Annotation cleanup
  4. Clustering and differential expression
  5. Denoising and differential expression
  6. Gene network inference

In the notebook cancer_usecase_part2.ipynb you will see how to analyse cell type specific gene regulatory networks.

🎴 YOU NEED TO FIRST HAVE A POPULATED LAMINDB INSTANCE (see README) 🎴¶

In [1]:
Copied!
from scprint import scPrint
from scdataloader import Preprocessor, utils
from scprint.tasks import GNInfer, Embedder, Denoiser, withknn

from bengrn import BenGRN, get_sroy_gt, compute_genie3
from bengrn.base import train_classifier

from grnndata import utils as grnutils
from grnndata import read_h5ad

from anndata.utils import make_index_unique
from anndata import concat
import scanpy as sc

from matplotlib import pyplot as plt
import numpy as np
import lamindb as ln

%load_ext autoreload
%autoreload 2 

import torch
torch.set_float32_matmul_precision('medium')
from scprint import scPrint from scdataloader import Preprocessor, utils from scprint.tasks import GNInfer, Embedder, Denoiser, withknn from bengrn import BenGRN, get_sroy_gt, compute_genie3 from bengrn.base import train_classifier from grnndata import utils as grnutils from grnndata import read_h5ad from anndata.utils import make_index_unique from anndata import concat import scanpy as sc from matplotlib import pyplot as plt import numpy as np import lamindb as ln %load_ext autoreload %autoreload 2 import torch torch.set_float32_matmul_precision('medium')
→ connected lamindb: jkobject/scprint

Downloading and preprocessing¶

We now use lamindb to easily access cellxgene and download a dataset of normal and benign prostatic hyperplasia tissues.

data is available here https://cellxgene.cziscience.com/e/574e9f9e-f8b4-41ef-bf19-89a9964fd9c7.cxg/ .

We then use scDataloader's preprocessing method. This method is quite extensive and does a few things.. find our more about it on its documentation.

On our end we are using the preprocessor to make sure that the the gene expression that we have are raw counts and that we have enough information to use scPRINT (i.e., enough genes expressed and enough counts per cells across the dataset).

Finally, the preprocessor will also increase the size of the expression matrix to be a fixed set of genes defined by the latest version of ensemble.

In [2]:
Copied!
cx_dataset = ln.Collection.using(instance="laminlabs/cellxgene").filter(name="cellxgene-census", version='2023-12-15').one()
cx_dataset = ln.Collection.using(instance="laminlabs/cellxgene").filter(name="cellxgene-census", version='2023-12-15').one()
In [3]:
Copied!
prostate_adata = cx_dataset.artifacts.filter(key='cell-census/2023-12-15/h5ads/574e9f9e-f8b4-41ef-bf19-89a9964fd9c7.h5ad').one().load()
sc.pl.umap(prostate_adata)
prostate_adata = cx_dataset.artifacts.filter(key='cell-census/2023-12-15/h5ads/574e9f9e-f8b4-41ef-bf19-89a9964fd9c7.h5ad').one().load() sc.pl.umap(prostate_adata)
/home/ml4ig1/miniconda3/envs/scprint/lib/python3.10/asyncio/sslproto.py:320: ResourceWarning: unclosed transport <asyncio.sslproto._SSLProtocolTransport object at 0x7fd088b36f80>
  _warn(f"unclosed transport {self!r}", ResourceWarning, source=self)
ResourceWarning: Enable tracemalloc to get the object allocation traceback
No description has been provided for this image
In [7]:
Copied!
# preprocessing using scDataloader
prostate_adata.obs.drop(columns="is_primary_data", inplace=True)
preprocessor = Preprocessor(do_postp=False)
prostate_adata = preprocessor(prostate_adata)
# preprocessing using scDataloader prostate_adata.obs.drop(columns="is_primary_data", inplace=True) preprocessor = Preprocessor(do_postp=False) prostate_adata = preprocessor(prostate_adata)
Dropping layers:  KeysView(Layers with keys: )
checking raw counts
removed 0 non primary cells, 83451 renamining
filtered out 0 cells, 83451 renamining
Removed 76 genes.
validating
/home/ml4ig1/Documents code/scPRINT/scDataLoader/scdataloader/preprocess.py:241: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  data_utils.validate(adata, organism=adata.obs.organism_ontology_term_id[0])
startin QC

Seeing 13047 outliers (15.63% of total dataset):
done

Embedding and annotations¶

We now start to load a large version of scPRINT from a specific checkpoint. Please download the checkpoints following the instructions in the README.

We will then use out Embedder class to embed the data and annotate the cells. These classes are how we parametrize and access the different functions of scPRINT. Find out more about its parameters in our documentation.

In [6]:
Copied!
model = scPrint.load_from_checkpoint('../data/temp/o2uniqsx/epoch=18-step=133000.ckpt', 
    # make sure that you check if you have a GPU with flashattention or not (see README)
    # you need this else, the model will look for the file with the embeddings instead of getting them from the weights
    precpt_gene_emb = None)
model = scPrint.load_from_checkpoint('../data/temp/o2uniqsx/epoch=18-step=133000.ckpt', # make sure that you check if you have a GPU with flashattention or not (see README) # you need this else, the model will look for the file with the embeddings instead of getting them from the weights precpt_gene_emb = None)
RuntimeError caught: scPrint is not attached to a `Trainer`.

In [10]:
Copied!
embedder = Embedder( 
                    # can work on random genes or most variables etc..
                    how="random expr", 
                    # number of genes to use
                    max_len=4000, 
                    add_zero_genes=0, 
                    # for the dataloading
                    num_workers=8, 
                    # we will only use the cell type embedding here.
                    pred_embedding = ["cell_type_ontology_term_id"]
                    )#, "disease_ontology_term_id"])
embedder = Embedder( # can work on random genes or most variables etc.. how="random expr", # number of genes to use max_len=4000, add_zero_genes=0, # for the dataloading num_workers=8, # we will only use the cell type embedding here. pred_embedding = ["cell_type_ontology_term_id"] )#, "disease_ontology_term_id"])
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
In [ ]:
Copied!
# create the embedding
prostate_adata, metrics = embedder(model, prostate_adata, cache=False, output_expression="none")
# create the embedding prostate_adata, metrics = embedder(model, prostate_adata, cache=False, output_expression="none")
100%|██████████| 1560/1560 [28:20<00:00,  1.09s/it]
AnnData object with n_obs × n_vars = 99826 × 424
    obs: 'pred_cell_type_ontology_term_id', 'pred_disease_ontology_term_id', 'pred_assay_ontology_term_id', 'pred_self_reported_ethnicity_ontology_term_id', 'pred_sex_ontology_term_id', 'pred_organism_ontology_term_id', 'conv_pred_cell_type_ontology_term_id', 'conv_pred_disease_ontology_term_id', 'conv_pred_assay_ontology_term_id', 'conv_pred_self_reported_ethnicity_ontology_term_id'
/home/ml4ig1/miniconda3/envs/scprint/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:1251: FutureWarning: The default value of 'ignore' for the `na_action` parameter in pandas.Categorical.map is deprecated and will be changed to 'None' in a future version. Please set na_action to the desired value to avoid seeing this warning
  color_vector = pd.Categorical(values.map(color_map))
/home/ml4ig1/miniconda3/envs/scprint/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:1251: FutureWarning: The default value of 'ignore' for the `na_action` parameter in pandas.Categorical.map is deprecated and will be changed to 'None' in a future version. Please set na_action to the desired value to avoid seeing this warning
  color_vector = pd.Categorical(values.map(color_map))
/home/ml4ig1/miniconda3/envs/scprint/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:1251: FutureWarning: The default value of 'ignore' for the `na_action` parameter in pandas.Categorical.map is deprecated and will be changed to 'None' in a future version. Please set na_action to the desired value to avoid seeing this warning
  color_vector = pd.Categorical(values.map(color_map))
/home/ml4ig1/miniconda3/envs/scprint/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:1251: FutureWarning: The default value of 'ignore' for the `na_action` parameter in pandas.Categorical.map is deprecated and will be changed to 'None' in a future version. Please set na_action to the desired value to avoid seeing this warning
  color_vector = pd.Categorical(values.map(color_map))
/home/ml4ig1/miniconda3/envs/scprint/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:1251: FutureWarning: The default value of 'ignore' for the `na_action` parameter in pandas.Categorical.map is deprecated and will be changed to 'None' in a future version. Please set na_action to the desired value to avoid seeing this warning
  color_vector = pd.Categorical(values.map(color_map))
/home/ml4ig1/miniconda3/envs/scprint/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:1251: FutureWarning: The default value of 'ignore' for the `na_action` parameter in pandas.Categorical.map is deprecated and will be changed to 'None' in a future version. Please set na_action to the desired value to avoid seeing this warning
  color_vector = pd.Categorical(values.map(color_map))
No description has been provided for this image
couldn't log to tensorboard
couldn't log to wandb

couldn't log to wandb
     cell_type_ontology_term_id
     accuracy: 0.7250215374752068
 
     disease_ontology_term_id
     accuracy: 0.8394506441207702
 
     assay_ontology_term_id
     accuracy: 0.998988239536794
 
     self_reported_ethnicity_ontology_term_id
     accuracy: 0.9504037024422495
 
     sex_ontology_term_id
     accuracy: 0.9948911105323263
 
     organism_ontology_term_id
     accuracy: 1.0
 
In [16]:
Copied!
prostate_adata
prostate_adata
Out[16]:
AnnData object with n_obs × n_vars = 99826 × 70116
    obs: 'assay_ontology_term_id', 'donor_id', 'sex_ontology_term_id', 'disease_ontology_term_id', 'organism_ontology_term_id', 'suspension_type', 'cell_type_ontology_term_id', 'tissue_ontology_term_id', 'development_stage_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'nnz', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_20_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'log1p_total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'log1p_total_counts_hb', 'pct_counts_hb', 'outlier', 'mt_outlier', 'init_dataset', 'pred_cell_type_ontology_term_id', 'pred_disease_ontology_term_id', 'pred_assay_ontology_term_id', 'pred_self_reported_ethnicity_ontology_term_id', 'pred_sex_ontology_term_id', 'pred_organism_ontology_term_id', 'conv_pred_cell_type_ontology_term_id', 'conv_pred_disease_ontology_term_id', 'conv_pred_assay_ontology_term_id', 'conv_pred_self_reported_ethnicity_ontology_term_id', 'sprint_leiden'
    uns: 'init_dataset_colors'
    obsm: 'X_pca', 'X_umap', 'scprint_umap', 'scprint'

Annotation cleanup¶

scPRINT generates predictions over hundreds of possible labels for each cell.

It is often advised to "cleanup" the predictions, e.g. making sure to remove low frequency cells and misslabellings.

Here, we use the most straightforward approach which is to remove any annotations that appear a small number of times.

A better approach would be doing majority voting over cell clusters as it would aggregate and smoothout the predictions over multiple cells. it would also remove most of the low frequency mistakes in the predictions.

We will also have a look at the embeddings of scPRINT by plotting its UMAP visualization.

In [30]:
Copied!
#cleaning up the cell types
prostate_adata.obs['cleaned_pred_cell_type_ontology_term_id'] = prostate_adata.obs['conv_pred_cell_type_ontology_term_id'].astype(str)
prostate_adata.obs.loc[~prostate_adata.obs['conv_pred_cell_type_ontology_term_id'].isin([k for k, v in prostate_adata.obs['conv_pred_cell_type_ontology_term_id'].value_counts().items() if v > 400]), 'cleaned_pred_cell_type_ontology_term_id'] = "other"
prostate_adata.obs['cleaned_pred_cell_type_ontology_term_id'].value_counts().plot.pie()
#cleaning up the cell types prostate_adata.obs['cleaned_pred_cell_type_ontology_term_id'] = prostate_adata.obs['conv_pred_cell_type_ontology_term_id'].astype(str) prostate_adata.obs.loc[~prostate_adata.obs['conv_pred_cell_type_ontology_term_id'].isin([k for k, v in prostate_adata.obs['conv_pred_cell_type_ontology_term_id'].value_counts().items() if v > 400]), 'cleaned_pred_cell_type_ontology_term_id'] = "other" prostate_adata.obs['cleaned_pred_cell_type_ontology_term_id'].value_counts().plot.pie()
Out[30]:
<Axes: ylabel='count'>
No description has been provided for this image
In [31]:
Copied!
# cleaning up the diseases
prostate_adata.obs['cleaned_pred_disease_ontology_term_id'] = prostate_adata.obs['conv_pred_disease_ontology_term_id'].astype(str)
prostate_adata.obs.loc[~prostate_adata.obs['conv_pred_disease_ontology_term_id'].isin([k for k, v in prostate_adata.obs['conv_pred_disease_ontology_term_id'].value_counts().items() if v > 1000]), 'cleaned_pred_disease_ontology_term_id'] = "other"
prostate_adata.obs['cleaned_pred_disease_ontology_term_id'].value_counts().plot.pie()
# cleaning up the diseases prostate_adata.obs['cleaned_pred_disease_ontology_term_id'] = prostate_adata.obs['conv_pred_disease_ontology_term_id'].astype(str) prostate_adata.obs.loc[~prostate_adata.obs['conv_pred_disease_ontology_term_id'].isin([k for k, v in prostate_adata.obs['conv_pred_disease_ontology_term_id'].value_counts().items() if v > 1000]), 'cleaned_pred_disease_ontology_term_id'] = "other" prostate_adata.obs['cleaned_pred_disease_ontology_term_id'].value_counts().plot.pie()
Out[31]:
<Axes: ylabel='count'>
No description has been provided for this image
In [38]:
Copied!
sc.pl.embedding(prostate_adata, basis="scprint_umap", color=['cleaned_pred_cell_type_ontology_term_id'])
sc.pl.embedding(prostate_adata, basis="scprint_umap", color=['cleaned_pred_disease_ontology_term_id'])
sc.pl.embedding(prostate_adata, basis="scprint_umap", color=['assay'])
sc.pl.embedding(prostate_adata, basis="scprint_umap", color=['disease'])
sc.pl.embedding(prostate_adata, basis="scprint_umap", color=['development_stage'])
sc.pl.embedding(prostate_adata, basis="scprint_umap", color=['cleaned_pred_cell_type_ontology_term_id']) sc.pl.embedding(prostate_adata, basis="scprint_umap", color=['cleaned_pred_disease_ontology_term_id']) sc.pl.embedding(prostate_adata, basis="scprint_umap", color=['assay']) sc.pl.embedding(prostate_adata, basis="scprint_umap", color=['disease']) sc.pl.embedding(prostate_adata, basis="scprint_umap", color=['development_stage'])
/home/ml4ig1/miniconda3/envs/scprint/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:1251: FutureWarning: The default value of 'ignore' for the `na_action` parameter in pandas.Categorical.map is deprecated and will be changed to 'None' in a future version. Please set na_action to the desired value to avoid seeing this warning
  color_vector = pd.Categorical(values.map(color_map))
No description has been provided for this image
/home/ml4ig1/miniconda3/envs/scprint/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:1251: FutureWarning: The default value of 'ignore' for the `na_action` parameter in pandas.Categorical.map is deprecated and will be changed to 'None' in a future version. Please set na_action to the desired value to avoid seeing this warning
  color_vector = pd.Categorical(values.map(color_map))
No description has been provided for this image
/home/ml4ig1/miniconda3/envs/scprint/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:1251: FutureWarning: The default value of 'ignore' for the `na_action` parameter in pandas.Categorical.map is deprecated and will be changed to 'None' in a future version. Please set na_action to the desired value to avoid seeing this warning
  color_vector = pd.Categorical(values.map(color_map))
No description has been provided for this image
In [ ]:
Copied!
prostate_adata.obs.cleaned_pred_cell_type_ontology_term_id.value_counts().head(20)
prostate_adata.obs.cleaned_pred_cell_type_ontology_term_id.value_counts().head(20)
cleaned_pred_cell_type_ontology_term_id
basal cell of epidermis                                             22449
mammary alveolar cell                                                9688
vasa recta descending limb cell                                      8777
CD1c-positive myeloid dendritic cell                                 5782
effector memory CD8-positive, alpha-beta T cell                      5410
smooth muscle cell of the pulmonary artery                           5007
other                                                                4972
pancreatic acinar cell                                               4639
basal epithelial cell of prostatic duct                              3405
serous cell of epithelium of trachea                                 3327
prostate gland microvascular endothelial cell                        2747
paneth cell of epithelium of small intestine                         2698
mast cell                                                            2365
IgG-negative class switched memory B cell                            2348
fibroblast of connective tissue of nonglandular part of prostate     2058
club cell                                                            1794
pulmonary artery endothelial cell                                    1239
CD141-positive myeloid dendritic cell                                1186
mucous neck cell                                                     1039
smooth muscle cell of prostate                                        857
peptic cell                                                           766
T-helper 17 cell                                                      735
CD14-positive, CD16-negative classical monocyte                       671
CD16-negative, CD56-bright natural killer cell, human                 666
lung pericyte                                                         663
alveolar type 2 fibroblast cell                                       646
type II pneumocyte                                                    618
naive B cell                                                          606
effector CD4-positive, alpha-beta T cell                              506
fibroblast of connective tissue of glandular part of prostate         464
luminal epithelial cell of mammary gland                              437
inflammatory macrophage                                               436
fibroblast of mammary gland                                           416
basal cell of epithelium of bronchus                                  409
Name: count, dtype: int64
In [39]:
Copied!
# we save for next time
prostate_adata.write_h5ad("../../data/prostate_adata_o2uniqsx.h5ad")
# we save for next time prostate_adata.write_h5ad("../../data/prostate_adata_o2uniqsx.h5ad")
In [2]:
Copied!
prostate_adata = sc.read_h5ad("../../data/prostate_adata_o2uniqsx.h5ad")
prostate_adata = sc.read_h5ad("../../data/prostate_adata_o2uniqsx.h5ad")

Clustering and differential expression¶

We will now cluster using the louvain algorithm on a kNN graph.

Once we detect a cluster of interest we will perform differential expression analysis on it. Taking as example some B-cell clusters, we will use scanpy's implementation of rank_gene_groups for our differential expression

In [17]:
Copied!
# do louvain mutliple times
sc.pp.neighbors(prostate_adata, n_neighbors=10, use_rep="scprint")
# using multiple resolutions can help spotting smaller clusters
sc.tl.louvain(prostate_adata, resolution=0.5, key_added="louvain_0.5")
sc.tl.louvain(prostate_adata, resolution=1.0, key_added="louvain_1.0")
# do louvain mutliple times sc.pp.neighbors(prostate_adata, n_neighbors=10, use_rep="scprint") # using multiple resolutions can help spotting smaller clusters sc.tl.louvain(prostate_adata, resolution=0.5, key_added="louvain_0.5") sc.tl.louvain(prostate_adata, resolution=1.0, key_added="louvain_1.0")
In [20]:
Copied!
# check clusters
sc.pl.embedding(prostate_adata, basis="scprint_umap", color='louvain_1.0', show=False, legend_loc="on data")
# check clusters sc.pl.embedding(prostate_adata, basis="scprint_umap", color='louvain_1.0', show=False, legend_loc="on data")
/home/ml4ig1/miniconda3/envs/scprint/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:1251: FutureWarning: The default value of 'ignore' for the `na_action` parameter in pandas.Categorical.map is deprecated and will be changed to 'None' in a future version. Please set na_action to the desired value to avoid seeing this warning
  color_vector = pd.Categorical(values.map(color_map))
Out[20]:
<Axes: title={'center': 'louvain_1.0'}, xlabel='scprint_umap1', ylabel='scprint_umap2'>
No description has been provided for this image
In [21]:
Copied!
# check cluster 9
i=9
loc = prostate_adata.obs['louvain_1.0']==str(i)
prostate_adata.obs[loc].conv_pred_disease_ontology_term_id.value_counts().head(2), prostate_adata.obs[loc].conv_pred_cell_type_ontology_term_id.value_counts().head(5)
# check cluster 9 i=9 loc = prostate_adata.obs['louvain_1.0']==str(i) prostate_adata.obs[loc].conv_pred_disease_ontology_term_id.value_counts().head(2), prostate_adata.obs[loc].conv_pred_cell_type_ontology_term_id.value_counts().head(5)
Out[21]:
(conv_pred_disease_ontology_term_id
 benign prostatic hyperplasia    3554
 normal                            18
 Name: count, dtype: int64,
 conv_pred_cell_type_ontology_term_id
 urethra urothelial cell                    1703
 luminal cell of prostate epithelium         703
 mucous neck cell                            566
 basal epithelial cell of prostatic duct     372
 club cell                                   174
 Name: count, dtype: int64)
In [22]:
Copied!
# check cluster 11
i=10
loc = prostate_adata.obs['louvain_1.0']==str(i)
prostate_adata.obs[loc].cleaned_pred_disease_ontology_term_id.value_counts().head(2), prostate_adata.obs[loc].cleaned_pred_cell_type_ontology_term_id.value_counts().head(5)
# check cluster 11 i=10 loc = prostate_adata.obs['louvain_1.0']==str(i) prostate_adata.obs[loc].cleaned_pred_disease_ontology_term_id.value_counts().head(2), prostate_adata.obs[loc].cleaned_pred_cell_type_ontology_term_id.value_counts().head(5)
Out[22]:
(cleaned_pred_disease_ontology_term_id
 benign prostatic hyperplasia    2900
 normal                            68
 Name: count, dtype: int64,
 cleaned_pred_cell_type_ontology_term_id
 IgG-negative class switched memory B cell    2879
 other                                          45
 CD4-positive, alpha-beta thymocyte             16
 effector CD8-positive, alpha-beta T cell       12
 CD1c-positive myeloid dendritic cell            8
 Name: count, dtype: int64)
In [23]:
Copied!
# We have find a nice IgG-negative class switched memory B cell cluster. let's use it and define a clean annotation for a plot
loc = loc&(prostate_adata.obs.cleaned_pred_cell_type_ontology_term_id=="IgG-negative class switched memory B cell")
prostate_adata.obs[loc].cleaned_pred_disease_ontology_term_id.value_counts().head(2)
# We have find a nice IgG-negative class switched memory B cell cluster. let's use it and define a clean annotation for a plot loc = loc&(prostate_adata.obs.cleaned_pred_cell_type_ontology_term_id=="IgG-negative class switched memory B cell") prostate_adata.obs[loc].cleaned_pred_disease_ontology_term_id.value_counts().head(2)
Out[23]:
cleaned_pred_disease_ontology_term_id
benign prostatic hyperplasia    2810
normal                            60
Name: count, dtype: int64
In [24]:
Copied!
# making a "focus" annotation for the B-cell to generate a nice plot of the B-cell cluster only
prostate_adata.obs['focus'] = "other"
prostate_adata.obs.loc[loc, 'focus'] = "memory B cell"
prostate_adata.obs.loc[loc & (prostate_adata.obs['cleaned_pred_disease_ontology_term_id']=='benign prostatic hyperplasia'), 'focus'] = "BPH associated memory B cell"
prostate_adata.obs['focus'].value_counts()
# making a "focus" annotation for the B-cell to generate a nice plot of the B-cell cluster only prostate_adata.obs['focus'] = "other" prostate_adata.obs.loc[loc, 'focus'] = "memory B cell" prostate_adata.obs.loc[loc & (prostate_adata.obs['cleaned_pred_disease_ontology_term_id']=='benign prostatic hyperplasia'), 'focus'] = "BPH associated memory B cell" prostate_adata.obs['focus'].value_counts()
Out[24]:
focus
other                           80572
BPH associated memory B cell     2810
memory B cell                      69
Name: count, dtype: int64
In [25]:
Copied!
import matplotlib.pyplot as plt
import matplotlib.patches as patches

import seaborn as sns
color = sns.color_palette()[1]
fig, ax = plt.subplots(figsize=(2, 2))
rect = patches.Rectangle((0, 0), 1, 1, facecolor=color)
ax.add_patch(rect)
ax.set_xlim(0, 1)
ax.set_ylim(0, 1)
ax.axis('off')
plt.show()
import matplotlib.pyplot as plt import matplotlib.patches as patches import seaborn as sns color = sns.color_palette()[1] fig, ax = plt.subplots(figsize=(2, 2)) rect = patches.Rectangle((0, 0), 1, 1, facecolor=color) ax.add_patch(rect) ax.set_xlim(0, 1) ax.set_ylim(0, 1) ax.axis('off') plt.show()
No description has been provided for this image
In [32]:
Copied!
# looking at the B-cell cluster. We can see some normal and BPH-associated memory B-cells
sc.pl.embedding(prostate_adata[(prostate_adata.obs['louvain_1.0']==str(i)) & (prostate_adata.obsm['scprint_umap'][:,0]>4)], basis="scprint_umap",color='focus', show=False, size=8, title="Switched memory B-cell cluster", legend_loc="right margin")
# looking at the B-cell cluster. We can see some normal and BPH-associated memory B-cells sc.pl.embedding(prostate_adata[(prostate_adata.obs['louvain_1.0']==str(i)) & (prostate_adata.obsm['scprint_umap'][:,0]>4)], basis="scprint_umap",color='focus', show=False, size=8, title="Switched memory B-cell cluster", legend_loc="right margin")
... storing 'focus' as categorical
/home/ml4ig1/miniconda3/envs/scprint/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:1251: FutureWarning: The default value of 'ignore' for the `na_action` parameter in pandas.Categorical.map is deprecated and will be changed to 'None' in a future version. Please set na_action to the desired value to avoid seeing this warning
  color_vector = pd.Categorical(values.map(color_map))
Out[32]:
<Axes: title={'center': 'focus'}, xlabel='scprint_umap_rot1', ylabel='scprint_umap_rot2'>
No description has been provided for this image
In [ ]:
Copied!
# here we use a function of scdataloader to add some more gene annotations to our dataset so we can view HGNC symbols
genedf = utils.load_genes(prostate_adata.obs.organism_ontology_term_id.iloc[0])
# columns that create issues when saving anndatas
prostate_adata.var = genedf.loc[prostate_adata.var.index].drop(columns=["stable_id", "created_at", "updated_at"])
# here we use a function of scdataloader to add some more gene annotations to our dataset so we can view HGNC symbols genedf = utils.load_genes(prostate_adata.obs.organism_ontology_term_id.iloc[0]) # columns that create issues when saving anndatas prostate_adata.var = genedf.loc[prostate_adata.var.index].drop(columns=["stable_id", "created_at", "updated_at"])
In [ ]:
Copied!
# now the diff expression between B-cells and the rest
sc.tl.rank_genes_groups(prostate_adata, groupby='cleaned_pred_cell_type_ontology_term_id', groups=['IgG-negative class switched memory B cell'], reference='other', method='t-test')
# Plot the most differentially expressed genes
sc.pl.rank_genes_groups(prostate_adata, n_genes=25, sharey=False, gene_symbols='symbol')
# super strong B cell markers
# now the diff expression between B-cells and the rest sc.tl.rank_genes_groups(prostate_adata, groupby='cleaned_pred_cell_type_ontology_term_id', groups=['IgG-negative class switched memory B cell'], reference='other', method='t-test') # Plot the most differentially expressed genes sc.pl.rank_genes_groups(prostate_adata, n_genes=25, sharey=False, gene_symbols='symbol') # super strong B cell markers
... storing 'focus' as categorical
... storing 'symbol' as categorical
... storing 'ncbi_gene_ids' as categorical
... storing 'biotype' as categorical
... storing 'description' as categorical
... storing 'synonyms' as categorical
... storing 'organism' as categorical
WARNING: It seems you use rank_genes_groups on the raw count data. Please logarithmize your data before calling rank_genes_groups.
/home/ml4ig1/miniconda3/envs/scprint/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:420: RuntimeWarning: overflow encountered in expm1
  self.expm1_func(mean_rest) + 1e-9
/home/ml4ig1/miniconda3/envs/scprint/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:422: RuntimeWarning: divide by zero encountered in log2
  self.stats[group_name, 'logfoldchanges'] = np.log2(
No description has been provided for this image

Denoising and differential expression¶

What we found out from our previous analysis is that there is not a lot of normal (i.e. healthy) B-cells in our cluster, most of them are BPH associated. In this case, if we wanted to compare BPH B-cells to normal B-cells we might be very underpowered...

Instead of going to look for some other dataset, let's use scPRINT to increase the depth of the expression profile of the cells, virtually adding more signal to our dataset.

We will use the Denoiser class (see more about the class in our documentation) in a similar way Trainer is used in pytorch lightning to denoise the expression profile of the cells.

We will then show the results of differential expression analysis before and after denoising.

In [4]:
Copied!
# in case you started from here
prostate_adata = sc.read_h5ad("../../data/temp/prostate_adata_o2uniqsx.h5ad")
prostate_adata
# in case you started from here prostate_adata = sc.read_h5ad("../../data/temp/prostate_adata_o2uniqsx.h5ad") prostate_adata
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[4], line 2
      1 # in case you started from here
----> 2 prostate_adata = sc.read_h5ad("../../data/temp/prostate_adata_o2uniqsx.h5ad")
      3 prostate_adata

File ~/miniconda3/envs/scprint/lib/python3.10/site-packages/anndata/_io/h5ad.py:234, in read_h5ad(filename, backed, as_sparse, as_sparse_fmt, chunk_size)
    226         raise NotImplementedError(
    227             "Currently only `X` and `raw/X` can be read as sparse."
    228         )
    230 rdasp = partial(
    231     read_dense_as_sparse, sparse_format=as_sparse_fmt, axis_chunk=chunk_size
    232 )
--> 234 with h5py.File(filename, "r") as f:
    236     def callback(func, elem_name: str, elem, iospec):
    237         if iospec.encoding_type == "anndata" or elem_name.endswith("/"):

File ~/miniconda3/envs/scprint/lib/python3.10/site-packages/h5py/_hl/files.py:561, in File.__init__(self, name, mode, driver, libver, userblock_size, swmr, rdcc_nslots, rdcc_nbytes, rdcc_w0, track_order, fs_strategy, fs_persist, fs_threshold, fs_page_size, page_buf_size, min_meta_keep, min_raw_keep, locking, alignment_threshold, alignment_interval, meta_block_size, **kwds)
    552     fapl = make_fapl(driver, libver, rdcc_nslots, rdcc_nbytes, rdcc_w0,
    553                      locking, page_buf_size, min_meta_keep, min_raw_keep,
    554                      alignment_threshold=alignment_threshold,
    555                      alignment_interval=alignment_interval,
    556                      meta_block_size=meta_block_size,
    557                      **kwds)
    558     fcpl = make_fcpl(track_order=track_order, fs_strategy=fs_strategy,
    559                      fs_persist=fs_persist, fs_threshold=fs_threshold,
    560                      fs_page_size=fs_page_size)
--> 561     fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
    563 if isinstance(libver, tuple):
    564     self._libver = libver

File ~/miniconda3/envs/scprint/lib/python3.10/site-packages/h5py/_hl/files.py:235, in make_fid(name, mode, userblock_size, fapl, fcpl, swmr)
    233     if swmr and swmr_support:
    234         flags |= h5f.ACC_SWMR_READ
--> 235     fid = h5f.open(name, flags, fapl=fapl)
    236 elif mode == 'r+':
    237     fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)

File h5py/_objects.pyx:54, in h5py._objects.with_phil.wrapper()

File h5py/_objects.pyx:55, in h5py._objects.with_phil.wrapper()

File h5py/h5f.pyx:102, in h5py.h5f.open()

FileNotFoundError: [Errno 2] Unable to synchronously open file (unable to open file: name = '../../data/temp/prostate_adata_o2uniqsx.h5ad', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)
In [11]:
Copied!
# here we compare memory B-cell in BPH to normal memory B cells before denoising
sc.tl.rank_genes_groups(prostate_adata, groupby='focus', groups=['BPH associated memory B cell'], reference='memory B cell', method='t-test')
# Plot the most differentially expressed genes
sc.pl.rank_genes_groups(prostate_adata, n_genes=25, sharey=False, gene_symbols='symbol')
# super strong B cell markers
# here we compare memory B-cell in BPH to normal memory B cells before denoising sc.tl.rank_genes_groups(prostate_adata, groupby='focus', groups=['BPH associated memory B cell'], reference='memory B cell', method='t-test') # Plot the most differentially expressed genes sc.pl.rank_genes_groups(prostate_adata, n_genes=25, sharey=False, gene_symbols='symbol') # super strong B cell markers
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/miniconda3/envs/scprint/lib/python3.10/site-packages/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key)
   3804 try:
-> 3805     return self._engine.get_loc(casted_key)
   3806 except KeyError as err:

File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'focus'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[11], line 2
      1 # here we compare memory B-cell in BPH to normal memory B cells before denoising
----> 2 sc.tl.rank_genes_groups(prostate_adata, groupby='focus', groups=['BPH associated memory B cell'], reference='memory B cell', method='t-test')
      3 # Plot the most differentially expressed genes
      4 sc.pl.rank_genes_groups(prostate_adata, n_genes=25, sharey=False, gene_symbols='symbol')

File ~/miniconda3/envs/scprint/lib/python3.10/site-packages/legacy_api_wrap/__init__.py:80, in legacy_api.<locals>.wrapper.<locals>.fn_compatible(*args_all, **kw)
     77 @wraps(fn)
     78 def fn_compatible(*args_all: P.args, **kw: P.kwargs) -> R:
     79     if len(args_all) <= n_positional:
---> 80         return fn(*args_all, **kw)
     82     args_pos: P.args
     83     args_pos, args_rest = args_all[:n_positional], args_all[n_positional:]

File ~/miniconda3/envs/scprint/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:637, in rank_genes_groups(adata, groupby, mask_var, use_raw, groups, reference, n_genes, rankby_abs, pts, key_added, copy, method, corr_method, tie_correct, layer, **kwds)
    635     if reference != "rest" and reference not in set(groups_order):
    636         groups_order += [reference]
--> 637 if reference != "rest" and reference not in adata.obs[groupby].cat.categories:
    638     cats = adata.obs[groupby].cat.categories.tolist()
    639     raise ValueError(
    640         f"reference = {reference} needs to be one of groupby = {cats}."
    641     )

File ~/miniconda3/envs/scprint/lib/python3.10/site-packages/pandas/core/frame.py:4102, in DataFrame.__getitem__(self, key)
   4100 if self.columns.nlevels > 1:
   4101     return self._getitem_multilevel(key)
-> 4102 indexer = self.columns.get_loc(key)
   4103 if is_integer(indexer):
   4104     indexer = [indexer]

File ~/miniconda3/envs/scprint/lib/python3.10/site-packages/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
   3807     if isinstance(casted_key, slice) or (
   3808         isinstance(casted_key, abc.Iterable)
   3809         and any(isinstance(x, slice) for x in casted_key)
   3810     ):
   3811         raise InvalidIndexError(key)
-> 3812     raise KeyError(key) from err
   3813 except TypeError:
   3814     # If we have a listlike key, _check_indexing_error will raise
   3815     #  InvalidIndexError. Otherwise we fall through and re-raise
   3816     #  the TypeError.
   3817     self._check_indexing_error(key)

KeyError: 'focus'
In [16]:
Copied!
# here we compare memory B-cell in BPH to normal memory B cells before denoising
sc.tl.rank_genes_groups(prostate_adata, groupby='focus', groups=['BPH associated memory B cell'], reference='memory B cell', method='t-test')
# Plot the most differentially expressed genes
sc.pl.rank_genes_groups(prostate_adata, n_genes=25, sharey=False, gene_symbols='symbol')
# super strong B cell markers
# here we compare memory B-cell in BPH to normal memory B cells before denoising sc.tl.rank_genes_groups(prostate_adata, groupby='focus', groups=['BPH associated memory B cell'], reference='memory B cell', method='t-test') # Plot the most differentially expressed genes sc.pl.rank_genes_groups(prostate_adata, n_genes=25, sharey=False, gene_symbols='symbol') # super strong B cell markers
WARNING: It seems you use rank_genes_groups on the raw count data. Please logarithmize your data before calling rank_genes_groups.
No description has been provided for this image
In [19]:
Copied!
# perform denoising
denoise = Denoiser( 
                    # could be on top differentially expressed genes or on random expressed genes in the cell (completed by random non expressed gene)
                    how="random expr"
                    # the size of the minibatch (need to fit in memory)
                    batch_size=20, 
                    # the number of genes to use
                    max_len=5000, 
                    # the number of cells to use (here more than what we will use so we will use everything)
                    max_cells=10_000, 
                    doplot=False, 
                    # how much do we want to increase the depth / counts of the cells (here, 10x)
                    predict_depth_mult=10, 
                    )
idx, genes, expr = denoise(model, prostate_adata[prostate_adata.obs['focus']!="other"])
# perform denoising denoise = Denoiser( # could be on top differentially expressed genes or on random expressed genes in the cell (completed by random non expressed gene) how="random expr" # the size of the minibatch (need to fit in memory) batch_size=20, # the number of genes to use max_len=5000, # the number of cells to use (here more than what we will use so we will use everything) max_cells=10_000, doplot=False, # how much do we want to increase the depth / counts of the cells (here, 10x) predict_depth_mult=10, ) idx, genes, expr = denoise(model, prostate_adata[prostate_adata.obs['focus']!="other"])
100%|██████████| 82/82 [00:50<00:00,  1.64it/s]
In [20]:
Copied!
# now what we are doing here it to complete the expression profile with the denoised values. this is not done by default for now
i = 0
prostate_adata.X = prostate_adata.X.tolil()
for idx, val in enumerate(prostate_adata.obs['focus']!="other"):
    if val:
        prostate_adata.X[idx, prostate_adata.var.index.get_indexer(np.array(model.genes)[genes[i]])] = expr[i]
        i += 1
prostate_adata.X = prostate_adata.X.tocsr()
# now what we are doing here it to complete the expression profile with the denoised values. this is not done by default for now i = 0 prostate_adata.X = prostate_adata.X.tolil() for idx, val in enumerate(prostate_adata.obs['focus']!="other"): if val: prostate_adata.X[idx, prostate_adata.var.index.get_indexer(np.array(model.genes)[genes[i]])] = expr[i] i += 1 prostate_adata.X = prostate_adata.X.tocsr()
In [21]:
Copied!
# now we compare memory B-cell in BPH to normal memory B cells after denoising
sc.tl.rank_genes_groups(prostate_adata, groupby='focus', groups=['BPH associated memory B cell'], reference='memory B cell', method='t-test')
# Plot the most differentially expressed genes
sc.pl.rank_genes_groups(prostate_adata, n_genes=25, sharey=False, gene_symbols='symbol')
# super strong B cell markers
# now we compare memory B-cell in BPH to normal memory B cells after denoising sc.tl.rank_genes_groups(prostate_adata, groupby='focus', groups=['BPH associated memory B cell'], reference='memory B cell', method='t-test') # Plot the most differentially expressed genes sc.pl.rank_genes_groups(prostate_adata, n_genes=25, sharey=False, gene_symbols='symbol') # super strong B cell markers
No description has been provided for this image
In [9]:
Copied!
prostate_adata.write_h5ad("../../data/temp/prostate_adata_denoised.h5ad")
prostate_adata.write_h5ad("../../data/temp/prostate_adata_denoised.h5ad")
In [10]:
Copied!
prostate_adata = sc.read_h5ad("../../data/temp/prostate_combined_o2uniqsx_v2.h5ad")
prostate_adata
prostate_adata = sc.read_h5ad("../../data/temp/prostate_combined_o2uniqsx_v2.h5ad") prostate_adata
Out[10]:
AnnData object with n_obs × n_vars = 83451 × 70116
    obs: 'Sample', 'Lineage', 'Population', 'resolution_0.1', 'resolution_0.2', 'resolution_0.3', 'resolution_0.4', 'resolution_0.5', 'resolution_0.75', 'resolution_1', 'resolution_2', 'resolution_3', 'resolution_4', 'resolution_5', 'nCount_RNA', 'nFeature_RNA', 'percent.mito', 'percent.ribo', 'Stress1', 'assay_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'disease_ontology_term_id', 'tissue_ontology_term_id', 'cell_type_ontology_term_id', 'development_stage_ontology_term_id', 'sex_ontology_term_id', 'organism_ontology_term_id', 'donor_id', 'suspension_type', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'nnz', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_20_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'log1p_total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'log1p_total_counts_hb', 'pct_counts_hb', 'outlier', 'mt_outlier', 'pred_cell_type_ontology_term_id', 'pred_disease_ontology_term_id', 'pred_assay_ontology_term_id', 'pred_self_reported_ethnicity_ontology_term_id', 'pred_sex_ontology_term_id', 'pred_organism_ontology_term_id', 'conv_pred_cell_type_ontology_term_id', 'conv_pred_disease_ontology_term_id', 'conv_pred_assay_ontology_term_id', 'conv_pred_self_reported_ethnicity_ontology_term_id', 'sprint_leiden', 'cleaned_pred_cell_type_ontology_term_id', 'cleaned_pred_disease_ontology_term_id'
    var: 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'uid', 'symbol', 'ncbi_gene_ids', 'biotype', 'description', 'synonyms', 'organism_id', 'public_source_id', 'created_by_id', 'mt', 'ribo', 'hb', 'organism', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'
    uns: 'assay_colors', 'cleaned_pred_cell_type_ontology_term_id_colors', 'cleaned_pred_disease_ontology_term_id_colors', 'development_stage_colors', 'disease_colors', 'unseen_genes'
    obsm: 'X_pca', 'X_umap', 'scprint', 'scprint_umap'

Gene network inference¶

Finally we will use scPRINT to infer gene networks on another cell of interest, the fibroblasts, in both normal and BPH conditions.

We will use the GRNfer class to infer gene networks. (see the cancer_usecase_part2.ipynb for more details on how to analyse the gene networks.)

In [38]:
Copied!
prostate_adata.obs.cleaned_pred_cell_type_ontology_term_id.value_counts().head(20)
prostate_adata.obs.cleaned_pred_cell_type_ontology_term_id.value_counts().head(20)
Out[38]:
cleaned_pred_cell_type_ontology_term_id
basal epithelial cell of prostatic duct                             25740
prostate gland microvascular endothelial cell                       12165
urethra urothelial cell                                             10952
CD1c-positive myeloid dendritic cell                                 6898
other                                                                5869
aortic smooth muscle cell                                            3970
effector CD8-positive, alpha-beta T cell                             3925
pancreatic acinar cell                                               3711
luminal cell of prostate epithelium                                  3435
fibroblast of connective tissue of nonglandular part of prostate     3334
mucous neck cell                                                     3260
IgG-negative class switched memory B cell                            2932
basophil                                                             1862
fibroblast of connective tissue of glandular part of prostate        1815
pancreatic ductal cell                                               1651
CD4-positive, alpha-beta thymocyte                                   1516
club cell                                                            1226
smooth muscle cell of prostate                                       1000
effector memory CD8-positive, alpha-beta T cell                       991
peptic cell                                                           872
mature conventional dendritic cell                                    721
effector CD4-positive, alpha-beta T cell                              686
mast cell                                                             451
renal interstitial pericyte                                           440
retinal blood vessel endothelial cell                                 404
Name: count, dtype: int64
In [33]:
Copied!
loc = (prostate_adata.obs.cleaned_pred_cell_type_ontology_term_id=="fibroblast of connective tissue of glandular part of prostate")
prostate_adata.obs[loc]['cleaned_pred_disease_ontology_term_id'].value_counts()
loc = (prostate_adata.obs.cleaned_pred_cell_type_ontology_term_id=="fibroblast of connective tissue of glandular part of prostate") prostate_adata.obs[loc]['cleaned_pred_disease_ontology_term_id'].value_counts()
Out[33]:
cleaned_pred_disease_ontology_term_id
benign prostatic hyperplasia    1482
normal                           673
other                              4
Name: count, dtype: int64
In [52]:
Copied!
prostate_adata.obs[loc]['louvain_1.0'].value_counts().head(10)
prostate_adata.obs[loc]['louvain_1.0'].value_counts().head(10)
Out[52]:
louvain_1.0
5     1458
4      671
17      10
8        8
7        7
0        3
9        2
6        0
3        0
1        0
10       0
11       0
12       0
13       0
14       0
15       0
16       0
2        0
18       0
Name: count, dtype: int64
In [34]:
Copied!
loc = loc & (prostate_adata.obs['louvain_1.0']==str(5))
prostate_adata.obs[loc]['cleaned_pred_disease_ontology_term_id'].value_counts()
loc = loc & (prostate_adata.obs['louvain_1.0']==str(5)) prostate_adata.obs[loc]['cleaned_pred_disease_ontology_term_id'].value_counts()
Out[34]:
cleaned_pred_disease_ontology_term_id
benign prostatic hyperplasia    790
normal                          664
other                             4
Name: count, dtype: int64
In [35]:
Copied!
prostate_adata.obs['fibro'] = None
prostate_adata.obs.loc[loc, 'fibro']="fibroblasts"
prostate_adata.obs.loc[loc & (prostate_adata.obs.cleaned_pred_disease_ontology_term_id=="benign prostatic hyperplasia"), 'fibro']="BPH associated fibroblasts"
prostate_adata.obs['fibro'] = None prostate_adata.obs.loc[loc, 'fibro']="fibroblasts" prostate_adata.obs.loc[loc & (prostate_adata.obs.cleaned_pred_disease_ontology_term_id=="benign prostatic hyperplasia"), 'fibro']="BPH associated fibroblasts"
In [37]:
Copied!
sc.pl.embedding(prostate_adata[prostate_adata.obs['louvain_1.0'].isin(["5","4"]) & (prostate_adata.obsm['scprint_umap'][:,1]>10)], basis="scprint_umap_rot",color='fibro', show=False, size=8, title="Fibroblasts cluster", legend_loc="right margin")
sc.pl.embedding(prostate_adata[prostate_adata.obs['louvain_1.0'].isin(["5","4"]) & (prostate_adata.obsm['scprint_umap'][:,1]>10)], basis="scprint_umap_rot",color='fibro', show=False, size=8, title="Fibroblasts cluster", legend_loc="right margin")
... storing 'focus' as categorical
... storing 'fibro' as categorical
/home/ml4ig1/miniconda3/envs/scprint/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:1251: FutureWarning: The default value of 'ignore' for the `na_action` parameter in pandas.Categorical.map is deprecated and will be changed to 'None' in a future version. Please set na_action to the desired value to avoid seeing this warning
  color_vector = pd.Categorical(values.map(color_map))
No description has been provided for this image
In [8]:
Copied!
grn_inferer = GNInfer(
                    # here we use the most variable genes across the fibroblasts vs the rest
                    how="most var across",
                    #how="random expr",
                    # we will preprocess the attention matrix with softmax
                    preprocess="softmax",
                    # we don't aggregate the heads here, we will do it manually
                    head_agg='none',
                    # here if we generate the attention matrices by performing a task, like denoising or by just passing the expression profile through the model
                    forward_mode="none",
                    # the number of genes to use (here the 4000 most variable genes)
                    num_genes=4000,
                    # the max number of cell to use per cell type
                    max_cells=300,
                    doplot=False,
                    batch_size=16,
                    # the column in anndata the defines the cell type
                    cell_type_col='fibro',
                    # list of layers to use
                    layer=list(range(model.nlayers))[:]
                    )
grn_inferer = GNInfer( # here we use the most variable genes across the fibroblasts vs the rest how="most var across", #how="random expr", # we will preprocess the attention matrix with softmax preprocess="softmax", # we don't aggregate the heads here, we will do it manually head_agg='none', # here if we generate the attention matrices by performing a task, like denoising or by just passing the expression profile through the model forward_mode="none", # the number of genes to use (here the 4000 most variable genes) num_genes=4000, # the max number of cell to use per cell type max_cells=300, doplot=False, batch_size=16, # the column in anndata the defines the cell type cell_type_col='fibro', # list of layers to use layer=list(range(model.nlayers))[:] )
In [9]:
Copied!
# I was missing this from the model (not really necessary)
model.organisms = ['NCBITaxon:9606', 'NCBITaxon:10090']
prostate_adata.obs.fibro = prostate_adata.obs.fibro.astype(str)
prostate_adata.obs[prostate_adata.obs.fibro=="BPH associated fibroblasts"].disease.value_counts()
# I was missing this from the model (not really necessary) model.organisms = ['NCBITaxon:9606', 'NCBITaxon:10090'] prostate_adata.obs.fibro = prostate_adata.obs.fibro.astype(str) prostate_adata.obs[prostate_adata.obs.fibro=="BPH associated fibroblasts"].disease.value_counts()
In [ ]:
Copied!
# compute GRNs on fibroblasts, we use all the atetetion layers
grn = grn_inferer(model, prostate_adata, cell_type="fibroblasts")
# highlight differential links on genes that are expressed in both 
grn.varp['all'] = grn.varp['GRN'].copy()
# now we aggregate the heads by taking their average
grn.varp['GRN'] = grn.varp['GRN'].mean(-1)
grn.write_h5ad("../../data/temp/prostate_fibro_grn_all.h5ad")
# compute GRNs on fibroblasts, we use all the atetetion layers grn = grn_inferer(model, prostate_adata, cell_type="fibroblasts") # highlight differential links on genes that are expressed in both grn.varp['all'] = grn.varp['GRN'].copy() # now we aggregate the heads by taking their average grn.varp['GRN'] = grn.varp['GRN'].mean(-1) grn.write_h5ad("../../data/temp/prostate_fibro_grn_all.h5ad")
In [11]:
Copied!
#same on the BPH associated fibroblasts
# I wanted to use only the ones that the labellers had defined as coing from BPH
prostate_adata.obs.loc[(prostate_adata.obs.fibro=="BPH associated fibroblasts") & ( prostate_adata.obs.disease=="benign prostatic hyperplasia"), 'fibro'] = "true BPH associated fibroblasts"

grn_c = grn_inferer(model, prostate_adata, cell_type="true cancer associated fibroblasts")
# highlight differential links on genes that are expressed in both
grn_c.varp['all'] = grn_c.varp['GRN'].copy()
grn_c.varp['GRN'] = grn_c.varp['GRN'].mean(-1)
grn_c.write_h5ad("../../data/temp/prostate_BPH_fibro_grn_all.h5ad")
#same on the BPH associated fibroblasts # I wanted to use only the ones that the labellers had defined as coing from BPH prostate_adata.obs.loc[(prostate_adata.obs.fibro=="BPH associated fibroblasts") & ( prostate_adata.obs.disease=="benign prostatic hyperplasia"), 'fibro'] = "true BPH associated fibroblasts" grn_c = grn_inferer(model, prostate_adata, cell_type="true cancer associated fibroblasts") # highlight differential links on genes that are expressed in both grn_c.varp['all'] = grn_c.varp['GRN'].copy() grn_c.varp['GRN'] = grn_c.varp['GRN'].mean(-1) grn_c.write_h5ad("../../data/temp/prostate_BPH_fibro_grn_all.h5ad")
Previous Next

Built with MkDocs using a theme provided by Read the Docs.
« Previous Next »