Documentation for the `tasks`

`scprint.tasks.cell_emb`

Classes:

Name	Description
`Embedder`

Functions:

Name	Description
`compute_classification`	Compute classification metrics for the given annotated data.
`compute_corr`	Compute the correlation between the output and target matrices.
`default_benchmark`	Run the default benchmark for embedding and annotation using the scPRINT model.

`Embedder`

Embedder a class to embed and annotate cells using a model

Parameters:

batch_size (int, default: 64 ) –

The size of the batches to be used in the DataLoader. Defaults to 64.
num_workers (int, default: 8 ) –

The number of worker processes to use for data loading. Defaults to 8.
how (str, default: 'random expr' ) –

The method to be used for selecting valid genes. Defaults to "random expr".
max_len (int, default: 2000 ) –

The maximum length of the gene sequence. Defaults to 1000.
add_zero_genes (int, default: 0 ) –

The number of zero genes to add to the gene sequence. Defaults to 100.
precision (str, default: '16-mixed' ) –

The precision to be used in the Trainer. Defaults to "16-mixed".
pred_embedding (List[str], default: ['cell_type_ontology_term_id', 'disease_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'sex_ontology_term_id'] ) –

The list of labels to be used for plotting embeddings. Defaults to [ "cell_type_ontology_term_id", "disease_ontology_term_id", "self_reported_ethnicity_ontology_term_id", "sex_ontology_term_id", ].
doclass (bool, default: True ) –

Whether to perform classification. Defaults to True.
doplot (bool, default: True ) –

Whether to generate plots. Defaults to True.
keep_all_cls_pred (bool, default: False ) –

Whether to keep all class predictions. Defaults to False.
dtype (dtype, default: float16 ) –

Data type for computations. Defaults to torch.float16.
output_expression (str, default: 'none' ) –

The method to output expression data. Options are "none", "all", "sample". Defaults to "none".
save_every (int, default: 40000 ) –

The number of cells to save at a time. Defaults to 100_000.

Methods:

Name	Description
`__call__`	call function to call the embedding

Source code in scprint/tasks/cell_emb.py

def __init__(
    self,
    batch_size: int = 64,
    num_workers: int = 8,
    how: str = "random expr",
    max_len: int = 2000,
    doclass: bool = True,
    add_zero_genes: int = 0,
    precision: str = "16-mixed",
    pred_embedding: List[str] = [
        "cell_type_ontology_term_id",
        "disease_ontology_term_id",
        "self_reported_ethnicity_ontology_term_id",
        "sex_ontology_term_id",
    ],
    doplot: bool = True,
    keep_all_cls_pred: bool = False,
    dtype: torch.dtype = torch.float16,
    output_expression: str = "none",
    genelist: List[str] = [],
    get_gene_emb: bool = False,
    save_every: int = 40_000,
):
    """
    Embedder a class to embed and annotate cells using a model

    Args:
        batch_size (int, optional): The size of the batches to be used in the DataLoader. Defaults to 64.
        num_workers (int, optional): The number of worker processes to use for data loading. Defaults to 8.
        how (str, optional): The method to be used for selecting valid genes. Defaults to "random expr".
        max_len (int, optional): The maximum length of the gene sequence. Defaults to 1000.
        add_zero_genes (int, optional): The number of zero genes to add to the gene sequence. Defaults to 100.
        precision (str, optional): The precision to be used in the Trainer. Defaults to "16-mixed".
        pred_embedding (List[str], optional): The list of labels to be used for plotting embeddings. Defaults to [ "cell_type_ontology_term_id", "disease_ontology_term_id", "self_reported_ethnicity_ontology_term_id", "sex_ontology_term_id", ].
        doclass (bool, optional): Whether to perform classification. Defaults to True.
        doplot (bool, optional): Whether to generate plots. Defaults to True.
        keep_all_cls_pred (bool, optional): Whether to keep all class predictions. Defaults to False.
        dtype (torch.dtype, optional): Data type for computations. Defaults to torch.float16.
        output_expression (str, optional): The method to output expression data. Options are "none", "all", "sample". Defaults to "none".
        save_every (int, optional): The number of cells to save at a time. Defaults to 100_000.
    """
    self.batch_size = batch_size
    self.num_workers = num_workers
    self.how = how
    self.max_len = max_len
    self.add_zero_genes = add_zero_genes
    self.pred_embedding = pred_embedding
    self.keep_all_cls_pred = keep_all_cls_pred
    self.precision = precision
    self.doplot = doplot
    self.dtype = dtype
    self.doclass = doclass
    self.output_expression = output_expression
    self.genelist = genelist
    self.get_gene_emb = get_gene_emb
    self.save_every = save_every

`call`

call function to call the embedding

Parameters:	`model` (`Module`) – The scPRINT model to be used for embedding and annotation. `adata` (`AnnData`) – The annotated data matrix of shape n_obs x n_vars. Rows correspond to cells and columns to genes.

Raises:	`ValueError` – If the model does not have a logger attribute. `ValueError` – If the model does not have a global_step attribute.

Returns:	`AnnData` – The annotated data matrix with embedded cell representations. – List[str]: List of gene names used in the embedding. – np.ndarray: The predicted expression values if output_expression is not "none". `dict` – Additional metrics and information from the embedding process.

Source code in scprint/tasks/cell_emb.py

def __call__(self, model: torch.nn.Module, adata: AnnData, cache=False):
    """
    __call__ function to call the embedding

    Args:
        model (torch.nn.Module): The scPRINT model to be used for embedding and annotation.
        adata (AnnData): The annotated data matrix of shape n_obs x n_vars. Rows correspond to cells and columns to genes.

    Raises:
        ValueError: If the model does not have a logger attribute.
        ValueError: If the model does not have a global_step attribute.

    Returns:
        AnnData: The annotated data matrix with embedded cell representations.
        List[str]: List of gene names used in the embedding.
        np.ndarray: The predicted expression values if output_expression is not "none".
        dict: Additional metrics and information from the embedding process.
    """
    # one of "all" "sample" "none"
    model.predict_mode = "none"
    model.keep_all_cls_pred = self.keep_all_cls_pred
    # Add at least the organism you are working with
    if self.how == "most var":
        sc.pp.highly_variable_genes(
            adata, flavor="seurat_v3", n_top_genes=self.max_len
        )
        self.genelist = adata.var.index[adata.var.highly_variable]
    adataset = SimpleAnnDataset(adata, obs_to_output=["organism_ontology_term_id"])
    col = Collator(
        organisms=model.organisms,
        valid_genes=model.genes,
        how=self.how if self.how != "most var" else "some",
        max_len=self.max_len,
        add_zero_genes=self.add_zero_genes,
        genelist=self.genelist if self.how in ["most var", "some"] else [],
    )
    dataloader = DataLoader(
        adataset,
        collate_fn=col,
        batch_size=self.batch_size,
        num_workers=self.num_workers,
        shuffle=False,
    )
    model.eval()
    model.on_predict_epoch_start()
    device = model.device.type
    model.doplot = self.doplot
    with (
        torch.no_grad(),
        torch.autocast(device_type=device, dtype=self.dtype),
    ):
        for batch in tqdm(dataloader):
            gene_pos, expression, depth = (
                batch["genes"].to(device),
                batch["x"].to(device),
                batch["depth"].to(device),
            )
            model._predict(
                gene_pos,
                expression,
                depth,
                predict_mode="none",
                pred_embedding=self.pred_embedding,
                get_gene_emb=self.get_gene_emb,
                max_size_in_mem=self.save_every,
            )
            torch.cuda.empty_cache()
    model.log_adata(name="predict_part_" + str(model.counter))
    try:
        mdir = (
            model.logger.save_dir if model.logger.save_dir is not None else "data"
        )
    except:
        mdir = "data"
    pred_adata = []
    for i in range(model.counter + 1):
        file = (
            mdir
            + "/step_"
            + str(model.global_step)
            + "_"
            + model.name
            + "_predict_part_"
            + str(i)
            + "_"
            + str(model.global_rank)
            + ".h5ad"
        )
        pred_adata.append(sc.read_h5ad(file))
    pred_adata = concat(pred_adata)
    if self.output_expression == "sample":
        adata.layers["sampled"] = (
            utils.zinb_sample(
                torch.from_numpy(pred_adata.layers["scprint_mu"]),
                torch.from_numpy(pred_adata.layers["scprint_theta"]),
                torch.from_numpy(pred_adata.layers["scprint_pi"]),
            )
            .cpu()
            .numpy()
        )
    else:
        pass
    pred_adata.obs.index = adata.obs.index
    try:
        adata.obsm["X_scprint_umap"] = pred_adata.obsm["X_umap"]
    except:
        print("too few cells to embed into a umap")
    try:
        adata.obsm["scprint_leiden"] = pred_adata.obsm["leiden"]
    except:
        print("too few cells to compute a clustering")
    adata.obsm["scprint_emb"] = pred_adata.obsm["scprint_emb"]
    for key, value in pred_adata.uns.items():
        adata.uns[key] = value

    pred_adata.obs.index = adata.obs.index
    adata.obs = pd.concat([adata.obs, pred_adata.obs], axis=1)
    if self.keep_all_cls_pred:
        allclspred = model.pred
        columns = []
        for cl in model.classes:
            n = model.label_counts[cl]
            columns += [model.label_decoders[cl][i] for i in range(n)]
        allclspred = pd.DataFrame(
            allclspred, columns=columns, index=adata.obs.index
        )
        adata.obs = pd.concat(adata.obs, allclspred)

    metrics = {}
    if self.doclass and not self.keep_all_cls_pred:
        for cl in model.classes:
            res = []
            if cl not in adata.obs.columns:
                continue
            class_topred = model.label_decoders[cl].values()

            if cl in model.labels_hierarchy:
                # class_groupings = {
                #    k: [
                #        i.ontology_id
                #        for i in bt.CellType.filter(k).first().children.all()
                #    ]
                #    for k in set(adata.obs[cl].unique()) - set(class_topred)
                # }
                cur_labels_hierarchy = {
                    model.label_decoders[cl][k]: [
                        model.label_decoders[cl][i] for i in v
                    ]
                    for k, v in model.labels_hierarchy[cl].items()
                }
            else:
                cur_labels_hierarchy = {}

            for pred, true in adata.obs[["pred_" + cl, cl]].values:
                if pred == true:
                    res.append(True)
                    continue
                if len(cur_labels_hierarchy) > 0:
                    if true in cur_labels_hierarchy:
                        res.append(pred in cur_labels_hierarchy[true])
                        continue
                    elif true not in class_topred:
                        raise ValueError(
                            f"true label {true} not in available classes"
                        )
                    elif true != "unknown":
                        res.append(False)
                elif true not in class_topred:
                    raise ValueError(f"true label {true} not in available classes")
                elif true != "unknown":
                    res.append(False)
                # else true is unknown
                # else we pass
            if len(res) == 0:
                # true was always unknown
                res = [1]
            if self.doplot:
                print("    ", cl)
                print("     accuracy:", sum(res) / len(res))
                print(" ")
            metrics.update({cl + "_accuracy": sum(res) / len(res)})
    return adata, metrics

`compute_classification`

Compute classification metrics for the given annotated data.

Parameters:

adata (AnnData) –

The annotated data matrix of shape n_obs x n_vars. Rows correspond to cells and columns to genes.
classes (List[str]) –

List of class labels to be used for classification.
label_decoders (Dict[str, Any]) –

Dictionary of label decoders for each class.
labels_hierarchy (Dict[str, Any]) –

Dictionary representing the hierarchy of labels.
metric_type (List[str], default: ['macro', 'micro', 'weighted'] ) –

List of metric types to compute. Defaults to ["macro", "micro", "weighted"].

Returns:	`Dict[str, Dict[str, float]]` – Dict[str, Dict[str, float]]: A dictionary containing classification metrics for each class.

Source code in scprint/tasks/cell_emb.py

def compute_classification(
    adata: AnnData,
    classes: List[str],
    label_decoders: Dict[str, Any],
    labels_hierarchy: Dict[str, Any],
    metric_type: List[str] = ["macro", "micro", "weighted"],
) -> Dict[str, Dict[str, float]]:
    """
    Compute classification metrics for the given annotated data.

    Args:
        adata (AnnData): The annotated data matrix of shape n_obs x n_vars. Rows correspond to cells and columns to genes.
        classes (List[str]): List of class labels to be used for classification.
        label_decoders (Dict[str, Any]): Dictionary of label decoders for each class.
        labels_hierarchy (Dict[str, Any]): Dictionary representing the hierarchy of labels.
        metric_type (List[str], optional): List of metric types to compute. Defaults to ["macro", "micro", "weighted"].

    Returns:
        Dict[str, Dict[str, float]]: A dictionary containing classification metrics for each class.
    """
    metrics = {}
    for label in classes:
        res = []
        if label not in adata.obs.columns:
            continue
        labels_topred = label_decoders[label].values()
        if label in labels_hierarchy:
            parentdf = (
                bt.CellType.filter()
                .df(include=["parents__ontology_id"])
                .set_index("ontology_id")[["parents__ontology_id"]]
            )
            parentdf.parents__ontology_id = parentdf.parents__ontology_id.astype(str)
            class_groupings = {
                k: get_descendants(k, parentdf) for k in set(adata.obs[label].unique())
            }
        for pred, true in adata.obs[["pred_" + label, label]].values:
            if pred == true:
                res.append(true)
                continue
            if label in labels_hierarchy:
                if true in class_groupings:
                    res.append(true if pred in class_groupings[true] else "")
                    continue
                elif true not in labels_topred:
                    raise ValueError(f"true label {true} not in available classes")
            elif true not in labels_topred:
                raise ValueError(f"true label {true} not in available classes")
            res.append("")
        metrics[label] = {}
        metrics[label]["accuracy"] = np.mean(np.array(res) == adata.obs[label].values)
        for x in metric_type:
            metrics[label][x] = f1_score(
                np.array(res), adata.obs[label].values, average=x
            )
    return metrics

`compute_corr`

Compute the correlation between the output and target matrices.

Parameters:

out (ndarray) –

The output matrix.
to (ndarray) –

The target matrix.
doplot (bool, default: True ) –

Whether to generate a plot of the correlation coefficients. Defaults to True.
compute_mean_regress (bool, default: False ) –

Whether to compute mean regression. Defaults to False.
plot_corr_size (int, default: 64 ) –

The size of the plot for correlation. Defaults to 64.

Returns:	`dict`( `dict` ) – A dictionary containing the computed metrics.

Source code in scprint/tasks/cell_emb.py

def compute_corr(
    out: np.ndarray,
    to: np.ndarray,
    doplot: bool = True,
    compute_mean_regress: bool = False,
    plot_corr_size: int = 64,
) -> dict:
    """
    Compute the correlation between the output and target matrices.

    Args:
        out (np.ndarray): The output matrix.
        to (np.ndarray): The target matrix.
        doplot (bool, optional): Whether to generate a plot of the correlation coefficients. Defaults to True.
        compute_mean_regress (bool, optional): Whether to compute mean regression. Defaults to False.
        plot_corr_size (int, optional): The size of the plot for correlation. Defaults to 64.

    Returns:
        dict: A dictionary containing the computed metrics.
    """
    metrics = {}
    corr_coef, p_value = spearmanr(
        out,
        to.T,
    )
    corr_coef[p_value > 0.05] = 0
    # corr_coef[]
    # only on non zero values,
    # compare a1-b1 corr with a1-b(n) corr. should be higher

    # Plot correlation coefficient
    val = plot_corr_size + 2 if compute_mean_regress else plot_corr_size
    metrics.update(
        {"recons_corr": np.mean(corr_coef[val:, :plot_corr_size].diagonal())}
    )
    if compute_mean_regress:
        metrics.update(
            {
                "mean_regress": np.mean(
                    corr_coef[
                        plot_corr_size : plot_corr_size + 2,
                        :plot_corr_size,
                    ].flatten()
                )
            }
        )
    if doplot:
        plt.figure(figsize=(10, 5))
        plt.imshow(corr_coef, cmap="coolwarm", interpolation="none", vmin=-1, vmax=1)
        plt.colorbar()
        plt.title('Correlation Coefficient of expr and i["x"]')
        plt.show()
    return metrics

`default_benchmark`

Run the default benchmark for embedding and annotation using the scPRINT model.

Parameters:

model (Module) –

The scPRINT model to be used for embedding and annotation.
default_dataset (str, default: 'pancreas' ) –

The default dataset to use for benchmarking. Options are "pancreas", "lung", or a path to a dataset. Defaults to "pancreas".
do_class (bool, default: True ) –

Whether to perform classification. Defaults to True.
coarse (bool, default: False ) –

Whether to use coarse cell type annotations. Defaults to False.

Returns:	`dict`( `dict` ) – A dictionary containing the benchmark metrics.

Source code in scprint/tasks/cell_emb.py

def default_benchmark(
    model: torch.nn.Module,
    default_dataset: str = "pancreas",
    do_class: bool = True,
    coarse: bool = False,
) -> dict:
    """
    Run the default benchmark for embedding and annotation using the scPRINT model.

    Args:
        model (torch.nn.Module): The scPRINT model to be used for embedding and annotation.
        default_dataset (str, optional): The default dataset to use for benchmarking. Options are "pancreas", "lung", or a path to a dataset. Defaults to "pancreas".
        do_class (bool, optional): Whether to perform classification. Defaults to True.
        coarse (bool, optional): Whether to use coarse cell type annotations. Defaults to False.

    Returns:
        dict: A dictionary containing the benchmark metrics.
    """
    if default_dataset == "pancreas":
        adata = sc.read(
            FILE_LOC + "/../../data/pancreas_atlas.h5ad",
            backup_url="https://figshare.com/ndownloader/files/24539828",
        )
        adata.obs["cell_type_ontology_term_id"] = adata.obs["celltype"].replace(
            COARSE if coarse else FINE
        )
        adata.obs["assay_ontology_term_id"] = adata.obs["tech"].replace(
            COARSE if coarse else FINE
        )
    elif default_dataset == "lung":
        adata = sc.read(
            FILE_LOC + "/../../data/lung_atlas.h5ad",
            backup_url="https://figshare.com/ndownloader/files/24539942",
        )
        adata.obs["cell_type_ontology_term_id"] = adata.obs["cell_type"].replace(
            COARSE if coarse else FINE
        )
    else:
        adata = sc.read_h5ad(default_dataset)
        adata.obs["batch"] = adata.obs["assay_ontology_term_id"]
        adata.obs["cell_type"] = adata.obs["cell_type_ontology_term_id"]
    preprocessor = Preprocessor(
        use_layer="counts",
        is_symbol=True,
        force_preprocess=True,
        skip_validate=True,
        do_postp=False,
    )
    adata.obs["organism_ontology_term_id"] = "NCBITaxon:9606"
    adata = preprocessor(adata.copy())
    embedder = Embedder(
        pred_embedding=["cell_type_ontology_term_id"] if do_class else [],
        doclass=(default_dataset not in ["pancreas", "lung"]) and do_class,
        max_len=4000,
        keep_all_cls_pred=False,
        output_expression="none",
    )
    embed_adata, metrics = embedder(model, adata.copy())

    bm = Benchmarker(
        embed_adata,
        batch_key="tech" if default_dataset == "pancreas" else "batch",
        label_key="celltype" if default_dataset == "pancreas" else "cell_type",
        embedding_obsm_keys=["scprint"],
        n_jobs=6,
    )
    bm.benchmark()
    metrics.update({"scib": bm.get_results(min_max_scale=False).T.to_dict()["scprint"]})
    metrics["classif"] = compute_classification(
        embed_adata, model.classes, model.label_decoders, model.labels_hierarchy
    )
    return metrics

`scprint.tasks.grn`

Classes:

Name	Description
`GNInfer`

Functions:

Name	Description
`default_benchmark`	default_benchmark function to run the default scPRINT GRN benchmark

`GNInfer`

GNInfer a class to infer gene regulatory networks from a dataset using a scPRINT model.

Parameters:

layer (Optional[list[int]], default: None ) –

List of layers to use for the inference. Defaults to None.
batch_size (int, default: 64 ) –

Batch size for processing. Defaults to 64.
num_workers (int, default: 8 ) –

Number of workers for data loading. Defaults to 8.
drop_unexpressed (bool, default: False ) –

Whether to drop unexpressed genes. Defaults to False.
num_genes (int, default: 3000 ) –

Number of genes to consider. Defaults to 3000.
precision (str, default: '16-mixed' ) –

Precision type for computations. Defaults to "16-mixed".
cell_type_col (str, default: 'cell_type' ) –

Column name for cell type information. Defaults to "cell_type".
how (str, default: 'random expr' ) –

Method to select genes. Options are "random expr", "most var within", "most var across", "given". Defaults to "random expr".
preprocess (str, default: 'softmax' ) –

Preprocessing method. Options are "softmax", "sinkhorn", "none". Defaults to "softmax".
head_agg (str, default: 'mean' ) –

Aggregation method for heads. Options are "mean", "sum", "none". Defaults to "mean".
filtration (str, default: 'thresh' ) –

Filtration method for the adjacency matrix. Options are "thresh", "top-k", "mst", "known", "none". Defaults to "thresh".
k (int, default: 10 ) –

Number of top connections to keep if filtration is "top-k". Defaults to 10.
apc (bool, default: False ) –

Whether to apply Average Product Correction. Defaults to False.
known_grn (optional, default: None ) –

Known gene regulatory network to use as a reference. Defaults to None.
symmetrize (bool, default: False ) –

Whether to symmetrize the adjacency matrix. Defaults to False.
doplot (bool, default: True ) –

Whether to generate plots. Defaults to True.
max_cells (int, default: 0 ) –

Maximum number of cells to consider. Defaults to 0.
forward_mode (str, default: 'none' ) –

Mode for forward pass. Defaults to "none".
genes (list, default: [] ) –

List of genes to consider. Defaults to an empty list.
loc (str, default: './' ) –

Location to save results. Defaults to "./".
dtype (dtype, default: float16 ) –

Data type for computations. Defaults to torch.float16.
locname (str, default: '' ) –

Name for the location. Defaults to an empty string.
add_emb_in_model (bool, default: False ) –

Whether to add cell embeddings in the grn. Defaults to False.

Methods:

Name	Description
`__call__`	call runs the method

Source code in scprint/tasks/grn.py

def __init__(
    self,
    layer: Optional[List[int]] = None,
    batch_size: int = 64,
    num_workers: int = 8,
    drop_unexpressed: bool = False,
    num_genes: int = 3000,
    precision: str = "16-mixed",
    cell_type_col: str = "cell_type",
    how: str = "random expr",  # random expr, most var within, most var across, given
    max_len: int = 3000,
    preprocess: str = "softmax",  # sinkhorn, softmax, none
    head_agg: str = "mean",  # mean, sum, none
    filtration: str = "thresh",  # thresh, top-k, mst, known, none
    k: int = 10,
    apc: bool = False,
    known_grn: Optional[any] = None,
    symmetrize: bool = False,
    doplot: bool = True,
    comp_attn: bool = True,
    max_cells: int = 0,
    forward_mode: str = "none",
    genes: List[str] = [],
    loc: str = "./",
    dtype: torch.dtype = torch.float16,
    locname: str = "",
    add_emb_in_model: bool = False,
):
    """
    GNInfer a class to infer gene regulatory networks from a dataset using a scPRINT model.

    Args:
        layer (Optional[list[int]], optional): List of layers to use for the inference. Defaults to None.
        batch_size (int, optional): Batch size for processing. Defaults to 64.
        num_workers (int, optional): Number of workers for data loading. Defaults to 8.
        drop_unexpressed (bool, optional): Whether to drop unexpressed genes. Defaults to False.
        num_genes (int, optional): Number of genes to consider. Defaults to 3000.
        precision (str, optional): Precision type for computations. Defaults to "16-mixed".
        cell_type_col (str, optional): Column name for cell type information. Defaults to "cell_type".
        how (str, optional): Method to select genes. Options are "random expr", "most var within", "most var across", "given". Defaults to "random expr".
        preprocess (str, optional): Preprocessing method. Options are "softmax", "sinkhorn", "none". Defaults to "softmax".
        head_agg (str, optional): Aggregation method for heads. Options are "mean", "sum", "none". Defaults to "mean".
        filtration (str, optional): Filtration method for the adjacency matrix. Options are "thresh", "top-k", "mst", "known", "none". Defaults to "thresh".
        k (int, optional): Number of top connections to keep if filtration is "top-k". Defaults to 10.
        apc (bool, optional): Whether to apply Average Product Correction. Defaults to False.
        known_grn (optional): Known gene regulatory network to use as a reference. Defaults to None.
        symmetrize (bool, optional): Whether to symmetrize the adjacency matrix. Defaults to False.
        doplot (bool, optional): Whether to generate plots. Defaults to True.
        max_cells (int, optional): Maximum number of cells to consider. Defaults to 0.
        forward_mode (str, optional): Mode for forward pass. Defaults to "none".
        genes (list, optional): List of genes to consider. Defaults to an empty list.
        loc (str, optional): Location to save results. Defaults to "./".
        dtype (torch.dtype, optional): Data type for computations. Defaults to torch.float16.
        locname (str, optional): Name for the location. Defaults to an empty string.
        add_emb_in_model (bool, optional): Whether to add cell embeddings in the grn. Defaults to False.

    """
    self.batch_size = batch_size
    self.num_workers = num_workers
    self.layer = layer
    self.locname = locname
    self.how = how
    assert (
        self.how
        in [
            "most var within",
            "most var across",
            "random expr",
            "given",
            "most expr",
        ]
    ), "how must be one of 'most var within', 'most var across', 'random expr', 'given', 'most expr'"
    self.num_genes = num_genes
    self.preprocess = preprocess
    self.cell_type_col = cell_type_col
    self.filtration = filtration
    self.doplot = doplot
    self.genes = genes
    self.apc = apc
    self.dtype = dtype
    self.forward_mode = forward_mode
    self.k = k
    self.max_len = max_len
    self.symmetrize = symmetrize
    self.known_grn = known_grn
    self.head_agg = head_agg
    self.max_cells = max_cells
    self.curr_genes = None
    self.drop_unexpressed = drop_unexpressed
    self.precision = precision
    self.comp_attn = comp_attn
    self.add_emb_in_model = add_emb_in_model
    if self.filtration != "none" and self.head_agg == "none":
        raise ValueError("filtration must be 'none' when head_agg is 'none'")

`call`

call runs the method

Parameters:	`model` (`Module`) – The model to be used for generating the network `adata` (`AnnData`) – Annotated data matrix of shape `n_obs` × `n_vars`. `n_obs` is the number of cells and `n_vars` is the number of genes. `cell_type` (`str`, default: `None` ) – Specific cell type to filter the data. Defaults to None.

Returns:	`AnnData` – Annotated data matrix with predictions and annotations. – np.ndarray: Filtered adjacency matrix.

Source code in scprint/tasks/grn.py

def __call__(self, model: torch.nn.Module, adata: AnnData, cell_type=None):
    """
    __call__ runs the method

    Args:
        model (torch.nn.Module): The model to be used for generating the network
        adata (AnnData): Annotated data matrix of shape `n_obs` × `n_vars`. `n_obs` is the number of cells and `n_vars` is the number of genes.
        cell_type (str, optional): Specific cell type to filter the data. Defaults to None.

    Returns:
        AnnData: Annotated data matrix with predictions and annotations.
        np.ndarray: Filtered adjacency matrix.
    """
    # Add at least the organism you are working with
    if self.layer is None:
        self.layer = list(range(model.nlayers))
    self.n_cell_embs = model.attn.additional_tokens
    subadata = self.predict(model, adata, self.layer, cell_type)
    adjacencies = self.aggregate(model.attn.get(), model.genes)
    if self.head_agg == "none":
        return self.save(
            adjacencies[self.n_cell_embs :, self.n_cell_embs :, :],
            subadata,
        )
    else:
        return self.save(
            self.filter(adjacencies)[self.n_cell_embs :, self.n_cell_embs :],
            subadata,
        )

`default_benchmark`

default_benchmark function to run the default scPRINT GRN benchmark

Parameters:

model (Any) –

The scPRINT model to be used for the benchmark.
default_dataset (str, default: 'sroy' ) –

The default dataset to use for benchmarking. Defaults to "sroy".
cell_types (List[str], default: [] ) –

List of cell types to include in the benchmark. Defaults to [
maxlayers (int, default: 16 ) –

Maximum number of layers to use from the model. Defaults to 16.
maxgenes (int, default: 5000 ) –

Maximum number of genes to consider. Defaults to 5000.
batch_size (int, default: 32 ) –

Batch size for processing. Defaults to 32.
maxcells (int, default: 1024 ) –

Maximum number of cells to consider. Defaults to 1024.

Returns:	`dict` – A dictionary containing the benchmark metrics.

Source code in scprint/tasks/grn.py

def default_benchmark(
    model: Any,
    default_dataset: str = "sroy",
    cell_types: List[str] = [],
    maxlayers: int = 16,
    maxgenes: int = 5000,
    batch_size: int = 32,
    maxcells: int = 1024,
):
    """
    default_benchmark function to run the default scPRINT GRN benchmark

    Args:
        model (Any): The scPRINT model to be used for the benchmark.
        default_dataset (str, optional): The default dataset to use for benchmarking. Defaults to "sroy".
        cell_types (List[str], optional): List of cell types to include in the benchmark. Defaults to [
        ].
        maxlayers (int, optional): Maximum number of layers to use from the model. Defaults to 16.
        maxgenes (int, optional): Maximum number of genes to consider. Defaults to 5000.
        batch_size (int, optional): Batch size for processing. Defaults to 32.
        maxcells (int, optional): Maximum number of cells to consider. Defaults to 1024.

    Returns:
        dict: A dictionary containing the benchmark metrics.
    """
    metrics = {}
    layers = list(range(model.nlayers))[max(0, model.nlayers - maxlayers) :]
    clf_omni = None
    if default_dataset == "sroy":
        preprocessor = Preprocessor(
            is_symbol=True,
            force_preprocess=True,
            skip_validate=True,
            do_postp=False,
            min_valid_genes_id=5000,
            min_dataset_size=64,
        )
        clf_self = None
        todo = [
            ("han", "human", "full"),
            ("mine", "human", "full"),
            ("han", "human", "chip"),
            ("han", "human", "ko"),
            ("tran", "mouse", "full"),
            ("zhao", "mouse", "full"),
            ("tran", "mouse", "chip"),
            ("tran", "mouse", "ko"),
        ]
        for da, spe, gt in todo:
            if gt != "full":
                continue
            if "NCBITaxon:10090" not in model.organisms and spe == "mouse":
                continue
            print(da + "_" + gt)
            preadata = get_sroy_gt(get=da, species=spe, gt=gt)
            adata = preprocessor(preadata.copy())
            grn_inferer = GNInfer(
                layer=layers,
                how="most var within",
                preprocess="softmax",
                head_agg="none",
                filtration="none",
                forward_mode="none",
                num_genes=maxgenes,
                num_workers=8,
                max_cells=maxcells,
                doplot=False,
                batch_size=batch_size,
            )
            grn = grn_inferer(model, adata)
            grn.varp["all"] = grn.varp["GRN"]
            grn.var["ensembl_id"] = grn.var.index
            grn.var["symbol"] = make_index_unique(grn.var["symbol"].astype(str))
            grn.var.index = grn.var["symbol"]
            grn.varp["GRN"] = grn.varp["all"].mean(-1).T
            metrics["mean_" + da + "_" + gt] = BenGRN(
                grn, do_auc=True, doplot=False
            ).compare_to(other=preadata)
            grn.varp["GRN"] = grn.varp["GRN"].T
            if spe == "human":
                metrics["mean_" + da + "_" + gt + "_base"] = BenGRN(
                    grn, do_auc=True, doplot=False
                ).scprint_benchmark()

            ## OMNI
            if clf_omni is None:
                grn.varp["GRN"] = grn.varp["all"]
                _, m, clf_omni = train_classifier(
                    grn,
                    C=1,
                    train_size=0.9,
                    class_weight={1: 800, 0: 1},
                    shuffle=True,
                    return_full=False,
                )
                joblib.dump(clf_omni, "clf_omni.pkl")
                metrics["omni_classifier"] = m
            coef = clf_omni.coef_[0] if clf_omni.coef_.shape[0] == 1 else clf_omni.coef_
            grn.varp["GRN"] = grn.varp["all"][:, :, coef > 0].mean(-1)
            if spe == "human":
                metrics["omni_" + da + "_" + gt + "_base"] = BenGRN(
                    grn, do_auc=True, doplot=True
                ).scprint_benchmark()
            grn.varp["GRN"] = grn.varp["GRN"].T
            metrics["omni_" + da + "_" + gt] = BenGRN(
                grn, do_auc=True, doplot=False
            ).compare_to(other=preadata)

            ## SELF
            if clf_self is None:
                grn.varp["GRN"] = np.transpose(grn.varp["all"], (1, 0, 2))
                _, m, clf_self = train_classifier(
                    grn,
                    other=preadata,
                    C=1,
                    train_size=0.5,
                    class_weight={1: 40, 0: 1},
                    shuffle=False,
                    return_full=False,
                )
                metrics["self_classifier"] = m
            coef = clf_self.coef_[0] if clf_self.coef_.shape[0] == 1 else clf_self.coef_
            grn.varp["GRN"] = grn.varp["all"][:, :, coef > 0].mean(-1).T
            metrics["self_" + da + "_" + gt] = BenGRN(
                grn, do_auc=True, doplot=False
            ).compare_to(other=preadata)
            if spe == "human":
                grn.varp["GRN"] = grn.varp["GRN"].T
                metrics["self_" + da + "_" + gt + "_base"] = BenGRN(
                    grn, do_auc=True, doplot=True
                ).scprint_benchmark()

            ## chip / ko
            if (da, spe, "chip") in todo:
                preadata = get_sroy_gt(get=da, species=spe, gt="chip")
                grn.varp["GRN"] = grn.varp["all"].mean(-1).T
                metrics["mean_" + da + "_" + "chip"] = BenGRN(
                    grn, do_auc=True, doplot=False
                ).compare_to(other=preadata)
                grn.varp["GRN"] = grn.varp["all"][:, :, coef > 0].mean(-1).T
                metrics["omni_" + da + "_" + "chip"] = BenGRN(
                    grn, do_auc=True, doplot=False
                ).compare_to(other=preadata)

                grn.varp["GRN"] = grn.varp["all"][:, :, coef > 0].mean(-1).T
                metrics["self_" + da + "_" + "chip"] = BenGRN(
                    grn, do_auc=True, doplot=False
                ).compare_to(other=preadata)
            if (da, spe, "ko") in todo:
                preadata = get_sroy_gt(get=da, species=spe, gt="ko")
                grn.varp["GRN"] = grn.varp["all"].mean(-1).T
                metrics["mean_" + da + "_" + "ko"] = BenGRN(
                    grn, do_auc=True, doplot=False
                ).compare_to(other=preadata)
                grn.varp["GRN"] = grn.varp["all"][:, :, coef > 0].mean(-1).T
                metrics["omni_" + da + "_" + "ko"] = BenGRN(
                    grn, do_auc=True, doplot=False
                ).compare_to(other=preadata)
                grn.varp["GRN"] = grn.varp["all"][:, :, coef > 0].mean(-1).T
                metrics["self_" + da + "_" + "ko"] = BenGRN(
                    grn, do_auc=True, doplot=False
                ).compare_to(other=preadata)
            del grn
    elif default_dataset == "gwps":
        if not os.path.exists(FILEDIR + "/../../data/perturb_gt.h5ad"):
            adata = get_perturb_gt()
            adata.write_h5ad(FILEDIR + "/../../data/perturb_gt.h5ad")
        else:
            adata = read_h5ad(FILEDIR + "/../../data/perturb_gt.h5ad")
        preprocessor = Preprocessor(
            force_preprocess=True,
            skip_validate=True,
            do_postp=False,
            min_valid_genes_id=maxgenes,
            min_dataset_size=64,
        )
        nadata = preprocessor(adata.copy())
        adata.var["isTF"] = False
        adata.var.loc[adata.var.gene_name.isin(grnutils.TF), "isTF"] = True
        adata.var["isTF"].sum()
        grn_inferer = GNInfer(
            layer=layers,
            how="most var within",
            preprocess="softmax",
            head_agg="none",
            filtration="none",
            forward_mode="none",
            num_genes=maxgenes,
            max_cells=maxcells,
            doplot=False,
            num_workers=8,
            batch_size=batch_size,
        )
        grn = grn_inferer(model, nadata)
        grn.varp["all"] = grn.varp["GRN"]

        grn.varp["GRN"] = grn.varp["all"].mean(-1).T
        metrics["mean"] = BenGRN(grn, do_auc=True, doplot=False).compare_to(other=adata)
        grn.var["ensembl_id"] = grn.var.index
        grn.var.index = grn.var["symbol"]
        grn.varp["GRN"] = grn.varp["all"].mean(-1)
        metrics["mean_base"] = BenGRN(
            grn, do_auc=True, doplot=False
        ).scprint_benchmark()

        grn.varp["GRN"] = grn.varp["all"]
        grn.var.index = grn.var["ensembl_id"]
        _, m, clf_omni = train_classifier(
            grn,
            C=1,
            train_size=0.9,
            class_weight={1: 800, 0: 1},
            shuffle=True,
            doplot=False,
            return_full=False,
            use_col="gene_name",
        )
        coef = clf_omni.coef_[0] if clf_omni.coef_.shape[0] == 1 else clf_omni.coef_
        grn.varp["GRN"] = grn.varp["all"][:, :, coef > 0].mean(-1).T
        metrics["omni"] = BenGRN(grn, do_auc=True, doplot=False).compare_to(other=adata)
        metrics["omni_classifier"] = m
        grn.var.index = grn.var["symbol"]
        grn.varp["GRN"] = grn.varp["GRN"].T
        metrics["omni_base"] = BenGRN(
            grn, do_auc=True, doplot=False
        ).scprint_benchmark()
        grn.varp["GRN"] = np.transpose(grn.varp["all"], (1, 0, 2))
        grn.var.index = grn.var["ensembl_id"]
        _, m, clf_self = train_classifier(
            grn,
            other=adata,
            C=1,
            train_size=0.5,
            class_weight={1: 40, 0: 1},
            doplot=False,
            shuffle=False,
            return_full=False,
            use_col="ensembl_id",
        )
        coef = clf_self.coef_[0] if clf_self.coef_.shape[0] == 1 else clf_self.coef_
        grn.varp["GRN"] = grn.varp["all"][:, :, coef > 0].mean(-1).T
        metrics["self"] = BenGRN(grn, do_auc=True, doplot=False).compare_to(other=adata)
        metrics["self_classifier"] = m
        grn.var.index = grn.var["symbol"]
        grn.varp["GRN"] = grn.varp["GRN"].T
        metrics["self_base"] = BenGRN(
            grn, do_auc=True, doplot=False
        ).scprint_benchmark()
    else:
        # max_genes=4000
        adata = sc.read_h5ad(default_dataset)
        adata.var["isTF"] = False
        adata.var.loc[adata.var.symbol.isin(grnutils.TF), "isTF"] = True
        for celltype in cell_types:
            # print(celltype)
            # grn_inferer = GNInfer(
            #    layer=layers,
            #    how="random expr",
            #    preprocess="softmax",
            #    head_agg="max",
            #    filtration="none",
            #    forward_mode="none",
            #    num_workers=8,
            #    num_genes=2200,
            #    max_cells=maxcells,
            #    doplot=False,
            #    batch_size=batch_size,
            # )
            #
            # grn = grn_inferer(model, adata[adata.X.sum(1) > 500], cell_type=celltype)
            # grn.var.index = make_index_unique(grn.var["symbol"].astype(str))
            # metrics[celltype + "_scprint"] = BenGRN(
            #    grn, doplot=False
            # ).scprint_benchmark()
            # del grn
            # gc.collect()
            grn_inferer = GNInfer(
                layer=layers,
                how="most var across",
                preprocess="softmax",
                head_agg="none",
                filtration="none",
                forward_mode="none",
                num_workers=8,
                num_genes=maxgenes,
                max_cells=maxcells,
                doplot=False,
                batch_size=batch_size,
            )
            grn = grn_inferer(model, adata[adata.X.sum(1) > 500], cell_type=celltype)
            grn.var.index = make_index_unique(grn.var["symbol"].astype(str))
            grn.varp["all"] = grn.varp["GRN"]
            grn.varp["GRN"] = grn.varp["GRN"].mean(-1)
            metrics[celltype + "_scprint_mean"] = BenGRN(
                grn, doplot=False
            ).scprint_benchmark()
            if clf_omni is None:
                grn.varp["GRN"] = grn.varp["all"]
                _, m, clf_omni = train_classifier(
                    grn,
                    C=1,
                    train_size=0.6,
                    max_iter=300,
                    class_weight={1: 800, 0: 1},
                    return_full=False,
                    shuffle=True,
                    doplot=False,
                )
                joblib.dump(clf_omni, "clf_omni.pkl")
                metrics["classifier"] = m
            coef = clf_omni.coef_[0] if clf_omni.coef_.shape[0] == 1 else clf_omni.coef_
            grn.varp["GRN"] = grn.varp["all"][:, :, coef > 0].mean(-1)
            metrics[celltype + "_scprint_class"] = BenGRN(
                grn, doplot=False
            ).scprint_benchmark()
            del grn
            gc.collect()
    return metrics

`scprint.tasks.denoise`

Classes:

Name	Description
`Denoiser`

Functions:

Name	Description
`default_benchmark`	default_benchmark function used to run the default denoising benchmark of scPRINT
`split_molecules`	Splits molecules into two (potentially overlapping) groups.

`Denoiser`

Denoiser class for denoising scRNA-seq data using a scPRINT model

Parameters:

batch_size (int, default: 10 ) –

Batch size for processing. Defaults to 10.
num_workers (int, default: 1 ) –

Number of workers for data loading. Defaults to 1.
max_len (int, default: 5000 ) –

Maximum number of genes to consider. Defaults to 5000.
precision (str, default: '16-mixed' ) –

Precision type for computations. Defaults to "16-mixed".
how (str, default: 'most var' ) –

Method to select genes. Options are "most var". Defaults to "most var".
max_cells (int, default: 500000 ) –

Number of cells to use for plotting correlation. Defaults to 10000.
doplot (bool, default: False ) –

Whether to generate plots. Defaults to False.
predict_depth_mult (int, default: 4 ) –

Multiplier for prediction depth. Defaults to 4.
downsample (Optional[float], default: None ) –

Fraction of data to downsample. Defaults to None.
devices (List[int]) –

List of device IDs to use. Defaults to [0].
dtype (dtype, default: float16 ) –

Data type for computations. Defaults to torch.float16.
genelist (Optional[List[str]], default: None ) –

List of gene names to use. Defaults to None.
save_every (int, default: 100000 ) –

The number of cells to save at a time. Defaults to 100_000.

Methods:

Name	Description
`__call__`	call calling the function

Source code in scprint/tasks/denoise.py

def __init__(
    self,
    batch_size: int = 10,
    num_workers: int = 1,
    max_len: int = 5_000,
    precision: str = "16-mixed",
    how: str = "most var",
    max_cells: int = 500_000,
    doplot: bool = False,
    predict_depth_mult: int = 4,
    downsample: Optional[float] = None,
    dtype: torch.dtype = torch.float16,
    genelist: Optional[List[str]] = None,
    save_every: int = 100_000,
):
    """
    Denoiser class for denoising scRNA-seq data using a scPRINT model

    Args:
        batch_size (int, optional): Batch size for processing. Defaults to 10.
        num_workers (int, optional): Number of workers for data loading. Defaults to 1.
        max_len (int, optional): Maximum number of genes to consider. Defaults to 5000.
        precision (str, optional): Precision type for computations. Defaults to "16-mixed".
        how (str, optional): Method to select genes. Options are "most var". Defaults to "most var".
        max_cells (int, optional): Number of cells to use for plotting correlation. Defaults to 10000.
        doplot (bool, optional): Whether to generate plots. Defaults to False.
        predict_depth_mult (int, optional): Multiplier for prediction depth. Defaults to 4.
        downsample (Optional[float], optional): Fraction of data to downsample. Defaults to None.
        devices (List[int], optional): List of device IDs to use. Defaults to [0].
        dtype (torch.dtype, optional): Data type for computations. Defaults to torch.float16.
        genelist (Optional[List[str]], optional): List of gene names to use. Defaults to None.
        save_every (int, optional): The number of cells to save at a time. Defaults to 100_000.
    """
    self.batch_size = batch_size
    self.num_workers = num_workers
    self.max_len = max_len
    self.max_cells = max_cells
    self.doplot = doplot
    self.predict_depth_mult = predict_depth_mult
    self.how = how
    self.downsample = downsample
    self.precision = precision
    self.dtype = dtype
    self.genelist = genelist
    self.save_every = save_every

`call`

call calling the function

Parameters:	`model` (`Module`) – The scPRINT model to be used for denoising. `adata` (`AnnData`) – The annotated data matrix of shape n_obs x n_vars. Rows correspond to cells and columns to genes.

Returns:	`AnnData` – The denoised annotated data matrix.

Source code in scprint/tasks/denoise.py

def __call__(self, model: torch.nn.Module, adata: AnnData):
    """
    __call__ calling the function

    Args:
        model (torch.nn.Module): The scPRINT model to be used for denoising.
        adata (AnnData): The annotated data matrix of shape n_obs x n_vars. Rows correspond to cells and columns to genes.

    Returns:
        AnnData: The denoised annotated data matrix.
    """
    # Select random number
    if self.downsample is not None:
        num = np.random.randint(0, 1000000)
        while os.path.exists(f"collator_output_{num}.txt"):
            num = np.random.randint(0, 1000000)
    random_indices = None
    if self.max_cells < adata.shape[0]:
        random_indices = np.random.randint(
            low=0, high=adata.shape[0], size=self.max_cells
        )
        adataset = SimpleAnnDataset(
            adata[random_indices], obs_to_output=["organism_ontology_term_id"]
        )
    else:
        adataset = SimpleAnnDataset(
            adata, obs_to_output=["organism_ontology_term_id"]
        )
    if self.how == "most var":
        sc.pp.highly_variable_genes(
            adata, flavor="seurat_v3", n_top_genes=self.max_len, span=0.99
        )
        self.genelist = adata.var.index[adata.var.highly_variable]

    col = Collator(
        organisms=model.organisms,
        valid_genes=model.genes,
        max_len=self.max_len,
        how="some" if self.how == "most var" else self.how,
        genelist=self.genelist if self.how != "random expr" else [],
        downsample=self.downsample,
        save_output=f"collator_output_{num}.txt"
        if self.downsample is not None
        else None,
    )
    dataloader = DataLoader(
        adataset,
        collate_fn=col,
        batch_size=self.batch_size,
        num_workers=self.num_workers,
        shuffle=False,
    )

    model.doplot = self.doplot
    model.on_predict_epoch_start()
    model.eval()
    device = model.device.type
    with torch.no_grad(), torch.autocast(device_type=device, dtype=self.dtype):
        for batch in tqdm(dataloader):
            gene_pos, expression, depth = (
                batch["genes"].to(device),
                batch["x"].to(device),
                batch["depth"].to(device),
            )
            model._predict(
                gene_pos,
                expression,
                depth,
                predict_mode="denoise",
                depth_mult=self.predict_depth_mult,
                max_size_in_mem=self.save_every,
            )
    torch.cuda.empty_cache()
    model.log_adata(name="predict_part_" + str(model.counter))
    try:
        mdir = (
            model.logger.save_dir if model.logger.save_dir is not None else "data"
        )
    except:
        mdir = "data"
    pred_adata = []
    for i in range(model.counter + 1):
        file = (
            mdir
            + "/step_"
            + str(model.global_step)
            + "_"
            + model.name
            + "_predict_part_"
            + str(i)
            + "_"
            + str(model.global_rank)
            + ".h5ad"
        )
        pred_adata.append(sc.read_h5ad(file))
    pred_adata = concat(pred_adata)
    metrics = None
    if self.downsample is not None:
        noisy = np.loadtxt(f"collator_output_{num}.txt")
        loc = np.loadtxt(f"collator_output_{num}.txt_loc")
        os.remove(f"collator_output_{num}.txt")
        os.remove(f"collator_output_{num}.txt_loc")
        # Sort loc indices per row and apply same sorting to noisy expression matrix
        sorted_indices = np.array([np.argsort(row) for row in loc])
        # Create row indices array for advanced indexing
        row_indices = np.arange(len(loc))[:, np.newaxis]
        # Sort loc and noisy using the row-wise indices
        del loc
        noisy = noisy[row_indices, sorted_indices]
        del sorted_indices, row_indices

        reco = pred_adata.layers["scprint_mu"].data.reshape(pred_adata.shape[0], -1)
        adata = (
            adata[random_indices, adata.var.index.isin(pred_adata.var.index)]
            if random_indices is not None
            else adata[:, adata.var.index.isin(pred_adata.var.index)]
        )
        true = adata.X[
            :,
            pred_adata.layers["scprint_mu"][
                :, pred_adata.var.index.isin(adata.var.index)
            ]
            .toarray()
            .any(axis=0),
        ].toarray()

        corr_coef, p_value = spearmanr(
            np.vstack([reco[true != 0], noisy[true != 0], true[true != 0]]).T
        )
        metrics = {
            "reco2noisy": corr_coef[0, 1],
            "reco2full": corr_coef[0, 2],
            "noisy2full": corr_coef[1, 2],
        }
        # corr_coef[p_value > 0.05] = 0
        # if self.doplot:
        #    plt.figure(figsize=(10, 5))
        #    plt.imshow(
        #        corr_coef, cmap="coolwarm", interpolation="none", vmin=-1, vmax=1
        #    )
        #    plt.colorbar()
        #    plt.title("Expression Correlation Coefficient")
        #    plt.show()
        # metrics = {
        #    "reco2noisy": np.mean(
        #        corr_coef[
        #            self.max_cells : self.max_cells * 2, : self.max_cells
        #        ].diagonal()
        #    ),
        #    "reco2full": np.mean(
        #        corr_coef[self.max_cells * 2 :, : self.max_cells].diagonal()
        #    ),
        #    "noisy2full": np.mean(
        #        corr_coef[
        #            self.max_cells * 2 :,
        #            self.max_cells : self.max_cells * 2,
        #        ].diagonal()
        #    ),
        # }
    return metrics, random_indices, pred_adata

`default_benchmark`

default_benchmark function used to run the default denoising benchmark of scPRINT

Parameters:	`model` (`Any`) – The scPRINT model to be used for the benchmark. `default_dataset` (`str`, default: `FILE_DIR + '/../../data/gNNpgpo6gATjuxTE7CCp.h5ad'` ) – Path to the default dataset to use for benchmarking. Defaults to FILE_DIR + "/../../data/r4iCehg3Tw5IbCLiCIbl.h5ad". `max_len` (`int`, default: `5000` ) – Maximum number of genes to consider. Defaults to 5000.

Returns:	`dict` – A dictionary containing the benchmark metrics.

Source code in scprint/tasks/denoise.py

def default_benchmark(
    model: Any,
    default_dataset: str = FILE_DIR
    + "/../../data/gNNpgpo6gATjuxTE7CCp.h5ad",  # r4iCehg3Tw5IbCLiCIbl
    max_len: int = 5000,
):
    """
    default_benchmark function used to run the default denoising benchmark of scPRINT

    Args:
        model (Any): The scPRINT model to be used for the benchmark.
        default_dataset (str, optional): Path to the default dataset to use for benchmarking. Defaults to FILE_DIR + "/../../data/r4iCehg3Tw5IbCLiCIbl.h5ad".
        max_len (int, optional): Maximum number of genes to consider. Defaults to 5000.

    Returns:
        dict: A dictionary containing the benchmark metrics.
    """
    adata = sc.read_h5ad(default_dataset)
    denoise = Denoiser(
        batch_size=40,
        max_len=max_len,
        max_cells=10_000,
        doplot=False,
        num_workers=8,
        predict_depth_mult=10,
        downsample=0.7,
    )
    return denoise(model, adata)[0]

`split_molecules`

Splits molecules into two (potentially overlapping) groups. :param umis: Array of molecules to split :param data_split: Proportion of molecules to assign to the first group :param overlap_factor: Overlap correction factor, if desired :param random_state: For reproducible sampling :return: umis_X and umis_Y, representing split and ~(1 - split) counts sampled from the input array

Source code in scprint/tasks/denoise.py

def split_molecules(
    umis: np.ndarray,
    data_split: float,
    overlap_factor: float = 0.0,
    random_state: np.random.RandomState = None,
) -> Tuple[np.ndarray, np.ndarray]:
    """Splits molecules into two (potentially overlapping) groups.
    :param umis: Array of molecules to split
    :param data_split: Proportion of molecules to assign to the first group
    :param overlap_factor: Overlap correction factor, if desired
    :param random_state: For reproducible sampling
    :return: umis_X and umis_Y, representing ``split`` and ``~(1 - split)`` counts
             sampled from the input array
    """
    if random_state is None:
        random_state = np.random.RandomState()

    umis_X_disjoint = random_state.binomial(umis, data_split - overlap_factor)
    umis_Y_disjoint = random_state.binomial(
        umis - umis_X_disjoint, (1 - data_split) / (1 - data_split + overlap_factor)
    )
    overlap_factor = umis - umis_X_disjoint - umis_Y_disjoint
    umis_X = umis_X_disjoint + overlap_factor
    umis_Y = umis_Y_disjoint + overlap_factor

    return umis_X, umis_Y

Documentation for the tasks

scprint.tasks.cell_emb

Embedder

__call__

compute_classification

compute_corr

default_benchmark

scprint.tasks.grn

GNInfer

__call__

default_benchmark

scprint.tasks.denoise

Denoiser

__call__

default_benchmark

split_molecules

Documentation for the `tasks`

`scprint.tasks.cell_emb`

`Embedder`

`call`

`compute_classification`

`compute_corr`

`default_benchmark`

`scprint.tasks.grn`

`GNInfer`

`call`

`default_benchmark`

`scprint.tasks.denoise`

`Denoiser`

`call`

`default_benchmark`

`split_molecules`