Guide assignment

This package contains implementations of 11 different guide assignment methods which we grouped into four main categories based on whether information is shared across gRNAs, cells or both. Details on all functions can be found in our preprint.

Independent

UMI threshold (UMI_t): The simplest approach is to not share any information across the gRNA-cell matrix and check for each value separately whether it is at least as high than a fixed threshold. If a cell has at least as many counts for a gRNA as the user-defined threshold, it is assigned to this gRNA. To find a suitable threshold, a list of thresholds can be passed as one argument of the function and the function creates an assignment output file for each threshold in the specified output_dir.

crispat.ga_umi(input_file, thresholds, output_dir)

Guide assignment with fixed UMI thresholds

Parameters:

input_file (str) – path to the stored anndata object with the gRNA counts
thresholds (list) – list of integers to use as thresholds (create assignment output file for each t in the list)
output_dir (str) – directory in which to store the resulting assignment

Returns:

None

Across gRNAs

Maximum: This method assigns each cell the gRNA with highest UMI count in this cell.

crispat.ga_max(input_file, output_dir, UMI_threshold=0)

Guide assignment in which the most abundant gRNA per cell is assigned :param input_file: path to the stored anndata object with the gRNA counts :type input_file: str :param output_dir: directory in which to store the resulting assignment :type output_dir: str :param UMI_threshold: Additional UMI threshold for assigned cells which is applied after creating the initial assignment to remove cells with fewer UMI counts than this threshold (default: no additional UMI threshold) :type UMI_threshold: int, optional

Returns:: None

Relative frequency threshold (Ratio_X%): This method assigns for each cell the gRNA with highest counts in this cell if its counts comprise at least X% of the total gRNA counts in this cell.

crispat.ga_ratio(input_file, thresholds, output_dir, add_UMI_counts=True, UMI_threshold=0)

Guide assignment in which the most abundant gRNA per cell is assigned if it comprises more than X% of the total counts in a cell

Parameters:

input_file – (str) path to the stored anndata object with the gRNA counts
thresholds – (list) list of ratio thresholds to use (generates one output file per ratio)
output_dir – (str) directory in which to store the resulting assignment
add_UMI_counts – (bool) if true, UMI counts are added to the output. To improve run time, set it to False
UMI_threshold (int, optional) – Additional UMI threshold for assigned cells which is applied after creating the initial assignment to remove cells with fewer UMI counts than this threshold (default: no additional UMI threshold)

Returns:

None

Across cells

Poisson-Gaussian mixture model (Poisson-Gauss): For every gRNA, this method fits a Poisson-Gaussian mixture model on the log2-transformed non-zero UMI counts of this gRNA over all cells across all batches. Next, all cells for which the probability of observing the guide counts from the Gaussian component is higher than for the Poisson (background) component are assigned to this gRNA.

crispat.ga_poisson_gauss(input_file, output_dir, start_gRNA=0, step=None, n_iter=500, n_counts=None, UMI_threshold=0, n_jobs=1, make_plots=True)

Guide assignment in which a Poisson-Gaussian mixture model is fitted to the non-zero log-transformed UMI counts

Parameters:

input_file (str) – path to the stored anndata object with the gRNA counts
output_dir (str) – directory in which to store the resulting assignment
start_gRNA (int, optional) – index of the start gRNA when parallelizing assignment for gRNA sets
step (int, optional) – number of gRNAs for which the assignment is done (if set to None, assignment for all gRNAs in the data)
n_iter (int, optional) – number of steps for training the model
n_counts (int, optional) – subsample the gRNA counts per cell to a total of n_counts. If None (default), the UMI count matrix is used without any downsampling.
UMI_threshold (int, optional) – Additional UMI threshold for assigned cells which is applied after creating the initial assignment to remove cells with fewer UMI counts than this threshold (default: no additional UMI threshold)
n_jobs (int, optional) – number of worker processes used to fit gRNAs in parallel. Default 1 (serial). For n_jobs > 1, per-gRNA fits are dispatched to a process pool; results are reassembled in input order so output is identical to the serial run.
make_plots (bool, optional) – if False, skip writing per-gRNA loss and fitted-model PNGs. Default True to preserve existing behavior. Set to False for batch / cluster runs to reduce per-worker overhead.

Returns:

None

Gaussian-Gaussian mixture model (Gauss): For every gRNA, this method fits a Gaussian-Gaussian mixture model on the log10-transformed UMI counts of this gRNA with a pseudocount of 1 over all cells in a batch.

crispat.ga_gauss(input_file, output_dir, start_gRNA=0, step=None, batch_list=None, UMI_threshold=0, n_iter=250, nonzero=False, inference='vi', n_jobs=1, make_plots=True)

Guide assignment in which a Gaussian mixture model is fitted to the log-transformed UMI counts similar to the approach used in Cell Ranger. Two different inference methods are provided that can be selected with the ‘inference’ parameter.

Parameters:

input_file (str) – path to the stored anndata object with the gRNA counts
output_dir (str) – directory in which to store the resulting assignment
start_gRNA (int, optional) – index of the start gRNA when parallelizing assignment for gRNA sets
step (int, optional) – number of gRNAs for which the assignment is done (if set to None, assignment for all gRNAs in the data)
batch_list (list, optional) – list of batches for which to fit the mixture model. If none, mixture model is fited for all batches
UMI_threshold (int, optional) – Additional UMI threshold for assigned cells which is applied after creating the initial assignment to remove cells with fewer UMI counts than this threshold (default: no additional UMI threshold)
n_iter (int, optional) – number of steps for training the model
nonzero (bool, optional) – if True fit the mixture model on the nonzero values only, otherwise all values are used
inference (str) – choice of the inference method, either “vi” (default) for variational inference via pyro or “em” for using an EM algorithm
n_jobs (int, optional) – number of worker processes used to fit gRNAs in parallel. Default 1 (serial). For n_jobs > 1, the per-gRNA model fits are dispatched to a process pool; results are reassembled in input order so output is identical to the serial run.
make_plots (bool, optional) – if False, skip writing per-gRNA loss and fitted-model PNGs (only relevant for inference=’vi’). Default True to preserve existing behavior. Set to False for batch / cluster runs to reduce per-worker overhead.

Returns:

None

Across gRNAs and cells

In the last group of methods, information is shared across cells and across gRNAs. Since ga_poisson, ga_negative_binomial and ga_binomial have the longest run time, these functions automatically are parallelized to run over all available CPUs. If you want to change this default behaviour, you can set parallelize to False or specify the number of processes (n_processes) that should be used (instead of all available CPUs).

2-Beta mixture model (2-Beta): Like the Ratio_X% approach, this method calculates for every cell the relative frequency of a gRNA as the ratio of its counts over the total number of gRNA counts. Using the highest ratio for every cell the method then fits a mixture model of two Beta distributions across all cells from a given batch to determine a threshold on the ratio based on where the two Beta distributions intersect. This results in one threshold per batch without distinguishing between gRNAs. This threshold is then used as X in the Ratio_X% approach.

crispat.ga_2beta(input_file, output_dir, n_iter=500, batch_list=None, add_UMI_counts=True, UMI_threshold=0)

Guide assignment in which a mixture model of 2-Beta distributions is fitted to the ratios of the most abundant gRNAs per cell to determine a batch-specific threshold on the ratios

Parameters:

input_file (str) – Path to the stored anndata object with the gRNA counts
output_dir (str) – Directory in which to store the resulting assignment
n_iter (int, optional) – Number of steps for training the model (default is 500)
batch_list (list, opitional) – List of batches for which to fit the mixture model. If none (default), all available batches are used
add_UMI_counts (bool, optional) – if true, UMI counts are added to the output. To improve run time, set it to False
UMI_threshold (int, optional) – Additional UMI threshold for assigned cells which is applied after creating the initial assignment to remove cells with fewer UMI counts than this threshold (default: no additional UMI threshold)

Returns:

None

3-Beta mixture model (3-Beta): This method uses the same approach as the 2-Beta model but using a 3-component Beta mixture and defining the threshold X as the intersection of the two highest components. The third component might thus capture cells infected with two gRNAs.

crispat.ga_3beta(input_file, output_dir, n_iter=500, batch_list=None, add_UMI_counts=True, UMI_threshold=0)

Guide assignment in which a mixture model of 3 Beta distributions is fitted to the ratios of the most abundant gRNAs to determine a batch-specific threshold on the ratios

Parameters:

input_file (str) – Path to the stored anndata object with the gRNA counts
output_dir (str) – Directory in which to store the resulting assignment
n_iter (int, optional) – Number of steps for training the model
batch_list (list, optional) – List of batches for which to fit the mixture model. If none (default), all available batches are used
add_UMI_counts – (bool) if true, UMI counts are added to the output. To improve run time, set it to False
UMI_threshold (int, optional) – Additional UMI threshold for assigned cells which is applied after creating the initial assignment to remove cells with too few UMI counts (default: no additional UMI threshold)

Returns:

None

Latent variable Poisson generalized linear model (Poisson): For every gRNA, this method fits a Poisson mixture on the UMI counts of this gRNA across all cells with mean \(\lambda = e^{\beta_0+\beta_1p_c+\beta_2b_c+log(s_c)}\) with \(\beta_0 \in R, \beta_1 \in R^+, \beta_2 \in R^n\), perturbation state \(p_c \in {0,1}\) and cell-specific covariates (sequencing depth \(s_c\) and one-hot encoded batch \(b_c\) for n batches). This approach is based on the R package SCEPTRE.

crispat.ga_poisson(input_file, output_dir, start_gRNA=0, gRNA_step=None, batch_list=None, UMI_threshold=0, n_iter=2500, subsample_size=15000, parallelize=True, n_processes=None, mem_limit='10GB')

Guide assignment with a Poisson mixture model based on SCEPTRE approach

Parameters:

input_file (str) – path to the stored anndata object with the gRNA counts
output_dir (str) – directory in which to store the resulting assignment
start_gRNA (int, optional) – index of the start gRNA when parallelizing assignment for gRNA sets
gRNA_step (int, optional) – number of gRNAs for which the assignment is done (if set to None, assignment for all gRNAs in the data)
batch_list (list, optional) – list of batches for which to fit the mixture model. If none (default), all available batches are used.
UMI_threshold (int, optional) – Additional UMI threshold for assigned cells which is applied after creating the initial assignment to remove cells with fewer UMI counts than this threshold (default: no additional UMI threshold)
n_iter (int, optional) – number of steps for training the model
subsample_size (int, optional) – number of cells to use for each step
parallelize (bool, optional) – whether to parallelize the computation over the gRNA (default = True)
n_processes (int, optional) – specifies number of processes to use for parallelization if parallelize = True. If set to None (default), all available CPUs will be used (if this number is not higher than the number of gRNAs).
mem_limit (str, optional) – set memory limit for the dask cluster (default: 10GB)

Returns:

None

Latent variable Negative Binomial generalized linear model (Negative Binomial): This method is a modified version of the Poisson method using a Negative Binomial distribution instead of Poisson distribution. The overdispersion is learnt as an additional parameter in the model.

crispat.ga_negative_binomial(input_file, output_dir, start_gRNA=0, gRNA_step=None, batch_list=None, UMI_threshold=0, n_iter=2500, subsample_size=15000, parallelize=True, n_processes=None, mem_limit='10GB')

Guide assignment with a negative binomial mixture model based on SCEPTRE mixture approach (Negative Binomial instead of Poisson as in SCEPTRE)

Parameters:

input_file (str) – path to the stored anndata object with the gRNA counts
output_dir (str) – directory in which to store the resulting assignment
start_gRNA (int, optional) – index of the start gRNA when parallelizing assignment for gRNA sets
gRNA_step (int, optional) – number of gRNAs for which the assignment is done (if set to None, assignment for all gRNAs in the data)
batch_list (list, optional) – list of batches for which to fit the mixture model. If None (default), all available batches are used
UMI_threshold (int, optional) – Additional UMI threshold for assigned cells which is applied after creating the initial assignment to remove cells with fewer UMI counts than this threshold (default: no additional UMI threshold)
n_iter (int) – number of steps for training the model
subsample_size (int) – number of cells to use for each step
parallelize (bool, optional) – whether to parallelize the computation over the gRNA (default = True)
n_processes (int, optional) – specifies number of processes to use for parallelization if parallelize = True. If set to None (default), all available CPUs will be used (if this number is not higher than the number of gRNAs).
mem_limit (str, optional) – set memory limit for the dask cluster (default: 10GB)

Returns:

None

Latent variable Binomial generalized linear model (Binomial): For every gRNA, this method fits a binomial distribution \(B(N_c, \theta_c)\) with \(N_c\) being the total number of gRNA counts per cell and \(\theta_c=sigmoid(e^{\beta_0+\beta_1p_c+\beta_2b_c})\) with \(\beta_0 \in R, \beta_1 \in R^+, \beta_2 \in R^n\), perturbation state \(p_c \in {0,1}\) and one-hot encoded batch \(b_c\).

crispat.ga_binomial(input_file, output_dir, start_gRNA=0, gRNA_step=None, batch_list=None, UMI_threshold=0, n_iter=3000, subsample_size=15000, parallelize=True, n_processes=None, mem_limit='10GB')

Guide assignment in which a binomial mixture model is fitted to the gRNA counts

Parameters:

input_file (str) – path to the stored anndata object with the gRNA counts
output_dir (str) – directory in which to store the resulting assignment
start_gRNA (int, optional) – index of the start gRNA when parallelizing assignment for gRNA sets
gRNA_step (int, optional) – number of gRNAs for which the assignment is done (if set to None, assignment for all gRNAs in the data)
batch_list (list, optional) – list of batches for which to fit the mixture model. If none (default), all available batches are used.
UMI_threshold (int, optional) – Additional UMI threshold for assigned cells which is applied after creating the initial assignment to remove cells with fewer UMI counts than this threshold (default: no additional UMI threshold)
n_iter (int, optional) – number of steps for training the model
subsample_size (int, optional) – number of cells to use for each step
parallelize (bool, optional) – whether to parallelize the computation over the gRNA (default = True)
n_processes (int, optional) – specifies number of processes to use for parallelization if parallelize = True. If set to None (default), all available CPUs will be used (if this number is not higher than the number of gRNAs).
mem_limit (str, optional) – set memory limit for the dask cluster (default: 10GB)

Returns:

None

Quantile approach (Top_X% cells): For every gRNA, this method chooses the top X% of cells with the highest gRNA, excluding cells with zero counts for the gRNA.

crispat.ga_quantiles(input_file, thresholds, output_dir, UMI_threshold=0)

Guide assignment in which the X% non-zero cells with highest ratios are assigned per gRNA

Parameters:

input_file (str) – path to the stored anndata object with the gRNA counts
thresholds (list) – list of quantile thresholds for which to return the assignment
output_dir (str) – directory in which to store the resulting assignment
UMI_threshold (int, optional) – Additional UMI threshold for assigned cells which is applied after creating the initial assignment to remove cells with fewer UMI counts than this threshold (default: no additional UMI threshold)

Returns:

None