pandora.bootstrap module

pandora.bootstrap.bootstrap_and_embed_multiple(dataset: EigenDataset, n_bootstraps: int = 100, result_dir: pathlib.Path | None = None, smartpca: Executable = 'smartpca', embedding: EmbeddingAlgorithm = EmbeddingAlgorithm.PCA, n_components: int = 10, seed: int | None = None, threads: int | None = None, redo: bool = False, keep_bootstraps: bool = False, bootstrap_convergence_check: bool = True, bootstrap_convergence_tolerance: float = 0.05, smartpca_optional_settings: Dict | None = None, logger: loguru.Logger | None = None) → List[EigenDataset][source]

Draws n_replicates bootstrap datasets of the provided EigenDataset and performs PCA/MDS analysis (as specified by embedding) for each bootstrap.

If bootstrap_convergence_check is set, this method will draw at most n_replicates bootstraps. See Notes below for further details. Note that unless threads=1, the computation is performed in parallel.

Parameters:

datasetEigenDataset: Dataset object to base the bootstrap replicates on.
n_bootstrapsint, default=100: Number of bootstrap replicates to draw. In case bootstrap_convergence_check is set, this is the upper limit of number of replicates.
result_dirOptional[pathlib.Path], default=None: Directory where to store all result files. If None, it will store the bootstraps in the directory of the input dataset.
smartpcaExecutable, default=”smartpca”: Path pointing to an executable of the EIGENSOFT smartpca tool. Per default, Pandora expects “smartpca” in the PATH variable.
embeddingEmbeddingAlgorithm, default=EmbeddingAlgorithm.PCA: Dimensionality reduction technique to apply. Allowed options are EmbeddingAlgorithm.PCA for PCA analysis and EmbeddingAlgorithm.MDS for MDS analysis.
n_componentsint, default=10: Number of dimensions to reduce the data to. The recommended number is 10 for PCA and 2 for MDS.
seedint, default=None: Seed to initialize the random number generator with.
threadsint, default=None: Number of threads to use for parallel bootstrap generation. Default is to use all system threads.
redobool, default=False: Whether to rerun analyses in case the result files are already present.
keep_bootstrapsbool, default=False: Whether to store all intermediate bootstrap ind, geno, and snp files. Note that setting this to True might require a substantial amount of disk space depending on the size of your dataset.
bootstrap_convergence_checkbool, default=True: Whether to automatically determine bootstrap convergence. If True, will only compute as many replicates as required for convergence according to our heuristic (see Notes below).
bootstrap_convergence_tolerancefloat, default=0.05: Determines the deviation tolerance when checking for bootstrap convergence. A value of X means that we allow deviations of up to \(X * 100\%\) between pairwise bootstrap comparisons and still assume convergence.
smartpca_optional_settingsDict, default=None: Additional smartpca settings. Not allowed are the following options: genotypename, snpname, indivname, evecoutname, evaloutname, numoutevec, maxpops. Note that this option is only used for PCA analyses.
loggerloguru.Logger, default=None: Optional logger instance, used to log debug messages.

Returns:

bootstrapsList[EigenDataset]: List of n_replicates boostrap replicates as EigenDataset objects. Each of the resulting datasets will have either bootstrap.pca != None or bootstrap.mds != None depending on the selected embedding option.

Notes

Bootstrap Convergence (“Bootstopping”): While more bootstraps yield more reliable stability analyses results, computing a vast amount of replicates is very compute heavy for typical genotype datasets. We thus suggest a trade-off between the accuracy of the stability and the ressource usage. To this end, we implement a bootstopping procedure intended to determine convergence of the bootstrapping procedure. Once every ``max(10, threads)``(*) replicates, we perform the following heuristic convergence check: Let \(N\) be the number of replicate computed when performing the convergence check. We first create 10 random subsets of size \(int(N/2)\) by sampling from all \(N\) replicates. We then compute the Pandora Stability (PS) for each of the 10 subsets and compute the relative difference of PS values between all possible pairs of subsets \((PS_1, PS_2)\) by computing \(\frac{\left|PS_1 - PS_2\right|}{PS_2}\). We assume convergence if all pairwise relative differences are below X * 100% were X is the set tolerance. If we determine that the bootstrap has converged, all remaining bootstrap computations are cancelled.

(*) The reasoning for checking every max(10, threads) is the following: if Pandora runs on a machine with e.g. 48 provided threads, 48 bootstraps will be computed in parallel and will terminate at approximately the same time. If we check convergence every 10 replicates, we will have to perform 4 checks, three of which are unnecessary (since the 48 replicates are already computed anyway, might as well use them instead of throwing away 30 in case 10 would have been sufficient).

pandora.bootstrap.bootstrap_and_embed_multiple_numpy(dataset: ~pandora.dataset.NumpyDataset, n_bootstraps: int = 100, embedding: ~pandora.custom_types.EmbeddingAlgorithm = EmbeddingAlgorithm.PCA, n_components: int = 10, seed: int | None = None, threads: int | None = None, distance_metric: ~typing.Callable[[~numpy.ndarray[~typing.Any, ~numpy.dtype[~numpy._typing._array_like._ScalarType_co]], ~pandas.core.series.Series], ~typing.Tuple[~numpy.ndarray[~typing.Any, ~numpy.dtype[~numpy._typing._array_like._ScalarType_co]], ~pandas.core.series.Series]] = <function euclidean_sample_distance>, imputation: str | None = 'mean', bootstrap_convergence_check: bool = True, bootstrap_convergence_tolerance: float = 0.05) → List[NumpyDataset][source]

Draws n_replicates bootstrap datasets of the provided NumpyDataset and performs PCA/MDS analysis (as specified by embedding) for each bootstrap.

If bootstrap_convergence_check is set, this method will draw at most n_replicates bootstraps. See Notes below for further details. Note that unless threads=1, the computation is performed in parallel.

Parameters:

datasetNumpyDataset

Dataset object to base the bootstrap replicates on.

n_bootstrapsint, default=100

Number of bootstrap replicates to draw.

embeddingEmbeddingAlgorithm, default=EmbeddingAlgorithm.PCA

Dimensionality reduction technique to apply. Allowed options are EmbeddingAlgorithm.PCA for PCA analysis and EmbeddingAlgorithm.MDS for MDS analysis.

n_componentsint, default=10

Number of dimensions to reduce the data to. The recommended number is 10 for PCA and 2 for MDS.

seedint, default=None

Seed to initialize the random number generator with.

threadsint, default=None

Number of threads to use for parallel bootstrap generation. Default is to use all system threads.

distance_metricCallable[[npt.NDArray, pd.Series, str], Tuple[npt.NDArray, pd.Series]], default=eculidean_sample_distance

Distance metric to use for computing the distance matrix input for MDS. This is expected to be a function that receives the numpy array of sequences, the population for each sequence and the imputation method as input and should output the distance matrix and the respective populations for each row. The resulting distance matrix is of size \((n, m)`\) and the resulting populations is expected to be of size \((n, 1)\). Default is distance_metrics::eculidean_sample_distance (the pairwise Euclidean distance of all samples)

imputationOptional[str], default=”mean”

Imputation method to use. Available options are:

mean: Imputes missing values with the average of the respective SNP
remove: Removes all SNPs with at least one missing value.
None: Does not impute missing data.

Note that depending on the distance_metric, not all imputation methods are supported. See the respective documentations in the distance_metrics module.

bootstrap_convergence_checkbool, default=True

Whether to automatically determine bootstrap convergence. If True, will only compute as many replicates as required for convergence according to our heuristic (see Notes below).

bootstrap_convergence_tolerancefloat, default=0.05

Determines the level of deviation tolerance when checking for bootstrap convergence. A value of X means that we allow deviations of up to \(X * 100\%\) between pairwise bootstrap comparisons and still assume convergence.

Returns:

bootstrapsList[NumpyDataset]: List of n_replicates boostrap replicates as NumpyDataset objects. Each of the resulting datasets will have either bootstrap.pca != None or bootstrap.mds != None depending on the selected embedding option.

Notes

Bootstrap Convergence (“Bootstopping”): While more bootstraps yield more reliable stability analyses results, computing a vast amount of replicates is very compute heavy for typical genotype datasets. We thus suggest a trade-off between the accuracy of the stability and the ressource usage. To this end, we implement a bootstopping procedure intended to determine convergence of the bootstrapping procedure. Once every max(10, threads) (*) replicates, we perform the following heuristic convergence check: Let \(N\) be the number of replicate computed when performing the convergence check. We first create 10 random subsets of size \(int(N/2)\) by sampling from all \(N\) replicates. We then compute the Pandora Stability (PS) for each of the 10 subsets and compute the relative difference of PS values between all possible pairs of subsets \((PS_1, PS_2)\) by computing \(\frac{\left|PS_1 - PS_2\right|}{PS_2}\). We assume convergence if all pairwise relative differences are below X * 100% were X is the set tolerance. If we determine that the bootstrap has converged, all remaining bootstrap computations are cancelled.

(*) The reasoning for checking every max(10, threads) is the following: if Pandora runs on a machine with e.g. 48 provided threads, 48 bootstraps will be computed in parallel and will terminate at approximately the same time. If we check convergence every 10 replicates, we will have to perform 4 checks, three of which are unnecessary (since the 48 replicates are already computed anyway, might as well use them instead of throwing away 30 in case 10 would have been sufficient).