pandora.bootstrap module
- pandora.bootstrap.bootstrap_and_embed_multiple(dataset: EigenDataset, n_bootstraps: int = 100, result_dir: pathlib.Path | None = None, smartpca: Executable = 'smartpca', embedding: EmbeddingAlgorithm = EmbeddingAlgorithm.PCA, n_components: int = 10, seed: int | None = None, threads: int | None = None, redo: bool = False, keep_bootstraps: bool = False, bootstrap_convergence_check: bool = True, bootstrap_convergence_tolerance: float = 0.05, smartpca_optional_settings: Dict | None = None, logger: loguru.Logger | None = None) List[EigenDataset][source]
Draws
n_replicatesbootstrap datasets of the provided EigenDataset and performs PCA/MDS analysis (as specified by embedding) for each bootstrap.If
bootstrap_convergence_checkis set, this method will draw at mostn_replicatesbootstraps. See Notes below for further details. Note that unlessthreads=1, the computation is performed in parallel.- Parameters:
- datasetEigenDataset
Dataset object to base the bootstrap replicates on.
- n_bootstrapsint, default=100
Number of bootstrap replicates to draw. In case
bootstrap_convergence_checkis set, this is the upper limit of number of replicates.- result_dirOptional[pathlib.Path], default=None
Directory where to store all result files. If None, it will store the bootstraps in the directory of the input dataset.
- smartpcaExecutable, default=”smartpca”
Path pointing to an executable of the EIGENSOFT smartpca tool. Per default, Pandora expects “smartpca” in the PATH variable.
- embeddingEmbeddingAlgorithm, default=EmbeddingAlgorithm.PCA
Dimensionality reduction technique to apply. Allowed options are
EmbeddingAlgorithm.PCAfor PCA analysis andEmbeddingAlgorithm.MDSfor MDS analysis.- n_componentsint, default=10
Number of dimensions to reduce the data to. The recommended number is 10 for PCA and 2 for MDS.
- seedint, default=None
Seed to initialize the random number generator with.
- threadsint, default=None
Number of threads to use for parallel bootstrap generation. Default is to use all system threads.
- redobool, default=False
Whether to rerun analyses in case the result files are already present.
- keep_bootstrapsbool, default=False
Whether to store all intermediate bootstrap ind, geno, and snp files. Note that setting this to
Truemight require a substantial amount of disk space depending on the size of your dataset.- bootstrap_convergence_checkbool, default=True
Whether to automatically determine bootstrap convergence. If
True, will only compute as many replicates as required for convergence according to our heuristic (see Notes below).- bootstrap_convergence_tolerancefloat, default=0.05
Determines the deviation tolerance when checking for bootstrap convergence. A value of X means that we allow deviations of up to \(X * 100\%\) between pairwise bootstrap comparisons and still assume convergence.
- smartpca_optional_settingsDict, default=None
Additional smartpca settings. Not allowed are the following options:
genotypename,snpname,indivname,evecoutname,evaloutname,numoutevec,maxpops. Note that this option is only used for PCA analyses.- loggerloguru.Logger, default=None
Optional logger instance, used to log debug messages.
- Returns:
- bootstrapsList[EigenDataset]
List of
n_replicatesboostrap replicates as EigenDataset objects. Each of the resulting datasets will have eitherbootstrap.pca != Noneorbootstrap.mds != Nonedepending on the selected embedding option.
Notes
Bootstrap Convergence (“Bootstopping”): While more bootstraps yield more reliable stability analyses results, computing a vast amount of replicates is very compute heavy for typical genotype datasets. We thus suggest a trade-off between the accuracy of the stability and the ressource usage. To this end, we implement a bootstopping procedure intended to determine convergence of the bootstrapping procedure. Once every ``max(10, threads)``(*) replicates, we perform the following heuristic convergence check: Let \(N\) be the number of replicate computed when performing the convergence check. We first create 10 random subsets of size \(int(N/2)\) by sampling from all \(N\) replicates. We then compute the Pandora Stability (PS) for each of the 10 subsets and compute the relative difference of PS values between all possible pairs of subsets \((PS_1, PS_2)\) by computing \(\frac{\left|PS_1 - PS_2\right|}{PS_2}\). We assume convergence if all pairwise relative differences are below X * 100% were X is the set tolerance. If we determine that the bootstrap has converged, all remaining bootstrap computations are cancelled.
(*) The reasoning for checking every
max(10, threads)is the following: if Pandora runs on a machine with e.g. 48 provided threads, 48 bootstraps will be computed in parallel and will terminate at approximately the same time. If we check convergence every 10 replicates, we will have to perform 4 checks, three of which are unnecessary (since the 48 replicates are already computed anyway, might as well use them instead of throwing away 30 in case 10 would have been sufficient).
- pandora.bootstrap.bootstrap_and_embed_multiple_numpy(dataset: ~pandora.dataset.NumpyDataset, n_bootstraps: int = 100, embedding: ~pandora.custom_types.EmbeddingAlgorithm = EmbeddingAlgorithm.PCA, n_components: int = 10, seed: int | None = None, threads: int | None = None, distance_metric: ~typing.Callable[[~numpy.ndarray[~typing.Any, ~numpy.dtype[~numpy._typing._array_like._ScalarType_co]], ~pandas.core.series.Series], ~typing.Tuple[~numpy.ndarray[~typing.Any, ~numpy.dtype[~numpy._typing._array_like._ScalarType_co]], ~pandas.core.series.Series]] = <function euclidean_sample_distance>, imputation: str | None = 'mean', bootstrap_convergence_check: bool = True, bootstrap_convergence_tolerance: float = 0.05) List[NumpyDataset][source]
Draws
n_replicatesbootstrap datasets of the provided NumpyDataset and performs PCA/MDS analysis (as specified byembedding) for each bootstrap.If
bootstrap_convergence_checkis set, this method will draw at mostn_replicatesbootstraps. See Notes below for further details. Note that unlessthreads=1, the computation is performed in parallel.- Parameters:
- datasetNumpyDataset
Dataset object to base the bootstrap replicates on.
- n_bootstrapsint, default=100
Number of bootstrap replicates to draw.
- embeddingEmbeddingAlgorithm, default=EmbeddingAlgorithm.PCA
Dimensionality reduction technique to apply. Allowed options are
EmbeddingAlgorithm.PCAfor PCA analysis andEmbeddingAlgorithm.MDSfor MDS analysis.- n_componentsint, default=10
Number of dimensions to reduce the data to. The recommended number is 10 for PCA and 2 for MDS.
- seedint, default=None
Seed to initialize the random number generator with.
- threadsint, default=None
Number of threads to use for parallel bootstrap generation. Default is to use all system threads.
- distance_metricCallable[[npt.NDArray, pd.Series, str], Tuple[npt.NDArray, pd.Series]], default=eculidean_sample_distance
Distance metric to use for computing the distance matrix input for MDS. This is expected to be a function that receives the numpy array of sequences, the population for each sequence and the imputation method as input and should output the distance matrix and the respective populations for each row. The resulting distance matrix is of size \((n, m)`\) and the resulting populations is expected to be of size \((n, 1)\). Default is distance_metrics::eculidean_sample_distance (the pairwise Euclidean distance of all samples)
- imputationOptional[str], default=”mean”
Imputation method to use. Available options are:
mean: Imputes missing values with the average of the respective SNP
remove: Removes all SNPs with at least one missing value.
None: Does not impute missing data.
Note that depending on the distance_metric, not all imputation methods are supported. See the respective documentations in the distance_metrics module.
- bootstrap_convergence_checkbool, default=True
Whether to automatically determine bootstrap convergence. If
True, will only compute as many replicates as required for convergence according to our heuristic (see Notes below).- bootstrap_convergence_tolerancefloat, default=0.05
Determines the level of deviation tolerance when checking for bootstrap convergence. A value of X means that we allow deviations of up to \(X * 100\%\) between pairwise bootstrap comparisons and still assume convergence.
- Returns:
- bootstrapsList[NumpyDataset]
List of
n_replicatesboostrap replicates as NumpyDataset objects. Each of the resulting datasets will have eitherbootstrap.pca != Noneorbootstrap.mds != Nonedepending on the selected embedding option.
Notes
Bootstrap Convergence (“Bootstopping”): While more bootstraps yield more reliable stability analyses results, computing a vast amount of replicates is very compute heavy for typical genotype datasets. We thus suggest a trade-off between the accuracy of the stability and the ressource usage. To this end, we implement a bootstopping procedure intended to determine convergence of the bootstrapping procedure. Once every
max(10, threads)(*) replicates, we perform the following heuristic convergence check: Let \(N\) be the number of replicate computed when performing the convergence check. We first create 10 random subsets of size \(int(N/2)\) by sampling from all \(N\) replicates. We then compute the Pandora Stability (PS) for each of the 10 subsets and compute the relative difference of PS values between all possible pairs of subsets \((PS_1, PS_2)\) by computing \(\frac{\left|PS_1 - PS_2\right|}{PS_2}\). We assume convergence if all pairwise relative differences are below X * 100% were X is the set tolerance. If we determine that the bootstrap has converged, all remaining bootstrap computations are cancelled.(*) The reasoning for checking every
max(10, threads)is the following: if Pandora runs on a machine with e.g. 48 provided threads, 48 bootstraps will be computed in parallel and will terminate at approximately the same time. If we check convergence every 10 replicates, we will have to perform 4 checks, three of which are unnecessary (since the 48 replicates are already computed anyway, might as well use them instead of throwing away 30 in case 10 would have been sufficient).