pandora.sliding_window module

pandora.sliding_window.sliding_window_embedding(dataset: EigenDataset, n_windows: int, result_dir: Path, smartpca: str | Path, embedding: EmbeddingAlgorithm, n_components: int, threads: int | None = None, redo: bool = False, keep_windows: bool = False, smartpca_optional_settings: Dict | None = None) → List[EigenDataset][source]

Separates the given EigenDataset into n_windows sliding-window datasets and performs PCA/MDS analysis (as specified by embedding) for each window.

Note that unless threads=1, the computation is performed in parallel.

Parameters:

datasetEigenDataset: Dataset object separate into windows.
n_windowsint: Number of sliding-windows to separate the dataset into.
result_dirpathlib.Path: Directory where to store all result files.
smartpcaExecutable: Path pointing to an executable of the EIGENSOFT smartpca tool.
embeddingEmbeddingAlgorithm: Dimensionality reduction technique to apply. Allowed options are EmbeddingAlgorithm.PCA for PCA analysis and EmbeddingAlgorithm.MDS for MDS analysis.
n_componentsint: Number of dimensions to reduce the data to. The recommended number is 10 for PCA and 2 for MDS.
seedint, default=None: Seed to initialize the random number generator with.
threadsint, default=None: Number of threads to use for parallel bootstrap generation. Default is to use all system threads.
redobool, default=False: Whether to rerun analyses in case the result files are already present.
keep_windowsbool, default=False: Whether to store all intermediate window-dataset ind, geno, and snp files. Note that setting this to True might require a substantial amount of disk space.
smartpca_optional_settingsDict, default=None: Additional smartpca settings. Not allowed are the following options: genotypename, snpname, indivname, evecoutname, evaloutname, numoutevec, maxpops. N ote that this option is only used when embedding == EmbeddingAlgorithm.PCA.

Returns:

windowsList[EigenDataset]: List of n_windows subsets as EigenDataset objects. Each of the resulting window datasets will have either window.pca != None or window.mds != None depending on the selected embedding option.

pandora.sliding_window.sliding_window_embedding_numpy(dataset: ~pandora.dataset.NumpyDataset, n_windows: int, embedding: ~pandora.custom_types.EmbeddingAlgorithm, n_components: int, threads: int | None = None, distance_metric: ~typing.Callable[[~numpy.ndarray[~typing.Any, ~numpy.dtype[~numpy._typing._array_like._ScalarType_co]], ~pandas.core.series.Series], ~typing.Tuple[~numpy.ndarray[~typing.Any, ~numpy.dtype[~numpy._typing._array_like._ScalarType_co]], ~pandas.core.series.Series]] = <function euclidean_sample_distance>, imputation: str | None = 'mean') → List[NumpyDataset][source]

Separates the given NumpyDataset into n_windows sliding-window datasets and performs PCA/MDS analysis (as specified by embedding) for each window.

Note that unless threads=1, the computation is performed in parallel.

Parameters:

datasetNumpyDataset

Dataset object separate into windows.

n_windowsint

Number of sliding-windows to separate the dataset into.

embeddingEmbeddingAlgorithm

Dimensionality reduction technique to apply. Allowed options are EmbeddingAlgorithm.PCA for PCA analysis and EmbeddingAlgorithm.MDS for MDS analysis.

n_componentsint

Number of dimensions to reduce the data to. The recommended number is 10 for PCA and 2 for MDS.

threadsint, default=None

Number of threads to use for parallel window embedding. Default is to use all system threads.

distance_metricCallable[[npt.NDArray, pd.Series, str], Tuple[npt.NDArray, pd.Series]], default=eculidean_sample_distance

Distance metric to use for computing the distance matrix input for MDS. This is expected to be a function that receives the numpy array of sequences, the population for each sequence and the imputation method as input and should output the distance matrix and the respective populations for each row. The resulting distance matrix is of size (n, m) and the resulting populations is expected to be of size (n, 1). Default is distance_metrics::eculidean_sample_distance (the pairwise Euclidean distance of all samples)

imputationOptional[str], default=”mean”

Imputation method to use. Available options are:

"mean": Imputes missing values with the average of the respective SNP
"remove": Removes all SNPs with at least one missing value.
None: Does not impute missing data.

Note that depending on the distance_metric, not all imputation methods are supported. See the respective documentations in the distance_metrics module.

Returns:

windowsList[NumpyDataset]: List of n_windows subsets as NumpyDataset objects. Each of the resulting window datasets will have either window.pca != None or window.mds != None depending on the selected embedding option.