pandora.sliding_window module
- pandora.sliding_window.sliding_window_embedding(dataset: EigenDataset, n_windows: int, result_dir: Path, smartpca: str | Path, embedding: EmbeddingAlgorithm, n_components: int, threads: int | None = None, redo: bool = False, keep_windows: bool = False, smartpca_optional_settings: Dict | None = None) List[EigenDataset][source]
Separates the given EigenDataset into n_windows sliding-window datasets and performs PCA/MDS analysis (as specified by
embedding) for each window.Note that unless
threads=1, the computation is performed in parallel.- Parameters:
- datasetEigenDataset
Dataset object separate into windows.
- n_windowsint
Number of sliding-windows to separate the dataset into.
- result_dirpathlib.Path
Directory where to store all result files.
- smartpcaExecutable
Path pointing to an executable of the EIGENSOFT smartpca tool.
- embeddingEmbeddingAlgorithm
Dimensionality reduction technique to apply. Allowed options are
EmbeddingAlgorithm.PCAfor PCA analysis andEmbeddingAlgorithm.MDSfor MDS analysis.- n_componentsint
Number of dimensions to reduce the data to. The recommended number is 10 for PCA and 2 for MDS.
- seedint, default=None
Seed to initialize the random number generator with.
- threadsint, default=None
Number of threads to use for parallel bootstrap generation. Default is to use all system threads.
- redobool, default=False
Whether to rerun analyses in case the result files are already present.
- keep_windowsbool, default=False
Whether to store all intermediate window-dataset
ind,geno, andsnpfiles. Note that setting this to True might require a substantial amount of disk space.- smartpca_optional_settingsDict, default=None
Additional smartpca settings. Not allowed are the following options:
genotypename,snpname,indivname,evecoutname,evaloutname,numoutevec,maxpops. N ote that this option is only used whenembedding == EmbeddingAlgorithm.PCA.
- Returns:
- windowsList[EigenDataset]
List of
n_windowssubsets as EigenDataset objects. Each of the resulting window datasets will have eitherwindow.pca != Noneorwindow.mds != Nonedepending on the selectedembeddingoption.
- pandora.sliding_window.sliding_window_embedding_numpy(dataset: ~pandora.dataset.NumpyDataset, n_windows: int, embedding: ~pandora.custom_types.EmbeddingAlgorithm, n_components: int, threads: int | None = None, distance_metric: ~typing.Callable[[~numpy.ndarray[~typing.Any, ~numpy.dtype[~numpy._typing._array_like._ScalarType_co]], ~pandas.core.series.Series], ~typing.Tuple[~numpy.ndarray[~typing.Any, ~numpy.dtype[~numpy._typing._array_like._ScalarType_co]], ~pandas.core.series.Series]] = <function euclidean_sample_distance>, imputation: str | None = 'mean') List[NumpyDataset][source]
Separates the given
NumpyDatasetinton_windowssliding-window datasets and performs PCA/MDS analysis (as specified byembedding) for each window.Note that unless
threads=1, the computation is performed in parallel.- Parameters:
- datasetNumpyDataset
Dataset object separate into windows.
- n_windowsint
Number of sliding-windows to separate the dataset into.
- embeddingEmbeddingAlgorithm
Dimensionality reduction technique to apply. Allowed options are
EmbeddingAlgorithm.PCAfor PCA analysis andEmbeddingAlgorithm.MDSfor MDS analysis.- n_componentsint
Number of dimensions to reduce the data to. The recommended number is 10 for PCA and 2 for MDS.
- threadsint, default=None
Number of threads to use for parallel window embedding. Default is to use all system threads.
- distance_metricCallable[[npt.NDArray, pd.Series, str], Tuple[npt.NDArray, pd.Series]], default=eculidean_sample_distance
Distance metric to use for computing the distance matrix input for MDS. This is expected to be a function that receives the numpy array of sequences, the population for each sequence and the imputation method as input and should output the distance matrix and the respective populations for each row. The resulting distance matrix is of size
(n, m)and the resulting populations is expected to be of size(n, 1). Default isdistance_metrics::eculidean_sample_distance(the pairwise Euclidean distance of all samples)- imputationOptional[str], default=”mean”
Imputation method to use. Available options are:
"mean": Imputes missing values with the average of the respective SNP"remove": Removes all SNPs with at least one missing value.None: Does not impute missing data.
Note that depending on the distance_metric, not all imputation methods are supported. See the respective documentations in the
distance_metricsmodule.
- Returns:
- windowsList[NumpyDataset]
List of
n_windowssubsets asNumpyDatasetobjects. Each of the resulting window datasets will have eitherwindow.pca != Noneorwindow.mds != Nonedepending on the selected embedding option.