pandora.pandora module

class pandora.pandora.Pandora(pandora_config: PandoraConfig)[source]

Bases: object

Pandora class for encapsulating a pandora run an it’s results.

Parameters:

pandora_configPandoraConfig: PandoraConfig object used to determine the analyses to run

Attributes:

pandora_configPandoraConfig

PandoraConfig object used to determine the analyses to run

datasetEigenDataset

EigenDataset object that contains the input data provided by the user

replicatesList[EigenDataset]

List of bootstrap replicates / sliding-windows of self.dataset. This is empty until self.bootstrap_embeddings() or self.sliding_window() was called.

pairwise_stabilitiespd.DataFrame

Pandas dataframe containing the Pandora stability scores for all pairwise replicate comparisons.: This is empty until self.bootstrap_embeddings() or self.sliding_window() was called.

pandora_stabilityfloat

Overall Pandora stability of the dataset under bootstrapping or sliding-window analysis. This is None until self.bootstrap_embeddings() or self.sliding_window() was called.

pairwise_stabilitiespd.DataFrame

Pandas dataframe containing the Pandora cluster stability scores for all pairwise replicate comparisons.: This is empty until self.bootstrap_embeddings() or self.sliding_window() was called.

pandora_cluster_stabilityfloat

Overall Pandora cluster stability of the dataset under bootstrapping or sliding-window analysis. This is None until self.bootstrap_embeddings() or self.sliding_window() was called.

sample_support_valuespd.DataFrame

Pandas dataframe containing the support values for all samples of self.dataset for all pairwise replicate comparisons. This is empty until self.bootstrap_embeddings() or self.sliding_window() was called.

Methods

`bootstrap_embeddings`()	Draws bootstrap replicates of `self.dataset` and computes and compares the respective embedding for all bootstrap replicates.
`embed_dataset`()	Performs dimensionality reduction on `self.dataset`.
`log_and_save_replicates_results`()	Logs the results of the bootstrap/sliding-window analyses using `pandora.logging.logger` and also saves the results of the analyses to the respective files as specified by `self.pandora_config`.
`sliding_window`()	Separates `self.dataset` into `self.pandora_config.n_replicates` overlapping windows and computes and compares the respective embedding for all of these windows.

bootstrap_embeddings() → None[source]

Draws bootstrap replicates of self.dataset and computes and compares the respective embedding for all bootstrap replicates.

The parameters (e.g. what method to use) is determined based on the configured settings in self.pandora_config. If run successfully, the following parameters of self will be set:

self.replicates

self.pairwise_stabilities

self.pandora_stability

self.pairwise_cluster_stabilities

self.pandora_cluster_stability

self.sample_support_values

Returns:

None

embed_dataset() → None[source]

Performs dimensionality reduction on self.dataset.

The parameters (e.g. what method to use) is determined based on the configured settings in self.pandora_config.

Returns:

None

Raises:

PandoraConfigException

If self.pandora_config.embedding_algorithm is not a valid EmbeddingAlgorithm.

log_and_save_replicates_results() → None[source]

Logs the results of the bootstrap/sliding-window analyses using pandora.logging.logger and also saves the results of the analyses to the respective files as specified by self.pandora_config.

Returns:

None

Raises:

PandoraException

If the results were not computed yet and thus there are no results to log.

sliding_window() → None[source]

Separates self.dataset into self.pandora_config.n_replicates overlapping windows and computes and compares the respective embedding for all of these windows.

The parameters (e.g. what method to use) is determined based on the configured settings in self.pandora_config. If run successfully, the following parameters of self will be set:

self.replicates

self.pairwise_stabilities

self.pandora_stability

self.pairwise_cluster_stabilities

self.pandora_cluster_stability

self.sample_support_values

Returns:

None

class pandora.pandora.PandoraConfig(*, dataset_prefix: Path, result_dir: Path, file_format: FileFormat = FileFormat.EIGENSTRAT, convertf: str | Path = 'convertf', n_replicates: Annotated[int, Ge(ge=0)] = 100, keep_replicates: bool = False, bootstrap_convergence_check: bool = True, bootstrap_convergence_tolerance: Annotated[float, Ge(ge=0)] = 0.05, n_components: Annotated[int, Ge(ge=0)] = 10, embedding_algorithm: EmbeddingAlgorithm = EmbeddingAlgorithm.PCA, smartpca: str | Path = 'smartpca', smartpca_optional_settings: Dict[str, Any] | None = None, embedding_populations: Path | None = None, support_value_rogue_cutoff: float = 0.5, kmeans_k: int | None = None, analysis_mode: AnalysisMode = AnalysisMode.BOOTSTRAP, redo: bool = False, seed: int = 1743406135, threads: Annotated[int, Gt(gt=0)] = 2, result_decimals: Annotated[int, Ge(ge=0)] = 2, verbosity: int = 1, plot_results: bool = False, plot_dim_x: Annotated[int, Ge(ge=0)] = 0, plot_dim_y: Annotated[int, Ge(ge=0)] = 1)[source]

Bases: BaseModel

Pydantic dataclass encapsulating the settings required to run Pandora.

Parameters:

dataset_prefixpathlib.Path: File path prefix pointing to the dataset to use for the Pandora analyses. Pandora will look for files called <input>.* so make sure all files have the same prefix.
result_dirpathlib.Path: Directory where to store all (intermediate) results to.
file_formatFileFormat, default=FileFormat.EIGENSTRAT: Format of the input dataset. Can be ANCESTRYMAP, EIGENSTRAT, PED, PACKEDPED, PACKEDANCESTRYMAP. Default is EIGENSTRAT.
convertfExecutable, default=”convertf”: File path pointing to an executable of Eigensoft’s convertf tool. Convertf is used if the provided dataset is not in EIGENSTRAT format. Default is ‘convertf’. This will only work if convertf is installed systemwide.
n_replicatesPositiveInt, default=100: Number of bootstrap replicates or sliding windows to compute. In case of bootstrapping, make sure to also set the bootstrap_convergence_check parameter as desired.
keep_replicatesbool, default=False: Whether to store all intermediate datasets files (.geno, .snp, .ind). Note that this will result in a substantial storage consumption. Default is False. Note that the bootstrapped indicies are stored as checkpoints for full reproducibility in any case.
bootstrap_convergence_checkbool, default=True: Whether to heuristically determine convergence of the bootstrapping procedure. If true, instead of computing n_replicates bootstraps and embeddings, Pandora will check for convergence once every max(10, threads) bootstrap embeddings are computed. If according to our heuristic (see bootstrap.py for more details) the bootstrap procedure converged, all remaining tasks are cancelled and the stability is determined uisng only the number of replicates computed when convergence is determined. Note that this parameter is only relevant if analysis_mode is AnalysisMode.BOOTSTRAP.
bootstrap_convergence_toleranceNonNegativeFloat, default=0.05: Determines the level of deviation tolerance when checking for bootstrap convergence. A value of \(X\) means that we allow deviations of up to \(X * 100\%\) between pairwise bootstrap comparisons and still assume convergence.
n_componentsPositiveInt, default=10: Number of dimensions to output and compare for PCA and MDS analyses. The recommended number is 10 for PCA and 2 for MDS. Default is 10 in correspondance to the default PCA embedding.
embedding_algorithmEmbeddingAlgorithm, default=EmbeddingAlgorithm.PCA: Embedding to compute during the stability analysis. Can be either EmbeddingAlgorithm.PCA or EmbeddingAlgorithm.MDS.
smartpcaExecutable, default=”smartpca”: File path pointing to an executable of Eigensoft’s smartpca tool. Smartpca is used for PCA analyses on the provided dataset. Default is ‘smartpca’. This will only work if smartpca is installed systemwide.
smartpca_optional_settingsDict[str, Any], default=None: Optional additional settings to use when performing PCA with smartpca. Pandora has full support for all smartpca options. Not allowed are the following options: genotypename, snpname, indivname, evecoutname, evaloutname, numoutevec, maxpops. Use the following schema to set the options: dict(shrinkmode=True, numoutlieriter=1)
embedding_populationspathlib.Path, default=None: File containing a new-line separated list of population names. Only these populations will be used for the dimensionality reduction. In case of PCA analyses, all remaining samples in the dataset will be projected onto the PCA results.
support_value_rogue_cutofffloat, default=0.5: When plotting the support values, only samples with a support value lower than the support_value_rogue_cutoff will be annotated with their sample IDs. Note that all samples in the respective plot are color-coded according to their support value in any case.
kmeans_kPositiveInt, default=None: Number of clusters k to use for K-Means clustering of the dimensionality reduction embeddings. If not set, the optimal number of clusters will be automatically determined according to the Bayesian Information Criterion (BIC).
analysis_modeAnalysisMode, default=AnalysisMode.BOOTSTRAP: Whether Pandora should do bootstrap analysis or sliding-window analysis.
redobool, default=False: Whether to rerun all analyses in case the results files from a previous run are already present. Careful: this will overwrite existing results!
seedint, default=None: Seed to initialize the random number generator. This setting is recommended for reproducible analyses. Default is the current unix timestamp.
threadsNonNegativeInt, default=None: Number of threads to use for the analysis. Default is the number of CPUs available.
result_decimalsNonNegativeInt, default=2: Number of decimals to round the stability scores and support values in the output. Default is two decimals.
verbosityint, default=1: Verbosity of the output logging of Pandora. - 0 = quiet, prints only errors and the results (loglevel = ERROR) - 1 = verbose, prints all intermediate infos (loglevel = INFO) - 2 = debug, prints intermediate infos and debug messages (loglevel = DEBUG)
plot_resultsbool, default=False,: Whether to plot all dimensionality reduction results and sample support values.
plot_dim_xNonNegativeInt, default=0: Dimension to plot on the x-axis. Note that the dimensions are zero-indexed. To plot the first dimension set plot_dim_x = 0
plot_dim_yNonNegativeInt, default=1: Dimension to plot on the y-axis. Note that the dimensions are zero-indexed. To plot the second dimension set plot_dim_y = 1

Attributes:

bootstrap_result_dir: Path where to store all bootstrap (intermediate) results in.
configfile: Returns a path to the pandora config yaml.
convertf_result_dir: Path where to store converted input files.
loglevel: Converts the int log-level to the respective logging module constant.
model_extra: Get extra fields set during validation.
model_fields_set: Returns the set of fields that have been explicitly set on this model instance.
pairwise_stability_result_file: Returns a path to a csv file where all pairwise stability results should be written to.
pandora_logfile: Returns a path to the Pandora logfile where all results should be logged to.
plot_dir: Path where to store all plots in.
projected_sample_support_values_csv: Returns a path to a csv file where all sample support values for projected samples should be written to.
result_file: Returns a path to the Pandora results file where all final stability results should we written to.
sample_support_values_csv: Returns a path to a csv file where all sample support values should be written to.
sliding_window_result_dir: Path where to store all sliding-window (intermediate) results in.

Methods

`copy`(*[, include, exclude, update, deep])	Returns a copy of the model.
`get_configuration`()	Creates a dictionary mapping of all settings in self.
`log_results_files`()	Logs the absolute file paths of all files written during an execution of Pandora.
`model_construct`([_fields_set])	Creates a new instance of the Model class with validated data.
`model_copy`(*[, update, deep])	Usage docs: https://docs.pydantic.dev/2.10/concepts/serialization/#model_copy
`model_dump`(*[, mode, include, exclude, ...])	Usage docs: https://docs.pydantic.dev/2.10/concepts/serialization/#modelmodel_dump
`model_dump_json`(*[, indent, include, ...])	Usage docs: https://docs.pydantic.dev/2.10/concepts/serialization/#modelmodel_dump_json
`model_json_schema`([by_alias, ref_template, ...])	Generates a JSON schema for a model class.
`model_parametrized_name`(params)	Compute the class name for parametrizations of generic classes.
`model_post_init`(_BaseModel__context)	Override this method to perform additional initialization after __init__ and model_construct.
`model_rebuild`(*[, force, raise_errors, ...])	Try to rebuild the pydantic-core schema for the model.
`model_validate`(obj, *[, strict, ...])	Validate a pydantic model instance.
`model_validate_json`(json_data, *[, strict, ...])	Usage docs: https://docs.pydantic.dev/2.10/concepts/json/#json-parsing
`model_validate_strings`(obj, *[, strict, context])	Validate the given object with string data against the Pydantic model.
`save_config`()	Saves the configurations of self in yaml format in self.configfile.

construct
dict
from_orm
json
parse_file
parse_obj
parse_raw
schema
schema_json
update_forward_refs
validate

analysis_mode: AnalysisMode

bootstrap_convergence_check: bool

bootstrap_convergence_tolerance: Annotated[float, Ge(ge=0)]

property bootstrap_result_dir: Path

Path where to store all bootstrap (intermediate) results in.

Returns:

pathlib.Path: Filepath to the bootstrap results directory.

property configfile: Path

Returns a path to the pandora config yaml.

self.save_config will save all PandoraConfig options in this file

Returns:

pathlib.Path: Filepath to the config file.

convertf: str | Path

property convertf_result_dir: Path

Path where to store converted input files.

Returns:

pathlib.Path: Filepath to the converted input files directory.

dataset_prefix: Path

embedding_algorithm: EmbeddingAlgorithm

embedding_populations: Path | None

file_format: FileFormat

get_configuration() → Dict[str, Any][source]

Creates a dictionary mapping of all settings in self.

Returns:

Dict[str, Any]: Dictionary representation of all settings in self. Filepaths are translated to absolute path strings, enums are represted by their value.

keep_replicates: bool

kmeans_k: int | None

log_results_files() → None[source]

Logs the absolute file paths of all files written during an execution of Pandora.

Returns:

None

property loglevel: int

Converts the int log-level to the respective logging module constant.

Returns:

int: logging module loglevel based on the verbosity specified in self.

model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

n_components: Annotated[int, Ge(ge=0)]

n_replicates: Annotated[int, Ge(ge=0)]

property pairwise_stability_result_file: Path

Returns a path to a csv file where all pairwise stability results should be written to.

Returns:

pathlib.Path: Filepath to a csv file for pairwise stability results.

property pandora_logfile: Path

Returns a path to the Pandora logfile where all results should be logged to.

Returns:

pathlib.Path: Filepath to the Pandora logfile.

plot_dim_x: Annotated[int, Ge(ge=0)]

plot_dim_y: Annotated[int, Ge(ge=0)]

property plot_dir: Path

Path where to store all plots in.

Returns:

pathlib.Path: Filepath to the plots directory.

plot_results: bool

property projected_sample_support_values_csv: Path

Returns a path to a csv file where all sample support values for projected samples should be written to.

Returns:

pathlib.Path: Filepath to a csv file for support value results for projected samples.

redo: bool

result_decimals: Annotated[int, Ge(ge=0)]

result_dir: Path

property result_file: Path

Returns a path to the Pandora results file where all final stability results should we written to.

Returns:

pathlib.Path: Filepath to the Pandora results file.

property sample_support_values_csv: Path

Returns a path to a csv file where all sample support values should be written to.

Returns:

pathlib.Path: Filepath to a csv file for support value results for all samples.

save_config() → None[source]

Saves the configurations of self in yaml format in self.configfile.

Will additionally log the Pandora version used for reproducibility. The resulting config file can be used as input for a subsequent Pandora execution.

Returns:

None

seed: int

property sliding_window_result_dir: Path

Path where to store all sliding-window (intermediate) results in.

Returns:

pathlib.Path: Filepath to the sliding-window results directory.

smartpca: str | Path

smartpca_optional_settings: Dict[str, Any] | None

support_value_rogue_cutoff: float

threads: Annotated[int, Gt(gt=0)]

verbosity: int

pandora.pandora.convert_to_eigenstrat_format(convertf: str | Path, convertf_result_dir: Path, dataset_prefix: Path, file_format: FileFormat, redo: bool = False) → Path[source]

Converts the given dataset from the given file_format to EIGENSTRAT format and stores it in the convertf_result_dir.

Results in three new files:

{convertf_result_dir}/{dataset_prefix.name}.geno
{convertf_result_dir}/{dataset_prefix.name}.snp
{convertf_result_dir}/{dataset_prefix.name}.ind

Parameters:

convertfExecutable

Executable of the EIGENSOFT convertf program.

convertf_result_dirpathlib.Path

Filepath where the output should be stored.

dataset_prefixpathlib.Path

Prefix of the filepath pointing to the respective dataset files that should be converted.

file_formatFileFormat

Format of the input files.

redobool, default=False: Whether to rerun the conversion if the output files are already present.

Returns:

convert_prefixpathlib.Path: Filepath prefix pointing to the converted genotype files in EIGENSTRAT format.

pandora.pandora.pandora_config_from_configfile(configfile: Path) → PandoraConfig[source]

Creates a new PandoraConfig object using the provided yaml configuration file.

Parameters:

configfilepathlib.Path: Configuration file in yaml file_format

Returns:

PandoraConfig: PandoraConfig object with the settings according to the given yaml file. Uses the default settings as specified in the PandoraConfig class for optional options not explicitly specified in the configfile.

Raises:

PandoraConfigException

If the config file does not specify a dataset_prefix.
If the config file does not specify a result_dir.
If the PandoraConfig object could not be initialized. This is most likely due to misspecified config options.