pandora.pandora module

class pandora.pandora.Pandora(pandora_config: PandoraConfig)[source]

Bases: object

Pandora class for encapsulating a pandora run an it’s results.

Parameters:
pandora_configPandoraConfig

PandoraConfig object used to determine the analyses to run

Attributes:
pandora_configPandoraConfig

PandoraConfig object used to determine the analyses to run

datasetEigenDataset

EigenDataset object that contains the input data provided by the user

replicatesList[EigenDataset]

List of bootstrap replicates / sliding-windows of self.dataset. This is empty until self.bootstrap_embeddings() or self.sliding_window() was called.

pairwise_stabilitiespd.DataFrame
Pandas dataframe containing the Pandora stability scores for all pairwise replicate comparisons.

This is empty until self.bootstrap_embeddings() or self.sliding_window() was called.

pandora_stabilityfloat

Overall Pandora stability of the dataset under bootstrapping or sliding-window analysis. This is None until self.bootstrap_embeddings() or self.sliding_window() was called.

pairwise_stabilitiespd.DataFrame
Pandas dataframe containing the Pandora cluster stability scores for all pairwise replicate comparisons.

This is empty until self.bootstrap_embeddings() or self.sliding_window() was called.

pandora_cluster_stabilityfloat

Overall Pandora cluster stability of the dataset under bootstrapping or sliding-window analysis. This is None until self.bootstrap_embeddings() or self.sliding_window() was called.

sample_support_valuespd.DataFrame

Pandas dataframe containing the support values for all samples of self.dataset for all pairwise replicate comparisons. This is empty until self.bootstrap_embeddings() or self.sliding_window() was called.

Methods

bootstrap_embeddings()

Draws bootstrap replicates of self.dataset and computes and compares the respective embedding for all bootstrap replicates.

embed_dataset()

Performs dimensionality reduction on self.dataset.

log_and_save_replicates_results()

Logs the results of the bootstrap/sliding-window analyses using pandora.logging.logger and also saves the results of the analyses to the respective files as specified by self.pandora_config.

sliding_window()

Separates self.dataset into self.pandora_config.n_replicates overlapping windows and computes and compares the respective embedding for all of these windows.

bootstrap_embeddings() None[source]

Draws bootstrap replicates of self.dataset and computes and compares the respective embedding for all bootstrap replicates.

The parameters (e.g. what method to use) is determined based on the configured settings in self.pandora_config. If run successfully, the following parameters of self will be set:

  • self.replicates

  • self.pairwise_stabilities

  • self.pandora_stability

  • self.pairwise_cluster_stabilities

  • self.pandora_cluster_stability

  • self.sample_support_values

Returns:
None
embed_dataset() None[source]

Performs dimensionality reduction on self.dataset.

The parameters (e.g. what method to use) is determined based on the configured settings in self.pandora_config.

Returns:
None
Raises:
PandoraConfigException
  • If self.pandora_config.embedding_algorithm is not a valid EmbeddingAlgorithm.

log_and_save_replicates_results() None[source]

Logs the results of the bootstrap/sliding-window analyses using pandora.logging.logger and also saves the results of the analyses to the respective files as specified by self.pandora_config.

Returns:
None
Raises:
PandoraException
  • If the results were not computed yet and thus there are no results to log.

sliding_window() None[source]

Separates self.dataset into self.pandora_config.n_replicates overlapping windows and computes and compares the respective embedding for all of these windows.

The parameters (e.g. what method to use) is determined based on the configured settings in self.pandora_config. If run successfully, the following parameters of self will be set:

  • self.replicates

  • self.pairwise_stabilities

  • self.pandora_stability

  • self.pairwise_cluster_stabilities

  • self.pandora_cluster_stability

  • self.sample_support_values

Returns:
None
class pandora.pandora.PandoraConfig(*, dataset_prefix: Path, result_dir: Path, file_format: FileFormat = FileFormat.EIGENSTRAT, convertf: str | Path = 'convertf', n_replicates: Annotated[int, Ge(ge=0)] = 100, keep_replicates: bool = False, bootstrap_convergence_check: bool = True, bootstrap_convergence_tolerance: Annotated[float, Ge(ge=0)] = 0.05, n_components: Annotated[int, Ge(ge=0)] = 10, embedding_algorithm: EmbeddingAlgorithm = EmbeddingAlgorithm.PCA, smartpca: str | Path = 'smartpca', smartpca_optional_settings: Dict[str, Any] | None = None, embedding_populations: Path | None = None, support_value_rogue_cutoff: float = 0.5, kmeans_k: int | None = None, analysis_mode: AnalysisMode = AnalysisMode.BOOTSTRAP, redo: bool = False, seed: int = 1743406135, threads: Annotated[int, Gt(gt=0)] = 2, result_decimals: Annotated[int, Ge(ge=0)] = 2, verbosity: int = 1, plot_results: bool = False, plot_dim_x: Annotated[int, Ge(ge=0)] = 0, plot_dim_y: Annotated[int, Ge(ge=0)] = 1)[source]

Bases: BaseModel

Pydantic dataclass encapsulating the settings required to run Pandora.

Parameters:
dataset_prefixpathlib.Path

File path prefix pointing to the dataset to use for the Pandora analyses. Pandora will look for files called <input>.* so make sure all files have the same prefix.

result_dirpathlib.Path

Directory where to store all (intermediate) results to.

file_formatFileFormat, default=FileFormat.EIGENSTRAT

Format of the input dataset. Can be ANCESTRYMAP, EIGENSTRAT, PED, PACKEDPED, PACKEDANCESTRYMAP. Default is EIGENSTRAT.

convertfExecutable, default=”convertf”

File path pointing to an executable of Eigensoft’s convertf tool. Convertf is used if the provided dataset is not in EIGENSTRAT format. Default is ‘convertf’. This will only work if convertf is installed systemwide.

n_replicatesPositiveInt, default=100

Number of bootstrap replicates or sliding windows to compute. In case of bootstrapping, make sure to also set the bootstrap_convergence_check parameter as desired.

keep_replicatesbool, default=False

Whether to store all intermediate datasets files (.geno, .snp, .ind). Note that this will result in a substantial storage consumption. Default is False. Note that the bootstrapped indicies are stored as checkpoints for full reproducibility in any case.

bootstrap_convergence_checkbool, default=True

Whether to heuristically determine convergence of the bootstrapping procedure. If true, instead of computing n_replicates bootstraps and embeddings, Pandora will check for convergence once every max(10, threads) bootstrap embeddings are computed. If according to our heuristic (see bootstrap.py for more details) the bootstrap procedure converged, all remaining tasks are cancelled and the stability is determined uisng only the number of replicates computed when convergence is determined. Note that this parameter is only relevant if analysis_mode is AnalysisMode.BOOTSTRAP.

bootstrap_convergence_toleranceNonNegativeFloat, default=0.05

Determines the level of deviation tolerance when checking for bootstrap convergence. A value of \(X\) means that we allow deviations of up to \(X * 100\%\) between pairwise bootstrap comparisons and still assume convergence.

n_componentsPositiveInt, default=10

Number of dimensions to output and compare for PCA and MDS analyses. The recommended number is 10 for PCA and 2 for MDS. Default is 10 in correspondance to the default PCA embedding.

embedding_algorithmEmbeddingAlgorithm, default=EmbeddingAlgorithm.PCA

Embedding to compute during the stability analysis. Can be either EmbeddingAlgorithm.PCA or EmbeddingAlgorithm.MDS.

smartpcaExecutable, default=”smartpca”

File path pointing to an executable of Eigensoft’s smartpca tool. Smartpca is used for PCA analyses on the provided dataset. Default is ‘smartpca’. This will only work if smartpca is installed systemwide.

smartpca_optional_settingsDict[str, Any], default=None

Optional additional settings to use when performing PCA with smartpca. Pandora has full support for all smartpca options. Not allowed are the following options: genotypename, snpname, indivname, evecoutname, evaloutname, numoutevec, maxpops. Use the following schema to set the options: dict(shrinkmode=True, numoutlieriter=1)

embedding_populationspathlib.Path, default=None

File containing a new-line separated list of population names. Only these populations will be used for the dimensionality reduction. In case of PCA analyses, all remaining samples in the dataset will be projected onto the PCA results.

support_value_rogue_cutofffloat, default=0.5

When plotting the support values, only samples with a support value lower than the support_value_rogue_cutoff will be annotated with their sample IDs. Note that all samples in the respective plot are color-coded according to their support value in any case.

kmeans_kPositiveInt, default=None

Number of clusters k to use for K-Means clustering of the dimensionality reduction embeddings. If not set, the optimal number of clusters will be automatically determined according to the Bayesian Information Criterion (BIC).

analysis_modeAnalysisMode, default=AnalysisMode.BOOTSTRAP

Whether Pandora should do bootstrap analysis or sliding-window analysis.

redobool, default=False

Whether to rerun all analyses in case the results files from a previous run are already present. Careful: this will overwrite existing results!

seedint, default=None

Seed to initialize the random number generator. This setting is recommended for reproducible analyses. Default is the current unix timestamp.

threadsNonNegativeInt, default=None

Number of threads to use for the analysis. Default is the number of CPUs available.

result_decimalsNonNegativeInt, default=2

Number of decimals to round the stability scores and support values in the output. Default is two decimals.

verbosityint, default=1

Verbosity of the output logging of Pandora. - 0 = quiet, prints only errors and the results (loglevel = ERROR) - 1 = verbose, prints all intermediate infos (loglevel = INFO) - 2 = debug, prints intermediate infos and debug messages (loglevel = DEBUG)

plot_resultsbool, default=False,

Whether to plot all dimensionality reduction results and sample support values.

plot_dim_xNonNegativeInt, default=0

Dimension to plot on the x-axis. Note that the dimensions are zero-indexed. To plot the first dimension set plot_dim_x = 0

plot_dim_yNonNegativeInt, default=1

Dimension to plot on the y-axis. Note that the dimensions are zero-indexed. To plot the second dimension set plot_dim_y = 1

Attributes:
bootstrap_result_dir

Path where to store all bootstrap (intermediate) results in.

configfile

Returns a path to the pandora config yaml.

convertf_result_dir

Path where to store converted input files.

loglevel

Converts the int log-level to the respective logging module constant.

model_extra

Get extra fields set during validation.

model_fields_set

Returns the set of fields that have been explicitly set on this model instance.

pairwise_stability_result_file

Returns a path to a csv file where all pairwise stability results should be written to.

pandora_logfile

Returns a path to the Pandora logfile where all results should be logged to.

plot_dir

Path where to store all plots in.

projected_sample_support_values_csv

Returns a path to a csv file where all sample support values for projected samples should be written to.

result_file

Returns a path to the Pandora results file where all final stability results should we written to.

sample_support_values_csv

Returns a path to a csv file where all sample support values should be written to.

sliding_window_result_dir

Path where to store all sliding-window (intermediate) results in.

Methods

copy(*[, include, exclude, update, deep])

Returns a copy of the model.

get_configuration()

Creates a dictionary mapping of all settings in self.

log_results_files()

Logs the absolute file paths of all files written during an execution of Pandora.

model_construct([_fields_set])

Creates a new instance of the Model class with validated data.

model_copy(*[, update, deep])

Usage docs: https://docs.pydantic.dev/2.10/concepts/serialization/#model_copy

model_dump(*[, mode, include, exclude, ...])

Usage docs: https://docs.pydantic.dev/2.10/concepts/serialization/#modelmodel_dump

model_dump_json(*[, indent, include, ...])

Usage docs: https://docs.pydantic.dev/2.10/concepts/serialization/#modelmodel_dump_json

model_json_schema([by_alias, ref_template, ...])

Generates a JSON schema for a model class.

model_parametrized_name(params)

Compute the class name for parametrizations of generic classes.

model_post_init(_BaseModel__context)

Override this method to perform additional initialization after __init__ and model_construct.

model_rebuild(*[, force, raise_errors, ...])

Try to rebuild the pydantic-core schema for the model.

model_validate(obj, *[, strict, ...])

Validate a pydantic model instance.

model_validate_json(json_data, *[, strict, ...])

Usage docs: https://docs.pydantic.dev/2.10/concepts/json/#json-parsing

model_validate_strings(obj, *[, strict, context])

Validate the given object with string data against the Pydantic model.

save_config()

Saves the configurations of self in yaml format in self.configfile.

construct

dict

from_orm

json

parse_file

parse_obj

parse_raw

schema

schema_json

update_forward_refs

validate

analysis_mode: AnalysisMode
bootstrap_convergence_check: bool
bootstrap_convergence_tolerance: Annotated[float, Ge(ge=0)]
property bootstrap_result_dir: Path

Path where to store all bootstrap (intermediate) results in.

Returns:
pathlib.Path

Filepath to the bootstrap results directory.

property configfile: Path

Returns a path to the pandora config yaml.

self.save_config will save all PandoraConfig options in this file

Returns:
pathlib.Path

Filepath to the config file.

convertf: str | Path
property convertf_result_dir: Path

Path where to store converted input files.

Returns:
pathlib.Path

Filepath to the converted input files directory.

dataset_prefix: Path
embedding_algorithm: EmbeddingAlgorithm
embedding_populations: Path | None
file_format: FileFormat
get_configuration() Dict[str, Any][source]

Creates a dictionary mapping of all settings in self.

Returns:
Dict[str, Any]

Dictionary representation of all settings in self. Filepaths are translated to absolute path strings, enums are represted by their value.

keep_replicates: bool
kmeans_k: int | None
log_results_files() None[source]

Logs the absolute file paths of all files written during an execution of Pandora.

Returns:
None
property loglevel: int

Converts the int log-level to the respective logging module constant.

Returns:
int

logging module loglevel based on the verbosity specified in self.

model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

n_components: Annotated[int, Ge(ge=0)]
n_replicates: Annotated[int, Ge(ge=0)]
property pairwise_stability_result_file: Path

Returns a path to a csv file where all pairwise stability results should be written to.

Returns:
pathlib.Path

Filepath to a csv file for pairwise stability results.

property pandora_logfile: Path

Returns a path to the Pandora logfile where all results should be logged to.

Returns:
pathlib.Path

Filepath to the Pandora logfile.

plot_dim_x: Annotated[int, Ge(ge=0)]
plot_dim_y: Annotated[int, Ge(ge=0)]
property plot_dir: Path

Path where to store all plots in.

Returns:
pathlib.Path

Filepath to the plots directory.

plot_results: bool
property projected_sample_support_values_csv: Path

Returns a path to a csv file where all sample support values for projected samples should be written to.

Returns:
pathlib.Path

Filepath to a csv file for support value results for projected samples.

redo: bool
result_decimals: Annotated[int, Ge(ge=0)]
result_dir: Path
property result_file: Path

Returns a path to the Pandora results file where all final stability results should we written to.

Returns:
pathlib.Path

Filepath to the Pandora results file.

property sample_support_values_csv: Path

Returns a path to a csv file where all sample support values should be written to.

Returns:
pathlib.Path

Filepath to a csv file for support value results for all samples.

save_config() None[source]

Saves the configurations of self in yaml format in self.configfile.

Will additionally log the Pandora version used for reproducibility. The resulting config file can be used as input for a subsequent Pandora execution.

Returns:
None
seed: int
property sliding_window_result_dir: Path

Path where to store all sliding-window (intermediate) results in.

Returns:
pathlib.Path

Filepath to the sliding-window results directory.

smartpca: str | Path
smartpca_optional_settings: Dict[str, Any] | None
support_value_rogue_cutoff: float
threads: Annotated[int, Gt(gt=0)]
verbosity: int
pandora.pandora.convert_to_eigenstrat_format(convertf: str | Path, convertf_result_dir: Path, dataset_prefix: Path, file_format: FileFormat, redo: bool = False) Path[source]

Converts the given dataset from the given file_format to EIGENSTRAT format and stores it in the convertf_result_dir.

Results in three new files:

  • {convertf_result_dir}/{dataset_prefix.name}.geno

  • {convertf_result_dir}/{dataset_prefix.name}.snp

  • {convertf_result_dir}/{dataset_prefix.name}.ind

Parameters:
convertfExecutable

Executable of the EIGENSOFT convertf program.

convertf_result_dirpathlib.Path

Filepath where the output should be stored.

dataset_prefixpathlib.Path

Prefix of the filepath pointing to the respective dataset files that should be converted.

file_formatFileFormat

Format of the input files.

redobool, default=False

Whether to rerun the conversion if the output files are already present.

Returns:
convert_prefixpathlib.Path

Filepath prefix pointing to the converted genotype files in EIGENSTRAT format.

pandora.pandora.pandora_config_from_configfile(configfile: Path) PandoraConfig[source]

Creates a new PandoraConfig object using the provided yaml configuration file.

Parameters:
configfilepathlib.Path

Configuration file in yaml file_format

Returns:
PandoraConfig

PandoraConfig object with the settings according to the given yaml file. Uses the default settings as specified in the PandoraConfig class for optional options not explicitly specified in the configfile.

Raises:
PandoraConfigException
  • If the config file does not specify a dataset_prefix.

  • If the config file does not specify a result_dir.

  • If the PandoraConfig object could not be initialized. This is most likely due to misspecified config options.