pandora.pandora module
- class pandora.pandora.Pandora(pandora_config: PandoraConfig)[source]
Bases:
objectPandora class for encapsulating a pandora run an it’s results.
- Parameters:
- pandora_configPandoraConfig
PandoraConfig object used to determine the analyses to run
- Attributes:
- pandora_configPandoraConfig
PandoraConfig object used to determine the analyses to run
- datasetEigenDataset
EigenDataset object that contains the input data provided by the user
- replicatesList[EigenDataset]
List of bootstrap replicates / sliding-windows of
self.dataset. This is empty untilself.bootstrap_embeddings()orself.sliding_window()was called.- pairwise_stabilitiespd.DataFrame
- Pandas dataframe containing the Pandora stability scores for all pairwise replicate comparisons.
This is empty until
self.bootstrap_embeddings()orself.sliding_window()was called.
- pandora_stabilityfloat
Overall Pandora stability of the dataset under bootstrapping or sliding-window analysis. This is
Noneuntilself.bootstrap_embeddings()orself.sliding_window()was called.- pairwise_stabilitiespd.DataFrame
- Pandas dataframe containing the Pandora cluster stability scores for all pairwise replicate comparisons.
This is empty until
self.bootstrap_embeddings()orself.sliding_window()was called.
- pandora_cluster_stabilityfloat
Overall Pandora cluster stability of the dataset under bootstrapping or sliding-window analysis. This is
Noneuntilself.bootstrap_embeddings()orself.sliding_window()was called.- sample_support_valuespd.DataFrame
Pandas dataframe containing the support values for all samples of
self.datasetfor all pairwise replicate comparisons. This is empty untilself.bootstrap_embeddings()orself.sliding_window()was called.
Methods
Draws bootstrap replicates of
self.datasetand computes and compares the respective embedding for all bootstrap replicates.Performs dimensionality reduction on
self.dataset.Logs the results of the bootstrap/sliding-window analyses using
pandora.logging.loggerand also saves the results of the analyses to the respective files as specified byself.pandora_config.Separates
self.datasetintoself.pandora_config.n_replicatesoverlapping windows and computes and compares the respective embedding for all of these windows.- bootstrap_embeddings() None[source]
Draws bootstrap replicates of
self.datasetand computes and compares the respective embedding for all bootstrap replicates.The parameters (e.g. what method to use) is determined based on the configured settings in
self.pandora_config. If run successfully, the following parameters of self will be set:self.replicatesself.pairwise_stabilitiesself.pandora_stabilityself.pairwise_cluster_stabilitiesself.pandora_cluster_stabilityself.sample_support_values
- Returns:
- None
- embed_dataset() None[source]
Performs dimensionality reduction on
self.dataset.The parameters (e.g. what method to use) is determined based on the configured settings in
self.pandora_config.- Returns:
- None
- Raises:
- PandoraConfigException
If
self.pandora_config.embedding_algorithmis not a validEmbeddingAlgorithm.
- log_and_save_replicates_results() None[source]
Logs the results of the bootstrap/sliding-window analyses using
pandora.logging.loggerand also saves the results of the analyses to the respective files as specified byself.pandora_config.- Returns:
- None
- Raises:
- PandoraException
If the results were not computed yet and thus there are no results to log.
- sliding_window() None[source]
Separates
self.datasetintoself.pandora_config.n_replicatesoverlapping windows and computes and compares the respective embedding for all of these windows.The parameters (e.g. what method to use) is determined based on the configured settings in
self.pandora_config. If run successfully, the following parameters of self will be set:self.replicatesself.pairwise_stabilitiesself.pandora_stabilityself.pairwise_cluster_stabilitiesself.pandora_cluster_stabilityself.sample_support_values
- Returns:
- None
- class pandora.pandora.PandoraConfig(*, dataset_prefix: Path, result_dir: Path, file_format: FileFormat = FileFormat.EIGENSTRAT, convertf: str | Path = 'convertf', n_replicates: Annotated[int, Ge(ge=0)] = 100, keep_replicates: bool = False, bootstrap_convergence_check: bool = True, bootstrap_convergence_tolerance: Annotated[float, Ge(ge=0)] = 0.05, n_components: Annotated[int, Ge(ge=0)] = 10, embedding_algorithm: EmbeddingAlgorithm = EmbeddingAlgorithm.PCA, smartpca: str | Path = 'smartpca', smartpca_optional_settings: Dict[str, Any] | None = None, embedding_populations: Path | None = None, support_value_rogue_cutoff: float = 0.5, kmeans_k: int | None = None, analysis_mode: AnalysisMode = AnalysisMode.BOOTSTRAP, redo: bool = False, seed: int = 1743406135, threads: Annotated[int, Gt(gt=0)] = 2, result_decimals: Annotated[int, Ge(ge=0)] = 2, verbosity: int = 1, plot_results: bool = False, plot_dim_x: Annotated[int, Ge(ge=0)] = 0, plot_dim_y: Annotated[int, Ge(ge=0)] = 1)[source]
Bases:
BaseModelPydantic dataclass encapsulating the settings required to run Pandora.
- Parameters:
- dataset_prefixpathlib.Path
File path prefix pointing to the dataset to use for the Pandora analyses. Pandora will look for files called <input>.* so make sure all files have the same prefix.
- result_dirpathlib.Path
Directory where to store all (intermediate) results to.
- file_formatFileFormat, default=FileFormat.EIGENSTRAT
Format of the input dataset. Can be ANCESTRYMAP, EIGENSTRAT, PED, PACKEDPED, PACKEDANCESTRYMAP. Default is EIGENSTRAT.
- convertfExecutable, default=”convertf”
File path pointing to an executable of Eigensoft’s convertf tool. Convertf is used if the provided dataset is not in EIGENSTRAT format. Default is ‘convertf’. This will only work if convertf is installed systemwide.
- n_replicatesPositiveInt, default=100
Number of bootstrap replicates or sliding windows to compute. In case of bootstrapping, make sure to also set the bootstrap_convergence_check parameter as desired.
- keep_replicatesbool, default=False
Whether to store all intermediate datasets files (.geno, .snp, .ind). Note that this will result in a substantial storage consumption. Default is False. Note that the bootstrapped indicies are stored as checkpoints for full reproducibility in any case.
- bootstrap_convergence_checkbool, default=True
Whether to heuristically determine convergence of the bootstrapping procedure. If true, instead of computing
n_replicatesbootstraps and embeddings, Pandora will check for convergence once everymax(10, threads)bootstrap embeddings are computed. If according to our heuristic (seebootstrap.pyfor more details) the bootstrap procedure converged, all remaining tasks are cancelled and the stability is determined uisng only the number of replicates computed when convergence is determined. Note that this parameter is only relevant ifanalysis_modeisAnalysisMode.BOOTSTRAP.- bootstrap_convergence_toleranceNonNegativeFloat, default=0.05
Determines the level of deviation tolerance when checking for bootstrap convergence. A value of \(X\) means that we allow deviations of up to \(X * 100\%\) between pairwise bootstrap comparisons and still assume convergence.
- n_componentsPositiveInt, default=10
Number of dimensions to output and compare for PCA and MDS analyses. The recommended number is 10 for PCA and 2 for MDS. Default is 10 in correspondance to the default PCA embedding.
- embedding_algorithmEmbeddingAlgorithm, default=EmbeddingAlgorithm.PCA
Embedding to compute during the stability analysis. Can be either EmbeddingAlgorithm.PCA or EmbeddingAlgorithm.MDS.
- smartpcaExecutable, default=”smartpca”
File path pointing to an executable of Eigensoft’s smartpca tool. Smartpca is used for PCA analyses on the provided dataset. Default is ‘smartpca’. This will only work if smartpca is installed systemwide.
- smartpca_optional_settingsDict[str, Any], default=None
Optional additional settings to use when performing PCA with smartpca. Pandora has full support for all smartpca options. Not allowed are the following options: genotypename, snpname, indivname, evecoutname, evaloutname, numoutevec, maxpops. Use the following schema to set the options: dict(shrinkmode=True, numoutlieriter=1)
- embedding_populationspathlib.Path, default=None
File containing a new-line separated list of population names. Only these populations will be used for the dimensionality reduction. In case of PCA analyses, all remaining samples in the dataset will be projected onto the PCA results.
- support_value_rogue_cutofffloat, default=0.5
When plotting the support values, only samples with a support value lower than the support_value_rogue_cutoff will be annotated with their sample IDs. Note that all samples in the respective plot are color-coded according to their support value in any case.
- kmeans_kPositiveInt, default=None
Number of clusters k to use for K-Means clustering of the dimensionality reduction embeddings. If not set, the optimal number of clusters will be automatically determined according to the Bayesian Information Criterion (BIC).
- analysis_modeAnalysisMode, default=AnalysisMode.BOOTSTRAP
Whether Pandora should do bootstrap analysis or sliding-window analysis.
- redobool, default=False
Whether to rerun all analyses in case the results files from a previous run are already present. Careful: this will overwrite existing results!
- seedint, default=None
Seed to initialize the random number generator. This setting is recommended for reproducible analyses. Default is the current unix timestamp.
- threadsNonNegativeInt, default=None
Number of threads to use for the analysis. Default is the number of CPUs available.
- result_decimalsNonNegativeInt, default=2
Number of decimals to round the stability scores and support values in the output. Default is two decimals.
- verbosityint, default=1
Verbosity of the output logging of Pandora. -
0= quiet, prints only errors and the results (loglevel = ERROR) -1= verbose, prints all intermediate infos (loglevel = INFO) -2= debug, prints intermediate infos and debug messages (loglevel = DEBUG)- plot_resultsbool, default=False,
Whether to plot all dimensionality reduction results and sample support values.
- plot_dim_xNonNegativeInt, default=0
Dimension to plot on the x-axis. Note that the dimensions are zero-indexed. To plot the first dimension set
plot_dim_x = 0- plot_dim_yNonNegativeInt, default=1
Dimension to plot on the y-axis. Note that the dimensions are zero-indexed. To plot the second dimension set
plot_dim_y = 1
- Attributes:
bootstrap_result_dirPath where to store all bootstrap (intermediate) results in.
configfileReturns a path to the pandora config yaml.
convertf_result_dirPath where to store converted input files.
loglevelConverts the int log-level to the respective logging module constant.
model_extraGet extra fields set during validation.
model_fields_setReturns the set of fields that have been explicitly set on this model instance.
pairwise_stability_result_fileReturns a path to a csv file where all pairwise stability results should be written to.
pandora_logfileReturns a path to the Pandora logfile where all results should be logged to.
plot_dirPath where to store all plots in.
projected_sample_support_values_csvReturns a path to a csv file where all sample support values for projected samples should be written to.
result_fileReturns a path to the Pandora results file where all final stability results should we written to.
sample_support_values_csvReturns a path to a csv file where all sample support values should be written to.
sliding_window_result_dirPath where to store all sliding-window (intermediate) results in.
Methods
copy(*[, include, exclude, update, deep])Returns a copy of the model.
Creates a dictionary mapping of all settings in self.
Logs the absolute file paths of all files written during an execution of Pandora.
model_construct([_fields_set])Creates a new instance of the Model class with validated data.
model_copy(*[, update, deep])Usage docs: https://docs.pydantic.dev/2.10/concepts/serialization/#model_copy
model_dump(*[, mode, include, exclude, ...])Usage docs: https://docs.pydantic.dev/2.10/concepts/serialization/#modelmodel_dump
model_dump_json(*[, indent, include, ...])Usage docs: https://docs.pydantic.dev/2.10/concepts/serialization/#modelmodel_dump_json
model_json_schema([by_alias, ref_template, ...])Generates a JSON schema for a model class.
model_parametrized_name(params)Compute the class name for parametrizations of generic classes.
model_post_init(_BaseModel__context)Override this method to perform additional initialization after __init__ and model_construct.
model_rebuild(*[, force, raise_errors, ...])Try to rebuild the pydantic-core schema for the model.
model_validate(obj, *[, strict, ...])Validate a pydantic model instance.
model_validate_json(json_data, *[, strict, ...])Usage docs: https://docs.pydantic.dev/2.10/concepts/json/#json-parsing
model_validate_strings(obj, *[, strict, context])Validate the given object with string data against the Pydantic model.
Saves the configurations of self in yaml format in self.configfile.
construct
dict
from_orm
json
parse_file
parse_obj
parse_raw
schema
schema_json
update_forward_refs
validate
- analysis_mode: AnalysisMode
- bootstrap_convergence_check: bool
- bootstrap_convergence_tolerance: Annotated[float, Ge(ge=0)]
- property bootstrap_result_dir: Path
Path where to store all bootstrap (intermediate) results in.
- Returns:
- pathlib.Path
Filepath to the bootstrap results directory.
- property configfile: Path
Returns a path to the pandora config yaml.
self.save_config will save all PandoraConfig options in this file
- Returns:
- pathlib.Path
Filepath to the config file.
- convertf: str | Path
- property convertf_result_dir: Path
Path where to store converted input files.
- Returns:
- pathlib.Path
Filepath to the converted input files directory.
- dataset_prefix: Path
- embedding_algorithm: EmbeddingAlgorithm
- embedding_populations: Path | None
- file_format: FileFormat
- get_configuration() Dict[str, Any][source]
Creates a dictionary mapping of all settings in self.
- Returns:
- Dict[str, Any]
Dictionary representation of all settings in self. Filepaths are translated to absolute path strings, enums are represted by their value.
- keep_replicates: bool
- kmeans_k: int | None
- log_results_files() None[source]
Logs the absolute file paths of all files written during an execution of Pandora.
- Returns:
- None
- property loglevel: int
Converts the int log-level to the respective logging module constant.
- Returns:
- int
logging module loglevel based on the verbosity specified in self.
- model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- n_components: Annotated[int, Ge(ge=0)]
- n_replicates: Annotated[int, Ge(ge=0)]
- property pairwise_stability_result_file: Path
Returns a path to a csv file where all pairwise stability results should be written to.
- Returns:
- pathlib.Path
Filepath to a csv file for pairwise stability results.
- property pandora_logfile: Path
Returns a path to the Pandora logfile where all results should be logged to.
- Returns:
- pathlib.Path
Filepath to the Pandora logfile.
- plot_dim_x: Annotated[int, Ge(ge=0)]
- plot_dim_y: Annotated[int, Ge(ge=0)]
- property plot_dir: Path
Path where to store all plots in.
- Returns:
- pathlib.Path
Filepath to the plots directory.
- plot_results: bool
- property projected_sample_support_values_csv: Path
Returns a path to a csv file where all sample support values for projected samples should be written to.
- Returns:
- pathlib.Path
Filepath to a csv file for support value results for projected samples.
- redo: bool
- result_decimals: Annotated[int, Ge(ge=0)]
- result_dir: Path
- property result_file: Path
Returns a path to the Pandora results file where all final stability results should we written to.
- Returns:
- pathlib.Path
Filepath to the Pandora results file.
- property sample_support_values_csv: Path
Returns a path to a csv file where all sample support values should be written to.
- Returns:
- pathlib.Path
Filepath to a csv file for support value results for all samples.
- save_config() None[source]
Saves the configurations of self in yaml format in self.configfile.
Will additionally log the Pandora version used for reproducibility. The resulting config file can be used as input for a subsequent Pandora execution.
- Returns:
- None
- seed: int
- property sliding_window_result_dir: Path
Path where to store all sliding-window (intermediate) results in.
- Returns:
- pathlib.Path
Filepath to the sliding-window results directory.
- smartpca: str | Path
- smartpca_optional_settings: Dict[str, Any] | None
- support_value_rogue_cutoff: float
- threads: Annotated[int, Gt(gt=0)]
- verbosity: int
- pandora.pandora.convert_to_eigenstrat_format(convertf: str | Path, convertf_result_dir: Path, dataset_prefix: Path, file_format: FileFormat, redo: bool = False) Path[source]
Converts the given dataset from the given
file_formattoEIGENSTRATformat and stores it in theconvertf_result_dir.Results in three new files:
{convertf_result_dir}/{dataset_prefix.name}.geno{convertf_result_dir}/{dataset_prefix.name}.snp{convertf_result_dir}/{dataset_prefix.name}.ind
- Parameters:
- convertfExecutable
Executable of the EIGENSOFT convertf program.
- convertf_result_dirpathlib.Path
Filepath where the output should be stored.
- dataset_prefixpathlib.Path
Prefix of the filepath pointing to the respective dataset files that should be converted.
- file_formatFileFormat
Format of the input files.
- redobool, default=False
Whether to rerun the conversion if the output files are already present.
- Returns:
- convert_prefixpathlib.Path
Filepath prefix pointing to the converted genotype files in EIGENSTRAT format.
- pandora.pandora.pandora_config_from_configfile(configfile: Path) PandoraConfig[source]
Creates a new
PandoraConfigobject using the provided yaml configuration file.- Parameters:
- configfilepathlib.Path
Configuration file in yaml
file_format
- Returns:
- PandoraConfig
PandoraConfig object with the settings according to the given yaml file. Uses the default settings as specified in the PandoraConfig class for optional options not explicitly specified in the
configfile.
- Raises:
- PandoraConfigException
If the config file does not specify a
dataset_prefix.If the config file does not specify a
result_dir.If the
PandoraConfigobject could not be initialized. This is most likely due to misspecified config options.