Pandora Command Line Interface Configuration
Configuration options:
dataset_prefix, File path prefix pointing to the dataset to use for the Pandora analyses. Pandora will look for files called .* so make sure all files have the same prefix.result_dir, Directory where to store all (intermediate) results to.file_format, default =EIGENSTRAT, Name of the file format your dataset is in. Supported formats areANCESTRYMAP,EIGENSTRAT,PED,PACKEDPED,PACKEDANCESTRYMAP. For more information see Section Input data below.convertf, default =convertf, File path pointing to an executable of Eigensoft’sconvertftool.convertfis used if the provided dataset is not inEIGENSTRATformat. Default isconvertf. This will only work ifconvertfis installed systemwide.bootstrap_convergence_check, default =True, If true, instead of computingn_replicatesbootstraps and embeddings, Pandora will check for convergence once everymax(10, threads)bootstrap embeddings are computed. If according to our heuristic (see TODO for more details) the bootstrap procedure converged, all remaining tasks are cancelled and the stability is determined uisng only the number of replicates computed when convergence is determined. Due to the runtime overhead of the convergence check compared to the runtime of MDS computations, we only advice using this convergence check for PCA analyses. Note that this parameter is only relevant ifanalysis_modeisAnalysisMode.BOOTSTRAP.bootstrap_convergence_tolerance, default=0.05, Determines the level of deviation tolerance when checking for bootstrap convergence. A value of \(X\) means that we allow deviations of up to \(X * 100\%\) between pairwise bootstrap comparisons and still assume convergence.n_replicates, default = 100, Number of bootstrap replicates or sliding windows to computekeep_replicates, default =false, Whether to store all intermediate datasets files (.geno,.snp,.ind). Note that this will result in a substantial storage consumption. Note that in case of bootstrapping, the bootstrapped indices are stored as checkpoints for full reproducibility in any case.n_components, default = 10, Number of components to compute and compare for PCA or MDS analyses. We recommend 10 for PCA analyses and 2 for MDS analyses. The default is 10 since the default forembedding_algorithmisPCA.embedding_algorithm, default =PCA, Dimensionality reduction technique you want to use. Allowed options arePCAandMDS.smartpca, default =smartpca, File path pointing to an executable of Eigensoft’ssmartpcatool.smartpcais used for PCA analyses on the provided dataset. Default issmartpca. This will only work ifsmartpcais installed systemwide.smartpca_optional_settings, default = not set, Optional additional settings to use when performing PCA withsmartpca. See SmartPCA section below for more details.embedding_populations, default = not set, File containing a new-line separated list of population names. Only these populations will be used for the dimensionality reduction. In case of PCA analyses, all remaining samples in the dataset will be projected onto the PCA results.support_value_rogue_cutoff, default = 0.5, When plotting the support values, only samples with a support value lower than thesupport_value_rogue_cutoffwill be annotated with their sample IDs. Note that all samples in the respective plot are color-coded according to their support value in any case.kmeans_k, default = not set, Number of clusters k to use for K-Means clustering of the dimensionality reduction embeddings. If not set, the optimal number of clusters will be automatically determined according to the Bayesian Information Criterion (BIC).analysis_mode, default =BOOTSTRAP, Whether to run bootstrap analysis or sliding-window analysis. Allowed options areBOOTSTRAPandSLIDING_WINDOW.redo, default =False, Whether to rerun all analyses in case the results files from a previous run are already present. Careful: this will overwrite existing results!seed, default = current unix timestamp, Seed to initialize the random number generator. This setting is recommended for reproducible analyses.threads, default = number of system threads, Number of threads to use for the analysis.result_decimals, default = 2, Number of decimals to round the stability scores and support values in the output.verbosity, default = 1, Verbosity of the output logging of Pandora.0 = quiet, prints only errors and the results (loglevel =
ERROR)1 = verbose, prints all intermediate infos (loglevel =
INFO)2 = debug, prints intermediate infos and debug messages (loglevel =
DEBUG)
plot_results, default =False, Whether to plot all dimensionality reduction results and sample support values.plot_dim_x, default = 0, Dimension to plot on the x-axis. Note that the dimensions are zero-indexed. To plot the first dimension setplot_dim_x = 0.plot_dim_y, default = 1, Dimension to plot on the y-axis. Note that the dimensions are zero-indexed. To plot the second dimension setplot_dim_y = 1.
SmartPCA optional settings
Pandora supports all smartPCA commands, for a list of possible settings see the SmartPCA documentation.
Not allowed are the following options: genotypename, snpname, indivname, evecoutname, evaloutname, numoutevec, maxpops. Use the following schema to set the options:
smartpca_optional_settings:
shrinkmode: YES
numoutlieriter: 1
Input data
Pandora supports a variety of different input formats. Basically, we support all file formats than can be converted to Eigensoft’s Eigenstrat format using the convertf program. Pandora expects the three input files (SNP, GENO, IND files) to have the same prefix and the file endings should follow the convention according to the table below.
File Format |
Expected file endings |
|---|---|
Ancestrymap |
|
Eigenstrat |
|
PED |
|
PackedPED |
|
PackedAncestrymap |
|
Pandora performs its bootstrapping and sliding-window analyses file-based and makes use of the Eigenstrat format. Thus, all other file formats are automatically converted to Eigenstrat prior to the analyses using the convertf tool. Make sure to correctly set the convertf option in your config file before running Pandora.
Note that Pandora does not apply any kind of preprocessing to your data. Make sure to run any appropriate preprocessing (e.g. LD-pruning) prior to Pandora.
Output files
Running Pandora in the command line will produce a number of (intermediate) output files. In the following I will describe these files and their content. Note that the names of the files are all relative to the specified result_dir in the configuration file.
pandora.log: The main pandora log file. Everything you see in your terminal will also be written to this log file.pandora.yaml: On program start, Pandora will save a verbose version of the configuration in this file. You can use this file to reproduce your results.pandora.txt: Main results file. The summary of the Pandora run will be written to this file, including the Pandora Stability, Pandora Cluster Stability and the summary of the Pandora support values.pandora.replicates.csv: Verbose comparison output. This file will contain the Pandora Stability and Pandora Cluster Stability for all pairwise results of bootstrap replicates/windows. Each row corresponds to one comparison with the first column indicating the indices of the compared bootstraps/windows.pandora.supportValues.csv: This file contains the Pandora support value for all samples in the dataset. Each row corresponds to one sample. The csv has one columnsPSVcontaining the respective pandora support value.pandora.supportValues.projected.csv: In case you specified a list of populations that should only be used for the PCA embedding, all remaining samples will be projected onto the resulting embedding. This file will contain the same support value data aspandora.supportValues.csv, but only for projected samples.bootstrap/: If you selected the bootstrap analyses, this directory will contain the following files for each bootstrap replicate:*.ckp: Pandora checkpoint file that stores the random seed used for this bootstrap as well as the SNP indices.*.eval,*.evec: The results of thesmartpcaPCA embedding (in case of PCA analyses)*.fst: The results ofsmartpcaFst computation (in case of MDS analyses)In case you specified
keep_replicates: truein your config, there will also be the bootstrapped dataset files (*.geno,*.snp,*.ind).
windows/: If you selected the sliding-window analyses, this directory will contain the following files for each window of the dataset:*.eval,*.evec: The results of thesmartpcaPCA embedding (in case of PCA analyses)*.fst: The results ofsmartpcaFst computation (in case of MDS analyses)In case you specified
keep_replicates: truein your config, there will also be the dataset files for the windows (*.geno,*.snp,*.ind).
plots/: If you setplot_results: truein your config, this directory will contain all plots Pandora generated during the execution. The names of the files should be self-explanatory. As of version 1.0.8, we provide each plot in two formats: pdf and HTML. You can open the HTML file in any browser to see an interactive version of the plot.