pandora.dataset module

class pandora.dataset.EigenDataset(file_prefix: Path, embedding_populations: Path | None = None, sample_ids: Series | None = None, populations: Series | None = None, n_snps: int | None = None)[source]

Bases: object

Class structure to represent a population genetics dataset in Eigenstrat format.

This class provides methods to perform PCA and MDS analyses using the Eigensoft smartpca tool. It further provides methods to generate a bootstrap replicate dataset (SNPs resampled with replacement) and to generate overlapping sliding windows of sub-datasets. Note that in order for the bootstrap and windowing methods to work, the respective geno, ind, and snp files need to be in EIGENSTRAT format with a similar file prefix and need to have file endings .geno, .ind, and .snp.

Note that Pandora does not check on init whether all input files are present, as you are allowed to init a dataset despite the files missing. This is useful for saving storage when drawing lots of bootstrap replicates. In case the input files are missing, you need to pass sample_ids, populations and n_snps. Of course, you need the files if you want to run self.bootstrap, self.run_pca, self.run_mds, or self.get_windows.

Parameters:

file_prefixpathlib.Path: File path prefix pointing to the ind, geno, and snp files in EIGENSTRAT format. All methods assume that all three files have the same prefix and have the file endings .geno, .ind, and .snp.
embedding_populationspathlib.Path, default=None: Path pointing to a file containing a new-line separated list containing population names. Only samples belonging to these populations will be used for PCA analyses. If not set, all samples will be used.
sample_idspd.Series, default=None: pandas Series containing a sample ID for each sequence in the dataset. If not set, Pandora will set this based on the content of self._ind_file
populationspd.Series, default=None: pandas Series containing the population for each sample in the dataset. If not set, Pandora will set this based on the content of self._ind_file
n_snpsint, default=None: Number of SNPs in the dataset. If not set, Pandora will set this based on the content of self._geno_file

Attributes:

file_prefixpathlib.Path: File path prefix pointing to the ind, geno, and snp files in EIGENSTRAT format. All methods assume that all three files have the same prefix and have the file endings .geno, .ind, and .snp.
namestr: Name of the dataset. Inferred as name of the provided file_prefix.
embedding_populationsList[str]: List of population used for PCA analysis with smartpca.
sample_idspd.Series: Pandas series containing the sample IDs for the dataset.
populationspd.Series: Pandas series containing the populations for all samples in the dataset.
projected_samplespd.DataFrame: Subset of self.sample_ids, contains only sample_ids that do not belong to embedding_populations.
n_snpsint: Number of SNPs in the dataset.
pcaEmbedding: PCA Embedding object as result of a smartpca run on the provided dataset. This is None until self.run_pca() was called.
mdsEmbedding: MDS Embedding object as a result of an MDS computation. This is None until self.run_mds() was called.

Methods

`bootstrap`(bootstrap_prefix, seed[, redo])	Creates a bootstrap dataset based on the content of self.
`check_files`()	Checks whether all input files (geno, snp, ind) are in correct format according to the EIGENSOFT specification.
`files_exist`()	Checks whether all required input files (geno, snp, ind) exist.
`fst_population_distance`(smartpca, result_prefix)	Computes the FST genetic distance matrix using EIGENSOFT's smartpca tool.
`get_population_info`()	Reads the populations from `self._ind_file`.
`get_projected_samples`()	Returns a pandas series with sample IDs of projected samples.
`get_sample_info`()	Reads the sample IDs from `self._ind_file`.
`get_sequence_length`()	Counts and returns the number of SNPs in `self._geno_file`.
`get_windows`(result_dir[, n_windows])	Creates n_windows new `EigenDataset` objects as overlapping sliding windows over self.
`remove_input_files`()	Removes all three input files (`self._ind_file`, `self._geno_file`, `self._snp_file`).
`run_mds`(smartpca[, n_components, ...])	Performs MDS analysis using the data provided in this class and assigns its MDS Embedding result to self.mds.
`run_pca`([smartpca, n_components, ...])	Runs the EIGENSOFT smartpca on the dataset and assigns its PCA Embedding result to self.pca.

Raises:

PandoraException: If not all input files (.geno, .ind, and .snp) exist, but sample_ids, populations and/or n_snp`s` is ``None.

bootstrap(bootstrap_prefix: Path, seed: int, redo: bool = False) → EigenDataset[source]

Creates a bootstrap dataset based on the content of self.

Bootstraps the dataset by resampling SNPs with replacement.

Parameters:

bootstrap_prefixpathlib.Path

Prefix of the file path where to write the bootstrap dataset to.: The resulting files will be bootstrap_prefix.geno, bootstrap_prefix.ind, and bootstrap_prefix.snp.

seedint

Seed to initialize the random number generator before drawing the replicates.

redobool, default=False

Whether to redo the bootstrap if all output files are present.

Returns:

EigenDataset: A new dataset object containing the bootstrap replicate data.

check_files() → None[source]

Checks whether all input files (geno, snp, ind) are in correct format according to the EIGENSOFT specification.

Returns:

None

Raises:

PandoraException: If any of the three files is malformatted.

files_exist() → bool[source]

Checks whether all required input files (geno, snp, ind) exist.

Returns:

bool: Whether all three input files are present.

fst_population_distance(smartpca: str | Path, result_prefix: Path, redo: bool = False) → Tuple[ndarray[Any, dtype[_ScalarType_co]], Series][source]

Computes the FST genetic distance matrix using EIGENSOFT’s smartpca tool.

The resulting FST matrix will be stored in fst_file and returned as numpy array.

Parameters:

smartpcaExecutable: Path pointing to an executable of the EIGENSOFT smartpca tool.
result_prefixpathlib.Path: Prefix where to store the results of the smartpca FST computation. On successfull execution, two files will be created: the FST result ({prefix}.fst) and a smartpca log file ({prefix}.smartpca.log).
redobool: Whether to recompute the FST matrix in case the result file is already present.

Returns:

distance_matrixnpt.NDArray: Distance matrix of pairwise FST distances between all unique populations. The shape of this matrix is (n_unique_populations, n_unique_populations).
populationspd.Series: Pandas Series containing a population name for each row in the distance matrix. This values of this series are the unique populations.

Raises:

RuntimeError: If the smartpca FST distance matrix computation failed.

get_population_info() → Series[source]

Reads the populations from self._ind_file.

Returns:

pd.Series: Pandas series containing the population for each sample in the order in the ind file.

Raises:

PandoraException: if the respective .ind file does not exist

get_projected_samples() → Series[source]

Returns a pandas series with sample IDs of projected samples.

If a sample is projected or used to compute the embedding is decided based on the presence and content of self._embedding_populations_file.

Returns:

pd.Series: Pandas series containing only sample IDs of projected samples.

get_sample_info() → Series[source]

Reads the sample IDs from self._ind_file.

Returns:

pd.Series: Pandas series containing the sample IDs of the dataset in the order in the ind file.

Raises:

PandoraException: if the respective ind file does not exist

get_sequence_length() → int[source]

Counts and returns the number of SNPs in self._geno_file.

Returns:

int: Number of SNPs in self._geno_file.

Raises:

PandoraException: if the respective geno file does not exist

get_windows(result_dir: Path, n_windows: int = 100) → List[EigenDataset][source]

Creates n_windows new EigenDataset objects as overlapping sliding windows over self.

Let \(K\) = number of SNPs in self and \(N\) = n_windows. Each dataset will have a window size of \(int(K / N + (K / 2 * N))\) SNPs. The stride is \(int(K / N)\) and the overlap between windows is thus \(int(K / 2 * N)\) SNPs. Note that the last EigenDataset will contain fewer SNPs as there is no following window to overlap with. However, due to rounding, the number of SNPs in the final Dataset will not simply be overlap fewer.

Parameters:

result_dirpathlib.Path: Directory where to store the resulting Dataset files in. Each window Dataset will be named window_{i} for i in range(n_windows)
n_windowsint, default=100: Number of windowed datasets to generate.

Returns:

windowsList[EigenDataset]: List of n_windows new EigenDataset objects as overlapping windows over self.

remove_input_files() → None[source]

Removes all three input files (self._ind_file, self._geno_file, self._snp_file).

This is useful if you want to save storage space and don’t need the input files anymore (e.g. for bootstrap replicates).

Returns:

None

run_mds(smartpca: str | Path, n_components: int = 2, result_dir: Path | None = None, redo: bool = False) → None[source]

Performs MDS analysis using the data provided in this class and assigns its MDS Embedding result to self.mds.

The FST matrix is generated using the EIGENSOFT smartpca tool. The subsequent MDS analysis is performed using the scikit-allel MDS implementation (PCoA).

Please note that since EIGENSOFT is used for the FST computation, all samples with the population set to Ignore will not be used for the MDS computation.

Parameters:

smartpcaExecutable

Path pointing to an executable of the EIGENSOFT smartpca tool.

n_componentsint, default=2

Number of components to reduce the data to.

result_dirpathlib.Path, default=self._file_dir

Directory where to store the data in. Calling this method will create two files:

result_dir / (self.name + ".fst"): contains the FST distance matrix.
result_dir / (self.name + ".smartpca.log"): contains the smartpca log.

Default is self._file_dir.

redobool, default=False

Whether to recompute the FST matrix in case the result file is already present.

Returns:

None

Raises:

PandoraException: If the number of components is >= the number of SNPs in self.input_data.
RuntimeError: If the smartpca FST distance matrix computation failed.

run_pca(smartpca: str | Path = 'smartpca', n_components: int = 10, result_dir: Path | None = None, redo: bool = False, smartpca_optional_settings: Dict | None = None) → None[source]

Runs the EIGENSOFT smartpca on the dataset and assigns its PCA Embedding result to self.pca.

Additional smartpca options can be passed as dictionary in smartpca_optional_settings, e.g. smartpca_optional_settings = dict(numoutlieriter=0, shrinkmode=True).

Parameters:

smartpcaExecutable, default=”smartpca”: Path pointing to an executable of the EIGENSOFT smartpca tool. Default is ‘smartpca’. This will only work if smartpca is installed systemwide.
n_componentsint, default=10: Number of principal components to output.
result_dirpathlib.Path, default=self._file_dir: File path pointing where to write the results to.
redobool, default=False: Whether to redo the analysis, if all outfiles are already present and correct.
smartpca_optional_settingsDict, default=None: Additional smartpca settings. Not allowed are the following options: genotypename, snpname, indivname, evecoutname, evaloutname, numoutevec, maxpops. If not set, the default settings of your smartpca executable are used.

Returns:

None

Raises:

PandoraException

If the number of principal components is >= the number of SNPs in self.input_data.
If not all input files ein EIGENSTRAT format are present (geno, snp, ind files).

RuntimeError

If the smartpca run failed.

class pandora.dataset.NumpyDataset(input_data: ~numpy.ndarray[~typing.Any, ~numpy.dtype[~numpy._typing._array_like._ScalarType_co]], sample_ids: ~pandas.core.series.Series, populations: ~pandas.core.series.Series, missing_value: float | int = nan, dtype: ~numpy.dtype = <class 'numpy.uint8'>)[source]

Bases: object

Class structure to represent a population genetics dataset in numerical format.

This class provides methods to perform PCA (using scikit-learn) and MDS (using scikit-allel) analyses on the provided numerical data. It further provides methods to generate a bootstrap replicate dataset (SNPs resampled with replacement) and to generate overlapping sliding windows of sub-datasets.

Parameters:

input_datanpt.NDArray: Numpy Array containing the input data to use.
sample_idspd.Series[str]: Pandas Series containing the sample IDs of the sequences contained in input_data. Expects the number of sample_ids to match the first dimension of input_data.
populationspd.Series[str]: Pandas Series containing the populations of the sequences contained in input_data. Expects the number of populations to match the first dimension of input_data. This population annotation can be used to group sequences, for example according to population structure in population genetics datasets or different cell types in gene expression data. These populations are relevant e.g. for MDS analyses using a per-population distance metric or when plotting the results of a PCA/MDS analysis using the plotting.plot_populations method.
missing_valueUnion[float, int], default=np.nan: Value to treat as missing value. All missing values in input_data will be replaced with a special value depending on the specified dtype: For floating point types: np.nan For signed integer types: -1 For unsigned integer types: the highest possible integer for the respective type
dtype: np.dtype, default=np.uint8: Numpy datatype to use to represent the input data. The more complex the data type, the higher it’s memory footprint. The default is the datatype with the smallest possible memory footprint: np.uint8 (unsigned 8Bit integer). For most genotype data with only 0, 1, 2, and missing as genotype calls this is the best option.

Attributes:

input_datanpt.NDArray: Numpy Array containing the input data to use.
sample_idspd.Series[str]: Pandas Series containing a sample ID for each row in input_data.
populationspd.Series[str]: Pandas Series containing a population name for each row in input_data.
pcaEmbedding: PCA Embedding object as result of a PCA analysis run on the provided dataset. This is None until self.run_pca() was called.
mdsEmbedding: MDS Embedding object as a result of an MDS computation. This is None until self.run_mds() was called.

Methods

`bootstrap`([seed])	Creates a bootstrap dataset based on the content of self.
`get_windows`([n_windows])	Creates `n_windows` new `NumpyDataset` objects as overlapping sliding windows over self.
`run_mds`([n_components, distance_metric, ...])	Performs MDS analysis using the data provided in this class.
`run_pca`([n_components, imputation])	Performs PCA analysis on `self.input_data` reducing the data to `n_components` dimensions.

Raises:

PandoraException: If the number of samples and number of populations differ (need to provide one population per sample).

bootstrap(seed: int | None = None) → NumpyDataset[source]

Creates a bootstrap dataset based on the content of self.

Bootstraps the dataset by resampling SNPs with replacement.

Parameters:

seedOptional[int], default=None: Optional seed to initialize the random number generator before drawing the replicates.

Returns:

NumpyDataset: A new dataset object containing the bootstrapped input_data.

get_windows(n_windows: int = 100) → List[NumpyDataset][source]

Creates n_windows new NumpyDataset objects as overlapping sliding windows over self.

Let \(K\) = number of SNPs in self and \(N\) = n_windows. Each dataset will have a window size of \(int(K / N + (K / 2 * N))\) SNPs. The stride is \(int(K / N)\) and the overlap between windows is thus \(int(K / 2 * N)\) SNPs. Note that the last NumpyDataset will contain fewer SNPs as there is no following window to overlap with. However, due to rounding, the number of SNPs in the final Dataset will not simply be overlap fewer.

Parameters:

n_windowsint, default=100: Number of windowed datasets to generate.

Returns:

windowsList[NumpyDataset]: List of n_windows new NumpyDataset objects as overlapping windows over self.

run_mds(n_components: int = 2, distance_metric: ~typing.Callable[[~numpy.ndarray[~typing.Any, ~numpy.dtype[~numpy._typing._array_like._ScalarType_co]], ~pandas.core.series.Series, str], ~typing.Tuple[~numpy.ndarray[~typing.Any, ~numpy.dtype[~numpy._typing._array_like._ScalarType_co]], ~pandas.core.series.Series]] = <function euclidean_sample_distance>, imputation: str | None = 'mean') → None[source]

Performs MDS analysis using the data provided in this class.

The distance matrix is generated using the provided distance_metric callable. The subsequent MDS analysis is performed using the scikit-allel MDS implementation (PCoA). The result of the MDS analysis is an Embedding object assigned to self.mds.

Parameters:

n_componentsint, default=2

Number of components to reduce the data to.

distance_metricCallable[[npt.NDArray, pd.Series, str], Tuple[npt.NDArray, pd.Series]], default=euclidean_sample_distance

Distance metric to use for computing the distance matrix input for MDS. This is expected to be a function that receives the numpy array of sequences, the population for each sequence, and the imputation method as input and should output the distance matrix and the respective populations for each row. The resulting distance matrix is of size \((n, m)\) and the resulting populations is expected to be of size \((n, 1)\). Default is distance_metrics::eculidean_sample_distance (the pairwise Euclidean distance of all samples).

imputationOptional[str], default=”mean”

Imputation method to use. Available options are:

mean: Imputes missing values with the average of the respective SNP
remove: Removes all SNPs with at least one missing value.
None: Does not impute missing data.

Note that depending on the distance_metric, not all imputation methods are supported. See the respective documentations in the distance_metrics module.

Returns:

None

Raises:

PandoraException

If the distance_metric did not return one population per row in the returned distance matrix.
If the number of components is >= the number of rows in the distance matrix.

run_pca(n_components: int = 10, imputation: str | None = 'mean') → None[source]

Performs PCA analysis on self.input_data reducing the data to n_components dimensions.

Uses the scikit-learn PCA implementation. The result of the PCA analysis is an Embedding object assigned to self.pca.

Parameters:

n_componentsint, default=10

Number of components to reduce the data to. Default is 10.

imputationOptional[str], default=”mean”

Imputation method to use. Available options for PCA are:

mean: Imputes missing values with the average of the respective SNP
remove: Removes all SNPs with at least one missing value.
None: Does not impute missing data. Note that this option is only valid if self.input_data does not contain NaN values.

Returns:

None

Raises:

PandoraException

If the number of principal components is >= the number of SNPs in self.input_data.
If imputation is None but self.input_data contains NaN values.

pandora.dataset.numpy_dataset_from_eigenfiles(eigen_prefix: Path) → NumpyDataset[source]

Loads a genotype dataset in EIGENSTRAT format as NumpyDataset.

This method only requires the genotype and individual files as the metadata of the SNPs present in the .snp file is not used. Note that the dataset needs to be in EIGENSTRAT format with a similar file prefix and need to have file endings .geno and .ind for the respective file type.

Parameters:

eigen_prefixpathlib.Path: File path prefix pointing to the ind and geno genotype files in EIGENSTRAT format. This method assumes that all EIGEn files have the same prefix and have the file endings .geno and .ind. (Note that the .snp file is not required.)

Returns:

NumpyDataset: NumpyDataset containing the genotype data provided in the EIGEN files located at eigen_prefix.

Raises:

PandoraException: If not all input files (.geno, .ind, and .snp) for the given eigen_prefix exist.