pandora.dataset module
- class pandora.dataset.EigenDataset(file_prefix: Path, embedding_populations: Path | None = None, sample_ids: Series | None = None, populations: Series | None = None, n_snps: int | None = None)[source]
Bases:
objectClass structure to represent a population genetics dataset in Eigenstrat format.
This class provides methods to perform PCA and MDS analyses using the Eigensoft smartpca tool. It further provides methods to generate a bootstrap replicate dataset (SNPs resampled with replacement) and to generate overlapping sliding windows of sub-datasets. Note that in order for the bootstrap and windowing methods to work, the respective geno, ind, and snp files need to be in EIGENSTRAT format with a similar file prefix and need to have file endings
.geno,.ind, and.snp.Note that Pandora does not check on init whether all input files are present, as you are allowed to init a dataset despite the files missing. This is useful for saving storage when drawing lots of bootstrap replicates. In case the input files are missing, you need to pass sample_ids, populations and n_snps. Of course, you need the files if you want to run
self.bootstrap,self.run_pca,self.run_mds, orself.get_windows.- Parameters:
- file_prefixpathlib.Path
File path prefix pointing to the ind, geno, and snp files in EIGENSTRAT format. All methods assume that all three files have the same prefix and have the file endings
.geno,.ind, and.snp.- embedding_populationspathlib.Path, default=None
Path pointing to a file containing a new-line separated list containing population names. Only samples belonging to these populations will be used for PCA analyses. If not set, all samples will be used.
- sample_idspd.Series, default=None
pandas Series containing a sample ID for each sequence in the dataset. If not set, Pandora will set this based on the content of
self._ind_file- populationspd.Series, default=None
pandas Series containing the population for each sample in the dataset. If not set, Pandora will set this based on the content of
self._ind_file- n_snpsint, default=None
Number of SNPs in the dataset. If not set, Pandora will set this based on the content of
self._geno_file
- Attributes:
- file_prefixpathlib.Path
File path prefix pointing to the ind, geno, and snp files in EIGENSTRAT format. All methods assume that all three files have the same prefix and have the file endings
.geno,.ind, and.snp.- namestr
Name of the dataset. Inferred as name of the provided
file_prefix.- embedding_populationsList[str]
List of population used for PCA analysis with smartpca.
- sample_idspd.Series
Pandas series containing the sample IDs for the dataset.
- populationspd.Series
Pandas series containing the populations for all samples in the dataset.
- projected_samplespd.DataFrame
Subset of self.sample_ids, contains only sample_ids that do not belong to embedding_populations.
- n_snpsint
Number of SNPs in the dataset.
- pcaEmbedding
PCA Embedding object as result of a smartpca run on the provided dataset. This is
Noneuntilself.run_pca()was called.- mdsEmbedding
MDS Embedding object as a result of an MDS computation. This is
Noneuntilself.run_mds()was called.
Methods
bootstrap(bootstrap_prefix, seed[, redo])Creates a bootstrap dataset based on the content of self.
Checks whether all input files (geno, snp, ind) are in correct format according to the EIGENSOFT specification.
Checks whether all required input files (geno, snp, ind) exist.
fst_population_distance(smartpca, result_prefix)Computes the FST genetic distance matrix using EIGENSOFT's smartpca tool.
Reads the populations from
self._ind_file.Returns a pandas series with sample IDs of projected samples.
Reads the sample IDs from
self._ind_file.Counts and returns the number of SNPs in
self._geno_file.get_windows(result_dir[, n_windows])Creates n_windows new
EigenDatasetobjects as overlapping sliding windows over self.Removes all three input files (
self._ind_file,self._geno_file,self._snp_file).run_mds(smartpca[, n_components, ...])Performs MDS analysis using the data provided in this class and assigns its MDS Embedding result to self.mds.
run_pca([smartpca, n_components, ...])Runs the EIGENSOFT smartpca on the dataset and assigns its PCA Embedding result to self.pca.
- Raises:
- PandoraException
If not all input files (
.geno,.ind, and.snp) exist, butsample_ids,populationsand/orn_snp`s` is ``None.
- bootstrap(bootstrap_prefix: Path, seed: int, redo: bool = False) EigenDataset[source]
Creates a bootstrap dataset based on the content of self.
Bootstraps the dataset by resampling SNPs with replacement.
- Parameters:
- bootstrap_prefixpathlib.Path
- Prefix of the file path where to write the bootstrap dataset to.
The resulting files will be
bootstrap_prefix.geno,bootstrap_prefix.ind, andbootstrap_prefix.snp.
- seedint
Seed to initialize the random number generator before drawing the replicates.
- redobool, default=False
Whether to redo the bootstrap if all output files are present.
- Returns:
- EigenDataset
A new dataset object containing the bootstrap replicate data.
- check_files() None[source]
Checks whether all input files (geno, snp, ind) are in correct format according to the EIGENSOFT specification.
- Returns:
- None
- Raises:
- PandoraException
If any of the three files is malformatted.
- files_exist() bool[source]
Checks whether all required input files (geno, snp, ind) exist.
- Returns:
- bool
Whether all three input files are present.
- fst_population_distance(smartpca: str | Path, result_prefix: Path, redo: bool = False) Tuple[ndarray[Any, dtype[_ScalarType_co]], Series][source]
Computes the FST genetic distance matrix using EIGENSOFT’s smartpca tool.
The resulting FST matrix will be stored in fst_file and returned as numpy array.
- Parameters:
- smartpcaExecutable
Path pointing to an executable of the EIGENSOFT smartpca tool.
- result_prefixpathlib.Path
Prefix where to store the results of the smartpca FST computation. On successfull execution, two files will be created: the FST result (
{prefix}.fst) and a smartpca log file ({prefix}.smartpca.log).- redobool
Whether to recompute the FST matrix in case the result file is already present.
- Returns:
- distance_matrixnpt.NDArray
Distance matrix of pairwise FST distances between all unique populations. The shape of this matrix is
(n_unique_populations, n_unique_populations).- populationspd.Series
Pandas Series containing a population name for each row in the distance matrix. This values of this series are the unique populations.
- Raises:
- RuntimeError
If the
smartpcaFST distance matrix computation failed.
- get_population_info() Series[source]
Reads the populations from
self._ind_file.- Returns:
- pd.Series
Pandas series containing the population for each sample in the order in the ind file.
- Raises:
- PandoraException
if the respective .ind file does not exist
- get_projected_samples() Series[source]
Returns a pandas series with sample IDs of projected samples.
If a sample is projected or used to compute the embedding is decided based on the presence and content of
self._embedding_populations_file.- Returns:
- pd.Series
Pandas series containing only sample IDs of projected samples.
- get_sample_info() Series[source]
Reads the sample IDs from
self._ind_file.- Returns:
- pd.Series
Pandas series containing the sample IDs of the dataset in the order in the ind file.
- Raises:
- PandoraException
if the respective ind file does not exist
- get_sequence_length() int[source]
Counts and returns the number of SNPs in
self._geno_file.- Returns:
- int
Number of SNPs in
self._geno_file.
- Raises:
- PandoraException
if the respective geno file does not exist
- get_windows(result_dir: Path, n_windows: int = 100) List[EigenDataset][source]
Creates n_windows new
EigenDatasetobjects as overlapping sliding windows over self.Let \(K\) = number of SNPs in self and \(N\) = n_windows. Each dataset will have a window size of \(int(K / N + (K / 2 * N))\) SNPs. The stride is \(int(K / N)\) and the overlap between windows is thus \(int(K / 2 * N)\) SNPs. Note that the last EigenDataset will contain fewer SNPs as there is no following window to overlap with. However, due to rounding, the number of SNPs in the final Dataset will not simply be overlap fewer.
- Parameters:
- result_dirpathlib.Path
Directory where to store the resulting Dataset files in. Each window Dataset will be named
window_{i}foriinrange(n_windows)- n_windowsint, default=100
Number of windowed datasets to generate.
- Returns:
- windowsList[EigenDataset]
List of
n_windowsnewEigenDatasetobjects as overlapping windows over self.
- remove_input_files() None[source]
Removes all three input files (
self._ind_file,self._geno_file,self._snp_file).This is useful if you want to save storage space and don’t need the input files anymore (e.g. for bootstrap replicates).
- Returns:
- None
- run_mds(smartpca: str | Path, n_components: int = 2, result_dir: Path | None = None, redo: bool = False) None[source]
Performs MDS analysis using the data provided in this class and assigns its MDS Embedding result to self.mds.
The FST matrix is generated using the EIGENSOFT smartpca tool. The subsequent MDS analysis is performed using the scikit-allel MDS implementation (PCoA).
Please note that since EIGENSOFT is used for the FST computation, all samples with the population set to
Ignorewill not be used for the MDS computation.- Parameters:
- smartpcaExecutable
Path pointing to an executable of the EIGENSOFT smartpca tool.
- n_componentsint, default=2
Number of components to reduce the data to.
- result_dirpathlib.Path, default=self._file_dir
Directory where to store the data in. Calling this method will create two files:
result_dir / (self.name + ".fst"): contains the FST distance matrix.result_dir / (self.name + ".smartpca.log"): contains the smartpca log.
Default is
self._file_dir.- redobool, default=False
Whether to recompute the FST matrix in case the result file is already present.
- Returns:
- None
- Raises:
- PandoraException
If the number of components is >= the number of SNPs in
self.input_data.- RuntimeError
If the
smartpcaFST distance matrix computation failed.
- run_pca(smartpca: str | Path = 'smartpca', n_components: int = 10, result_dir: Path | None = None, redo: bool = False, smartpca_optional_settings: Dict | None = None) None[source]
Runs the EIGENSOFT smartpca on the dataset and assigns its PCA Embedding result to self.pca.
Additional smartpca options can be passed as dictionary in smartpca_optional_settings, e.g.
smartpca_optional_settings = dict(numoutlieriter=0, shrinkmode=True).- Parameters:
- smartpcaExecutable, default=”smartpca”
Path pointing to an executable of the EIGENSOFT smartpca tool. Default is ‘smartpca’. This will only work if smartpca is installed systemwide.
- n_componentsint, default=10
Number of principal components to output.
- result_dirpathlib.Path, default=self._file_dir
File path pointing where to write the results to.
- redobool, default=False
Whether to redo the analysis, if all outfiles are already present and correct.
- smartpca_optional_settingsDict, default=None
Additional smartpca settings. Not allowed are the following options:
genotypename,snpname,indivname,evecoutname,evaloutname,numoutevec,maxpops. If not set, the default settings of your smartpca executable are used.
- Returns:
- None
- Raises:
- PandoraException
If the number of principal components is >= the number of SNPs in
self.input_data.If not all input files ein EIGENSTRAT format are present (geno, snp, ind files).
- RuntimeError
If the
smartpcarun failed.
- class pandora.dataset.NumpyDataset(input_data: ~numpy.ndarray[~typing.Any, ~numpy.dtype[~numpy._typing._array_like._ScalarType_co]], sample_ids: ~pandas.core.series.Series, populations: ~pandas.core.series.Series, missing_value: float | int = nan, dtype: ~numpy.dtype = <class 'numpy.uint8'>)[source]
Bases:
objectClass structure to represent a population genetics dataset in numerical format.
This class provides methods to perform PCA (using scikit-learn) and MDS (using scikit-allel) analyses on the provided numerical data. It further provides methods to generate a bootstrap replicate dataset (SNPs resampled with replacement) and to generate overlapping sliding windows of sub-datasets.
- Parameters:
- input_datanpt.NDArray
Numpy Array containing the input data to use.
- sample_idspd.Series[str]
Pandas Series containing the sample IDs of the sequences contained in
input_data. Expects the number ofsample_idsto match the first dimension ofinput_data.- populationspd.Series[str]
Pandas Series containing the populations of the sequences contained in
input_data. Expects the number ofpopulationsto match the first dimension ofinput_data. This population annotation can be used to group sequences, for example according to population structure in population genetics datasets or different cell types in gene expression data. These populations are relevant e.g. for MDS analyses using a per-population distance metric or when plotting the results of a PCA/MDS analysis using the plotting.plot_populations method.- missing_valueUnion[float, int], default=np.nan
Value to treat as missing value. All missing values in
input_datawill be replaced with a special value depending on the specified dtype: For floating point types: np.nan For signed integer types: -1 For unsigned integer types: the highest possible integer for the respective type- dtype: np.dtype, default=np.uint8
Numpy datatype to use to represent the input data. The more complex the data type, the higher it’s memory footprint. The default is the datatype with the smallest possible memory footprint: np.uint8 (unsigned 8Bit integer). For most genotype data with only
0,1,2, and missing as genotype calls this is the best option.
- Attributes:
- input_datanpt.NDArray
Numpy Array containing the input data to use.
- sample_idspd.Series[str]
Pandas Series containing a sample ID for each row in
input_data.- populationspd.Series[str]
Pandas Series containing a population name for each row in
input_data.- pcaEmbedding
PCA Embedding object as result of a PCA analysis run on the provided dataset. This is
Noneuntilself.run_pca()was called.- mdsEmbedding
MDS Embedding object as a result of an MDS computation. This is
Noneuntilself.run_mds()was called.
Methods
bootstrap([seed])Creates a bootstrap dataset based on the content of self.
get_windows([n_windows])Creates
n_windowsnewNumpyDatasetobjects as overlapping sliding windows over self.run_mds([n_components, distance_metric, ...])Performs MDS analysis using the data provided in this class.
run_pca([n_components, imputation])Performs PCA analysis on
self.input_datareducing the data ton_componentsdimensions.- Raises:
- PandoraException
If the number of samples and number of populations differ (need to provide one population per sample).
- bootstrap(seed: int | None = None) NumpyDataset[source]
Creates a bootstrap dataset based on the content of self.
Bootstraps the dataset by resampling SNPs with replacement.
- Parameters:
- seedOptional[int], default=None
Optional seed to initialize the random number generator before drawing the replicates.
- Returns:
- NumpyDataset
A new dataset object containing the bootstrapped
input_data.
- get_windows(n_windows: int = 100) List[NumpyDataset][source]
Creates
n_windowsnewNumpyDatasetobjects as overlapping sliding windows over self.Let \(K\) = number of SNPs in self and \(N\) = n_windows. Each dataset will have a window size of \(int(K / N + (K / 2 * N))\) SNPs. The stride is \(int(K / N)\) and the overlap between windows is thus \(int(K / 2 * N)\) SNPs. Note that the last NumpyDataset will contain fewer SNPs as there is no following window to overlap with. However, due to rounding, the number of SNPs in the final Dataset will not simply be overlap fewer.
- Parameters:
- n_windowsint, default=100
Number of windowed datasets to generate.
- Returns:
- windowsList[NumpyDataset]
List of
n_windowsnewNumpyDatasetobjects as overlapping windows over self.
- run_mds(n_components: int = 2, distance_metric: ~typing.Callable[[~numpy.ndarray[~typing.Any, ~numpy.dtype[~numpy._typing._array_like._ScalarType_co]], ~pandas.core.series.Series, str], ~typing.Tuple[~numpy.ndarray[~typing.Any, ~numpy.dtype[~numpy._typing._array_like._ScalarType_co]], ~pandas.core.series.Series]] = <function euclidean_sample_distance>, imputation: str | None = 'mean') None[source]
Performs MDS analysis using the data provided in this class.
The distance matrix is generated using the provided distance_metric callable. The subsequent MDS analysis is performed using the scikit-allel MDS implementation (PCoA). The result of the MDS analysis is an Embedding object assigned to
self.mds.- Parameters:
- n_componentsint, default=2
Number of components to reduce the data to.
- distance_metricCallable[[npt.NDArray, pd.Series, str], Tuple[npt.NDArray, pd.Series]], default=euclidean_sample_distance
Distance metric to use for computing the distance matrix input for MDS. This is expected to be a function that receives the numpy array of sequences, the population for each sequence, and the imputation method as input and should output the distance matrix and the respective populations for each row. The resulting distance matrix is of size \((n, m)\) and the resulting populations is expected to be of size \((n, 1)\). Default is
distance_metrics::eculidean_sample_distance(the pairwise Euclidean distance of all samples).- imputationOptional[str], default=”mean”
Imputation method to use. Available options are:
mean: Imputes missing values with the average of the respective SNP
remove: Removes all SNPs with at least one missing value.
None: Does not impute missing data.
Note that depending on the
distance_metric, not all imputation methods are supported. See the respective documentations in thedistance_metricsmodule.
- Returns:
- None
- Raises:
- PandoraException
If the
distance_metricdid not return one population per row in the returned distance matrix.If the number of components is >= the number of rows in the distance matrix.
- run_pca(n_components: int = 10, imputation: str | None = 'mean') None[source]
Performs PCA analysis on
self.input_datareducing the data ton_componentsdimensions.Uses the scikit-learn PCA implementation. The result of the PCA analysis is an Embedding object assigned to
self.pca.- Parameters:
- n_componentsint, default=10
Number of components to reduce the data to. Default is 10.
- imputationOptional[str], default=”mean”
Imputation method to use. Available options for PCA are:
mean: Imputes missing values with the average of the respective SNP
remove: Removes all SNPs with at least one missing value.
None: Does not impute missing data. Note that this option is only valid if
self.input_datadoes not contain NaN values.
- Returns:
- None
- Raises:
- PandoraException
If the number of principal components is >= the number of SNPs in
self.input_data.If imputation is
Nonebutself.input_datacontains NaN values.
- pandora.dataset.numpy_dataset_from_eigenfiles(eigen_prefix: Path) NumpyDataset[source]
Loads a genotype dataset in EIGENSTRAT format as
NumpyDataset.This method only requires the genotype and individual files as the metadata of the SNPs present in the
.snpfile is not used. Note that the dataset needs to be in EIGENSTRAT format with a similar file prefix and need to have file endings.genoand.indfor the respective file type.- Parameters:
- eigen_prefixpathlib.Path
File path prefix pointing to the ind and geno genotype files in EIGENSTRAT format. This method assumes that all EIGEn files have the same prefix and have the file endings
.genoand.ind. (Note that the.snpfile is not required.)
- Returns:
- NumpyDataset
NumpyDatasetcontaining the genotype data provided in the EIGEN files located ateigen_prefix.
- Raises:
- PandoraException
If not all input files (
.geno,.ind, and.snp) for the giveneigen_prefixexist.