pandora.distance_metrics module

pandora.distance_metrics.euclidean_population_distance(input_data: ndarray[Any, dtype[_ScalarType_co]], populations: Series, imputation: str | None) → Tuple[ndarray[Any, dtype[_ScalarType_co]], Series][source]

Computes and returns the distance matrix of pairwise Euclidean distances between all unique populations.

Parameters:

input_datanpt.NDArray

Numpy Array containing the genetic input data to use.

populationspd.Series[str]

Pandas Series containing a population name for each row in input_data.

imputationOptional[str]

Imputation method to use. Available options are:

"mean": Imputes missing values with the average of the respective SNP
"remove": Removes all SNPs with at least one missing value.
None: Note that this option is only valid if input_data does not contain NaN values.

Returns:

distance_matrixnpt.NDArray: Distance matrix of pairwise Euclidean distances between all unique populations. The array is of shape (n_unique_populations, n_unique_populations).
populationspd.Series: Pandas Series containing a population name for each row in the distance matrix. This values of this series are the unique populations.

Raises:

PandoraException: If imputation is None but input_data contains NaN values.

pandora.distance_metrics.euclidean_sample_distance(input_data: ndarray[Any, dtype[_ScalarType_co]], populations: Series, imputation: str | None) → Tuple[ndarray[Any, dtype[_ScalarType_co]], Series][source]

Computes and returns the distance matrix of pairwise Euclidean distances between all samples (rows) in input_data.

Parameters:

input_datanpt.NDArray

Numpy Array containing the genetic input data to use.

populationspd.Series[str]

Pandas Series containing a population name for each row in input_data. Note that this population info is not used for distance computation as the distance is computed per sample. The parameter is only required to provide a unique interface for per-sample and per-population distances.

imputationOptional[str]

Imputation method to use. Available options are:

"mean": Imputes missing values with the average of the respective SNP
"remove": Removes all SNPs with at least one missing value.
None: Note that this option is only valid if input_data does not contain NaN values.

Returns:

distance_matrixnpt.NDArray: Distance matrix of pairwise Euclidean distances between all samples. The array is of shape (n_samples, n_samples).
populationspd.Series: Pandas Series containing a population name for each row in the distance matrix. This is identical to the passed series of populations since this information is not used for distance computation.

Raises:

PandoraException: If imputation is None but input_data contains NaN values.

pandora.distance_metrics.fst_population_distance(input_data: ndarray[Any, dtype[_ScalarType_co]], populations: Series, imputation: str | None) → Tuple[ndarray[Any, dtype[_ScalarType_co]], Series][source]

Computes and returns the distance matrix of pairwise FST distances between all unique populations.

Parameters:

input_datanpt.NDArray: Numpy Array containing the genetic input data to use.
populationspd.Series[str]: Pandas Series containing a population name for each row in input_data.
imputationOptional[str]: Imputation method to use. For the FST populations distance, only imputation=None is supported.

Returns:

distance_matrixnpt.NDArray: Distance matrix of pairwise FST distances between all unique populations. The array is of shape (n_unique_populations, n_unique_populations).
populationspd.Series: Pandas Series containing a population name for each row in the distance matrix. This values of this series are the unique populations.

pandora.distance_metrics.hamming_sample_distance(input_data: ndarray[Any, dtype[_ScalarType_co]], populations: Series, imputation: str | None) → Tuple[ndarray[Any, dtype[_ScalarType_co]], Series][source]

Computes and returns the distance matrix of pairwise hamming distances between all samples (rows) in input_data.

Parameters:

input_datanpt.NDArray: Numpy Array containing the genetic input data to use.
populationspd.Series[str]: Pandas Series containing a population name for each row in input_data. Note that this population info is not used for distance computation as the distance is computed per sample. The parameter is only required to provide a unique interface for per-sample and per-population distances.
imputationOptional[str]: Imputation method to use. This parameter is not used and exists only for compatibility with the interface required for the dataset.run_mds method.

Returns:

distance_matrixnpt.NDArray: Distance matrix of pairwise hamming distances between all samples. The array is of shape (n_samples, n_samples).
populationspd.Series: Pandas Series containing a population name for each row in the distance matrix. This is identical to the passed series of populations since this information is not used for distance computation.

pandora.distance_metrics.manhattan_population_distance(input_data: ndarray[Any, dtype[_ScalarType_co]], populations: Series, imputation: str | None) → Tuple[ndarray[Any, dtype[_ScalarType_co]], Series][source]

Computes and returns the distance matrix of pairwise manhattan distances between all unique populations.

Parameters:

input_datanpt.NDArray

Numpy Array containing the genetic input data to use.

populationspd.Series[str]

Pandas Series containing a population name for each row in input_data.

imputationOptional[str]

Imputation method to use. Available options are:

"mean": Imputes missing values with the average of the respective SNP
"remove": Removes all SNPs with at least one missing value.
None: Note that this option is only valid if input_data does not contain NaN values.

Returns:

distance_matrixnpt.NDArray: Distance matrix of pairwise manhattan distances between all unique populations. The array is of shape (n_unique_populations, n_unique_populations).
populationspd.Series: Pandas Series containing a population name for each row in the distance matrix. This values of this series are the unique populations.

Raises:

PandoraException: If imputation is None but input_data contains NaN values.

pandora.distance_metrics.manhattan_sample_distance(input_data: ndarray[Any, dtype[_ScalarType_co]], populations: Series, imputation: str | None) → Tuple[ndarray[Any, dtype[_ScalarType_co]], Series][source]

Computes and returns the distance matrix of pairwise manhattan distances between all samples (rows) in input_data.

Parameters:

input_datanpt.NDArray

Numpy Array containing the genetic input data to use.

populationspd.Series[str]

Pandas Series containing a population name for each row in input_data. Note that this population info is not used for distance computation as the distance is computed per sample. The parameter is only required to provide a unique interface for per-sample and per-population distances.

imputationOptional[str]

Imputation method to use. Available options are:

"mean": Imputes missing values with the average of the respective SNP
"remove": Removes all SNPs with at least one missing value.
None: Note that this option is only valid if input_data does not contain NaN values.

Returns:

distance_matrixnpt.NDArray: Distance matrix of pairwise manhattan distances between all samples. The array is of shape (n_samples, n_samples).
populationspd.Series: Pandas Series containing a population name for each row in the distance matrix. This is identical to the passed series of populations since this information is not used for distance computation.

Raises:

PandoraException: If imputation is None but input_data contains NaN values.

pandora.distance_metrics.missing_corrected_hamming_sample_distance(input_data: ndarray[Any, dtype[_ScalarType_co]], populations: Series, imputation: str | None) → Tuple[ndarray[Any, dtype[_ScalarType_co]], Series][source]

Computes and returns the distance matrix of pairwise, hamming distances between all samples (rows) in input_data. Compared to hamming_sample_distance, this method additionally corrects for missing samples (see Notes below).

Parameters:

input_datanpt.NDArray: Numpy Array containing the genetic input data to use.
populationspd.Series[str]: Pandas Series containing a population name for each row in input_data. Note that this population info is not used for distance computation as the distance is computed per sample. The parameter is only required to provide a unique interface for per-sample and per-population distances.
imputationOptional[str]: Imputation method to use. This parameter is not used and exists only for compatibility with the interface required for the dataset.run_mds method.

Returns:

distance_matrixnpt.NDArray: Distance matrix of pairwise hamming distances between all samples. The array is of shape (n_samples, n_samples).
populationspd.Series: Pandas Series containing a population name for each row in the distance matrix. This is identical to the passed series of populations since this information is not used for distance computation.

Notes

Instead of the plain hamming distance \(d(i, j)\) between two samples \(i\) and \(j\), it corrects for the fraction of missing data in both samples (\(m_i\), \(m_j\)). However, for the correction, we only consider missing values if they are missing in either of the two samples, but not in both. We denote the fraction of data missing in both samples as \(m_{i,j}\) Thus, the missing correct hamming distance \(d_m(i, j)\) computes as:

\[d_m(i, j) = \frac{d(i, j)}{m_i + m_j - m_{i, j}}\]

Note that this distance metric corresponds to the PLINK --distance 'flat-missing' computation.

pandora.distance_metrics.population_distance(input_data: ndarray[Any, dtype[_ScalarType_co]], populations: Series, distance_metric: Callable[[ndarray[Any, dtype[_ScalarType_co]], ndarray[Any, dtype[_ScalarType_co]]], ndarray[Any, dtype[_ScalarType_co]]]) → Tuple[ndarray[Any, dtype[_ScalarType_co]], Series][source]

Computes and returns the distance matrix of pairwise distances between all unique populations using the provided distance metric.

Parameters:

input_datanpt.NDArray: Numpy Array containing the genetic input data to use.
populationspd.Series[str]: Pandas Series containing a population name for each row in input_data.
distance_metricCallable[[npt.NDArray, npt.NDArray, str], npt.NDArray]: Distance metric function to use for the pairwise population distance computation. Needs to be a callable that takes two numpy arrays as input (each array contains the data of all samples in input_data for one specific population) and returns the pairwise distance between unique pairs of samples of both populations.

Returns:

distance_matrixnpt.NDArray: Distance matrix of pairwise distances between all unique populations. The array is of shape (n_unique_populations, n_unique_populations).
populationspd.Series: Pandas Series containing a population name for each row in the distance matrix. This values of this series are the unique populations.