pandora.embedding module

class pandora.embedding.Embedding(embedding: DataFrame, n_components: int)[source]

Bases: object

Base Wrapper class for the result of a PCA/MDS embedding.

Methods

cluster([kmeans_k])

Fits a K-Means cluster to the embedding data and returns a scikit-learn fitted KMeans object.

get_optimal_kmeans_k([k_boundaries])

Determines the optimal number of clusters k for K-Means clustering according to the Bayesian Information Criterion (BIC).

cluster(kmeans_k: int = None) KMeans[source]

Fits a K-Means cluster to the embedding data and returns a scikit-learn fitted KMeans object.

Parameters:
kmeans_kint

Number of clusters. If not set, the optimal number of clusters is determined automatically.

Returns:
KMeans

Scikit-learn KMeans object that is fitted to self.embedding.

get_optimal_kmeans_k(k_boundaries: Tuple[int, int] = None) int[source]

Determines the optimal number of clusters k for K-Means clustering according to the Bayesian Information Criterion (BIC).

Parameters:
k_boundariesTuple[int, int], default=None

Minimum and maximum number of clusters. If None is given, determine the boundaries automatically. If self.embedding.populations is not identical for all samples, use the number of distinct populations, otherwise use the square root of the number of samples as maximum max_k. The minimum min_k is min(max_k, 3).

Returns:
int

the optimal number of clusters between min_n and max_n

class pandora.embedding.MDS(embedding: DataFrame, n_components: int, stress: float)[source]

Bases: Embedding

Initializes a new MDS object.

Parameters:
embeddingpd.DataFrame

Pandas dataframe containing the sample ID, population and embedding vector of all samples. The dataframe should contain one row per sample. Pandora expects the following columns:

  • sample_id (str): ID of the respective sample.

  • population (str): Name of the respective population.

  • D{i} for i in range(n_components) (float): data for the i-th embedding dimension for each sample, 0-indexed, so the first dimension corresponds to column D0

n_componentsint

number of components the data was fitted for

stressfloat

Stress of the fitted MDS.

Raises:
PandoraException
  • If embedding does not contain a “sample_id” column.

  • If embedding does not contain a “population” column.

  • If embedding does not contain (the correct amount of) D{i} columns.

Attributes:
embeddingpd.DataFrame

Pandas dataframe containing the sample ID, population and embedding vector of all samples. The dataframe should contain one row per sample. Pandora expects the following columns:

  • sample_id (str): ID of the respective sample.

  • population (str): Name of the respective population.

  • D{i} for i in range(n_components) (float): data for the i-th embedding dimension for each sample, 0-indexed, so the first dimension corresponds to column D0

n_componentsint

number of components the data was fitted for

stressfloat

Stress of the fitted MDS.

embedding_matrixnpt.NDArray[float]

Numpy ndarray of shape (n_samples, n_components) containing the MDS result matrix.

sample_idspd.Series[str]

Pandas series containing the IDs of all samples.

populationspd.Series[str]

Pandas series containing the population for each sample in sample_ids.

Methods

cluster([kmeans_k])

Fits a K-Means cluster to the embedding data and returns a scikit-learn fitted KMeans object.

get_optimal_kmeans_k([k_boundaries])

Determines the optimal number of clusters k for K-Means clustering according to the Bayesian Information Criterion (BIC).

class pandora.embedding.PCA(embedding: DataFrame, n_components: int, explained_variances: ndarray[Any, dtype[float]])[source]

Bases: Embedding

Class structure encapsulating PCA results.

This class provides a wrapper for PCA results.

Parameters:
embeddingpd.DataFrame

Pandas dataframe containing the sample ID, population and PC-Vector of all samples. The dataframe should contain one row per sample. Pandora expects the following columns:

  • sample_id (str): ID of the respective sample.

  • population (str): Name of the respective population.

  • D{i} for i in range(n_components) (float): data for the i-th PC for each sample, 0-indexed, so the first PC corresponds to column D0

n_componentsint

number of principal components corresponding to the PCA data

explained_variancesnpt.NDArray[float]

Numpy ndarray containing the explained variances for each PC (shape=(n_components,))

Raises:
PandoraException
  • If explained_variances is not a 1D numpy array or contains more/fewer values than n_components.

  • If embedding does not contain a “sample_id” column.

  • If embedding does not contain a “population” column

  • If embedding does not contain (the correct amount of) D{i} columns

Attributes:
embeddingpd.DataFrame

Pandas dataframe with shape (n_samples, n_components + 2) that contains the PCA results. The dataframe contains one row per sample and has the following columns:

  • sample_id (str): ID of the respective sample.

  • population (str): Name of the respective population.

  • D{i} for i in range(n_components) (float): data for the i-th PC for each sample, 0-indexed, so the first PC corresponds to column PC0

explained_variancesnpt.NDArray[float]

Numpy ndarray containing the explained variances for each PC (shape=(n_components,))

n_componentsint

number of principal components

embedding_matrixnpt.NDArray[float]

Numpy ndarray of shape (n_samples, n_components) containing the PCA result matrix.

sample_idspd.Series[str]

Pandas series containing the IDs of all samples.

populationspd.Series[str]

Pandas series containing the population for each sample in sample_ids.

Methods

cluster([kmeans_k])

Fits a K-Means cluster to the embedding data and returns a scikit-learn fitted KMeans object.

get_optimal_kmeans_k([k_boundaries])

Determines the optimal number of clusters k for K-Means clustering according to the Bayesian Information Criterion (BIC).

pandora.embedding.check_smartpca_results(evec: Path, eval: Path) None[source]

Checks whether the smartpca results finished properly and contain all required information.

Parameters:
evecpathlib.Path

Filepath pointing to a .evec result file of a smartpca run.

evalpathlib.Path

Filepath pointing to a .eval result file of a smartpca run.

Returns:
None
Raises:
PandoraException

If either the evec file or the eval file are incorrect.

pandora.embedding.from_sklearn_mds(embedding: DataFrame, sample_ids: Series, populations: Series, stress: float) MDS[source]

Creates a new MDS object based on an MDS embedding pandas dataframe.

Note that embedding is expected to have a column entitled populations. This is needed since the input distance matrices for MDS may be summary statistics for all samples of one population. The resulting MDS object however will duplicate the results for each sample given in sample_ids to match the original input data.

Parameters:
embeddingpd.DataFrame

MDS embedding data as pandas DataFrame. Each row corresponds to a single sample or population. The embedding is expected to have a column entitled "population" denoting the respective population of the row.

sample_idspd.Series

Pandas Series containing IDs of samples the embedding data is for. Note that the number of sample IDs can be larger than the number of rows in the embedding. This is the case if the embedding was computed per population but the data should be mapped for each sample. The number of sample IDs needs to match the number of populations.

populationspd.Series

Pandas Series containing the population for each sample in sample_ids. The number of populations needs to match the number of sample IDs.

stressfloat

Goodness of the MDS fit for the data.

Returns:
MDS

MDS object encapsulating the MDS data

Raises:
PandoraException
  • If embedding does not contain a "populations" column.

  • If the number of samples and number of populations are not identical. Exactly one population is required for each sample.

pandora.embedding.from_smartpca(evec: Path, eval: Path) PCA[source]

Creates a PCA object based on the results of a smartpca run.

Parameters:
evecpathlib.Path

Filepath pointing to a .evec result file of a smartpca run.

evalpathlib.Path

Filepath pointing to a .eval result file of a smartpca run.

Returns:
PCA

PCA object of the results of the respective smartpca run.

Raises:
PandoraException

If either the evec file or the eval file are incorrect.