pandora.embedding module

class pandora.embedding.Embedding(embedding: DataFrame, n_components: int, explained_variances: ndarray[Any, dtype[float]])[source]

Bases: object

Class structure encapsulating PCA or MDS results.

Parameters:

embeddingpd.DataFrame

Pandas dataframe containing the sample ID, population and embedding vector of all samples. The dataframe should contain one row per sample. Pandora expects the following columns:

sample_id (str): ID of the respective sample.

population (str): Name of the respective population.

D{i} for i in range(n_components) (float): data for the i-th embedding vector for each sample, 0-indexed, so the first embedding vector corresponds to column D0

n_componentsint

number of components corresponding to the embedding data

explained_variancesnpt.NDArray[float]

Numpy ndarray containing the explained variances for each embedding vector (shape=(n_components,))

Attributes:

embeddingpd.DataFrame

Pandas dataframe with shape (n_samples, n_components + 2) that contains the embedding results. The dataframe contains one row per sample and has the following columns:

sample_id (str): ID of the respective sample.

population (str): Name of the respective population.

D{i} for i in range(n_components) (float): data for the i-th embedding vector for each sample, 0-indexed, so the first embedding vector corresponds to column PC0

explained_variancesnpt.NDArray[float]

Numpy ndarray containing the explained variances for each embedding vector (shape=(n_components,))

n_componentsint

number of components

embedding_matrixnpt.NDArray[float]

Numpy ndarray of shape (n_samples, n_components) containing the embedding matrix.

sample_idspd.Series[str]

Pandas series containing the IDs of all samples.

populationspd.Series[str]

Pandas series containing the population for each sample in sample_ids.

Methods

`cluster`([kmeans_k])	Fits a K-Means cluster to the embedding data and returns a scikit-learn fitted KMeans object.
`get_optimal_kmeans_k`([k_boundaries])	Determines the optimal number of clusters k for K-Means clustering according to the Bayesian Information Criterion (BIC).

Raises:

PandoraException

If explained_variances is not a 1D numpy array or contains more/fewer values than n_components.
If embedding does not contain a “sample_id” column.
If embedding does not contain a “population” column
If embedding does not contain (the correct amount of) D{i} columns

cluster(kmeans_k: int | None = None) → KMeans[source]

Fits a K-Means cluster to the embedding data and returns a scikit-learn fitted KMeans object.

Parameters:

kmeans_kint: Number of clusters. If not set, the optimal number of clusters is determined automatically.

Returns:

KMeans: Scikit-learn KMeans object that is fitted to self.embedding.

get_optimal_kmeans_k(k_boundaries: Tuple[int, int] | None = None) → int[source]

Determines the optimal number of clusters k for K-Means clustering according to the Bayesian Information Criterion (BIC).

Parameters:

k_boundariesTuple[int, int], default=None: Minimum and maximum number of clusters. If None is given, determine the boundaries automatically. If self.embedding.populations is not identical for all samples, use the number of distinct populations, otherwise use the square root of the number of samples as maximum max_k. The minimum min_k is min(max_k, 3).

Returns:

int: the optimal number of clusters between min_n and max_n

pandora.embedding.check_smartpca_results(evec: Path, eval: Path) → None[source]

Checks whether the smartpca results finished properly and contain all required information.

Parameters:

evecpathlib.Path: Filepath pointing to a .evec result file of a smartpca run.
evalpathlib.Path: Filepath pointing to a .eval result file of a smartpca run.

Returns:

None

Raises:

PandoraException: If either the evec file or the eval file are incorrect.

pandora.embedding.from_smartpca(evec: Path, eval: Path) → Embedding[source]

Creates an Embedding object based on the results of a smartpca run.

Parameters:

evecpathlib.Path: Filepath pointing to a .evec result file of a smartpca run.
evalpathlib.Path: Filepath pointing to a .eval result file of a smartpca run.

Returns:

Embedding: Embedding object of the results of the respective smartpca run.

Raises:

PandoraException: If either the evec file or the eval file are incorrect.

pandora.embedding.mds_from_dataframe(embedding: DataFrame, sample_ids: Series, populations: Series, explained_variances: ndarray[Any, dtype[float]]) → Embedding[source]

Creates a new Embedding object based on an MDS embedding pandas dataframe.

Note that embedding is expected to have a column entitled populations. This is needed since the input distance matrices for MDS may be summary statistics for all samples of one population. The resulting MDS object however will duplicate the results for each sample given in sample_ids to match the original input data.

Parameters:

embeddingpd.DataFrame: MDS embedding data as pandas DataFrame. Each row corresponds to a single sample or population. The embedding is expected to have a column entitled "population" denoting the respective population of the row.
sample_idspd.Series: Pandas Series containing IDs of samples the embedding data is for. Note that the number of sample IDs can be larger than the number of rows in the embedding. This is the case if the embedding was computed per population but the data should be mapped for each sample. The number of sample IDs needs to match the number of populations.
populationspd.Series: Pandas Series containing the population for each sample in sample_ids. The number of populations needs to match the number of sample IDs.
explained_variancesnpt.NDArray[float]: Numpy ndarray containing the explained variances for each MDS embedding vector (shape=(n_components,))

Returns:

Embedding: Embedding object encapsulating the MDS data

Raises:

PandoraException

If embedding does not contain a "populations" column.
If the number of samples and number of populations are not identical. Exactly one population is required for each sample.