pandora.embedding module
- class pandora.embedding.Embedding(embedding: DataFrame, n_components: int, explained_variances: ndarray[Any, dtype[float]])[source]
Bases:
objectClass structure encapsulating PCA or MDS results.
- Parameters:
- embeddingpd.DataFrame
Pandas dataframe containing the sample ID, population and embedding vector of all samples. The dataframe should contain one row per sample. Pandora expects the following columns:
sample_id(str): ID of the respective sample.population(str): Name of the respective population.D{i} for i in range(n_components)(float): data for the i-th embedding vector for each sample, 0-indexed, so the first embedding vector corresponds to columnD0
- n_componentsint
number of components corresponding to the embedding data
- explained_variancesnpt.NDArray[float]
Numpy ndarray containing the explained variances for each embedding vector (
shape=(n_components,))
- Attributes:
- embeddingpd.DataFrame
Pandas dataframe with shape
(n_samples, n_components + 2)that contains the embedding results. The dataframe contains one row per sample and has the following columns:sample_id(str): ID of the respective sample.population(str): Name of the respective population.D{i} for i in range(n_components)(float): data for the i-th embedding vector for each sample, 0-indexed, so the first embedding vector corresponds to columnPC0
- explained_variancesnpt.NDArray[float]
Numpy ndarray containing the explained variances for each embedding vector (
shape=(n_components,))- n_componentsint
number of components
- embedding_matrixnpt.NDArray[float]
Numpy ndarray of shape
(n_samples, n_components)containing the embedding matrix.- sample_idspd.Series[str]
Pandas series containing the IDs of all samples.
- populationspd.Series[str]
Pandas series containing the population for each sample in
sample_ids.
Methods
cluster([kmeans_k])Fits a K-Means cluster to the embedding data and returns a scikit-learn fitted KMeans object.
get_optimal_kmeans_k([k_boundaries])Determines the optimal number of clusters k for K-Means clustering according to the Bayesian Information Criterion (BIC).
- Raises:
- PandoraException
If
explained_variancesis not a 1D numpy array or contains more/fewer values thann_components.If
embeddingdoes not contain a “sample_id” column.If
embeddingdoes not contain a “population” columnIf
embeddingdoes not contain (the correct amount of)D{i}columns
- cluster(kmeans_k: int | None = None) KMeans[source]
Fits a K-Means cluster to the embedding data and returns a scikit-learn fitted KMeans object.
- Parameters:
- kmeans_kint
Number of clusters. If not set, the optimal number of clusters is determined automatically.
- Returns:
- KMeans
Scikit-learn KMeans object that is fitted to
self.embedding.
- get_optimal_kmeans_k(k_boundaries: Tuple[int, int] | None = None) int[source]
Determines the optimal number of clusters k for K-Means clustering according to the Bayesian Information Criterion (BIC).
- Parameters:
- k_boundariesTuple[int, int], default=None
Minimum and maximum number of clusters. If None is given, determine the boundaries automatically. If
self.embedding.populationsis not identical for all samples, use the number of distinct populations, otherwise use the square root of the number of samples as maximummax_k. The minimummin_kismin(max_k, 3).
- Returns:
- int
the optimal number of clusters between
min_nandmax_n
- pandora.embedding.check_smartpca_results(evec: Path, eval: Path) None[source]
Checks whether the smartpca results finished properly and contain all required information.
- Parameters:
- evecpathlib.Path
Filepath pointing to a
.evecresult file of a smartpca run.- evalpathlib.Path
Filepath pointing to a
.evalresult file of a smartpca run.
- Returns:
- None
- Raises:
- PandoraException
If either the
evecfile or theevalfile are incorrect.
- pandora.embedding.from_smartpca(evec: Path, eval: Path) Embedding[source]
Creates an Embedding object based on the results of a smartpca run.
- Parameters:
- evecpathlib.Path
Filepath pointing to a
.evecresult file of a smartpca run.- evalpathlib.Path
Filepath pointing to a
.evalresult file of a smartpca run.
- Returns:
- Embedding
Embedding object of the results of the respective smartpca run.
- Raises:
- PandoraException
If either the
evecfile or theevalfile are incorrect.
- pandora.embedding.mds_from_dataframe(embedding: DataFrame, sample_ids: Series, populations: Series, explained_variances: ndarray[Any, dtype[float]]) Embedding[source]
Creates a new Embedding object based on an MDS embedding pandas dataframe.
Note that embedding is expected to have a column entitled populations. This is needed since the input distance matrices for MDS may be summary statistics for all samples of one population. The resulting MDS object however will duplicate the results for each sample given in
sample_idsto match the original input data.- Parameters:
- embeddingpd.DataFrame
MDS embedding data as pandas DataFrame. Each row corresponds to a single sample or population. The embedding is expected to have a column entitled
"population"denoting the respective population of the row.- sample_idspd.Series
Pandas Series containing IDs of samples the embedding data is for. Note that the number of sample IDs can be larger than the number of rows in the embedding. This is the case if the embedding was computed per population but the data should be mapped for each sample. The number of sample IDs needs to match the number of populations.
- populationspd.Series
Pandas Series containing the population for each sample in
sample_ids. The number of populations needs to match the number of sample IDs.- explained_variancesnpt.NDArray[float]
Numpy ndarray containing the explained variances for each MDS embedding vector (
shape=(n_components,))
- Returns:
- Embedding
Embedding object encapsulating the MDS data
- Raises:
- PandoraException
If
embeddingdoes not contain a"populations"column.If the number of samples and number of populations are not identical. Exactly one population is required for each sample.