pandora.embedding module
- class pandora.embedding.Embedding(embedding: DataFrame, n_components: int)[source]
Bases:
objectBase Wrapper class for the result of a PCA/MDS embedding.
Methods
cluster([kmeans_k])Fits a K-Means cluster to the embedding data and returns a scikit-learn fitted KMeans object.
get_optimal_kmeans_k([k_boundaries])Determines the optimal number of clusters k for K-Means clustering according to the Bayesian Information Criterion (BIC).
- cluster(kmeans_k: int = None) KMeans[source]
Fits a K-Means cluster to the embedding data and returns a scikit-learn fitted KMeans object.
- Parameters:
- kmeans_kint
Number of clusters. If not set, the optimal number of clusters is determined automatically.
- Returns:
- KMeans
Scikit-learn KMeans object that is fitted to
self.embedding.
- get_optimal_kmeans_k(k_boundaries: Tuple[int, int] = None) int[source]
Determines the optimal number of clusters k for K-Means clustering according to the Bayesian Information Criterion (BIC).
- Parameters:
- k_boundariesTuple[int, int], default=None
Minimum and maximum number of clusters. If None is given, determine the boundaries automatically. If
self.embedding.populationsis not identical for all samples, use the number of distinct populations, otherwise use the square root of the number of samples as maximummax_k. The minimummin_kismin(max_k, 3).
- Returns:
- int
the optimal number of clusters between
min_nandmax_n
- class pandora.embedding.MDS(embedding: DataFrame, n_components: int, stress: float)[source]
Bases:
EmbeddingInitializes a new MDS object.
- Parameters:
- embeddingpd.DataFrame
Pandas dataframe containing the sample ID, population and embedding vector of all samples. The dataframe should contain one row per sample. Pandora expects the following columns:
sample_id(str): ID of the respective sample.population(str): Name of the respective population.D{i} for i in range(n_components)(float): data for the i-th embedding dimension for each sample, 0-indexed, so the first dimension corresponds to columnD0
- n_componentsint
number of components the data was fitted for
- stressfloat
Stress of the fitted MDS.
- Raises:
- PandoraException
If
embeddingdoes not contain a “sample_id” column.If
embeddingdoes not contain a “population” column.If
embeddingdoes not contain (the correct amount of)D{i}columns.
- Attributes:
- embeddingpd.DataFrame
Pandas dataframe containing the sample ID, population and embedding vector of all samples. The dataframe should contain one row per sample. Pandora expects the following columns:
sample_id(str): ID of the respective sample.population(str): Name of the respective population.D{i} for i in range(n_components)(float): data for the i-th embedding dimension for each sample, 0-indexed, so the first dimension corresponds to columnD0
- n_componentsint
number of components the data was fitted for
- stressfloat
Stress of the fitted MDS.
- embedding_matrixnpt.NDArray[float]
Numpy ndarray of shape
(n_samples, n_components)containing the MDS result matrix.- sample_idspd.Series[str]
Pandas series containing the IDs of all samples.
- populationspd.Series[str]
Pandas series containing the population for each sample in
sample_ids.
Methods
cluster([kmeans_k])Fits a K-Means cluster to the embedding data and returns a scikit-learn fitted KMeans object.
get_optimal_kmeans_k([k_boundaries])Determines the optimal number of clusters k for K-Means clustering according to the Bayesian Information Criterion (BIC).
- class pandora.embedding.PCA(embedding: DataFrame, n_components: int, explained_variances: ndarray[Any, dtype[float]])[source]
Bases:
EmbeddingClass structure encapsulating PCA results.
This class provides a wrapper for PCA results.
- Parameters:
- embeddingpd.DataFrame
Pandas dataframe containing the sample ID, population and PC-Vector of all samples. The dataframe should contain one row per sample. Pandora expects the following columns:
sample_id(str): ID of the respective sample.population(str): Name of the respective population.D{i} for i in range(n_components)(float): data for the i-th PC for each sample, 0-indexed, so the first PC corresponds to columnD0
- n_componentsint
number of principal components corresponding to the PCA data
- explained_variancesnpt.NDArray[float]
Numpy ndarray containing the explained variances for each PC (
shape=(n_components,))
- Raises:
- PandoraException
If
explained_variancesis not a 1D numpy array or contains more/fewer values thann_components.If
embeddingdoes not contain a “sample_id” column.If
embeddingdoes not contain a “population” columnIf
embeddingdoes not contain (the correct amount of)D{i}columns
- Attributes:
- embeddingpd.DataFrame
Pandas dataframe with shape
(n_samples, n_components + 2)that contains the PCA results. The dataframe contains one row per sample and has the following columns:sample_id(str): ID of the respective sample.population(str): Name of the respective population.D{i} for i in range(n_components)(float): data for the i-th PC for each sample, 0-indexed, so the first PC corresponds to columnPC0
- explained_variancesnpt.NDArray[float]
Numpy ndarray containing the explained variances for each PC (
shape=(n_components,))- n_componentsint
number of principal components
- embedding_matrixnpt.NDArray[float]
Numpy ndarray of shape
(n_samples, n_components)containing the PCA result matrix.- sample_idspd.Series[str]
Pandas series containing the IDs of all samples.
- populationspd.Series[str]
Pandas series containing the population for each sample in
sample_ids.
Methods
cluster([kmeans_k])Fits a K-Means cluster to the embedding data and returns a scikit-learn fitted KMeans object.
get_optimal_kmeans_k([k_boundaries])Determines the optimal number of clusters k for K-Means clustering according to the Bayesian Information Criterion (BIC).
- pandora.embedding.check_smartpca_results(evec: Path, eval: Path) None[source]
Checks whether the smartpca results finished properly and contain all required information.
- Parameters:
- evecpathlib.Path
Filepath pointing to a
.evecresult file of a smartpca run.- evalpathlib.Path
Filepath pointing to a
.evalresult file of a smartpca run.
- Returns:
- None
- Raises:
- PandoraException
If either the
evecfile or theevalfile are incorrect.
- pandora.embedding.from_sklearn_mds(embedding: DataFrame, sample_ids: Series, populations: Series, stress: float) MDS[source]
Creates a new MDS object based on an MDS embedding pandas dataframe.
Note that embedding is expected to have a column entitled populations. This is needed since the input distance matrices for MDS may be summary statistics for all samples of one population. The resulting MDS object however will duplicate the results for each sample given in
sample_idsto match the original input data.- Parameters:
- embeddingpd.DataFrame
MDS embedding data as pandas DataFrame. Each row corresponds to a single sample or population. The embedding is expected to have a column entitled
"population"denoting the respective population of the row.- sample_idspd.Series
Pandas Series containing IDs of samples the embedding data is for. Note that the number of sample IDs can be larger than the number of rows in the embedding. This is the case if the embedding was computed per population but the data should be mapped for each sample. The number of sample IDs needs to match the number of populations.
- populationspd.Series
Pandas Series containing the population for each sample in
sample_ids. The number of populations needs to match the number of sample IDs.- stressfloat
Goodness of the MDS fit for the data.
- Returns:
- MDS
MDS object encapsulating the MDS data
- Raises:
- PandoraException
If
embeddingdoes not contain a"populations"column.If the number of samples and number of populations are not identical. Exactly one population is required for each sample.
- pandora.embedding.from_smartpca(evec: Path, eval: Path) PCA[source]
Creates a PCA object based on the results of a smartpca run.
- Parameters:
- evecpathlib.Path
Filepath pointing to a
.evecresult file of a smartpca run.- evalpathlib.Path
Filepath pointing to a
.evalresult file of a smartpca run.
- Returns:
- PCA
PCA object of the results of the respective smartpca run.
- Raises:
- PandoraException
If either the
evecfile or theevalfile are incorrect.