Plotting and further insights
The raw results of Pandora might be enough for you, but sometimes you really want to know what is going on with your data and why the results are the way they are. This is probably especially true for the support values of single samples.
In this jupyter notebook, we provide some examples on how you can inspect your results with a focus on the support values.
These examples are also intended to demonstrate the plotting functionalities of Pandora.
The dataset we are using here is the West-Eurasian subset of the Human Origins data (Lazaridis, I., Patterson, N., Mittnik, A. et al. Ancient human genomes suggest three ancestral populations for present-day Europeans. Nature 513, 409–413 (2014). https://doi.org/10.1038/nature13673) merged with 13 Çayönü individuals (N. Ezgi Altınışık et al., A genomic snapshot of demographic and cultural dynamism in Upper Mesopotamia during the Neolithic Transition.Sci. Adv.8,eabo3609(2022). https://doi.org/10.1126/sciadv.abo3609)
[1]:
# First import some Pandora library functionality and other packages we are going to need for the following inspection
import pandas as pd
import pathlib
from pandora.embedding import from_smartpca
from pandora.embedding_comparison import EmbeddingComparison
from pandora.plotting import plot_embedding_comparison_rogue_samples, plot_support_values, plot_populations, plot_projections
# and setup the paths to the exemplary results we are going to inspect
results_dir = pathlib.Path("../_static/cayonu_results")
bootstrap_results = results_dir / "bootstrap"
support_values_file = results_dir / "pandora.supportValues.csv"
projected_support_values_file = results_dir / "pandora.supportValues.projected.csv"
[2]:
# we next load the EigenDataset for the Cayonu dataset
# note that we only use a pickled object for this as the EIGENSTRAT files would be too large for uploading to GitHub
import pickle
cayonu_dataset = pickle.load((results_dir / "dataset.pickle").open("rb"))
# here is how you would normally initialize this dataset:
# cayonu_dataset = EigenDataset(
# file_prefix=pathlib.Path(prefix/to/cayonu_files),
# embedding_populations=pathlib.Path(prefix/to/embedding_populations.txt)
# )
# load the pre-computed PCA object
cayonu_dataset.pca = from_smartpca(
evec=results_dir / "cayonu.evec",
eval=results_dir / "cayonu.eval"
)
First, we visualize the dataset’s unbootstrapped PCA using the precomputed results and annotate all populations
[3]:
fig = plot_populations(cayonu_dataset.pca)
fig.update_legends(orientation="h")
fig.update_layout(height=1000, width=750)
next, let’s only plot the projected Çayönü samples
[4]:
fig = plot_projections(cayonu_dataset.pca, cayonu_dataset.embedding_populations)
fig.update_layout(height=750, width=750)
Next, let’s have a look at the support values for all samples, and for the projected samples
[5]:
psv_all_samples = pd.read_csv(support_values_file, index_col=[0])
psv_all_samples.sort_values("PSV")
[5]:
| PSV | |
|---|---|
| ITS2 | 0.424616 |
| ITS5 | 0.424936 |
| Jordan445 | 0.476375 |
| HGDP00625 | 0.487708 |
| cay033 | 0.492772 |
| ... | ... |
| HGDP00887 | 0.942582 |
| UkrBel618 | 0.942695 |
| Mordovians28 | 0.942815 |
| UkrBel614 | 0.943761 |
| HGDP00894 | 0.944473 |
826 rows × 1 columns
[6]:
psv_projected_samples = pd.read_csv(projected_support_values_file, index_col=[0])
psv_projected_samples.sort_values("PSV")
[6]:
| PSV | |
|---|---|
| cay033 | 0.492772 |
| cay014 | 0.537452 |
| cay015 | 0.539045 |
| cay022 | 0.561840 |
| cay020 | 0.579423 |
| cay027 | 0.584384 |
| cay012 | 0.590237 |
| cay011 | 0.609492 |
| cay016 | 0.627005 |
| cay018 | 0.627732 |
| cay1820 | 0.654619 |
| cay013 | 0.712516 |
| cay007 | 0.806723 |
A table is nice, but a plot is nicer for sure!
The following plot color-codes all samples according to their support, and additionally annotates all samples with a support value below 0.5
[7]:
plot_support_values(cayonu_dataset.pca, psv_all_samples.PSV, support_value_rogue_cutoff=0.5)
We can also make some more sense about why these samples have such a low PSV. To do so, we plot two bootstrapped PCA results against each other and annotate the “rogue” samples.
To demonstrate the rogueness of rogue samples, we need to match two PCA results (using Procrustes Transformation) and then highlight the positions of the respective samples in both plots. Since we can only match two embeddings at once, we can only plot the rogue samples for pairs of embeddings.
A few notes regarding the following plot:
The shape of the PCA does not match the shape of the previous plots anymore. This is a direct consequence of the Procrustes Transformation.
Each color corresponds to a single sample. Each color has two points: a dot and a star representing the two bootstrap replicates.
Here we only plot samples with a PSV <= 0.45 to better demonstrate how to use this “rogueness” plot
[8]:
# first, we load two bootstrap PCA results
# for samples with low support you can basically use any two bootstraps and chances are the results will be as obvious as the example belpw
bootstrap0 = from_smartpca(
evec=bootstrap_results / "bootstrap_0.evec",
eval=bootstrap_results / "bootstrap_0.eval"
)
bootstrap1 = from_smartpca(
evec=bootstrap_results / "bootstrap_1.evec",
eval=bootstrap_results / "bootstrap_1.eval"
)
# next, we create a EmbeddingComparison object that handles all the Procrustes stuff for us
comparison = EmbeddingComparison(bootstrap0, bootstrap1)
# finally, we can plot the comparison and highlight samples with a PSV <= 0.45
plot_embedding_comparison_rogue_samples(comparison, psv_all_samples.PSV, support_value_rogue_cutoff=0.45)
You can now make sense of why ITS2 and ITS5 (both South-Italian samples) have such low PSVs: both seem to be projected to substantially different points in the embedding space depending on the bootstrap we are looking at. If we plot both bootstrap replicates separately (and untransformed), we can see that ITS2’s cloest neighbor in Bootstrap #0 is an Ashkenazi-Jew samples, whereas in Bootstrap #1 it is a Greek sample. Similarly, for ITS5 the closest neighbors are a Maltese sample or another Italian-South sample.
[9]:
plot_support_values(bootstrap0, psv_all_samples.PSV, support_value_rogue_cutoff=0.45)
[10]:
plot_support_values(bootstrap1, psv_all_samples.PSV, support_value_rogue_cutoff=0.45)