Plotting and further insights

The raw results of Pandora might be enough for you, but sometimes you really want to know what is going on with your data and why the results are the way they are. This is probably especially true for the support values of single samples.

In this jupyter notebook, we provide some examples on how you can inspect your results with a focus on the support values.

These examples are also intended to demonstrate the plotting functionalities of Pandora.

The dataset we are using here is the West-Eurasian subset of the Human Origins data (Lazaridis, I., Patterson, N., Mittnik, A. et al. Ancient human genomes suggest three ancestral populations for present-day Europeans. Nature 513, 409–413 (2014). https://doi.org/10.1038/nature13673) merged with 13 Çayönü individuals (N. Ezgi Altınışık et al., A genomic snapshot of demographic and cultural dynamism in Upper Mesopotamia during the Neolithic Transition.Sci. Adv.8,eabo3609(2022). https://doi.org/10.1126/sciadv.abo3609)

[1]:

# First import some Pandora library functionality and other packages we are going to need for the following inspection
import pandas as pd
import pathlib

from pandora.embedding import from_smartpca
from pandora.embedding_comparison import EmbeddingComparison
from pandora.plotting import plot_embedding_comparison_rogue_samples, plot_support_values, plot_populations, plot_projections

# and setup the paths to the exemplary results we are going to inspect
results_dir = pathlib.Path("../_static/cayonu_results")
bootstrap_results = results_dir / "bootstrap"
support_values_file = results_dir / "pandora.supportValues.csv"
projected_support_values_file = results_dir / "pandora.supportValues.projected.csv"

[2]:

# we next load the EigenDataset for the Cayonu dataset
# note that we only use a pickled object for this as the EIGENSTRAT files would be too large for uploading to GitHub
import pickle

cayonu_dataset = pickle.load((results_dir / "dataset.pickle").open("rb"))

# here is how you would normally initialize this dataset:
# cayonu_dataset = EigenDataset(
#     file_prefix=pathlib.Path(prefix/to/cayonu_files),
#     embedding_populations=pathlib.Path(prefix/to/embedding_populations.txt)
# )

# load the pre-computed PCA object
cayonu_dataset.pca = from_smartpca(
    evec=results_dir / "cayonu.evec",
    eval=results_dir / "cayonu.eval"
)

First, we visualize the dataset’s unbootstrapped PCA using the precomputed results and annotate all populations

[3]:

fig = plot_populations(cayonu_dataset.pca)
fig.update_legends(orientation="h")
fig.update_layout(height=1000, width=750)

next, let’s only plot the projected Çayönü samples

[4]:

fig = plot_projections(cayonu_dataset.pca, cayonu_dataset.embedding_populations)
fig.update_layout(height=750, width=750)

Next, let’s have a look at the support values for all samples, and for the projected samples

[5]:

psv_all_samples = pd.read_csv(support_values_file, index_col=[0])
psv_all_samples.sort_values("PSV")

[5]:

	PSV
ITS2	0.424616
ITS5	0.424936
Jordan445	0.476375
HGDP00625	0.487708
cay033	0.492772
...	...
HGDP00887	0.942582
UkrBel618	0.942695
Mordovians28	0.942815
UkrBel614	0.943761
HGDP00894	0.944473

826 rows × 1 columns

[6]:

psv_projected_samples = pd.read_csv(projected_support_values_file, index_col=[0])
psv_projected_samples.sort_values("PSV")

[6]:

	PSV
cay033	0.492772
cay014	0.537452
cay015	0.539045
cay022	0.561840
cay020	0.579423
cay027	0.584384
cay012	0.590237
cay011	0.609492
cay016	0.627005
cay018	0.627732
cay1820	0.654619
cay013	0.712516
cay007	0.806723

A table is nice, but a plot is nicer for sure!

The following plot color-codes all samples according to their support, and additionally annotates all samples with a support value below 0.5

[7]:

plot_support_values(cayonu_dataset.pca, psv_all_samples.PSV, support_value_rogue_cutoff=0.5)

We can also make some more sense about why these samples have such a low PSV. To do so, we plot two bootstrapped PCA results against each other and annotate the “rogue” samples.

To demonstrate the rogueness of rogue samples, we need to match two PCA results (using Procrustes Transformation) and then highlight the positions of the respective samples in both plots. Since we can only match two embeddings at once, we can only plot the rogue samples for pairs of embeddings.

A few notes regarding the following plot:

The shape of the PCA does not match the shape of the previous plots anymore. This is a direct consequence of the Procrustes Transformation.
Each color corresponds to a single sample. Each color has two points: a dot and a star representing the two bootstrap replicates.
Here we only plot samples with a PSV <= 0.45 to better demonstrate how to use this “rogueness” plot

[8]:

# first, we load two bootstrap PCA results
# for samples with low support you can basically use any two bootstraps and chances are the results will be as obvious as the example belpw

bootstrap0 = from_smartpca(
    evec=bootstrap_results / "bootstrap_0.evec",
    eval=bootstrap_results / "bootstrap_0.eval"
)

bootstrap1 = from_smartpca(
    evec=bootstrap_results / "bootstrap_1.evec",
    eval=bootstrap_results / "bootstrap_1.eval"
)

# next, we create a EmbeddingComparison object that handles all the Procrustes stuff for us
comparison = EmbeddingComparison(bootstrap0, bootstrap1)

# finally, we can plot the comparison and highlight samples with a PSV <= 0.45
plot_embedding_comparison_rogue_samples(comparison, psv_all_samples.PSV, support_value_rogue_cutoff=0.45)

You can now make sense of why ITS2 and ITS5 (both South-Italian samples) have such low PSVs: both seem to be projected to substantially different points in the embedding space depending on the bootstrap we are looking at. If we plot both bootstrap replicates separately (and untransformed), we can see that ITS2’s cloest neighbor in Bootstrap #0 is an Ashkenazi-Jew samples, whereas in Bootstrap #1 it is a Greek sample. Similarly, for ITS5 the closest neighbors are a Maltese sample or another Italian-South sample.

[9]:

plot_support_values(bootstrap0, psv_all_samples.PSV, support_value_rogue_cutoff=0.45)

[10]:

plot_support_values(bootstrap1, psv_all_samples.PSV, support_value_rogue_cutoff=0.45)