The NumPy-based Interface
=========================
The following code snippets provide examples on how to use the numpy-based Pandora interface.

PCA Analysis
------------
The following example demonstrates you how to do the PCA bootstrap stability analysis using the alternative ``numpy`` interface of Pandora.
Instead of providing the file path to EIGEN files, you directly provide Pandora with a ``numpy`` data matrix.

.. code-block:: python

    import pandas as pd
    import numpy as np

    from pandora.bootstrap import bootstrap_and_embed_multiple_numpy
    from pandora.custom_types import EmbeddingAlgorithm
    from pandora.dataset import NumpyDataset
    from pandora.embedding_comparison import BatchEmbeddingComparison

    # this is the same geno type data as we used above, but already typed out as numpy matrix
    geno_data = np.asarray([
        [1, 0, 2, 0, 2, 0, 2, 0, 1, 2],
        [1, 1, 1, 0, 1, 0, 2, 0, 2, 2],
        [1, 2, 1, 1, 1, 1, 1, 1, 2, 2],
        [0, 1, 0, 2, 0, 1, 1, 1, 0, 0],
        [0, 2, 1, 2, 0, 1, 0, 2, 1, 1]]
    )
    # create a new pandas Series for the respective sample IDs and populations
    sample_ids = pd.Series(["SAMPLE0", "SAMPLE1", "SAMPLE2", "SAMPLE3", "SAMPLE4"])
    populations = pd.Series(["pop0", "pop1", "pop2", "pop3", "pop4"])

    # finally initialize the dataset using the geno data and the sample metadata
    dataset = NumpyDataset(geno_data, sample_ids, populations)

    # create 10 bootstrap replicates based on the dataset and compute the PCA embedding for each bootstrap replicate
    bootstrap_replicates = bootstrap_and_embed_multiple_numpy(
        dataset=dataset,
        n_bootstraps=10,
        embedding=EmbeddingAlgorithm.PCA,  # tell Pandora to compute PCA for each of the bootstrap replicates
        n_components=2,  # here we only compute 2 PCs
        seed=42, # set the seed for full reproducibility of the bootstraps
        threads=2  # compute the bootstraps in parallel using 2 threads
    )

    # finally, using all bootstrap PCA objects, we create a container for comparing all replicates and getting the overall PS score
    batch_comparison = BatchEmbeddingComparison([b.pca for b in bootstrap_replicates])
    pandora_stability = batch_comparison.compare()

    print("Pandora Stability (PS): ", round(pandora_stability, 2))


This will print something like ``Pandora Stability (PS):  0.96``.


MDS Analysis
------------
This example shows how to perform an MDS analysis using the NumPy-based interface. You can select the distance metric you wish to compute as input for MDS analysis.
The first example uses one of the pre-implemented distance metrics, afterwards I will demonstrate how you can define your own custom distance metric.

**Important:** while the above examples all also work in a Jupyter notebook, the following example will only run if you paste it into a python
file and run it from command line. The reason for this is the custom distance metric we will pass for MDS
(which is a python ``Callable`` and ``bootstrap_and_embed_multiple_numpy`` uses multiprocessing which causes some trouble when not wrapped in the ``if name == "main":``)

.. code-block:: python

    import pandas as pd
    import numpy as np

    from pandora.bootstrap import bootstrap_and_embed_multiple_numpy
    from pandora.custom_types import EmbeddingAlgorithm
    from pandora.dataset import NumpyDataset
    from pandora.distance_metrics import manhattan_population_distance
    from pandora.embedding_comparison import BatchEmbeddingComparison


    if __name__ == "__main__":
        # this is identical to the example with PCA above: define the data and init the dataset
        geno_data = np.asarray([
            [1, 0, 2, 0, 2, 0, 2, 0, 1, 2],
            [1, 1, 1, 0, 1, 0, 2, 0, 2, 2],
            [1, 2, 1, 1, 1, 1, 1, 1, 2, 2],
            [0, 1, 0, 2, 0, 1, 1, 1, 0, 0],
            [0, 2, 1, 2, 0, 1, 0, 2, 1, 1]]
        )
        sample_ids = pd.Series(["SAMPLE0", "SAMPLE1", "SAMPLE2", "SAMPLE3", "SAMPLE4"])
        populations = pd.Series(["pop0", "pop1", "pop2", "pop3", "pop4"])

        dataset = NumpyDataset(geno_data, sample_ids, populations)

        # instead of PCA, this time we pass MDS as embedding method
        # in this case we also need to pass a Callable, we use the above euclidean function in this example
        bootstrap_replicates = bootstrap_and_embed_multiple_numpy(
            dataset=dataset,
            n_bootstraps=10,  # again compute 10 bootstrap datasets
            embedding=EmbeddingAlgorithm.MDS,  # and perform MDS analysis for each bootstrap
            distance_metric=manhattan_population_distance,  # use the Manhattan distance between populations for MDS computation
            n_components=2,
            seed=42,
            threads=2
        )

        batch_comparison = BatchEmbeddingComparison([b.mds for b in bootstrap_replicates])
        pandora_stability = batch_comparison.compare()

        print("Pandora Stability (PS): ", round(pandora_stability, 2))


Again we will se an output like ``Pandora Stability (PS):  0.92.``

Custom distance metric
^^^^^^^^^^^^^^^^^^^^^^
If you want to use a distance metric that is not implemented in Pandora, you can define one very easily as I will show you with the following
example in which we will use the ``scikit-learn`` pairwise ``cosine_distances`` function. You can define a per-sample and a per-population metric like this:

.. code-block:: python

    from sklearn.metrics.pairwise import cosine_distances

    from pandora.distance_metrics import population_distance
    from pandora.imputation import impute_data


    def cosine_sample_distance(input_data: npt.NDArray, populations: pd.Series, imputation: Optional[str]) -> Tuple[npt.NDArray, pd.Series]:
        # first we impute the data, note that depending on the distance metric you are using not all imputations make sense
        # you can of course also implement your own custom imputation method and not use the provided ``impute_data`` functionality.
        input_data = impute_data(input_data, imputation)
        return cosine_distances(input_data, input_data), populations

    def cosine_population_distance(input_data: npt.NDArray, populations: pd.Series, imputation: Optional[str]) -> Tuple[npt.NDArray, pd.Series]:
        # first we impute the data, note that depending on the distance metric you are using not all imputations make sense
        # you can of course also implement your own custom imputation method and not use the provided ``impute_data`` functionality.
        input_data = impute_data(input_data, imputation)
        return population_distance(input_data, populations, cosine_distances)


For the per-population metric, we make use of Pandora's ``population_distance`` function. Provided a numpy data array and the respective populations,
as well as the desired pairwise distance metric, ``population_distance`` will take care of the population grouping.
You can then use one of your custom distance metrics by passing it as ``distance_metric`` to the respective function calls in Pandora
(e.g. the above example of ``bootstrap_and_embed_multiple_numpy``).


Sliding-Window Analysis
-----------------------
The above examples show the usage of the numpy-based interface for the Pandora bootstrap analysis. Pandora additionally provides methods to estimate the stability of dimensionality reduction along a genome.
The code for these analyses is basically the same as above, but instead of ``bootstrap_and_embed_multiple_numpy``, we will use the ``sliding_window_embedding_numpy`` method.
The following example demonstrates a sliding-window PCA analysis.

.. code-block:: python

    import pandas as pd
    import numpy as np

    from pandora.custom_types import EmbeddingAlgorithm
    from pandora.dataset import NumpyDataset
    from pandora.embedding_comparison import BatchEmbeddingComparison
    from pandora.sliding_window import sliding_window_embedding_numpy

    # for the sliding window analysis we use a larger array as example
    geno_data = np.asarray([
        [1, 0, 2, 0, 2, 0, 2, 0, 1, 2, 1, 0, 2, 0, 2, 0, 2, 0, 1, 2],
        [1, 1, 1, 0, 1, 0, 2, 0, 2, 2, 1, 1, 1, 0, 1, 0, 2, 0, 2, 2],
        [1, 2, 1, 1, 1, 1, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 2, 2],
        [0, 1, 0, 2, 0, 1, 1, 1, 0, 0, 0, 1, 0, 2, 0, 1, 1, 1, 0, 0],
        [0, 2, 1, 2, 0, 1, 0, 2, 1, 1, 0, 2, 1, 2, 0, 1, 0, 2, 1, 1]]
    )
    # create a new pandas Series for the respective sample IDs and populations
    sample_ids = pd.Series(["SAMPLE0", "SAMPLE1", "SAMPLE2", "SAMPLE3", "SAMPLE4"])
    populations = pd.Series(["pop0", "pop1", "pop2", "pop3", "pop4"])

    # finally initialize the dataset using the geno data and the sample metadata
    dataset = NumpyDataset(geno_data, sample_ids, populations)

    # for this example, we separate the dataset into 5 windows and compute the PCA embedding for each of the windows
    sliding_windows = sliding_window_embedding_numpy(
        dataset=dataset,
        n_windows=5,
        embedding=EmbeddingAlgorithm.PCA,  # tell Pandora to compute PCA for each of the windows
        n_components=2,  # here we only compute 2 PCs
        threads=2,  # compute the bootstraps in parallel using 2 threads
    )

    # finally, using all windowed PCA objects, we create a container for comparing all replicates and getting the overall PS score
    batch_comparison = BatchEmbeddingComparison([w.pca for w in sliding_windows])
    pandora_stability = batch_comparison.compare()

    print("Pandora Stability (PS): ", round(pandora_stability, 2))


This will print something like ``Pandora Stability (PS):  0.93``.


Loading Eigen-files as NumpyDataset
-----------------------------------
Instead of defining your data as a numpy dataset manually, you can also load genotype dataset files in EIGENSTRAT format.
You can simply use the ``pandora.dataset.numpy_dataset_from_eigenfiles`` method to do so:

.. code-block:: python

    import pathlib

    from pandora.dataset import numpy_dataset_from_eigenfiles

    # set the prefix to the .geno, .ind, and .snp files
    # if your dataset is not in EIGENSTRAT format, check out Pandora's conversion module
    dataset_prefix = pathlib.Path("path/to/eigenfiles")
    # numpy_dataset_from_eigenfiles expects three files:
    # - path/to/eigenfiles.snp
    # - path/to/eigenfiles.geno
    # - path/to/eigenfiles.ind
    dataset = numpy_dataset_from_eigenfiles(dataset_prefix)