Working with data from R
========================

Many computational biologists are more familiar in working with R instead of Python and thus might have their own (pre)processing pipeline setup in R.
You can export your data from R as `.RData` file using the ``save`` function and we can then read this data in Python using the ``pyreadr`` package.
This might not work universally for all kinds of data, if you have any issues with exporting your data feel free to contact us and we can work things out together!

First of all, we need to install the ``pyreadr`` package. You can do so either via conda::

    conda install pyreadr -c conda-forge

or via pip::

    pip install pyreadr


Suppose we have the file called ``example.RData``. You can then load this data in python like this:

.. code-block:: python

    import pyreadr
    data = pyreadr.read_r("example.RData")
    print(data.keys())


``data`` is now a Python ``OrderedDict`` and the above code also prints the keys of this dict. Suppose our data in R had the attributes ``geno_data``, ``populations`` and ``sample_ids``, then this will print ``odict_keys(['geno_data', 'sample_ids', 'populations'])``.
We can then access e.g. the ``geno_data`` simply via a dict access and pyreadr will return the data as pandas dataframe. The following code-snipped shows you how you can then transform this data
to a Pandora NumpyDataset.


.. code-block:: python

    import numpy as np
    import pyreadr

    from pandora.dataset import NumpyDataset

    # in our this is an OrderedDict with keys geno_data, populations, and sample_ids
    r_data = pyreadr.read_r("example.RData")

    # these are all pandas dataframes
    geno_data = r_data["geno_data"]
    sample_ids = r_data["sample_ids"]
    populations = r_data["populations"]

    # convert the dataframes to the required input formats for the NumpyDataset
    # geno_data needs to be a numpy NDArray
    geno_data = geno_data.to_numpy()
    # also we need to properly set the nan values in order for the imputation to work
    geno_data = np.nan_to_num(geno_data, nan=np.nan)

    # sample IDs and populations need to be pandas Series, not dataframes
    # in case the sample_ids in R were just a vector, the dataframes will have a column with the same key as above
    # the dataframe will however simply have the
    sample_ids = sample_ids.sample_ids
    populations = populations.populations

    # using this data, we can initialize a NumpyDataset
    dataset = NumpyDataset(geno_data, sample_ids, populations)