converting .rda to pandas dataframe - python

I have some .rda files that I need to access with Python.
My code looks like this:
import rpy2.robjects as robjects
from rpy2.robjects import r, pandas2ri
pandas2ri.activate()
df = robjects.r.load("datafile.rda")
df2 = pandas2ri.ri2py_dataframe(df)
where df2 is a pandas dataframe. However, it only contains the header of the .rda file! I have searched back and forth. None of the solutions proposed seem to be working.
Does anyone have an idea how to efficiently convert an .rda dataframe to a pandas dataframe?

Thank you for your useful question. I tried the two ways proposed above to handle my problem.
For feather, I faced this issue:
pyarrow.lib.ArrowInvalid: Not a Feather V1 or Arrow IPC file
For rpy2, as mentioned by #Orange: "pandas2ri.ri2py_dataframe does not seem to exist any longer in rpy2 version 3.0.3" or later.
I searched for another workaround and found pyreadr useful for me and maybe for those who are facing the same problems as I am: https://github.com/ofajardo/pyreadr
Usage: https://gist.github.com/LeiG/8094753a6cc7907c716f#gistcomment-2795790
pip install pyreadr
import pyreadr
result = pyreadr.read_r('/path/to/file.RData') # also works for Rds, rda
# done! let's see what we got
# result is a dictionary where keys are the name of objects and the values python
# objects
print(result.keys()) # let's check what objects we got
df1 = result["df1"] # extract the pandas data frame for object df1

You could try using the new feather library developed as a language agnostic dataframe to be used in either R or Python.
# Install feather
devtools::install_github("wesm/feather/R")
library(feather)
path <- "your_file_path"
write_feather(datafile, path)
Then install in python
$ pip install feather-format
And load in your datafile
import feather
path = 'your_file_path'
datafile = feather.read_dataframe(path)

As mentioned, consider converting the .rda file into individual .rds objects using R's mget or eapply for building Python dictionary of dataframes.
RPy2
import os
import pandas as pd
import rpy2.robjects as robjects
from rpy2.robjects import pandas2ri
from rpy2.robjects.packages import importr
pandas2ri.activate()
base = importr('base')
base.load("datafile.rda")
rdf_List = base.mget(base.ls())
# ITERATE THROUGH LIST OF R DFs
pydf_dict = {}
for i,f in enumerate(base.names(rdf_List)):
pydf_dict[f] = pandas2ri.ri2py_dataframe(rdf_List[i])
for k,v in pydf_dict.items():
print(v.head())

Related

Access dataframes in rpy2.robjects.methods.RS4

I am trying to import *.rds files in Python.
import rpy2
from rpy2.robjects import pandas2ri
import rpy2.robjects as robjects
pandas2ri.activate()
readRDS = robjects.r['readRDS']
temp_data = readRDS('filename.rds')
The resulting temp_data is of type rpy2.robjects.methods.RS4.
type(temp_data)
Out[4]: rpy2.robjects.methods.RS4
I know that the RDS file should contain dataframes. How can I access the dataframes in temp_data?
Update
I came across these documentation pages but they didn't help me: https://rpy.sourceforge.io/rpy2/doc-dev/html/_modules/rpy2/robjects/methods.html and https://rpy2.github.io/doc/v2.9.x/html/generated_rst/s4class.html
I know that the RDS file should contain dataframes. How can I access the dataframes in temp_data?
The RDS may contain data frames as nested elements, but what you observe indicates that the outermost object is an RS4 object. There is documentation about such objects here: https://rpy2.github.io/doc/v3.4.x/html/robjects_oop.html#s4-objects

How do I create a dataframe from a geojason file without using geopandas?

I'm looking to turn a geojason into a pandas dataframe that I can work with using python. However, for some reason, the geojason package will not install on my computer.
So wanted to know how I could turn a geojason file into a dataframe witout using the geojason package.
This is what I have so far
import json
import pandas as pd
with open('Local_Authority_Districts_(December_2020)_UK_BGC.geojson') as f:
data = json.load(f)
Here is a link to the geojason that I'm working with. I'm new to python so any help would be much appreciated. https://drive.google.com/file/d/1V4WljiJcASqq9ksh8CHM_2nBC0K2PR18/view?usp=sharing
You could use geopandas. It's as easy as this:
import geopandas as gpd
gdf = gpd.read_file('Local_Authority_Districts_(December_2020)_UK_BGC.geojson')
You can turn the resulting geodataframe into a regular dataframe with:
df = pd.DataFrame(gdf)

Convert pandas timeseries to xts format using rpy2

I am trying to call the function 'Nelson.Siegel' in the'YieldCurve' Package using rpy2. 'Nelson.Siegel' takes an xts file (rates) and a list (Marurity) as inputs, it seems that I have to convert pandas data frame into xts format, and I am not sure how to achieve it. And I am not sure if I call the Nelson.Siegel function in the correct way. Any help will be appreicated.
I try to use pandas2ri.activate() to change data type from pandas to r but it seems that I need to further make it into xts format. I try to import as.xts in xts package but it doesn't work together with rpy2.
import pandas as pd
import numpy as np
from rpy2.robjects.packages import importr
import rpy2.robjects as robjects
base = importr('base')
utils = importr('utils')
utils.install_packages('YieldCurve', repos="http://cran.us.r-project.org")
Yieldcurve= importr('YieldCurve')
NelsonSiegel = robjects.r('Nelson.Siegel')
from rpy2.robjects import pandas2ri
pandas2ri.activate()
Maturity=[0.5,1,2]
df = pd.DataFrame(np.random.randint(0,30,size=(10,3)),
columns=["1","2","3"],
index=pd.date_range("20190101", periods=10))
NSParam= NelsonSiegel(df, Maturity)
Error message: Error in is.finite(if (is.character(from)) from <- as.numeric(from) else from) :
default method not implemented for type 'list'
Specifying that Maturity should be an vector rather than let the converter assume that a list is wanted might solve this:
Maturity=robjects.vectors.IntVector([0.5,1,2])
Otherwise, first check whether your pandas data frame is safely converter to an R data frame:
df = pd.DataFrame(np.random.randint(0,30,size=(10,3)),
columns=["1","2","3"],
index=pd.date_range("20190101", periods=10))
base.print(df)

error No module named 'xlrd'. how to import excel with python and pandas properly? please close this

I realized that there may be something wrong in my local dev env just now.
I tried my code on colab.
it worked well.
import pandas as pd
df = pd.read_excel('hurun-2018-top50.xlsx')
thank u all.
please close this session.
------- following is original description ---------
I am trying to import excel with python and pandas.
I already pip installed "xlrd" module.
I googled a lot and tried several different methods, none of them worked.
Here is my code.
import pandas as pd
from pandas import ExcelFile
from pandas import ExcelWriter
df = pd.read_excel('hurun-2018-top50.xlsx', index_col=0)
df = pd.read_excel('hurun-2018-top50.xlsx', sheetname='Sheet1')
df = pd.read_excel('hurun-2018-top50.xlsx')
Any response will be appreciated.

Is there a way to access R data frame column names in python/rpy2?

I have an R data frame, saved in Database02.Rda. Loading it
import rpy2.robjects as robjects
robjects.r.load("Database02.Rda")
works fine. However:
print(robjects.r.names("df"))
yields
NULL
Also, as an example, column 214 (213 if we count starting with 0) is named REGION.
print(robjects.r.table(robjects.r["df"][213]))
works fine:
Region 1 Region 2 ...
9811 3451 ...
but we should also be able to do
print(robjects.r.table("df$REGION"))
This, however, results in
df$REGION
1
(which it does also for column names that do not exist at all); also:
print(robjects.r.table(robjects.r["df"]["REGION"]))
gives an error:
TypeError: SexpVector indices must be integers, not str
Now, the docs say, names can not be used for subsetting in python. Am I correct to assume that the column names are not imported whith the rest of the data when loading the data frame with python/rpy2? Am I thus correct that the easiest way to access them is to save and load them as a seperate list and construct a dict or so in python mapping the names to the column index numbers? This does not seem very generic, however. Is there a way to extract the column names directly?
The versions of R, python, rpy2 I use are:
R: 3.2.2
python: 3.5.0
rpy2: 2.7.8
When doing the following, you are loading whatever objects are Database02.Rda into R's "global environment".
import rpy2.robjects as robjects
robjects.r.load("Database02.Rda")
robjects.globalenv is an Environement. You can list its content with:
tuple(robjects.globalenv.keys())
Now I am understanding that one of your objects is called df. You can access it with:
df = robjects.globalenv['df']
if df is a list or a data frame, you can access its named elements with
rx2 (the doc is your friend here again). To get the one called REGION, do:
df.rx2("REGION")
To list all named elements in a list or dataframe that's easy:
tuple(df.names)
If you run R code in python, the global environment answer will not work. But kudos to #lgautier the creator/maintainer of this package. In R the dollar sign $ is used frequently. This is what I learned:
print(pamk_clusters$pamobject$clusinfo)
will not work, and its equivalent
print(pamk_clusters[["pamobject"]][["clusinfo"]])
also will not work ... however, after some digging in the "man"
http://rpy2.readthedocs.io/en/version_2.7.x/vector.html#extracting-r-style
Access to R-style extracting/subsetting is granted though the two delegators rx and rx2, representing the R functions [ and [[ respectively.
This works as expected
print(pamk_clusters.rx2("pamobject").rx2("clusinfo"))
I commented in the forums about "man" clarity:
https://bitbucket.org/rpy2/rpy2/issues/436/acessing-dataframe-elements-using-rpy2
I am using rpy2 on Win7 with ipython. To help others dig through the formatting, here is a setup that seems to work:
import rpy2
import rpy2.robjects as robjects
import rpy2.robjects.packages as rpackages
from rpy2.robjects.packages import importr
base = importr('base')
utils = importr('utils')
utils.chooseCRANmirror(ind=1)
cluster = importr('cluster')
stats = importr('stats')
#utils.install_packages("fpc")
fpc = importr('fpc')
import pickle
with open ('points', 'rb') as fp:
points = pickle.load(fp)
# data above is stored as binary object
# online: http://www.mshaffer.com/arizona/dissertation/points
import rpy2.robjects.numpy2ri as npr
npr.activate()
k = robjects.IntVector(range(3, 8)) # r-syntax 3:7 # I expect 5
pamk_clusters = fpc.pamk(points,k)
print( base.summary(pamk_clusters) )
base.print( base.summary(pamk_clusters) )
utils.str(pamk_clusters)
print(pamk_clusters$pamobject$clusinfo)
base.print(pamk_clusters$pamobject$clusinfo)
print(pamk_clusters[["pamobject"]][["clusinfo"]])
print(pamk_clusters.rx2("pamobject").rx2("clusinfo"))
pam_clusters = cluster.pam(points,5) # much slower
kmeans_clusters = stats.kmeans(points,5) # much faster
utils.str(kmeans_clusters)
print(kmeans_clusters.rx2("cluster"))
R has been a standard for statistical computing for nearly 25 years, based on a forty-year old S - back when computing efficiency mattered a lot.
https://en.wikipedia.org/wiki/R_(programming_language)
Again #lgautier, thank you for making R more readily accessible within Python

Categories