Convert pandas timeseries to xts format using rpy2 - python

I am trying to call the function 'Nelson.Siegel' in the'YieldCurve' Package using rpy2. 'Nelson.Siegel' takes an xts file (rates) and a list (Marurity) as inputs, it seems that I have to convert pandas data frame into xts format, and I am not sure how to achieve it. And I am not sure if I call the Nelson.Siegel function in the correct way. Any help will be appreicated.
I try to use pandas2ri.activate() to change data type from pandas to r but it seems that I need to further make it into xts format. I try to import as.xts in xts package but it doesn't work together with rpy2.
import pandas as pd
import numpy as np
from rpy2.robjects.packages import importr
import rpy2.robjects as robjects
base = importr('base')
utils = importr('utils')
utils.install_packages('YieldCurve', repos="http://cran.us.r-project.org")
Yieldcurve= importr('YieldCurve')
NelsonSiegel = robjects.r('Nelson.Siegel')
from rpy2.robjects import pandas2ri
pandas2ri.activate()
Maturity=[0.5,1,2]
df = pd.DataFrame(np.random.randint(0,30,size=(10,3)),
columns=["1","2","3"],
index=pd.date_range("20190101", periods=10))
NSParam= NelsonSiegel(df, Maturity)
Error message: Error in is.finite(if (is.character(from)) from <- as.numeric(from) else from) :
default method not implemented for type 'list'

Specifying that Maturity should be an vector rather than let the converter assume that a list is wanted might solve this:
Maturity=robjects.vectors.IntVector([0.5,1,2])
Otherwise, first check whether your pandas data frame is safely converter to an R data frame:
df = pd.DataFrame(np.random.randint(0,30,size=(10,3)),
columns=["1","2","3"],
index=pd.date_range("20190101", periods=10))
base.print(df)

Related

converting pandas dataframe to rda and getting encoding error

I am trying to convert a pandas dataframe into an rda file. I found this code to do this:
'''
import rpy2
from rpy2 import robjects
from rpy2.robjects import pandas2ri
pandas2ri.activate()
r_data = pandas2ri.py2rpy(df)
robjects.r.assign("df", r_data)
robjects.r("save(df, file='test.rda')")
'''
But I keep getting the following error:
AttributeError: 'list' object has no attribute 'encode'
I'm not sure what this means, whether it is because I need to convert the df to a UTF-8 and not sure how to go about doing that.
Note: I am using python 3.10

Rpy2 (working with Dataframes) - How to solve the Python(NaN) and R(NA) conflict?

I have a pandas dataFrame: result with 2 columns.
doy (day_of_year: independent variable, values 1,2,3,....365).
bookings (dependent variable, 279 numeric values and 86 NaN values)
Please find a portion of the dataFrame below:
My goal is to impute the missing values using R (randomForest::rfImpute) for further spectral analysis.
So, I am using rpy2 to use R-code inside Python script. I have imported necessary packages/libraries. I have also activated `pandas2ri.
import random
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
base = importr('base')
utils = importr('utils')
randomForest = importr('randomForest')
from rpy2.robjects import pandas2ri
pandas2ri.activate()
r_df = pandas2ri.py2ri(result)
type(r_df)
print(r_df.head())
print(base.summary(r_df))
random.seed(222)
result.imputed = randomForest.rfImpute('bookings', 'doy', data = r_df)`
But whenever I run the code, I get the error: No NA values found in bookings.
It's clear that the R code fails to interpret the missing values.
I have also tried to replace NaN with NA in the R dataFrame r_df,
robjects.r('r_df[is.nan(as.numeric(r_df))] = NA')
but when I run the code, I get the error: object r_df is not found.
Is there a way around this issue? As of now, I am a bit stuck and can't seem to find a helpful documentation.
Please find below some other outputs of separate lines of code.
(...)
robjects.r('r_df[is.nan(as.numeric(r_df))] = NA')
but when I run the code, I get the error: object r_df is not found.
r_dfis the name of a Python variable (name defined in your Python code), so embedded R does not know about it.
Is there a way around this issue? As of now, I am a bit stuck and can't seem to find a helpful documentation.
What about this:
https://rpy2.github.io/doc/v2.9.x/html/robjects_rinstance.html#r-the-instance-of-r

Using Dask with Python causes issues when running Pandas code

I am trying to work with Dask because my dataframe has become large and that pandas by itself can't simply process it. I read my dataset in as follows and get the following result that looks odd, not sure why its not outputting the dataframe:
import pandas as pd
import numpy as np
import seaborn as sns
import scipy.stats as stats
import matplotlib.pyplot as plt
import dask.bag as db
import json
%matplotlib inline
Leads = db.read_text('Leads 6.4.18.txt')
Leads
This returns (instead of my pandas dataframe):
dask.bag<bag-fro..., npartitions=1>
Then when I try to rename a few columns:
Leads_updated = Leads.rename(columns={'Business Type':'Business_Type','Lender
Type':'Lender_Type'})
Leads_updated
I get:
AttributeError: 'Bag' object has no attribute 'rename'
Can someone please explain what I am not doing correctly. The ojective is to just use Dask for all these steps since it is too big for regular Python/Pandas. My understanding is the syntax used under Dask should be the same as Pandas.

converting .rda to pandas dataframe

I have some .rda files that I need to access with Python.
My code looks like this:
import rpy2.robjects as robjects
from rpy2.robjects import r, pandas2ri
pandas2ri.activate()
df = robjects.r.load("datafile.rda")
df2 = pandas2ri.ri2py_dataframe(df)
where df2 is a pandas dataframe. However, it only contains the header of the .rda file! I have searched back and forth. None of the solutions proposed seem to be working.
Does anyone have an idea how to efficiently convert an .rda dataframe to a pandas dataframe?
Thank you for your useful question. I tried the two ways proposed above to handle my problem.
For feather, I faced this issue:
pyarrow.lib.ArrowInvalid: Not a Feather V1 or Arrow IPC file
For rpy2, as mentioned by #Orange: "pandas2ri.ri2py_dataframe does not seem to exist any longer in rpy2 version 3.0.3" or later.
I searched for another workaround and found pyreadr useful for me and maybe for those who are facing the same problems as I am: https://github.com/ofajardo/pyreadr
Usage: https://gist.github.com/LeiG/8094753a6cc7907c716f#gistcomment-2795790
pip install pyreadr
import pyreadr
result = pyreadr.read_r('/path/to/file.RData') # also works for Rds, rda
# done! let's see what we got
# result is a dictionary where keys are the name of objects and the values python
# objects
print(result.keys()) # let's check what objects we got
df1 = result["df1"] # extract the pandas data frame for object df1
You could try using the new feather library developed as a language agnostic dataframe to be used in either R or Python.
# Install feather
devtools::install_github("wesm/feather/R")
library(feather)
path <- "your_file_path"
write_feather(datafile, path)
Then install in python
$ pip install feather-format
And load in your datafile
import feather
path = 'your_file_path'
datafile = feather.read_dataframe(path)
As mentioned, consider converting the .rda file into individual .rds objects using R's mget or eapply for building Python dictionary of dataframes.
RPy2
import os
import pandas as pd
import rpy2.robjects as robjects
from rpy2.robjects import pandas2ri
from rpy2.robjects.packages import importr
pandas2ri.activate()
base = importr('base')
base.load("datafile.rda")
rdf_List = base.mget(base.ls())
# ITERATE THROUGH LIST OF R DFs
pydf_dict = {}
for i,f in enumerate(base.names(rdf_List)):
pydf_dict[f] = pandas2ri.ri2py_dataframe(rdf_List[i])
for k,v in pydf_dict.items():
print(v.head())

Is there a way to access R data frame column names in python/rpy2?

I have an R data frame, saved in Database02.Rda. Loading it
import rpy2.robjects as robjects
robjects.r.load("Database02.Rda")
works fine. However:
print(robjects.r.names("df"))
yields
NULL
Also, as an example, column 214 (213 if we count starting with 0) is named REGION.
print(robjects.r.table(robjects.r["df"][213]))
works fine:
Region 1 Region 2 ...
9811 3451 ...
but we should also be able to do
print(robjects.r.table("df$REGION"))
This, however, results in
df$REGION
1
(which it does also for column names that do not exist at all); also:
print(robjects.r.table(robjects.r["df"]["REGION"]))
gives an error:
TypeError: SexpVector indices must be integers, not str
Now, the docs say, names can not be used for subsetting in python. Am I correct to assume that the column names are not imported whith the rest of the data when loading the data frame with python/rpy2? Am I thus correct that the easiest way to access them is to save and load them as a seperate list and construct a dict or so in python mapping the names to the column index numbers? This does not seem very generic, however. Is there a way to extract the column names directly?
The versions of R, python, rpy2 I use are:
R: 3.2.2
python: 3.5.0
rpy2: 2.7.8
When doing the following, you are loading whatever objects are Database02.Rda into R's "global environment".
import rpy2.robjects as robjects
robjects.r.load("Database02.Rda")
robjects.globalenv is an Environement. You can list its content with:
tuple(robjects.globalenv.keys())
Now I am understanding that one of your objects is called df. You can access it with:
df = robjects.globalenv['df']
if df is a list or a data frame, you can access its named elements with
rx2 (the doc is your friend here again). To get the one called REGION, do:
df.rx2("REGION")
To list all named elements in a list or dataframe that's easy:
tuple(df.names)
If you run R code in python, the global environment answer will not work. But kudos to #lgautier the creator/maintainer of this package. In R the dollar sign $ is used frequently. This is what I learned:
print(pamk_clusters$pamobject$clusinfo)
will not work, and its equivalent
print(pamk_clusters[["pamobject"]][["clusinfo"]])
also will not work ... however, after some digging in the "man"
http://rpy2.readthedocs.io/en/version_2.7.x/vector.html#extracting-r-style
Access to R-style extracting/subsetting is granted though the two delegators rx and rx2, representing the R functions [ and [[ respectively.
This works as expected
print(pamk_clusters.rx2("pamobject").rx2("clusinfo"))
I commented in the forums about "man" clarity:
https://bitbucket.org/rpy2/rpy2/issues/436/acessing-dataframe-elements-using-rpy2
I am using rpy2 on Win7 with ipython. To help others dig through the formatting, here is a setup that seems to work:
import rpy2
import rpy2.robjects as robjects
import rpy2.robjects.packages as rpackages
from rpy2.robjects.packages import importr
base = importr('base')
utils = importr('utils')
utils.chooseCRANmirror(ind=1)
cluster = importr('cluster')
stats = importr('stats')
#utils.install_packages("fpc")
fpc = importr('fpc')
import pickle
with open ('points', 'rb') as fp:
points = pickle.load(fp)
# data above is stored as binary object
# online: http://www.mshaffer.com/arizona/dissertation/points
import rpy2.robjects.numpy2ri as npr
npr.activate()
k = robjects.IntVector(range(3, 8)) # r-syntax 3:7 # I expect 5
pamk_clusters = fpc.pamk(points,k)
print( base.summary(pamk_clusters) )
base.print( base.summary(pamk_clusters) )
utils.str(pamk_clusters)
print(pamk_clusters$pamobject$clusinfo)
base.print(pamk_clusters$pamobject$clusinfo)
print(pamk_clusters[["pamobject"]][["clusinfo"]])
print(pamk_clusters.rx2("pamobject").rx2("clusinfo"))
pam_clusters = cluster.pam(points,5) # much slower
kmeans_clusters = stats.kmeans(points,5) # much faster
utils.str(kmeans_clusters)
print(kmeans_clusters.rx2("cluster"))
R has been a standard for statistical computing for nearly 25 years, based on a forty-year old S - back when computing efficiency mattered a lot.
https://en.wikipedia.org/wiki/R_(programming_language)
Again #lgautier, thank you for making R more readily accessible within Python

Categories