Unable to move R analyses output back to Python (rpy2)

Unable to move R analyses output back to Python (rpy2) - python

I am attempting to pass some data from python to R and then retun the results to a python but can't seem to get it to work.
I am successful in passing my data to R and running my custom function on the data and even get the output. Where I am stuck is getting the statistical output back into python as a dataframe. I have tried using rpy2 and even exporting it to a .csv file to re-import but can't get either method to work. When I try and push it back to pandas I get an error that is cant be coerced. When it comes to saving to a .csv I can't seem to get it to work using my "results" object. In reading it seems that checking what is in the R global environment may help me figure it out but I haven't been able to figure out how to do that either.
Any helpful comments are appreciated.
#import statements
import rpy2
print(rpy2.__version__)
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
import rpy2.robjects.numpy2ri
rpy2.robjects.numpy2ri.activate()
base = importr('base')
utils = importr('utils')
name = 'test_subject'
#Sample data to analyze
list1 = [0,1,2,3,4,5,6,7,8,9,10] # analysis window
list2 = [1,5,6,8,7,9,10,8,7,6,3] # nnumber of responses per bin
#Convert data to R objects
set1 = robjects.IntVector(list1)
set2 = robjects.IntVector(list2)
makeDataFrame = robjects.r('''data.frame ''')
df = makeDataFrame(x = set1, y = set2)
# Create curve fitting function
curve_fit = robjects.r('''
curve_fit <- function(df, plot = FALSE){ control <- nls.control(maxiter = 1000, tol = 0.000100, minFactor = 1/2064,
printEval = FALSE, warnOnly = TRUE)
fit <- nls(y ~ d+a*exp(-.5*((x-t0)/b)^2)+c*(x-t0),
data = df,
start = list(a = 1, b = 10, t0 = 10, c = 1, d = 1),
algorithm = "port",
control = control)
if (plot){
fitFnc <- function(x) predict(fit, list(x=x))
par(mfrow = c(1, 1))
plot(df$x, df$y, xlim = c(0,45))
curve(fitFnc, from=.5, to=45, add = TRUE)
}
return(list("params" = summary(fit),
"r2" = cor(predict(fit), df$y)^2))
}''')
#run function on data
results = curve_fit(df, plot = True)
#Show Results
print('results', results)
print(type(results))

The problem was in
return(list("params" = summary(fit), "r2" = cor(predict(fit), df$y)^2))
The first item in the list "params" was a summary table from R. While this printed in python as the data I wanted it was a single object that could not be subdivided as it was essentially and image of an R output table. What I needed to return was a dataframe as shown in the code below.
return(data.frame(coef(summary(fit)), r2 = cor(predict(fit), df$y)^2))
This returned a list of objects that I could then convert to a numpy array and manipulate in python.
Here is the full code.
#import statements
import rpy2
print(rpy2.__version__)
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
import rpy2.robjects.numpy2ri
import numpy as np
rpy2.robjects.numpy2ri.activate()
base = importr('base')
utils = importr('utils')
#Sample data to analyze
list1 = [0,1,2,3,4,5,6,7,8,9,10] # analysis window
list2 = [1,5,6,8,7,9,10,8,7,6,3] # nnumber of responses per bin
#Convert data to R objects and place in data frame
set1 = robjects.IntVector(list1)
set2 = robjects.IntVector(list2)
makeDataFrame = robjects.r('''data.frame ''')
df = makeDataFrame(x = set1, y = set2)
# Create curve fitting function in r
curve_fit = robjects.r('''
#Fit function
curve_fit <- function(df, plot = FALSE){ control <- nls.control(maxiter = 1000, tol = 0.000100, minFactor = 1/2064,
printEval = FALSE, warnOnly = TRUE)
#Specify formula to fit
fit <- nls(y ~ d+a*exp(-.5*((x-t0)/b)^2)+c*(x-t0),
data = df,
start = list(a = 1, b = 10, t0 = 10, c = 1, d = 1),
algorithm = "port",
control = control)
# Create plot of curve
if (plot){
fitFnc <- function(x) predict(fit, list(x=x))
par(mfrow = c(1, 1))
plot(df$x, df$y, xlim = c(0,45))
curve(fitFnc, from=.5, to=45, add = TRUE)}
#returns data in R dataframe
return(data.frame(coef(summary(fit)), r2 = cor(predict(fit), df$y)^2))
}''')
#run function on data
results = curve_fit(df, plot = True)
results = np.array(results) #convert to numpy array
#Show Results
print('results', results)
print(type(results))

Related

Error calculating r squared with statsmodels for multiple yfinance data in a DataFrame

I recently began learning Python, but rather with a complex project I had already started in Excel. I have used different guides for the code I have used so far, tweaked to my needs.
I am using 'yfinance' to gather data for multiple cryptocurrencies in a specific time period from Yahoo! Finance. Also, 'stats models' to obtain alpha, beta and r squared using a DataFrame created with all cryptocurrencies and an additional column with the mkt. return (x variable).
I am having the following error: ValueError: endog and exog matrices are different sizes. I saw another question/answer regarding this error, but it did not seem to relate to my issue.
The error takes place in line 87 [model = sm.OLS(Y2,X_)] of the following code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import datetime
from pandas_datareader import data as pdr
import yfinance as yf
yf.pdr_override()
df1 = pdr.get_data_yahoo("BTC-USD", start="2015-01-01", end="2020-01-01")
df2 = pdr.get_data_yahoo("ETH-USD", start="2015-01-01", end="2020-01-01")
df3 = pdr.get_data_yahoo("XRP-USD", start="2015-01-01", end="2020-01-01")
df4 = pdr.get_data_yahoo("BCH-USD", start="2015-01-01", end="2020-01-01")
df5 = pdr.get_data_yahoo("USDT-USD", start="2015-01-01", end="2020-01-01")
df6 = pdr.get_data_yahoo("BSV-USD", start="2015-01-01", end="2020-01-01")
df7 = pdr.get_data_yahoo("LTC-USD", start="2015-01-01", end="2020-01-01")
df8 = pdr.get_data_yahoo("BNB-USD", start="2015-01-01", end="2020-01-01")
df9 = pdr.get_data_yahoo("EOS-USD", start="2015-01-01", end="2020-01-01")
df10 = pdr.get_data_yahoo("LINK-USD", start="2015-01-01", end="2020-01-01")
df11 = pdr.get_data_yahoo("XMR-USD", start="2015-01-01", end="2020-01-01")
df12 = pdr.get_data_yahoo("BTG-USD", start="2015-01-01", end="2020-01-01")
return_btc = df1.Close.pct_change()[1:]
return_eth = df2.Close.pct_change()[1:]
return_xrp = df3.Close.pct_change()[1:]
return_bch = df4.Close.pct_change()[1:]
return_usdt = df5.Close.pct_change()[1:]
return_bsv = df6.Close.pct_change()[1:]
return_ltc = df7.Close.pct_change()[1:]
return_bnb = df8.Close.pct_change()[1:]
return_eos = df9.Close.pct_change()[1:]
return_link = df10.Close.pct_change()[1:]
return_xmr = df11.Close.pct_change()[1:]
return_btg = df12.Close.pct_change()[1:]
d = {"BTC Return":return_btc, "ETH Return":return_eth, "XRP Return":return_xrp, "BCH Return":return_bch,
"USDT Return":return_usdt, "BSV Return":return_bsv, "LTC Return":return_ltc, "BNB Return":return_bnb,
"EOS Return":return_eos, "LINK Return":return_link, "XMR Return":return_xmr, "BTG Return":return_btg}
df = pd.DataFrame(d) # new data frame with all returns data
df = pd.DataFrame(d, columns=["Date", "BTC Return", "ETH Return", "XRP Return", "BCH Return", "USDT Return", "BSV Return",
"LTC Return", "BNB Return", "EOS Return", "LINK Return", "XMR Return", "BTG Return"])
avg_row = df.mean(axis=1)
return_mkt = avg_row
d1 = {"BTC Return":return_btc, "ETH Return":return_eth, "XRP Return":return_xrp, "BCH Return":return_bch,
"USDT Return":return_usdt, "BSV Return":return_bsv, "LTC Return":return_ltc, "BNB Return":return_bnb,
"EOS Return":return_eos, "LINK Return":return_link, "XMR Return":return_xmr, "BTG Return":return_btg, "MKT Return":return_mkt}
df = pd.DataFrame(d1)
print(df)
import statsmodels.api as sm
from statsmodels import regression
X = return_mkt.values
Y1 = return_btc
Y2 = return_eth
#Y3 = return_xrp
def linreg(x,y):
x = sm.add_constant(x)
model = regression.linear_model.OLS(y,x).fit()
# we are removing the constant
x = x[:, 1]
return model.params[0], model.params[1]
X_ = sm.add_constant(X) # artificially add intercept to x, as advised in the docs
model = sm.OLS(Y1,X_)
results = model.fit()
rsquared = results.rsquared
alpha, beta = linreg(X,Y1)
def linreg(x,y):
x = sm.add_constant(x)
model = regression.linear_model.OLS(y,x).fit()
# we are removing the constant
x = x[:, 1]
return model.params[0], model.params[1]
X_ = sm.add_constant(X) # artificially add intercept to x, as advised in the docs
model = sm.OLS(Y2,X_)
results = model.fit()
rsquared = results.rsquared
alpha, beta = linreg(X,Y2)
The error is located in the second def, as I am trying to compute the previously mentioned statistics for each cryptocurrency. Thus, the 1st def is for BTC (Y1), the 2nd def is for ETH (Y2), and so on (Y3,...).
The entire code was fine when I had only the function for BTC at the end, the error occurred when I tried to add more of the same function for the others.

Fundamentally, the problem is that because Ethereum (and all other cryptos) started later than bitcoin, there are null values for the price every day for the first few years, which can't be handled. So you have to take just the values where they are not null.
However, there are many things in your code which you could factor out so that you don't repeat yourself unnecessarily. You made an attempt at that with the linreg function, but then you re-defined it for the second crypto, which shouldn't be necessary.
Here is a quick re-write which addresses both the fundamental problem and hopefully illustrates what I mean above. The output is a dataframe with the statistics you're looking for, by cryptocurrency. The goal is to write as much of the code 'generically', and then just provide a list of cryptos that you are interested in.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from pandas_datareader import data as pdr
import datetime
import yfinance as yf
import statsmodels.api as sm
from statsmodels import regression
yf.pdr_override()
cryptos = ["BTC", "ETH", "XRP"] # Here you can specify the cryptos you want. I just used 3 for demonstration
# The rest of the code is not specific to any one crypto
def get_and_process_data(c):
raw_data = pdr.get_data_yahoo(c + '-USD', start="2015-01-01", end="2020-01-01")
return raw_data.Close.pct_change()[1:]
df = pd.DataFrame({c: get_and_process_data(c) for c in cryptos})
df['avg_return'] = df.mean(axis=1) # avg market return
print(df)
def model(x, y):
# Calculate r-squared
X = sm.add_constant(x) # artificially add intercept to x, as advised in the docs
model = sm.OLS(y,X).fit()
rsquared = model.rsquared
# Fit linear regression and calculate alpha and beta
X = sm.add_constant(x)
model = regression.linear_model.OLS(y,X).fit()
alpha = model.params[0]
beta = model.params[1]
return rsquared, alpha, beta
results = pd.DataFrame({c: model(df[df[c].notnull()]['avg_return'], df[df[c].notnull()][c]) for c in cryptos}).transpose()
results.columns = ['rsquared', 'alpha', 'beta']
print(results)

How to create a dashboard with widgets (selector) and interactivity (tap stream) between plots in HoloViews/Bokeh?

I'm trying to create a dashboard that consists of two plots (heatmap and line graph) and one widget (selector):
When you select an option from widget both plots get updated;
When you tap on the first plot the second plot is updated based on tap info.
Currently I'm trying to do it in HoloViews. It seems that this should be very easy to do but I somehow can't wrap my head around it.
The code below shows how it should look like. However, the selector is not connected in any way to the dashboard since I don't know how to do it.
import pandas as pd
import numpy as np
import panel as pn
import holoviews as hv
hv.extension('bokeh')
def create_test_df(k_features, n_tickers=5, m_windows=5):
start_date = pd.Timestamp('01-01-2020')
window_len = pd.Timedelta(days=1)
cols = ['window_dt', 'ticker'] + [f'feature_{i}' for i in range(k_features)]
data = {c: [] for c in cols}
for w in range(m_windows):
window_dt = start_date + w*window_len
for t in range(n_tickers):
ticker = f'ticker_{t}'
data['window_dt'].append(window_dt)
data['ticker'].append(ticker)
for f in range(k_features):
data[f'feature_{f}'].append(np.random.rand())
return pd.DataFrame(data)
k_features = 3
features = [f'feature_{i}' for i in range(k_features)]
df = create_test_df(k_features)
selector = pn.widgets.Select(options=features)
heatmap = hv.HeatMap(df[['window_dt', 'ticker', f'{selector.value}']])
posxy = hv.streams.Tap(source=heatmap, x='01-01-2020', y='ticker_4')
def tap_heatmap(x, y):
scalar = np.random.randn()
x = np.linspace(-2*np.pi, 2*np.pi, 100)
data = list(zip(x, np.sin(x*scalar)))
return hv.Curve(data)
pn.Row(heatmap, hv.DynamicMap(tap_heatmap, streams=[posxy]), selector)

Ok I finally got it. It turned out to be simple (just as expected) but not quite intuitive. Basically, different approach for implementing selector (dropdown menu) should be used. Working code for such example is below:
import pandas as pd
import numpy as np
import panel as pn
import holoviews as hv
hv.extension('bokeh')
def create_test_df(k_features, n_tickers=5, m_windows=5):
start_date = pd.Timestamp('01-01-2020')
window_len = pd.Timedelta(days=1)
cols = ['window_dt', 'ticker'] + [f'feature_{i}' for i in range(k_features)]
data = {c: [] for c in cols}
for w in range(m_windows):
window_dt = start_date + w*window_len
for t in range(n_tickers):
ticker = f'ticker_{t}'
data['window_dt'].append(window_dt)
data['ticker'].append(ticker)
for f in range(k_features):
data[f'feature_{f}'].append(np.random.rand())
return pd.DataFrame(data)
def load_heatmap(feature):
return hv.HeatMap(df[['window_dt', 'ticker', f'{feature}']])
def tap_heatmap(x, y):
scalar = np.random.randn()
x = np.linspace(-2*np.pi, 2*np.pi, 100)
data = list(zip(x, np.sin(x*scalar)))
return hv.Curve(data)
k_features = 3
features = [f'feature_{i}' for i in range(k_features)]
df = create_test_df(k_features)
heatmap_dmap = hv.DynamicMap(load_heatmap, kdims='Feature').redim.values(Feature=features)
posxy = hv.streams.Tap(source=heatmap_dmap, x='01-01-2020', y='ticker_0')
sidegraph_dmap = hv.DynamicMap(tap_heatmap, streams=[posxy])
pn.Row(heatmap_dmap, sidegraph_dmap)

Creating a vector of values based off a test using a for loop

This feels like it should be a simple problem but I am newer to python, in R i would use a foreach loop that gave me an option to combine.
I have tried a for loop that lets me print out all the values i need but i want them collected into a vector of values that i can use later.
from scipy.stats import gamma
import scipy.stats as stats
import numpy as np
import random
data2 = np.random.gamma(1,2, size = 500)
gammT = np.log(data2 + 1)
mean = np.mean(gammT)
sd = np.std(gammT)
a = (mean/ sd)**2
b = (sd**2)/ mean
for i in range(1,100):
gammT = random.sample(list(gammT), 500)
gamm = np.random.gamma(a,b, size = len(gammT))
s = stats.anderson_ksamp([gammT,gamm])
s = s[2]
print(s)
So i am able to print all the values i want but i want them all to be gathered together in a vector of values. I have tried to append and make lists but am not able to get them together.

from scipy.stats import gamma
import scipy.stats as stats
import numpy as np
import random
gammT = np.log(data2.iScore + 1)
mean = np.mean(gammT)
sd = np.std(gammT)
a = (mean/ sd)**2
b = (sd**2)/ mean
#initialize empty list
result=[]
for i in range(100):
# removed (1,100) you only need range(100) for 100 elements
gammT = random.sample(list(gammT), 500)
gamm = np.random.gamma(a,b, size = len(gammT))
s = stats.anderson_ksamp([gammT,gamm])
s = s[2]
#append calculation to list
result.append(s)
print(s)
print(result)

Getting Rpy2 to work with rgdal to spatially subset points

So I have some R code that already works. This takes a bunch of points from data and spatially subsets against a shapefile in the method of http://robinlovelace.net/r/2014/07/29/clipping-with-r.html.
#data is a .csv file with lon lat points
data_points <- SpatialPoints(data)
proj4string(data_points) <- CRS("+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0")
data_ll <- spTransform(data_points, CRS("+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"))
melbourne <- readOGR("melbourne_australia.land.coastline","melbourne_australia_land_coast") #this is a shapefile from https://mapzen.com/data/metro-extracts
subset <- data_ll[melbourne,]
plot(melbourne)
points(subset)
I'm trying to convert this to the corresponding rpy2 script. So far I have;
import pandas as pd
import numpy as np
import rpy2.robjects as ro
import rpy2.robjects.numpy2ri
from rpy2.robjects.packages import importr
rgdal = importr('rgdal')
base = importr('base')
rpy2.robjects.numpy2ri.activate()
data = pd.read_csv('sim.csv')
data = data.values
coordinates = ro.r['coordinates']
proj4string = ro.r['proj4string']
spTransform = ro.r['spTransform']
readOGR = ro.r['readOGR']
SpatialPoints = ro.r['SpatialPoints']
CRS = ro.r['CRS']
class_r = ro.r['class']
key = CRS("+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0")
data_points = SpatialPoints(data, proj4string = key)
data_ll = spTransform(data_points, key)
melbourne = readOGR("melbourne_australia.land.coastline", "melbourne_australia_land_coast")
subset = data_ll[melbourne,]
which fails on the last line with the error TypeError: 'RS4' object is not subscriptable. Does anyone have some idea of what is going on?

One way to do this is to convert the R code to a function and then import it as a package.
This is the R code:
library(rgdal)
library(sp)
#import data
data <- read.csv("sim.csv", header = F)
subset_points <- function(data){
data_points <- SpatialPoints(data, proj4string=CRS("+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"))
data_ll <- spTransform(data_points, CRS("+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"))
sink("/dev/null")
melbourne <- readOGR("melbourne_australia.land.coastline", "melbourne_australia_land_coast")
sink()
subset <- data_ll[melbourne,]
final <- as.data.frame(subset)
return(final)
}
and this is the Python code:
import rpy2.robjects as ro
import rpy2.robjects.numpy2ri
from rpy2.robjects.packages import importr
from rpy2.robjects.packages import SignatureTranslatedAnonymousPackage
with open('subset_data.R') as fh:
rcode = os.linesep.join(fh.readlines())
subset = SignatureTranslatedAnonymousPackage(rcode, "subset")
rpy2.robjects.numpy2ri.activate()
data = pd.read_csv('sim.csv')
data = data.values
final = subset.subset_points(data)
print(np.array(final).T)

Getting standard errors from regressions using rpy2

I am using rpy2 for regressions. The returned object has a list that includes coefficients, residuals, fitted values, rank of the fitted model, etc.)
However I can't find the standard errors (nor the R^2) in the fit object. Running lm directly model in R, standard errors are displayed with the summary command, but I can't access them directly in the model's data frame.
How can I get extract this info using rpy2?
Sample python code is
from scipy import random
from numpy import hstack, array, matrix
from rpy2 import robjects
from rpy2.robjects.packages import importr
def test_regress():
stats=importr('stats')
x=random.uniform(0,1,100).reshape([100,1])
y=1+x+random.uniform(0,1,100).reshape([100,1])
x_in_r=create_r_matrix(x, x.shape[1])
y_in_r=create_r_matrix(y, y.shape[1])
formula=robjects.Formula('y~x')
env = formula.environment
env['x']=x_in_r
env['y']=y_in_r
fit=stats.lm(formula)
coeffs=array(fit[0])
resids=array(fit[1])
fitted_vals=array(fit[4])
return(coeffs, resids, fitted_vals)
def create_r_matrix(py_array, ncols):
if type(py_array)==type(matrix([1])) or type(py_array)==type(array([1])):
py_array=py_array.tolist()
r_vector=robjects.FloatVector(flatten_list(py_array))
r_matrix=robjects.r['matrix'](r_vector, ncol=ncols)
return r_matrix
def flatten_list(source):
return([item for sublist in source for item in sublist])
test_regress()

So this seems to work for me:
def test_regress():
stats=importr('stats')
x=random.uniform(0,1,100).reshape([100,1])
y=1+x+random.uniform(0,1,100).reshape([100,1])
x_in_r=create_r_matrix(x, x.shape[1])
y_in_r=create_r_matrix(y, y.shape[1])
formula=robjects.Formula('y~x')
env = formula.environment
env['x']=x_in_r
env['y']=y_in_r
fit=stats.lm(formula)
coeffs=array(fit[0])
resids=array(fit[1])
fitted_vals=array(fit[4])
modsum = base.summary(fit)
rsquared = array(modsum[7])
se = array(modsum.rx2('coefficients')[2:4])
return(coeffs, resids, fitted_vals, rsquared, se)
Although, as I said, this is literally my first foray into RPy2, so there may be a better way to do that. But this version appears to output arrays containing the R-squared value along with the standard errors.
You can use print(modsum.names) to see the names of the components of the R object (kind of like names(modsum) in R) and then .rx and .rx2 are the equivalent of [ and [[ in R.

#joran: Pretty good. I'd say that it is pretty much the way to do it.
from rpy2 import robjects
from rpy2.robjects.packages import importr
base = importr('base')
stats = importr('stats') # import only once !
def test_regress():
x = base.matrix(stats.runif(100), nrow = 100)
y = (x.ro + base.matrix(stats.runif(100), nrow = 100)).ro + 1 # not so nice
formula = robjects.Formula('y~x')
env = formula.environment
env['x'] = x
env['y'] = y
fit = stats.lm(formula)
coefs = stats.coef(fit)
resids = stats.residuals(fit)
fitted_vals = stats.fitted(fit)
modsum = base.summary(fit)
rsquared = modsum.rx2('r.squared')
se = modsum.rx2('coefficients')[2:4]
return (coefs, resids, fitted_vals, rsquared, se)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unable to move R analyses output back to Python (rpy2) - python

Related

Error calculating r squared with statsmodels for multiple yfinance data in a DataFrame

How to create a dashboard with widgets (selector) and interactivity (tap stream) between plots in HoloViews/Bokeh?

Creating a vector of values based off a test using a for loop

Getting Rpy2 to work with rgdal to spatially subset points

Getting standard errors from regressions using rpy2

Categories

Resources