I would like to embed some R libraries in my python script by using rpy2. I already embedded succesfully "stats.lm", but now I would like to embed "randomForest".
import pandas as pd
from rpy2.robjects.packages import importr
from rpy2.robjects import r, pandas2ri
import rpy2.robjects as robjects
randomForest=importr('randomForest')
pandas2ri.activate()
#read data
df = pd.read_csv('train.csv',index_col=0)
rdf = pandas2ri.py2ri(df)
#check
print(type(rdf))
print(rdf)
#Random Forest
formula = 'target ~ .'
fit_full = randomForest(formula, data=rdf)
The output is:
Traceback (most recent call last):
File "<ipython-input-5-776f4072f19e>", line 2, in <module>
fit_full = randomForest(formula, data=rdf)
TypeError: 'InstalledSTPackage' object is not callable
I already used succesfully this package in R to model this dataset. "train.csv" is a matrix of some tens of thousand samples (rows) and about 94 columns: 93 features (class integer), 1 target (class factor). Target column has 9 classes (Class_1,...,Class_9).
----------------- EDIT -----------------
A partial solution could be to directly embed the code in a function that contains the model and prediction:
import rpy2.robjects as robjects
import rpy2
from rpy2.robjects import pandas2ri
rpy2.__version__
robjects.r('''
f <- function() {
library(randomForest)
train <- read.csv("train.csv")
train1 <- train[sample(c(1:60000), 5000, replace = TRUE),2:95]
train1.rf <- randomForest(target ~ ., data = train1,
importance = TRUE,
do.trace = 100)
pred <- as.data.frame(predict(train1.rf, train1[1:100,1:93]))
}
''')
r_f = robjects.globalenv['f']
pred=pandas2ri.ri2py(r_f())
but I'm still wondering if there is a better solution (that stores the model "train1.rf", too).
This is what I was searching for:
import rpy2.robjects as robjects
from rpy2.robjects import pandas2ri
import pandas as pd
import random
pandas2ri.activate()
df = pd.read_csv('train.csv',index_col=0)
train=df.iloc[random.sample(range(1,60000), 5000),0:94]
test=df.iloc[random.sample(range(1,60000), 100),0:93]
rtrain = pandas2ri.py2ri(train)
print(rtrain)
rtest = pandas2ri.py2ri(test)
print(rtest)
robjects.r('''
f <- function(train) {
library(randomForest)
train1.rf <- randomForest(target ~ ., data = train, importance = TRUE, do.trace = 100)
}
''')
r_f = robjects.globalenv['f']
rf_model=(r_f(rtrain))
robjects.r('''
g <- function(model,test) {
pred <- as.data.frame(predict(model, test))
}
''')
r_g = robjects.globalenv['g']
pred=pandas2ri.ri2py(r_g(rf_model,rtest))
Related
I am attempting to pass some data from python to R and then retun the results to a python but can't seem to get it to work.
I am successful in passing my data to R and running my custom function on the data and even get the output. Where I am stuck is getting the statistical output back into python as a dataframe. I have tried using rpy2 and even exporting it to a .csv file to re-import but can't get either method to work. When I try and push it back to pandas I get an error that is cant be coerced. When it comes to saving to a .csv I can't seem to get it to work using my "results" object. In reading it seems that checking what is in the R global environment may help me figure it out but I haven't been able to figure out how to do that either.
Any helpful comments are appreciated.
#import statements
import rpy2
print(rpy2.__version__)
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
import rpy2.robjects.numpy2ri
rpy2.robjects.numpy2ri.activate()
base = importr('base')
utils = importr('utils')
name = 'test_subject'
#Sample data to analyze
list1 = [0,1,2,3,4,5,6,7,8,9,10] # analysis window
list2 = [1,5,6,8,7,9,10,8,7,6,3] # nnumber of responses per bin
#Convert data to R objects
set1 = robjects.IntVector(list1)
set2 = robjects.IntVector(list2)
makeDataFrame = robjects.r('''data.frame ''')
df = makeDataFrame(x = set1, y = set2)
# Create curve fitting function
curve_fit = robjects.r('''
curve_fit <- function(df, plot = FALSE){ control <- nls.control(maxiter = 1000, tol = 0.000100, minFactor = 1/2064,
printEval = FALSE, warnOnly = TRUE)
fit <- nls(y ~ d+a*exp(-.5*((x-t0)/b)^2)+c*(x-t0),
data = df,
start = list(a = 1, b = 10, t0 = 10, c = 1, d = 1),
algorithm = "port",
control = control)
if (plot){
fitFnc <- function(x) predict(fit, list(x=x))
par(mfrow = c(1, 1))
plot(df$x, df$y, xlim = c(0,45))
curve(fitFnc, from=.5, to=45, add = TRUE)
}
return(list("params" = summary(fit),
"r2" = cor(predict(fit), df$y)^2))
}''')
#run function on data
results = curve_fit(df, plot = True)
#Show Results
print('results', results)
print(type(results))
The problem was in
return(list("params" = summary(fit), "r2" = cor(predict(fit), df$y)^2))
The first item in the list "params" was a summary table from R. While this printed in python as the data I wanted it was a single object that could not be subdivided as it was essentially and image of an R output table. What I needed to return was a dataframe as shown in the code below.
return(data.frame(coef(summary(fit)), r2 = cor(predict(fit), df$y)^2))
This returned a list of objects that I could then convert to a numpy array and manipulate in python.
Here is the full code.
#import statements
import rpy2
print(rpy2.__version__)
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
import rpy2.robjects.numpy2ri
import numpy as np
rpy2.robjects.numpy2ri.activate()
base = importr('base')
utils = importr('utils')
#Sample data to analyze
list1 = [0,1,2,3,4,5,6,7,8,9,10] # analysis window
list2 = [1,5,6,8,7,9,10,8,7,6,3] # nnumber of responses per bin
#Convert data to R objects and place in data frame
set1 = robjects.IntVector(list1)
set2 = robjects.IntVector(list2)
makeDataFrame = robjects.r('''data.frame ''')
df = makeDataFrame(x = set1, y = set2)
# Create curve fitting function in r
curve_fit = robjects.r('''
#Fit function
curve_fit <- function(df, plot = FALSE){ control <- nls.control(maxiter = 1000, tol = 0.000100, minFactor = 1/2064,
printEval = FALSE, warnOnly = TRUE)
#Specify formula to fit
fit <- nls(y ~ d+a*exp(-.5*((x-t0)/b)^2)+c*(x-t0),
data = df,
start = list(a = 1, b = 10, t0 = 10, c = 1, d = 1),
algorithm = "port",
control = control)
# Create plot of curve
if (plot){
fitFnc <- function(x) predict(fit, list(x=x))
par(mfrow = c(1, 1))
plot(df$x, df$y, xlim = c(0,45))
curve(fitFnc, from=.5, to=45, add = TRUE)}
#returns data in R dataframe
return(data.frame(coef(summary(fit)), r2 = cor(predict(fit), df$y)^2))
}''')
#run function on data
results = curve_fit(df, plot = True)
results = np.array(results) #convert to numpy array
#Show Results
print('results', results)
print(type(results))
I am trying to run DEseq2 from Python using rpy2.
How should I pass the design matrix?
My script is as follows:
from numpy import *
from numpy.random import multinomial, random
from rpy2 import robjects
import rpy2.robjects.numpy2ri
robjects.numpy2ri.activate()
from rpy2.robjects.packages import importr
deseq = importr('DESeq2')
# Generate some data. 1000 genes, 10 samples
n = 1000
probabilities = random(n)
probabilities /= sum(probabilities)
data = zeros((n,10), int)
for i in range(10):
data[:,i] = multinomial(1000000, probabilities)
# Make the data frame
d = {}
categories = ('1','2') * 5
d["key_1"] = robjects.IntVector(categories)
dataframe = robjects.DataFrame(d)
# Create the design matrix, and run DESeqDataSetFromMatrix
design = "~ key_1" # <--- I guess this is wrong
dds = deseq.DESeqDataSetFromMatrix(countData=data, colData=dataframe,design=design)
The error I am getting is
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/rpy2-2.8.5-py3.6-macosx-10.11-x86_64.egg/rpy2/rinterface/__init__.py:186: RRuntimeWarning: Error: $ operator is invalid for atomic vectors
warnings.warn(x, RRuntimeWarning)
Traceback (most recent call last):
File "testrpy.py", line 23, in <module>
dds = deseq.DESeqDataSetFromMatrix(countData=data, colData=dataf,design=design)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/rpy2-2.8.5-py3.6-macosx-10.11-x86_64.egg/rpy2/robjects/functions.py", line 178, in __call__
return super(SignatureTranslatedFunction, self).__call__(*args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/rpy2-2.8.5-py3.6-macosx-10.11-x86_64.egg/rpy2/robjects/functions.py", line 106, in __call__
res = super(Function, self).__call__(*new_args, **new_kwargs)
rpy2.rinterface.RRuntimeError: Error: $ operator is invalid for atomic vectors
My guess is that the design argument is not correct.
Does anybody have an example of running DEseq via rpy2?
Thanks.
Ah ! You were almost there:
# Create the design matrix, and run DESeqDataSetFromMatrix
design = "~ key_1" # <--- I guess this is wrong
design is a string, but I guess that it should be a formula. Formulae are language objects in R.
Try with:
from rpy2.robjects import Formula
design = Formula("~ key_1")
I am doing a quantile regression on the engel dataset with rpy2 (2.7.6):
import statsmodels as sm
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
pandas2ri.activate()
quantreg = importr('quantreg')
data = sm.datasets.engel.load_pandas().data
qreg = quantreg.rq('foodexp ~ income', data=data, tau=0.5)
However this generates the following error:
qreg = quantreg.rq('foodexp ~ income', data=data, tau=0.5)
Traceback (most recent call last):
File "<ipython-input-22-02ee1015737c>", line 1, in <module>
quantreg.rq('foodexp ~ income', data=data, tau=0.5)
File "C:\Anaconda\lib\site-packages\rpy2\robjects\functions.py", line 178, in __call__
return super(SignatureTranslatedFunction, self).__call__(*args, **kwargs)
File "C:\Anaconda\lib\site-packages\rpy2\robjects\functions.py", line 106, in __call__
res = super(Function, self).__call__(*new_args, **new_kwargs)
RRuntimeError: Error in y - x %*% z$coef : non-conformable arrays
From what I understand, non-conformable arrays in this case would mean there are some missing values or the 'arrays' being used are different sizes. I can confirm that this is NOT the case:
data.count()
Out[26]:
income 235
foodexp 235
dtype: int64
data.shape
Out[27]: (235, 2)
What else could this error mean? Is it possible that the conversion from DataFrame to data.frame in rpy2 is not working correctly or maybe I'm missing something here? Can anyone else confirm this error?
Just in case here is some info regarding the version of R and Python.
R version 3.2.0 (2015-04-16) -- "Full of Ingredients"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)
Python 2.7.11 |Anaconda 2.3.0 (64-bit)| (default, Dec 7 2015, 14:10:42) [MSC v.1500 64 bit (AMD64)]
on win32
Any help would be appreciated.
Edit 1:
If I load the dataset directly from R I don't get an error:
from rpy2.robjects import r
r.data('engel')
data = r['engel']
qreg = quantreg.rq('foodexp ~ income', data=data, tau=0.5)
So I think there is something wrong with the conversion with pandas2ri. The same error occurs when I try to convert the DataFrame to data.frame manually with pandas2ri.py2ri.
Edit 2:
Interestingly enough, if I used the deprecated pandas.rpy.common.convert_to_r_dataframe the error is gone:
import pandas.rpy.common as com
rdata = com.convert_to_r_dataframe(data)
qreg = quantreg.rq('foodexp ~ income', data=rdata, tau=0.5)
There is definitely a bug in pandas2ri which is also confirmed here.
As answered on the rpy2 issue tracker:
The root of the issue seems to be that the columns in the pandas data frame are converted to Array objects each with only one column.
>>> pandas2ri.py2ri_pandasdataframe(data)
<DataFrame - Python:0x7f8af3c2afc8 / R:0x92958b0>
[Array, Array]
income: <class 'rpy2.robjects.vectors.Array'>
<Array - Python:0x7f8af57ef908 / R:0x92e1bf0>
[420.157651, 541.411707, 901.157457, ..., 581.359892, 743.077243, 1057.676711]
foodexp: <class 'rpy2.robjects.vectors.Array'>
<Array - Python:0x7f8af3c2ab88 / R:0x92e7600>
[255.839425, 310.958667, 485.680014, ..., 468.000798, 522.601906, 750.320163]
The distinction is a subtle one, but this seems to be confusing the quantreg package. There are other R functions appear to be working independently of whether the objects is an array with one column or a vector.
Turning the columns to R vectors appears to be what is required to solve the problem:
from rpy2.robjects.vectors import FloatVector
mydata=pandas2ri.py2ri_pandasdataframe(data)
from rpy2.robjects.packages import importr
base=importr('base')
mydata[0]=base.as_vector(mydata[0])
mydata[1]=base.as_vector(mydata[1])
# now this is working
qreg = quantreg.rq('foodexp ~ income', data=mydata, tau=0.5)
Now I would like to gather more data about whether this could solve the issue without breaking things else. For this I turned the fix into a custom converter derived from the pandas converter:
from rpy2.robjects import default_converter
from rpy2.robjects.conversion import Converter, localconverter
from rpy2.robjects.packages import importr
from rpy2.robjects import numpy2ri, pandas2ri, vectors
import numpy
my_converter = Converter('my converter',
template=pandas2ri.converter)
base=importr('base')
def ndarray_forcevector(obj):
func=numpy2ri.converter.py2ri.registry[numpy.ndarray]
# current conversion as performed by numpy
res=func(obj)
if len(obj.shape) == 1:
# force into an R vector
res=base.as_vector(res)
return res
#my_converter.py2ri.register(pandas2ri.PandasSeries)
def py2ri_pandasseries(obj):
# this is a copy of the function with the same name in pandas2ri, with
# the call to ndarray_forcevector() as the only difference
if obj.dtype == '<M8[ns]':
# time series
d = [vectors.IntVector([x.year for x in obj]),
vectors.IntVector([x.month for x in obj]),
vectors.IntVector([x.day for x in obj]),
vectors.IntVector([x.hour for x in obj]),
vectors.IntVector([x.minute for x in obj]),
vectors.IntVector([x.second for x in obj])]
res = vectors.ISOdatetime(*d)
#FIXME: can the POSIXct be created from the POSIXct constructor ?
# (is '<M8[ns]' mapping to Python datetime.datetime ?)
res = vectors.POSIXct(res)
else:
# converted as a numpy array
res = ndarray_forcevector(obj)
# "index" is equivalent to "names" in R
if obj.ndim == 1:
res.do_slot_assign('names',
vectors.StrVector(tuple(str(x) for x in obj.index)))
else:
res.do_slot_assign('dimnames',
vectors.SexpVector(conversion.py2ri(obj.index)))
return res
The easiest way to use this new converter might be in a context manager:
with localconverter(default_converter + my_converter) as cv:
qreg = quantreg.rq('foodexp ~ income', data=data, tau=0.5)
So I have some R code that already works. This takes a bunch of points from data and spatially subsets against a shapefile in the method of http://robinlovelace.net/r/2014/07/29/clipping-with-r.html.
#data is a .csv file with lon lat points
data_points <- SpatialPoints(data)
proj4string(data_points) <- CRS("+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0")
data_ll <- spTransform(data_points, CRS("+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"))
melbourne <- readOGR("melbourne_australia.land.coastline","melbourne_australia_land_coast") #this is a shapefile from https://mapzen.com/data/metro-extracts
subset <- data_ll[melbourne,]
plot(melbourne)
points(subset)
I'm trying to convert this to the corresponding rpy2 script. So far I have;
import pandas as pd
import numpy as np
import rpy2.robjects as ro
import rpy2.robjects.numpy2ri
from rpy2.robjects.packages import importr
rgdal = importr('rgdal')
base = importr('base')
rpy2.robjects.numpy2ri.activate()
data = pd.read_csv('sim.csv')
data = data.values
coordinates = ro.r['coordinates']
proj4string = ro.r['proj4string']
spTransform = ro.r['spTransform']
readOGR = ro.r['readOGR']
SpatialPoints = ro.r['SpatialPoints']
CRS = ro.r['CRS']
class_r = ro.r['class']
key = CRS("+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0")
data_points = SpatialPoints(data, proj4string = key)
data_ll = spTransform(data_points, key)
melbourne = readOGR("melbourne_australia.land.coastline", "melbourne_australia_land_coast")
subset = data_ll[melbourne,]
which fails on the last line with the error TypeError: 'RS4' object is not subscriptable. Does anyone have some idea of what is going on?
One way to do this is to convert the R code to a function and then import it as a package.
This is the R code:
library(rgdal)
library(sp)
#import data
data <- read.csv("sim.csv", header = F)
subset_points <- function(data){
data_points <- SpatialPoints(data, proj4string=CRS("+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"))
data_ll <- spTransform(data_points, CRS("+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"))
sink("/dev/null")
melbourne <- readOGR("melbourne_australia.land.coastline", "melbourne_australia_land_coast")
sink()
subset <- data_ll[melbourne,]
final <- as.data.frame(subset)
return(final)
}
and this is the Python code:
import rpy2.robjects as ro
import rpy2.robjects.numpy2ri
from rpy2.robjects.packages import importr
from rpy2.robjects.packages import SignatureTranslatedAnonymousPackage
with open('subset_data.R') as fh:
rcode = os.linesep.join(fh.readlines())
subset = SignatureTranslatedAnonymousPackage(rcode, "subset")
rpy2.robjects.numpy2ri.activate()
data = pd.read_csv('sim.csv')
data = data.values
final = subset.subset_points(data)
print(np.array(final).T)
I am using rpy2 for regressions. The returned object has a list that includes coefficients, residuals, fitted values, rank of the fitted model, etc.)
However I can't find the standard errors (nor the R^2) in the fit object. Running lm directly model in R, standard errors are displayed with the summary command, but I can't access them directly in the model's data frame.
How can I get extract this info using rpy2?
Sample python code is
from scipy import random
from numpy import hstack, array, matrix
from rpy2 import robjects
from rpy2.robjects.packages import importr
def test_regress():
stats=importr('stats')
x=random.uniform(0,1,100).reshape([100,1])
y=1+x+random.uniform(0,1,100).reshape([100,1])
x_in_r=create_r_matrix(x, x.shape[1])
y_in_r=create_r_matrix(y, y.shape[1])
formula=robjects.Formula('y~x')
env = formula.environment
env['x']=x_in_r
env['y']=y_in_r
fit=stats.lm(formula)
coeffs=array(fit[0])
resids=array(fit[1])
fitted_vals=array(fit[4])
return(coeffs, resids, fitted_vals)
def create_r_matrix(py_array, ncols):
if type(py_array)==type(matrix([1])) or type(py_array)==type(array([1])):
py_array=py_array.tolist()
r_vector=robjects.FloatVector(flatten_list(py_array))
r_matrix=robjects.r['matrix'](r_vector, ncol=ncols)
return r_matrix
def flatten_list(source):
return([item for sublist in source for item in sublist])
test_regress()
So this seems to work for me:
def test_regress():
stats=importr('stats')
x=random.uniform(0,1,100).reshape([100,1])
y=1+x+random.uniform(0,1,100).reshape([100,1])
x_in_r=create_r_matrix(x, x.shape[1])
y_in_r=create_r_matrix(y, y.shape[1])
formula=robjects.Formula('y~x')
env = formula.environment
env['x']=x_in_r
env['y']=y_in_r
fit=stats.lm(formula)
coeffs=array(fit[0])
resids=array(fit[1])
fitted_vals=array(fit[4])
modsum = base.summary(fit)
rsquared = array(modsum[7])
se = array(modsum.rx2('coefficients')[2:4])
return(coeffs, resids, fitted_vals, rsquared, se)
Although, as I said, this is literally my first foray into RPy2, so there may be a better way to do that. But this version appears to output arrays containing the R-squared value along with the standard errors.
You can use print(modsum.names) to see the names of the components of the R object (kind of like names(modsum) in R) and then .rx and .rx2 are the equivalent of [ and [[ in R.
#joran: Pretty good. I'd say that it is pretty much the way to do it.
from rpy2 import robjects
from rpy2.robjects.packages import importr
base = importr('base')
stats = importr('stats') # import only once !
def test_regress():
x = base.matrix(stats.runif(100), nrow = 100)
y = (x.ro + base.matrix(stats.runif(100), nrow = 100)).ro + 1 # not so nice
formula = robjects.Formula('y~x')
env = formula.environment
env['x'] = x
env['y'] = y
fit = stats.lm(formula)
coefs = stats.coef(fit)
resids = stats.residuals(fit)
fitted_vals = stats.fitted(fit)
modsum = base.summary(fit)
rsquared = modsum.rx2('r.squared')
se = modsum.rx2('coefficients')[2:4]
return (coefs, resids, fitted_vals, rsquared, se)