Getting Rpy2 to work with rgdal to spatially subset points - python

So I have some R code that already works. This takes a bunch of points from data and spatially subsets against a shapefile in the method of http://robinlovelace.net/r/2014/07/29/clipping-with-r.html.
#data is a .csv file with lon lat points
data_points <- SpatialPoints(data)
proj4string(data_points) <- CRS("+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0")
data_ll <- spTransform(data_points, CRS("+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"))
melbourne <- readOGR("melbourne_australia.land.coastline","melbourne_australia_land_coast") #this is a shapefile from https://mapzen.com/data/metro-extracts
subset <- data_ll[melbourne,]
plot(melbourne)
points(subset)
I'm trying to convert this to the corresponding rpy2 script. So far I have;
import pandas as pd
import numpy as np
import rpy2.robjects as ro
import rpy2.robjects.numpy2ri
from rpy2.robjects.packages import importr
rgdal = importr('rgdal')
base = importr('base')
rpy2.robjects.numpy2ri.activate()
data = pd.read_csv('sim.csv')
data = data.values
coordinates = ro.r['coordinates']
proj4string = ro.r['proj4string']
spTransform = ro.r['spTransform']
readOGR = ro.r['readOGR']
SpatialPoints = ro.r['SpatialPoints']
CRS = ro.r['CRS']
class_r = ro.r['class']
key = CRS("+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0")
data_points = SpatialPoints(data, proj4string = key)
data_ll = spTransform(data_points, key)
melbourne = readOGR("melbourne_australia.land.coastline", "melbourne_australia_land_coast")
subset = data_ll[melbourne,]
which fails on the last line with the error TypeError: 'RS4' object is not subscriptable. Does anyone have some idea of what is going on?

One way to do this is to convert the R code to a function and then import it as a package.
This is the R code:
library(rgdal)
library(sp)
#import data
data <- read.csv("sim.csv", header = F)
subset_points <- function(data){
data_points <- SpatialPoints(data, proj4string=CRS("+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"))
data_ll <- spTransform(data_points, CRS("+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"))
sink("/dev/null")
melbourne <- readOGR("melbourne_australia.land.coastline", "melbourne_australia_land_coast")
sink()
subset <- data_ll[melbourne,]
final <- as.data.frame(subset)
return(final)
}
and this is the Python code:
import rpy2.robjects as ro
import rpy2.robjects.numpy2ri
from rpy2.robjects.packages import importr
from rpy2.robjects.packages import SignatureTranslatedAnonymousPackage
with open('subset_data.R') as fh:
rcode = os.linesep.join(fh.readlines())
subset = SignatureTranslatedAnonymousPackage(rcode, "subset")
rpy2.robjects.numpy2ri.activate()
data = pd.read_csv('sim.csv')
data = data.values
final = subset.subset_points(data)
print(np.array(final).T)

Related

Masking a variable with lat and lon but needed 3d array

what i am trying is masking a value from nc file with numpy array, according to specific location but it gives me 1d array and i can not use this array for plotting here my code.
from netCDF4 import Dataset
import matplotlib.pyplot as plt
import numpy as np
import numpy.ma as ma
file = './sample_data/NSS.AMBX.NK.D08214.S0740.E0931.B5312324.WI.nc'
data = Dataset(file,mode='r')
fcdBT89gHz = np.asarray(data.groups['Data_Fields']['fcdr_brightness_temperature_1'][:])
fcdBT150gHz = np.asarray(data.groups['Data_Fields']['fcdr_brightness_temperature_2'][:])
fcdBT183_1gHz = np.asarray(data.groups['Data_Fields']['fcdr_brightness_temperature_3'][:])
fcdBT183_3gHz = np.asarray(data.groups['Data_Fields']['fcdr_brightness_temperature_4'][:])
fcdBT183_7gHz = np.asarray(data.groups['Data_Fields']['fcdr_brightness_temperature_5'][:])
lats = data.groups['Geolocation_Time_Fields']['latitude'] #Enlem degerleri
lons = data.groups['Geolocation_Time_Fields']['longitude'] #Boylam degerleri
latlar = np.asarray(lats[:]) # Lati
lonlar = np.asarray(lons[:]) # Long
lo = ma.masked_outside(lonlar,105,110)
la = ma.masked_outside(latlar,30,35)
merged_coord=~ma.mask_or(la.mask,lo.mask)
h = plt.plot(fcdBT150gHz[merged_coord])
The output is like that but i need latitudes in x axis like this plot
If you need shape of variables:
lo.shape = (2495, 90)
la.shape = (2495, 90)
fcdBT150gHz[merged_coord].shape = (701,)
Maybe i did not use true way for masking. If data is needed here.

Unable to move R analyses output back to Python (rpy2)

I am attempting to pass some data from python to R and then retun the results to a python but can't seem to get it to work.
I am successful in passing my data to R and running my custom function on the data and even get the output. Where I am stuck is getting the statistical output back into python as a dataframe. I have tried using rpy2 and even exporting it to a .csv file to re-import but can't get either method to work. When I try and push it back to pandas I get an error that is cant be coerced. When it comes to saving to a .csv I can't seem to get it to work using my "results" object. In reading it seems that checking what is in the R global environment may help me figure it out but I haven't been able to figure out how to do that either.
Any helpful comments are appreciated.
#import statements
import rpy2
print(rpy2.__version__)
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
import rpy2.robjects.numpy2ri
rpy2.robjects.numpy2ri.activate()
base = importr('base')
utils = importr('utils')
name = 'test_subject'
#Sample data to analyze
list1 = [0,1,2,3,4,5,6,7,8,9,10] # analysis window
list2 = [1,5,6,8,7,9,10,8,7,6,3] # nnumber of responses per bin
#Convert data to R objects
set1 = robjects.IntVector(list1)
set2 = robjects.IntVector(list2)
makeDataFrame = robjects.r('''data.frame ''')
df = makeDataFrame(x = set1, y = set2)
# Create curve fitting function
curve_fit = robjects.r('''
curve_fit <- function(df, plot = FALSE){ control <- nls.control(maxiter = 1000, tol = 0.000100, minFactor = 1/2064,
printEval = FALSE, warnOnly = TRUE)
fit <- nls(y ~ d+a*exp(-.5*((x-t0)/b)^2)+c*(x-t0),
data = df,
start = list(a = 1, b = 10, t0 = 10, c = 1, d = 1),
algorithm = "port",
control = control)
if (plot){
fitFnc <- function(x) predict(fit, list(x=x))
par(mfrow = c(1, 1))
plot(df$x, df$y, xlim = c(0,45))
curve(fitFnc, from=.5, to=45, add = TRUE)
}
return(list("params" = summary(fit),
"r2" = cor(predict(fit), df$y)^2))
}''')
#run function on data
results = curve_fit(df, plot = True)
#Show Results
print('results', results)
print(type(results))
The problem was in
return(list("params" = summary(fit), "r2" = cor(predict(fit), df$y)^2))
The first item in the list "params" was a summary table from R. While this printed in python as the data I wanted it was a single object that could not be subdivided as it was essentially and image of an R output table. What I needed to return was a dataframe as shown in the code below.
return(data.frame(coef(summary(fit)), r2 = cor(predict(fit), df$y)^2))
This returned a list of objects that I could then convert to a numpy array and manipulate in python.
Here is the full code.
#import statements
import rpy2
print(rpy2.__version__)
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
import rpy2.robjects.numpy2ri
import numpy as np
rpy2.robjects.numpy2ri.activate()
base = importr('base')
utils = importr('utils')
#Sample data to analyze
list1 = [0,1,2,3,4,5,6,7,8,9,10] # analysis window
list2 = [1,5,6,8,7,9,10,8,7,6,3] # nnumber of responses per bin
#Convert data to R objects and place in data frame
set1 = robjects.IntVector(list1)
set2 = robjects.IntVector(list2)
makeDataFrame = robjects.r('''data.frame ''')
df = makeDataFrame(x = set1, y = set2)
# Create curve fitting function in r
curve_fit = robjects.r('''
#Fit function
curve_fit <- function(df, plot = FALSE){ control <- nls.control(maxiter = 1000, tol = 0.000100, minFactor = 1/2064,
printEval = FALSE, warnOnly = TRUE)
#Specify formula to fit
fit <- nls(y ~ d+a*exp(-.5*((x-t0)/b)^2)+c*(x-t0),
data = df,
start = list(a = 1, b = 10, t0 = 10, c = 1, d = 1),
algorithm = "port",
control = control)
# Create plot of curve
if (plot){
fitFnc <- function(x) predict(fit, list(x=x))
par(mfrow = c(1, 1))
plot(df$x, df$y, xlim = c(0,45))
curve(fitFnc, from=.5, to=45, add = TRUE)}
#returns data in R dataframe
return(data.frame(coef(summary(fit)), r2 = cor(predict(fit), df$y)^2))
}''')
#run function on data
results = curve_fit(df, plot = True)
results = np.array(results) #convert to numpy array
#Show Results
print('results', results)
print(type(results))

How to use numpy return_inverse table on a different array?

I have a lookup table created using -
lookupTable, data_training_panda_y_indexed = np.unique(data_training_panda_y, return_inverse=True)
However, I want to apply the lookupTable on a different array data_cross_validation_panda_y
data_training_panda_y is a list of strings which can be these values - Incoming, Outgoing, Neutral.
So, lookUpTable is ndArray ('Incoming' 'Outgoing 'Neutral')
Code so far -
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from numpy import dtype
from _codecs import lookup
#Load data
data = np.genfromtxt('../Data/bezdekIris.csv',delimiter=',',usecols=[0,1,2,3,4],dtype=None)
labels = np.genfromtxt('../Data/bezdekIris.csv',delimiter=',',usecols=[4],dtype=None)
#Shuffle the rows
np.random.shuffle(data)
#Cut the data into 3 parts
data_rows = np.size(data, 0)
training_rows = int(round(0.6*data_rows))
cross_validation_rows = int(round(0.2*data_rows))
testing_rows = data_rows - training_rows - cross_validation_rows
data_training_panda = pd.DataFrame(data[:training_rows])
data_training_panda_X = data_training_panda.iloc[:,0:4]
data_training_panda_y = data_training_panda.iloc[:,4]
data_cross_validation_panda = pd.DataFrame(data[training_rows:training_rows+cross_validation_rows])
data_cross_validation_panda_X = data_cross_validation_panda.iloc[:,0:4]
data_cross_validation_panda_y = data_cross_validation_panda.iloc[:,4]
data_testing_panda = pd.DataFrame(data[training_rows+cross_validation_rows:])
data_testing_panda_X = data_testing_panda.iloc[:,0:4]
data_testing_panda_y = data_testing_panda.iloc[:,4]
#Take out the labels from the 3 parts
lookupTable, data_training_panda_y_indexed = np.unique(data_training_panda_y, return_inverse=True)
#Label the CV and Testing
data_cross_validation_panda_y_indexed = np.array([])
data_testing_panda_y_indexed = np.array([])
bezdekIris.csv Sample Data -
5.1,3.5,1.4,0.2,Incoming
4.9,3.0,1.4,0.2,Outgoing
4.7,3.2,1.3,0.2,Netural
Using searchsorted could be a solution.
data_cross_validation_panda_y_indexed = np.searchsorted(lookupTable, data_cross_validation_panda_y)

Calling R library "randomForest" from python using rpy2

I would like to embed some R libraries in my python script by using rpy2. I already embedded succesfully "stats.lm", but now I would like to embed "randomForest".
import pandas as pd
from rpy2.robjects.packages import importr
from rpy2.robjects import r, pandas2ri
import rpy2.robjects as robjects
randomForest=importr('randomForest')
pandas2ri.activate()
#read data
df = pd.read_csv('train.csv',index_col=0)
rdf = pandas2ri.py2ri(df)
#check
print(type(rdf))
print(rdf)
#Random Forest
formula = 'target ~ .'
fit_full = randomForest(formula, data=rdf)
The output is:
Traceback (most recent call last):
File "<ipython-input-5-776f4072f19e>", line 2, in <module>
fit_full = randomForest(formula, data=rdf)
TypeError: 'InstalledSTPackage' object is not callable
I already used succesfully this package in R to model this dataset. "train.csv" is a matrix of some tens of thousand samples (rows) and about 94 columns: 93 features (class integer), 1 target (class factor). Target column has 9 classes (Class_1,...,Class_9).
----------------- EDIT -----------------
A partial solution could be to directly embed the code in a function that contains the model and prediction:
import rpy2.robjects as robjects
import rpy2
from rpy2.robjects import pandas2ri
rpy2.__version__
robjects.r('''
f <- function() {
library(randomForest)
train <- read.csv("train.csv")
train1 <- train[sample(c(1:60000), 5000, replace = TRUE),2:95]
train1.rf <- randomForest(target ~ ., data = train1,
importance = TRUE,
do.trace = 100)
pred <- as.data.frame(predict(train1.rf, train1[1:100,1:93]))
}
''')
r_f = robjects.globalenv['f']
pred=pandas2ri.ri2py(r_f())
but I'm still wondering if there is a better solution (that stores the model "train1.rf", too).
This is what I was searching for:
import rpy2.robjects as robjects
from rpy2.robjects import pandas2ri
import pandas as pd
import random
pandas2ri.activate()
df = pd.read_csv('train.csv',index_col=0)
train=df.iloc[random.sample(range(1,60000), 5000),0:94]
test=df.iloc[random.sample(range(1,60000), 100),0:93]
rtrain = pandas2ri.py2ri(train)
print(rtrain)
rtest = pandas2ri.py2ri(test)
print(rtest)
robjects.r('''
f <- function(train) {
library(randomForest)
train1.rf <- randomForest(target ~ ., data = train, importance = TRUE, do.trace = 100)
}
''')
r_f = robjects.globalenv['f']
rf_model=(r_f(rtrain))
robjects.r('''
g <- function(model,test) {
pred <- as.data.frame(predict(model, test))
}
''')
r_g = robjects.globalenv['g']
pred=pandas2ri.ri2py(r_g(rf_model,rtest))

Convert DF into Numpy Array for calculations

I have the data in a dataframe format that I will use for linear regression calculation using user-built function. Here is the code:
from sklearn.datasets import load_boston
boston = load_boston()
bos = pd.DataFrame(boston.data) # convert to DF
bos.columns = boston.feature_names
bos['PRICE'] = boston.target
y = bos.PRICE
x = bos.drop('PRICE', axis = 1) # DROP PRICE since only want X-type variables (not Y-target)
xw = df.to_array(x)
xw = np.insert(xw,0,1, axis = 1) # to insert a column of "1" values
However, I am getting the error:
AttributeError Traceback (most recent call last)
<ipython-input-131-272f1b4d26ba> in <module>()
1 import copy
2
----> 3 xw = df.to_array(x)
AttributeError: 'int' object has no attribute 'to_array'
I am not sure where the problem. I need to pass an array of values (x in this case) to the function to execute some matrix operations
The insert function was working in a step by step code development but for some reason is failing here.
I tried:
xw = copy.deepcopy(x)
with no success
Any thoughts?
it is x.as_matrix() not df.to_array(x)
Please refer to pandas document for more detail on as_matrix()
Here is the code that work
from sklearn.datasets import load_boston
import pandas as pd
import numpy as np
boston = load_boston()
bos = pd.DataFrame(boston.data) # convert to DF
bos.columns = boston.feature_names
bos['PRICE'] = boston.target
y = bos.PRICE
x = bos.drop('PRICE', axis = 1) # DROP PRICE since only want X-type variables (not Y-target)
xw = x.as_matrix()
xw = np.insert(xw,0,1, axis = 1) # to insert a column of "1" values

Categories