Using rpy2 w/ caret attempts classification instead of regression - python

I have data that I have created and preprocessed in Python that I would like to import to R and perform a k-fold cross-validated LASSO fit using glmnet. I want control over which observations are used in each fold, so I want to use caret to do this.
However, I have found that caret interprets my data as a classification instead of a regression problem, and promptly fails. Here is what I hope is a reproducible example:
import numpy as np
import pandas as pd
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
from rpy2.robjects import numpy2ri
from rpy2.robjects.conversion import localconverter
pandas2ri.activate()
numpy2ri.activate()
# Import essential R packages
glmnet = importr('glmnet')
caret = importr('caret')
base = importr('base')
# Define X and y input
dummy_x = pd.DataFrame(np.random.rand(10000, 5), columns=('a', 'b', 'c', 'd', 'e'))
dummy_y = np.random.rand(10000)
# Convert pandas DataFrame to R data.frame
with localconverter(robjects.default_converter + pandas2ri.converter):
dummy_x_R = robjects.conversion.py2rpy(dummy_x)
# Use caret to perform the fit using default settings
caret_test = caret.train(**{'x': dummy_x_R, 'y': dummy_y, 'method': 'glmnet'})
rpy2 fails, giving this cryptic error message from R:
RRuntimeError: Error: Metric RMSE not applicable for classification models
What could be causing this? According to this previous question, it may be the case that caret is assuming that at least one of my variables is an integer type, and so defaults to thinking this is a classification instead of a regression problem.
However, I have checked both X and y using typeof, and they are clearly doubles:
base.sapply(dummy_x_R, 'typeof')
>>> array(['double', 'double', 'double', 'double', 'double'], dtype='<U6')
base.sapply(dummy_y, 'typeof')
>>> array(['double', 'double', 'double', ..., 'double', 'double', 'double'],
dtype='<U6')
Why am I getting this error? All the default settings to train assume a regression model, so why does caret assume a classification model when used in this way?

In situations like this, the first step is to identify whether the unexpected outcome originated from the Python or rpy2 side, or the R side.
The conversion from pandas to R, or numpy to R appears to work as expected, as least for array types:
>>> [x.typeof for x in dummy_x_R]
[<RTYPES.REALSXP: 14>,
<RTYPES.REALSXP: 14>,
<RTYPES.REALSXP: 14>,
<RTYPES.REALSXP: 14>,
<RTYPES.REALSXP: 14>]
I am guessing that this is what you might have done for dummy_y.
>>> from rpy2.robjects import numpy2ri
>>> with localconverter(robjects.default_converter + numpy2ri.converter):
dummy_y_R = robjects.conversion.py2rpy(dummy_y)
>>> dummy_y_R.typeof
<RTYPES.REALSXP: 14>
However, a rather subtle conversion detail is at root of the issue. dummy_y_R has a "shape" (attribute dim in R), while caret expects a shape-less R array (a "vector" in R lingo) in order to perform a regression. One can force dummy_y to be an R vector with:
caret_test = caret.train(**{'x': dummy_x_R,
'y': robjects.FloatVector(dummy_y),
'method': 'glmnet'})

To use R methods, be sure all inputs are R objects. Therefore, consider converting the dummy_y numpy array to an R vector which you can do with base.as_double:
...
dummy_y_R = base.as_double(dummy_y)
caret.train(x=dummy_x_R, y=dummy_y_R, method='glmnet')

Related

Getting transformed X values from OLS model using statsmodels

I am trying to do a linear regression. With the results I want to multiply each x with its own estimated coefficient: xi·βi.
However, I am doing a lot of transformations on xi.
For example:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
def log_plus_1(x):
return np.log(x + 1.0)
df = sm.datasets.get_rdataset("Guerry", "HistData").data
df = df[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()
formule = 'Lottery ~ pow(Literacy,2) + log_plus_1(Wealth)'
mod = smf.ols(formula=formule, data=df)
res = mod.fit()
res.params
Now I would need pow(Literacy, 2) and log_plus_1(Wealth). But since they go into the model, I was hoping to get them out of there too. Instead of transforming the data from the original dataset.
In R I would use res$model to get it.
The data is stored as attributes of the model, e.g. the design matrix is mod.exog, the dependent or response variable is mod.endog.
(I'm not sure I remember correctly the details of the following: The data that patsy returns after creating the transformed design matrix should, in this case, be a pandas DataFrame, and should be stored in mod.data.orig_exog or something like that.)
res.predict automatically handles the transformation, i.e. patsy uses the formula information to transform the data for the explanatory variables in prediction in the same way as the data was transformed in creating the model.
predict only returns the prediction and not the internally transformed predict exog.

Output of a statsmodels regression

I would like to perform a simple linear regression using statsmodels and I've tried several different methods by now but I just don't get it to work. The code that I have constructed now doesn't give me any errors but it also doesn't show me the result
I am trying to create a model for the variable "Direction" which takes the value 0 if the return for the corresponding date was negative and 1 if it was positive. The explinatory variables are the (5) lags of the returns. The df13 contains the lags and also the direction for each observed date. I tried this code and as I mentioned it doesn't give an error but says " Optimization terminated successfully.
Current function value: 0.682314
Iterations 5
However, I would like to see the typical table with all the beta values, their significance etc.
Also, what would you say, since Direction is a binary variable may it be better to use a logit instead of a linear model? However, in the assignment it appeared as a linear model.
And lastly, I am sorry its not displayed here correctly but I don't know how to write as code or insert my dataframe
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import os
import itertools
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import statsmodels.api as sm
import matplotlib.pyplot as plt
from statsmodels.sandbox.regression.predstd import wls_prediction_std
...
X = df13[['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5']]
Y = df13['Direction']
X = sm.add_constant(X)
model = sm.Logit(Y.astype(float), X.astype(float)).fit()
predictions = model.predict(X)
print_model = model.summary
print(print_model)
Edit: I'm sure it has to be a logit regression so I updated that part
I don't know if this is unintentional, but it looks like you need to define X and Y separately:
X = df13[['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5']]
Y = df13['Direction']
Secondly, I'm not familiar with statsmodel, but I would try converting your dataframes to numpy arrays. You can do this with
Xnum = X.to_numpy()
ynum = y.to_numpy()
And try passing those to the regressors.

How use LinearRegression with categorical variables in sklearn

I am trying to perform some speed comparison test Python vs R and struggling with issue - LinearRegression under sklearn with categorical variables.
Code R:
# Start the clock!
ptm <- proc.time()
ptm
test_data = read.csv("clean_hold.out.csv")
# Regression Model
model_liner = lm(test_data$HH_F ~ ., data = test_data)
# Stop the clock
new_ptm <- proc.time() - ptm
Code Python:
import pandas as pd
import time
from sklearn.linear_model import LinearRegression
from sklearn.feature_extraction import DictVectorizer
start = time.time()
test_data = pd.read_csv("./clean_hold.out.csv")
x_train = [col for col in test_data.columns[1:] if col != 'HH_F']
y_train = ['HH_F']
model_linear = LinearRegression(normalize=False)
model_linear.fit(test_data[x_train], test_data[y_train])
but it's not work for me
return X.astype(np.float32 if X.dtype == np.int32 else np.float64)
ValueError: could not convert string to float: Bee True
I was tried another approach
test_data = pd.read_csv("./clean_hold.out.csv").to_dict()
v = DictVectorizer(sparse=False)
X = v.fit_transform(test_data)
However, I catched another error:
File
"C:\Anaconda32\lib\site-packages\sklearn\feature_extraction\dict_vectorizer.py",
line 258, in transform
Xa[i, vocab[f]] = dtype(v) TypeError: float() argument must be a string or a number
I don't understand how Python should resolve this issues ...
Example of data:
http://screencast.com/t/hYyyu7nU9hQm
I have to do some encoding before using fit.
There are several classes that can be used :
LabelEncoder : turn your string into incremental value
OneHotEncoder : use One-of-K algorithm to transform your String into integer
I wanted to have a scalable solution but didn't get any answer. I selected OneHotEncoder that binarize all the strings. It is quite effective but if you have a lot different strings the matrix will grow very quickly and memory will be required.

OLS using statsmodel.formula.api versus statsmodel.api

Can anyone explain to me the difference between ols in statsmodel.formula.api versus ols in statsmodel.api?
Using the Advertising data from the ISLR text, I ran an ols using both, and got different results. I then compared with scikit-learn's LinearRegression.
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
df = pd.read_csv("C:\...\Advertising.csv")
x1 = df.loc[:,['TV']]
y1 = df.loc[:,['Sales']]
print "Statsmodel.Formula.Api Method"
model1 = smf.ols(formula='Sales ~ TV', data=df).fit()
print model1.params
print "\nStatsmodel.Api Method"
model2 = sm.OLS(y1, x1)
results = model2.fit()
print results.params
print "\nSci-Kit Learn Method"
model3 = LinearRegression()
model3.fit(x1, y1)
print model3.coef_
print model3.intercept_
The output is as follows:
Statsmodel.Formula.Api Method
Intercept 7.032594
TV 0.047537
dtype: float64
Statsmodel.Api Method
TV 0.08325
dtype: float64
Sci-Kit Learn Method
[[ 0.04753664]]
[ 7.03259355]
The statsmodel.api method returns a different parameter for TV from the statsmodel.formula.api and the scikit-learn methods.
What kind of ols algorithm is statsmodel.api running that would produce a different result? Does anyone have a link to documentation that could help answer this question?
Came across this issue today and wanted to elaborate on #stellasia's answer because the statsmodels documentation is perhaps a bit ambiguous.
Unless you are using actual R-style string-formulas when instantiating OLS, you need to add a constant (literally a column of 1s) under both statsmodels.formulas.api and plain statsmodels.api. #Chetan is using R-style formatting here (formula='Sales ~ TV'), so he will not run into this subtlety, but for people with some Python knowledge but no R background this could be very confusing.
Furthermore it doesn't matter whether you specify the hasconst parameter when building the model. (Which is kind of silly.) In other words, unless you are using R-style string formulas, hasconst is ignored even though it is supposed to
[Indicate] whether the RHS includes a user-supplied constant
because, in the footnotes
No constant is added by the model unless you are using formulas.
The example below shows that both .formulas.api and .api will require a user-added column vector of 1s if not using R-style string formulas.
# Generate some relational data
np.random.seed(123)
nobs = 25
x = np.random.random((nobs, 2))
x_with_ones = sm.add_constant(x, prepend=False)
beta = [.1, .5, 1]
e = np.random.random(nobs)
y = np.dot(x_with_ones, beta) + e
Now throw x and y into Excel and run Data>Data Analysis>Regression, making sure "Constant is zero" is unchecked. You'll get the following coefficients:
Intercept 1.497761024
X Variable 1 0.012073045
X Variable 2 0.623936056
Now, try running this regression on x, not x_with_ones, in either statsmodels.formula.api or statsmodels.api with hasconst set to None, True, or False. You'll see that in each of those 6 scenarios, there is no intercept returned. (There are only 2 parameters.)
import statsmodels.formula.api as smf
import statsmodels.api as sm
print('smf models')
print('-' * 10)
for hc in [None, True, False]:
model = smf.OLS(endog=y, exog=x, hasconst=hc).fit()
print(model.params)
# smf models
# ----------
# [ 1.46852293 1.8558273 ]
# [ 1.46852293 1.8558273 ]
# [ 1.46852293 1.8558273 ]
Now running things correctly with a column vector of 1.0s added to x. You can use smf here but it's really not necessary if you're not using formulas.
print('sm models')
print('-' * 10)
for hc in [None, True, False]:
model = sm.OLS(endog=y, exog=x_with_ones, hasconst=hc).fit()
print(model.params)
# sm models
# ----------
# [ 0.01207304 0.62393606 1.49776102]
# [ 0.01207304 0.62393606 1.49776102]
# [ 0.01207304 0.62393606 1.49776102]
The difference is due to the presence of intercept or not:
in statsmodels.formula.api, similarly to the R approach, a constant is automatically added to your data and an intercept in fitted
in statsmodels.api, you have to add a constant yourself (see the documentation here). Try using add_constant from statsmodels.api
x1 = sm.add_constant(x1)
I had a similar issue with the Logit function.
(I used patsy to create my matrices, so the intercept was there.)
My sm.logit was not converging.
My sm.formula.logit was converging however.
Data going in was exactly the same.
I changed the solver method to 'newton' and the sm.logit converged also.
Is it possible the two versions have different default solver methods.

What is the Python equivalent to R predict function for linear models?

What is the Python equivalent to R predict function for linear models?
I'm sure there is something in scipy that can help here but is there an equivalent function?
https://stat.ethz.ch/R-manual/R-patched/library/stats/html/predict.lm.html
Scipy has plenty of regression tools with predict methods; though IMO, Pandas is the python library that comes closest to replicating R's functionality, complete with predict methods. The following snippets in R and python demonstrate the similarities.
R linear regression:
data(trees)
linmodel <- lm(Volume~., data = trees[1:20,])
linpred <- predict(linmodel, trees[21:31,])
plot(linpred, trees$Volume[21:31])
Same data set in python using pandas ols:
import pandas as pd
from pandas.stats.api import ols
import matplotlib.pyplot as plt
trees = pd.read_csv('trees.csv')
linmodel = ols(y = trees['Volume'][0:20], x = trees[['Girth', 'Height']][0:20])
linpred = linmodel.predict(x = trees[['Girth', 'Height']][20:31])
plt.scatter(linpred,trees['Volume'][20:31])

Categories