Different accuracy for python (Scikit-Learn) and R (e1071) - python

For the same dataset (here Bupa) and parameters i get different accuracies.
What did I overlook?
R implementation:
data_file = "bupa.data"
dataset = read.csv(data_file, header = FALSE)
nobs <- nrow(dataset) # 303 observations
sample <- train <- sample(nrow(dataset), 0.95*nobs) # 227 observations
# validate <- sample(setdiff(seq_len(nrow(dataset)), train), 0.1*nobs) # 30 observations
test <- setdiff(seq_len(nrow(dataset)), train) # 76 observations
svmfit <- svm(V7~ .,data=dataset[train,],
type="C-classification",
kernel="linear",
cost=1,
cross=10)
testpr <- predict(svmfit, newdata=na.omit(dataset[test,]))
accuracy <- sum(testpr==na.omit(dataset[test,])$V7)/length(na.omit(dataset[test,])$V7)
I get accuracy: 0.94
but when i do as following in python (scikit-learn)
import numpy as np
from sklearn import cross_validation
from sklearn import datasets
import pandas as pd
from sklearn import svm, grid_search
f = open("data/bupa.data")
dataset = np.loadtxt(fname = f, delimiter = ',')
nobs = np.shape(dataset)[0]
print("Number of Observations: %d" % nobs)
y = dataset[:,6]
X = dataset[:,:-1]
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.06, random_state=0)
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
scores = cross_validation.cross_val_score(clf, X, y, cv=10, scoring='accuracy')
I get accuracy 0.67
please help me.

I came across this post having the same issue - wildly different accuracy between scikit-learn and e1071 bindings for libSVM. I think the issue is that e1071 scales the training data and then keeps the scaling parameters for using in predicting new observations. Scikit-learn does not do this and leaves it up the user to realize that the same scaling approach needs to be taken on both training and test data. I only thought to check this after encountering and reading this guide from the nice people behind libSVM.
While I don't have your data, str(svmfit) should give you the scaling params (mean and standard deviation of the columns of Bupa). You can use these to appropriately scale your data in Python (see below for an idea). Alternately, you can scale the entire dataset together in Python and then do test/train splits; either way should give you now identical predictions.
def manual_scale(a, means, sds):
a1 = a - means
a1 = a1/sds
return a1

When using Support Vector Regression in Python/sklearn and R/e1071 both x and y variables need to be scaled/unscaled.
Here is a self-contained example using rpy2 to show equivalence of R and Python results (first part with disabled scaling in R, second part with 'manual' scaling in Python):
# import modules
import matplotlib.pyplot as plt
import numpy as np
import sklearn
import sklearn.model_selection
import sklearn.datasets
import sklearn.svm
import rpy2
import rpy2.robjects
import rpy2.robjects.packages
# use R e1071 SVM function via rpy2
def RSVR(x_train, y_train, x_test,
cost=1.0, epsilon=0.1, gamma=0.01, scale=False):
# convert Python arrays to R matrices
rx_train = rpy2.robjects.r['matrix'](rpy2.robjects.FloatVector(np.array(x_train).T.flatten()), nrow = len(x_train))
ry_train = rpy2.robjects.FloatVector(np.array(y_train).flatten())
rx_test = rpy2.robjects.r['matrix'](rpy2.robjects.FloatVector(np.array(x_test).T.flatten()), nrow = len(x_test))
# train SVM
e1071 = rpy2.robjects.packages.importr('e1071')
rsvr = e1071.svm(x=rx_train,
y=ry_train,
kernel='radial',
cost=cost,
epsilon=epsilon,
gamma=gamma,
scale=scale)
# run SVM
predict = rpy2.robjects.r['predict']
ry_pred = np.array(predict(rsvr, rx_test))
return ry_pred
# define auxiliary function for plotting results
def plot_results(y_test, py_pred, ry_pred, title, lim=[-500, 500]):
plt.title(title)
plt.plot(lim, lim, lw=2, color='gray', zorder=-1)
plt.scatter(y_test, py_pred, color='black', s=40, label='Python/sklearn')
plt.scatter(y_test, ry_pred, color='orange', s=10, label='R/e1071')
plt.xlabel('observed')
plt.ylabel('predicted')
plt.legend(loc=0)
return None
# get example regression data
x_orig, y_orig = sklearn.datasets.make_regression(n_samples=100, n_features=10, random_state=42)
# split into train and test set
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x_orig, y_orig, train_size=0.8)
# SVM parameters
# (identical but named differently for R/e1071 and Python/sklearn)
C = 1000.0
epsilon = 0.1
gamma = 0.01
# setup SVM and scaling classes
psvr = sklearn.svm.SVR(kernel='rbf', C=C, epsilon=epsilon, gamma=gamma)
x_sca = sklearn.preprocessing.StandardScaler()
y_sca = sklearn.preprocessing.StandardScaler()
# run R and Python SVMs without any scaling
# (see 'scale=False')
py_pred = psvr.fit(x_train, y_train).predict(x_test)
ry_pred = RSVR(x_train, y_train, x_test,
cost=C, epsilon=epsilon, gamma=gamma, scale=False)
# scale both x and y variables
sx_train = x_sca.fit_transform(x_train)
sy_train = y_sca.fit_transform(y_train.reshape(-1, 1))[:, 0]
sx_test = x_sca.transform(x_test)
sy_test = y_sca.transform(y_test.reshape(-1, 1))[:, 0]
# run Python SVM on scaled data and invert scaling afterwards
ps_pred = psvr.fit(sx_train, sy_train).predict(sx_test)
ps_pred = y_sca.inverse_transform(ps_pred.reshape(-1, 1))[:, 0]
# run R SVM with native scaling on original/unscaled data
# (see 'scale=True')
rs_pred = RSVR(x_train, y_train, x_test,
cost=C, epsilon=epsilon, gamma=gamma, scale=True)
# plot results
plt.subplot(121)
plot_results(y_test, py_pred, ry_pred, 'without scaling (Python/sklearn default)')
plt.subplot(122)
plot_results(y_test, ps_pred, rs_pred, 'with scaling (R/e1071 default)')
plt.tight_layout()
UPDATE: Actually, the scaling uses a slightly different definition of variance in R and Python, see this answer (1/(N-1)... in R vs. 1/N... in Python where N is the sample size). However, for typical sample sizes, this should be negligible.

I can confirm these statements. One indeed needs to apply the same scaling to the train and test sets. In particular I have done this:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X = sc_X.fit_transform(X)
where X is my training set. Then, when preparing the test set, I have simply used the StandardScaler instance obtained from the scaling of the training test. It is important to used it just for transforming, not for fitting and transforming (like above), i.e.:
X_test = sc_X.transform(X_test)
This allowed on obtaining substantial agreement between R and scikit-learn results.

Related

XGBoost can't predict a simple sinusoidal function

I created a very simple function to test XGBoost.
X is an array containing 1000 rows of "7*np.pi" for each row.
Y is simply "1 + 0.5*np.sin(x)"
I split the dataset in 800 training and 200 testing rows. Shuffle MUST be False to simulate future occurrences, making sure the last 200 rows are reserved to testing.
import numpy as np
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
from sklearn.metrics import mean_squared_error as MSE
from xgboost import XGBRegressor
N = 1000 # 1000 rows
x = np.linspace(0, 7*np.pi, N) # Simple function
y = 1 + 0.5*np.sin(x) # Generate simple function sin(x) as y
# Train-test split, intentionally use shuffle=False to simulate time series
X = x.reshape(-1,1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, shuffle=False)
### Interestingly, model generalizes well if shuffle=False
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, shuffle=False)
XGB_reg = XGBRegressor(random_state=42)
XGB_reg.fit(X_train,y_train)
# EVALUATE ON TRAIN DATA
yXGBPredicted = XGB_reg.predict(X_train)
rmse = np.sqrt(MSE(y_train, yXGBPredicted))
print("RMSE TRAIN XGB: % f" %(rmse))
# EVALUATE ON TEST DATA
yXGBPredicted = XGB_reg.predict(X_test)
# METRICAS XGB
rmse = np.sqrt(MSE(y_test, yXGBPredicted))
print("RMSE TEST XGB: % f" %(rmse))
# Predict full dataset
yXGB = XGB_reg.predict(X)
# Plot and compare
plt.style.use('fivethirtyeight')
plt.rcParams.update({'font.size': 16})
fig, ax = plt.subplots(figsize=(10,5))
plt.plot(x, y)
plt.plot(x, yXGB)
plt.ylim(0,2)
plt.xlabel("x")
plt.ylabel("y")
plt.show()
I trained the model on the first 800 rows and then predicted the next 200 rows.
I was expecting testing data to have a great RMSE, but it did not happen.
I was surprised to see that XGBoost simple repeated the last value of the training set on all rows of the predictions (see chart).
Any ideas why this doesn't work?
You're asking your model to "extrapolate" - making predictions for x values that are greater than x values in the training dataset. Extrapolation works with some model types (such as linear models), but it typically does not work with decision tree models and their ensembles (such as XGBoost).
If you switch from XGBoost to LightGBM, then you can train extrapolation-capable decision tree ensembles using the "linear tree" approach:
Any ideas why this doesn't work?
Your XGBRegressor is probably over-fitted (has n_estimators = 100 and max_depth = 6). If you decrease those parameter values, then the red line will appear more jagged, and it will be easier for you to see it "working".
Right now, if you ask your over-fitted XGBRegressor to extrapolate, then it basically functions as a giant look-up table. When extrapolating towards +Inf, then the "closest match" is at x = 17.5; when extrapolating towards -Inf, then the "closest match" is at x = 0.0.

Multiple Linear Regression. Coeffs don't match

So I have this small dataset and ı want to perform multiple linear regression on it.
first I drop the deliveries column for it's high correlation with miles. Although gasprice is supposed to be removed, I don't remove it so that I can perform multiple linear regression and not simple linear regression.
finally I removed the outliers and did the following:
Dataset
import math
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.stats import diagnostic as diag
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn import linear_model
%matplotlib inline
X = dfafter
Y = dfafter[['hours']]
# Split X and y into X_
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)
# create a Linear Regression model object
regression_model = LinearRegression()
# pass through the X_train & y_train data set
regression_model.fit(X_train, y_train)
y_predict = regression_model.predict(X_train)
#lets find out what are our coeffs of the multiple linear regression and olso find intercept
intercept = regression_model.intercept_[0]
coefficent = regression_model.coef_[0][0]
print("The intercept for our model is {}".format(intercept))
print('-'*100)
# loop through the dictionary and print the data
for coef in zip(X.columns, regression_model.coef_[0]):
print("The Coefficient for {} is {}".format(coef[0],coef[1]))
#Coeffs here don't match the ones that will appear later
#Rebuild the model using Statsmodel for easier analysis
X2 = sm.add_constant(X)
# create a OLS model
model = sm.OLS(Y, X2)
# fit the data
est = model.fit()
# calculate the mean squared error
odel_mse = mean_squared_error(y_train, y_predict)
# calculate the mean absolute error
model_mae = mean_absolute_error(y_train, y_predict)
# calulcate the root mean squared error
model_rmse = math.sqrt(model_mse)
# display the output
print("MSE {:.3}".format(model_mse))
print("MAE {:.3}".format(model_mae))
print("RMSE {:.3}".format(model_rmse))
print(est.summary())
#????????? something is wrong
X = df[['miles', 'gasprice']]
y = df['hours']
regr = linear_model.LinearRegression()
regr.fit(X, y)
print(regr.coef_)
So the code ends here. I found different coeffs every time I printed them out. what did I do wrong and is any of them correct?
I see you are trying 3 different things here, so let me summarize:
sklearn.linear_model.LinearRegression() with train_test_split(X, Y, test_size=0.2, random_state=1), so only using 80% of the data (but the split should be the same every time you run it since you fixed the random state)
statsmodels.api.OLS with the full dataset (you're passing X2 and Y, which are not cut up into train-test)
sklearn.linear_model.LinearRegression() with the full dataset, as in n2.
I tried to reproduce with the iris dataset, and I am getting identical results for cases #2 and #3 (which are trained on the same exact data), and only slightly different coefficients for case 1.
In order to evaluate if any of them are "correct", you will need to evaluate the model on unseen data and look at adjusted R^2 score, etc (hence you need the holdout (test) set). If you want to further improve the model you can try to understand better the interactions of the features in the linear model. Statsmodels has a neat "R-like" formula way to specify your model: https://www.statsmodels.org/dev/example_formulas.html

How to split data to train and test ? Cross validation possible ? M-estimation or OLS?

I have 26 observations to apply a simple linear regression but when I split the data to 70% for train and 30% for test data usually the results for the test data (R squared / P value) are not good. Is it because the samples for the test are too small ? 8 or 9 observation are not enough ? What should I do ? no random state so he the algorithm choose the data randomly
Also wondering how to choose between OLS and M-estimation(which is more resistant to outliers which I have on my data check below because Variable B is impacted by other variables except A) to apply for my dataset.
this is the code I have done so far and looking to do cross validation in the train data.
Is it possible according to the number of observations I have?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy import stats
data = pd.read_excel("C:\\Users\\AchourAh\\Desktop\\PL32_PMM_03_09_2018_SP_Level.xlsx",'Sheet1')
data1 = data.fillna(0) #Replace null values of the whole dataset with 0
print(data1)
X = data1.iloc[0:len(data1),1].values.reshape(-1, 1)
Y = data1.iloc[0:len(data1),2].values.reshape(-1, 1)
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size =0.33)
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, Y_train)
plt.scatter(X_train, Y_train, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('SP00114585')
plt.xlabel('COP COR Quantity')
plt.ylabel('PAUS Quantity')
plt.show()
plt.scatter(X_test, Y_test, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('SP00114585')
plt.xlabel('COP COR Quantity')
plt.ylabel('PAUS Quantity')
plt.show()
X2 = sm.add_constant(X_train)
est = sm.OLS(Y_train, X2)
est2 = est.fit()
print(est2.summary())
X3 = sm.add_constant(X_test)
est3 = sm.OLS(Y_test, X3)
est4 = est3.fit()
print(est4.summary())
This is an example of the data I have and my goal is not to predict a good model but to describe the impact of variable A on B. Also when analyzing the whole data together results are always better than splitting the data
Variable A Variable B
87.000 573.000
90.000 99.000
258.000 339.000
180.000 618.000
0 69.000
90.000 621.000
90.000 231.000
210.000 345.000
255.000 255.000
0 0
213.000 372.000
405.000 405.000
162.000 162.000
405.000 405.000
0 186.000
105.000 252.000
474.000 501.000
531.000 531.000
549.000 549.000
525.000 525.000
360.000 660.000
546.000 546.000
645.000 645.000
561.000 600.000
978.000 1.104.000
960.000 960.000
Also, plotted the results using SKlearn and analyzing the results based on the statsmodels. Can I assume that the plotted results are represented by the values due to statsmodels or there is something to change in the code ?
Y=df["Column name"]
X=df[[ "All other Columns"]]
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 465)
Good luck

need help about SVM in Python and Matlab

I am beginner in machine learning and Python. I am trying to implement SVM into my data comparing between MATLAB and Python. I expect to get similar result, but I don't. I've tried to find out similar problem with mine, but not any or I miss them.
I have one set of training and test set which is binary classification.
Firstly, I did an implementation in MATLAB using built-in function svmtrain and svmclassify.
load('sample.mat')
y_test = y_test';
y_train = y_train';
%% Training and Classification
svmStruct = svmtrain(X_train, y_train, 'boxconstraint', 0.1);
species = svmclassify(svmStruct, X_test);
%% Compute Accuracy
count = 0;
for i=1:length(X_test)
if species(i)==y_test(i)
count = count+1;
end
end
count/length(X_test)*100
%% Plotting
scatter(time, X_test, [], species)
hold on
y_test(y_test==1) = 0.03;
y_test(y_test==0) = -0.03;
plot(time, y_test, 'g')
Here is the result of test set (class A classified by red, Class B classified by blue, green line is the actual label of class A
Secondly, I did the same thing in Python using scikit-learning package.
import numpy as np
import matplotlib.pyplot as plt
import scipy.io as sio
from sklearn import svm
""" Load Data"""
LoadData = sio.loadmat('sample.mat')
X_test = LoadData['X_test'].reshape(-1, 1)
X_train = LoadData['X_train'].reshape(-1, 1)
y_test = LoadData['y_test'].T
y_train = LoadData['y_train'].T
time = LoadData['time']
""" Train Classifier """
clf = svm.LinearSVC(C=1.0)
clf.fit(X_train,y_train)
predict_X_test = clf.predict(X_test) # Predict test set
acc = clf.score(X_test, y_test) # Accuracy
""" Plotting """
plt.figure(1)
plt.figure(figsize=(15,8))
plt.scatter(time, X_test, c = predict_X_test)
y_test[y_test==1] = 0.03
y_test[y_test==0] = -0.03
plt.plot(time, y_test)
""" Plot Threshold """
w = clf.coef_
b = clf.intercept_
xx = np.linspace(300,610)
yy = -b/w*np.ones(xx.shape)
yy = yy.T
plt.plot(xx, yy, 'k-')
Here is the result in Python. black line is its threshold. red line is the actual label of class A
By using Python, it seems like threshold is too high, even I tried to adjust C parameter, but it doesn't work at all.
I am right now having no idea what wrong in my code. Any suggestion is appreciated
EDIT
Here I did validation curve in the link below. I think I did it correctly.
Blue line is the curve from Python. Red line is from MATLAB. x axis is C parameter at [0.001 0.01 0.1 1 10 100] (I used log(x)), y axis is accuracy.

Plotting cross validation error for different degrees of polynomials

I am new to python and want to plot a CV error for each fold and for each degree of polynomial. The below code calculate error values for different degree polynomials and for each fold. Kindly guide me in this regard.
from sklearn.cross_validation import KFold
kf = KFold(len(dF), n_folds=5)
e_test = []
orders = [2,3]
dims = [6,10]
for i, order in enumerate(orders):
dF = getDataByDegree(d,order)
error = []
wTemp = np.empty(dims[i])
wTemp.fill(0.001)
for train_index, test_index in kf:
x_train, x_test = dF[train_index], data['l'][train_index]
y_train, y_test = dF[test_index], data['l'][test_index]
w, x_error = gradientDes(wTemp,x_train,x_test)
y_error = errorfun(w,y_train,y_test)
error.insert(i,y_error[0])
e_test.insert(i,error)
fig, ax = plt.subplots()
for i in range(1,len(orders):
ax.plot(orders,values[i],lw=2, label='Test Error - Fold %s' % str(int(i)+1))
plt.show()
You are looking for what sklearn calls validation curve. The validation_curve function lets you explore a range of a certain model hyperparameter while doing CV for you.
See this example if you want to plot the errors.

Categories