I am beginner in machine learning and Python. I am trying to implement SVM into my data comparing between MATLAB and Python. I expect to get similar result, but I don't. I've tried to find out similar problem with mine, but not any or I miss them.
I have one set of training and test set which is binary classification.
Firstly, I did an implementation in MATLAB using built-in function svmtrain and svmclassify.
load('sample.mat')
y_test = y_test';
y_train = y_train';
%% Training and Classification
svmStruct = svmtrain(X_train, y_train, 'boxconstraint', 0.1);
species = svmclassify(svmStruct, X_test);
%% Compute Accuracy
count = 0;
for i=1:length(X_test)
if species(i)==y_test(i)
count = count+1;
end
end
count/length(X_test)*100
%% Plotting
scatter(time, X_test, [], species)
hold on
y_test(y_test==1) = 0.03;
y_test(y_test==0) = -0.03;
plot(time, y_test, 'g')
Here is the result of test set (class A classified by red, Class B classified by blue, green line is the actual label of class A
Secondly, I did the same thing in Python using scikit-learning package.
import numpy as np
import matplotlib.pyplot as plt
import scipy.io as sio
from sklearn import svm
""" Load Data"""
LoadData = sio.loadmat('sample.mat')
X_test = LoadData['X_test'].reshape(-1, 1)
X_train = LoadData['X_train'].reshape(-1, 1)
y_test = LoadData['y_test'].T
y_train = LoadData['y_train'].T
time = LoadData['time']
""" Train Classifier """
clf = svm.LinearSVC(C=1.0)
clf.fit(X_train,y_train)
predict_X_test = clf.predict(X_test) # Predict test set
acc = clf.score(X_test, y_test) # Accuracy
""" Plotting """
plt.figure(1)
plt.figure(figsize=(15,8))
plt.scatter(time, X_test, c = predict_X_test)
y_test[y_test==1] = 0.03
y_test[y_test==0] = -0.03
plt.plot(time, y_test)
""" Plot Threshold """
w = clf.coef_
b = clf.intercept_
xx = np.linspace(300,610)
yy = -b/w*np.ones(xx.shape)
yy = yy.T
plt.plot(xx, yy, 'k-')
Here is the result in Python. black line is its threshold. red line is the actual label of class A
By using Python, it seems like threshold is too high, even I tried to adjust C parameter, but it doesn't work at all.
I am right now having no idea what wrong in my code. Any suggestion is appreciated
EDIT
Here I did validation curve in the link below. I think I did it correctly.
Blue line is the curve from Python. Red line is from MATLAB. x axis is C parameter at [0.001 0.01 0.1 1 10 100] (I used log(x)), y axis is accuracy.
Related
I created a very simple function to test XGBoost.
X is an array containing 1000 rows of "7*np.pi" for each row.
Y is simply "1 + 0.5*np.sin(x)"
I split the dataset in 800 training and 200 testing rows. Shuffle MUST be False to simulate future occurrences, making sure the last 200 rows are reserved to testing.
import numpy as np
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
from sklearn.metrics import mean_squared_error as MSE
from xgboost import XGBRegressor
N = 1000 # 1000 rows
x = np.linspace(0, 7*np.pi, N) # Simple function
y = 1 + 0.5*np.sin(x) # Generate simple function sin(x) as y
# Train-test split, intentionally use shuffle=False to simulate time series
X = x.reshape(-1,1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, shuffle=False)
### Interestingly, model generalizes well if shuffle=False
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, shuffle=False)
XGB_reg = XGBRegressor(random_state=42)
XGB_reg.fit(X_train,y_train)
# EVALUATE ON TRAIN DATA
yXGBPredicted = XGB_reg.predict(X_train)
rmse = np.sqrt(MSE(y_train, yXGBPredicted))
print("RMSE TRAIN XGB: % f" %(rmse))
# EVALUATE ON TEST DATA
yXGBPredicted = XGB_reg.predict(X_test)
# METRICAS XGB
rmse = np.sqrt(MSE(y_test, yXGBPredicted))
print("RMSE TEST XGB: % f" %(rmse))
# Predict full dataset
yXGB = XGB_reg.predict(X)
# Plot and compare
plt.style.use('fivethirtyeight')
plt.rcParams.update({'font.size': 16})
fig, ax = plt.subplots(figsize=(10,5))
plt.plot(x, y)
plt.plot(x, yXGB)
plt.ylim(0,2)
plt.xlabel("x")
plt.ylabel("y")
plt.show()
I trained the model on the first 800 rows and then predicted the next 200 rows.
I was expecting testing data to have a great RMSE, but it did not happen.
I was surprised to see that XGBoost simple repeated the last value of the training set on all rows of the predictions (see chart).
Any ideas why this doesn't work?
You're asking your model to "extrapolate" - making predictions for x values that are greater than x values in the training dataset. Extrapolation works with some model types (such as linear models), but it typically does not work with decision tree models and their ensembles (such as XGBoost).
If you switch from XGBoost to LightGBM, then you can train extrapolation-capable decision tree ensembles using the "linear tree" approach:
Any ideas why this doesn't work?
Your XGBRegressor is probably over-fitted (has n_estimators = 100 and max_depth = 6). If you decrease those parameter values, then the red line will appear more jagged, and it will be easier for you to see it "working".
Right now, if you ask your over-fitted XGBRegressor to extrapolate, then it basically functions as a giant look-up table. When extrapolating towards +Inf, then the "closest match" is at x = 17.5; when extrapolating towards -Inf, then the "closest match" is at x = 0.0.
I recently attending a class where the instructor was teaching us how to create a linear regression model using Python. Here is my linear regression model:
import matplotlib.pyplot as plt
import pandas as pd
from scipy import stats
import numpy as np
from sklearn.metrics import r2_score
#Define the path for the file
path=r"C:\Users\H\Desktop\Files\Data.xlsx"
#Read the file into a dataframe ensuring to group by weeks
df=pd.read_excel(path, sheet_name = 0)
df=df.groupby(['Week']).sum()
df = df.reset_index()
#Define x and y
x=df['Week']
y=df['Payment Amount Total']
#Draw the scatter plot
plt.scatter(x, y)
plt.show()
#Now we draw the line of linear regression
#First we want to look for these values
slope, intercept, r, p, std_err = stats.linregress(x, y)
#We then create a function
def myfunc(x):
#Below is y = mx + c
return slope * x + intercept
#Run each value of the x array through the function. This will result in a new array with new values for the y-axis:
mymodel = list(map(myfunc, x))
#We plot the scatter plot and line
plt.scatter(x, y)
plt.plot(x, mymodel)
plt.show()
#We print the value of r
print(r)
#We predict what the cost will be in week 23
print(myfunc(23))
The instructor said we now must use the train/test model to determine how accurate the model above is. This confused me a little as I understood it to mean we will further refine the model above. Or, does it simply mean we will use:
a normal linear regression model
a train/test model
and compare the r values the two different models yield as well as the predicted values they yield?. Is the train/test model considered a regression model?
I tried to create the train/test model but I'm not sure if it's correct (the packages were imported from the above example). When I run the train/test code I get the following error:
ValueError: Found array with 0 sample(s) (shape=(0,)) while a minimum of 1 is required.
Here is the full code:
train_x = x[:80]
train_y = y[:80]
test_x = x[80:]
test_y = y[80:]
#I display the training set:
plt.scatter(train_x, train_y)
plt.show()
#I display the testing set:
plt.scatter(test_x, test_y)
plt.show()
mymodel = np.poly1d(np.polyfit(train_x, train_y, 4))
myline = np.linspace(0, 6, 100)
plt.scatter(train_x, train_y)
plt.plot(myline, mymodel(myline))
plt.show()
#Let's look at how well my training data fit in a polynomial regression?
mymodel = np.poly1d(np.polyfit(train_x, train_y, 4))
r2 = r2_score(train_y, mymodel(train_x))
print(r2)
#Now we want to test the model with the testing data as well
mymodel = np.poly1d(np.polyfit(train_x, train_y, 4))
r2 = r2_score(test_y, mymodel(test_x))
print(r2)
#Now we can use this model to predict new values:
#We predict what the total amount would be on the 23rd week:
print(mymodel(23))
You better split to train and test using sklearn method:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Where X is your features dataframe and y is the column of your labels. 0.2 stands for 80% train and 20% test.
BTW - the error you are describing could be because you dataframe has only 80 rows, leaving x[80:] empty
I'm working on my first Linear Regression code using a Tech with Tim video (https://www.youtube.com/watch?v=45ryDIPHdGg) and have run into a snag. I'm using the UCI student data from here: https://archive.ics.uci.edu/ml/datasets/Student+Performance
My initial model code ran fine. Then I iterated to find an optimal accuracy model, which was fine. Where it starts to go off the rails is that I tried to then inject those optimal coefficients into a new model, then run two predictions:
The very first model (pre-optimization loop)
The optimized model
Against the same x_test1 data set. To compare the two, I simply summed the squared difference between predicted and actual y values. Then I also recorded the final accuracy of both models.
I've done something wrong because the accuracy of my new "optimized" model is the same or lower as the very first model, and the difference values is very similar as well. I expected the optimized model to have much less error and a higher accuracy.
Can someone help me to see the error? I suspect the error lies after the plot section of code. Thanks in advance, code below.
# Import libraries
import pandas as pd
import numpy as np
import sklearn
import pickle
import matplotlib.pyplot as plt
from sklearn import linear_model
from math import sqrt
from sklearn.linear_model import LinearRegression
from matplotlib import style
# from sklearn.utils import shuffle
# Read in Data
data = pd.read_csv("student-mat.csv", sep=";")
# Slice data to include only desired headings
data = data[["G1", "G2", "G3", "studytime", "failures", "absences"]]
# Define the attribute we are trying to predict; called "label".
# Others are "features" and used to predict label
predict = "G3"
# Create array of features and label
X = np.array(data.drop([predict], 1))
y = np.array(data[predict])
# Split data into training and testing data. 90% used for training, 10% testing
# Test size 0.1 = 10% of array size
x_train1, x_test1, y_train1, y_test1 = sklearn.model_selection.train_test_split(X, y, test_size=0.1)
# Create 1st linear model and fit
linear = linear_model.LinearRegression()
linear.fit(x_train1, y_train1)
# Compute accuracy of model
acc = linear.score(x_test1, y_test1)
# Iterate for a given number of times (max_iter) to find an optimal accuracy value and record best coefficients
loop_num = 1
max_iter = 1000
best_acc = acc
best_coef = linear.coef_
best_int = linear.intercept_
acc_counter = [acc]
print("\nInitial Accuracy: %4.3f" % acc)
while loop_num < max_iter + 1:
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.1)
linear2 = linear_model.LinearRegression()
linear2.fit(x_train, y_train)
acc = linear2.score(x_test, y_test)
acc_counter.append(acc)
print("\nAccuracy of run " + str(loop_num) + " is: %4.3f" % acc)
if acc > best_acc:
print("\n\tBetter accuracy found.")
best_acc = acc
best_coef = linear2.coef_
best_int = linear2.intercept_
print("Co: \n", linear2.coef_)
print("Intercept: \n", linear2.intercept_)
else:
print("\n\tFit Discarded.")
loop_num += 1
print("\nBest Acccuracy: \n%4.3f" % best_acc)
print("\nBest Coefficients: \n", best_coef)
print("\nBest Intercept: \n", best_int)
# Plot Accuracy over time
x_scale = []
for x in range(max_iter + 1):
x_scale.append(x)
plt.plot(x_scale, acc_counter, color='green', linestyle='dashed', linewidth=3, marker='o',
markerfacecolor='blue', markersize=5)
ymax = max(acc_counter)
ymin = min(acc_counter)
xpos = acc_counter.index(ymax)
xmax = x_scale[xpos]
annot_max_acc = str(ymax)
plt.annotate('Max Accuracy = ' + annot_max_acc[0:4], xy=(xmax, ymax), xycoords='data', xytext=(.8, .95),
textcoords='axes fraction',
arrowprops=dict(facecolor='black', shrink=0.05), horizontalalignment='right', verticalalignment='top')
plt.ylim(ymin, 1.0)
plt.xlabel('Run Number')
plt.ylabel('Accuracy')
plt.title('Prediction Accuracy over Time')
plt.show()
# Create model with best coefficients from above
new_model = linear_model.LinearRegression()
new_model.intercept_ = best_int
new_model.coef_ = best_coef
# Predict y values for 1st model (not best) then compute difference between predictions and actual values
print("\n\n\nBREAK")
comp = []
predictions = linear.predict(x_test1)
for x in range(len(predictions)):
print(predictions[x], x_test1[x], y_test1[x])
diff = sqrt((predictions[x] - y_test1[x])**2)
print("\tDifference is ", diff)
comp.append(diff)
print("\n\n\nBREAK")
print(comp)
print("\nSum of errors is ", sum(comp))
# Predict y values of best model (with optimal coefficients from above) using same x_test1 values as 1st model
# then compute difference between predictions and actual values
print("\n\n\nBREAK")
comp2 = []
predictions_new_model = new_model.predict(x_test1)
for x in range(len(predictions_new_model)):
print(predictions_new_model[x], x_test1[x], y_test1[x])
diff2 = sqrt((predictions_new_model[x] - y_test1[x])**2)
print("\tDifference is ", diff2)
comp2.append(diff2)
print("\n\n\nBREAK")
print(comp2)
print("\nSum of errors is ", sum(comp2))
print("\n\n\nFirst model fit difference: ", sum(comp))
print("\nSecond model fit difference ", sum(comp2))
print('\n1st model score: ',linear.score(x_train1, y_train1))
print('\nBest model score: ',new_model.score(x_train1, y_train1))
Looking at your code I’ve just realized that you’re using the same model (LinearRegression) and not changing any hyperparameters in any runs so there’s actually no improvement and the difference comes from the fact that you’ve split the data twice (and you didn’t give any random seed to it) so the slight difference comes from that. To improve the model you have to change the hyperparameters of the estimator. See more here: hyperparameter tuning
For the same dataset (here Bupa) and parameters i get different accuracies.
What did I overlook?
R implementation:
data_file = "bupa.data"
dataset = read.csv(data_file, header = FALSE)
nobs <- nrow(dataset) # 303 observations
sample <- train <- sample(nrow(dataset), 0.95*nobs) # 227 observations
# validate <- sample(setdiff(seq_len(nrow(dataset)), train), 0.1*nobs) # 30 observations
test <- setdiff(seq_len(nrow(dataset)), train) # 76 observations
svmfit <- svm(V7~ .,data=dataset[train,],
type="C-classification",
kernel="linear",
cost=1,
cross=10)
testpr <- predict(svmfit, newdata=na.omit(dataset[test,]))
accuracy <- sum(testpr==na.omit(dataset[test,])$V7)/length(na.omit(dataset[test,])$V7)
I get accuracy: 0.94
but when i do as following in python (scikit-learn)
import numpy as np
from sklearn import cross_validation
from sklearn import datasets
import pandas as pd
from sklearn import svm, grid_search
f = open("data/bupa.data")
dataset = np.loadtxt(fname = f, delimiter = ',')
nobs = np.shape(dataset)[0]
print("Number of Observations: %d" % nobs)
y = dataset[:,6]
X = dataset[:,:-1]
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.06, random_state=0)
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
scores = cross_validation.cross_val_score(clf, X, y, cv=10, scoring='accuracy')
I get accuracy 0.67
please help me.
I came across this post having the same issue - wildly different accuracy between scikit-learn and e1071 bindings for libSVM. I think the issue is that e1071 scales the training data and then keeps the scaling parameters for using in predicting new observations. Scikit-learn does not do this and leaves it up the user to realize that the same scaling approach needs to be taken on both training and test data. I only thought to check this after encountering and reading this guide from the nice people behind libSVM.
While I don't have your data, str(svmfit) should give you the scaling params (mean and standard deviation of the columns of Bupa). You can use these to appropriately scale your data in Python (see below for an idea). Alternately, you can scale the entire dataset together in Python and then do test/train splits; either way should give you now identical predictions.
def manual_scale(a, means, sds):
a1 = a - means
a1 = a1/sds
return a1
When using Support Vector Regression in Python/sklearn and R/e1071 both x and y variables need to be scaled/unscaled.
Here is a self-contained example using rpy2 to show equivalence of R and Python results (first part with disabled scaling in R, second part with 'manual' scaling in Python):
# import modules
import matplotlib.pyplot as plt
import numpy as np
import sklearn
import sklearn.model_selection
import sklearn.datasets
import sklearn.svm
import rpy2
import rpy2.robjects
import rpy2.robjects.packages
# use R e1071 SVM function via rpy2
def RSVR(x_train, y_train, x_test,
cost=1.0, epsilon=0.1, gamma=0.01, scale=False):
# convert Python arrays to R matrices
rx_train = rpy2.robjects.r['matrix'](rpy2.robjects.FloatVector(np.array(x_train).T.flatten()), nrow = len(x_train))
ry_train = rpy2.robjects.FloatVector(np.array(y_train).flatten())
rx_test = rpy2.robjects.r['matrix'](rpy2.robjects.FloatVector(np.array(x_test).T.flatten()), nrow = len(x_test))
# train SVM
e1071 = rpy2.robjects.packages.importr('e1071')
rsvr = e1071.svm(x=rx_train,
y=ry_train,
kernel='radial',
cost=cost,
epsilon=epsilon,
gamma=gamma,
scale=scale)
# run SVM
predict = rpy2.robjects.r['predict']
ry_pred = np.array(predict(rsvr, rx_test))
return ry_pred
# define auxiliary function for plotting results
def plot_results(y_test, py_pred, ry_pred, title, lim=[-500, 500]):
plt.title(title)
plt.plot(lim, lim, lw=2, color='gray', zorder=-1)
plt.scatter(y_test, py_pred, color='black', s=40, label='Python/sklearn')
plt.scatter(y_test, ry_pred, color='orange', s=10, label='R/e1071')
plt.xlabel('observed')
plt.ylabel('predicted')
plt.legend(loc=0)
return None
# get example regression data
x_orig, y_orig = sklearn.datasets.make_regression(n_samples=100, n_features=10, random_state=42)
# split into train and test set
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x_orig, y_orig, train_size=0.8)
# SVM parameters
# (identical but named differently for R/e1071 and Python/sklearn)
C = 1000.0
epsilon = 0.1
gamma = 0.01
# setup SVM and scaling classes
psvr = sklearn.svm.SVR(kernel='rbf', C=C, epsilon=epsilon, gamma=gamma)
x_sca = sklearn.preprocessing.StandardScaler()
y_sca = sklearn.preprocessing.StandardScaler()
# run R and Python SVMs without any scaling
# (see 'scale=False')
py_pred = psvr.fit(x_train, y_train).predict(x_test)
ry_pred = RSVR(x_train, y_train, x_test,
cost=C, epsilon=epsilon, gamma=gamma, scale=False)
# scale both x and y variables
sx_train = x_sca.fit_transform(x_train)
sy_train = y_sca.fit_transform(y_train.reshape(-1, 1))[:, 0]
sx_test = x_sca.transform(x_test)
sy_test = y_sca.transform(y_test.reshape(-1, 1))[:, 0]
# run Python SVM on scaled data and invert scaling afterwards
ps_pred = psvr.fit(sx_train, sy_train).predict(sx_test)
ps_pred = y_sca.inverse_transform(ps_pred.reshape(-1, 1))[:, 0]
# run R SVM with native scaling on original/unscaled data
# (see 'scale=True')
rs_pred = RSVR(x_train, y_train, x_test,
cost=C, epsilon=epsilon, gamma=gamma, scale=True)
# plot results
plt.subplot(121)
plot_results(y_test, py_pred, ry_pred, 'without scaling (Python/sklearn default)')
plt.subplot(122)
plot_results(y_test, ps_pred, rs_pred, 'with scaling (R/e1071 default)')
plt.tight_layout()
UPDATE: Actually, the scaling uses a slightly different definition of variance in R and Python, see this answer (1/(N-1)... in R vs. 1/N... in Python where N is the sample size). However, for typical sample sizes, this should be negligible.
I can confirm these statements. One indeed needs to apply the same scaling to the train and test sets. In particular I have done this:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X = sc_X.fit_transform(X)
where X is my training set. Then, when preparing the test set, I have simply used the StandardScaler instance obtained from the scaling of the training test. It is important to used it just for transforming, not for fitting and transforming (like above), i.e.:
X_test = sc_X.transform(X_test)
This allowed on obtaining substantial agreement between R and scikit-learn results.
I am trying some code to make a learning curve :
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 7)
from sklearn.linear_model import LinearRegression
estimator = LinearRegression()
estimator.fit(X_train, y_train)
y_predicted = estimator.predict(X_test)
fig = plt.figure()
plt.xlabel("Data")
plt.ylabel("MSE")
plt.ylim(-4, 14)
plt.scatter(X_train.ravel(), y_train, color = 'green')#<<<<<<<ERROR HERE
plt.plot(X_test.ravel(), y_predicted, color = 'blue')
plt.show()
Results in :
ValueError: x and y must be the same size
Printing X_train and y_train shape output :
(1317, 11)
(1317,)
How can I fix this ?
The problem is that you are trying to plot an 11 dimensional variable (x) against a 1 dimensional variable (y). You say you are trying to plot a learning curve. This implies you are training a model iteratively and showing the error after each iteration (or 5 iterations, or whatever). But that is not what you are plotting, you are training the model fully, then trying to plot the inputs (or whatever ravel() does to them) against the predictions. This won't work. You need to rethink what you are trying to achieve here.
As already mentioned you are trying to plot the response variable against 11 features on a 2d grid, which clearly isn't going to work. None of my following suggestions are going to achieve what you are attempting, since your model isn't learning iteratively instead you split, trained, tested. However if your If you merely want to plot each successive feature against your response you could do something like the following (I used pandas to organize my data)
data = DataFrame(np.random.normal(0,1, (1317, 11)),
index=pd.date_range(
end= dt.datetime.utcnow(),
periods=1317, freq='D'))
features = ['feature_{}'.format(x) for x in
range(len(data.columns))]
data.columns = features
data['result'] = data.mean(1) + np.random.randn()
fig = plt.figure(figsize(10,10))
ax = fig.add_subplot(111)
for feature in features:
ax.scatter(data[feature], data['result'], c=numpy.random.rand(3,1))
Although I would probably just scatter your model (y_predicted) against y to visually validate my model.