I'm working on my first Linear Regression code using a Tech with Tim video (https://www.youtube.com/watch?v=45ryDIPHdGg) and have run into a snag. I'm using the UCI student data from here: https://archive.ics.uci.edu/ml/datasets/Student+Performance
My initial model code ran fine. Then I iterated to find an optimal accuracy model, which was fine. Where it starts to go off the rails is that I tried to then inject those optimal coefficients into a new model, then run two predictions:
The very first model (pre-optimization loop)
The optimized model
Against the same x_test1 data set. To compare the two, I simply summed the squared difference between predicted and actual y values. Then I also recorded the final accuracy of both models.
I've done something wrong because the accuracy of my new "optimized" model is the same or lower as the very first model, and the difference values is very similar as well. I expected the optimized model to have much less error and a higher accuracy.
Can someone help me to see the error? I suspect the error lies after the plot section of code. Thanks in advance, code below.
# Import libraries
import pandas as pd
import numpy as np
import sklearn
import pickle
import matplotlib.pyplot as plt
from sklearn import linear_model
from math import sqrt
from sklearn.linear_model import LinearRegression
from matplotlib import style
# from sklearn.utils import shuffle
# Read in Data
data = pd.read_csv("student-mat.csv", sep=";")
# Slice data to include only desired headings
data = data[["G1", "G2", "G3", "studytime", "failures", "absences"]]
# Define the attribute we are trying to predict; called "label".
# Others are "features" and used to predict label
predict = "G3"
# Create array of features and label
X = np.array(data.drop([predict], 1))
y = np.array(data[predict])
# Split data into training and testing data. 90% used for training, 10% testing
# Test size 0.1 = 10% of array size
x_train1, x_test1, y_train1, y_test1 = sklearn.model_selection.train_test_split(X, y, test_size=0.1)
# Create 1st linear model and fit
linear = linear_model.LinearRegression()
linear.fit(x_train1, y_train1)
# Compute accuracy of model
acc = linear.score(x_test1, y_test1)
# Iterate for a given number of times (max_iter) to find an optimal accuracy value and record best coefficients
loop_num = 1
max_iter = 1000
best_acc = acc
best_coef = linear.coef_
best_int = linear.intercept_
acc_counter = [acc]
print("\nInitial Accuracy: %4.3f" % acc)
while loop_num < max_iter + 1:
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.1)
linear2 = linear_model.LinearRegression()
linear2.fit(x_train, y_train)
acc = linear2.score(x_test, y_test)
acc_counter.append(acc)
print("\nAccuracy of run " + str(loop_num) + " is: %4.3f" % acc)
if acc > best_acc:
print("\n\tBetter accuracy found.")
best_acc = acc
best_coef = linear2.coef_
best_int = linear2.intercept_
print("Co: \n", linear2.coef_)
print("Intercept: \n", linear2.intercept_)
else:
print("\n\tFit Discarded.")
loop_num += 1
print("\nBest Acccuracy: \n%4.3f" % best_acc)
print("\nBest Coefficients: \n", best_coef)
print("\nBest Intercept: \n", best_int)
# Plot Accuracy over time
x_scale = []
for x in range(max_iter + 1):
x_scale.append(x)
plt.plot(x_scale, acc_counter, color='green', linestyle='dashed', linewidth=3, marker='o',
markerfacecolor='blue', markersize=5)
ymax = max(acc_counter)
ymin = min(acc_counter)
xpos = acc_counter.index(ymax)
xmax = x_scale[xpos]
annot_max_acc = str(ymax)
plt.annotate('Max Accuracy = ' + annot_max_acc[0:4], xy=(xmax, ymax), xycoords='data', xytext=(.8, .95),
textcoords='axes fraction',
arrowprops=dict(facecolor='black', shrink=0.05), horizontalalignment='right', verticalalignment='top')
plt.ylim(ymin, 1.0)
plt.xlabel('Run Number')
plt.ylabel('Accuracy')
plt.title('Prediction Accuracy over Time')
plt.show()
# Create model with best coefficients from above
new_model = linear_model.LinearRegression()
new_model.intercept_ = best_int
new_model.coef_ = best_coef
# Predict y values for 1st model (not best) then compute difference between predictions and actual values
print("\n\n\nBREAK")
comp = []
predictions = linear.predict(x_test1)
for x in range(len(predictions)):
print(predictions[x], x_test1[x], y_test1[x])
diff = sqrt((predictions[x] - y_test1[x])**2)
print("\tDifference is ", diff)
comp.append(diff)
print("\n\n\nBREAK")
print(comp)
print("\nSum of errors is ", sum(comp))
# Predict y values of best model (with optimal coefficients from above) using same x_test1 values as 1st model
# then compute difference between predictions and actual values
print("\n\n\nBREAK")
comp2 = []
predictions_new_model = new_model.predict(x_test1)
for x in range(len(predictions_new_model)):
print(predictions_new_model[x], x_test1[x], y_test1[x])
diff2 = sqrt((predictions_new_model[x] - y_test1[x])**2)
print("\tDifference is ", diff2)
comp2.append(diff2)
print("\n\n\nBREAK")
print(comp2)
print("\nSum of errors is ", sum(comp2))
print("\n\n\nFirst model fit difference: ", sum(comp))
print("\nSecond model fit difference ", sum(comp2))
print('\n1st model score: ',linear.score(x_train1, y_train1))
print('\nBest model score: ',new_model.score(x_train1, y_train1))
Looking at your code I’ve just realized that you’re using the same model (LinearRegression) and not changing any hyperparameters in any runs so there’s actually no improvement and the difference comes from the fact that you’ve split the data twice (and you didn’t give any random seed to it) so the slight difference comes from that. To improve the model you have to change the hyperparameters of the estimator. See more here: hyperparameter tuning
Related
I created a very simple function to test XGBoost.
X is an array containing 1000 rows of "7*np.pi" for each row.
Y is simply "1 + 0.5*np.sin(x)"
I split the dataset in 800 training and 200 testing rows. Shuffle MUST be False to simulate future occurrences, making sure the last 200 rows are reserved to testing.
import numpy as np
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
from sklearn.metrics import mean_squared_error as MSE
from xgboost import XGBRegressor
N = 1000 # 1000 rows
x = np.linspace(0, 7*np.pi, N) # Simple function
y = 1 + 0.5*np.sin(x) # Generate simple function sin(x) as y
# Train-test split, intentionally use shuffle=False to simulate time series
X = x.reshape(-1,1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, shuffle=False)
### Interestingly, model generalizes well if shuffle=False
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, shuffle=False)
XGB_reg = XGBRegressor(random_state=42)
XGB_reg.fit(X_train,y_train)
# EVALUATE ON TRAIN DATA
yXGBPredicted = XGB_reg.predict(X_train)
rmse = np.sqrt(MSE(y_train, yXGBPredicted))
print("RMSE TRAIN XGB: % f" %(rmse))
# EVALUATE ON TEST DATA
yXGBPredicted = XGB_reg.predict(X_test)
# METRICAS XGB
rmse = np.sqrt(MSE(y_test, yXGBPredicted))
print("RMSE TEST XGB: % f" %(rmse))
# Predict full dataset
yXGB = XGB_reg.predict(X)
# Plot and compare
plt.style.use('fivethirtyeight')
plt.rcParams.update({'font.size': 16})
fig, ax = plt.subplots(figsize=(10,5))
plt.plot(x, y)
plt.plot(x, yXGB)
plt.ylim(0,2)
plt.xlabel("x")
plt.ylabel("y")
plt.show()
I trained the model on the first 800 rows and then predicted the next 200 rows.
I was expecting testing data to have a great RMSE, but it did not happen.
I was surprised to see that XGBoost simple repeated the last value of the training set on all rows of the predictions (see chart).
Any ideas why this doesn't work?
You're asking your model to "extrapolate" - making predictions for x values that are greater than x values in the training dataset. Extrapolation works with some model types (such as linear models), but it typically does not work with decision tree models and their ensembles (such as XGBoost).
If you switch from XGBoost to LightGBM, then you can train extrapolation-capable decision tree ensembles using the "linear tree" approach:
Any ideas why this doesn't work?
Your XGBRegressor is probably over-fitted (has n_estimators = 100 and max_depth = 6). If you decrease those parameter values, then the red line will appear more jagged, and it will be easier for you to see it "working".
Right now, if you ask your over-fitted XGBRegressor to extrapolate, then it basically functions as a giant look-up table. When extrapolating towards +Inf, then the "closest match" is at x = 17.5; when extrapolating towards -Inf, then the "closest match" is at x = 0.0.
I recently attending a class where the instructor was teaching us how to create a linear regression model using Python. Here is my linear regression model:
import matplotlib.pyplot as plt
import pandas as pd
from scipy import stats
import numpy as np
from sklearn.metrics import r2_score
#Define the path for the file
path=r"C:\Users\H\Desktop\Files\Data.xlsx"
#Read the file into a dataframe ensuring to group by weeks
df=pd.read_excel(path, sheet_name = 0)
df=df.groupby(['Week']).sum()
df = df.reset_index()
#Define x and y
x=df['Week']
y=df['Payment Amount Total']
#Draw the scatter plot
plt.scatter(x, y)
plt.show()
#Now we draw the line of linear regression
#First we want to look for these values
slope, intercept, r, p, std_err = stats.linregress(x, y)
#We then create a function
def myfunc(x):
#Below is y = mx + c
return slope * x + intercept
#Run each value of the x array through the function. This will result in a new array with new values for the y-axis:
mymodel = list(map(myfunc, x))
#We plot the scatter plot and line
plt.scatter(x, y)
plt.plot(x, mymodel)
plt.show()
#We print the value of r
print(r)
#We predict what the cost will be in week 23
print(myfunc(23))
The instructor said we now must use the train/test model to determine how accurate the model above is. This confused me a little as I understood it to mean we will further refine the model above. Or, does it simply mean we will use:
a normal linear regression model
a train/test model
and compare the r values the two different models yield as well as the predicted values they yield?. Is the train/test model considered a regression model?
I tried to create the train/test model but I'm not sure if it's correct (the packages were imported from the above example). When I run the train/test code I get the following error:
ValueError: Found array with 0 sample(s) (shape=(0,)) while a minimum of 1 is required.
Here is the full code:
train_x = x[:80]
train_y = y[:80]
test_x = x[80:]
test_y = y[80:]
#I display the training set:
plt.scatter(train_x, train_y)
plt.show()
#I display the testing set:
plt.scatter(test_x, test_y)
plt.show()
mymodel = np.poly1d(np.polyfit(train_x, train_y, 4))
myline = np.linspace(0, 6, 100)
plt.scatter(train_x, train_y)
plt.plot(myline, mymodel(myline))
plt.show()
#Let's look at how well my training data fit in a polynomial regression?
mymodel = np.poly1d(np.polyfit(train_x, train_y, 4))
r2 = r2_score(train_y, mymodel(train_x))
print(r2)
#Now we want to test the model with the testing data as well
mymodel = np.poly1d(np.polyfit(train_x, train_y, 4))
r2 = r2_score(test_y, mymodel(test_x))
print(r2)
#Now we can use this model to predict new values:
#We predict what the total amount would be on the 23rd week:
print(mymodel(23))
You better split to train and test using sklearn method:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Where X is your features dataframe and y is the column of your labels. 0.2 stands for 80% train and 20% test.
BTW - the error you are describing could be because you dataframe has only 80 rows, leaving x[80:] empty
I am new to statistic modelling so please forgive if I am mistaken about this.
I am currently working on a function in python which will predict accuracy score for logistics regression model on the test data set. User will have the flexibility to supply model parameters/coefficients (other than the ones generated by training model-part of the requirement). I have a functional code which updates the coefficients but accuracy or prediction on test data set stays the same no matter how different model parameters I supply. My understanding is that the score on test set should change if I change model coefficients?
I am using statsmodel library to make things easier for me and following this link. Can someone please help me understand what am I missing ? Below is the code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import statsmodels.formula.api as sm
from sklearn.model_selection import train_test_split
data = pd.read_csv("E:\\Dev\\testing\\rawdata.txt", header=None,
names=['Exam1', 'Exam2', 'Admitted'])
X = data.copy() # ou training data
y = X.Admitted.copy() # copy “y” column values out
X.drop(['Admitted'], axis=1, inplace=True) # then, drop y column
# manually add the intercept
X['intercept'] = 1.0 # so we don't need to use sm.add_constant every time
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
model = sm.Logit(y_train, X_train)
result = model.fit()
print("old parameters :\n" + str(list(result.params)))
#New parameters supplied
mdict = { 'Exam1':10000000.2234, 'Exam2':1.1233423, 'intercept':2313.423 }
result.params = mdict
print("New parameters: \n"+str(result.params))
def logitPredict(modelParams, X, threshold):
probabilities = modelParams.predict(X)
return [1 if x >= threshold else 0 for x in probabilities]
predictions = logitPredict(result, X_test, .5)
accuracy = np.mean(predictions == y_test)
#accuracy always remains same as train model
print ('accuracy = {0}%'.format(accuracy*100) )
#test sample
myExams = pd.DataFrame({'Exam1': [40.], 'Exam2': [78.], 'intercept': [1.]})
myExams
print ('Your probability = {0}%'.format(result.predict(myExams)[0]*100))
In Python sklearn ensemble library, I want to train my data using some boosting method (say Adaboost). As I would like to know the optimal number of estimators, I plan to do a cv with different number of estimators each time. However, it seems doing it the following way is redundant:
for n in [50,100,150,200,250,300]:
model = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1),n_estimators=n)
cross_val_score(model,x,y,k=5)
Because in AdaBoost, once I train the classifier on # of estimator=50, as I move along to train # of estimator=100, the first 50 classifiers and their weights don't change. I wonder if there is a way to start training directly with the 51st weak learner in this case.
It is possible to use inheritance to make a "hack" of AdaBoostClassifier that doesn't retrain estimators and is compatible with many cross-validation functions in sklearn (must be cross-validation that doesn't shuffle data).
If you look at the source code in sklearn.ensemble.weight_boosting.py, you can see that you can get away with not needing to retrain estimators if you properly wrap the behavior of AdaBoostClassifier.fit() and AdaBoostClassifier._boost().
The problem with cross validation functions is that they make clones of the original estimator using sklearn.base.clone(), and in turn the function sklearn.base.clone() makes deep copies of the estimator's parameters. The deep copy nature makes it impossible for the estimator to "remember" its estimators between different cross-validation runs (clone() copies the contents of a reference and not the reference itself). The only way to do so (at least the only way I can think of) is to use global state to keep track of old estimators between runs. The catch here is that you have to compute a hash of your X features, which could be expensive!
Anyways, here is the hack to AdaBoostClassifier itself:
'''
adaboost_hack.py
Make a "hack" of AdaBoostClassifier in sklearn.ensemble.weight_boosting.py
that doesn't need to retrain estimators and is compatible with many sklearn
cross validation functions.
'''
import copy
import numpy as np
from sklearn.ensemble import AdaBoostClassifier
from sklearn.base import clone
# Used to hold important variables between runs of cross validation.
# Note that sklearn cross validation functions use sklearn.base.clone()
# to make copies of the estimator sent to it as a function. The function
# sklearn.base.clone() makes deep copies of parameters of an estimator, so
# the only way to provide a way to remember previous estimators between
# cross validation runs is to use a global variable.
#
# We will use hash values of the split of X[:, 0] as keys for remembering
# previous estimators of a cv fold. Note, you can NOT use cross validators
# that randomly shuffle the data before splitting. This will cause different
# hashes.
kfold_hash = {}
class WarmRestartAdaBoostClassifier(AdaBoostClassifier):
'''
Keep track of old estimators, estimator weights, the estimator errors, and
the next to last sample weight seen.
Note that AdaBoostClassifier._boost() does NOT boost the last seen sample
weight. Simple fix to this is to drop the last estimator and retrain it.
Wrap AdaBoostClassifier.fit() to decide whether to throw away estimators or add estimators
depending on the current number of estimators vs the number of old esimators.
Also look at the possibility of use the global kfold_hash to get old values if
use_kfold_hash == True.
Wrap AdaBoostClassifier._boost() with behavior to record the next to last sample weight.
'''
def __init__(self,
base_estimator=None,
n_estimators=50,
learning_rate=1.,
algorithm='SAMME.R',
random_state=None,
next_to_last_sample_weight = None,
old_estimators_ = [],
use_kfold_hash = False):
AdaBoostClassifier.__init__(self, base_estimator, n_estimators, learning_rate,
algorithm, random_state)
self.next_to_last_sample_weight = next_to_last_sample_weight
self._last_sample_weight = None
self.old_estimators_ = old_estimators_
self.use_kfold_hash = use_kfold_hash
def _boost(self, iboost, X, y, sample_weight, random_state):
'''
Record the sample weight.
Parameters and return behavior same as that of AdaBoostClassifier._boost() as
seen in sklearn.ensemble.weight_boosting.py
Parameters
----------
iboost : int
The index of the current boost iteration.
X : {array-like, sparse matrix} of shape = [n_samples, n_features]
The training input samples. Sparse matrix can be CSC, CSR, COO,
DOK, or LIL. COO, DOK, and LIL are converted to CSR.
y : array-like of shape = [n_samples]
The target values (class labels).
sample_weight : array-like of shape = [n_samples]
The current sample weights.
random_state : RandomState
The current random number generator
Returns
-------
sample_weight : array-like of shape = [n_samples] or None
The reweighted sample weights.
If None then boosting has terminated early.
estimator_weight : float
The weight for the current boost.
If None then boosting has terminated early.
error : float
The classification error for the current boost.
If None then boosting has terminated early.
'''
fit_info = AdaBoostClassifier._boost(self, iboost, X, y, sample_weight, random_state)
sample_weight, _, _ = fit_info
self.next_to_last_sample_weight = self._last_sample_weight
self._last_sample_weight = sample_weight
return fit_info
def fit(self, X, y):
hash_X = None
if self.use_kfold_hash:
# Use a hash of X features in this kfold to access the global information
# for this kfold.
hash_X = hash(bytes(X[:, 0]))
if hash_X in kfold_hash.keys():
self.old_estimators_ = kfold_hash[hash_X]['old_estimators_']
self.next_to_last_sample_weight = kfold_hash[hash_X]['next_to_last_sample_weight']
self.estimator_weights_ = kfold_hash[hash_X]['estimator_weights_']
self.estimator_errors_ = kfold_hash[hash_X]['estimator_errors_']
# We haven't done any fits yet.
if not self.old_estimators_:
AdaBoostClassifier.fit(self, X, y)
self.old_estimators_ = self.estimators_
# The case that we throw away estimators.
elif self.n_estimators < len(self.old_estimators_):
self.estimators_ = self.old_estimators_[:self.n_estimators]
self.estimator_weights_ = self.estimator_weights_[:self.n_estimators]
self.estimator_errors_ = self.estimator_errors_[:self.n_estimators]
# The case that we add new estimators.
elif self.n_estimators > len(self.old_estimators_):
n_more = self.n_estimators - len(self.old_estimators_)
self.fit_more(X, y, n_more)
# Record information in the global hash if necessary.
if self.use_kfold_hash:
kfold_hash[hash_X] = {'old_estimators_' : self.old_estimators_,
'next_to_last_sample_weight' : self.next_to_last_sample_weight,
'estimator_weights_' : self.estimator_weights_,
'estimator_errors_' : self.estimator_errors_}
return self
def fit_more(self, X, y, n_more):
'''
Fits additional estimators.
'''
# Since AdaBoostClassifier._boost() doesn't boost the last sample weight, we retrain the last estimator with
# its input sample weight.
self.n_estimators = n_more + 1
if self.old_estimators_ is None:
raise Exception('Should have already fit estimators before calling fit_more()')
self.old_estimators_ = self.old_estimators_[:-1]
old_estimator_weights = self.estimator_weights_[:-1]
old_estimator_errors = self.estimator_errors_[:-1]
sample_weight = self.next_to_last_sample_weight
AdaBoostClassifier.fit(self, X, y, sample_weight)
self.old_estimators_.extend(self.estimators_)
self.estimators_ = self.old_estimators_
self.n_estimators = len(self.estimators_)
self.estimator_weights_ = np.concatenate([old_estimator_weights, self.estimator_weights_])
self.estimator_errors_ = np.concatenate([old_estimator_errors, self.estimator_errors_])
And here is an example that allows you to compare timings/accuracies of the hack compared to the original AdaBoostClassifier. Note, that testing hack will have increasing time as we add estimators, but the training will not. I found the hack to run much faster than the original, but I'm not hashing large amount of X samples.
'''
example.py
Test the AdaBoost hack.
'''
import time # Used to get timing info.
import adaboost_hack
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier # We will use stumps for our classifiers.
from sklearn.ensemble import AdaBoostClassifier # Used to compare hack to original.
from sklearn.model_selection import (cross_val_score, KFold)
from sklearn.metrics import accuracy_score
my_random = np.random.RandomState(0) # For consistent results.
nSamples = 2000
# Make some sample data.
X = my_random.uniform(size = (nSamples, 2))
y = np.zeros(len(X), dtype = int)
# Decision boundary is the unit circle.
in_class = X[:, 0]**2 + X[:, 1]**2 > 1
y = np.zeros(len(X), dtype = int)
y[in_class] = 1
# Add some random error.
error_rate = 0.01
to_flip = my_random.choice(np.arange(len(y)), size = int(error_rate * len(y)), replace = False)
y[to_flip] = 1 - y[to_flip]
# Plot the data.
plt.scatter(X[:, 0], X[:, 1], c = y)
plt.title('Simulated Data')
plt.show()
# Make our hack solution. Initially do 2 estimators.
# Train the hack without testing. Should find nearly constant time per training session.
print('Training hack without testing.')
ada_boost_hack = adaboost_hack.WarmRestartAdaBoostClassifier(DecisionTreeClassifier(max_depth = 1,
random_state = my_random),
n_estimators = 1,
random_state = my_random)
nFit = 50
times = []
for i in range(nFit):
times.append(time.time())
ada_boost_hack.n_estimators += 1
ada_boost_hack.fit(X, y)
def get_differences(times):
times = np.array(times)
return times[1:] - times[:-1]
times_per_train = {'hack no test' : get_differences(times)}
# Now look at running tests while training the hack. Should have small linear growth between
# in time per training session.
print('Training hack with testing.')
ada_boost_hack = adaboost_hack.WarmRestartAdaBoostClassifier(DecisionTreeClassifier(max_depth = 1,
random_state = my_random),
n_estimators = 1,
random_state = my_random)
times = []
scores = []
for i in range(nFit):
times.append(time.time())
ada_boost_hack.n_estimators += 1
ada_boost_hack.fit(X, y)
y_predict = ada_boost_hack.predict(X)
new_score = accuracy_score(y, y_predict)
scores.append(new_score)
plt.plot(scores)
plt.title('Training scores for hack')
plt.ylabel('Accuracy')
plt.show()
times_per_train['hack with test'] = get_differences(times)
print('Now training hack with cross validation')
ada_boost_hack = adaboost_hack.WarmRestartAdaBoostClassifier(DecisionTreeClassifier(max_depth = 1,
random_state = my_random),
n_estimators = 1,
random_state = my_random,
use_kfold_hash = True)
# Now try cross_val_score().
scores = []
times = []
# We use KFold to make sure the hashes of X features of each fold are
# the same between each run.
for i in range(1, nFit + 1):
ada_boost_hack.set_params(n_estimators = i)
new_scores = cross_val_score(ada_boost_hack, X, y, cv = KFold(3))
scores.append(new_scores)
times.append(time.time())
def plot_cv_scores(scores):
scores = np.array(scores)
plt.plot(scores.mean(axis = 1))
plt.plot(scores.mean(axis = 1) + scores.std(axis = 1) * 2, color = 'red')
plt.plot(scores.mean(axis = 1) - scores.std(axis = 1) * 2, color = 'red')
plt.ylabel('Accuracy')
plot_cv_scores(scores)
plt.title('Cross validation scores for hack')
plt.show()
times_per_train['hack cross validation'] = get_differences(times)
# Double check that kfold_hash only has 3 keys since we used cv = 3.
print('adaboost_hack.keys() = ', adaboost_hack.kfold_hash.keys())
# Now get timings for original classifier.
print('Now doing cross validations of original')
ada_boost = AdaBoostClassifier(DecisionTreeClassifier(max_depth = 1,
random_state = np.random.RandomState(0)),
n_estimators = 1,
random_state = np.random.RandomState(0))
times = []
scores = []
# We use KFold to make sure the hashes of X features of each fold are
# the same between each run.
for i in range(1, nFit + 1):
ada_boost.set_params(n_estimators = i)
new_scores = cross_val_score(ada_boost, X, y, cv = KFold(3))
scores.append(new_scores)
times.append(time.time())
plot_cv_scores(scores)
plt.title('Cross validation scores for original')
plt.show()
times_per_train['original cross validation'] = get_differences(times)
# Plot all of the timing data.
for key in times_per_train.keys():
plt.plot(times_per_train[key])
plt.title('Time per training or cv score')
plt.ylabel('Time')
plt.xlabel('nth training or cv score')
plt.legend(times_per_train.keys())
plt.show()
You can fit all 300 estimators and then use AdaBoostClassifier.staged_predict() to track how the error rate depends on the number of estimators. However, you will have to do the cross-validation splits yourself; I don't think it is compatible with cross_val_score().
For example,
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier # We will use simple stumps for individual estimators in AdaBoost.
from sklearn.metrics import accuracy_score
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0)
nSamples = {'train' : 2000, 'test' : 1000}
X = np.random.uniform(size = (nSamples['train'] + nSamples['test'], 2))
# Decision boundary is the unit circle.
in_class = X[:, 0]**2 + X[:, 1]**2 > 1
y = np.zeros(len(X), dtype = int)
y[in_class] = 1
# Add some random error.
error_rate = 0.01
to_flip = np.random.choice(np.arange(len(y)), size = int(error_rate * len(y)), replace = False)
y[to_flip] = 1 - y[to_flip]
# Split training and test.
X = {'train' : X[:nSamples['train']],
'test' : X[nSamples['train']:]}
y = {'train' : y[:nSamples['train']],
'test' : y[nSamples['train']:]}
# Make AdaBoost Classifier.
max_estimators = 50
ada_boost = AdaBoostClassifier(DecisionTreeClassifier(max_depth = 1, # Just a stump.
random_state = np.random.RandomState(0)),
n_estimators = max_estimators,
random_state = np.random.RandomState(0))
# Fit all estimators.
ada_boost.fit(X['train'], y['train'])
# Get the test accuracy for each stage of prediction.
scores = {'train' : [], 'test' : []}
for y_predict_train, y_predict_test in zip(ada_boost.staged_predict(X['train']),
ada_boost.staged_predict(X['test'])):
scores['train'].append(accuracy_score(y['train'], y_predict_train))
scores['test'].append(accuracy_score(y['test'], y_predict_test))
# Plot the results.
n_estimators = range(1, len(scores['train']) + 1)
for key in scores.keys():
plt.plot(n_estimators, scores[key])
plt.title('Staged Scores')
plt.ylabel('Accuracy')
plt.xlabel('N Estimators')
plt.legend(scores.keys())
plt.show()
I am running a Logistic Regression and would like to plot the Learning Curve of this to get a feel for the data. How can I do this ? Here is my code thus far :
from sklearn import metrics,preprocessing,cross_validation
from sklearn.feature_extraction.text import TfidfVectorizer
import sklearn.linear_model as lm
import pandas as p
loadData = lambda f: np.genfromtxt(open(f,'r'), delimiter=' ')
print "loading data.."
traindata = list(np.array(p.read_table('train.tsv'))[:,2])
testdata = list(np.array(p.read_table('test.tsv'))[:,2])
y = np.array(p.read_table('train.tsv'))[:,-1]
tfv = TfidfVectorizer(min_df=3, max_features=None, strip_accents='unicode',
analyzer='word',token_pattern=r'\w{1,}',ngram_range=(1, 2), use_idf=1,smooth_idf=1,sublinear_tf=1)
rd = lm.LogisticRegression(penalty='l2', dual=True, tol=0.0001,
C=1, fit_intercept=True, intercept_scaling=1.0,
class_weight=None, random_state=None)
X_all = traindata + testdata
lentrain = len(traindata)
print "fitting pipeline"
tfv.fit(X_all)
print "transforming data"
X_all = tfv.transform(X_all)
X = X_all[:lentrain]
X_test = X_all[lentrain:]
print "20 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rd, X, y, cv=20, scoring='roc_auc'))
print "training on full data"
rd.fit(X,y)
pred = rd.predict_proba(X_test)[:,1]
testfile = p.read_csv('test.tsv', sep="\t", na_values=['?'], index_col=1)
pred_df = p.DataFrame(pred, index=testfile.index, columns=['label'])
pred_df.to_csv('benchmark.csv')
print "submission file created.."
What I would like to create is something like this, so I can have a better understanding of what is going on :
Can anyone help me with this please?
Not quite as general as it should be, but it'll do the job with a little fiddling on your end.
from matplotlib import pyplot as plt
from sklearn import metrics
import numpy as np
def data_size_response(model,trX,teX,trY,teY,score_func,prob=True,n_subsets=20):
train_errs,test_errs = [],[]
subset_sizes = np.exp(np.linspace(3,np.log(trX.shape[0]),n_subsets)).astype(int)
for m in subset_sizes:
model.fit(trX[:m],trY[:m])
if prob:
train_err = score_func(trY[:m],model.predict_proba(trX[:m]))
test_err = score_func(teY,model.predict_proba(teX))
else:
train_err = score_func(trY[:m],model.predict(trX[:m]))
test_err = score_func(teY,model.predict(teX))
print "training error: %.3f test error: %.3f subset size: %.3f" % (train_err,test_err,m)
train_errs.append(train_err)
test_errs.append(test_err)
return subset_sizes,train_errs,test_errs
def plot_response(subset_sizes,train_errs,test_errs):
plt.plot(subset_sizes,train_errs,lw=2)
plt.plot(subset_sizes,test_errs,lw=2)
plt.legend(['Training Error','Test Error'])
plt.xscale('log')
plt.xlabel('Dataset size')
plt.ylabel('Error')
plt.title('Model response to dataset size')
plt.show()
model = # put your model here
score_func = # put your scoring function here
response = data_size_response(model,trX,teX,trY,teY,score_func,prob=True)
plot_response(*response)
The data_size_response function takes a model (in your case a instantiated LR model), a pre-split dataset (train/test X and Y arrays you can use the train_test_split function in sklearn to generate this), and a scoring function as input and iterates through your dataset training on n exponentially spaced subsets and returns the "learning curve". There is also a plotting function for visualizing this response.
I would have liked to use cross_val_score like your example but it would require modifying sklearn source to get back training scores in addition to the test scores it already provides. The prob argument is whether or not to use a predict_proba vs predict method on the model which is necessary for certain model/scoring function combinations e.g. roc_auc_score.
Example plot on a subset of the MNIST dataset:
Let me know if you have any questions!