Sklearn.pipeline producing incorrect result - python

I am trying to construct a pipeline with a StandardScaler() and LogisticRegression(). I get different results when I code it with and without the pipeline. Here's my code without the pipeline:
clf_LR = linear_model.LogisticRegression()
scalar = StandardScaler()
X_train_std = scalar.fit_transform(X_train)
X_test_std = scalar.fit_transform(X_test)
clf_LR.fit(X_train_std, y_train)
print('Testing score without pipeline: ', clf_LR.score(X_test_std, y_test))
My code with pipeline:
pipe_LR = Pipeline([('scaler', StandardScaler()),
('classifier', linear_model.LogisticRegression())
])
pipe_LR.fit(X_train, y_train)
print('Testing score with pipeline: ', pipe_LR.score(X_test, y_test))
Here is my result:
Testing score with pipeline: 0.821917808219178
Testing score without pipeline: 0.8767123287671232
While trying to debug the problem, it seems the data is being standardized. But the result with pipeline matches the result of training the model on my original X_train data (without applying StandardScaler()).
clf_LR_orig = linear_model.LogisticRegression()
clf_LR_orig.fit(X_train, y_train)
print('Testing score without Standardization: ', clf_LR_orig.score(X_test, y_test))
Testing score without Standardization: 0.821917808219178
Is there something I am missing in the construction of the pipeline?
Thanks very much!

As szymon-bednorz commented ,generally we don't fit_transform on test data, rather we go for fit_transform(X_train) and transform(X_test).This works pretty well, when your training and test data are from same distribution, and size of X_train is greater than X_test.
Further as you found while debugging that fitting through pipeline gives same accuracy as fitting logistic regression hints that X_train and X_test already scaled. Although I am not sure about this.

Related

why r2_score is quite different between train_test_split and pipeline cross_val_score?

I wonder why r2_score is quite different between train_test_split and pipeline cross_val_score? I suspect it's because the model can see the unknown words through CountVectorizer() in the pipeline. But based on concept of Pipeline, CountVectorizer() should only work on training set split by cross_val?
pipe=Pipeline([('Vect', CountVectorizer()), ('rf', RandomForestRegressor(random_state=1)) ])
X_train, X_test, y_train, y_test=train_test_split(df['X'], df['price'], shuffle= False, test_size=0.5)
reg=pipe.fit(X_train,y_train )
mypred= reg.predict(X_test)
r2_score(mypred, y_test)
# result is -0.2
cross_val_score(pipe,df['X'], df['price'],cv=2)
# result is about 0.3
r2_score(mypred, y_test)
is wrong.
You need to provide the true values as first input and predicted values as second. Correct that to:
r2_score(y_test, mypred)
and then check results.

DictVectorizer learns more features for the training set

I have the following code which works as expected:
clf = Pipeline([
('vectorizer', DictVectorizer(sparse=False)),
('classifier', DecisionTreeClassifier(criterion='entropy'))
])
clf.fit(X[:size], y[:size])
score = clf.score(X_test, y_test)
I wanted to do the same logic without using Pipeline:
v = DictVectorizer(sparse=False)
Xdv = v.fit_transform(X[:size])
Xdv_test = v.fit_transform(X_test)
clf = DecisionTreeClassifier(criterion='entropy')
clf.fit(Xdv[:size], y[:size])
clf.score(Xdv_test, y_test)
But I receive the following error:
ValueError: Number of features of the model must match the input. Model n_features is 8251 and input n_features is 14303
It seems that DictVectorizer learns more features for the test set than for the training set. I want to know how does Pipeline handle this issue and how can I accomplish the same.
Dont call fit_transform again.
Do this:
Xdv_test = v.transform(X_test)
When you do fit() or fit_transform(), the dict vectorizer will forget the features learnt during previous call (on training data) and re-fits again, hence different number of features.
Pipeline will automatically handle the test data appropriately when you do clf.score(X_test, y_test) on the pipeline.

Do I use the same Tfidf vocabulary in k-fold cross_validation

I am doing text classification based on TF-IDF Vector Space Model.I have only no more than 3000 samples.For the fair evaluation, I'm evaluating the classifier using 5-fold cross-validation.But what confuses me is that whether it is necessary to rebuild the TF-IDF Vector Space Model in each fold cross-validation. Namely, would I need to rebuild the vocabulary and recalculate the IDF value in vocabulary in each fold cross-validation?
Currently I'm doing TF-IDF tranforming based on scikit-learn toolkit, and training my classifier using SVM. My method is as follows: firstly,I'm dividing the sample in hand by the ratio of 3:1, 75 percent of them are applied to fit the parameter of the TF-IDF Vector Space Model.Herein, the parameter is the size of vocabulary and the terms that contained in it, also the IDF value of each term in vocabulary.Then I'm transforming the remainder in this TF-IDF SVM and using these vectors to make 5-fold cross-validation (Notably, I don't use the previous 75 percent samples for transforming).
My code is as follows:
# train, test split, the train data is just for TfidfVectorizer() fit
x_train, x_test, y_train, y_test = train_test_split(data_x, data_y, train_size=0.75, random_state=0)
tfidf = TfidfVectorizer()
tfidf.fit(x_train)
# vectorizer test data for 5-fold cross-validation
x_test = tfidf.transform(x_test)
scoring = ['accuracy']
clf = SVC(kernel='linear')
scores = cross_validate(clf, x_test, y_test, scoring=scoring, cv=5, return_train_score=False)
print(scores)
My confusion is that whether my method doing TF-IDF transforming and making 5-fold cross-validation is correct, or whether it's necessary to rebuild the TF-IDF Vector Model Space using train data and then transform into TF-IDF vectors with both train and test data? Just as follows:
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
for train_index, test_index in skf.split(data_x, data_y):
x_train, x_test = data_x[train_index], data_x[test_index]
y_train, y_test = data_y[train_index], data_y[test_index]
tfidf = TfidfVectorizer()
x_train = tfidf.fit_transform(x_train)
x_test = tfidf.transform(x_test)
clf = SVC(kernel='linear')
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
score = accuracy_score(y_test, y_pred)
print(score)
The StratifiedKFold approach, which you had adopted to build the TfidfVectorizer() is the right way, by doing so you are making sure that features are generated only based out of the training dataset.
If you think about building the TfidfVectorizer() on the whole dataset, then its situation of leaking the test dataset to the model even though we are not explicitly feeding the test dataset. The parameters such as size of vocabulary, IDF value of each term in vocabulary would greatly differ when test documents are included.
The simpler way could be using pipeline and cross_validate.
Use this!
from sklearn.pipeline import make_pipeline
clf = make_pipeline(TfidfVectorizer(), svm.SVC(kernel='linear'))
scores = cross_validate(clf, data_x, data_y, scoring=['accuracy'], cv=5, return_train_score=False)
print(scores)
Note: It is not useful to do cross_validate on the test data alone. we have to do on the [train + validation] dataset.

Setting the n_estimators argument using **kwargs (Scikit Learn)

I am trying to follow this tutorial to learn the machine learning based prediction but I have got two questions on it?
Ques1. How to set the n_estimators in the below piece of code, otherwise it will always assume the default value.
from sklearn.cross_validation import KFold
def run_cv(X,y,clf_class,**kwargs):
# Construct a kfolds object
kf = KFold(len(y),n_folds=5,shuffle=True)
y_pred = y.copy()
# Iterate through folds
for train_index, test_index in kf:
X_train, X_test = X[train_index], X[test_index]
y_train = y[train_index]
# Initialize a classifier with key word arguments
clf = clf_class(**kwargs)
clf.fit(X_train,y_train)
y_pred[test_index] = clf.predict(X_test)
return y_pred
It is being called as:
from sklearn.svm import SVC
print "%.3f" % accuracy(y, run_cv(X,y,SVC))
Ques2: How to use the already trained model file (e.g. obtained from SVM) so that I can use it to predict more (test) data which I didn't used for training?
For your first question, in the above code you would call run_cv(X,y,SVC,n_classifiers=100), the **kwargs will pass this to the classifier initializer with the step clf = clf_class(**kwargs).
For your second question, the cross validation in the code you've linked is just for model evaluation, i.e. comparing different types of models and hyperparameters, and determining the likely effectiveness of your model in production. Once you've decided on your model, you need to refit the model on the whole dataset:
clf.fit(X,y)
Then you can get predictions with clf.predict or clf.predict_proba.

how to properly use sklearn to predict the error of a fit

I'm using sklearn to fit a linear regression model to some data. In particular, my response variable is stored in an array y and my features in a matrix X.
I train a linear regression model with the following piece of code
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X,y)
and everything seems to be fine.
Then let's say I have some new data X_new and I want to predict the response variable for them. This can easily done by doing
predictions = model.predict(X_new)
My question is, what is this the error associated to this prediction?
From my understanding I should compute the mean squared error of the model:
from sklearn.metrics import mean_squared_error
model_mse = mean_squared_error(model.predict(X),y)
And basically my real predictions for the new data should be a random number computed from a gaussian distribution with mean predictions and sigma^2 = model_mse. Do you agree with this and do you know if there's a faster way to do this in sklearn?
You probably want to validate your model on your training data set. I would suggest exploring the cross-validation submodule sklearn.cross_validation.
The most basic usage is:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
It depends on you training data-
If it's distribution is a good representation of the "real world" and of a sufficient size (see learning theories, as PAC), then I would generally agree.
That said- if you are looking for a practical way to evaluate your model, why won't you use the test set as Kris has suggested?
I usually use grid search for optimizing parameters:
#split to training and test sets
X_train, X_test, y_train, y_test =train_test_split(
X_data[indices], y_data[indices], test_size=0.25)
#cross validation gridsearch
params = dict(logistic__C=[0.1,0.3,1,3, 10,30, 100])
grid_search = GridSearchCV(clf, param_grid=params,cv=5)
grid_search.fit(X_train, y_train)
#print scores and best estimator
print 'best param: ', grid_search.best_params_
print 'best train score: ', grid_search.best_score_
print 'Test score: ', grid_search.best_estimator_.score(X_test,y_test)
The Idea is hiding the test set from your learning algorithm (and yourself)- Don't train and don't optimize parameters using this data.
Finally you should use the test set for performance evaluation (error) only, it should provide an unbiased mse.

Categories