How to implement SMOTE in cross validation and GridSearchCV - python

I'm relatively new to Python. Can you help me improve my implementation of SMOTE to a proper pipeline? What I want is to apply the over and under sampling on the training set of every k-fold iteration so that the model is trained on a balanced data set and evaluated on the imbalanced left out piece. The problem is that when I do that I cannot use the familiar sklearn interface for evaluation and grid search.
Is it possible to make something similar to model_selection.RandomizedSearchCV. My take on this:
df = pd.read_csv("Imbalanced_data.csv") #Load the data set
X = df.iloc[:,0:64]
X = X.values
y = df.iloc[:,64]
y = y.values
n_splits = 2
n_measures = 2 #Recall and AUC
kf = StratifiedKFold(n_splits=n_splits) #Stratified because we need balanced samples
kf.get_n_splits(X)
clf_rf = RandomForestClassifier(n_estimators=25, random_state=1)
s =(n_splits,n_measures)
scores = np.zeros(s)
for train_index, test_index in kf.split(X,y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
sm = SMOTE(ratio = 'auto',k_neighbors = 5, n_jobs = -1)
smote_enn = SMOTEENN(smote = sm)
x_train_res, y_train_res = smote_enn.fit_sample(X_train, y_train)
clf_rf.fit(x_train_res, y_train_res)
y_pred = clf_rf.predict(X_test,y_test)
scores[test_index,1] = recall_score(y_test, y_pred)
scores[test_index,2] = auc(y_test, y_pred)

You need to look at the pipeline object. imbalanced-learn has a Pipeline which extends the scikit-learn Pipeline, to adapt for the fit_sample() and sample() methods in addition to fit_predict(), fit_transform() and predict() methods of scikit-learn.
Have a look at this example here:
https://imbalanced-learn.org/stable/auto_examples/pipeline/plot_pipeline_classification.html
For your code, you would want to do this:
from imblearn.pipeline import make_pipeline, Pipeline
smote_enn = SMOTEENN(smote = sm)
clf_rf = RandomForestClassifier(n_estimators=25, random_state=1)
pipeline = make_pipeline(smote_enn, clf_rf)
OR
pipeline = Pipeline([('smote_enn', smote_enn),
('clf_rf', clf_rf)])
Then you can pass this pipeline object to GridSearchCV, RandomizedSearchCV or other cross validation tools in the scikit-learn as a regular object.
kf = StratifiedKFold(n_splits=n_splits)
random_search = RandomizedSearchCV(pipeline, param_distributions=param_dist,
n_iter=1000,
cv = kf)

This looks like it would fit the bill http://contrib.scikit-learn.org/imbalanced-learn/stable/generated/imblearn.over_sampling.SMOTE.html
You'll want to create your own transformer (http://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html) that upon calling fit returns a balanced data set (presumably the one gotten from StratifiedKFold), but upon calling predict, which is that is going to happen for the test data, calls into SMOTE.

Related

Using TimeSeriesSplit within cross_val_score

I'm fitting a time series. In this sense, I'm trying to cross-validate using the TimeSeriesSplit function. I believe that the easiest way to apply this function is through the cross_val_score function, through the cv argument.
The question is simple, is the way I am passing the CV argument correct? Should I do the split(scaled_train) or should I use the split(X_train) or split(input_data) ? Or, should I cross-validate in another way?
This is the code I am writing:
def fit_model1(data: pd.DataFrame):
df = data
scores_fit_model1 = []
for sizes in test_sizes:
# Generate Test Design
input_data = df.drop('next_count',axis=1)
output_data = df[['next_count']]
X_train, X_test, y_train, y_test = train_test_split(input_data, output_data, test_size=sizes, random_state=0, shuffle=False)
#scaling
scaler = MinMaxScaler()
scaled_train = scaler.fit_transform(X_train)
scaled_test = scaler.transform(X_test)
#Build Model
lr = LinearRegression()
lr.fit(scaled_train, y_train.values.ravel())
predictions = lr.predict(scaled_test)
#Cross Validation Definition
time_split = TimeSeriesSplit(n_splits=10)
#performance metrics
r2 = cross_val_score(lr, scaled_train, y_train.values.ravel(), cv=time_split.split(scaled_train), scoring = 'r2', n_jobs =1).mean()
scores_fit_model1.append(r2)
return scores_fit_model1
The TimeSeriesSplit is simply an iterator that yields a growing window of sequential folds. Therefore, you can pass it as is to cv, or you can pass time_series_split(scaled_train), which amounts to the same thing: making splits in an array of the same size as your train data (which cross_val_score takes as the second positional parameter). It doesn't matter whether the TimeSeriesSplit gets the scaled or original data, as long as cross_val_score has the scaled data.
I made some minor simplifications in your code as well - scaling before the train_test_split, and making the output data a Series (so you don't need values.ravel):
def fit_model1(data: pd.DataFrame):
df = data
scores_fit_model1 = []
for sizes in test_sizes:
# Generate Test Design
input_data = df.drop('next_count',axis=1)
output_data = df['next_count']
scaler = MinMaxScaler()
scaled_input = scaler.fit_transform(input_data)
X_train, X_test, y_train, y_test = train_test_split(scaled_input, output_data, test_size=sizes, random_state=0, shuffle=False)
#Build Model
lr = LinearRegression()
lr.fit(X_train, y_train)
predictions = lr.predict(X_test)
#Cross Validation Definition
time_split = TimeSeriesSplit(n_splits=10)
#performance metrics
r2 = cross_val_score(lr, X_train, y_train, cv=time_split, scoring = 'r2', n_jobs =1).mean()
scores_fit_model1.append(r2)
return scores_fit_model1

Obtain errors of individual data points when using cross-validation (scikit-learn)

I am using cross-validation to evaluate my ML models but now I want to look into the distribution of the errors, i.e. I want to get the average error of specific data points whenever they are in the test set.
from sklearn import linear_model
from sklearn.model_selection import KFold, cross_val_score
X = #data points
y = #output
lm = linear_model.LinearRegression()
kfold = KFold(n_splits=10)
scores = cross_val_score(lm, X, y, scoring='neg_mean_squared_error', cv=kfold)
rmse_scores = [np.sqrt(abs(s)) for s in scores]
print('Testing RMSE (lin reg): {:.3f}'.format(np.mean(rmse_scores)))
Is there an easy way to get the individual errors of each of the data points whenever they are in the test set (not training error) using cross-validation with scikit-learn?
Thank you!
If I understood your question correctly, this should be what you are looking for.
kf = KFold(n_splits=3)
error = []
for train_index, val_index in kf.split(X, y):
Xtrain, X_val = X[train_index], X[val_index]
ytrain, y_val = y[train_index], y[val_index]
model.fit(Xtrain, ytrain)
pred = model.predict(X_val)
current_error = mean_squared_error(y_val, pred) # error per iteration
error.append(current_error)
print(np.mean(error)) # get mean error after CV

What is the difference between model.LGBMRegressor.fit(x_train, y_train) and lightgbm.train(train_data, valid_sets = test_data)?

I tried out two ways of implementing light GBM. Expect it to return the same value but it didnt.
I thought lgb.LightGBMRegressor() and lgb.train(train_data, test_data) will return the same accuracy but it didnt. So I wonder why?
Function to break the data
def dataready(train, test, predictvar):
included_features = train.columns
y_test = test[predictvar].values
y_train = train[predictvar].ravel()
train = train.drop([predictvar], axis = 1)
test = test.drop([predictvar], axis = 1)
x_train = train.values
x_test = test.values
return x_train, y_train, x_test, y_test, train
This is how i break down the data
x_train, y_train, x_test, y_test, train2 = dataready(train, test, 'runtime.min')
train_data = lgb.Dataset(x_train, label=y_train)
test_data = lgb.Dataset(x_test, label=y_test)
predict model
lgb1 = LMGBRegressor()
lgb1.fit(x_train, y_train)
lgb = lgb.train(parameters,train_data,valid_sets=test_data,num_boost_round=5000,early_stopping_rounds=100)
I expect it to be roughly the same but it is not. As far as I understand, one is a booster and the other is a regressor?
LGBMRegressor is the sklearn interface. The .fit(X, y) call is standard sklearn syntax for model training. It is a class object for you to use as part of sklearn's ecosystem (for running pipelines, parameter tuning etc.).
lightgbm.train is the core training API for lightgbm itself.
XGBoost and many other popular ML training libraries have a similar differentiation (core API uses xgb.train(...) for example with sklearn API using XGBClassifier or XGBRegressor).

ValueError: Input has n_features=10 while the model has been trained with n_features=4261

I am trying to use trained BoW, tfidf, and SVM model to do prediction:
def bagOfWords(files_data):
count_vector = sklearn.feature_extraction.text.CountVectorizer()
return count_vector.fit_transform(files_data)
files = sklearn.datasets.load_files(dir_path)
word_counts = util.bagOfWords(files.data)
tf_transformer = sklearn.feature_extraction.text.TfidfTransformer(use_idf=True).fit(word_counts)
X = tf_transformer.transform(word_counts)
clf = sklearn.svm.LinearSVC()
X_train, X_test, y_train, y_test = sklearn.cross_validation.train_test_split(X, y, test_size=test_size)
I can run following:
clf.fit(X_train, y_train)
y_predicted = clf.predict(X_test)
But following will get error:
clf.fit(X_train, y_train)
new_word_counts = util.bagOfWords(["a place to listen to music it s making its way to the us"])
ready_to_be_predicted = tf_transformer.transform(new_word_counts)
predicted = clf.predict(ready_to_be_predicted)
I think am already using the former tf_transform, and don't know why still got the error. Any help is greatly appreciated!
You're not preserving the CountVectorizer you originally fit the data with.
This bagOfWords call is fitting a separate CountVectorizer in its own scope.
new_word_counts = util.bagOfWords(["a place to listen to music it s making its way to the us"])
You want to use the one you fit on your training set.
You are also training your transformers with the entire X, including X_test. You want to exclude your test test from any training, including transformations.
Try something like this.
files = sklearn.datasets.load_files(dir_path)
# Split in train/test
X_train, X_test, y_train, y_test = sklearn.cross_validation.train_test_split(files.data, file.target)
# Fit and tranform with X_train
count_vector = sklearn.feature_extraction.text.CountVectorizer()
word_counts = count_vector.fit_transform(X_train)
tf_transformer = sklearn.feature_extraction.text.TfidfTransformer(use_idf=True)
X_train = tf_transformer.fit_transform(word_counts)
clf = sklearn.svm.LinearSVC()
clf.fit(X_train, y_train)
# Transform X_test
test_word_counts = count_vector.transform(X_test)
ready_to_be_predicted = tf_transformer.transform(test_word_counts)
X_test = clf.predict(ready_to_be_predicted)
# Test example
new_word_counts = count_vector.transform["a place to listen to music it smaking its way to the us"])
ready_to_be_predicted = tf_transformer.transform(new_word_counts)
predicted = clf.predict(ready_to_be_predicted)
Of course, it's much less complicated to combine these transformers into a Pipeline.
http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

How to use the a k-fold cross validation in scikit with naive bayes classifier and NLTK

I have a small corpus and I want to calculate the accuracy of naive Bayes classifier using 10-fold cross validation, how can do it.
Your options are to either set this up yourself or use something like NLTK-Trainer since NLTK doesn't directly support cross-validation for machine learning algorithms.
I'd recommend probably just using another module to do this for you but if you really want to write your own code you could do something like the following.
Supposing you want 10-fold, you would have to partition your training set into 10 subsets, train on 9/10, test on the remaining 1/10, and do this for each combination of subsets (10).
Assuming your training set is in a list named training, a simple way to accomplish this would be,
num_folds = 10
subset_size = len(training)/num_folds
for i in range(num_folds):
testing_this_round = training[i*subset_size:][:subset_size]
training_this_round = training[:i*subset_size] + training[(i+1)*subset_size:]
# train using training_this_round
# evaluate against testing_this_round
# save accuracy
# find mean accuracy over all rounds
Actually there is no need for a long loop iterations that are provided in the most upvoted answer. Also the choice of classifier is irrelevant (it can be any classifier).
Scikit provides cross_val_score, which does all the looping under the hood.
from sklearn.cross_validation import KFold, cross_val_score
k_fold = KFold(len(y), n_folds=10, shuffle=True, random_state=0)
clf = <any classifier>
print cross_val_score(clf, X, y, cv=k_fold, n_jobs=1)
I've used both libraries and NLTK for naivebayes sklearn for crossvalidation as follows:
import nltk
from sklearn import cross_validation
training_set = nltk.classify.apply_features(extract_features, documents)
cv = cross_validation.KFold(len(training_set), n_folds=10, indices=True, shuffle=False, random_state=None, k=None)
for traincv, testcv in cv:
classifier = nltk.NaiveBayesClassifier.train(training_set[traincv[0]:traincv[len(traincv)-1]])
print 'accuracy:', nltk.classify.util.accuracy(classifier, training_set[testcv[0]:testcv[len(testcv)-1]])
and at the end I calculated the average accuracy
Modified the second answer:
cv = cross_validation.KFold(len(training_set), n_folds=10, shuffle=True, random_state=None)
Inspired from Jared's answer, here is a version using a generator:
def k_fold_generator(X, y, k_fold):
subset_size = len(X) / k_fold # Cast to int if using Python 3
for k in range(k_fold):
X_train = X[:k * subset_size] + X[(k + 1) * subset_size:]
X_valid = X[k * subset_size:][:subset_size]
y_train = y[:k * subset_size] + y[(k + 1) * subset_size:]
y_valid = y[k * subset_size:][:subset_size]
yield X_train, y_train, X_valid, y_valid
I am assuming that your data set X has N data points (= 4 in the example) and D features (= 2 in the example). The associated N labels are stored in y.
X = [[ 1, 2], [3, 4], [5, 6], [7, 8]]
y = [0, 0, 1, 1]
k_fold = 2
for X_train, y_train, X_valid, y_valid in k_fold_generator(X, y, k_fold):
# Train using X_train and y_train
# Evaluate using X_valid and y_valid

Categories