How to make the prediction from cross validation score? - python

I am working on building a prediction model. I have managed to reach until getting the cross-validation scores. Now I have no idea how to continue. What function should I use to make predictions using cross-validation scores?
X = data.iloc[:,0:16]
Y = data.iloc[:,16]
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y,
test_size=validation_size, random_state=seed)
models = [
('LR', LogisticRegression()),
('CART', DecisionTreeClassifier()),
('KNN', KNeighborsClassifier()),
('SVM', SVC())
]
results, names = [], []
for name, model in models:
seed = 32
scoring = 'accuracy'
kfold = model_selection.KFold(n_splits=10, random_state=seed)
cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)

Cross validation is mostly used as a more robust validation scheme to check if your model is performing well or not. After that you can train a model with the whole dataset after being satisfied with your cross validation score or you can use.
sklearn.model_selection.cross_val_predict
Which predicts cross validated estimates. You can check out the documentation for more information.

Related

learning curve of multiclass classification task

I'm trying to do a multiclass classification using multiple machine learning using this function that I have created:
def model_roc(X, y):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=11)
pipeline1 = imbpipeline(steps = [['pca' , PCA()],
['smote', SMOTE('not majority',random_state=11)],
['scaler', MinMaxScaler()],
['classifier', LogisticRegression(random_state=11,max_iter=1000)]])
stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=11)
param_grid1 = {'classifier__penalty': ['l1', 'l2'],'classifier__C':[0.001, 0.01, 0.1, 1, 10, 100, 1000]}
grid_search1 = GridSearchCV(estimator=pipeline1,
param_grid=param_grid1,
scoring=make_scorer(roc_auc_score, average='weighted',multi_class='ovo', needs_proba=True),
cv=stratified_kfold,
n_jobs=-1)
print("#"*100)
print(pipeline1['classifier'])
grid_search1.fit(X_train, y_train)
cv_score1 = grid_search1.best_score_
test_score1 = grid_search1.score(X_test, y_test)
print('cv_score',cv_score1, 'test_score',test_score1)
return
I have 2 questions:
Can I get multiple metrics from the same function(ROC/Accuracy/precision and F1_score) as it is imbalanced data with multiple classes?
I need to plot the learning curve, but I don't know how to do this out of my function.
to plot a learning curve of roc_auc_score
def plot_learning_curves(model,X,y):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=11)
train_scores ,test_scores = [],[]
for m in range(1,len(X_train)):
model.fit(X_train[:m],y_train[:m])
y_train_predict =model.predict(X_train[:m])
y_test_predict = model.predict(X_test)
train_scores.append(roc_auc_score(y_train,y_train_predict))
test_scores.append(roc_auc_score(y_test,y_test_predict))
plt.plot(train_scores,"r-+",linewidth=2,label="train")
plt.plot(test_scores,"b-",linewidth=2,label="test")
we iterate m from 1 to the whole length of the training set, slicing m records from the training data and training the model on this subset.
The model's roc_auc_score will be calculated and added to a list, showing the progress of it's roc_auc_score as the model trains on more and more instances.
-modified from ageron-handsonml2

Where to use validation set in model training

I have split my data into 3 sets train, test and validation as shown below:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=1)
I wanted to ask where do I put the validation set in this code:
#Defining Model
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy on test is:",accuracy_score(y_test,y_pred))
#Measure Precision,Recall
print("Precision Score: ",precision_score(y_test, y_pred,average='macro'))
print("Recall Score: ",recall_score(y_test, y_pred,average='macro'))
print("F1-Score :",f1_score(y_test, y_pred,average='macro'))
I suggest you to read up some more on why you would split your date into a train, test and validation set.
In the code you show you can use validation data in the same way you use your test data but that doesn't really make fully sense.
There is a lot to it, I think this can get you started. Link
In short and very simplified, the general idea is that you use results from test data to make adjustment to your model, to improve its performance.
Your validation data you use only at the very end for your final model evaluation, to make sure it actually does perform well on unseen data.
(Worst case if you only use two sets, then you might adjust parameters until works well for these two datasets but still not any other.)
You can show validation data score and accuracy like this way:
#Defining Model
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred_val = model.predict(X_val)
print("Accuracy on val is:",accuracy_score(y_val, y_pred_val))
y_pred = model.predict(X_test)
print("Accuracy on test is:",accuracy_score(y_test,y_pred))
#Measure Precision,Recall
print("Precision Score for Val: ",precision_score(y_val, y_pred_val, average='macro'))
print("Recall Score for Val: ",recall_score(y_val, y_pred_val, average='macro'))
print("F1-Score for Val :",f1_score(y_val, y_pred_val, average='macro'))
print("Precision Score: ",precision_score(y_test, y_pred,average='macro'))
print("Recall Score: ",recall_score(y_test, y_pred,average='macro'))
print("F1-Score :",f1_score(y_test, y_pred,average='macro'))

Show sorted results from loop

I'm testing different models (classifiers) and I've created a list (that will contain model names) and then loop through it to print accuracy and cross validation score for each one of them. It works fine.
The thing I'd like to do is showing them ordered by descending accuracy_score (metrics.accuracy_score(y_test, y_pred) in the code below). How do I do that easily?
Thanks a lot to anyone who'll be willing to help!
#create an array of models
models = []
models.append(("Random Forest",RandomForestClassifier(n_estimators = 100, random_state = 0)))
#models.append(("Logistic Regression",LogisticRegression()))
models.append(("Naive Bayes",GaussianNB()))
models.append(("SVM",SVC()))
models.append(("Dtree",DecisionTreeClassifier()))
models.append(("KNN",KNeighborsClassifier()))
models.append(("Gradient Boosting",GradientBoostingClassifier()))
#measure the accuracy and show results per model
for name, model in models:
# fit the model with x and y data
model.fit(X_train, y_train)
#Prediction of test set
y_pred = model.predict(X_test)
kfold = KFold(n_splits=4)#, random_state=22)
cv_result = cross_val_score(model, X_train, y_train, cv = kfold, scoring = "accuracy")
print('\033[1m', name, '\033[0m')
print('accuracy score is: \033[1m', metrics.accuracy_score(y_test, y_pred),'\033[0m')
print('cross validation score is: ' ,cv_result,'\n------------------------------------------------------------------------------------')
Append your scores to a new list, and then sort that list using the .sort() method, like so:
#create an array of models
models = []
models.append(("Random Forest",RandomForestClassifier(n_estimators = 100, random_state = 0)))
#models.append(("Logistic Regression",LogisticRegression()))
models.append(("Naive Bayes",GaussianNB()))
models.append(("SVM",SVC()))
models.append(("Dtree",DecisionTreeClassifier()))
models.append(("KNN",KNeighborsClassifier()))
models.append(("Gradient Boosting",GradientBoostingClassifier()))
results = [] # New list to store results
#measure the accuracy and show results per model
for name, model in models:
# fit the model with x and y data
model.fit(X_train, y_train)
#Prediction of test set
y_pred = model.predict(X_test)
kfold = KFold(n_splits=4)#, random_state=22)
cv_result = cross_val_score(model, X_train, y_train, cv = kfold, scoring = "accuracy")
results.append((name, metrics.accuracy_score(y_test, y_pred)))
print('\033[1m', name, '\033[0m')
print('accuracy score is: \033[1m', metrics.accuracy_score(y_test, y_pred),'\033[0m')
print('cross validation score is: ' ,cv_result,'\n------------------------------------------------------------------------------------')
results.sort(key=lambda tup: tup[1], reverse=True) # sort in-place
print(results) # print results
Rather than doing everything in the same big chunk of code inside the loop, I suggest to identify the different types of operations that you're doing and separate them into their own functions:
Run the model, fit, predict, compute score;
Sort the list;
Print the model;
import operator # itemgetter
#create an array of models
models = []
models.append(("Random Forest",RandomForestClassifier(n_estimators = 100, random_state = 0)))
#models.append(("Logistic Regression",LogisticRegression()))
models.append(("Naive Bayes",GaussianNB()))
models.append(("SVM",SVC()))
models.append(("Dtree",DecisionTreeClassifier()))
models.append(("KNN",KNeighborsClassifier()))
models.append(("Gradient Boosting",GradientBoostingClassifier()))
def run_model(m):
name, model = m
# fit the model with x and y data
model.fit(X_train, y_train)
#Prediction of test set
y_pred = model.predict(X_test)
kfold = KFold(n_splits=4)#, random_state=22)
cv_result = cross_val_score(model, X_train, y_train, cv = kfold, scoring = "accuracy")
return (name, accuracy, cv_result)
def print_model(name, accuracy, cv_result):
print('\033[1m', name, '\033[0m')
print('accuracy score is: \033[1m', accuracy,'\033[0m')
print('cross validation score is: ' ,cv_result,'\n------------------------------------------------------------------------------------')
results = sorted(map(run_model, models), key=operator.itemgetter(1))
for name, accuracy, cv_result in results:
print_model(name, accuracy, cv_result)
Disclaimer: Contrary to all best practices, I did not test this code before posting it, because the OP didn't provide example values for X_train, y_train, X_test, y_test, nor the relevant import to make their code work.

Can i know train and test accuracy individually if am using (cross_val_score)

I used a random-forest classifier to classify my dataset; I want to use cross-validation; my problem is I could not find a way to know the accuracy of train and test splits, so is this possible?
here is the code I used, which tell me about the
X_test, X_rem, y_test, y_rem = train_test_split(X,Y, test_size=0.1)
X_valid, X_train, y_valid, y_train = train_test_split(X_rem,y_rem, train_size=0.80)
cv = KFold(n_splits=10, random_state=1, shuffle=True)
# create model
model = RandomForestClassifier(n_estimators =400)
scoresranP = cross_val_score(model, X_rem, y_rem, cv=cv, n_jobs=-1)
print('Accuracy of Random Forest: %.3f (%.3f)' % (mean(scoresranP), std(scoresranP)))
If I am right, in this case, the "scoresranP" will give me the training accuracy, so can I get the test accuracy using the test split?
I am wrong. Can anyone tell me if I can using cross_val_score with three splits (train, valid, test)?
I really appreciate any help you can provide.
If I am correct Random Forest does not gives you a training accuracy as it is a regression learning method.

using brand as a feature in product categorization

I am working on classifying products into categories using scikit-learn , I have trained a logistic regression model which has done well. I want to improve the performance of the classifier and I have been wondering if the brand of the product could have an impact on improving the classification. As you know in the pipeline of a classifier there are steps like stemming and tokenization etc so is there a way to add brand treatment as an extra feature to the pipeline so that the algorithm can used to classify products?
The data I am using is a dataset of products(title,brand,description etc...) with their categories as labels.
code:
with open('file_path') as json_data:
d = json.load(json_data)
def train(classifier, X, y):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=33)
print("X_train:")
print(len(X_train))
print("X_test:")
print(len(X_test))
print("y_train:")
print(len(y_train))
print("y_test:")
print(len(y_test))
classifier.fit(X_train, y_train)
print "Accuracy: %s" % classifier.score(X_test, y_test)
return classifier
target = np.asarray(d["target"])
data=d["data"]
categs=d["categs"]
trial = Pipeline([
('vectorizer', TfidfVectorizer(tokenizer=stemming_tokenizer,
stop_words=stopwords.words('french') + list(string.punctuation))),
('classifier', linear_model.LogisticRegression(C=1e2,n_jobs=-1)),
])
print("start train...")
clf=train(trial, data, target)
print("finished train")

Categories