How to assign cv_results_['params'] from GridSearchCV - python

I want to build a dataframe with cv_results but GridSearchCV is giving a list back. Dont know how to assign mat to ind to make a dataframe.
Here an example
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn import datasets
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)
# Split the data into training/testing sets
X_train = diabetes_X[:-20]
X_test = diabetes_X[-20:]
# Split the targets into training/testing sets
y_train = diabetes_y[:-20]
y_test = diabetes_y[-20:]
pipeline = Pipeline([
('kbest', SelectKBest(f_classif)),
('regressor', KNeighborsRegressor())
])
parameters = {'kbest__k': list(range(1, X_train.shape[1]+1)),
'regressor__n_neighbors': list(range(1,21))}
grid = GridSearchCV(pipeline, parameters)
grid.fit(X_train, y_train)
mat = grid.cv_results_['mean_test_score']
ind = grid.cv_results_['params']
Output of params look like this:

Related

How to improve the knn model?

I built a knn model for classification. Unfortunately, my model has accuracy > 80%, and I would like to get a better result. Can I ask for some tips? Maybe I used too many predictors?
My data = https://www.openml.org/search?type=data&sort=runs&id=53&status=active
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score
from sklearn.model_selection import GridSearchCV
heart_disease = pd.read_csv('heart_disease.csv', sep=';', decimal=',')
y = heart_disease['heart_disease']
X = heart_disease.drop(["heart_disease"], axis=1)
correlation_matrix = heart_disease.corr()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)
scaler = MinMaxScaler(feature_range=(-1,1))
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
knn_3 = KNeighborsClassifier(3, n_jobs = -1)
knn_3.fit(X_train, y_train)
y_train_pred = knn_3.predict(X_train)
labels = ['0', '1']
print('Training set')
print(pd.DataFrame(confusion_matrix(y_train, y_train_pred), index = labels, columns = labels))
print(accuracy_score(y_train, y_train_pred))
print(f1_score(y_train, y_train_pred))
y_test_pred = knn_3.predict(X_test)
print('Test set')
print(pd.DataFrame(confusion_matrix(y_test, y_test_pred), index = labels, columns = labels))
print(accuracy_score(y_test, y_test_pred))
print(f1_score(y_test, y_test_pred))
hyperparameters = {'n_neighbors' : range(1, 15), 'weights': ['uniform','distance']}
knn_best = GridSearchCV(KNeighborsClassifier(), hyperparameters, n_jobs = -1, error_score = 'raise')
knn_best.fit(X_train,y_train)
knn_best.best_params_
y_train_pred_best = knn_best.predict(X_train)
y_test_pred_best = knn_best.predict(X_test)
print('Training set')
print(pd.DataFrame(confusion_matrix(y_train, y_train_pred_best), index = labels, columns = labels))
print(accuracy_score(y_train, y_train_pred_best))
print(f1_score(y_train, y_train_pred_best))
print('Test set')
print(pd.DataFrame(confusion_matrix(y_test, y_test_pred_best), index = labels, columns = labels))
print(accuracy_score(y_test, y_test_pred_best))
print(f1_score(y_test, y_test_pred_best))
```.
Just a little part of answer, to find the best number for k_neighbors.
errlist = [] #an error list to append
for i in range(1,40): #from 0-40 numbers to use in k_neighbors
knn_i = KNeighborsClassifier(k_neighbors=i)
knn_i.fit(X_train,y_train)
errlist.append(np.mean(knn_i.predict(X_test)!=y_test)) # append the mean of failed-predict numbers
plot a line to see best k_neighbors:
plt.plot(range(1,40),errlist)
feel free to change the numbers for range.

Using TimeSeriesSplit within cross_val_score

I'm fitting a time series. In this sense, I'm trying to cross-validate using the TimeSeriesSplit function. I believe that the easiest way to apply this function is through the cross_val_score function, through the cv argument.
The question is simple, is the way I am passing the CV argument correct? Should I do the split(scaled_train) or should I use the split(X_train) or split(input_data) ? Or, should I cross-validate in another way?
This is the code I am writing:
def fit_model1(data: pd.DataFrame):
df = data
scores_fit_model1 = []
for sizes in test_sizes:
# Generate Test Design
input_data = df.drop('next_count',axis=1)
output_data = df[['next_count']]
X_train, X_test, y_train, y_test = train_test_split(input_data, output_data, test_size=sizes, random_state=0, shuffle=False)
#scaling
scaler = MinMaxScaler()
scaled_train = scaler.fit_transform(X_train)
scaled_test = scaler.transform(X_test)
#Build Model
lr = LinearRegression()
lr.fit(scaled_train, y_train.values.ravel())
predictions = lr.predict(scaled_test)
#Cross Validation Definition
time_split = TimeSeriesSplit(n_splits=10)
#performance metrics
r2 = cross_val_score(lr, scaled_train, y_train.values.ravel(), cv=time_split.split(scaled_train), scoring = 'r2', n_jobs =1).mean()
scores_fit_model1.append(r2)
return scores_fit_model1
The TimeSeriesSplit is simply an iterator that yields a growing window of sequential folds. Therefore, you can pass it as is to cv, or you can pass time_series_split(scaled_train), which amounts to the same thing: making splits in an array of the same size as your train data (which cross_val_score takes as the second positional parameter). It doesn't matter whether the TimeSeriesSplit gets the scaled or original data, as long as cross_val_score has the scaled data.
I made some minor simplifications in your code as well - scaling before the train_test_split, and making the output data a Series (so you don't need values.ravel):
def fit_model1(data: pd.DataFrame):
df = data
scores_fit_model1 = []
for sizes in test_sizes:
# Generate Test Design
input_data = df.drop('next_count',axis=1)
output_data = df['next_count']
scaler = MinMaxScaler()
scaled_input = scaler.fit_transform(input_data)
X_train, X_test, y_train, y_test = train_test_split(scaled_input, output_data, test_size=sizes, random_state=0, shuffle=False)
#Build Model
lr = LinearRegression()
lr.fit(X_train, y_train)
predictions = lr.predict(X_test)
#Cross Validation Definition
time_split = TimeSeriesSplit(n_splits=10)
#performance metrics
r2 = cross_val_score(lr, X_train, y_train, cv=time_split, scoring = 'r2', n_jobs =1).mean()
scores_fit_model1.append(r2)
return scores_fit_model1

Grid search not giving the best parameters

When running grid search on Inverse of regularization strength parameter and number of nearest neighbors parameter for logistic regression , linear SVM and K nearest neighbors classifier , The best parameters obtained from gridsearch are not really the best when verifying manually by training on same training data set. Code below
# Convert to a DataFrame.
import pandas as pd
from sklearn.datasets import fetch_openml
df = fetch_openml('credit-g', as_frame=True).frame
df.head(5)
df.dtypes
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(12, 12))
st = fig.suptitle("univariate distributions and target distribution", fontsize=20)
# Using columns that we need for this plot
nfeatures = df[['duration', 'credit_amount' , 'age']]
target = df['class']
# creating 4x4 grid
grid = plt.GridSpec(4, 4, hspace=0.4, wspace=0.4)
# creating the normal plots in grid 1 , 2 ,3 and 4
p1 = fig.add_subplot(grid[:2,:2])
p2 = fig.add_subplot(grid[:2,2:])
p3 = fig.add_subplot(grid[2:,:2])
p4 = fig.add_subplot(grid[2:,2:])
p1.hist(nfeatures['duration'])
p2.hist(nfeatures['credit_amount'])
p3.hist(nfeatures['age'])
p4.hist(target)
p1.set_xlabel('duration')
p2.set_xlabel('credit_amount')
p3.set_xlabel('age')
p4.set_xlabel('class')
# customizing to look neat
st.set_y(0.95)
fig.subplots_adjust(top=0.92)
from sklearn.model_selection import train_test_split
columns = [column for column in df.columns if column != 'class']
X = df[columns]
y = df['class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3 ,random_state=11)
#X_train , y_train , X_valid , y_valid = train_test_split(X,)
# basic preprocessing on train sets
# numeric_columns = ['duration','credit_amount' , 'installment_commitment' , 'residence_since' , 'age' ,'existing_credits' , 'num_dependents' ]
numeric_columns = df.select_dtypes(include=['float64']).columns
categorical_columns = [column for column in columns if column not in numeric_columns]
temp = X_train[categorical_columns]
X_train_ohe = pd.concat([pd.get_dummies(temp),X_train[numeric_columns]],axis=1)
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
lr = LogisticRegression(max_iter=1000)
cr = cross_val_score(lr,X_train_ohe,y_train)
print(cr)
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier
# define the data preparation for the categorical columns
t1 = [('cat', OneHotEncoder(), categorical_columns)]
col_transform = ColumnTransformer(transformers=t1)
# define the models
models = {'lr_model':LogisticRegression(max_iter=1000), 'lsvm_model':LinearSVC(max_iter=2500) , 'knn_model':KNeighborsClassifier()}
for name,model in models.items():
# define the data preparation and modeling pipeline
pipeline = Pipeline(steps=[('prep',col_transform), ('m', model)])
# define the model cross-validation configuration
#cv = KFold(n_splits=10, shuffle=True, random_state=1)
# evaluate the pipeline using cross validation and calculate MAE
score = cross_val_score(pipeline, X_train, y_train)
print(name ,score.mean())
# define the data preparation for the categorical columns and numeric columns
t2 = [('cat', OneHotEncoder(), categorical_columns), ('num', StandardScaler(), numeric_columns)]
col_transform = ColumnTransformer(transformers=t2)
# try with new column transformer
for name,model in models.items():
# define the data preparation and modeling pipeline
pipeline = Pipeline(steps=[('prep',col_transform), ('m', model)])
# define the model cross-validation configuration
#cv = KFold(n_splits=10, shuffle=True, random_state=1)
# evaluate the pipeline using cross validation and calculate MAE
score = cross_val_score(pipeline, X_train, y_train)
print(name ,score.mean())
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score
from sklearn.metrics import make_scorer
f1_scorer = make_scorer(f1_score, pos_label="bad")
# 'prep__num__with_mean': [True, False],
# 'prep__num__with_std': [True, False],
param_grid = {
'm__C': [0.1, 1.0 , 0.01],
}
param_grid_knn = {
'm__n_neighbors': [5, 10 , 15],
}
for name,model in models.items():
# define the data preparation and modeling pipeline
pipeline = Pipeline(steps=[('prep',col_transform), ('m', model)])
# define the model cross-validation configuration
#cv = KFold(n_splits=10, shuffle=True, random_state=1)
# evaluate the pipeline using cross validation and calculate MAE
if name == 'knn_model':
grid_clf = GridSearchCV(pipeline, param_grid_knn, cv=5, scoring=f1_scorer )
else:
grid_clf = GridSearchCV(pipeline, param_grid, cv=5, scoring=f1_scorer)
grid_clf.fit(X_train, y_train)
print(name,grid_clf.best_params_)
print(name, grid_clf.best_estimator_.score(X_test, y_test))
lr_array = []
lr_c = [0.01,0.1,1]
for c in lr_c:
pipeline = Pipeline(steps=[('prep',col_transform), ('m', LogisticRegression(max_iter=1000, C=c))])
pipeline.fit(X_train,y_train)
y_hat = pipeline.predict(X_train)
lr_array.append(f1_score(y_train,y_hat,pos_label="bad"))
lsvm_array = []
lsvm_c = [0.01,0.1,1]
for c in lsvm_c:
pipeline = Pipeline(steps=[('prep',col_transform), ('m', LinearSVC(dual=True,max_iter=2500,C=c))])
pipeline.fit(X_train,y_train)
y_hat = pipeline.predict(X_train)
lsvm_array.append(f1_score(y_train,y_hat,pos_label="bad"))
knn_array = []
knn_n = [5,10,15]
for n in knn_n:
pipeline = Pipeline(steps=[('prep',col_transform), ('m', KNeighborsClassifier(n_neighbors=n))])
pipeline.fit(X_train,y_train)
y_hat = pipeline.predict(X_train)
knn_array.append(f1_score(y_train,y_hat,pos_label="bad"))
fig = plt.figure(figsize=(12, 12))
# creating 3x1 grid
grid = plt.GridSpec(3, 1, hspace=0.4, wspace=0.4)
# creating the normal plots in grid 1 , 2 ,3
p1 = fig.add_subplot(grid[0,:])
p2 = fig.add_subplot(grid[1,:])
p3 = fig.add_subplot(grid[2,:])
p1.scatter(lr_c,lr_array)
p2.scatter(lsvm_c,lsvm_array)
p3.scatter(knn_n,knn_array)
The trend changes when using different scores and evaluating on test set instead train set but the best parameters never seem to be same for grid search and manual verification . What could be the reason for this? For example if you run the above code grid search tells you 10 is the best value for n_neighbors but the graph at the end shows 5 does better .Is the comparison not being implemented correctly ? You can check the runs with output at this link https://github.com/binodmathews93/AppliedMachineLearningCourse/blob/master/Applied_Machine_Learning_Homework_2.ipynb
Hyperparameter tuning is performed on the validation (development) set, not on the training set.
Grid Search Cross-Validation is using the K-Fold strategy to build a validation set that is used only for validation, not for training.
You are manually performing training and validation on the same set which is an incorrect approach.
pipeline = Pipeline(steps=[('prep',col_transform), ('m', LogisticRegression(max_iter=1000, C=c))])
pipeline.fit(X_train,y_train) # <- here is the problem
y_hat = pipeline.predict(X_train)
lr_array.append(f1_score(y_train,y_hat,pos_label="bad"))
This will only lead to hyperparameter choices that will boost performance on the training set which is not what you want (you what a set of hyperparameters that lead to good performance on the test set - that generalizes well).
This is the reason why the K (in KNN) is lower when you are doing the manual testing - lower K leads to less "regularization" and therefore is optimal, although incorrect, choice from the perspective of the training set.
If you want to manually verify the results, you will need to build the validation set by yourself (and don't use it during the training), or you will need to manually call K-fold cross-validation procedure.

Improve precision of my predictive technique in Python

I am using the following Python code to make output predictions depending on some values using decision trees based on entropy/gini index. My input data is contained in the file: https://drive.google.com/file/d/1C8GZ2wiqFUW3WuYxyc0G3axgkM1Uwsb6/view?usp=sharing
The first column "gold" in the file contains the output that I am trying to predict (either T or N). The remaining columns represents some 0 or 1 data that I can use to predict the first column. I am using a test set of 30% and a training set of 70%. I am getting the same precision/recall using either entropy or gini index. I am getting a precision of 0.80 for T and a recall of 0.54 for T. I would like to increase the precision of T and I am okay if the recall for T goes down as well, I am willing to accept this tradeoff. I do not care about the precision/recall of N predictions, I am just trying to improve the precision of T, that's all I care about. I guess increasing the precision means that we should abstain from making predictions in some situations that we are not certain about. How to do that?
# Run this program on your local python
# interpreter, provided you have installed
# the required libraries.
# Importing the required packages
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.ensemble import ExtraTreesClassifier
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.externals.six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz
from sklearn import tree
import collections
import pydotplus
# Function importing Dataset
column_count =0
def importdata():
balance_data = pd.read_csv( 'data1extended.txt', sep= ',')
row_count, column_count = balance_data.shape
# Printing the dataswet shape
print ("Dataset Length: ", len(balance_data))
print ("Dataset Shape: ", balance_data.shape)
print("Number of columns ", column_count)
# Printing the dataset obseravtions
print ("Dataset: ",balance_data.head())
return balance_data, column_count
def columns(balance_data):
row_count, column_count = balance_data.shape
return column_count
# Function to split the dataset
def splitdataset(balance_data, column_count):
# Separating the target variable
X = balance_data.values[:, 1:column_count]
Y = balance_data.values[:, 0]
# Splitting the dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size = 0.3, random_state = 100)
return X, Y, X_train, X_test, y_train, y_test
# Function to perform training with giniIndex.
def train_using_gini(X_train, X_test, y_train):
# Creating the classifier object
clf_gini = DecisionTreeClassifier(criterion = "gini",
random_state = 100,max_depth=3, min_samples_leaf=5)
# Performing training
clf_gini.fit(X_train, y_train)
return clf_gini
# Function to perform training with entropy.
def tarin_using_entropy(X_train, X_test, y_train):
# Decision tree with entropy
clf_entropy = DecisionTreeClassifier(
criterion = "entropy", random_state = 100,
max_depth = 3, min_samples_leaf = 5)
# Performing training
clf_entropy.fit(X_train, y_train)
return clf_entropy
# Function to make predictions
def prediction(X_test, clf_object):
# Predicton on test with giniIndex
y_pred = clf_object.predict(X_test)
print("Predicted values:")
print(y_pred)
return y_pred
# Function to calculate accuracy
def cal_accuracy(y_test, y_pred):
print("Confusion Matrix: ",
confusion_matrix(y_test, y_pred))
print ("Accuracy : ",
accuracy_score(y_test,y_pred)*100)
print("Report : ",
classification_report(y_test, y_pred))
#Univariate selection
def selection(column_count, data):
# data = pd.read_csv("data1extended.txt")
X = data.iloc[:,1:column_count] #independent columns
y = data.iloc[:,0] #target column i.e price range
#apply SelectKBest class to extract top 10 best features
bestfeatures = SelectKBest(score_func=chi2, k=5)
fit = bestfeatures.fit(X,y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
df=pd.DataFrame(data, columns=X)
#concat two dataframes for better visualization
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score'] #naming the dataframe columns
print(featureScores.nlargest(5,'Score')) #print 10 best features
return X,y,data,df
#Feature importance
def feature(X,y):
model = ExtraTreesClassifier()
model.fit(X,y)
print(model.feature_importances_) #use inbuilt class feature_importances of tree based classifiers
#plot graph of feature importances for better visualization
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
feat_importances.nlargest(5).plot(kind='barh')
plt.show()
#Correlation Matrix
def correlation(data, column_count):
corrmat = data.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(column_count,column_count))
#plot heat map
g=sns.heatmap(data[top_corr_features].corr(),annot=True,cmap="RdYlGn")
def generate_decision_tree(X,y):
clf = DecisionTreeClassifier(random_state=0)
data_feature_names = ['callersAtLeast1T','CalleesAtLeast1T','callersAllT','calleesAllT','CallersAtLeast1N','CalleesAtLeast1N','CallersAllN','CalleesAllN','childrenAtLeast1T','parentsAtLeast1T','childrenAtLeast1N','parentsAtLeast1N','childrenAllT','parentsAllT','childrenAllN','ParentsAllN','ParametersatLeast1T','FieldMethodsAtLeast1T','ReturnTypeAtLeast1T','ParametersAtLeast1N','FieldMethodsAtLeast1N','ReturnTypeN','ParametersAllT','FieldMethodsAllT','ParametersAllN','FieldMethodsAllN']
#generate model
model = clf.fit(X, y)
# Create DOT data
dot_data = tree.export_graphviz(clf, out_file=None,
feature_names=data_feature_names,
class_names=y)
# Draw graph
graph = pydotplus.graph_from_dot_data(dot_data)
# Show graph
Image(graph.create_png())
# Create PDF
graph.write_pdf("tree.pdf")
# Create PNG
graph.write_png("tree.png")
# Driver code
def main():
# Building Phase
data,column_count = importdata()
X, Y, X_train, X_test, y_train, y_test = splitdataset(data, column_count)
clf_gini = train_using_gini(X_train, X_test, y_train)
clf_entropy = tarin_using_entropy(X_train, X_test, y_train)
# Operational Phase
print("Results Using Gini Index:")
# Prediction using gini
y_pred_gini = prediction(X_test, clf_gini)
cal_accuracy(y_test, y_pred_gini)
print("Results Using Entropy:")
# Prediction using entropy
y_pred_entropy = prediction(X_test, clf_entropy)
cal_accuracy(y_test, y_pred_entropy)
#COMMENTED OUT THE 4 FOLLOWING LINES DUE TO MEMORY ERROR
#X,y,dataheaders,df=selection(column_count,data)
#generate_decision_tree(X,y)
#feature(X,y)
#correlation(dataheaders,column_count)
# Calling main function
if __name__=="__main__":
main()
I would suggest using Pipelines, to build data pipelines and GridSearchCV to find the best possible hyper-parameters and classifiers for the pipe.
A basic example;
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_selection import SelectKBest, chi2, f_class
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
pipe = Pipeline[('kbest', SelectKBest(chi2, k=3000)),
('clf', DecisionTreeClassifier())
])
pipe_params = {'kbest__k': range(1, 10, 1),
'kbest__score_func': [f_classif, chi2],
'clf__max_depth': np.arange(1,30),
'clf__min_samples_leaf': [1,2,4,5,10,20,30,40,80,100]}
grid_search = GridSearchCV(pipe, pipe_params, n_jobs=-1
scoring=accuracy_score, cv=10)
grid_search.fit(X_train, Y_train)
This will iterate over every hyper-parameters in pipe_params and choose the best classifier based on accuracy_score.

how to get a list of wrong predictions on validation set

Im trying to build a text-classification model on a database of site reviews (3 classes).
i cleaned the DF, tokenized it (with countVectorizer) and Tfidf (TfidfTransformer) and built MNB model.
now after i trained and evaluated the model, i want to get a list of the wrong predictions so i can pass them through LIME and explore the words that confuse the model.
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import (
classification_report,
confusion_matrix,
accuracy_score,
roc_auc_score,
roc_curve,
)
df = pd.read_csv(
"https://raw.githubusercontent.com/m-braverman/ta_dm_course_data/master/train3.csv"
)
cleaned_df = df.drop(
labels=["review_id", "user_id", "business_id", "review_date"], axis=1
)
x = cleaned_df["review_text"]
y = cleaned_df["business_category"]
# tokenization
vectorizer = CountVectorizer()
vectorizer_fit = vectorizer.fit(x)
bow_x = vectorizer_fit.transform(x)
#### transform BOW to TF-IDF
transformer = TfidfTransformer()
transformer_x = transformer.fit(bow_x)
tfidf_x = transformer_x.transform(bow_x)
# SPLITTING THE DATASET INTO TRAINING SET AND TESTING SET
x_train, x_test, y_train, y_test = train_test_split(
tfidf_x, y, test_size=0.3, random_state=101
)
mnb = MultinomialNB(alpha=0.14)
mnb.fit(x_train, y_train)
predmnb = mnb.predict(x_test)
my objective is to get the original indices of the reviews that the model predicted wrongly.
I managed to get the result like this:
predictions = c.predict(preprocessed_df['review_text'])
df2= preprocessed_df.join(pd.DataFrame(predictions))
df2.columns = ['review_text', 'business_category', 'word_count', 'prediction']
df2[df2['business_category']!=df2['prediction']]
im sure there is a more elegant way...
It seems like there is another problem in your code, generally the TfIdf vectorizer is fit on the training data only and in order to get the test data in the same format we do the transform operation. This is primarily done to avoid data leakage. Please refer to TfidfVectorizer: should it be used on train only or train+test. I have modified your code to suit your need.
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import (
classification_report,
confusion_matrix,
accuracy_score,
roc_auc_score,
roc_curve,
)
df = pd.read_csv(
"https://raw.githubusercontent.com/m-braverman/ta_dm_course_data/master/train3.csv"
)
cleaned_df = df.drop(
labels=["review_id", "user_id", "business_id", "review_date"], axis=1
)
x = cleaned_df["review_text"]
y = cleaned_df["business_category"]
# SPLITTING THE DATASET INTO TRAINING SET AND TESTING SET
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.3, random_state=101
)
transformer = TfidfTransformer()
x_train_tf = transformer.fit_transform(x_train)
x_test_tf = transformer.transform(x_test)
mnb = MultinomialNB(alpha=0.14)
mnb.fit(x_train_tf, y_train)
predmnb = mnb.predict(x_test_tf)
incorrect_docs = x_test[predmnb == y_test]

Categories