GridSearchCV is serching for best recall for wrong value - python

So i work on diabetics dataset. I want to have the "best" recall i can get (classify most of diabetics as diabetics)
The problem is that while my code is serching for a best recall i feel like its looking for best score for healthy people (sign as 0 in x_true). How i can hange it that my GridSearchCV will focus on Diabetis people (1 in x_true)
My code:
param_grid_rf = {
"random_state" : [2115],
'max_depth' : np.arange(1,4,1),
'max_leaf_nodes' : np.arange(13,17,1),
'min_samples_leaf': np.arange(2,5),
'n_estimators' : np.arange(100,301,100)
}
grid_search= GridSearchCV(RandomForestClassifier(), param_grid_rf, cv=5, n_jobs=-1, scoring="recall")
grid_search.fit(X_train, y_train)
Creating best_params_ model
rfc = RandomForestClassifier(n_estimators = 300,
max_depth= 3, max_leaf_nodes= 13, min_samples_leaf= 2, random_state = 2115)
rfc.fit(X_train, y_train)
print(classification_report(y_train, rfc.predict(X_train), target_names = ["1","0"]))
Out put:
precision recall f1-score support
1 0.79 0.94 0.86 400
0 0.82 0.54 0.65 214
Confusin matrix:
0 - [88, 12],
1 - [36, 18]]
pred 0 , pred 1
Basically the recall for predicting no diabetes people are 0.94 and only 0.54 for Diabetics.
Is my code is correct?
What can I do to make recall higher for Diabetics?
Thank you for your time :)
PS. The balance of output class in dataset: "0" - 500, "1" - 268

Related

Why do predictions and scores return different results in classification using scikit-learn?

I wrote a very simple multiclass classifier based on the iris dataset. This is the code:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC, SVC
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import classification_report
# Load the data
iris = load_iris()
X = iris.data
y = iris.target
# Use label_binarize to be multi-label like settings
Y = label_binarize(y, classes=[0, 1, 2])
n_classes = Y.shape[1]
# Add noisy features
random_state = np.random.RandomState(0)
n_samples, n_features = X.shape
X = np.concatenate([X, random_state.randn(n_samples, 200 * n_features)], axis=1)
from sklearn.preprocessing import label_binarize
# Split into training and test
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.5, random_state=0
)
# Create classifier
classifier = OneVsRestClassifier(
make_pipeline(StandardScaler(), LinearSVC(random_state=random_state))
)
# Train the model
classifier.fit(X_train, y_train)
My goal is to predict the values of the test set in 2 ways:
Using the classifier.predict() function and define y_pred.
Using the classifier.decision_function() to get the scores and then pick the highest one for each instance and define y_pred_.
Here is how I did it:
# Get the scores for the Test set
y_score = classifier.decision_function(X_test)
# Make predictions
y_pred = classifier.predict(X_test)
y_pred_ = label_binarize(np.argmax(y_score, axis=1), [0,1,2])
It looks like however that when I try to compute the classification report I get slightly different results, while I would expect to be the same since the predictions are based on the scores obtained from the decision function as it can be seen in the documentation (line 789). Here are both reports:
print(classification_report(y_test, y_pred))
print(classification_report(y_test, y_pred_))
precision recall f1-score support
0 0.54 0.62 0.58 21
1 0.44 0.40 0.42 30
2 0.36 0.50 0.42 24
micro avg 0.44 0.49 0.47 75
macro avg 0.45 0.51 0.47 75
weighted avg 0.45 0.49 0.46 75
samples avg 0.39 0.49 0.42 75
precision recall f1-score support
0 0.42 0.38 0.40 21
1 0.52 0.47 0.49 30
2 0.38 0.46 0.42 24
micro avg 0.44 0.44 0.44 75
macro avg 0.44 0.44 0.44 75
weighted avg 0.45 0.44 0.44 75
samples avg 0.44 0.44 0.44 75
What am I doing wrong? Would you be able to suggest a smart and elegant solution so that both reports are identical?
For multilabel classification you should use
y_pred_ = np.where(classifier.decision_function(X_test) > 0, 1, 0)
to replicate the output of the predict() method as in this case the different classes are not mutually exclusive, i.e. a given sample can belong to multiple classes.
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, label_binarize
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import classification_report
# Load the data
iris = load_iris()
X = iris.data
y = label_binarize(iris.target, classes=[0, 1, 2])
# Split the data into training and test
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.5, random_state=0
)
# Create classifier
classifier = OneVsRestClassifier(
make_pipeline(StandardScaler(), LinearSVC(random_state=0))
)
# Train the model
classifier.fit(X_train, y_train)
# Make predictions
y_pred = classifier.predict(X_test)
y_pred_ = np.where(classifier.decision_function(X_test) > 0, 1, 0)
print(classification_report(y_test, y_pred))
# precision recall f1-score support
# 0 1.00 1.00 1.00 21
# 1 0.58 0.37 0.45 30
# 2 0.95 0.83 0.89 24
# micro avg 0.85 0.69 0.76 75
# macro avg 0.84 0.73 0.78 75
# weighted avg 0.82 0.69 0.74 75
# samples avg 0.66 0.69 0.67 75
print(classification_report(y_test, y_pred_))
# precision recall f1-score support
# 0 1.00 1.00 1.00 21
# 1 0.58 0.37 0.45 30
# 2 0.95 0.83 0.89 24
# micro avg 0.85 0.69 0.76 75
# macro avg 0.84 0.73 0.78 75
# weighted avg 0.82 0.69 0.74 75
# samples avg 0.66 0.69 0.67 75
For multiclass classification you can instead use
y_pred_ = np.argmax(classifier.decision_function(X_test), axis=1)
as in your code, as in this case the different classes are mutually exclusive, i.e. each sample is assigned to only one class.
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import classification_report
# Load the data
iris = load_iris()
X = iris.data
y = iris.target
# Split into training and test
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.5, random_state=0
)
# Create classifier
classifier = OneVsRestClassifier(
make_pipeline(StandardScaler(), LinearSVC(random_state=0))
)
# Train the model
classifier.fit(X_train, y_train)
# Make predictions
y_pred = classifier.predict(X_test)
y_pred_ = np.argmax(classifier.decision_function(X_test), axis=1)
print(classification_report(y_test, y_pred))
# precision recall f1-score support
# 0 1.00 1.00 1.00 21
# 1 0.85 0.73 0.79 30
# 2 0.71 0.83 0.77 24
# accuracy 0.84 75
# macro avg 0.85 0.86 0.85 75
# weighted avg 0.85 0.84 0.84 75
print(classification_report(y_test, y_pred_))
# precision recall f1-score support
# 0 1.00 1.00 1.00 21
# 1 0.85 0.73 0.79 30
# 2 0.71 0.83 0.77 24
# accuracy 0.84 75
# macro avg 0.85 0.86 0.85 75
# weighted avg 0.85 0.84 0.84 75
OneVsRestClassifier is assuming that you expect multi-label result, i.e. there may be more than one positive label for a single input. The result is thus different from using argmax with decision_function.
Try
print(y_pred[0])
print(y_pred_[0])
Output:
[0 1 1]
[0 0 1]

Problems with all values output to 1 in evaluation metrics

x_test,x_val,y_test,y_val = train_test_split(x_test,y_test,test_size=0.5)
print(x_train.shape)
#(1413, 3) <----Result
print(x_val.shape)
#(472, 3) <----Result
print(x_test.shape)
#(471, 3) <----Result
I proceeded with data split using machine learning and got the above results.
from sklearn.tree import DecisionTreeClassifier
dTree = DecisionTreeClassifier(max_depth=2,random_state=0).fit(x_train,y_train)
print("train score : {}".format(dTree.score(x_train, y_train)))
#train score : 1.0 <----Result
print("val score : {}".format(dTree.score(x_val, y_val)))
#val score : 1.0 <----Result
We then used Decision Tree to print out the score of train and val, respectively, and the results were all 1.
predict_y = dTree.predict(x_test)
from sklearn.metrics import classification_report
print(classification_report(y_test, dTree.predict(x_test)))
print("test score : {}".format(dTree.score(x_test, y_test)))
precision recall f1-score support
A 1.00 1.00 1.00 235
B 1.00 1.00 1.00 236
accuracy 1.00 471
macro avg 1.00 1.00 1.00 471
weighted avg 1.00 1.00 1.00 471
test score : 0.9978768577494692
Finally, classification_report also showed the above results. Are some of my data splits wrong? Or Does the value of 1 mean all datas perfectly classified?If I'm wrong, I want to hear the right solution.

After Training a Naive Bayes Text Classification Algorithm, how to predict topic of a Single text file

I have Trained and test the Naive Bayes Algorithm using a text and train data. Now i want to predict the topic of a single text file.
Here is my code,
#importing test, train data
import sklearn.datasets as skd
categories = ['business', 'entertainment','local', 'sports', 'world']
sinhala_train = skd.load_files('Cleant data\stemmed_filtered_sinhala-set1', categories= categories, encoding= 'utf-8')
sinhala_test = skd.load_files('Cleant data\stemmed_filtered_sinhala-set2',categories= categories, encoding= 'utf-8')
name_file = "adaderana_67571.txt"
A = open(name_file, encoding='utf-8')
new_file = A.read()
from sklearn.feature_extraction.text import CountVectorizer
count_vectorization = CountVectorizer()
train_data_tf = count_vectorization.fit_transform(sinhala_train.data)
train_data_tf.shape
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_trans = TfidfTransformer()
train_data_tfidf = tfidf_trans.fit_transform(train_data_tf)
train_data_tfidf.shape
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(train_data_tfidf, sinhala_train.target)
test_data_tf = count_vectorization.transform(sinhala_test.data)
test_data_tfidf = tfidf_trans.fit_transform(test_data_tf)
predicted = clf.predict(test_data_tfidf)
from sklearn import metrics
from sklearn.metrics import accuracy_score
print("Accuracy of the model:", accuracy_score(sinhala_test.target, predicted))
print(metrics.classification_report(sinhala_test.target, predicted, target_names=sinhala_test.target_names)),
metrics.confusion_matrix(sinhala_test.target, predicted)
And this is my output,
Accuracy of the model: 0.864
precision recall f1-score support
business 0.78 0.94 0.85 100
entertainment 0.95 0.86 0.90 100
local 0.89 0.65 0.75 100
sports 0.91 0.93 0.92 100
world 0.83 0.94 0.88 100
micro avg 0.86 0.86 0.86 500
macro avg 0.87 0.86 0.86 500
weighted avg 0.87 0.86 0.86 500
array([[94, 2, 4, 0, 0],
[ 2, 86, 2, 4, 6],
[19, 0, 65, 5, 11],
[ 1, 3, 1, 93, 2],
[ 5, 0, 1, 0, 94]], dtype=int64)
Now i want to predict the topic of the text file new_file.
Can someone help me write the code to predict topic for this text file.
I solved my problem. This was the code i used to predict the topic.
docs_new1 = sinhala_test_1
docs_new = [docs_new1]
X_new_counts = count_vectorization.transform(docs_new)
X_new_tfidf = tfidf_trans.transform(X_new_counts)
predicted_topic = clf.predict(X_new_tfidf)
for doc, category in zip(docs_new, predicted_topic):
topic = ( sinhala_train.target_names[category])
return topic

KNN does not find classes after balancing data

I have a strange problem where I have a model with 4 clusters, the data is unbalanced at the following proportions: 75%, 15%, 7% and 3%. I split it into train and test with 80/20 proportion, then I train a KNN with 5 neighbors, giving me an acurracy of 1.
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=42)
train_index, test_index = next(sss.split(X, y))
x_train, y_train = X[train_index], y[train_index]
x_test, y_test = X[test_index], y[test_index]
KNN_final = KNeighborsClassifier()
KNN_final.fit(x_train, y_train)
y_pred = KNN_final.predict(x_test)
print('Avg. accuracy for all classes:', metrics.accuracy_score(y_test, y_pred))
print('Classification report: \n',metrics.classification_report(y_test, y_pred, digits=2))
Avg. accuracy for all classes: 1.0
Classification report:
precision recall f1-score support
0 1.00 1.00 1.00 140
1 1.00 1.00 1.00 60
2 1.00 1.00 1.00 300
3 1.00 1.00 1.00 1500
accuracy 1.00 2000
macro avg 1.00 1.00 1.00 2000
weighted avg 1.00 1.00 1.00 2000
Although it seems strange, I keep going, get new data and try to classify it based on this model, but it never finds the class with smaller percentage, it always misclassifies it as the second lower class.
So I try to balance the data using the imbalance learn library with SMOTEENN algorithm:
Original dataset shape Counter({3: 7500, 2: 1500, 0: 700, 1: 300})
sme = SMOTEENN(sampling_strategy='all', random_state=42)
X_res, y_res = sme.fit_resample(X, y)
print('Resampled dataset shape %s' % Counter(y_res))
Resampled dataset shape Counter({0: 7500, 1: 7500, 2: 7500, 3: 7500})
Then I do the same thing, split it into train and test with the same proportion 80/20 and train a new KNNclassifier with 5 neighbors. But the classification report seems even worse now:
Avg. accuracy for all classes: 1.0
Classification report:
precision recall f1-score support
0 1.00 1.00 1.00 1500
1 1.00 1.00 1.00 500
accuracy 1.00 2000
macro avg 1.00 1.00 1.00 2000
weighted avg 1.00 1.00 1.00 2000
I don't see what I'm doing wrong, is there any process I need to do after resampling the data, other than split and shuffle, before training a new classifier? Why my KNN is not seeing 4 classes now?
Although a full investigation requires your data, which you do not provide, such behavior is (at least partially) consistent with the following scenario:
You have duplicates (possibly a lot) in your initial data
Due to these duplicates, some (most? all?) of your test data are actually not new/unseen, but copies of samples in your training data, which leads to an unreasonably high test accuracy of 1.0
When adding new data (no duplicates of your initial ones), the model unsurprisingly fails to fulfill the expectations created from such a high accuracy (1.0) in the test data.
Notice that the stratified split will not protect you from such a scenario; here is a demonstration with toy data, adapted from the documentation:
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 1, 0, 1, 0, 1])
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)
train_index, test_index = next(sss.split(X, y))
X[train_index]
# result:
array([[3, 4],
[1, 2],
[3, 4]])
X_[test_index]
# result:
array([[3, 4],
[1, 2],
[1, 2]])

score error while creating model in python

I was using classification report to check the accuracy and also the confusion matrix
I made some modifications to the code and it seems to work now
x = np.array([17, 17.083333, 17.166667, 17.25, 17.333333, 17.416667])
x = x.reshape(6,1)
y = [1,0,1,1,0,1]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.20)
clf = svm.SVC(kernel='linear')
clf.fit(X_train,y_train)
pred = clf.predict(X_test)
score= sk.metrics.accuracy_score(y_test,pred)
report = sk.metrics.classification_report (y_test, pred, target_names = ['0','1'])
confusionmatrix = sk.metrics.confusion_matrix(y_test,pred)
print ("Accuracy_Score: "+str(score))
print ("Classification_Report:\n"+report)
print ("Confusion_Matrix:")
print (confusionmatrix)
output:
Accuracy_Score: 0.5
Classification_Report:
precision recall f1-score support
0 0.00 0.00 0.00 1
1 0.50 1.00 0.67 1
avg / total 0.25 0.50 0.33 2
Confusion_Matrix:
[[0 1]
[0 1]]
I changed the input "x" to an numpy array and removed values from x.reshape and also you had a typo in clf.predict() you have given "Xtest" it has to be "X_test".
Hope this helps

Categories