I have a strange problem where I have a model with 4 clusters, the data is unbalanced at the following proportions: 75%, 15%, 7% and 3%. I split it into train and test with 80/20 proportion, then I train a KNN with 5 neighbors, giving me an acurracy of 1.
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=42)
train_index, test_index = next(sss.split(X, y))
x_train, y_train = X[train_index], y[train_index]
x_test, y_test = X[test_index], y[test_index]
KNN_final = KNeighborsClassifier()
KNN_final.fit(x_train, y_train)
y_pred = KNN_final.predict(x_test)
print('Avg. accuracy for all classes:', metrics.accuracy_score(y_test, y_pred))
print('Classification report: \n',metrics.classification_report(y_test, y_pred, digits=2))
Avg. accuracy for all classes: 1.0
Classification report:
precision recall f1-score support
0 1.00 1.00 1.00 140
1 1.00 1.00 1.00 60
2 1.00 1.00 1.00 300
3 1.00 1.00 1.00 1500
accuracy 1.00 2000
macro avg 1.00 1.00 1.00 2000
weighted avg 1.00 1.00 1.00 2000
Although it seems strange, I keep going, get new data and try to classify it based on this model, but it never finds the class with smaller percentage, it always misclassifies it as the second lower class.
So I try to balance the data using the imbalance learn library with SMOTEENN algorithm:
Original dataset shape Counter({3: 7500, 2: 1500, 0: 700, 1: 300})
sme = SMOTEENN(sampling_strategy='all', random_state=42)
X_res, y_res = sme.fit_resample(X, y)
print('Resampled dataset shape %s' % Counter(y_res))
Resampled dataset shape Counter({0: 7500, 1: 7500, 2: 7500, 3: 7500})
Then I do the same thing, split it into train and test with the same proportion 80/20 and train a new KNNclassifier with 5 neighbors. But the classification report seems even worse now:
Avg. accuracy for all classes: 1.0
Classification report:
precision recall f1-score support
0 1.00 1.00 1.00 1500
1 1.00 1.00 1.00 500
accuracy 1.00 2000
macro avg 1.00 1.00 1.00 2000
weighted avg 1.00 1.00 1.00 2000
I don't see what I'm doing wrong, is there any process I need to do after resampling the data, other than split and shuffle, before training a new classifier? Why my KNN is not seeing 4 classes now?
Although a full investigation requires your data, which you do not provide, such behavior is (at least partially) consistent with the following scenario:
You have duplicates (possibly a lot) in your initial data
Due to these duplicates, some (most? all?) of your test data are actually not new/unseen, but copies of samples in your training data, which leads to an unreasonably high test accuracy of 1.0
When adding new data (no duplicates of your initial ones), the model unsurprisingly fails to fulfill the expectations created from such a high accuracy (1.0) in the test data.
Notice that the stratified split will not protect you from such a scenario; here is a demonstration with toy data, adapted from the documentation:
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 1, 0, 1, 0, 1])
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)
train_index, test_index = next(sss.split(X, y))
X[train_index]
# result:
array([[3, 4],
[1, 2],
[3, 4]])
X_[test_index]
# result:
array([[3, 4],
[1, 2],
[1, 2]])
Related
I have a three-class problem and I'm able to report precision and recall for each class with the below code:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
which gives me the precision and recall nicely for each of the 3 classes in a table format.
My question is how can I now get sensitivity and specificity for each of the 3 classes? I looked at sklearn.metrics and I didn't find anything for reporting sensitivity and specificity.
If we check the help page for classification report:
Note that in binary classification, recall of the positive class is
also known as “sensitivity”; recall of the negative class is
“specificity”.
So we can convert the pred into a binary for every class, and then use the recall results from precision_recall_fscore_support.
Using an example:
from sklearn.metrics import classification_report
y_true = [0, 1, 2, 2, 2]
y_pred = [0, 0, 2, 2, 1]
target_names = ['class 0', 'class 1', 'class 2']
print(classification_report(y_true, y_pred, target_names=target_names))
Looks like:
precision recall f1-score support
class 0 0.50 1.00 0.67 1
class 1 0.00 0.00 0.00 1
class 2 1.00 0.67 0.80 3
accuracy 0.60 5
macro avg 0.50 0.56 0.49 5
weighted avg 0.70 0.60 0.61 5
Using sklearn:
from sklearn.metrics import precision_recall_fscore_support
res = []
for l in [0,1,2]:
prec,recall,_,_ = precision_recall_fscore_support(np.array(y_true)==l,
np.array(y_pred)==l,
pos_label=True,average=None)
res.append([l,recall[0],recall[1]])
put the results into a dataframe:
pd.DataFrame(res,columns = ['class','sensitivity','specificity'])
class sensitivity specificity
0 0 0.75 1.000000
1 1 0.75 0.000000
2 2 1.00 0.666667
Classification report's output is a formatted string. This code snippet extracts the required values and stores it in a 2-D list.
Note: To understand the code better, add print statements to check the variable values.
y = classification_report(y_test,y_pred) #classification report's output is a string
lines = y.split('\n') #extract every line and store in a list
res = [] #list to store the cleaned results
for i in range(len(lines)):
line = lines[i].split(" ") #Values are separated by blanks. Split at the blank spaces.
line = [j for j in line if j!=''] #add only the values into the list
if len(line) != 0:
#empty lines get added as empty lists. Remove those
res.append(line)
I have some data that contains difficulty scores for tests plus some features. Example (the numbers are random, my real data has about 800 rows and 8 columns):
question time_needed media_existent frequency_changed_answers score
abc 3545 0 1.25 0.79
dff 3574 0 2.80 0.03
xyz 1123 0 4.50 0.60
mno 7000 1 3.77 1.00
pqr 4656 0 1.00 0.99
stv 4367 0 2.73 0.33
The score is between 0 and 1. The closer to 1, the easier the question. The frequency of changed answers is how many times the answers have been changed before submission (the student was undecided) divided by how many times the question was answered (some questions are more popular).
Just like in this example, I applied the 3 methods (Random Forest, permutations, SHAP) to figure out which features are the most important. All 3 of them consider this frequency the most important, then the time, then whether the test contains media.
For random forest:
list_of_columns = ['time_needed','media_existent', 'frequency_changed_answers']
X = df_random_forest[list_of_columns]
target_column = 'score'
y = df_random_forest[target_column]
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.25,random_state=12)
rf = RandomForestRegressor(n_estimators=100)
rf.fit(X_train, y_train)
rf.feature_importances_
But the following score is just 0.2932189613132453
rf.score(X_test, y_test)
Also:
scores = cross_val_score(rf, X, y, cv=5)
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))
>>0.25 accuracy with a standard deviation of 0.05.
What would be the problem?
x_test,x_val,y_test,y_val = train_test_split(x_test,y_test,test_size=0.5)
print(x_train.shape)
#(1413, 3) <----Result
print(x_val.shape)
#(472, 3) <----Result
print(x_test.shape)
#(471, 3) <----Result
I proceeded with data split using machine learning and got the above results.
from sklearn.tree import DecisionTreeClassifier
dTree = DecisionTreeClassifier(max_depth=2,random_state=0).fit(x_train,y_train)
print("train score : {}".format(dTree.score(x_train, y_train)))
#train score : 1.0 <----Result
print("val score : {}".format(dTree.score(x_val, y_val)))
#val score : 1.0 <----Result
We then used Decision Tree to print out the score of train and val, respectively, and the results were all 1.
predict_y = dTree.predict(x_test)
from sklearn.metrics import classification_report
print(classification_report(y_test, dTree.predict(x_test)))
print("test score : {}".format(dTree.score(x_test, y_test)))
precision recall f1-score support
A 1.00 1.00 1.00 235
B 1.00 1.00 1.00 236
accuracy 1.00 471
macro avg 1.00 1.00 1.00 471
weighted avg 1.00 1.00 1.00 471
test score : 0.9978768577494692
Finally, classification_report also showed the above results. Are some of my data splits wrong? Or Does the value of 1 mean all datas perfectly classified?If I'm wrong, I want to hear the right solution.
I made up a excel sheet of random numbers (3000 rows and 6 columns) and set it so any row having a B column >= 50, a C column of 0 and an E column of 1 gets a final 'y' value of 1. Else, it gets a 0 value. Ran this through this RandomForestClassifier code and it doesn't work and either returns 0 for all new test data or doesn't even take into account the B column when predicting. How can I solve this?
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import pickle
data_crd = pd.read_csv(r'C:\Users\Rada1\.spyder-py3\new_created_data.csv')
#C:\Users\Rada1\.spyder-py3\new_created_data.csv
data_crd.head()
X = data_crd.iloc[:,1:5]
y = data_crd.iloc[:,5]
#print (X)
#print (y)
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2, random_state=0)
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
classifier = RandomForestClassifier (n_estimators = 500, random_state = 0)
classifier.fit (X_train, y_train)
y_pred = classifier.predict(X_test)
print (classification_report(y_test,y_pred))
print (confusion_matrix(y_test,y_pred))
print (accuracy_score(y_test,y_pred))
with open ('model_wcd','wb') as f:
pickle.dump(classifier,f)
I get a 100% accuracy rate as my result which just already feels wrong. What do I need to adjust?
precision recall f1-score support
0 1.00 1.00 1.00 515
1 1.00 1.00 1.00 85
accuracy 1.00 600
macro avg 1.00 1.00 1.00 600
weighted avg 1.00 1.00 1.00 600
[[515 0]
[ 0 85]]
1.0
Hopefully if you use stratify =y it might work
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2, random_state=0,stratify=y)
and also use MinMaxScaler for numerical feature and reshape them to (-1,1)
x_train_num=num_feature.transform(x_train[column_name].values.reshape(-1,1))
x_test_num=num_feature.transform(x_test[column_name].values.reshape(-1,1))
I was using classification report to check the accuracy and also the confusion matrix
I made some modifications to the code and it seems to work now
x = np.array([17, 17.083333, 17.166667, 17.25, 17.333333, 17.416667])
x = x.reshape(6,1)
y = [1,0,1,1,0,1]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.20)
clf = svm.SVC(kernel='linear')
clf.fit(X_train,y_train)
pred = clf.predict(X_test)
score= sk.metrics.accuracy_score(y_test,pred)
report = sk.metrics.classification_report (y_test, pred, target_names = ['0','1'])
confusionmatrix = sk.metrics.confusion_matrix(y_test,pred)
print ("Accuracy_Score: "+str(score))
print ("Classification_Report:\n"+report)
print ("Confusion_Matrix:")
print (confusionmatrix)
output:
Accuracy_Score: 0.5
Classification_Report:
precision recall f1-score support
0 0.00 0.00 0.00 1
1 0.50 1.00 0.67 1
avg / total 0.25 0.50 0.33 2
Confusion_Matrix:
[[0 1]
[0 1]]
I changed the input "x" to an numpy array and removed values from x.reshape and also you had a typo in clf.predict() you have given "Xtest" it has to be "X_test".
Hope this helps