unusually high accuracy when running ML - python

Hi I am working with a difficult data set, in that there is low correlation between the input and output, yet results are very good (99.9% accuracy with the test set). I'm sure I'm doing something wrong, just don't know what.
label is 'unsafe' column, which is either 0 or 1 (was originally 0 or 100 but I limited the maximum value - it made no difference with the result. I started with random forests and then ran k nearest neighbors and got almost the same accuracy, 99.9%. Screenshots of df are:
there are many more 0s than 1s (in the training set out of 80,000 there are only 169 1s, and there is also a run of 1s at the end but this is just how the original file was imported)
import os
import glob
import numpy as np
import pandas as pd
import sklearn as sklearn
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_pickle('/Users/shellyganga/Downloads/ola.pickle')
maxVal = 1
df.unsafe = df['unsafe'].where(df['unsafe'] <= maxVal, maxVal)
print(df.head)
df.drop(df.columns[0], axis=1, inplace=True)
df.drop(df.columns[-2], axis=1, inplace=True)
#setting features and labels
labels = np.array(df['unsafe'])
features= df.drop('unsafe', axis = 1)
# Saving feature names for later use
feature_list = list(features.columns)
# Convert to numpy array
features = np.array(features)
from sklearn.model_selection import train_test_split
# 30% examples in test data
train, test, train_labels, test_labels = train_test_split(features, labels,
stratify = labels,
test_size = 0.3,
random_state = 0)
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(train, train_labels)
print(np.mean(train_labels))
print(train_labels.shape)
print('accuracy on train: {:.5f}'.format(knn.score(train, train_labels)))
print('accuracy on test: {:.5f}'.format(knn.score(test, test_labels)))
output:
0.0023654350798950337
(81169,)
accuracy on train: 0.99763
accuracy on test: 0.99761

The fact that you have many more instances of 0 than 1 is an example of class imbalance. Here is a really cool stats.stackexchange question on the topic.
Basically, if only 169 out of your 80000 labels are 1 and the rest are 0, then your model could naively predict the label 0 for every instance, and still have a training-set accuracy (= fraction of misclassified instances) of 99.78875%.
I suggest trying the F1 score, which is the harmonic mean of precision, AKA positive predictive value = TP/(TP + FP), and recall, AKA sensitivity = TP/(TP + FN): https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score
from sklearn.metrics import f1_score
print('F1 score on train: {:.5f}'.format(f1_score(train, train_labels)))
print('F1 score on test: {:.5f}'.format(f1_score(test, test_labels)))

Related

inconsistency between contamination set up and number of outlier prediction in Sklearn isolation Forest

I inspired by this notebook, and I'm experimenting IsolationForest algorithm using scikit-learn==0.22.2.post1 for anomaly detection context on the SF version of KDDCUP99 dataset, including 4 attributes. The data is directly fetched from sklearn and after preprocessing (label encoding the categorical feature) passed to the IF algorithm with the default setup.
The full code is as follows:
from sklearn import datasets
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import IsolationForest
from sklearn.metrics import confusion_matrix
from sklearn.metrics import recall_score, roc_curve, roc_auc_score, f1_score, precision_recall_curve, auc
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np
import seaborn as sns
import itertools
import matplotlib.pyplot as plt
import datetime
%matplotlib inline
def byte_decoder(val):
# decodes byte literals to strings
return val.decode('utf-8')
#Load Dataset KDDCUP99 from sklearn
target = 'target'
sf = datasets.fetch_kddcup99(subset='SF', percent10=False) # you can use percent10=True for convenience sake
dfSF=pd.DataFrame(sf.data,
columns=["duration", "service", "src_bytes", "dst_bytes"])
assert len(dfSF)>0, "SF dataset no loaded."
dfSF[target]=sf.target
anomaly_rateSF = 1.0 - len(dfSF.loc[dfSF[target]==b'normal.'])/len(dfSF)
"SF Anomaly Rate is:"+"{:.1%}".format(anomaly_rateSF)
#'SF Anomaly Rate is: 0.5%'
#Data Processing
toDecodeSF = ['service']
# apply hot encoding to fields of type string
# convert all abnormal target types to a single anomaly class
dfSF['binary_target'] = [1 if x==b'normal.' else -1 for x in dfSF[target]]
leSF = preprocessing.LabelEncoder()
for f in toDecodeSF:
dfSF[f + " (encoded)"] = list(map(byte_decoder, dfSF[f]))
dfSF[f + " (encoded)"] = leSF.fit_transform(dfSF[f])
for f in toDecodeSF:
dfSF.drop(f, axis=1, inplace=True)
dfSF.drop(target, axis=1, inplace=True)
#check rate of Anomaly for setting contamination parameter in IF
dfSF["binary_target"].value_counts() / np.sum(dfSF["binary_target"].value_counts())
#data split
X_train_sf, X_test_sf, y_train_sf, y_test_sf = train_test_split(dfSF.drop('binary_target', axis=1),
dfSF['binary_target'],
test_size=0.33,
random_state=11,
stratify=dfSF['binary_target'])
#print(y_test_sf.value_counts())
#1 230899
#-1 1114
#Name: binary_target, dtype: int64
#training IF and predict the outliers/anomalies on test set with 10% contamination:
clfIF = IsolationForest(max_samples="auto",
random_state=11,
contamination = 0.1,
n_estimators=100,
n_jobs=-1)
clfIF.fit(X_train_sf, y_train_sf)
y_pred_test = clfIF.predict(X_test_sf)
#print(X_test_sf.shape)
#(232013, 4)
#print(np.unique(y_pred_test, return_counts=True))
#(array([-1, 1]), array([ 23248, 208765])) # instead of labeling 10% of 232013, which is 23201 data outliers/anomalies, It is 23248 !!
based on documentation in the binary case, we can extract true positives, etc as follows:
tn, fp, fn, tp = confusion_matrix(y_test_sf, y_pred_test).ravel()
print("TN: ",tn,"FP: ", fp,"FN: " ,fn,"TP: ", tp)
#TN: 1089 FP: 25 FN: 22159 TP: 208740
Problems:
Problem 1: I'm wondering why IF predict more than 10% contamination which already set on test set by labelling outlier/anomaly? 23248 instead of 23201 !!
Problem 2: normally TN + FP should be inlier/normal 230899 and FN + TP should be equaled 1114 as we counted after data split. I think it is vice versa in my implementation, but I couldn't figure it out and debug it.
Problem 3: is based on the KDDCUP99 dataset documentation, and its User guide and my calculation in the following implementation the anomaly rate is 0.5% and it means if I set contamination=0.005, it should give me
Probably I am missing something here, and any help will be highly appreciated.
The truth is that the contamination parameter simply controls the threshold for the decision function when a scored data point should be considered an outlier. It has no impact on the model itself. It could make sense to use some statistical analysis to get a rough estimate of the contamination.
If you expect a certain number of outliers in your dataset, then you can use the raw scores to find a threshold that gives you that number and set the contamination parameter retrospectively when applying the model to new data.

My accuracy is at 0.0 and I don't know why?

I am getting an accuracy of 0.0. I am using the boston housing dataset.
Here is my code:
import sklearn
from sklearn import datasets
from sklearn import svm, metrics
from sklearn import linear_model, preprocessing
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
boston = datasets.load_boston()
x = boston.data
y = boston.target
train_data, test_data, train_label, test_label = sklearn.model_selection.train_test_split(x, y, test_size=0.2)
model = KNeighborsClassifier()
lab_enc = preprocessing.LabelEncoder()
train_label_encoded = lab_enc.fit_transform(train_label)
test_label_encoded = lab_enc.fit_transform(test_label)
model.fit(train_data, train_label_encoded)
predicted = model.predict(test_data)
accuracy = model.score(test_data, test_label_encoded)
print(accuracy)
How can I increase the accuracy on this dataset?
Boston dataset is for regression problems. Definition in the docs:
Load and return the boston house-prices dataset (regression).
So, it does not make sense if you use an ordinary encoding like the labels are not samples from a continuous data. For example, you encode 12.3 and 12.4 to completely different labels but they are pretty close to each other, and you evaluate the result wrong if the classifier predicts 12.4 when the real target is 12.3, but this is not a binary situation. In classification, the prediction is whether correct or not, but in regression it is calculated in a different way such as mean square error.
This part is not necessary, but I would like to give you an example for the same dataset and source code. With a simple idea of rounding the labels towards zero(to the nearest integer to zero) will give you some intuition.
5.0-5.9 -> 5
6.0-6.9 -> 6
...
50.0-50.9 -> 50
Let's change your code a little bit.
import numpy as np
def encode_func(labels):
return np.array([int(l) for l in labels])
...
train_label_encoded = encode_func(train_label)
test_label_encoded = encode_func(test_label)
The output will be around 10%.

ROC curve for multi-class classification without one vs all in python

I have a multi-class classification problem with 9 different classes. I am using the AdaBoostClassifier class from scikit-learn to train my model without using the one vs all technique, as the number of classes is very high and it might be inefficient.
I have tried using the tips from the documentation in scikit learn [1], but there the one vs all technique is used, which is substantially different. In my approach I only get one prediction per event, i.e. if I have n classes, the outcome of the prediction is a single value within the n classes. For the one vs all approach, on the other hand, the outcome of the prediction is an array of size n with a sort of likelihood value per class.
[1]
https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#sphx-glr-auto-examples-model-selection-plot-roc-py
The code is:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt # Matplotlib plotting library for basic visualisation
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_curve, auc
from sklearn import preprocessing
# Read data
df = pd.read_pickle('data.pkl')
# Create the dependent variable class
# This will substitute each of the n classes from
# text to number
factor = pd.factorize(df['target_var'])
df.target_var= factor[0]
definitions = factor[1]
X = df.drop('target_var', axis=1)
y = df['target_var]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
bdt_clf = AdaBoostClassifier(
DecisionTreeClassifier(max_depth=2),
n_estimators=250,
learning_rate=0.3)
bdt_clf.fit(X_train, y_train)
y_pred = bdt_clf.predict(X_test)
#Reverse factorize (converting y_pred from 0s,1s, 2s, etc. to their original values
reversefactor = dict(zip(range(9),definitions))
y_test_rev = np.vectorize(reversefactor.get)(y_test)
y_pred_rev = np.vectorize(reversefactor.get)(y_pred)
I tried directly with the roc curve function, and also binarising the labels, but I always get the same error message.
def multiclass_roc_auc(y_test, y_pred):
lb = preprocessing.LabelBinarizer()
lb.fit(y_test)
y_test = lb.transform(y_test)
y_pred = lb.transform(y_pred)
return roc_curve(y_test, y_pred)
multiclass_roc_auc(y_test, y_pred_test)
The error message is:
ValueError: multilabel-indicator format is not supported
How could this be sorted out? Am I missing some important concept?

Why is Multi Class Machine Learning Model Giving Bad Results?

I have the following code so far:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn import preprocessing
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
df_train = pd.read_csv('uc_data_train.csv')
del df_train['Unnamed: 0']
temp = df_train['size_womenswear']
del df_train['size_womenswear']
df_train['size_womenswear'] = temp
df_train['count'] = 1
print(df_train.head())
print(df_train.dtypes)
print(df_train[['size_womenswear', 'count']].groupby('size_womenswear').count()) # Determine number of unique catagories, and number of cases for each catagory
del df_train['count']
df_test = pd.read_csv('uc_data_test.csv')
del df_test['Unnamed: 0']
print(df_test.head())
print(df_test.dtypes)
df_train.drop(['customer_id','socioeconomic_status','brand','socioeconomic_desc','order_method',
'first_order_channel','days_since_first_order','total_number_of_orders', 'return_rate'], axis=1, inplace=True)
LE = preprocessing.LabelEncoder() # Create label encoder
df_train['size_womenswear'] = LE.fit_transform(np.ravel(df_train[['size_womenswear']]))
print(df_train.head())
print(df_train.dtypes)
x = df_train.iloc[:,np.arange(len(df_train.columns)-1)].values # Assign independent values
y = df_train.iloc[:,-1].values # and dependent values
xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size = 0.25, random_state = 0) # Testing on 75% of the data
model = GaussianNB()
model.fit(xTrain, yTrain)
yPredicted = model.predict(xTest)
#print(yPrediction)
print('Accuracy: ', accuracy_score(yTest, yPredicted))
I am not sure how to include the data that I am using but I am trying to predict the 'size_womenswear'. There are 8 different sizes that I have encoded to predict and I have moved this column to the end of the dataframe. so y is the dependent and x are the independent (all the other columns)
I am using a Gaussian Naive Bayes classifier to try and classify the 8 different sizes and then test on 25% of the data. The results are not very good.
I don't know why I am only getting an accuracy of 61% when I am working with 80,000 rows. I am very new to Machine Learning and would appreciate any assistance. Is there a better method that I could use in this case than Gaussian Naive Bayes?
can't comment, just throwing out some ideas;
Maybe you need to deal with class imbalance, and try other model that will fit the data better? try the xgboost or lightgbm package given good data they usually perform pretty good in general, but it really depends on the data.
Also the way you split train and test, does the resulting train and test data set has similar distribution for your Y? that's very important.
Last thing, for classification models the performance measurement can be a bit tricky, try some other measurement methods. F1 scores or try to draw a confusion matrix and see what your predictions vs Y looks like. perhaps your model is predicting everything to one
or just a few classes.

Random forest algorithm not working for new datasets

I used random forest algorithm in python to train my dataset1 and I got an accuracy of 99%. But when I tried with the new dataset2 to predict the values, I am getting wrong values. I manually checked the results for the new dataset and when I compared with the prediction results, the accuracy is very low.
Below is my Code :
from IPython import get_ipython
get_ipython().run_line_magic('matplotlib', 'inline')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize']=(20.0,10.0)
data = pd.read_csv('D:/Users/v477sjp/lpthw/Extract.csv', usecols=['CON_ID',
'CON_LEGACY_ID', 'CON_CREATE_TD',
'CON_CREATE_LT', 'BUL_CSYS_ID_ORIG', 'BUL_CSYS_ID_CORIG',
'BUL_CSYS_ID_DEST', 'BUL_CSYS_ID_CLEAR', 'TOP_ID', 'CON_DG_IN',
'PTP_ID', 'SMO_ID_1',
'SMO_ID_8', 'LOB_ID', 'PRG_ID', 'PSG_ID', 'SMP_ID', 'COU_ISO_ID_ORIG',
'COU_ISO_ID_DEST', 'CON_DELIV_DUE_DT', 'CON_DELIV_DUE_LT',
'CON_POSTPONED_DT', 'CON_DELIV_PLAN_DT', 'CON_INTL_IN', 'PCE_NR',
'CON_TC_PCE_QT', 'CON_TC_GRS_WT', 'CON_TC_VL', 'PCE_OC_LN',
'PCE_OC_WD','PCE_OC_HT', 'PCE_OC_VL', 'PCE_OC_WT', 'PCE_OA_LN',
'PCE_OA_WD','PCE_OA_HT', 'PCE_OA_VL', 'PCE_OA_WT', 'COS_EVENT_TD',
'COS_EVENT_LT',
'((XSF_ID||XSS_ID)||XSG_ID)', 'BUL_CSYS_ID_OCC',
'PCE_NR.1', 'PCS_EVENT_TD', 'PCS_EVENT_LT',
'((XSF_ID||XSS_ID)||XSG_ID).1', 'BUL_CSYS_ID_OCC.1',
'BUL_CSYS_ID_1', 'BUL_CSYS_ID_2', 'BUL_CSYS_ID_3',
'BUL_CSYS_ID_4', 'BUL_CSYS_ID_5', 'BUL_CSYS_ID_6', 'BUL_CSYS_ID_7',
'BUL_CSYS_ID_8', 'BUL_CSYS_ID_9', 'BUL_CSYS_ID_10', 'BUL_CSYS_ID_11',
'BUL_CSYS_ID_12', 'BUL_CSYS_ID_13', 'BUL_CSYS_ID_14',
'BUL_CSYS_ID_15',
'BUL_CSYS_ID_16', 'CON_TOT_SECT_NR', 'DELAY'] )
df = pd.DataFrame(data.values ,columns=data.columns)
for col_name in df.columns:
if(df[col_name].dtype == 'object' and col_name != 'DELAY'):
df[col_name]= df[col_name].astype('category')
df[col_name] = df[col_name].cat.codes
target_attribute = df['DELAY']
input_attribute=df.loc[:,'CON_ID':'CON_TOT_SECT_NR']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(input_attribute,target_attribute, test_size=0.3)
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)
rf.fit(X_train, y_train);
predictions = rf.predict(X_test)
errors = abs(predictions - y_test)
print(errors)
print('Mean Absolute Error:', round(np.mean(errors), 2), 'result.')
mape = 100 * (errors / y_test)
accuracy = 100 - np.mean(mape)
print('Accuracy:', round(accuracy, 2), '%.')
data_new = pd.read_csv('D:/Users/v477sjp/lpthw/Extract-0401- TestCurrentData-Null.csv', usecols=['CON_ID','CON_LEGACY_ID','CON_CREATE_TD','CON_CREATE_LT','BUL_CSYS_ID_ORIG', 'BUL_CSYS_ID_CORIG','BUL_CSYS_ID_DEST','BUL_CSYS_ID_CLEAR','TOP_ID',
'CON_DG_N', 'PTP_ID', 'SMO_ID_1',
'SMO_ID_8', 'LOB_ID', 'PRG_ID', 'PSG_ID', 'SMP_ID', 'COU_ISO_ID_ORIG',
'COU_ISO_ID_DEST', 'CON_DELIV_DUE_DT', 'CON_DELIV_DUE_LT',
'CON_POSTPONED_DT', 'CON_DELIV_PLAN_DT', 'CON_INTL_IN', 'PCE_NR',
'CON_TC_PCE_QT', 'CON_TC_GRS_WT', 'CON_TC_VL','PCE_OC_LN','PCE_OC_WD',
'PCE_OC_HT', 'PCE_OC_VL', 'PCE_OC_WT', 'PCE_OA_LN', 'PCE_OA_WD',
'PCE_OA_HT', 'PCE_OA_VL', 'PCE_OA_WT', 'COS_EVENT_TD', 'COS_EVENT_LT',
'((XSF_ID||XSS_ID)||XSG_ID)', 'BUL_CSYS_ID_OCC',
'PCE_NR.1','PCS_EVENT_TD', 'PCS_EVENT_LT',
'((XSF_ID||XSS_ID)||XSG_ID).1', 'BUL_CSYS_ID_OCC.1',
'BUL_CSYS_ID_1', 'BUL_CSYS_ID_2', 'BUL_CSYS_ID_3',
'BUL_CSYS_ID_4', 'BUL_CSYS_ID_5', 'BUL_CSYS_ID_6', 'BUL_CSYS_ID_7',
'BUL_CSYS_ID_8', 'BUL_CSYS_ID_9', 'BUL_CSYS_ID_10', 'BUL_CSYS_ID_11',
'BUL_CSYS_ID_12', 'BUL_CSYS_ID_13', 'BUL_CSYS_ID_14','BUL_CSYS_ID_15',
'BUL_CSYS_ID_16', 'CON_TOT_SECT_NR', 'DELAY'] )
df_new = pd.DataFrame(data_new.values ,columns=data_new.columns)
for col_name in df_new.columns:
if(df_new[col_name].dtype == 'object' and col_name != 'DELAY' ):
df_new[col_name]= df_new[col_name].astype('category')
df_new[col_name] = df_new[col_name].cat.codes
X_test_new=df_new.loc[:,'CON_ID':'CON_TOT_SECT_NR']
y_pred_new = rf.predict(X_test_new)
df_new['Delay_1']=y_pred_new
df_new.to_csv('prediction_new.csv')
The prediction results are wrong for the new dataset and the accuracy is very low. Accuracy for data is 99%. I should be getting negative results for the new dataset. But all the values I got are positive. Please help
Seems like the algorithm is overfitted to the training dataset. Some options to try are
1) Use a larger dataset
2) decrease the features/columns or do some feature engineering
3) use regularization
4) if the dataset is not too large try decreasing the number of estimators in case of random forest.
5) play with other parameters like max_features, max_depth, min_samples_split, min_samples_leaf
Hope this helps

Categories