Accuracy is high everytime but the resulting prediction is wrong - python

I am trying to predict the crop name by entering the temperature, soil humidity, pH and average rainfall.
And the accuracy percentage is always high i.e it ranges from 88% to 94% everytime. But the final result after prediction is always wrong.
This is the code:
#importing the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
#Reading the csv file
data=pd.read_csv('cpdata.csv')
#Creating dummy variable for target i.e label
label= pd.get_dummies(data.label).iloc[: , 1:]
data= pd.concat([data,label],axis=1)
data.drop('label', axis=1,inplace=True)
print('The data present in one row of the dataset is')
print(data.head(1))
train=data.iloc[:, 0:4].values
test=data.iloc[: ,4:].values
#Dividing the data into training and test set
X_train,X_test,y_train,y_test=train_test_split(train,test,test_size=0.3)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
#Importing Decision Tree classifier
from sklearn.tree import DecisionTreeRegressor
clf=DecisionTreeRegressor()
#Fitting the classifier into training set
clf.fit(X_train,y_train)
pred=clf.predict(X_test)
from sklearn.metrics import accuracy_score
# Finding the accuracy of the model
a=accuracy_score(y_test,pred)
print("The accuracy of this model is: ", a*100)
ah=89.41
atemp=26.98
shum=28
pH=6.26
rain=58.54
l=[]
l.append(atemp)
l.append(ah)
l.append(pH)
l.append(rain)
predictcrop=[l]
# Putting the names of crop in a single list
crops=['rice','wheat','mungbean','Tea','millet','maize','lentil','jute','cofee','cotton','ground nut','peas','rubber','sugarcane','tobacco','kidney beans','moth beans','coconut','blackgram','adzuki beans','pigeon peas','chick peas','banana','grapes','apple','mango','muskmelon','orange','papaya','pomegranate','watermelon']
cr='rice'
#Predicting the crop
predictions = clf.predict(predictcrop)
count=0
for i in range(0,31):
if(predictions[0][i]==1):
c=crops[i]
count=count+1
break;
i=i+1
if(count==0):
print('The predicted crop is %s'%cr)
else:
print('The predicted crop is %s'%c)
The output that I am getting is-
The accuracy of this model is: 90.43010752688173
The predicted crop is apple
Even though I enter the exact values for any other crop, I get apple or mango every time.
Kindly help.

Apply the scaler also to your new data for prediction. I cannot test it without your data but it should look somehow like:
datascaled = sc.transform(predictcrop)
predictions = clf.predict(datascaled)
In order to apply the scaler also to new data later, you need to save it:
from sklearn.externals.joblib import dump, load
dump(sc, 'scaler.bin', compress=True)
and later:
sc=load('scaler.bin')

Related

sklearn model returns a mean absolute error of 0, why?

Toying around with sklearn and I wanted to predict TSLA Close prices for a few dates using the Open, High, Low prices and the Volume. I used a very basic model to predict the close and they were supposedly 100% accurate and I'm not sure why. The 0% error makes me feel as if I didn't set up my model correctly.
Code:
from os import X_OK
from numpy.lib.shape_base import apply_along_axis
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error
tsla_data_path = "/Users/simon/Documents/PythonVS/ML/TSLA.csv"
tsla_data = pd.read_csv(tsla_data_path)
tsla_features = ['Open','High','Low','Volume']
y = tsla_data.Close
X = tsla_data[tsla_features]
# define model
tesla_model = DecisionTreeRegressor(random_state = 1)
# fit model
tesla_model.fit(X,y)
#print results
print('making predictions for the following five dates')
print(X.head())
print('________________________________________________')
print('the predictions are')
print(tesla_model.predict(X.head()))
print('________________________________________________')
print('the error is ')
print(mean_absolute_error(y.head(),tesla_model.predict(X.head())))
Output:
making predictions for the following five dates
Open High Low Volume
0 67.054001 67.099998 65.419998 39737000
1 66.223999 66.786003 65.713997 27778000
2 66.222000 66.251999 65.500000 12328000
3 65.879997 67.276001 65.737999 30372500
4 66.524002 67.582001 66.438004 32868500
________________________________________________
the predictions are
[65.783997 66.258003 65.987999 66.973999 67.239998]
________________________________________________
the error is
0.0
Data:
Date,Open,High,Low,Close,Adj_Close,Volume
2019-11-26,67.054001,67.099998,65.419998,65.783997,65.783997,39737000
2019-11-27,66.223999,66.786003,65.713997,66.258003,66.258003,27778000
2019-11-29,66.222000,66.251999,65.500000,65.987999,65.987999,12328000
2019-12-02,65.879997,67.276001,65.737999,66.973999,66.973999,30372500
2019-12-03,66.524002,67.582001,66.438004,67.239998,67.239998,32868500
You're making a mistake by measuring the performance of your model on the dataset used to train it.
If you want to have a proper evaluation metric of your performance, you should split your dataset into 2 datasets. One that is used to train the model and the other to measure its performance. You can split your dataset using sklearn.model_selection.train_test_split() as follow:
tesla_model = DecisionTreeRegressor(random_state = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
tesla_model.fit(X_train, X_test)
mae = mean_absolute_error(y_test,tesla_model.predict(X_test))
Have a look at this Wikipedia page explaining the differents dataset in ML.

My accuracy is at 0.0 and I don't know why?

I am getting an accuracy of 0.0. I am using the boston housing dataset.
Here is my code:
import sklearn
from sklearn import datasets
from sklearn import svm, metrics
from sklearn import linear_model, preprocessing
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
boston = datasets.load_boston()
x = boston.data
y = boston.target
train_data, test_data, train_label, test_label = sklearn.model_selection.train_test_split(x, y, test_size=0.2)
model = KNeighborsClassifier()
lab_enc = preprocessing.LabelEncoder()
train_label_encoded = lab_enc.fit_transform(train_label)
test_label_encoded = lab_enc.fit_transform(test_label)
model.fit(train_data, train_label_encoded)
predicted = model.predict(test_data)
accuracy = model.score(test_data, test_label_encoded)
print(accuracy)
How can I increase the accuracy on this dataset?
Boston dataset is for regression problems. Definition in the docs:
Load and return the boston house-prices dataset (regression).
So, it does not make sense if you use an ordinary encoding like the labels are not samples from a continuous data. For example, you encode 12.3 and 12.4 to completely different labels but they are pretty close to each other, and you evaluate the result wrong if the classifier predicts 12.4 when the real target is 12.3, but this is not a binary situation. In classification, the prediction is whether correct or not, but in regression it is calculated in a different way such as mean square error.
This part is not necessary, but I would like to give you an example for the same dataset and source code. With a simple idea of rounding the labels towards zero(to the nearest integer to zero) will give you some intuition.
5.0-5.9 -> 5
6.0-6.9 -> 6
...
50.0-50.9 -> 50
Let's change your code a little bit.
import numpy as np
def encode_func(labels):
return np.array([int(l) for l in labels])
...
train_label_encoded = encode_func(train_label)
test_label_encoded = encode_func(test_label)
The output will be around 10%.

Need to print the column or a row contains my predicted label

I am using Decision Tree algorithm to predict the label from a test file. However, I need to print the complete row or a single cell which consist that label. The code which I am working on is mentioned below.
import numpy as np
import csv
from sklearn.metrics import accuracy_score,confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
path = "train_names.csv"
file=open(path)
reader = csv.reader(file)
data = np.asarray(list(reader))
#train data
names_train=data[1:,[0,1,2,3,4]]
label_train=data[1:,[5]]
#test data
names_test=data1[1:,[0,1,2,3,4]]
label_test=data1[1:,[5]]
decisionTreeClassifier = DecisionTreeClassifier()
decisionTreeClassifier.fit(names_train,label_train)
predictions = decisionTreeClassifier.predict(names_test)
print("Accuracy: ",accuracy_score(label_test,predictions))
for i in range(0,len(names_test)):
print (predictions[i])
Did you mean to say that you want to print the output of your predictions? If so you can simply just call print(predictions) or you can use sklearn's confusion_matrix by importing this
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(label_test, predictions)
print(cm)
# your X-axis is the predicted while Y-axis is the truth
You can also print the model's score by using this
decisionTreeClassifier.score(names_test, label_test)
Just you have to iteration both the input values(names_test) and the predicted outputs (predictions). Try this!
predictions = decisionTreeClassifier.predict(names_test)
for x,y_pred in zip(names_test, predictions):
print(x,y_pred)

unusually high accuracy when running ML

Hi I am working with a difficult data set, in that there is low correlation between the input and output, yet results are very good (99.9% accuracy with the test set). I'm sure I'm doing something wrong, just don't know what.
label is 'unsafe' column, which is either 0 or 1 (was originally 0 or 100 but I limited the maximum value - it made no difference with the result. I started with random forests and then ran k nearest neighbors and got almost the same accuracy, 99.9%. Screenshots of df are:
there are many more 0s than 1s (in the training set out of 80,000 there are only 169 1s, and there is also a run of 1s at the end but this is just how the original file was imported)
import os
import glob
import numpy as np
import pandas as pd
import sklearn as sklearn
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_pickle('/Users/shellyganga/Downloads/ola.pickle')
maxVal = 1
df.unsafe = df['unsafe'].where(df['unsafe'] <= maxVal, maxVal)
print(df.head)
df.drop(df.columns[0], axis=1, inplace=True)
df.drop(df.columns[-2], axis=1, inplace=True)
#setting features and labels
labels = np.array(df['unsafe'])
features= df.drop('unsafe', axis = 1)
# Saving feature names for later use
feature_list = list(features.columns)
# Convert to numpy array
features = np.array(features)
from sklearn.model_selection import train_test_split
# 30% examples in test data
train, test, train_labels, test_labels = train_test_split(features, labels,
stratify = labels,
test_size = 0.3,
random_state = 0)
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(train, train_labels)
print(np.mean(train_labels))
print(train_labels.shape)
print('accuracy on train: {:.5f}'.format(knn.score(train, train_labels)))
print('accuracy on test: {:.5f}'.format(knn.score(test, test_labels)))
output:
0.0023654350798950337
(81169,)
accuracy on train: 0.99763
accuracy on test: 0.99761
The fact that you have many more instances of 0 than 1 is an example of class imbalance. Here is a really cool stats.stackexchange question on the topic.
Basically, if only 169 out of your 80000 labels are 1 and the rest are 0, then your model could naively predict the label 0 for every instance, and still have a training-set accuracy (= fraction of misclassified instances) of 99.78875%.
I suggest trying the F1 score, which is the harmonic mean of precision, AKA positive predictive value = TP/(TP + FP), and recall, AKA sensitivity = TP/(TP + FN): https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score
from sklearn.metrics import f1_score
print('F1 score on train: {:.5f}'.format(f1_score(train, train_labels)))
print('F1 score on test: {:.5f}'.format(f1_score(test, test_labels)))

Multiclass Classification and probability prediction

import pandas as pd
import numpy
from sklearn import cross_validation
from sklearn.naive_bayes import GaussianNB
fi = "df.csv"
# Open the file for reading and read in data
file_handler = open(fi, "r")
data = pd.read_csv(file_handler, sep=",")
file_handler.close()
# split the data into training and test data
train, test = cross_validation.train_test_split(data,test_size=0.6, random_state=0)
# initialise Gaussian Naive Bayes
naive_b = GaussianNB()
train_features = train.ix[:,0:127]
train_label = train.iloc[:,127]
test_features = test.ix[:,0:127]
test_label = test.iloc[:,127]
naive_b.fit(train_features, train_label)
test_data = pd.concat([test_features, test_label], axis=1)
test_data["p_malw"] = naive_b.predict_proba(test_features)
print "test_data\n",test_data["p_malw"]
print "Accuracy:", naive_b.score(test_features,test_label)
I have written this code to accept input from a csv file with 128 columns where 127 columns are features and the 128th column is the class label.
I want to predict probability that the sample belongs to each class (There are 5 classes (1-5)) and print it in for of a matrix and determine the class of sample based on the prediction. predict_proba() is not giving the desired output. Please suggest required changes.
GaussianNB.predict_proba returns the probabilities of the samples for each class in the model. In your case, it should return a result with five columns with the same number of rows as in your test data. You can verify which column corresponds to which class using naive_b.classes_ . So, it is not clear why you are saying that this is not the desired output. Perhaps, your problem comes from the fact that you are assigning the output of predict proba to a data frame column. Try:
pred_prob = naive_b.predict_proba(test_features)
instead of
test_data["p_malw"] = naive_b.predict_proba(test_features)
and verify its shape using pred_prob.shape. The second dimension should be 5.
If you want the predicted label for each sample you can use the predict method, followed by confusion matrix to see how many labels have been predicted correctly.
from sklearn.metrics import confusion_matrix
naive_B.fit(train_features, train_label)
pred_label = naive_B.predict(test_features)
confusion_m = confusion_matrix(test_label, pred_label)
confusion_m
Here is some useful reading.
sklearn GaussianNB - http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB.predict_proba
sklearn confusion_matrix - http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

Categories