I am using KNN classifier on a data set and am trying to find the predicted probabilities for each outcome of the prediction and am not sure how to go about it. I have not found much on this topic. The code I am using is:
import numpy as np ##Import necassary packages
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import style
style.use("ggplot")
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import *
from sklearn import preprocessing
from sklearn import neighbors
from sklearn.metrics import *
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.cluster import MiniBatchKMeans
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
url2="http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" #Reading in Data from a freely and easily available source on the internet
Adult = pd.read_csv(url2, header=None, skipinitialspace=True) #Decoding data by removing extra spaces in cplumns with skipinitialspace=True
##Assigning reasonable column names to the dataframe
Adult.columns = ["age","workclass","fnlwgt","education","educationnum","maritalstatus","occupation",
"relationship","race","sex","capitalgain","capitalloss","hoursperweek","nativecountry",
"less50kmoreeq50kn"]
Adult.loc[Adult.loc[:, "race"] == "Amer-Indian-Eskimo", "race"] = "Other" #consolidating catagorical data in the race column
Adult.loc[:,"race"].value_counts().plot(kind='bar') #plotting the consolidated catagorical data in the race column
plt.title('race after consolidation')
plt.show()
Adult.loc[:, "White"] = (Adult.loc[:, "race"] == "White").astype(int) #One hot encoding the catagorical/creating new categorical data in the race column
Adult.loc[:, "Black"] = (Adult.loc[:, "race"] == "Black").astype(int)
Adult.loc[:, "Asian-Pac-Islander"] = (Adult.loc[:, "race"] == "Asian-Pac-Islander").astype(int)
Adult.loc[:, "Other"] = (Adult.loc[:, "race"] == "Other").astype(int)
Adult.loc[:,"Other"] #Verifying One-hot encoding for Other column
Adult = Adult.drop("race", axis=1) #removing the obsolete column "race"
Minage = min(Adult.loc[:,"age"]) #MinMax normilizing the age column
Maxage = max(Adult.loc[:,"age"])
MinMaxage = (Adult.loc[:,"age"] - Minage)/(Maxage - Minage)
df2 = pd.DataFrame() #creating a dataframe to plot the normilized data
df2.loc[:,0] = Adult.loc[:, "White"] #filling the data frame
df2.loc[:,1] = NormZ1
df2.loc[:,1] = MinMaxage #assigning new columns for df2
df2.loc[:,2] = Adult.loc[:,"hoursperweek"]
df2.columns = ["White","MinMaxage","hoursperweek"] #labeling the columns for df2
df2.head() #checkiung new dataframe
X = np.array(df2.drop(["hoursperweek"], 1)) #choosing the expert label to predict and not including the label in the X array
y = np.array(df2["hoursperweek"])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2) #splittting the data into training and testing data
clf = neighbors.KNeighborsClassifier() #assigning K neighbors classifier
clf.fit(X_train, y_train) #fitting the data for X_train and y_train
accuracy = clf.score(X_test, y_test) #finding the accuracy of the prediction
print("accuracy rate with age MinMax Normilized")
print(accuracy)
print ('predictions for test set with age MinMax Normilized:') #showing results
print(clf.predict(X_test))
print ('actual class values with age MinMax Normilized:')
print(y_test)
I have loaded the actual outcome and predicted outcome into a new data frame and would like to add a 3rd column in the new data frame with the predicted probabilities for each row, but I am not sure how to solve for these in python. Is there a way to solve for the predicted probabilities for each outcome? I am would like to use the predicted probabilities for a confusion matrix and an ROC curve.
from the above code in question, i have removed
df2.loc[:,1] = NormZ1
and re-ran the code, used the syntax
print(clf.predict_proba(X_test))
was able to get probabilities in shape(6513, 93)
you can try clf.predict_boba(X_test) to get the prediction probability. source
Related
I'm trying to implement a complement naive bayes classifier using sklearn. My data have very imbalanced classes (30k samples of class 0 and 6k samples of the 1 class) and I'm trying to compensate this using weighted class.
Here is the shape of my dataset:
enter image description here
I tried to use the compute compute_class_weight function to calcute the weights and then pass it to the fit function when training my model:
import numpy as np
import seaborn as sn
import pandas as pd
from pandas import DataFrame
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.utils import class_weight
from sklearn.naive_bayes import ComplementNB
#Import the csv data
data = pd.read_csv('output_pt900.csv')
#Create the header of the csv file
header = []
for x in range(0,2500):
header.append('pixel' + str(x))
header.append('status')
#Add the header to the csv data
data.columns = header
#Replace the b's and the f's in the status column by 0 and 1
data['status'] = data['status'].replace('b',0)
data['status'] = data['status'].replace('f',1)
print(data)
#Drop the NaN values
data = data.dropna()
#Separate the features variables and the status
y = data['status']
x = data.drop('status',axis=1)
#Split the original dataset into two other: train and test
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.2)
all_together = y_train.to_numpy()
unique_classes = np.unique(all_together)
c_w = class_weight.compute_class_weight('balanced', unique_classes, all_together)
clf = ComplementNB()
clf.fit(x_train,y_train, c_w)
y_predict = clf.predict(x_test)
cm = confusion_matrix(y_test, y_predict)
svm = sn.heatmap(cm, cmap='Blues', annot=True, fmt='g')
figure=svm.get_figure()
figure.savefig('confusion_matrix_cnb.png', dpi=400)
plt.show()
but I got thesse error:
ValueError: sample_weight.shape == (2,), expected (29752,)!
Anyone knows how to use weighted class in sklearn models?
compute_class_weight returns an array of length equal to the number of unique classes with the weight to assign to instances of each class (link). So if there are 2 unique classes, c_w has length 2, containing the weight that should be assigned to samples with label 0 and 1.
When calling fit for your model, the weight for each sample is expected by the sample_weight argument. This should explain the error you received. To solve this issue, you need to use c_w returned by compute_class_weight to create an array of individual sample weights. You could do this with [c_w[i] for i in all_together]. Your fit call would ultimately look something like:
clf.fit(x_train, y_train, sample_weight=[c_w[i] for i in all_together])
I am working with a data set labeled Adult and I am trying to run a KNN on a few of the columns I have made into a new data Frame and normalized a couple of the columns. I am getting a ValueError: Unknown label type: 'continuous' error when trying to run
clf = neighbors.KNeighborsClassifier()
clf.fit(X_train, y_train)
After researching the error on line it seems that I need to use a label encoder on my data after I have normalized it, because it is now a float rather than an int but I am having trouble with using the label encoder. The code I am using is:
import numpy as np ##Import necassary packages
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import *
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
url2="http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" #Reading in Data from a freely and easily available source on the internet
Adult = pd.read_csv(url2, header=None, skipinitialspace=True) #Decoding data by removing extra spaces in cplumns with skipinitialspace=True
##Assigning reasonable column names to the dataframe
Adult.columns = ["age","workclass","fnlwgt","education","educationnum","maritalstatus","occupation",
"relationship","race","sex","capitalgain","capitalloss","hoursperweek","nativecountry",
"less50kmoreeq50kn"]
Adult.loc[Adult.loc[:, "race"] == "Amer-Indian-Eskimo", "race"] = "Other" #consolidating catagorical data in the race column
Adult.loc[:,"race"].value_counts().plot(kind='bar') #plotting the consolidated catagorical data in the race column
plt.title('race after consolidation')
plt.show()
Adult.loc[:, "White"] = (Adult.loc[:, "race"] == "White").astype(int) #One hot encoding the catagorical/creating new categorical data in the race column
Adult.loc[:, "Black"] = (Adult.loc[:, "race"] == "Black").astype(int)
Adult.loc[:, "Asian-Pac-Islander"] = (Adult.loc[:, "race"] == "Asian-Pac-Islander").astype(int)
Adult.loc[:, "Other"] = (Adult.loc[:, "race"] == "Other").astype(int)
Adult.loc[:,"Other"] #Verifying One-hot encoding for Other column
Adult = Adult.drop("race", axis=1) #removing the obsolete column "race"
Minage = min(Adult.loc[:,"age"]) #MinMax normalizing the age column
Maxage = max(Adult.loc[:,"age"])
MinMaxage = (Adult.loc[:,"age"] - Minage)/(Maxage - Minage)
Minhours = min(Adult.loc[:,"hoursperweek"]) #MinMax ormalizing the hoursperweek column
Maxhours = max(Adult.loc[:,"hoursperweek"])
MinMaxhours = (Adult.loc[:,"hoursperweek"] - Minhours)/(Maxhours - Minhours)
df2 = pd.DataFrame() #creating a dataframe to plot the normilized data
df2.loc[:,0] = Adult.loc[:, "White"] #filling the data frame
df2.loc[:,1] = MinMaxage
df2.loc[:,2] = MinMaxhours
df2.columns = ["White","MinMaxage","MinMaxhours"]
X = np.array(df2.drop(['MinMaxhours'], 1))
y = np.array(df2['MinMaxhours'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
clf = neighbors.KNeighborsClassifier()
clf.fit(X_train, y_train)
accuracy = clf.score(X_test, y_test)
print(accuracy)
clf.predict(X_test)
y_test
Could someone help me out with how to label encode the data so I can perform Knn on the data? I have looked it up on the sklearn site and different examples, but am still having trouble using it on my dataset. I receive the error message when trying to fit the data running clf.fit(X_train, y_train)
It looks like you have a regression problem instead of a classification problem. You are trying to predict the MinMaxHours variable, which is a real number. If you are trying to predict real number you should use the regression version of the Neirest neighbor algorithm. The following code should work in order to get a prediction.
from sklearn.neighbors import KNeighborsRegressor
clf = KNeighborsRegressor()
clf.fit(X_train, y_train)
y_test_pred = clf.predict(X_test)
I am trying to predict the crop name by entering the temperature, soil humidity, pH and average rainfall.
And the accuracy percentage is always high i.e it ranges from 88% to 94% everytime. But the final result after prediction is always wrong.
This is the code:
#importing the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
#Reading the csv file
data=pd.read_csv('cpdata.csv')
#Creating dummy variable for target i.e label
label= pd.get_dummies(data.label).iloc[: , 1:]
data= pd.concat([data,label],axis=1)
data.drop('label', axis=1,inplace=True)
print('The data present in one row of the dataset is')
print(data.head(1))
train=data.iloc[:, 0:4].values
test=data.iloc[: ,4:].values
#Dividing the data into training and test set
X_train,X_test,y_train,y_test=train_test_split(train,test,test_size=0.3)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
#Importing Decision Tree classifier
from sklearn.tree import DecisionTreeRegressor
clf=DecisionTreeRegressor()
#Fitting the classifier into training set
clf.fit(X_train,y_train)
pred=clf.predict(X_test)
from sklearn.metrics import accuracy_score
# Finding the accuracy of the model
a=accuracy_score(y_test,pred)
print("The accuracy of this model is: ", a*100)
ah=89.41
atemp=26.98
shum=28
pH=6.26
rain=58.54
l=[]
l.append(atemp)
l.append(ah)
l.append(pH)
l.append(rain)
predictcrop=[l]
# Putting the names of crop in a single list
crops=['rice','wheat','mungbean','Tea','millet','maize','lentil','jute','cofee','cotton','ground nut','peas','rubber','sugarcane','tobacco','kidney beans','moth beans','coconut','blackgram','adzuki beans','pigeon peas','chick peas','banana','grapes','apple','mango','muskmelon','orange','papaya','pomegranate','watermelon']
cr='rice'
#Predicting the crop
predictions = clf.predict(predictcrop)
count=0
for i in range(0,31):
if(predictions[0][i]==1):
c=crops[i]
count=count+1
break;
i=i+1
if(count==0):
print('The predicted crop is %s'%cr)
else:
print('The predicted crop is %s'%c)
The output that I am getting is-
The accuracy of this model is: 90.43010752688173
The predicted crop is apple
Even though I enter the exact values for any other crop, I get apple or mango every time.
Kindly help.
Apply the scaler also to your new data for prediction. I cannot test it without your data but it should look somehow like:
datascaled = sc.transform(predictcrop)
predictions = clf.predict(datascaled)
In order to apply the scaler also to new data later, you need to save it:
from sklearn.externals.joblib import dump, load
dump(sc, 'scaler.bin', compress=True)
and later:
sc=load('scaler.bin')
I used random forest algorithm in python to train my dataset1 and I got an accuracy of 99%. But when I tried with the new dataset2 to predict the values, I am getting wrong values. I manually checked the results for the new dataset and when I compared with the prediction results, the accuracy is very low.
Below is my Code :
from IPython import get_ipython
get_ipython().run_line_magic('matplotlib', 'inline')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize']=(20.0,10.0)
data = pd.read_csv('D:/Users/v477sjp/lpthw/Extract.csv', usecols=['CON_ID',
'CON_LEGACY_ID', 'CON_CREATE_TD',
'CON_CREATE_LT', 'BUL_CSYS_ID_ORIG', 'BUL_CSYS_ID_CORIG',
'BUL_CSYS_ID_DEST', 'BUL_CSYS_ID_CLEAR', 'TOP_ID', 'CON_DG_IN',
'PTP_ID', 'SMO_ID_1',
'SMO_ID_8', 'LOB_ID', 'PRG_ID', 'PSG_ID', 'SMP_ID', 'COU_ISO_ID_ORIG',
'COU_ISO_ID_DEST', 'CON_DELIV_DUE_DT', 'CON_DELIV_DUE_LT',
'CON_POSTPONED_DT', 'CON_DELIV_PLAN_DT', 'CON_INTL_IN', 'PCE_NR',
'CON_TC_PCE_QT', 'CON_TC_GRS_WT', 'CON_TC_VL', 'PCE_OC_LN',
'PCE_OC_WD','PCE_OC_HT', 'PCE_OC_VL', 'PCE_OC_WT', 'PCE_OA_LN',
'PCE_OA_WD','PCE_OA_HT', 'PCE_OA_VL', 'PCE_OA_WT', 'COS_EVENT_TD',
'COS_EVENT_LT',
'((XSF_ID||XSS_ID)||XSG_ID)', 'BUL_CSYS_ID_OCC',
'PCE_NR.1', 'PCS_EVENT_TD', 'PCS_EVENT_LT',
'((XSF_ID||XSS_ID)||XSG_ID).1', 'BUL_CSYS_ID_OCC.1',
'BUL_CSYS_ID_1', 'BUL_CSYS_ID_2', 'BUL_CSYS_ID_3',
'BUL_CSYS_ID_4', 'BUL_CSYS_ID_5', 'BUL_CSYS_ID_6', 'BUL_CSYS_ID_7',
'BUL_CSYS_ID_8', 'BUL_CSYS_ID_9', 'BUL_CSYS_ID_10', 'BUL_CSYS_ID_11',
'BUL_CSYS_ID_12', 'BUL_CSYS_ID_13', 'BUL_CSYS_ID_14',
'BUL_CSYS_ID_15',
'BUL_CSYS_ID_16', 'CON_TOT_SECT_NR', 'DELAY'] )
df = pd.DataFrame(data.values ,columns=data.columns)
for col_name in df.columns:
if(df[col_name].dtype == 'object' and col_name != 'DELAY'):
df[col_name]= df[col_name].astype('category')
df[col_name] = df[col_name].cat.codes
target_attribute = df['DELAY']
input_attribute=df.loc[:,'CON_ID':'CON_TOT_SECT_NR']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(input_attribute,target_attribute, test_size=0.3)
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)
rf.fit(X_train, y_train);
predictions = rf.predict(X_test)
errors = abs(predictions - y_test)
print(errors)
print('Mean Absolute Error:', round(np.mean(errors), 2), 'result.')
mape = 100 * (errors / y_test)
accuracy = 100 - np.mean(mape)
print('Accuracy:', round(accuracy, 2), '%.')
data_new = pd.read_csv('D:/Users/v477sjp/lpthw/Extract-0401- TestCurrentData-Null.csv', usecols=['CON_ID','CON_LEGACY_ID','CON_CREATE_TD','CON_CREATE_LT','BUL_CSYS_ID_ORIG', 'BUL_CSYS_ID_CORIG','BUL_CSYS_ID_DEST','BUL_CSYS_ID_CLEAR','TOP_ID',
'CON_DG_N', 'PTP_ID', 'SMO_ID_1',
'SMO_ID_8', 'LOB_ID', 'PRG_ID', 'PSG_ID', 'SMP_ID', 'COU_ISO_ID_ORIG',
'COU_ISO_ID_DEST', 'CON_DELIV_DUE_DT', 'CON_DELIV_DUE_LT',
'CON_POSTPONED_DT', 'CON_DELIV_PLAN_DT', 'CON_INTL_IN', 'PCE_NR',
'CON_TC_PCE_QT', 'CON_TC_GRS_WT', 'CON_TC_VL','PCE_OC_LN','PCE_OC_WD',
'PCE_OC_HT', 'PCE_OC_VL', 'PCE_OC_WT', 'PCE_OA_LN', 'PCE_OA_WD',
'PCE_OA_HT', 'PCE_OA_VL', 'PCE_OA_WT', 'COS_EVENT_TD', 'COS_EVENT_LT',
'((XSF_ID||XSS_ID)||XSG_ID)', 'BUL_CSYS_ID_OCC',
'PCE_NR.1','PCS_EVENT_TD', 'PCS_EVENT_LT',
'((XSF_ID||XSS_ID)||XSG_ID).1', 'BUL_CSYS_ID_OCC.1',
'BUL_CSYS_ID_1', 'BUL_CSYS_ID_2', 'BUL_CSYS_ID_3',
'BUL_CSYS_ID_4', 'BUL_CSYS_ID_5', 'BUL_CSYS_ID_6', 'BUL_CSYS_ID_7',
'BUL_CSYS_ID_8', 'BUL_CSYS_ID_9', 'BUL_CSYS_ID_10', 'BUL_CSYS_ID_11',
'BUL_CSYS_ID_12', 'BUL_CSYS_ID_13', 'BUL_CSYS_ID_14','BUL_CSYS_ID_15',
'BUL_CSYS_ID_16', 'CON_TOT_SECT_NR', 'DELAY'] )
df_new = pd.DataFrame(data_new.values ,columns=data_new.columns)
for col_name in df_new.columns:
if(df_new[col_name].dtype == 'object' and col_name != 'DELAY' ):
df_new[col_name]= df_new[col_name].astype('category')
df_new[col_name] = df_new[col_name].cat.codes
X_test_new=df_new.loc[:,'CON_ID':'CON_TOT_SECT_NR']
y_pred_new = rf.predict(X_test_new)
df_new['Delay_1']=y_pred_new
df_new.to_csv('prediction_new.csv')
The prediction results are wrong for the new dataset and the accuracy is very low. Accuracy for data is 99%. I should be getting negative results for the new dataset. But all the values I got are positive. Please help
Seems like the algorithm is overfitted to the training dataset. Some options to try are
1) Use a larger dataset
2) decrease the features/columns or do some feature engineering
3) use regularization
4) if the dataset is not too large try decreasing the number of estimators in case of random forest.
5) play with other parameters like max_features, max_depth, min_samples_split, min_samples_leaf
Hope this helps
I have a comma-separated CSV file with two numerical columns - inputs and outputs. They are correlated in a (more or less linear function), see below. The sample I have is very small.
Below, is the Python code I wrote using sklearn in order to predict values. Somehow it's not giving me the correct values (reasonable predictions). I am quite new to this, so please bear with me.
import pandas as pd
data = pd.read_csv("data.csv", header=None, names=['kg', 'cm'])
labels = data['kg']
train1 = data.drop(['kg'], axis=1) # In all honesty, I don't understand this.
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(train1, labels, test_size=0.10, random_state=2)
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(x_train, y_train)
reg.score(x_test, y_test)
reg.predict(80) # Gives an incorrect value of about 108.
Data.
89,155
86,161
82.5,168
79.25,174
76.25,182
73,189
70,198
66.66,207
63.5,218
60.25,229
57,241
54,257
51,259
Actually you are having problem understanding your own code.
import pandas as pd
data = pd.read_csv("data.csv", header=None, names=['kg', 'cm'])
labels = data['kg']
train1 = data.drop(['kg'], axis=1) # In all honesty, I don't understand this.
Until here what you have done is that you have loaded the dataframe. After that you seprated X and y from the dataset.
labels represent the y values.
train1 represent the x values.
Since you wrote you can't understand :- train1 = data.drop(['kg'], axis=1)
Let me explain this. What this does is that from the dataframe which consist both column 'kg' and 'cm'. It removes 'kg' column (axis = 1 means column, axis = 0 means row). Hence only 'cm' is remaining which is your x.
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(train1, labels, test_size=0.10, random_state=2)
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(x_train, y_train)
reg.score(x_test, y_test)
reg.predict(80) # Gives an incorrect value of about 108.
Now you train the model on x values which represents 'cm' and y values which represent 'kg'.
When you predict(80) what happens is that you input the 'cm' value to be 80. Let me just plot the 'cm' vs 'kg' for training data.
When you input height as 80 this means that you are going more left, even more left than your plot. Hence as you can see x decrease y increase. It means that as 'cm' decrease means 'kg' increase. Hence ouput is 110 which is more.
from io import StringIO
input_data=StringIO("""89,155\n
86,161\n
82.5,168\n
79.25,174\n
76.25,182\n
73,189\n
70,198\n
66.66,207\n
63.5,218\n
60.25,229\n
57,241\n
54,257\n
51,259""")
import pandas as pd
data = pd.read_csv(input_data, header=None, names=['kg', 'cm'])
labels = data['cm']
train1 = data.drop(['cm'], axis=1) #This is similar to selecting the kg column
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(train1, labels, test_size=0.10, random_state=2)
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(x_train, y_train)
reg.score(x_test, y_test)
import numpy as np
reg.predict(np.array([80]).reshape(-1, 1)) # 172.65013306.
I think you are having problems with small data size. The code flow looks normal to me, I would suggest you try to find the p-value for the input-output. This will tell you if the correlation found from your linear regression is significant or not (p-value <0.05).
You can find p-value using:
from scipy.stats import linregress
print(linregress(input, output))
To find p-value using scikit learn you probably need to use the formula to find p-value. Good luck.