Multiclass Classification and probability prediction - python

import pandas as pd
import numpy
from sklearn import cross_validation
from sklearn.naive_bayes import GaussianNB
fi = "df.csv"
# Open the file for reading and read in data
file_handler = open(fi, "r")
data = pd.read_csv(file_handler, sep=",")
file_handler.close()
# split the data into training and test data
train, test = cross_validation.train_test_split(data,test_size=0.6, random_state=0)
# initialise Gaussian Naive Bayes
naive_b = GaussianNB()
train_features = train.ix[:,0:127]
train_label = train.iloc[:,127]
test_features = test.ix[:,0:127]
test_label = test.iloc[:,127]
naive_b.fit(train_features, train_label)
test_data = pd.concat([test_features, test_label], axis=1)
test_data["p_malw"] = naive_b.predict_proba(test_features)
print "test_data\n",test_data["p_malw"]
print "Accuracy:", naive_b.score(test_features,test_label)
I have written this code to accept input from a csv file with 128 columns where 127 columns are features and the 128th column is the class label.
I want to predict probability that the sample belongs to each class (There are 5 classes (1-5)) and print it in for of a matrix and determine the class of sample based on the prediction. predict_proba() is not giving the desired output. Please suggest required changes.

GaussianNB.predict_proba returns the probabilities of the samples for each class in the model. In your case, it should return a result with five columns with the same number of rows as in your test data. You can verify which column corresponds to which class using naive_b.classes_ . So, it is not clear why you are saying that this is not the desired output. Perhaps, your problem comes from the fact that you are assigning the output of predict proba to a data frame column. Try:
pred_prob = naive_b.predict_proba(test_features)
instead of
test_data["p_malw"] = naive_b.predict_proba(test_features)
and verify its shape using pred_prob.shape. The second dimension should be 5.
If you want the predicted label for each sample you can use the predict method, followed by confusion matrix to see how many labels have been predicted correctly.
from sklearn.metrics import confusion_matrix
naive_B.fit(train_features, train_label)
pred_label = naive_B.predict(test_features)
confusion_m = confusion_matrix(test_label, pred_label)
confusion_m
Here is some useful reading.
sklearn GaussianNB - http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB.predict_proba
sklearn confusion_matrix - http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

Related

Returning a trained scikit learn (random forest) model from a function?

I am training a random forest model and have found that returning the trained model object from a function consistently results in different .predict behavior. Is this intended or not?
I think this is completely reproducible code. Input data is just 1000 rows of 6 columns of floats:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import pandas as pd
def as_a_function():
df = pd.to_csv() # read file
lcscols = #just a list of 6/12 of the columns in the csv file that are used to build the model (ignore timestamp, etc)
selcol = #y 'real' data
train_df = df.sample(frac=testsize,random_state=42)
test_df = df.drop(train_df.index) #test/train split
rfmodel, fitvals_mid = RF_model(train_df,test_df,selcol, lcscols)
tempdf = df.copy(deep=True) # new copy, not totally necessary but helpful in edge cases
tempdf.dropna(inplace=True)
selcolname = selcol + '_cal'
mid_cal = pd.DataFrame(data=rfmodel.predict(tempdf[lcscols]),index=tempdf.index,columns=[selcolname])
#new df just made from a .predict call
# note that input order of columns matters, needs to be identical to training order??
def RF_model(train_df, test_df, ycol, xcols):
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)
rfmodel = rf.fit(train_df[xcols], train_df[ycol])
y_pred_test = rfmodel.predict(test_df[xcols])
#missing code to test predicted values of testing set
return rfmodel
#################################
def inline():
df = pd.to_csv() # read file
lcscols = #just a list of 6/12 of the columns in the csv file that are used to build the model (ignore timestamp, etc)
refcol = #'true' data
X = df[lcscols].values
y = df[[refcol]].values
x_train,x_test,y_train,y_test = train_test_split(X, y, test_size=0.2, random_state=42)
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)
ramp = rf.fit(x_train, y_train.flatten())
y_pred_test = ramp.predict(x_test)
#missing code to check prediction on test values
tempdf = df.copy(deep=True)[lcscols]
tempdf.dropna(axis=1,how='all',inplace=True)
tempdf.dropna(axis=0,inplace=True)
df_cal = pd.DataFrame(data=ramp.predict(tempdf),index=tempdf.index,columns=['name'])
return df_cal
The problem is that rfmodel.predict(tempdf[lcscols]) produces different output than ramp.predict(tempdf).
I imagine that it's going to be somewhat different given that pd.DataFrame.sample is not going to be the exact same split as test_train_split but it's r^2 value of 0.98 when .predict is called on the trained model in the same function as compared to r^2 = 0.5 when .predict is called on a returned model object. That seems like way too different to be attributable to a different split method?
Try using np.random.seed(42) before you call the method - Make sure you have numpy imported first. Every time the model predicts it uses random values, every time you run your code with that seed it uses different random values, however when you use np.random.seed(42), every time you run your code the model will use the same random values.

Need to print the column or a row contains my predicted label

I am using Decision Tree algorithm to predict the label from a test file. However, I need to print the complete row or a single cell which consist that label. The code which I am working on is mentioned below.
import numpy as np
import csv
from sklearn.metrics import accuracy_score,confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
path = "train_names.csv"
file=open(path)
reader = csv.reader(file)
data = np.asarray(list(reader))
#train data
names_train=data[1:,[0,1,2,3,4]]
label_train=data[1:,[5]]
#test data
names_test=data1[1:,[0,1,2,3,4]]
label_test=data1[1:,[5]]
decisionTreeClassifier = DecisionTreeClassifier()
decisionTreeClassifier.fit(names_train,label_train)
predictions = decisionTreeClassifier.predict(names_test)
print("Accuracy: ",accuracy_score(label_test,predictions))
for i in range(0,len(names_test)):
print (predictions[i])
Did you mean to say that you want to print the output of your predictions? If so you can simply just call print(predictions) or you can use sklearn's confusion_matrix by importing this
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(label_test, predictions)
print(cm)
# your X-axis is the predicted while Y-axis is the truth
You can also print the model's score by using this
decisionTreeClassifier.score(names_test, label_test)
Just you have to iteration both the input values(names_test) and the predicted outputs (predictions). Try this!
predictions = decisionTreeClassifier.predict(names_test)
for x,y_pred in zip(names_test, predictions):
print(x,y_pred)

Why is Multi Class Machine Learning Model Giving Bad Results?

I have the following code so far:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn import preprocessing
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
df_train = pd.read_csv('uc_data_train.csv')
del df_train['Unnamed: 0']
temp = df_train['size_womenswear']
del df_train['size_womenswear']
df_train['size_womenswear'] = temp
df_train['count'] = 1
print(df_train.head())
print(df_train.dtypes)
print(df_train[['size_womenswear', 'count']].groupby('size_womenswear').count()) # Determine number of unique catagories, and number of cases for each catagory
del df_train['count']
df_test = pd.read_csv('uc_data_test.csv')
del df_test['Unnamed: 0']
print(df_test.head())
print(df_test.dtypes)
df_train.drop(['customer_id','socioeconomic_status','brand','socioeconomic_desc','order_method',
'first_order_channel','days_since_first_order','total_number_of_orders', 'return_rate'], axis=1, inplace=True)
LE = preprocessing.LabelEncoder() # Create label encoder
df_train['size_womenswear'] = LE.fit_transform(np.ravel(df_train[['size_womenswear']]))
print(df_train.head())
print(df_train.dtypes)
x = df_train.iloc[:,np.arange(len(df_train.columns)-1)].values # Assign independent values
y = df_train.iloc[:,-1].values # and dependent values
xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size = 0.25, random_state = 0) # Testing on 75% of the data
model = GaussianNB()
model.fit(xTrain, yTrain)
yPredicted = model.predict(xTest)
#print(yPrediction)
print('Accuracy: ', accuracy_score(yTest, yPredicted))
I am not sure how to include the data that I am using but I am trying to predict the 'size_womenswear'. There are 8 different sizes that I have encoded to predict and I have moved this column to the end of the dataframe. so y is the dependent and x are the independent (all the other columns)
I am using a Gaussian Naive Bayes classifier to try and classify the 8 different sizes and then test on 25% of the data. The results are not very good.
I don't know why I am only getting an accuracy of 61% when I am working with 80,000 rows. I am very new to Machine Learning and would appreciate any assistance. Is there a better method that I could use in this case than Gaussian Naive Bayes?
can't comment, just throwing out some ideas;
Maybe you need to deal with class imbalance, and try other model that will fit the data better? try the xgboost or lightgbm package given good data they usually perform pretty good in general, but it really depends on the data.
Also the way you split train and test, does the resulting train and test data set has similar distribution for your Y? that's very important.
Last thing, for classification models the performance measurement can be a bit tricky, try some other measurement methods. F1 scores or try to draw a confusion matrix and see what your predictions vs Y looks like. perhaps your model is predicting everything to one
or just a few classes.

How to predict after training data using naive bayes with python?

I have got a dataset which contains just two useful columns for training my model, first is news heading and the second is category of news.
So, I got the following training command running successfully using python:
import re
import numpy as np
import pandas as pd
# the Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
# function to split the data for cross-validation
from sklearn.model_selection import train_test_split
# function for transforming documents into counts
from sklearn.feature_extraction.text import CountVectorizer
# function for encoding categories
from sklearn.preprocessing import LabelEncoder
# grab the data
news = pd.read_csv("/Users/helloworld/Downloads/NewsAggregatorDataset/newsCorpora.csv",encoding='latin-1')
news.head()
def normalize_text(s):
s = s.lower()
# remove punctuation that is not word-internal (e.g., hyphens, apostrophes)
s = re.sub('\s\W',' ',s)
s = re.sub('\W\s',' ',s)
# make sure we didn't introduce any double spaces
s = re.sub('\s+',' ',s)
return s
news['TEXT'] = [normalize_text(s) for s in news['TITLE']]
# pull the data into vectors
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(news['TEXT'])
encoder = LabelEncoder()
y = encoder.fit_transform(news['CATEGORY'])
# split into train and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
nb = MultinomialNB()
nb.fit(x_train, y_train)
So my question is, how can I give a new set of data (e.g. Just news heading) and tell the program to predict the news category using python sklearn command?
P.S. My training data is like:
You should train the model using the training data (as you did) and then you should predict using new data (the test data).
Do the following:
nb = MultinomialNB()
nb.fit(x_train, y_train)
y_predicted = nb.predict(x_test)
Now, if you want to evaluate the predictions based on the **accuracy you can do the following:**
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_predicted)
Similarly, you can calculate other metrics.
Finally, we can see all the available metrics here !
EDIT 1
When you type:
y_predicted = nb.predict(x_test)
y_predicted will contain numerical values that correspond to your categories.
To project back these values and get the labels you can do:
y_predicted_labels = encoder.inverse_transform(y_predicted)
You are very close. Just need two more lines of code. Use this link, explains Naives Bayes using Sci Kit,
https://www.digitalocean.com/community/tutorials/how-to-build-a-machine-learning-classifier-in-python-with-scikit-learn
The short answer to your question is below, import the accuracy function,
from sklearn.metrics import accuracy_score
test the model using the predict function,
preds = nb.predict(x_test)
and then test the accuracy
print(accuracy_score(y_test, preds))

Time Series Classification

you can access the data set at this link https://drive.google.com/file/d/0B9Hd-26lI95ZeVU5cDY0ZU5MTWs/view?usp=sharing
My Task is to predict the price movement of a sector fund. How much it goes up or down doesn't really matter, I only want to know whether it's going up or down. So I define it as a classification problem.
Since this data set is a time-series data, I met many problems. I have read articles about these problems like I can't use k-fold cross validation since this is time series data. You can't ignore the order of the data.
my code is as follows:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime
from sklearn.linear_model import LinearRegression
from math import sqrt
from sklearn.svm import LinearSVC
from sklearn.svm import SVCenter code here
lag1 = pd.read_csv(#local file path, parse_dates=['Date'])
#Trend : if price going up: ture, otherwise false
lag1['Trend'] = lag1.XLF > lag1.XLF.shift()
train_size = round(len(lag1)*0.50)
train = lag1[0:train_size]
test = lag1[train_size:]
variable_to_use= ['rGDP','interest_rate','private_auto_insurance','M2_money_supply','VXX']
y_train = train['Trend']
X_train = train[variable_to_use]
y_test = test['Trend']
X_test = test[variable_to_use]
#SVM Lag1
this_C = 1.0
clf = SVC(kernel = 'linear', C=this_C).fit(X_train, y_train)
print('XLF Lag1 dataset')
print('Accuracy of Linear SVC classifier on training set: {:.2f}'
.format(clf.score(X_train, y_train)))
print('Accuracy of Linear SVC classifier on test set: {:.2f}'
.format(clf.score(X_test, y_test)))
#Check prediction results
clf.predict(X_test)
First of all, is my method right here : first generating a column of true and false? I am afraid the machine can't understand this column if I simply feed this column to it. Should I first perform a regression then compare the numeric result to generate a list of going up or down?
The accuracy on training set is very low at : 0.58 I am getting an array with all trues with clf.predict(X_test) which I don't know why I would get all trues.
And I don't know whether the resulting accuracy is calculated in which way: for example, I think my current accuracy only counts the number of true and false but ignoring the order of them? Since this is time-series data, ignoring the order is not right and gives me no information about predicting price movement. Let's say I have 40 examples in test set, and I got 20 Tures I would get 50% accuracy. But I guess the trues are no in the right position as it appears in the ground truth set. (Tell me if I am wrong)
I am also considering using Gradient Boosted Tree to do the classification, would it be better?
Some preprocessing of this data would probably be helpful. Step one might go something like:
df = pd.read_csv('YOURLOCALFILEPATH',header=0)
#more code than your method but labels rows as 0 or 1 and easy to output to new file for later reference
df['Date'] = pd.to_datetime(df['date'], unit='d')
df = df.set_index('Date')
df['compare'] = df['XLF'].shift(-1)
df['Label'] np.where(df['XLF']>df['compare'), 1, 0)
df.drop('compare', axis=1, inplace=True)
Step two can use one of sklearn's built in scalers, such as the MinMax scaler, to preprocess the data by scaling your feature inputs before feeding it into your model.

Categories