Interpret results Regression - python

I made an ordinal regression model (first time performing regression, be merciful) and now I need to evaluate it. What would be the best way? (I use mord API for the ordinal regression)
These are the tasks I am trying to complete:
3) Build a regression model that will predict the rating score of each
product based on attributes which correspond to some very common words
used in the reviews (selection of how many words is left to you as a
decision). So, for each product you will have a long(ish) vector of
attributes based on how many times each word appears in reviews of
this product. Your target variable is the rating. You will be judged
on the process of building the model (regularization, subset
selection, validation set, etc.) and not so much on the accuracy of
the results.
4) Having the vectors from Question 3, perform
dimensionality reduction (either PCA or NMF). Can you conclude how
many components you can keep? Experiment with this parameter and
justify your final conclusion.
This is my code for it:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import textblob
import nltk
from pandas import ExcelWriter
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from textblob import Word
from collections import Counter
import seaborn as sns
import mord as m
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
%matplotlib inline
df = # import dataframe from link
#Clean up Rating (whilst doing 'hand cleaning' I saw data outside of the [0,5] range; needs to be corrected; this could have been spotted by plotting the data on histogram but since I saw this while going throught the data I feel plotting it is an unnecessary step)
df.loc[df.Rating > 5, 'Rating'] = np.NaN
df.loc[df.Rating < 1, 'Rating'] = np.NaN
# Convert weights to same measure (pounds). Most of the weights I inspected seem wrong...
for i in range(0, df.weight.size-1):
cell = df.weight[i]
while (cell == 0 and i < df.weight.size-1):
i += 1
cell = df.weight[i]
if not(isinstance(cell, float)) and not(isinstance(cell, int)):
number = ''.join([x for x in cell if (x.isdigit() or x=='.')])
num = float(number)
if bool(re.search('ounces', cell)):
df.loc[i, 'weight'] = num * 0.0625 # Ounces to pounds conversion
else:
df.loc[i, 'weight'] = num # Introduce only number (without measure type)
df.loc[:, "Review"] = df["Title"] + str(' - ') + df["Text"]
df.drop('Title', axis=1, inplace=True)
df.drop('Text', axis=1, inplace=True)
df.columns = ['Brand', 'Name', 'NumsHelpful', 'Rating', 'Weight(Pounds)', 'Review']
df['Weight(Pounds)'] = pd.to_numeric(df['Weight(Pounds)'], errors='coerce')
df['Brand'] = df['Brand'].astype(str)
df['Review'] = df['Review'].astype(str)
df['Name'] = df['Name'].astype(str)
d = {'Brand':'first',
'NumsHelpful':'mean',
'Rating':'mean',
'Weight(Pounds)':'first',
'Review':'/'.join,
}
df = df.groupby('Name').agg(d).reset_index()
df.Rating = df.Rating.round()
df.NumsHelpful = df.NumsHelpful.round()
df['Review2'] = df['Review'].apply(lambda x: " ".join(x.lower() for x in x.split()))
df['Review2'] = df['Review2'].str.replace('[^\w\s]','')
stop = stopwords.words('english')
df['Review2'] = df['Review2'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
freq = pd.Series(' '.join(df['Review2']).split()).value_counts()[:20]
common = ['wine', 'mix', 'taste', 'drink', 'one', 'price', 'product', 'flavour', 'would', 'bitters', 'bottle', 'buy','really', 'make']
df['Review2'] = df['Review2'].apply(lambda x: " ".join(x for x in x.split() if x not in common))
freq = pd.Series(' '.join(df['Review2']).split()).value_counts()[-10:]
freq = list(freq.index)
df['Review2'] = df['Review2'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
df['words'] = df.Review2.str.strip().str.split('[\W_]+')
df['Review2'] = df['words'].apply(lambda x: " ".join([Word(word).lemmatize('v') for word in x]))
df['Review2'].str.split(expand=True).stack().value_counts()
# Create word matrix
bow = df.Review2.str.split().apply(pd.Series.value_counts)
rating = df['Rating']
df_rating = pd.DataFrame([rating])
df_rating = df_rating.transpose()
bow = bow.join(df_rating)
# Remove some columns and rows
bow = bow.loc[(bow['Rating'].notna()), ~(bow.sum(0) < 80)]
# Divide into train - validation - test
bow.fillna(0, inplace=True)
rating = bow['Rating']
bow = bow.drop('Rating', 1)
x_train, x_test, y_train, y_test = train_test_split(bow, rating, test_size=0.4, random_state=0)
# Run regression
regr = m.OrdinalRidge()
regr.fit(x_train, y_train)
scores = cross_val_score(regr, bow, rating, cv=5, scoring='accuracy')
# scores -> array([0.75438596, 0.73684211, 0.66071429, 0.53571429, 0.60714286])
# avg_score -> Accuracy: 0.66 (+/- 0.16)
# Do PCA (dimensionality reduction)
scaler = StandardScaler()
# Fit on training set only.
scaler.fit(x_train)
# Apply transform to both the training set and the test set.
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)
# Make an instance of the Model
pca = PCA(.95)
pca.fit(x_train)
x_train = pca.transform(x_train)
x_test = pca.transform(x_test)
regr.fit(x_train, y_train)
scores = cross_val_score(regr, bow, rating, cv=10, scoring='accuracy')
What are your thoughts on the above code?
Any insight is greatly appreciated!
EDIT:
This is a link to the dataset
This is a link to a google.doc containing the source code (Python)

Related

Create a dummy variable in a cycle (loop)

I'm working on a dataset composed by 22 columns and 129 rows.
I'm using Support Vector Machine to predict my dependent variable.
To do this, I split the variable in a dummy that assume 0 and 1:
df['dummy_medianrat'] = df['median_rating'].apply(lambda x: 1 if x < 13 else 0)
Now, my answer is:
I want to generate in loop this dummy, for example:
df['dummy_medianrat'] = df['median_rating'].apply(lambda x: 1 if x < 12 else 0)
df['dummy_medianrat'] = df['median_rating'].apply(lambda x: 1 if x < 5 else 0)
df['dummy_medianrat'] = df['median_rating'].apply(lambda x: 1 if x < 8 else 0)
and so on. I want to test my variable with different classification (<12, <5, <8) and permit to SVM to test all of this.
Full code:
import pandas as pd # pandas is used to load and manipulate data and for One-Hot Encoding
import numpy as np # data manipulation
import matplotlib.pyplot as plt # matplotlib is for drawing graphs
import matplotlib.colors as colors
from sklearn.utils import resample # downsample the dataset
from sklearn.model_selection import train_test_split # split data into training and testing sets
from sklearn import preprocessing # scale and center data
from sklearn.svm import SVC # this will make a support vector machine for classificaiton
from sklearn.model_selection import GridSearchCV # this will do cross validation
from sklearn.metrics import confusion_matrix # this creates a confusion matrix
from sklearn.metrics import plot_confusion_matrix # draws a confusion matrix
from sklearn.decomposition import PCA # to perform PCA to plot the data
from sklearn import svm, datasets
datafile = (r'C:\Users\gpont\PycharmProjects\pythonProject2\data\Map\databaseCDP0.csv')
df = pd.read_csv(datafile, skiprows = 0, sep=';')
df['dummy_medianrat'] = df['median_rating'].apply(lambda x: 1 if x < 13 else 0)
#Splitting data in two datasets
df_lowr = df[df['dummy_medianrat'] == 1]
df_higr = df[df['dummy_medianrat'] == 0]
df_downsample = pd.concat([df_lowr, df_higr])
len(df_downsample)
X = df_downsample.drop('dummy_medianrat', axis=1).copy()
X.head()
y = df_downsample['dummy_medianrat'].copy()
y.head()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42,
test_size=0.25)
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
X_train.shape
X_test.shape
#Build A Preliminary Support Vector Machine
#We don't need to scale y_traing because is 0, 1 (binary classification)
clf_svm = SVC(random_state=42)
clf_svm.fit(X_train_scaled, y_train)
titles_options = [("Confusion matrix, without normalization", None),
("Normalized confusion matrix", 'true')]
for title, normalize in titles_options:
disp = plot_confusion_matrix(clf_svm, X_test_scaled, y_test,
display_labels=["Did not default", "Defaulted"],
cmap=plt.cm.Blues,
normalize=normalize)
disp.ax_.set_title(title)
print(title)
print(disp.confusion_matrix)
After created some dummies with differente values, I want to generate two confusion matrix (normalized and not), for each dummy created in a loop.

Why is my accuracy_score metric incorrect? scikit learn

I have somewhat working code, which is giving me trouble. I seem to get an almost random accuracy_score metric, whereas my printout of predicted values suggests otherwise. I was following this tutorial online and here is what I have written so far:
import os
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, confusion_matrix
adult_train = pd.read_csv(os.path.expanduser("~/Desktop/") + "adult_train_srt.csv", sep=',')
print(adult_train.head(100))
le = LabelEncoder()
adult_train['age'] = le.fit_transform(adult_train['age'])
adult_train['workclass'] = le.fit_transform(adult_train['workclass'].astype(str))
adult_train['education'] = le.fit_transform(adult_train['education'].astype(str))
adult_train['occupation'] = le.fit_transform(adult_train['occupation'].astype(str))
adult_train['race'] = le.fit_transform(adult_train['race'].astype(str))
adult_train['sex'] = le.fit_transform(adult_train['sex'].astype(str))
adult_train['hours_per_week'] = le.fit_transform(adult_train['hours_per_week'])
adult_train['native_country'] = le.fit_transform(adult_train['native_country'].astype(str))
adult_train['classs'] = le.fit_transform(adult_train['classs'].astype(str))
cols = [col for col in adult_train.columns if col not in ['classs']]
data = adult_train[cols]
target = adult_train['classs']
data_train, data_test, target_train, target_test = train_test_split(data, target, test_size = 0.1) #, random_state = 42)
gnb = GaussianNB()
pred = gnb.fit(data_train, target_train).predict(data_test)
pred_gnb = gnb.predict(data_test)
print(pred_gnb)
print("Naive-Bayes accuracy: (TN + TP / ALL) ", accuracy_score(pred_gnb, target_test)) #normalize = True
print("""Confusion matrix:
TN - FP
FN - TP
Guessed:
0s +, 1s -
0s -, 1s +
""")
print(confusion_matrix(target_test, pred_gnb))
Prediction = pd.DataFrame({'Prediction':pred_gnb})
result = pd.concat([adult_train, Prediction], axis=1)
print(result.head(10))
I am at a loss, I have no way of understanding whether or not my dataframe concatenation is working or if the accuracy_score metric is solving something else, because I get outputs like so:
This particular instance it is saying there are 7 true negatives (OK), 1 false positive (???), 2 false negatives (O.K), and 0 true positives (???, but there was one guessed correct?). The [classs] column is what the [Prediction] columnn is guessing.
result = pd.concat([adult_train, Prediction], axis=1)
Here the Prediction dataframe, should not be concatenated with adult_train,
Prediction is the result of prediction on the test set data_set
pred_gnb = gnb.predict(data_test)
So, I think you should concatenate the data_test, the target_test and the Prediction, try this and it may work
result = pd.concat([pd.DataFrame(data_test), pd.DataFrame(target_test), Prediction], axis=1)

How can I return accuracy rates for Top N predictions using sklearn's SGDClassifier?

I am trying to modify the results in this post (How to get Top 3 or Top N predictions using sklearn's SGDClassifier) to get the accuracy rate returned, however I am get an accuracy rate of zero and I can't figure out why. Any thoughts? Any thoughts/edits would be much appreciated! Thank you.
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from sklearn import linear_model
arr=['dogs cats lions','apple pineapple orange','water fire earth air', 'sodium potassium calcium']
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(arr)
feature_names = vectorizer.get_feature_names()
Y = ['animals', 'fruits', 'elements','chemicals']
T=["eating apple roasted in fire and enjoying fresh air"]
test = vectorizer.transform(T)
clf = linear_model.SGDClassifier(loss='log')
clf.fit(X,Y)
x=clf.predict(test)
def top_n_accuracy(probs, test, n):
best_n = np.argsort(probs, axis=1)[:,-n:]
ts = np.argmax(test, axis=1)
successes = 0
for i in range(ts.shape[0]):
if ts[i] in best_n[i,:]:
successes += 1
return float(successes)/ts.shape[0]
n=2
probs = clf.predict_proba(test)
top_n_accuracy(probs, test, n)
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from sklearn import linear_model
arr=['dogs cats lions','apple pineapple orange','water fire earth air', 'sodium potassium calcium']
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(arr)
feature_names = vectorizer.get_feature_names()
Y = ['animals', 'fruits', 'elements','chemicals']
T=["eating apple roasted in fire and enjoying fresh air", "I love orange"]
test = vectorizer.transform(T)
clf = linear_model.SGDClassifier(loss='log')
clf.fit(X,Y)
x=clf.predict(test)
n=2
probs = clf.predict_proba(test)
topn = np.argsort(probs, axis = 1)[:,-n:]
Here I introduce the ground truth label vector (these are numeric indices, you need to map ["elements", etc] to [0,1,2 etc]. Here I assumed your test example belongs to elements.
y_true = np.array([2,1])
This should then compute your accuracy
np.mean(np.array([1 if y_true[k] in topn[k] else 0 for k in range(len(topn))]))
I ended up figuring this on out, albeit a bit different than the above...
# Set Data Location:
data = 'top10000.csv'
# load the data
df = pd.read_csv(data,low_memory=False,thousands=',', encoding='latin-1')
df = df.dropna()
df = df[['CODE','DUTIES']] #select only these columns
#df = df.rename(index=float, columns={"CODE": "label", "DUTIES": "text"})
df = df.rename(columns={"CODE": "label", "DUTIES": "text"})
#Convert label to float so you don't need to encode for processing later on
df['label']=df['label'].str.replace('-', '',regex=True, case = False).str.strip()
df['label']=df['label'].str.replace('.', '',regex=True)
#df['label']=pd.to_numeric(df['label'])
df['label']=df['label'].str[1:].astype(int)
#df['label'].astype('float64', raise_on_error = True)
#split data into testing and training
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(df.text, df.label,test_size=0.33, random_state=6)
#reset the index
valid_y = valid_y.reset_index(drop=True)
valid_x = valid_x.reset_index(drop=True)
# We will also copy the validation datasets to a dataframe to be able to merge later on
valid_x_df = pd.DataFrame(valid_x)
valid_y_df = pd.DataFrame(valid_y)
# Extracte features
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(train_x)
X_test_counts = count_vect.transform(valid_x)
# Define the model training and validation function
def TV_model(classifier, feature_vector_train, label, feature_vector_valid, valid_y, valid_x, is_neural_net=False):
# fit the training dataset on the classifier
classifier.fit(feature_vector_train, label)
# predict the top n labels on validation dataset
n = 5
#classifier.probability = True
probas = classifier.predict_proba(feature_vector_valid)
predictions = classifier.predict(feature_vector_valid)
#Identify the indexes of the top predictions
top_n_predictions = np.argsort(probas, axis = 1)[:,-n:]
#then find the associated SOC code for each prediction
top_class = classifier.classes_[top_n_predictions]
#cast to a new dataframe
top_class_df = pd.DataFrame(data=top_class)
#merge it up with the validation labels and descriptions
results = pd.merge(valid_y, valid_x, left_index=True, right_index=True)
results = pd.merge(results, top_class_df, left_index=True, right_index=True)
# Top 5 results condiions and choices
top5_conditions = [
(results.iloc[:,0] == results[0]),
(results.iloc[:,0] == results[1]),
(results.iloc[:,0] == results[2]),
(results.iloc[:,0] == results[3]),
(results.iloc[:,0] == results[4])]
top5_choices = [1, 1, 1, 1, 1]
# Fetch Top 1 Result
top1_conditions = [(results.iloc[:,0] == results[4])]
top1_choices = [1]
# Create the success columns
results['Top 5 Successes'] = np.select(top5_conditions, top5_choices, default=0)
results['Top 1 Successes'] = np.select(top1_conditions, top1_choices, default=0)
#Print the QA
print("Are Top 5 Results greater than Top 1 Result? (answer must be True): ", (sum(results['Top 5 Successes'])/results.shape[0])>(metrics.accuracy_score(valid_y, predictions)))
print("Are Top 1 Results equal from predict() and predict_proba()? (answer must be True): ", (sum(results['Top 1 Successes'])/results.shape[0])==(metrics.accuracy_score(valid_y, predictions)))
print(" ")
print("Details: ")
print("Top 5 Accuracy Rate (predict_proba)= ", sum(results['Top 5 Successes'])/results.shape[0])
#print("Top 5 Accuracy Rate (np.mean)= ", np.mean(np.array([1 if valid_y[k] in top_class[k] else 0 for k in range(len(top_class))])))
print("Top 1 Accuracy Rate (predict_proba)= ", sum(results['Top 1 Successes'])/results.shape[0])
print("Top 1 Accuracy Rate = (predict)", metrics.accuracy_score(valid_y, predictions))
# Train and validate model from example data using the function defined above
TV_model(LogisticRegression(), X_train_counts, train_y, X_test_counts, valid_y_df, valid_x_df)
I'm sure it could be more computationally efficient, so any suggestion on how I could transform the accuracy rate calculation into a one liner like was suggested in the comments above would be much appreciated!

Why is my output dataframe shape not 1459 x 2 but 1460 x 2

Below is what i have done so far.
#importing the necessary modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import LassoCV
from sklearn.linear_model import ElasticNetCV
from sklearn.ensemble import RandomForestRegressor
filepath = r"C:\Users...Kaggle data\house prediction iowa\house_predtrain (3).csv"
train = pd.read_csv(filepath)
print(train.shape)
filepath2 = r"C:\Users...Kaggle data\house prediction iowa\house_predtest (1).csv"
test = pd.read_csv (filepath2)
print(test.shape)
#first we raplace all the NANs by 0 in botht the train and test data
train = train.fillna(0)
test = test.fillna(0) #error one
train.dtypes.value_counts()
#isolating all the object/categorical feature and converting them to numeric features
encode_cols = train.dtypes[train.dtypes == np.object]
encode_cols2 = test.dtypes[test.dtypes == np.object]
#print(encode_cols)
encode_cols = encode_cols.index.tolist()
encode_cols2 = encode_cols2.index.tolist()
print(encode_cols2)
# Do the one hot encoding
train_dummies = pd.get_dummies(train, columns=encode_cols)
test_dummies = pd.get_dummies(test, columns=encode_cols2)
#align your test and train data (error2)
train, test = train_dummies.align(test_dummies, join = 'left', axis = 1)
print(train.shape)
print(test.shape)
#Now working with Floats features
numericals_floats = train.dtypes == np.float
numericals = train.columns[numericals_floats]
print(numericals)
#we check for skewness in the float data
skew_limit = 0.35
skew_vals = train[numericals].skew()
skew_cols = (skew_vals
.sort_values(ascending=False)
.to_frame()
.rename(columns={0:'Skewness'}))
skew_cols
#Visualising them above data before and after log transforming
%matplotlib inline
field = 'GarageYrBlt'
fig, (ax_before, ax_after) = plt.subplots(1, 2, figsize=(10,5))
train[field].hist(ax=ax_before)
train[field].apply(np.log1p).hist(ax=ax_after)
ax_before.set (title = 'Before np.log1p', ylabel = 'frequency', xlabel = 'Value')
ax_after.set (title = 'After np.log1p', ylabel = 'frequency', xlabel = 'Value')
fig.suptitle('Field: "{}"'.format (field));
#note how applying log transformation on GarageYrBuilt does not do much
print(skew_cols.index.tolist()) #returns a list of the values
for i in skew_cols.index.tolist():
if i == "SalePrice": #we do not want to transform the feature to be predicted
continue
train[i] = train[i].apply(np.log1p)
test[i] = test[i].apply(np.log1p)
feature_cols = [x for x in train.columns if x != ('SalePrice')]
X_train = train[feature_cols]
y_train = train['SalePrice']
X_test = test[feature_cols]
y_test = train['SalePrice']
print(X_test.shape)
print(y_train.shape)
print(X_train.shape)
#now to the most fun part. Feature engineering is over!!!
#i am going to use linear regression, L1 regularization, L2 regularization and ElasticNet(blend of L1 and L2)
#first up, Linear Regression
alphas =[0.00005, 0.0005, 0.005, 0.05, 0.5, 0.1, 0.3, 1, 3, 5, 10, 25, 50, 100] #i choosed this
l1_ratios = np.linspace(0.1, 0.9, 9)
#LinearRegression
linearRegression = LinearRegression().fit(X_train, y_train)
prediction1 = linearRegression.predict(X_test)
LR_score = linearRegression.score(X_train, y_train)
print(LR_score)
#ridge
ridgeCV = RidgeCV(alphas=alphas).fit(X_train, y_train)
prediction2 = ridgeCV.predict(X_test)
R_score = ridgeCV.score(X_train, y_train)
print(R_score)
#lasso
lassoCV = LassoCV(alphas=alphas, max_iter=1e2).fit(X_train, y_train)
prediction3 = lassoCV.predict(X_test)
L_score = lassoCV.score(X_train, y_train)
print(L_score)
#elasticNetCV
elasticnetCV = ElasticNetCV(alphas=alphas, l1_ratio=l1_ratios, max_iter=1e2).fit(X_train, y_train)
prediction4 = elasticnetCV.predict(X_test)
EN_score = elasticnetCV.score(X_train, y_train)
print(EN_score)
from sklearn.ensemble import RandomForestRegressor
randfr = RandomForestRegressor()
randfr = randfr.fit(X_train, y_train)
prediction5 = randfr.predict(X_test)
print(prediction5.shape)
RF_score = randfr.score(X_train, y_train)
print(RF_score)
#putting it lall together
rmse_vals = [LR_score, R_score, L_score, EN_score, RF_score]
labels = ['Linear', 'Ridge', 'Lasso', 'ElasticNet', 'RandomForest']
rmse_df = pd.Series(rmse_vals, index=labels).to_frame()
rmse_df.rename(columns={0: 'SCORES'}, inplace=1)
rmse_df
\\KaggleHouse_submission_1 = pd.DataFrame({'Id': test.Id, 'SalePrice': prediction5})
KaggleHouse_submission_1 = KaggleHouse_submission_1
print(KaggleHouse_submission_1.shape)
In the kaggle house prediction there is a train dataset and a test dataset. here is the link to the actual data link. The output dataframe size should be a 1459 X 2 but mine is 1460 X 2 for some reason. I am not sure why this is happening. Any feedbacks is highly appreciated.
In the following line:
test = train.fillna(0)
you are assigning (overwriting) test variable with the "train" data ...
Scikit learn is very sensitive o ordering of columns, so if your train data set and the test data set are misaligned, you may have a problem similar to that above. so you need to first ensure that the test data is encoded same as the train data by using the following align command.
train, test = train_dummies.align(test_dummies, join='left', axis = 1)
see changes in my code above

scikit very low accuracy on classifiers(Naive Bayes, DecissionTreeClassifier)

I am using this dataset Weath Based on age and the documentation states that the accuracy should be around 84%. Unfortunately, the accuracy of my program is at 25%
To process the data I did the following:
1. Loaded the .txt data file and converted it to a .csv
2. Removed data with missing values
3. Extracted the class values: <=50K >50 and convert it to 0 and 1 respectively
4. For each attribute and for each string value of that attribute I
mapped it to an integer value. Example att1{'cs':0, 'cs2':1},
att2{'usa':0, 'greece':1} ... and so on
5. Called naive bayes on the new integer data set
Python code:
import load_csv as load #my functions to do [1..5] of the list
import numpy as np
my_data = np.genfromtxt('out.csv', dtype = dt, delimiter = ',', skip_header = 1)
data = np.array(load.remove_missing_values(my_data)) #this funcion removes the missing data
features_train = np.array(load.remove_field_num(data, len(data[0]) - 1)) #this function extracts the data, e.g removes the class in the end of the data
label_train = np.array(load.create_labels(data))
features_train = np.array(load.convert_to_int(features_train))
my_data = np.genfromtxt('test.csv', dtype = dt, delimiter = ',', skip_header = 1)
data = np.array(load.remove_missing_values(my_data))
features_test = np.array(load.remove_field_num(data, len(data[0]) - 1))
label_test = np.array(load.create_labels(data)) #extracts the labels from the .csv data file
features_test = np.array(load.convert_to_int(features_test)) #converts the strings to ints(each unique string of an attribute is assigned a unique integer value
from sklearn import tree
from sklearn.naive_bayes import GaussianNB
from sklearn import tree
from sklearn.metrics import accuracy_score
clf = tree.DecisionTreeClassifier()
clf.fit(features_train, label_train)
predict = clf.predict(features_test)
score = accuracy_score(predict, label_test) #Low accuracy score
load_csv module:
import numpy as np
attributes = { 'Private':0, 'Self-emp-not-inc':1, 'Self-emp-inc':2, 'Federal-gov':3, 'Local-gov':4, 'State-gov':5, 'Without-pay':6, 'Never-worked':7,
'Bachelors':0, 'Some-college':1, '11th':2, 'HS-grad':3, 'Prof-school':4, 'Assoc-acdm':5, 'Assoc-voc':6, '9th':7, '7th-8th':8, '12th':9, 'Masters':10, '1st-4th':11, '10th':12, 'Doctorate':13, '5th-6th':14, 'Preschool':15,
'Married-civ-spouse':0, 'Divorced':1, 'Never-married':2, 'Separated':3, 'Widowed':4, 'Married-spouse-absent':5, 'Married-AF-spouse':6,
'Tech-support':0, 'Craft-repair':1, 'Other-service':2, 'Sales':3, 'Exec-managerial':4, 'Prof-specialty':5, 'Handlers-cleaners':6, 'Machine-op-inspct':7, 'Adm-clerical':8,
'Farming-fishing':9, 'Transport-moving':10, 'Priv-house-serv':11, 'Protective-serv':12, 'Armed-Forces':13,
'Wife':0, 'Own-child':1, 'Husband':2, 'Not-in-family':4, 'Other-relative':5, 'Unmarried':5,
'White':0, 'Asian-Pac-Islander':1, 'Amer-Indian-Eskimo':2, 'Other':3, 'Black':4,
'Female':0, 'Male':1,
'United-States':0, 'Cambodia':1, 'England':2, 'Puerto-Rico':3, 'Canada':4, 'Germany':5, 'Outlying-US(Guam-USVI-etc)':6, 'India':7, 'Japan':8, 'Greece':9, 'South':10, 'China':11, 'Cuba':12, 'Iran':13, 'Honduras':14, 'Philippines':15, 'Italy':16, 'Poland':17, 'Jamaica':18, 'Vietnam':19, 'Mexico':20, 'Portugal':21, 'Ireland':22, 'France':23, 'Dominican-Republic':24, 'Laos':25, 'Ecuador':26, 'Taiwan':27, 'Haiti':28, 'Columbia':29, 'Hungary':30, 'Guatemala':31, 'Nicaragua':32, 'Scotland':33, 'Thailand':34, 'Yugoslavia':35, 'El-Salvador':36, 'Trinadad&Tobago':37, 'Peru':38, 'Hong':39, 'Holand-Netherlands':40
}
def remove_field_num(a, i): #function to strip values
names = list(a.dtype.names)
new_names = names[:i] + names[i + 1:]
b = a[new_names]
return b
def remove_missing_values(data):
temp = []
for i in range(len(data)):
for j in range(len(data[i])):
if data[i][j] == '?': #If a missing value '?' is encountered do not append the line to temp
break;
if j == (len(data[i]) - 1) and len(data[i]) == 15:
temp.append(data[i]) #Append the lines that do not contain '?'
return temp
def create_labels(data):
temp = []
for i in range(len(data)): #Iterate through the data
j = len(data[i]) - 1 #Extract the labels
if data[i][j] == '<=50K':
temp.append(0)
else:
temp.append(1)
return temp
def convert_to_int(data):
my_lst = []
for i in range(len(data)):
lst = []
for j in range(len(data[i])):
key = data[i][j]
if j in (1, 3, 5, 6, 7, 8, 9, 13, 14):
lst.append(int(attributes[key]))
else:
lst.append(int(key))
my_lst.append(lst)
temp = np.array(my_lst)
return temp
I have tried to use both tree and NaiveBayes but the accuracy is very low. Any suggestions of what am I missing?
I guess the problem is in preprocessing. It is better to encode the categorical variables as one_hot vectors (vectors with only zero or ones where one corresponds to the desired value for that class) instead of raw numbers. Sklearn DictVectorizer can help you in that. You can do the classification much more efficiently with the pandas library.
The following shows how easily you can achieve that with help of pandas library. It works very well along side scikit-learn. This achieves accuracy of 81.6 on a test set that is 20% of the entire data.
from __future__ import division
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.dict_vectorizer import DictVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.metrics.classification import classification_report, accuracy_score
from sklearn.naive_bayes import GaussianNB
from sklearn.tree.tree import DecisionTreeClassifier
import numpy as np
import pandas as pd
# Read the data into a pandas dataframe
df = pd.read_csv('adult.data.csv')
# Columns names
cols = np.array(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
'marital-status', 'occupation', 'relationship', 'race', 'sex',
'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
'target'])
# numeric columns
numeric_cols = ['age', 'fnlwgt', 'education-num',
'capital-gain', 'capital-loss', 'hours-per-week']
# assign names to the columns in the dataframe
df.columns = cols
# replace the target variable to 0 and 1 for <50K and >50k
df1 = df.copy()
df1.loc[df1['target'] == ' <=50K', 'target'] = 0
df1.loc[df1['target'] == ' >50K', 'target'] = 1
# split the data into train and test
X_train, X_test, y_train, y_test = train_test_split(
df1.drop('target', axis=1), df1['target'], test_size=0.2)
# numeric attributes
x_num_train = X_train[numeric_cols].as_matrix()
x_num_test = X_test[numeric_cols].as_matrix()
# scale to <0,1>
max_train = np.amax(x_num_train, 0)
max_test = np.amax(x_num_test, 0) # not really needed
x_num_train = x_num_train / max_train
x_num_test = x_num_test / max_train # scale test by max_train
# labels or target attribute
y_train = y_train.astype(int)
y_test = y_test.astype(int)
# categorical attributes
cat_train = X_train.drop(numeric_cols, axis=1)
cat_test = X_test.drop(numeric_cols, axis=1)
cat_train.fillna('NA', inplace=True)
cat_test.fillna('NA', inplace=True)
x_cat_train = cat_train.T.to_dict().values()
x_cat_test = cat_test.T.to_dict().values()
# vectorize (encode as one hot)
vectorizer = DictVectorizer(sparse=False)
vec_x_cat_train = vectorizer.fit_transform(x_cat_train)
vec_x_cat_test = vectorizer.transform(x_cat_test)
# build the feature vector
x_train = np.hstack((x_num_train, vec_x_cat_train))
x_test = np.hstack((x_num_test, vec_x_cat_test))
clf = LogisticRegression().fit(x_train, y_train.values)
pred = clf.predict(x_test)
print classification_report(y_test.values, pred, digits=4)
print accuracy_score(y_test.values, pred)
clf = DecisionTreeClassifier().fit(x_train, y_train)
predict = clf.predict(x_test)
print classification_report(y_test.values, pred, digits=4)
print accuracy_score(y_test.values, pred)
clf = GaussianNB().fit(x_train, y_train)
predict = clf.predict(x_test)
print classification_report(y_test.values, pred, digits=4)
print accuracy_score(y_test.values, pred)

Categories