How to Perform a Naieve Bayes Classification in Scikit from CSV

How to Perform a Naieve Bayes Classification in Scikit from CSV - python

I am looking to predict whether someone is a smoker from several columns of demographic data stored in a csv, as well as their smoker status.
The columns used are:
Gender, Age,Race, ServedInMilitary, CountryofBirth, EducationLevel MaritalStatus, HouseholdIncome, FamilyIncome, ChildrenInHouse, QuantitiyofAlcohol, PerUnitTime, ShortnessOfBreath, Asthma, Exercise, Smoker, SmokedBefore, AgeStartedSmoking.
All columns have numeric, but not necessarily binary values. Could someone help me correct my code to take these factors into account when determining smoker status and then help me measure the accuracy of my classifier?
I have the following code from a similar question: how to Load CSV Data in scikit and using it for Naive Bayes Classification
target_names = np.array(['Positives','Negatives'])
# add columns to your data frame
data['is_train'] = np.random.uniform(0, 1, len(df)) <= 0.75
data['Type'] = pd.Factor(targets, target_names)
data['Targets'] = targets
# define training and test sets
train = data[data['is_train']==True]
test = data[data['is_train']==False]
trainTargets = np.array(train['Targets']).astype(int)
testTargets = np.array(test['Targets']).astype(int)
# columns you want to model
features = data.columns[0:7]
# call Gaussian Naive Bayesian class with default parameters
gnb = GaussianNB()
# train model
y_gnb = gnb.fit(train[features], trainTargets).predict(train[features])
#Predict Output

There seems to be a missing line here for the dataframe, but I will assume you have it. If you don't, then read your data using pandas.read_csv.
Also, your columns seem to have data that is both categorical and numerical. For example, the "SmokedBefore" column is likely 1/0 whereas your "Age" column is likely numbers such as 20 or 30.
This makes a difference, because in "SmokedBefore" the intent is not to say that 1>0. The intent is to say Yes/No. If your model assumes that higher (or lower) is better, then this will cause an issue. Therefore it is categorical and should not be treated like a numerical value. It is simply a tag to indicate whether someone has smoked before.
However, in "Age" the intent is to say that 30 is different than 20 by 10. Therefore, it is numerical and should be treated as such.
To treat this, you will need to transform your categorical features into another set of binary features that will balance out this effect and handle it for you. This is called One Hot Encoding. Instead of 1/0 on your "SmokedBefore", it will become "is_1" and "is_0" with corresponding data. Like that, each column will have a 1 and a 0.
You can simply use the onehotencoder function provided in sklearn. Use the categorical_features argument to specify which columns have categorical features

Related

How do I filter outliers in test data based on z-scores of train data?

I have a train and test dataset. On the train dataset I detected and deleted outlier values, when their standard deviation is 5 times greater from the mean. If a z-score returned is larger than that, the value is quite unusual and therefore I delete it from the dataset.
import scipy.stats as stats
z_scores = train_df.apply(stats.zscore)
abs_z_scores = np.abs(z_scores)
filtered_entries = (abs_z_scores < 5).all(axis=1)
train_df= train_df[filtered_entries]
Now I want to use the same z-scores based on the train set to remove values from the test set. (I don't want to get the z_scores from the test dataset itself!) Probably one idea is to store the mean and standard deviation of X from the train data and calculate the z-score for the test data based on them e.g.
(Xtest−μ)/σ
But I do not have any concrete ideas how to do so. Could someone give me some advice?

Choosing between imputation methods [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed last year.
Improve this question
I'm trying to evaluate 2 methods for imputation of data.
My dataset: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data
My target label is LotFrontage.
First I encoded all categorial features with OneHotEncoding and then I used the correlation matrix and filter anything above -0.3 or blow 0.3.
encoded_df = pd.get_dummies(train_df, prefix_sep="_", columns=['MSZoning', 'Street', 'Alley',
'LotShape', 'LandContour', 'Utilities',
'LotConfig', 'LandSlope', 'Neighborhood',
'Condition1', 'Condition2', 'BldgType', 'HouseStyle'])
corrmat = encoded_df.corr()
corrmat[(corrmat > 0.3) | (corrmat < -0.3)]
# filtering out based on corrmat output...
encoded_df = encoded_df[['SalePrice', 'MSSubClass', 'LotFrontage', 'LotArea',
'BldgType_1Fam', 'BldgType_2fmCon', 'BldgType_Duplex', 'BldgType_Twnhs', 'BldgType_TwnhsE',
'MSZoning_C (all)', 'MSZoning_FV', 'MSZoning_RH', 'MSZoning_RL', 'MSZoning_RM']]
Then I try two imputation methods:
use the mean value of LotFrontage (used this method because I saw low outlier ratio)
Tried to predict LotFrontage with DecisionTreeRegressor
# imputate LotFrontage with the mean value (we saw low outliers ratio so we gonna use this)
encoded_df1 = encoded_df.copy()
encoded_df1['LotFrontage'].fillna(encoded_df['LotFrontage'].mean(), inplace=True)
X1 = encoded_df1.drop('LotFrontage', axis=1)
y1 = encoded_df1['LotFrontage']
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1)
classifier1 = DecisionTreeRegressor()
classifier1.fit(X1_train, y1_train)
y1_pred = classifier1.predict(X1_test)
print('score1: ', classifier1.score(X1_test, y1_test))
# imputate LotFrontage with by preditcing it using DecisionTreeRegressor
encoded_df2 = encoded_df.copy()
X2 = encoded_df2[~encoded_df2['LotFrontage'].isnull()].drop('LotFrontage', axis=1)
y2 = encoded_df2[~encoded_df2['LotFrontage'].isnull()]['LotFrontage']
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2)
classifier2 = DecisionTreeRegressor()
classifier2.fit(X2_train, y2_train)
y2_pred = classifier2.predict(encoded_df2[encoded_df2['LotFrontage'].isnull()].drop('LotFrontage', axis=1))
imputated_encoded_df2 = encoded_df2[encoded_df2['LotFrontage'].isnull()].assign(LotFrontage=y2_pred)
X3 = imputated_encoded_df2.drop('LotFrontage', axis=1)
y3 = imputated_encoded_df2['LotFrontage']
X3_train, X3_test, y3_train, y3_test = train_test_split(X3, y3)
classifier2.fit(X3_train, y3_train)
y3_pred = classifier2.predict(X3_test)
print('score2: ', classifier2.score(X3_test, y3_test))
My questions are:
Is it correct of me first using fillna with the mean value and then splitting to train and test and checking the score? Because if I'm filling the values prior to fitting the model won't it fit the model on the imputated data and thus giving me biased result? Same for the second method
Anything else I'm doing wrong since I can't determine the best method for imputation since I get bad and random score for both methods

1.Imputation Using (Mean/Median) Values:
This works by calculating the mean/median of the non-missing values in a column and then replacing the missing values within each column separately and independently from the others. It can only be used with numeric data.
Pros:
Easy and fast.
Works well with small numerical datasets.
Cons:
Doesn’t factor the correlations between features.
It only works on the column level.
Will give poor results on encoded categorical features (do NOT use it on categorical features).
Not very accurate.
Doesn’t account for the uncertainty in the imputations.
2.Imputation Using (Most Frequent) or (Zero/Constant) Values:
Most Frequent is another statistical strategy to impute missing values and YES!! It works with categorical features (strings or numerical representations) by replacing missing data with the most frequent values within each column.
Pros:
Works well with categorical features.
Cons:
It also doesn’t factor the correlations between features.
It can introduce bias in the data.
Zero or Constant imputation — as the name suggests — it replaces the missing values with either zero or any constant value you specify
3.Imputation Using k-NN:
The k nearest neighbours is an algorithm that is used for simple classification. The algorithm uses ‘feature similarity’ to predict the values of any new data points. This means that the new point is assigned a value based on how closely it resembles the points in the training set. This can be very useful in making predictions about the missing values by finding the k’s closest neighbours to the observation with missing data and then imputing them based on the non-missing values in the neighbourhood.
How does it work?
It creates a basic mean impute then uses the resulting complete list to construct a KDTree. Then, it uses the resulting KDTree to compute nearest neighbours (NN). After it finds the k-NNs, it takes the weighted average of them.
Pros:
Can be much more accurate than the mean, median or most frequent imputation methods (It depends on the dataset).
Cons:
Computationally expensive. KNN works by storing the whole training dataset in memory.
K-NN is quite sensitive to outliers in the data (unlike SVM)
Since the outlier ratio is low we can use method 3. It will also have less impact on the correlation between the imputed target variable(i.e LotFrontage) and other features.
import sys
from impyute.imputation.cs import fast_knn
sys.setrecursionlimit(100000) #Increase the recursion limit of the OS
# start the KNN training
train_df['LotFrontage']=fast_knn(train_df[['LotFrontage','1stFlrSF','MSSubClass']], k=30)
I've chosen the two features considering their correlation with the LotFrontage column.

How to get original value for binary encoding using category_encoder package

I have a dataset which includes over 100 countries in it. I want to include these in an XGBoost model to make a classification prediction. I know that One Hot Encoding is the go-to process for this, but I would rather do something that wont increase the dimensionality so much and will be resilient to new values, so I'm trying binary classification using the category_encoders package. http://contrib.scikit-learn.org/categorical-encoding/binary.html
Using this encoding helped my model out over using basic one-hot encoding, but how do I get back to the original labels after encoding?
I know about the inverse_transform method, but that functions on the whole data frame. I need a way where I can put in a binary, or integer value and get back the original value.
Here's some example data taken from: https://towardsdatascience.com/smarter-ways-to-encode-categorical-data-for-machine-learning-part-1-of-3-6dca2f71b159
import numpy as np
import pandas as pd
import category_encoders as ce
# make some data
df = pd.DataFrame({
'color':["a", "c", "a", "a", "b", "b"],
'outcome':[1, 2, 3, 2, 2, 2]})
# split into X and y
X = df.drop('outcome', axis = 1)
y = df.drop('color', axis = 1)
# instantiate an encoder - here we use Binary()
ce_binary = ce.BinaryEncoder(cols = ['color'])
# fit and transform and presto, you've got encoded data
ce_binary.fit_transform(X, y)
I'd like to pass the values [0,0,1] or 1 into a function and get back a as a value.
The main reason for this is for looking at the feature importances of the model. I can get feature importances based on a column, but this will give me back a column id rather than the underlying value of a category that is the most important.

Please note that the article you reference suggests using the Binary Encoder for ordinal data only - that is, discrete data that has an order associated with it (small, medium, large), not nominal data (Red, White, Blue).
If you decide to use a Binary encoder, the order in which colors (or countries) are encoded will impact your performance. For example, assume red=001, white=010, and blue=011. When you apply an ML algorithm, it will see that red and blue have a feature in common (feature 3). This is probably not what you want.
In terms of applying the inverse transformation, you'll need to apply the inverse transformation to [0,0,1] in your example above, not "1". "1" is meaningless without context. You should be able to apply the inverse transformation to a single record (row) in your data, but not a single column. The inverse scaler will need to will operate on an object with the output dimension of the transformer.

Data Preprocessing Python

I have a DataFrame in Python and I need to preprocess my data. Which is the best method to preprocess data?, knowing that some variables have huge scale and others doesn't. Data hasn't huge deviance either. I tried with preprocessing.Scale function and it works, but I'm not sure at all if is the best method to proceed to the machine learning algorithms.

There are various techniques for data preprocessing, you can refer to the ideas in sklearn.preprocessing as potential guidelines to follow.
http://scikit-learn.org/stable/modules/preprocessing.html
Preprocessing is coupled to the data you are studying, but in general you could explore:
Assessing missing values, by computing their percentage per column
Compute the variance and remove variables with near zero variance
Assess the inter variable correlation to detect redundancy
You can compute these scores easily in pandas as follows:
data_file = "your_input_data_file.csv"
data = pd.read_csv(data_file, delimiter="|")
variance = data.var()
variance = variance.to_frame("variance")
variance["feature_names"] = variance.index
variance.reset_index(inplace=True)
#reordering columns
variance = variance[["feature_names","variance"]]
logging.debug("exporting variance to csv file")
variance.to_csv(data_file+"_variance.csv", sep="|", index=False)
missing_values_percentage = data.isnull().sum()/data.shape[0]
missing_values_percentage = missing_values_percentage.to_frame("missing_values_percentage")
missing_values_percentage["feature_names"] = missing_values_percentage.index
missing_values_percentage.reset_index(inplace=True)
missing_values_percentage = missing_values_percentage[["feature_names","missing_values_percentage"]]
logging.debug("exporting missing values to csv file")
missing_values_percentage.to_csv(data_file+"_mssing_values.csv", sep="|", index=False)
correlation = data.corr()
correlation.to_csv(data_file+"_correlation.csv", sep="|")
The above would generate three files holding respectively, the variance, missing values percentage and correlation results.
Refer to this blog article for a hands on tutorial.

always split your data to train and test split to prevent overfiting.
if some of your features has big scale and some doesnt you should standard the data.make sure to sandard the data only on the train set not to couse overfiting.
you also have to look for missing datas and replace or remove them.
if less than 0.5% of the data in a column is missing you can use 'dropna' otherwise you have to replace it with something(you can replace ut with zero,mean,the previous data...)
you also have to check outliers by using boxplot.
outliers are point that are significantly different from other data in the same group can also affects your prediction in machine learning.
its the best if we check the multicollinearity.
if some features have correlation we have multicollinearity can couse wrong prediction for our model.
for using your data some of the columns might be categorical with sholud be converted to numerical.

having OneHotEncoder to manage unseen values at transform step

I am using sklearn.preprocessing.OneHotEncoder to encode categorical data of the form
A=array([[1,4,1],[0,3,2]])
B=array([[1,4,7],[0,3,2]])
Suppose I use A at the .fit(A) step and B at some point as new data to .transform(B). If B contains unseen values in respect to A, doing so produces a feature out of bounds error. Is it possible to have B containing new unseen values such that the transform step sets all binaries to zero for the concerned value?
ValueError: Feature out of bounds. Try setting n_values.
I understand I can change the feature bounds at .fit time. But if I am using A as training data, each time I got a new set B to predict, I would have to mess with my initial encoding.
Thanks.

Is it possible to have B containing new unseen values such that the transform step sets all binaries to zero for the concerned value?
No, but it would be nice if OneHotEncoder did that, so I've opened an issue for this. For now, you'll just have to set n_values a bit higher.

This feature is added to OneHotEncoder now. You can do this by setting the parameter handle_unknown='ignore'.
For example:
from sklearn.preprocessing import OneHotEncoder
A=array([[1,4,1],[0,3,2]])
B=array([[1,4,7],[0,3,2]])
onehot = OneHotEncoder(handle_unknown='ignore')
A = onehot.fit_transform(A)
B = onehot.transform(B)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.