Choosing between imputation methods [closed]

Choosing between imputation methods [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed last year.
Improve this question
I'm trying to evaluate 2 methods for imputation of data.
My dataset: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data
My target label is LotFrontage.
First I encoded all categorial features with OneHotEncoding and then I used the correlation matrix and filter anything above -0.3 or blow 0.3.
encoded_df = pd.get_dummies(train_df, prefix_sep="_", columns=['MSZoning', 'Street', 'Alley',
'LotShape', 'LandContour', 'Utilities',
'LotConfig', 'LandSlope', 'Neighborhood',
'Condition1', 'Condition2', 'BldgType', 'HouseStyle'])
corrmat = encoded_df.corr()
corrmat[(corrmat > 0.3) | (corrmat < -0.3)]
# filtering out based on corrmat output...
encoded_df = encoded_df[['SalePrice', 'MSSubClass', 'LotFrontage', 'LotArea',
'BldgType_1Fam', 'BldgType_2fmCon', 'BldgType_Duplex', 'BldgType_Twnhs', 'BldgType_TwnhsE',
'MSZoning_C (all)', 'MSZoning_FV', 'MSZoning_RH', 'MSZoning_RL', 'MSZoning_RM']]
Then I try two imputation methods:
use the mean value of LotFrontage (used this method because I saw low outlier ratio)
Tried to predict LotFrontage with DecisionTreeRegressor
# imputate LotFrontage with the mean value (we saw low outliers ratio so we gonna use this)
encoded_df1 = encoded_df.copy()
encoded_df1['LotFrontage'].fillna(encoded_df['LotFrontage'].mean(), inplace=True)
X1 = encoded_df1.drop('LotFrontage', axis=1)
y1 = encoded_df1['LotFrontage']
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1)
classifier1 = DecisionTreeRegressor()
classifier1.fit(X1_train, y1_train)
y1_pred = classifier1.predict(X1_test)
print('score1: ', classifier1.score(X1_test, y1_test))
# imputate LotFrontage with by preditcing it using DecisionTreeRegressor
encoded_df2 = encoded_df.copy()
X2 = encoded_df2[~encoded_df2['LotFrontage'].isnull()].drop('LotFrontage', axis=1)
y2 = encoded_df2[~encoded_df2['LotFrontage'].isnull()]['LotFrontage']
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2)
classifier2 = DecisionTreeRegressor()
classifier2.fit(X2_train, y2_train)
y2_pred = classifier2.predict(encoded_df2[encoded_df2['LotFrontage'].isnull()].drop('LotFrontage', axis=1))
imputated_encoded_df2 = encoded_df2[encoded_df2['LotFrontage'].isnull()].assign(LotFrontage=y2_pred)
X3 = imputated_encoded_df2.drop('LotFrontage', axis=1)
y3 = imputated_encoded_df2['LotFrontage']
X3_train, X3_test, y3_train, y3_test = train_test_split(X3, y3)
classifier2.fit(X3_train, y3_train)
y3_pred = classifier2.predict(X3_test)
print('score2: ', classifier2.score(X3_test, y3_test))
My questions are:
Is it correct of me first using fillna with the mean value and then splitting to train and test and checking the score? Because if I'm filling the values prior to fitting the model won't it fit the model on the imputated data and thus giving me biased result? Same for the second method
Anything else I'm doing wrong since I can't determine the best method for imputation since I get bad and random score for both methods

1.Imputation Using (Mean/Median) Values:
This works by calculating the mean/median of the non-missing values in a column and then replacing the missing values within each column separately and independently from the others. It can only be used with numeric data.
Pros:
Easy and fast.
Works well with small numerical datasets.
Cons:
Doesn’t factor the correlations between features.
It only works on the column level.
Will give poor results on encoded categorical features (do NOT use it on categorical features).
Not very accurate.
Doesn’t account for the uncertainty in the imputations.
2.Imputation Using (Most Frequent) or (Zero/Constant) Values:
Most Frequent is another statistical strategy to impute missing values and YES!! It works with categorical features (strings or numerical representations) by replacing missing data with the most frequent values within each column.
Pros:
Works well with categorical features.
Cons:
It also doesn’t factor the correlations between features.
It can introduce bias in the data.
Zero or Constant imputation — as the name suggests — it replaces the missing values with either zero or any constant value you specify
3.Imputation Using k-NN:
The k nearest neighbours is an algorithm that is used for simple classification. The algorithm uses ‘feature similarity’ to predict the values of any new data points. This means that the new point is assigned a value based on how closely it resembles the points in the training set. This can be very useful in making predictions about the missing values by finding the k’s closest neighbours to the observation with missing data and then imputing them based on the non-missing values in the neighbourhood.
How does it work?
It creates a basic mean impute then uses the resulting complete list to construct a KDTree. Then, it uses the resulting KDTree to compute nearest neighbours (NN). After it finds the k-NNs, it takes the weighted average of them.
Pros:
Can be much more accurate than the mean, median or most frequent imputation methods (It depends on the dataset).
Cons:
Computationally expensive. KNN works by storing the whole training dataset in memory.
K-NN is quite sensitive to outliers in the data (unlike SVM)
Since the outlier ratio is low we can use method 3. It will also have less impact on the correlation between the imputed target variable(i.e LotFrontage) and other features.
import sys
from impyute.imputation.cs import fast_knn
sys.setrecursionlimit(100000) #Increase the recursion limit of the OS
# start the KNN training
train_df['LotFrontage']=fast_knn(train_df[['LotFrontage','1stFlrSF','MSSubClass']], k=30)
I've chosen the two features considering their correlation with the LotFrontage column.

Related

How do I filter outliers in test data based on z-scores of train data?

I have a train and test dataset. On the train dataset I detected and deleted outlier values, when their standard deviation is 5 times greater from the mean. If a z-score returned is larger than that, the value is quite unusual and therefore I delete it from the dataset.
import scipy.stats as stats
z_scores = train_df.apply(stats.zscore)
abs_z_scores = np.abs(z_scores)
filtered_entries = (abs_z_scores < 5).all(axis=1)
train_df= train_df[filtered_entries]
Now I want to use the same z-scores based on the train set to remove values from the test set. (I don't want to get the z_scores from the test dataset itself!) Probably one idea is to store the mean and standard deviation of X from the train data and calculate the z-score for the test data based on them e.g.
(Xtest−μ)/σ
But I do not have any concrete ideas how to do so. Could someone give me some advice?

Increase feature importance

I am working on a classification problem. I have around 1000 features and target variable has 2 classes. All the 1000 features have values 1 or 0. I am trying to find feature importance but my feature importance values varies from 0.0 - 0.003. I am not sure if such low value is meaningful.
Is there a way I can increase feature importance.
# Variable importance
rf = RandomForestClassifier(min_samples_split=10, random_state =1)
rf.fit(X, Y)
print ("Features sorted by their score:")
a = (list(zip(map(lambda x: round(x, 3), rf.feature_importances_), X)))
I would really appreciate any help! Thanks

Since you only have two target classes you can perform an unequal variance t-test which has been useful to find important features in a binary classification task when all other feature ranking methods have failed me. You can implement this using scipy.stats.ttest_ind function. It basically is a statistical test that checks whether the two distributions are different. if the returned p-value is less than 0.05, they can be assumed to be different distributions. To implement for each feature, follow these steps:
Extract all predictor values for class 1 and 2 respectively.
Run test_ind on these two distributions, specifying that they're variance is unknown, and make sure it's a two tailed t-test
If the p-value is less than 0.05, this feature is important.
Alternatively, you can do this for all your features and use the p-value as the measure of feature importance. The lower, the p-value, the higher the importance of a feature.
Cheers!

How to Perform a Naieve Bayes Classification in Scikit from CSV

I am looking to predict whether someone is a smoker from several columns of demographic data stored in a csv, as well as their smoker status.
The columns used are:
Gender, Age,Race, ServedInMilitary, CountryofBirth, EducationLevel MaritalStatus, HouseholdIncome, FamilyIncome, ChildrenInHouse, QuantitiyofAlcohol, PerUnitTime, ShortnessOfBreath, Asthma, Exercise, Smoker, SmokedBefore, AgeStartedSmoking.
All columns have numeric, but not necessarily binary values. Could someone help me correct my code to take these factors into account when determining smoker status and then help me measure the accuracy of my classifier?
I have the following code from a similar question: how to Load CSV Data in scikit and using it for Naive Bayes Classification
target_names = np.array(['Positives','Negatives'])
# add columns to your data frame
data['is_train'] = np.random.uniform(0, 1, len(df)) <= 0.75
data['Type'] = pd.Factor(targets, target_names)
data['Targets'] = targets
# define training and test sets
train = data[data['is_train']==True]
test = data[data['is_train']==False]
trainTargets = np.array(train['Targets']).astype(int)
testTargets = np.array(test['Targets']).astype(int)
# columns you want to model
features = data.columns[0:7]
# call Gaussian Naive Bayesian class with default parameters
gnb = GaussianNB()
# train model
y_gnb = gnb.fit(train[features], trainTargets).predict(train[features])
#Predict Output

There seems to be a missing line here for the dataframe, but I will assume you have it. If you don't, then read your data using pandas.read_csv.
Also, your columns seem to have data that is both categorical and numerical. For example, the "SmokedBefore" column is likely 1/0 whereas your "Age" column is likely numbers such as 20 or 30.
This makes a difference, because in "SmokedBefore" the intent is not to say that 1>0. The intent is to say Yes/No. If your model assumes that higher (or lower) is better, then this will cause an issue. Therefore it is categorical and should not be treated like a numerical value. It is simply a tag to indicate whether someone has smoked before.
However, in "Age" the intent is to say that 30 is different than 20 by 10. Therefore, it is numerical and should be treated as such.
To treat this, you will need to transform your categorical features into another set of binary features that will balance out this effect and handle it for you. This is called One Hot Encoding. Instead of 1/0 on your "SmokedBefore", it will become "is_1" and "is_0" with corresponding data. Like that, each column will have a 1 and a 0.
You can simply use the onehotencoder function provided in sklearn. Use the categorical_features argument to specify which columns have categorical features

Data Preprocessing Python

I have a DataFrame in Python and I need to preprocess my data. Which is the best method to preprocess data?, knowing that some variables have huge scale and others doesn't. Data hasn't huge deviance either. I tried with preprocessing.Scale function and it works, but I'm not sure at all if is the best method to proceed to the machine learning algorithms.

There are various techniques for data preprocessing, you can refer to the ideas in sklearn.preprocessing as potential guidelines to follow.
http://scikit-learn.org/stable/modules/preprocessing.html
Preprocessing is coupled to the data you are studying, but in general you could explore:
Assessing missing values, by computing their percentage per column
Compute the variance and remove variables with near zero variance
Assess the inter variable correlation to detect redundancy
You can compute these scores easily in pandas as follows:
data_file = "your_input_data_file.csv"
data = pd.read_csv(data_file, delimiter="|")
variance = data.var()
variance = variance.to_frame("variance")
variance["feature_names"] = variance.index
variance.reset_index(inplace=True)
#reordering columns
variance = variance[["feature_names","variance"]]
logging.debug("exporting variance to csv file")
variance.to_csv(data_file+"_variance.csv", sep="|", index=False)
missing_values_percentage = data.isnull().sum()/data.shape[0]
missing_values_percentage = missing_values_percentage.to_frame("missing_values_percentage")
missing_values_percentage["feature_names"] = missing_values_percentage.index
missing_values_percentage.reset_index(inplace=True)
missing_values_percentage = missing_values_percentage[["feature_names","missing_values_percentage"]]
logging.debug("exporting missing values to csv file")
missing_values_percentage.to_csv(data_file+"_mssing_values.csv", sep="|", index=False)
correlation = data.corr()
correlation.to_csv(data_file+"_correlation.csv", sep="|")
The above would generate three files holding respectively, the variance, missing values percentage and correlation results.
Refer to this blog article for a hands on tutorial.

always split your data to train and test split to prevent overfiting.
if some of your features has big scale and some doesnt you should standard the data.make sure to sandard the data only on the train set not to couse overfiting.
you also have to look for missing datas and replace or remove them.
if less than 0.5% of the data in a column is missing you can use 'dropna' otherwise you have to replace it with something(you can replace ut with zero,mean,the previous data...)
you also have to check outliers by using boxplot.
outliers are point that are significantly different from other data in the same group can also affects your prediction in machine learning.
its the best if we check the multicollinearity.
if some features have correlation we have multicollinearity can couse wrong prediction for our model.
for using your data some of the columns might be categorical with sholud be converted to numerical.

Mixing categorial and continuous data in Naive Bayes classifier using scikit-learn

I'm using scikit-learn in Python to develop a classification algorithm to predict the gender of certain customers. Amongst others, I want to use the Naive Bayes classifier but my problem is that I have a mix of categorical data (ex: "Registered online", "Accepts email notifications" etc) and continuous data (ex: "Age", "Length of membership" etc). I haven't used scikit much before but I suppose that that Gaussian Naive Bayes is suitable for continuous data and that Bernoulli Naive Bayes can be used for categorical data. However, since I want to have both categorical and continuous data in my model, I don't really know how to handle this. Any ideas would be much appreciated!

You have at least two options:
Transform all your data into a categorical representation by computing percentiles for each continuous variables and then binning the continuous variables using the percentiles as bin boundaries. For instance for the height of a person create the following bins: "very small", "small", "regular", "big", "very big" ensuring that each bin contains approximately 20% of the population of your training set. We don't have any utility to perform this automatically in scikit-learn but it should not be too complicated to do it yourself. Then fit a unique multinomial NB on those categorical representation of your data.
Independently fit a gaussian NB model on the continuous part of the data and a multinomial NB model on the categorical part. Then transform all the dataset by taking the class assignment probabilities (with predict_proba method) as new features: np.hstack((multinomial_probas, gaussian_probas)) and then refit a new model (e.g. a new gaussian NB) on the new features.

Hope I'm not too late. I recently wrote a library called Mixed Naive Bayes, written in NumPy. It can assume a mix of Gaussian and categorical (multinoulli) distributions on the training data features.
https://github.com/remykarem/mixed-naive-bayes
The library is written such that the APIs are similar to scikit-learn's.
In the example below, let's assume that the first 2 features are from a categorical distribution and the last 2 are Gaussian. In the fit() method, just specify categorical_features=[0,1], indicating that Columns 0 and 1 are to follow categorical distribution.
from mixed_naive_bayes import MixedNB
X = [[0, 0, 180.9, 75.0],
[1, 1, 165.2, 61.5],
[2, 1, 166.3, 60.3],
[1, 1, 173.0, 68.2],
[0, 2, 178.4, 71.0]]
y = [0, 0, 1, 1, 0]
clf = MixedNB(categorical_features=[0,1])
clf.fit(X,y)
clf.predict(X)
Pip installable via pip install mixed-naive-bayes. More information on the usage in the README.md file. Pull requests are greatly appreciated :)

The simple answer: multiply result!! it's the same.
Naive Bayes based on applying Bayes’ theorem with the “naive” assumption of independence between every pair of features - meaning you calculate the Bayes probability dependent on a specific feature without holding the others - which means that the algorithm multiply each probability from one feature with the probability from the second feature (and we totally ignore the denominator - since it is just a normalizer).
so the right answer is:
calculate the probability from the categorical variables.
calculate the probability from the continuous variables.
multiply 1. and 2.

#Yaron's approach needs an extra step (4. below):
Calculate the probability from the categorical variables.
Calculate the probability from the continuous variables.
Multiply 1. and 2.
AND
Divide 3. by the sum of the product of 1. and 2. EDIT: What I actually mean is that the denominator should be (probability of the event given the hypotnesis is yes) + (probability of evidence given the hypotnesis is no) (asuming a binary problem, without loss of generality). Thus, the probabilities of the hypotheses (yes or no) given the evidence would sum to 1.
Step 4. is the normalization step. Take a look at #remykarem's mixed-naive-bayes as an example (lines 268-278):
if self.gaussian_features.size != 0 and self.categorical_features.size != 0:
finals = t * p * self.priors
elif self.gaussian_features.size != 0:
finals = t * self.priors
elif self.categorical_features.size != 0:
finals = p * self.priors
normalised = finals.T/(np.sum(finals, axis=1) + 1e-6)
normalised = np.moveaxis(normalised, [0, 1], [1, 0])
return normalised
The probabilities of the Gaussian and Categorical models (t and p respectively) are multiplied together in line 269 (line 2 in extract above) and then normalized as in 4. in line 275 (fourth line from the bottom in extract above).

For hybrid features, you can check this implementation.
The author has presented mathematical justification in his Quora answer, you might want to check.

You will need the following steps:
Calculate the probability from the categorical variables (using predict_proba method from BernoulliNB)
Calculate the probability from the continuous variables (using predict_proba method from GaussianNB)
Multiply 1. and 2. AND
Divide by the prior (either from BernoulliNB or from GaussianNB since they are the same) AND THEN
Divide 4. by the sum (over the classes) of 4. This is the normalisation step.
It should be easy enough to see how you can add your own prior instead of using those learned from the data.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.