I have the House Prices - Advanced Regression Techniques Data set. I need to do Lasso and Ridge Regularization on it. I saved the train data in the variable named house. Typed the following code:
house.info()
Got this output: enter image description here
There are columns in this data set which are numerical(int64 and float 64) but they actually are categorical(both ordinal and nominal).
I wanted to ask whether I can standardize these categorical variables or should I first convert all these variables into type "object" using house[col_name]=house[col_name].astype(str) and then do one- hot encoding on these variables and standardize the rest of the numerical columns?
When a column is cardinal it is possible to apply one-hot-encoding, in this way the categorical columns can be vectorized in a binary way for each category.
import pandas as pd
raw_df= pd.get_dummies(data=raw_df,
cardinal_features=['col1', 'col2', 'col3'],
prefix=['feature1_', 'feature2_', 'feature3_'])
Related
I have a dataset which includes over 100 countries in it. I want to include these in an XGBoost model to make a classification prediction. I know that One Hot Encoding is the go-to process for this, but I would rather do something that wont increase the dimensionality so much and will be resilient to new values, so I'm trying binary classification using the category_encoders package. http://contrib.scikit-learn.org/categorical-encoding/binary.html
Using this encoding helped my model out over using basic one-hot encoding, but how do I get back to the original labels after encoding?
I know about the inverse_transform method, but that functions on the whole data frame. I need a way where I can put in a binary, or integer value and get back the original value.
Here's some example data taken from: https://towardsdatascience.com/smarter-ways-to-encode-categorical-data-for-machine-learning-part-1-of-3-6dca2f71b159
import numpy as np
import pandas as pd
import category_encoders as ce
# make some data
df = pd.DataFrame({
'color':["a", "c", "a", "a", "b", "b"],
'outcome':[1, 2, 3, 2, 2, 2]})
# split into X and y
X = df.drop('outcome', axis = 1)
y = df.drop('color', axis = 1)
# instantiate an encoder - here we use Binary()
ce_binary = ce.BinaryEncoder(cols = ['color'])
# fit and transform and presto, you've got encoded data
ce_binary.fit_transform(X, y)
I'd like to pass the values [0,0,1] or 1 into a function and get back a as a value.
The main reason for this is for looking at the feature importances of the model. I can get feature importances based on a column, but this will give me back a column id rather than the underlying value of a category that is the most important.
Please note that the article you reference suggests using the Binary Encoder for ordinal data only - that is, discrete data that has an order associated with it (small, medium, large), not nominal data (Red, White, Blue).
If you decide to use a Binary encoder, the order in which colors (or countries) are encoded will impact your performance. For example, assume red=001, white=010, and blue=011. When you apply an ML algorithm, it will see that red and blue have a feature in common (feature 3). This is probably not what you want.
In terms of applying the inverse transformation, you'll need to apply the inverse transformation to [0,0,1] in your example above, not "1". "1" is meaningless without context. You should be able to apply the inverse transformation to a single record (row) in your data, but not a single column. The inverse scaler will need to will operate on an object with the output dimension of the transformer.
I am looking to predict whether someone is a smoker from several columns of demographic data stored in a csv, as well as their smoker status.
The columns used are:
Gender, Age,Race, ServedInMilitary, CountryofBirth, EducationLevel MaritalStatus, HouseholdIncome, FamilyIncome, ChildrenInHouse, QuantitiyofAlcohol, PerUnitTime, ShortnessOfBreath, Asthma, Exercise, Smoker, SmokedBefore, AgeStartedSmoking.
All columns have numeric, but not necessarily binary values. Could someone help me correct my code to take these factors into account when determining smoker status and then help me measure the accuracy of my classifier?
I have the following code from a similar question: how to Load CSV Data in scikit and using it for Naive Bayes Classification
target_names = np.array(['Positives','Negatives'])
# add columns to your data frame
data['is_train'] = np.random.uniform(0, 1, len(df)) <= 0.75
data['Type'] = pd.Factor(targets, target_names)
data['Targets'] = targets
# define training and test sets
train = data[data['is_train']==True]
test = data[data['is_train']==False]
trainTargets = np.array(train['Targets']).astype(int)
testTargets = np.array(test['Targets']).astype(int)
# columns you want to model
features = data.columns[0:7]
# call Gaussian Naive Bayesian class with default parameters
gnb = GaussianNB()
# train model
y_gnb = gnb.fit(train[features], trainTargets).predict(train[features])
#Predict Output
There seems to be a missing line here for the dataframe, but I will assume you have it. If you don't, then read your data using pandas.read_csv.
Also, your columns seem to have data that is both categorical and numerical. For example, the "SmokedBefore" column is likely 1/0 whereas your "Age" column is likely numbers such as 20 or 30.
This makes a difference, because in "SmokedBefore" the intent is not to say that 1>0. The intent is to say Yes/No. If your model assumes that higher (or lower) is better, then this will cause an issue. Therefore it is categorical and should not be treated like a numerical value. It is simply a tag to indicate whether someone has smoked before.
However, in "Age" the intent is to say that 30 is different than 20 by 10. Therefore, it is numerical and should be treated as such.
To treat this, you will need to transform your categorical features into another set of binary features that will balance out this effect and handle it for you. This is called One Hot Encoding. Instead of 1/0 on your "SmokedBefore", it will become "is_1" and "is_0" with corresponding data. Like that, each column will have a 1 and a 0.
You can simply use the onehotencoder function provided in sklearn. Use the categorical_features argument to specify which columns have categorical features
I want to use Random Forest for feature selection based on Gini index. My dataset has mix of numeric (contiuous) and categorical(String) data. This is an example of the dataset
Var1 Var2
198 zcROj17IEC
336 DHeTmBftjz
252.3 crIgUHSK8h
252 ZSNrjIX0Db
I know trees works on discrete data (categorical) but does RandomForest in Sklearn require continuous numeric data to be discretized first or it can handle it?? For categorical string variables I used the following to encode the strings into numeric columns with zeros and ones
pandas.get_dummies(X['Var2'])
and it works but for the numeric I tried the following to discretize
pandas.qcut(X['Var1'], 2 , retbins=True)
but I keep getting an error of non-unique bins!
Do I need to discretize? How can I do it?
Random forest should support continuous variables no problem. See for example this sample.
Trees and Forest work worse when you make dummies from you categorical values.
You need just label you categorical features - that's all!
I have a DataFrame in Python and I need to preprocess my data. Which is the best method to preprocess data?, knowing that some variables have huge scale and others doesn't. Data hasn't huge deviance either. I tried with preprocessing.Scale function and it works, but I'm not sure at all if is the best method to proceed to the machine learning algorithms.
There are various techniques for data preprocessing, you can refer to the ideas in sklearn.preprocessing as potential guidelines to follow.
http://scikit-learn.org/stable/modules/preprocessing.html
Preprocessing is coupled to the data you are studying, but in general you could explore:
Assessing missing values, by computing their percentage per column
Compute the variance and remove variables with near zero variance
Assess the inter variable correlation to detect redundancy
You can compute these scores easily in pandas as follows:
data_file = "your_input_data_file.csv"
data = pd.read_csv(data_file, delimiter="|")
variance = data.var()
variance = variance.to_frame("variance")
variance["feature_names"] = variance.index
variance.reset_index(inplace=True)
#reordering columns
variance = variance[["feature_names","variance"]]
logging.debug("exporting variance to csv file")
variance.to_csv(data_file+"_variance.csv", sep="|", index=False)
missing_values_percentage = data.isnull().sum()/data.shape[0]
missing_values_percentage = missing_values_percentage.to_frame("missing_values_percentage")
missing_values_percentage["feature_names"] = missing_values_percentage.index
missing_values_percentage.reset_index(inplace=True)
missing_values_percentage = missing_values_percentage[["feature_names","missing_values_percentage"]]
logging.debug("exporting missing values to csv file")
missing_values_percentage.to_csv(data_file+"_mssing_values.csv", sep="|", index=False)
correlation = data.corr()
correlation.to_csv(data_file+"_correlation.csv", sep="|")
The above would generate three files holding respectively, the variance, missing values percentage and correlation results.
Refer to this blog article for a hands on tutorial.
always split your data to train and test split to prevent overfiting.
if some of your features has big scale and some doesnt you should standard the data.make sure to sandard the data only on the train set not to couse overfiting.
you also have to look for missing datas and replace or remove them.
if less than 0.5% of the data in a column is missing you can use 'dropna' otherwise you have to replace it with something(you can replace ut with zero,mean,the previous data...)
you also have to check outliers by using boxplot.
outliers are point that are significantly different from other data in the same group can also affects your prediction in machine learning.
its the best if we check the multicollinearity.
if some features have correlation we have multicollinearity can couse wrong prediction for our model.
for using your data some of the columns might be categorical with sholud be converted to numerical.
I am using sklearn.preprocessing.OneHotEncoder to encode categorical data of the form
A=array([[1,4,1],[0,3,2]])
B=array([[1,4,7],[0,3,2]])
Suppose I use A at the .fit(A) step and B at some point as new data to .transform(B). If B contains unseen values in respect to A, doing so produces a feature out of bounds error. Is it possible to have B containing new unseen values such that the transform step sets all binaries to zero for the concerned value?
ValueError: Feature out of bounds. Try setting n_values.
I understand I can change the feature bounds at .fit time. But if I am using A as training data, each time I got a new set B to predict, I would have to mess with my initial encoding.
Thanks.
Is it possible to have B containing new unseen values such that the transform step sets all binaries to zero for the concerned value?
No, but it would be nice if OneHotEncoder did that, so I've opened an issue for this. For now, you'll just have to set n_values a bit higher.
This feature is added to OneHotEncoder now. You can do this by setting the parameter handle_unknown='ignore'.
For example:
from sklearn.preprocessing import OneHotEncoder
A=array([[1,4,1],[0,3,2]])
B=array([[1,4,7],[0,3,2]])
onehot = OneHotEncoder(handle_unknown='ignore')
A = onehot.fit_transform(A)
B = onehot.transform(B)