I came across this article Label encoding across multiple columns in scikit-learn and one of the comments https://stackoverflow.com/a/30267328/10058906 explained how each value for a given column is encoded from the range of 0 to (n-1) where n is the length of the column.
It raised a question on when I encode red: 2, orange: 1 and green: 0 does it imply that green is closer to orange than red since 0 is closer to 1 than 2; which in reality is not true? I earlier thought perhaps since green occurs the maximum number of times, it gets the value 0. But, this does not hold for the column fruit where apple gets value 0 even though orange occurs the maximum number of times.
I would like to summarize Label Encoder and One Hot Encoding:
It is true that Label Encoder simply gives an integral representation to a cell value. This implies that for the above dataset if we label encode our categorical values - it would imply that green is closer to orange than red since 0 is closer to 1 than 2 - which is false.
On the other hand, One Hot Encoding creates a separate column for each categorical value, and a value of either 0 or 1 is given representing the absence or presence of that feature respectively. Also, the in-built function of pd.get_dummies(dataframe) produces the same output.
Hence, if the given dataset contains categorical values which are ordinal in nature, it is wise to use Label Encoding; but if the given data is nominal, one should go forward with One Hot Encoding.
https://discuss.analyticsvidhya.com/t/dummy-variables-is-necessary-to-standardize-them/66867/2
Related
If there is no predefined column types(nominal/interval) stored and some of variables are encoded as 1,2,3... in place of actual Categories (e.g. Good, better, bad....) if we see, automatically it may be classified as interval variables but actually they are nominal variables that are encoded.
Is there any way to identify such variables?
I thought of cardinality but threshold becomes an issue here please suggest some other solution.
I'm good with python solution but if someone can give idea on SAS will be helpful :)
as a Data Analyst, its your call to consider the categorical column as nominal or ordinal (depending on the data).
if nominal data --> use dummy variable.(or one hot encoding)
if ordinal data --> use map() function for label-encoding.
if nominal data and cardinality is high --> encoding according to frequency count (lets say there are 30 different categories in a column, there are 1000 rows , 3 categories have high frequency count ,so these will be in separate 3 categories, other 17 have very low, so put all these 17 in 1 single category. ie. There will be only 4 categories, not 30).
apart from object type(string) columns, To identify categorical variables:
frequency count plays very important role for numeric columns.
I need to one hot encode categorical variables on my pandas data frame.
My dataset is really big with over 2000 productIDs to be one hot encoded.
I tried pd.get_dummies and it always crashes.
I have also tried scikit-learn's OneHotEncoder which also crashes! (it works fine with a smaller subset of dataframe)
What other methods are there? What is the most efficient way to one hot encode categorical variables for very big data set?
My data frame:
Month User ProductID
1 A ProdA
3 A ProdB
11 A ProdC
12 A ProdD
Required output:
Month User ProdA ProdB ProdC ProdD
1 A 1 0 0 0
3 A 0 1 0 0
11 A 0 0 1 0
12 A 0 0 0 1
My dataset is really big with over 2000 productIDs and million of user rows.
This will result in a huge dataset. Presumably it's crashing because of memory.
Perhaps you should consider alternatives to full one-hot encoding.
One way is to create dummies of the top categories, and "other" for the rest.
tops = df.ProductID.value_counts().head(10)
will give you the top product ids. You can then use
df.ProductID[~df.ProductID.isin(tops)] = 'other'
and create dummies out of that.
If you have a response variable, you might alternatively use mean encoding.
For a feature with so many different possible values, one-hot encoding may not be the best option.
I suggest using Target Encoding (https://contrib.scikit-learn.org/categorical-encoding/). Unlike one-hot encoding, which will create k columns for k unique values of the feature, target encoding transforms the one feature into one column.
I have a data-set that contain both numeric and categorical data like this
subject_id hour_measure heart rate blood_pressure urine color
3 4 60
4 2 70 60 red
6 1 30 yellow
I tried various methods to handle missing data such as the following code
f = lambda x: x.mean() if np.issubdtype(x.dtype, np.number) else next(iter(x.mode()), None)
df[cols] = df[cols].fillna(df[cols].transform(f))
df= df.fillna(method='ffill')
but these techniques didn't give me the result I want. I tried to use hot deck imputation I already understand the concept of the hot deck imputation technique, as it is a suitable way to handle both numeric and categorical data.
If you are using your data as input for machine learning, you can convert the columns containing text to numbers (e.g. a LUT, or convert the colors to corresponding RGB values.
Regarding the second part of your question : could you be more specific about what results you are expecting and what your current code produces?
The hot-deck method is defined in the literature as that method replaces missing values with randomly selected values from the current dataset on hand. So, I tried hot-deck methods to handle missing data such as the following code:
def hotdeck_imputation(data):
for c in (data.columns):
data.loc[:,c] = [random.choice(data[c].dropna()) if np.isnan(i) else i for i in data[c]]
return data
I hope it helps with your problem.
I am cleaning a csv file on jupyter to do machine learning.
However, several columns have string values, like the column "description":
I know I need to use NLP to clean, but could not find how to do it on jupyter.
Could you advice me how to convert these values to numeric values?
Thank you
Numerical values are better for creating learning models than words or images.(Why? dimensionality reduction)
Common machine learning algorithms expect a numerical input.
The technique used to convert a word to a corresponding numerical value is called word embedding.
In word embedding, strings are converted to feature vectors(numbers).
Bag of words, word2vec, GloVe can be used for implementing this.
It is generally advisable to ignore those fields which wouldn't be significant for the model.So include description only if is absolutely essential.
The problem you are describing is that of converting categorical data, usually in the form of strings or numerical ID's to purely numerical data. I'm sure you are aware that using numerical ID's has a problem: it leads to the false interpretation that the data has some sort of order. Like apple < orange < lime, when this is not the case.
It is common to use one-hot encoding to produce numerical indicator variables. After encoding one column, you have N columns, where N is the amount of unique labels. The columns have a value of 1 when the corresponding categorical variable had that value and 0 otherwise. This is especially handy if there are few unique labels in one column. Both Pandas and sklearn have these sorts of functions available, albeit they are not as feature complete as one would hope.
The "description" column you have seems to be a bit trickier, because it actually includes language, not just categorical data. So that column would need to be parsed or handled in some other way. Although, the one-hot encoding scheme may very well be used for all the words in the description, producing a vector that has more 1's.
For example:
>>> import pandas as pd
>>> df = pd.DataFrame(['a', 'b', 'c', 'a', 'a', pd.np.nan])
>>> pd.get_dummies(df)
0_a 0_b 0_c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
4 1 0 0
5 0 0 0
Additional processing would be needed to get the encoding word by word. This approach considers only the full values as variables.
I'm building a random forest in python using sklearn-learn, and I've applied "one hot" encoding to all of the categorical variables. Question: if I apply "one hot" to my DV,
do I apply all of its dummy columns as the DV, or should the DV be handled differently?
You need to apply one-hot encoding to all those columns where the values are not in numbers.You can handle DV with one-hot and other non-numerical columns with some other encoding as well. E.g: Suppose there is a column with city names, you need to change this into numerical form. This is called as DATA MOLDING. You can do this molding without one-hot as well.
E.g: there is DV column for diabetes with entry "yes" and "no". This is without one-hot encoding.
diabetes_map = {True : 1, False : 0}
df['diabetes'] = df['diabetes'].map(diabetes_map)
Depends on the type of problem you have. For binary or multi-class problems, you do not need to one hot encode dependent variable in scikit-learn. Doing one-hot encoding will change the shape of the output variable from single dimension to multi-dimensions. This is called as label-indicator matrix, where each column denotes the presence or absence of that label.
For example, doing one-hot encoding of the following:
['high', 'medium', 'low', 'high', 'low', 'high', 'medium']
will return this:
high medium low
1 0 0
0 1 0
0 0 1
1 0 0
0 0 1
1 0 0
0 1 0
Not all classifiers in scikit-learn are able to support this format, (even though they support multi-class classification) Even in those that do support this, this will trigger the multi-label classification (in which more than one label can be present at once) which is what you dont want in a multi-class problem.