I am cleaning a csv file on jupyter to do machine learning.
However, several columns have string values, like the column "description":
I know I need to use NLP to clean, but could not find how to do it on jupyter.
Could you advice me how to convert these values to numeric values?
Thank you
Numerical values are better for creating learning models than words or images.(Why? dimensionality reduction)
Common machine learning algorithms expect a numerical input.
The technique used to convert a word to a corresponding numerical value is called word embedding.
In word embedding, strings are converted to feature vectors(numbers).
Bag of words, word2vec, GloVe can be used for implementing this.
It is generally advisable to ignore those fields which wouldn't be significant for the model.So include description only if is absolutely essential.
The problem you are describing is that of converting categorical data, usually in the form of strings or numerical ID's to purely numerical data. I'm sure you are aware that using numerical ID's has a problem: it leads to the false interpretation that the data has some sort of order. Like apple < orange < lime, when this is not the case.
It is common to use one-hot encoding to produce numerical indicator variables. After encoding one column, you have N columns, where N is the amount of unique labels. The columns have a value of 1 when the corresponding categorical variable had that value and 0 otherwise. This is especially handy if there are few unique labels in one column. Both Pandas and sklearn have these sorts of functions available, albeit they are not as feature complete as one would hope.
The "description" column you have seems to be a bit trickier, because it actually includes language, not just categorical data. So that column would need to be parsed or handled in some other way. Although, the one-hot encoding scheme may very well be used for all the words in the description, producing a vector that has more 1's.
For example:
>>> import pandas as pd
>>> df = pd.DataFrame(['a', 'b', 'c', 'a', 'a', pd.np.nan])
>>> pd.get_dummies(df)
0_a 0_b 0_c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
4 1 0 0
5 0 0 0
Additional processing would be needed to get the encoding word by word. This approach considers only the full values as variables.
Related
I need to one hot encode categorical variables on my pandas data frame.
My dataset is really big with over 2000 productIDs to be one hot encoded.
I tried pd.get_dummies and it always crashes.
I have also tried scikit-learn's OneHotEncoder which also crashes! (it works fine with a smaller subset of dataframe)
What other methods are there? What is the most efficient way to one hot encode categorical variables for very big data set?
My data frame:
Month User ProductID
1 A ProdA
3 A ProdB
11 A ProdC
12 A ProdD
Required output:
Month User ProdA ProdB ProdC ProdD
1 A 1 0 0 0
3 A 0 1 0 0
11 A 0 0 1 0
12 A 0 0 0 1
My dataset is really big with over 2000 productIDs and million of user rows.
This will result in a huge dataset. Presumably it's crashing because of memory.
Perhaps you should consider alternatives to full one-hot encoding.
One way is to create dummies of the top categories, and "other" for the rest.
tops = df.ProductID.value_counts().head(10)
will give you the top product ids. You can then use
df.ProductID[~df.ProductID.isin(tops)] = 'other'
and create dummies out of that.
If you have a response variable, you might alternatively use mean encoding.
For a feature with so many different possible values, one-hot encoding may not be the best option.
I suggest using Target Encoding (https://contrib.scikit-learn.org/categorical-encoding/). Unlike one-hot encoding, which will create k columns for k unique values of the feature, target encoding transforms the one feature into one column.
I have twitter data that I want to cluster. It is text data and I learned that K means can not handle Non-Numerical data. I wanted to cluster data just on the basis of the tweets. The data looks like this.
I found this code that can converts the text into numerical data.
def handle_non_numerical_data(df):
columns = df.columns.values
for column in columns:
text_digit_vals = {}
def convert_to_int(val):
return text_digit_vals[val]
if df[column].dtype != np.int64 and df[column].dtype != np.float64:
column_contents = df[column].values.tolist()
unique_elements = set(column_contents)
x = 0
for unique in unique_elements:
if unique not in text_digit_vals:
text_digit_vals[unique] = x
x += 1
df[column] = list(map(convert_to_int, df[column]))
return df
df = handle_non_numerical_data(data)
print(df.head())
output
label tweet
0 9 24
1 5 11
2 17 45
3 14 138
4 18 112
I'm quite new to this and I don't think this is what I need to fit the data. What is a better way to handle Non-Numerical data (text) of this nature?
Edit: When running K means clustering Algorithm on raw text data I get this error.
ValueError: could not convert string to float
The most typical way of handling non-numerical data is to convert a single column into multiple binary columns. This is called "getting dummy variables" or a "one hot encoding" (among many other snobby terms).
There are other things you can do to translate the data to numbers, such as sentiment analysis (i.e. cetagorize each tweet into happy, sad, funny, angry, etc...), analyzing the tweets to determine if they are about a certain subject or not (i.e. Does this tweet talk about a virus?), the number of words in each tweet, the number of spaces per tweet, if it has good grammar or not, etc. As you can see, you are asking about a very broad subject.
When transforming data to binary columns, you get the number of unique values in your column and make that many new columns, each one of them filled with zeros and ones.
Let's focus on your first column:
import pandas as pd
df = pd.DataFrame({'account':['realdonaldtrump','naredramodi','pontifex','pmoindia','potus']})
account
0 realdonaldtrump
1 narendramodi
2 pontifex
3 pmoindia
4 potus
This is equivalent to:
pd.get_dummies(df, columns=['account'], prefix='account')
account_naredramodi account_pmoindia account_pontifex account_potus \
0 0 0 0 0
1 1 0 0 0
2 0 0 1 0
3 0 1 0 0
4 0 0 0 1
account_realdonaldtrump
0 1
1 0
2 0
3 0
4 0
This is one of many methods. You can check out this article about one hot encoding here.
NOTE: When you have many unique values, doing this will give you many columns and some algorithms will crash due to not having enough degrees of freedom (too many variables, not enough observations). Last, if you are running a regression, you will run into perfect multicollinearity if you do not drop one of the columns.
Going back to your example, if you want to turn all your columns into this kind of data, try:
pd.get_dummies(df)
However, I wouldn't do this for the tweet column because each tweet is its own unique value.
As k-means is a method of vector quantization, you should vectorize your textual data in one way or another.
See some examples of using k-means over text:
Word2Vec
tf-idf
I have a data-set that contain both numeric and categorical data like this
subject_id hour_measure heart rate blood_pressure urine color
3 4 60
4 2 70 60 red
6 1 30 yellow
I tried various methods to handle missing data such as the following code
f = lambda x: x.mean() if np.issubdtype(x.dtype, np.number) else next(iter(x.mode()), None)
df[cols] = df[cols].fillna(df[cols].transform(f))
df= df.fillna(method='ffill')
but these techniques didn't give me the result I want. I tried to use hot deck imputation I already understand the concept of the hot deck imputation technique, as it is a suitable way to handle both numeric and categorical data.
If you are using your data as input for machine learning, you can convert the columns containing text to numbers (e.g. a LUT, or convert the colors to corresponding RGB values.
Regarding the second part of your question : could you be more specific about what results you are expecting and what your current code produces?
The hot-deck method is defined in the literature as that method replaces missing values with randomly selected values from the current dataset on hand. So, I tried hot-deck methods to handle missing data such as the following code:
def hotdeck_imputation(data):
for c in (data.columns):
data.loc[:,c] = [random.choice(data[c].dropna()) if np.isnan(i) else i for i in data[c]]
return data
I hope it helps with your problem.
I have data of the form:
Feature 1 Feature 2 Feature 3 ---> Numerical Value
Problem is Feature 1 is like, String Values like Company Names, Feature 2 is also a string value like a Category and Feature 3 is just timestamp.
I want to train a model that given the features is able to predict the numerical value.
I know regression can be used for it.
But,
How do I convert the categorical features so that they can be used in regression?
How do I utilize the timestamp value for Prediction? Should I extract the month, the hour number (line from 0-23) and make them into more categorical values?
Thanks.
As we know machine learning algorithm are not capable to understand the text directly,so we need to convert these string values into one hot vector representation.
we use one hot encoder to perform “binarization” of the category and include it as a feature to train the model
So you can use pandas for this,
For example
import pandas as pd
df =pd.DataFrame({'A':["google","amazon","microsoft"]})
pd.get_dummies(df)
A_amazon A_google A_microsoft
0 1 0
1 0 0
0 0 1
After converting your variable into above format you can apply regression
Thanks
I've been developing a tool that automatically preprocesses data in pandas.DataFrame format. During this preprocessing step, I want to treat continuous and categorical data differently. In particular, I want to be able to apply, e.g., a OneHotEncoder to only the categorical data.
Now, let's assume that we're provided a pandas.DataFrame and have no other information about the data in the DataFrame. What is a good heuristic to use to determine whether a column in the pandas.DataFrame is categorical?
My initial thoughts are:
1) If there are strings in the column (e.g., the column data type is object), then the column very likely contains categorical data
2) If some percentage of the values in the column is unique (e.g., >=20%), then the column very likely contains continuous data
I've found 1) to work fine, but 2) hasn't panned out very well. I need better heuristics. How would you solve this problem?
Edit: Someone requested that I explain why 2) didn't work well. There were some tests cases where we still had continuous values in a column but there weren't many unique values in the column. The heuristic in 2) obviously failed in that case. There were also issues where we had a categorical column that had many, many unique values, e.g., passenger names in the Titanic data set. Same column type misclassification problem there.
Here are a couple of approaches:
Find the ratio of number of unique values to the total number of unique values. Something like the following
likely_cat = {}
for var in df.columns:
likely_cat[var] = 1.*df[var].nunique()/df[var].count() < 0.05 #or some other threshold
Check if the top n unique values account for more than a certain proportion of all values
top_n = 10
likely_cat = {}
for var in df.columns:
likely_cat[var] = 1.*df[var].value_counts(normalize=True).head(top_n).sum() > 0.8 #or some other threshold
Approach 1) has generally worked better for me than Approach 2). But approach 2) is better if there is a 'long-tailed distribution', where a small number of categorical variables have high frequency while a large number of categorical variables have low frequency.
There's are many places where you could "steal" the definitions of formats that can be cast as "number". ##,#e-# would be one of such format, just to illustrate. Maybe you'll be able to find a library to do so.
I try to cast everything to numbers first and what is left, well, there's no other way left but to keep them as categorical.
You could define which datatypes count as numerics and then exclude the corresponding variables
If initial dataframe is df:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
dataframe = df.select_dtypes(exclude=numerics)
I think the real question here is whether you'd like to bother the user once in a while or silently fail once in a while.
If you don't mind bothering the user, maybe detecting ambiguity and raising an error is the way to go.
If you don't mind failing silently, then your heuristics are ok. I don't think you'll find anything that's significantly better. I guess you could make this into a learning problem if you really want to. Download a bunch of datasets, assume they are collectively a decent representation of all data sets in the world, and train based on features over each data set / column to predict categorical vs. continuous.
But of course in the end nothing can be perfect. E.g. is the column [1, 8, 22, 8, 9, 8] referring to hours of the day or to dog breeds?
I've been thinking about a similar problem and the more that I consider it, it seems that this itself is a classification problem that could benefit from training a model.
I bet if you examined a bunch of datasets and extracted these features for each column / pandas.Series:
% floats: percentage of values that are float
% int: percentage of values that are whole numbers
% string: percentage of values that are strings
% unique string: number of unique string values / total number
% unique integers: number of unique integer values / total number
mean numerical value (non numerical values considered 0 for this)
std deviation of numerical values
and trained a model, it could get pretty good at inferring column types, where the possible output values are: categorical, ordinal, quantitative.
Side note: as far as a Series with a limited number of numerical values goes, it seems like the interesting problem would be determining categorical vs ordinal; it doesn't hurt to think a variable is ordinal if it turns out to be quantitative right? The preprocessing steps would encode the ordinal values numerically anyways without one-hot encoding.
A related problem that is interesting: given a group of columns, can you tell if they are already one-hot encoded? E.g in the forest-cover-type-prediction kaggle contest, you would automatically know that soil type is a single categorical variable.
IMO the opposite strategy, identifying categoricals is better because it depends on what the data is about. Technically address data can be thought of as unordered categorical data, but usually I wouldn't use it that way.
For survey data, an idea would be to look for Likert scales, e.g. 5-8 values, either strings (which might probably need hardcoded (and translated) levels to look for "good", "bad", ".agree.", "very .*",...) or int values in the 0-8 range + NA.
Countries and such things might also be identifiable...
Age groups (".-.") might also work.
I've been looking at this, thought it maybe useful to share what I have. This builds on #Rishabh Srivastava answer.
import pandas as pd
def remove_cat_features(X, method='fraction_unique', cat_cols=None, min_fraction_unique=0.05):
"""Removes categorical features using a given method.
X: pd.DataFrame, dataframe to remove categorical features from."""
if method=='fraction_unique':
unique_fraction = X.apply(lambda col: len(pd.unique(col))/len(col))
reduced_X = X.loc[:, unique_fraction>min_fraction_unique]
if method=='named_columns':
non_cat_cols = [col not in cat_cols for col in X.columns]
reduced_X = X.loc[:, non_cat_cols]
return reduced_X
You can then call this function, giving a pandas df as X and you can either remove named categorical columns or you can choose to remove columns with a low number of unique values (specified by min_fraction_unique).