Encoding huge categorical features

Encoding huge categorical features - python

I am currently working on a big categorical dataset using both Pandas an sklearn. As I want know to do feature extraction, or just be able to build a model, I need to encode the categoricals features as they are not handled by sklearn models.
Here is the issue : for one of the categories, I have more than 110,000 different values. I cannot possibly encode this as there would be (and there is) a memory error.
Also, this feature cannot be removed from the DataSet so this option is out of the question.
So I have multiple ways to deal with that :
Use FeatureHasher from sklearn (as mentionned in several topics related to that). But, while it's easy to encode, there is no mention on how to link the hashed feature with the DataFrame and thus, complete the feature extraction.
Use FuzzyMatching to reduce the number or values in the features and then use one-hot-encoding. The issue here is that it'll still create many dummy variables and I will lose the interest regarding the features that get encoded.
So I have multiple questions : first, do you know how to use FeatureHasher to link a Pandas DataFrame to an sklearn model through hashing ? If so, how am I supposed to do it ?
Second, can you think of any other way to do it that might be easier, or would work best with my problem?
Here are some screenshots regarding the DataSet I have, so that you understand more fully the issue.
Here is the output of the number/percentage of different values per feature.
Also, the values inside the biggest feature to encode called 'commentaire' (commentary in english) contains strings that sometimes are long sentences and sometimes just small words.
As requested, here is the current code to test Feature Hashing:
fh = FeatureHasher(input_type='string')
hashedCommentaire = fh.transform(dataFrame['commentaire'])
This outputs a Memory Error, so I can reduce the number of features, let's put it at a 100:
fh = FeatureHasher(input_type='string', n_features=100)
hashedCommentaire = fh.transform(dataFrame['commentaire'])
print(hashedCommentaire.toarray())
print(hashedCommentaire.shape)
It doesn't get an error and outputs the following : feature hashing output
Can I then directly use the result hashing in my DataFrame? The issue is that we totally loose track of the values of 'commentaire', if we then want to predict on new data, will the hashing output follow the previous one ? And also how do we know the hashing is "sufficient". Here I put 100 rows but there was more than 110,000 values to begin with.
Thanks to another observation, I began to explore the use of tf-idf. Here we can see a sample of 'commentaire' values: 'commentaire' feature sample
As we can see the different values will be strings. Sometimes sentences. The thing is, while exploring this data, I noticed that some values are very close to each other; that's why I wanted to explore the idea of Fuzzy Matching at first and now of tf-idf.

Related

How to substitute for null values if it is a categorical variable?

I was trying to get dummy values for my data, when I noticed some values are having '?' as their value.
As many rows in my data have these values, I simply cannot drop them.
In such case what should I replace them with?
Just taking the mode of the category will help?
Also, I tried to replace the ? values with the mode.
df1 = df1[df1.workclass == '?'].replace('?',"Private")
But I get an empty table now.

It depends on the dataset. There are different methods that apply to different features. Some may require just replacing with the mode. In some cases, different ML algorithms and models are also used such as Random Forest, KNN, etc. So it completely depends on the type of data you are handling. Explore the field of data exploration. Maybe this can help you.

You will have to manually check your different variables and decide what to do with missing for each parameter.
for eg: You can drop the variables with >50 pc of missing unless they suggest very high weight of evidence.
Some variables can be substituted with central tendencies or can be predicted as well.
Categoricals can be replaces by UNK (unknown) and so on.

Creating new groups by pattern

I have JSON with data about some products and I have already converted this into flat table by pandas, so now I have a few columns with data. I Selected some products manually and putted them into one group. I have sorted them by name for example but this is more complicated, there are also some features and requirements which need to be checked.
So what I want is to create script which will group my products in familiar way as those few groups I created manually based on my own thoughts.
Im totally new into machine learning, but I read about this and also watched some tutorials, but I haven't seen this type of case.
I saw that if I use KNN classifier for example, I have to put in input every group that exists and then it will assign single product to one of those groups, but in my case this must be more complicated I guess since I want from this script to create those groups on his own in similiar way to selected by me.
I was thinking about unsupervised machine learnign but this doesn't look like solution beacuse I have my own data which I want to provide, it seems like I need to use some kind of hybrid with supervised machine learning.
data = pd.read_json('recent.json')['results']
data = json_normalize(data)
le = preprocessing.LabelEncoder()
product_name = le.fit_transform(data['name'])
just some code to show what I have done
I don't know if that makes sense what I want, I already made attempt to this problem in normal way without machine learning just by If and loop things, but I wish I could do that also in "smarter" way

the code above shows nothing. If you have data about some products like each entry contains fields you can clasterize this with KNN what is an unsupervides method.
I have to put in input every group that exists
Not, just define the metric between two entries and the method makes classes or entire dendrogram according to that, so you can select classes from dendrogramm as you want. If you look at each node there, it contains common feature of items in class, so it makes auto-description for a class.

Slow loop python to search data in antoher data frame in python

I have two data frames : one with all my data (called 'data') and one with latitudes and longitudes of different stations where each observation starts and ends (called 'info'), I am trying to get a data frame where I'll have the latitude and longitude next to each station in each observation, my code in python :
for i in range(0,15557580):
for j in range(0,542):
if data.year[i] == '2018' and data.station[i]==info.station[j]:
data.latitude[i] = info.latitude[j]
data.longitude[i] = info.longitude[j]
break
but since I have about 15 million observation , doing it, takes a lot of time, is there a quicker way of doing it ?
Thank you very much (I am still new to this)
edit :
my file info looks like this (about 500 observation, one for each station)
my file data like this (theres other variables not shown here) (about 15 million observations , one for each travel)
and what i am looking to get is that when the stations numbers match that the resulting data would look like this :

This is one solution. You can also use pandas.merge to add 2 new columns to data and perform the equivalent mapping.
# create series mappings from info
s_lat = info.set_index('station')['latitude']
s_lon = info.set_index('station')['latitude']
# calculate Boolean mask on year
mask = data['year'] == '2018'
# apply mappings, if no map found use fillna to retrieve original data
data.loc[mask, 'latitude'] = data.loc[mask, 'station'].map(s_lat)\
.fillna(data.loc[mask, 'latitude'])
data.loc[mask, 'longitude'] = data.loc[mask, 'station'].map(s_lon)\
.fillna(data.loc[mask, 'longitude'])

This is a very recurrent and important issue when anyone starts to deal with large datasets. Big Data is a whole subject in itself, here is a quick introduction to the main concepts.
1. Prepare your dataset
In big data, 80% to 90% of the time is spent gathering, filtering and preparing your datasets. Create subsets of data, making them optimized for your further processing.
2. Optimize your script
Short code does not always mean optimized code in term of performance. In your case, without knowing about your dataset, it is hard to say exactly how you should process it, you will have to figure out on your own how to avoid the most computation possible while getting the exact same result. Try to avoid any unnecessary computation.
You can also consider splitting the work over multiple threads if appropriate.
As a general rule, you should not use for loops and break them inside. Whenever you don't know precisely how many loops you will have to go through in the first place, you should always use while or do...while loops.
3. Consider using distributed storage and computing
This is a subject in itself that is way too big to be all explained here.
Storing, accessing and processing data in a serialized way is faster of small amount of data but very inappropriate for large datasets. Instead, we use distributed storage and computing frameworks.
It aims at doing everything in parallel. It relies on a concept named MapReduce.
The first distributed data storage framework was Hadoop (eg. Hadoop Distributed File System or HDFS). This framework has its advantages and flaws, depending on your application.
In any case, if you are willing to use this framework, it will probably be more appropriate for you not to use MR directly on top HDFS, but using a upper level one, preferably in-memory, such as Spark or Apache Ignite on top of HDFS. Also, depending on your needs, try to have a look at frameworks such as Hive, Pig or Sqoop for example.
Again this subject is a whole different world but might very well be adapted to your situation. Feel free to document yourself about all these concepts and frameworks, and leave your questions if needed in the comments.

Encoded categorical features in h2o in python

Is there a way to see how the categorical features are encoded when we allow h2o to automatically create categorical data by casting a column to enum type?
I am implementing holdout stacking where my underlying training data differs for each model. I have a common feature that I want to make sure is encoded the same way across both sets. The feature contains names (str). It is guaranteed that all names that appear in one data set will be appear in the other.

The best way to see inside a model is to export the pojo, and look at the java source code. You should see how it is processing enums.
But, if I understand the rest of your question correctly, it should be fine. As long as the training data contains all possible values of a category it will work as you expect. If a categorical value not see in training is presented in production it will be treated as an NA.

how to convert the new input non-numerical-data into numerical data

OUTLINE:
I want to know how to keep the new input non-numerical-data as the unique one!
DESCRIBE:
Namely, when we finish both train and test period, lot of new data that unions numerical and non-numerical data input into my model.
ISSUE(1):
Thus, the first important matter is convert these new mixed data into a dataframe with all numerical format!!
TRIED METHOD:
How can we do that? I used LabelEncoder to transform every non-numerical data into float type, however, the member of SCIKIT-LEARN feedbacked me that LabelEncoder is only available for Label[Y] and I should use OneHotEncoder
convert feature.
[He answered me at URL: https://github.com/scikit-learn/scikit-learn/issues/8674], while misunderstanding my meaning
Unfortunately, OneHotEncoder is only available for integer feature by SCIKIT-LEARN Official documentation said in p.1829.
I know there is a title about 'Feature Union with Heterogeneous Data Sources', while it is not convenient as LabelEncoder does.
ISSUE(2):
The second reason why such issue confuses me is related to how we can ensure the input new non-numerical data can be transform into an unique value that differs from the previous-transformed-training-data or previous-transformed-test-data. Namely, although we can use LabelEncoder transform new input data into numerical data, there is a risk that the transformed data[in the new input data yield] might be equal to the transformed data[in the previous transformed data]
Thanks for your help in advance

OK, I have been answered by the official guys. The follow URL maybe helpful for somebody who confused about this issue.
https://github.com/amueller/introduction_to_ml_with_python/blob/master/04-representing-data-feature-engineering.ipynb

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.