Encoded categorical features in h2o in python

Encoded categorical features in h2o in python - python

Is there a way to see how the categorical features are encoded when we allow h2o to automatically create categorical data by casting a column to enum type?
I am implementing holdout stacking where my underlying training data differs for each model. I have a common feature that I want to make sure is encoded the same way across both sets. The feature contains names (str). It is guaranteed that all names that appear in one data set will be appear in the other.

The best way to see inside a model is to export the pojo, and look at the java source code. You should see how it is processing enums.
But, if I understand the rest of your question correctly, it should be fine. As long as the training data contains all possible values of a category it will work as you expect. If a categorical value not see in training is presented in production it will be treated as an NA.

Related

How to substitute for null values if it is a categorical variable?

I was trying to get dummy values for my data, when I noticed some values are having '?' as their value.
As many rows in my data have these values, I simply cannot drop them.
In such case what should I replace them with?
Just taking the mode of the category will help?
Also, I tried to replace the ? values with the mode.
df1 = df1[df1.workclass == '?'].replace('?',"Private")
But I get an empty table now.

It depends on the dataset. There are different methods that apply to different features. Some may require just replacing with the mode. In some cases, different ML algorithms and models are also used such as Random Forest, KNN, etc. So it completely depends on the type of data you are handling. Explore the field of data exploration. Maybe this can help you.

You will have to manually check your different variables and decide what to do with missing for each parameter.
for eg: You can drop the variables with >50 pc of missing unless they suggest very high weight of evidence.
Some variables can be substituted with central tendencies or can be predicted as well.
Categoricals can be replaces by UNK (unknown) and so on.

Creating new groups by pattern

I have JSON with data about some products and I have already converted this into flat table by pandas, so now I have a few columns with data. I Selected some products manually and putted them into one group. I have sorted them by name for example but this is more complicated, there are also some features and requirements which need to be checked.
So what I want is to create script which will group my products in familiar way as those few groups I created manually based on my own thoughts.
Im totally new into machine learning, but I read about this and also watched some tutorials, but I haven't seen this type of case.
I saw that if I use KNN classifier for example, I have to put in input every group that exists and then it will assign single product to one of those groups, but in my case this must be more complicated I guess since I want from this script to create those groups on his own in similiar way to selected by me.
I was thinking about unsupervised machine learnign but this doesn't look like solution beacuse I have my own data which I want to provide, it seems like I need to use some kind of hybrid with supervised machine learning.
data = pd.read_json('recent.json')['results']
data = json_normalize(data)
le = preprocessing.LabelEncoder()
product_name = le.fit_transform(data['name'])
just some code to show what I have done
I don't know if that makes sense what I want, I already made attempt to this problem in normal way without machine learning just by If and loop things, but I wish I could do that also in "smarter" way

the code above shows nothing. If you have data about some products like each entry contains fields you can clasterize this with KNN what is an unsupervides method.
I have to put in input every group that exists
Not, just define the metric between two entries and the method makes classes or entire dendrogram according to that, so you can select classes from dendrogramm as you want. If you look at each node there, it contains common feature of items in class, so it makes auto-description for a class.

Encoding huge categorical features

I am currently working on a big categorical dataset using both Pandas an sklearn. As I want know to do feature extraction, or just be able to build a model, I need to encode the categoricals features as they are not handled by sklearn models.
Here is the issue : for one of the categories, I have more than 110,000 different values. I cannot possibly encode this as there would be (and there is) a memory error.
Also, this feature cannot be removed from the DataSet so this option is out of the question.
So I have multiple ways to deal with that :
Use FeatureHasher from sklearn (as mentionned in several topics related to that). But, while it's easy to encode, there is no mention on how to link the hashed feature with the DataFrame and thus, complete the feature extraction.
Use FuzzyMatching to reduce the number or values in the features and then use one-hot-encoding. The issue here is that it'll still create many dummy variables and I will lose the interest regarding the features that get encoded.
So I have multiple questions : first, do you know how to use FeatureHasher to link a Pandas DataFrame to an sklearn model through hashing ? If so, how am I supposed to do it ?
Second, can you think of any other way to do it that might be easier, or would work best with my problem?
Here are some screenshots regarding the DataSet I have, so that you understand more fully the issue.
Here is the output of the number/percentage of different values per feature.
Also, the values inside the biggest feature to encode called 'commentaire' (commentary in english) contains strings that sometimes are long sentences and sometimes just small words.
As requested, here is the current code to test Feature Hashing:
fh = FeatureHasher(input_type='string')
hashedCommentaire = fh.transform(dataFrame['commentaire'])
This outputs a Memory Error, so I can reduce the number of features, let's put it at a 100:
fh = FeatureHasher(input_type='string', n_features=100)
hashedCommentaire = fh.transform(dataFrame['commentaire'])
print(hashedCommentaire.toarray())
print(hashedCommentaire.shape)
It doesn't get an error and outputs the following : feature hashing output
Can I then directly use the result hashing in my DataFrame? The issue is that we totally loose track of the values of 'commentaire', if we then want to predict on new data, will the hashing output follow the previous one ? And also how do we know the hashing is "sufficient". Here I put 100 rows but there was more than 110,000 values to begin with.
Thanks to another observation, I began to explore the use of tf-idf. Here we can see a sample of 'commentaire' values: 'commentaire' feature sample
As we can see the different values will be strings. Sometimes sentences. The thing is, while exploring this data, I noticed that some values are very close to each other; that's why I wanted to explore the idea of Fuzzy Matching at first and now of tf-idf.

how to convert the new input non-numerical-data into numerical data

OUTLINE:
I want to know how to keep the new input non-numerical-data as the unique one!
DESCRIBE:
Namely, when we finish both train and test period, lot of new data that unions numerical and non-numerical data input into my model.
ISSUE(1):
Thus, the first important matter is convert these new mixed data into a dataframe with all numerical format!!
TRIED METHOD:
How can we do that? I used LabelEncoder to transform every non-numerical data into float type, however, the member of SCIKIT-LEARN feedbacked me that LabelEncoder is only available for Label[Y] and I should use OneHotEncoder
convert feature.
[He answered me at URL: https://github.com/scikit-learn/scikit-learn/issues/8674], while misunderstanding my meaning
Unfortunately, OneHotEncoder is only available for integer feature by SCIKIT-LEARN Official documentation said in p.1829.
I know there is a title about 'Feature Union with Heterogeneous Data Sources', while it is not convenient as LabelEncoder does.
ISSUE(2):
The second reason why such issue confuses me is related to how we can ensure the input new non-numerical data can be transform into an unique value that differs from the previous-transformed-training-data or previous-transformed-test-data. Namely, although we can use LabelEncoder transform new input data into numerical data, there is a risk that the transformed data[in the new input data yield] might be equal to the transformed data[in the previous transformed data]
Thanks for your help in advance

OK, I have been answered by the official guys. The follow URL maybe helpful for somebody who confused about this issue.
https://github.com/amueller/introduction_to_ml_with_python/blob/master/04-representing-data-feature-engineering.ipynb

HDF5 Links to Events in Dataset

I'm trying to use HDF5 to store time-series EEG data. These files can be quite large and consist of many channels, and I like the features of the HDF5 file format (lazy I/O, dynamic compression, mpi, etc).
One common thing to do with EEG data is to mark sections of data as 'interesting'. I'm struggling with a good way to store these marks in the file. I see soft/hard links supported for linking the same dataset to other groups, etc -- but I do not see any way to link to sections of the dataset.
For example, let's assume I have a dataset called EEG containing sleep data. Let's say I run an algorithm that takes a while to process the data and generates indices corresponding to periods of REM sleep. What is the best way to store these index ranges in an HDF5 file?
The best I can think of right now is to create a dataset with three columns -- the first column is a string and contains a label for the event ("REM1"), and the second/third column contains the start/end index respectively. The only reason I don't like this solution is because HDF5 datasets are pretty set in size -- if I decide later that a period of REM sleep was mis-identified and I need to add/remove that event, the dataset size would need to change (and deleting the dataset/recreating it with a new size is suboptimal). Compound this by the fact that I may have MANY events (imagine marking eyeblink events), this becomes more of a problem.
I'm more curious to find out if there's functionality in the HDF5 file that I'm just not aware of, because this seems like a pretty common thing that one would want to do.

I think what you want is a Region Reference — essentially, a way to store a reference to a slice of your data. In h5py, you create them with the regionref property and numpy slicing syntax, so if you have a dataset called ds and your start and end indexes of your REM period, you can do:
rem_ref = ds.regionref[start:end]
ds.attrs['REM1'] = rem_ref
ds[ds.attrs['REM1']] # Will be a 1-d set of values
You can store regionrefs pretty naturally — they can be attributes on a dataset, objects in a group, or you can create a regionref-type dataset and store them in there.
In your case, I might create a group ("REM_periods" or something) and store the references in there. Creating a "REM_periods" dataset and storing the regionrefs there is reasonable too, but you run into the whole "datasets tend not to be variable-length very well" thing.
Storing them as attrs on the dataset might be OK, too, but it'd get awkward if you wanted to have more than one event type.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.