Background: I am working on a binary classification of health insurance claims. The data I am working with has approximately 1 million rows and a mix of numeric features and categorical features (all of which are nominal discrete). The issue I am facing is that several of my categorical features have high cardinality with many values that are very uncommon or unique. I have plotted 8 of my categorical features below which had the highest counts of unique factor levels:
Alternative to Dummy Variables: I have been reading up on feature hashing and understand that this method is an alternative that can be used for a fast and space-efficient way of vectorizing features and is particularity suitable for categorical data with high cardinality. I plan to utilize Scikit Learn's FeatureHasher to perform feature hashing on my categorical features with more than 100 unique feature levels (I will create dummy variables for the remaining categorical features with less than 100 unique feature levels). Before I implement this I have a few questions relating to feature hashing and how it relates to model performance in machine learning:
What is the primary advantage of using feature hashing as opposed to dummying only the most frequently occuring factor levels? I assume there is less information loss with the feature hashing approach but need more clarification on what advantages hashing provides in machine learning algorithms when dealing with high cardinality.
I am interested in evaluating feature importance after evaluating a few separate classification models. Is there a way to evaluate hashed features in the context of how they relate to the original categorical levels? Is there a way to reverse hashes or does feature hashing inevitably lead to loss of model interpretability?
Sorry for the long post and questions. Any feedback/recommendations would be much appreciated!
Feature hashing can support new categories during inference that were not seen in training. With dummy encoding, you can only encode a fixed set of previously seen categories. If you encounter a category not seen in training, you're out of luck.
For feature importance, there are two canonical approaches.
a) Train/evaluate your model with and without each feature to see its effect. This can be computationally expensive.
b) Train/evaluate your model with the feature and also with that feature permuted among all samples.
With feature hashing, each feature expands to multiple columns so b) will be tricky and I haven't found any packages that do permutation importance of feature hashed columns.
So, I think a) is probably your best bet, considering you only have 1 million rows.
Also, you'll probably get better answers on Cross Validated for ML questions on stackoverflow.
Related
I'm evaluating two different unsupervised ML algorithms, Isolation Forest and LSTM Autoencoder model, to identify anomalies in a large time series data. This dataset includes mostly categorical data such as Ip Adresses, cloud subscription Ids,tenant Ids, userAgents, and client Application Ids.
When reading a tutorial on an implementation of a Tensorflow's Decision Tree (TF-DF) model, it mentions that the model handles non-label categorical values natively and
there is no need for preprocessing in the form of one-hot encoding, normalization or extra is_present feature.
Does anybody know how Tensorflow handles the categorical features behind the scenes (assuming they do some transformation into a numeric representation)?
Tl;dr: There is a natural way of using categorical features in decision trees/forests that requires no encoding. Tensorflow Decision Forests uses this and a number of standard transformations to handle categorical features.
Tensorflow Decision Forest (TF-DF) constructs decision tree / decision forest models. A single decision tree recursively splits the dataset along its features. Splits along categorical features can naturally be performed through so-called in-set conditions. For instance, a tree can express a condition like userAgents \in \{“Mozilla/5.0”, “InternetExplorer/10.0”\}. Other types of conditions are also possible. Tensorflow Decision Forests (TF-DF) can construct in-set conditions if the dataset contains categorical features.
More specifically, Tensorflow Decision Forests uses the C++ library Yggdrasil Decision Forests (YDF) under the hood for any advanced computations. YDF offers three different algorithms for finding a good categorical split of the data. For example, the Random algorithm will just try out many possible splits at random and pick the best one.
For performance and quality reasons, YDF also preprocesses categorical features: If a categorical value is very rare, YDF may consider it “out-of-dictionary”, the threshold for “rare” being user-configurable. Furthermore, YDF maps the categorical features to integers by decreasing item frequency, with the mapping stored as part of the model. Note that this is purely an internal encoding; the algorithms are aware that a feature is categorical, hence typical issues with integer encodings do not apply.
Finally, Tensorflow Decision Forests (TF-DF) uses Keras, which expects classification tasks to have an integer label. Therefore, TF-DF users have to encode the label themselves or use the built-in pd_dataframe_to_tf_dataset.
Note that this answer only applies to Tensorflow Decision Forests. Other parts of Tensorflow may need manual encoding.
There seems to be many techniques for reducing dimensionality (pca, svd etc) in order to escape the curse of dimensionality. But how do you know that your dataset in fact suffers from high-dimensionality problems? Is there a best practice, like visualizations or can one even use KNN to find out?
I have a dataset with 99 features and 1 continuous label (price) and 30 000 instances.
The curse of dimensionality refers to a problem that dictates the relation between feature dimension and your data size. It has been pointed out that as your feature size/dimension grows, the amount of data in order to successfully model your problem will also grow exponentially.
The problem actually arises when there is exponential growth in your data. Because you have to think of how to handle it properly ( storage/ computation power needed).
So we usually experiment to figure out the right size of the dimension we need for our problem (maybe using cross-validation) and only select those features. Also, keep in mind that using lots of features comes with a high risk of overfitting.
You can use either Feature selection or feature extraction for dimension reduction.LASSO can be used for feature selection or PCA, LDA for feature extraction.
I am taking the fastai Intro to Machine Learning course, and in Lesson 1 he uses a Random Forest on the Blue Book for Bulldozers dataset from Kaggle.
In a curious move to me the instructor did not use pd.get_dummies() or OneHotEncoder from SKlearn to handle categorical data. Instead he called pd.Series.cat.codes on all categorical columns.
I noticed when thefit() method was called, it computed much faster (about 1 minute) on the dataset using pd.Series.cat.codes, whereas the dataset with the dummy variables crashed on a virtual server I had running that was using 60 GB of RAM.
The memory each dataframe occupied was about the same........54 MB. I'm curious why one dataframe is so much more performant than the other?
Is it because with a single column of integers a Random Forest only considers the average of that column as its cut point, thus making it easier to compute? Or is it something else?
To understand this better we need to look at the working of Tree based models. In a tree based algo the data is split into bins based on feature and its values. The splitting algorithm considers all possible splits and learns the most optimal split (Minimized impurity of resulting bins).
When we consider continuous numeric feature for a split, then there would be a number of combination on which a tree can split.
Categorical features are disadvantaged and have only a few options for splitting which results in a very sparse decision trees. This becomes worse for category with just two levels.
Also dummy variables are created to avoid the model from learning false ordinality. Since tree based model works on the principle of splitting this is not an issue and there is no need to create dummy variables.
pd.get_dummies will add k (or k-1 if drop_first = True) columns to your DataFrame. In case of a very large K, the RandomForest algorithm as more choice to make when sub-selecting the features thus making each tree training longer to train.
You could use the max_features parameters to limit the number of feature used during each tree training but the scikit-learn implementation of the algorithm doesn't take into account that your dummies variable are actually from one feature, meaning it could select only a subset of dummies from your categorical variable
This could lead to sub-performance of your model. I'm guessing this is why fastai uses
pd.Series.cat.codes.
I am training a neural network which has 10 or so categorical inputs. After one-hot encoding these categorical inputs I end up feeding around 500 inputs into the network.
I would love to be able to ascertain the importance of each of my categorical inputs. Scikit-learn has numerous feature importance algorithms, however can any of these be applied to categorical data inputs? All of the examples use numerical inputs.
I could apply these methods to the one-hot encoded inputs, but how would I extract the meaning after applying to binarised inputs? How does one go about judging feature importance on categorical inputs?
Using the feature selection algorithms on one hot encoding might be miss leading due to the relations between the encoded features. For example, if you encode a feature of n values into n features and you have n-1 of the m selected, the last feature is not needed.
Since the number of your features is quite low (~10), feature selection not help you so much since you'll probably be able to reduce only few of them without loosing too much information.
You wrote that the one hot encoding turns the 10 features into 500, meaning that each feature has about 50 values. In this case you might be more interested in discretisation algorithms, manipulating on the values themselves. If there is an implied order on the values, you can use algorithms for continuos features. Another option is simply to omit rare values or values without a strong correlation to the concept.
In case that you use feature selection, most algorithms will work on categorial data but you should beware of corner cases. For example, the mutual information, suggested by #Igor Raush is an excellent measure. However, features with many values tend to have higher entropy than feature withe less values. That in turn might lead into higher mutual information and a bias into features of many values. A way to cope with this problem is to normalize by dividing the mutual information by the feature entropy.
Another set of feature selection algorithms that might help you are the wrappers. They actually delegate the learning to the classification algorithm and therefore they are indifferent of the representation as long as the classification algorithm can cope with it.
I've got about 300k documents stored in a Postgres database that are tagged with topic categories (there are about 150 categories in total). I have another 150k documents that don't yet have categories. I'm trying to find the best way to programmaticly categorize them.
I've been exploring NLTK and its Naive Bayes Classifier. Seems like a good starting point (if you can suggest a better classification algorithm for this task, I'm all ears).
My problem is that I don't have enough RAM to train the NaiveBayesClassifier on all 150 categoies/300k documents at once (training on 5 categories used 8GB). Furthermore, accuracy of the classifier seems to drop as I train on more categories (90% accuracy with 2 categories, 81% with 5, 61% with 10).
Should I just train a classifier on 5 categories at a time, and run all 150k documents through the classifier to see if there are matches? It seems like this would work, except that there would be a lot of false positives where documents that don't really match any of the categories get shoe-horned into on by the classifier just because it's the best match available... Is there a way to have a "none of the above" option for the classifier just in case the document doesn't fit into any of the categories?
Here is my test class http://gist.github.com/451880
You should start by converting your documents into TF-log(1 + IDF) vectors: term frequencies are sparse so you should use python dict with term as keys and count as values and then divide by total count to get the global frequencies.
Another solution is to use the abs(hash(term)) for instance as positive integer keys. Then you an use scipy.sparse vectors which are more handy and more efficient to perform linear algebra operation than python dict.
Also build the 150 frequencies vectors by averaging the frequencies of all the labeled documents belonging to the same category. Then for new document to label, you can compute the cosine similarity between the document vector and each category vector and choose the most similar category as label for your document.
If this is not good enough, then you should try to train a logistic regression model using a L1 penalty as explained in this example of scikit-learn (this is a wrapper for liblinear as explained by #ephes). The vectors used to train your logistic regression model should be the previously introduced TD-log(1+IDF) vectors to get good performance (precision and recall). The scikit learn lib offers a sklearn.metrics module with routines to compute those score for a given model and given dataset.
For larger datasets: you should try the vowpal wabbit which is probably the fastest rabbit on earth for large scale document classification problems (but not easy to use python wrappers AFAIK).
How big (number of words) are your documents? Memory consumption at 150K trainingdocs should not be an issue.
Naive Bayes is a good choice especially when you have many categories with only a few training examples or very noisy trainingdata. But in general, linear Support Vector Machines do perform much better.
Is your problem multiclass (a document belongs only to one category exclusivly) or multilabel (a document belongs to one or more categories)?
Accuracy is a poor choice to judge classifier performance. You should rather use precision vs recall, precision recall breakeven point (prbp), f1, auc and have to look at the precision vs recall curve where recall (x) is plotted against precision (y) based on the value of your confidence-threshold (wether a document belongs to a category or not). Usually you would build one binary classifier per category (positive training examples of one category vs all other trainingexamples which don't belong to your current category). You'll have to choose an optimal confidence threshold per category. If you want to combine those single measures per category into a global performance measure, you'll have to micro (sum up all true positives, false positives, false negatives and true negatives and calc combined scores) or macro (calc score per category and then average those scores over all categories) average.
We have a corpus of tens of million documents, millions of training examples and thousands of categories (multilabel). Since we face serious training time problems (the number of documents are new, updated or deleted per day is quite high), we use a modified version of liblinear. But for smaller problems using one of the python wrappers around liblinear (liblinear2scipy or scikit-learn) should work fine.
Is there a way to have a "none of the
above" option for the classifier just
in case the document doesn't fit into
any of the categories?
You might get this effect simply by having a "none of the above" pseudo-category trained each time. If the max you can train is 5 categories (though I'm not sure why it's eating up quite so much RAM), train 4 actual categories from their actual 2K docs each, and a "none of the above" one with its 2K documents taken randomly from all the other 146 categories (about 13-14 from each if you want the "stratified sampling" approach, which may be sounder).
Still feels like a bit of a kludge and you might be better off with a completely different approach -- find a multi-dimensional doc measure that defines your 300K pre-tagged docs into 150 reasonably separable clusters, then just assign each of the other yet-untagged docs to the appropriate cluster as thus determined. I don't think NLTK has anything directly available to support this kind of thing, but, hey, NLTK's been growing so fast that I may well have missed something...;-)