I'm evaluating two different unsupervised ML algorithms, Isolation Forest and LSTM Autoencoder model, to identify anomalies in a large time series data. This dataset includes mostly categorical data such as Ip Adresses, cloud subscription Ids,tenant Ids, userAgents, and client Application Ids.
When reading a tutorial on an implementation of a Tensorflow's Decision Tree (TF-DF) model, it mentions that the model handles non-label categorical values natively and
there is no need for preprocessing in the form of one-hot encoding, normalization or extra is_present feature.
Does anybody know how Tensorflow handles the categorical features behind the scenes (assuming they do some transformation into a numeric representation)?
Tl;dr: There is a natural way of using categorical features in decision trees/forests that requires no encoding. Tensorflow Decision Forests uses this and a number of standard transformations to handle categorical features.
Tensorflow Decision Forest (TF-DF) constructs decision tree / decision forest models. A single decision tree recursively splits the dataset along its features. Splits along categorical features can naturally be performed through so-called in-set conditions. For instance, a tree can express a condition like userAgents \in \{“Mozilla/5.0”, “InternetExplorer/10.0”\}. Other types of conditions are also possible. Tensorflow Decision Forests (TF-DF) can construct in-set conditions if the dataset contains categorical features.
More specifically, Tensorflow Decision Forests uses the C++ library Yggdrasil Decision Forests (YDF) under the hood for any advanced computations. YDF offers three different algorithms for finding a good categorical split of the data. For example, the Random algorithm will just try out many possible splits at random and pick the best one.
For performance and quality reasons, YDF also preprocesses categorical features: If a categorical value is very rare, YDF may consider it “out-of-dictionary”, the threshold for “rare” being user-configurable. Furthermore, YDF maps the categorical features to integers by decreasing item frequency, with the mapping stored as part of the model. Note that this is purely an internal encoding; the algorithms are aware that a feature is categorical, hence typical issues with integer encodings do not apply.
Finally, Tensorflow Decision Forests (TF-DF) uses Keras, which expects classification tasks to have an integer label. Therefore, TF-DF users have to encode the label themselves or use the built-in pd_dataframe_to_tf_dataset.
Note that this answer only applies to Tensorflow Decision Forests. Other parts of Tensorflow may need manual encoding.
Related
In Python, can I get 100 best features out of 200k by performing Linear Discriminant Analysis on data having 2 classes?
Although LDA is used for multi-class problems, it can be used in binary classification problems.
You can use LDA for dimensionality reduction which aims to reduce the number of features. Feature selection on the other hand is the process of selecting a subset of features from a set of features.
So it is a kind of feature extraction and not feature selection. This means LDA will create a new set of features and not select the best features.
In essence, the original features no longer exist and new features are constructed from the available data that are not directly comparable to the original data [1].
Check this link for further reading
[1] Linear Discriminant Analysis for Dimensionality Reduction in Python
When utilizing classifiers from scikit-learn for multi-class problems, is it necessary to encode the labels with one hot encoding? For example, I have 3 classes and simply labeled them as 0, 1, and 2 when feeding this data into the different classifiers for training. As far as I can tell, it seems to be working normally. But is there any reason this kind of basic encoding is not recommended?
Some algorithms, like random forests, handle categorical values natively. For methods such as logistic regression, multilayer perceptron, Gaussian naive Bayes, and random forest, the methods appear to handle categorical values natively, if I'm not mistaken. Is that assessment correct? Which of scikit-learn's classifiers do not handle these inputs natively and are influenced by ordinality?
All scikit estimators handle multi-class problems automatically.
Internally they will be converted to appropriately, either simple encoding to 0,1,2 etc if the algorithm supports native multi-class problems or one-hot encodings if the algorithm handles multi-class problems by transforming to binary.
Please refer to the documentation to see this:
All scikit-learn classifiers are capable of multiclass classification,...
You can see that "logistic regression, multilayer perceptron, Gaussian naive Bayes, and random forest" are under the heading "Inherently multiclass".
Others like SGD, or LinearSVC use one-vs-rest approach to handle multi-class, but that as I said above will be handled internally by scikit, so you as a user don't need to do anything and can pass multi-class labels (even as strings) in a single array of y to all classification estimators.
Only thing where the user needs to explicitly convert labels to one-hot encoding is the multi-label problem, where more than one label can be predicted for a sample. But I think your question is not about that.
I am training a neural network which has 10 or so categorical inputs. After one-hot encoding these categorical inputs I end up feeding around 500 inputs into the network.
I would love to be able to ascertain the importance of each of my categorical inputs. Scikit-learn has numerous feature importance algorithms, however can any of these be applied to categorical data inputs? All of the examples use numerical inputs.
I could apply these methods to the one-hot encoded inputs, but how would I extract the meaning after applying to binarised inputs? How does one go about judging feature importance on categorical inputs?
Using the feature selection algorithms on one hot encoding might be miss leading due to the relations between the encoded features. For example, if you encode a feature of n values into n features and you have n-1 of the m selected, the last feature is not needed.
Since the number of your features is quite low (~10), feature selection not help you so much since you'll probably be able to reduce only few of them without loosing too much information.
You wrote that the one hot encoding turns the 10 features into 500, meaning that each feature has about 50 values. In this case you might be more interested in discretisation algorithms, manipulating on the values themselves. If there is an implied order on the values, you can use algorithms for continuos features. Another option is simply to omit rare values or values without a strong correlation to the concept.
In case that you use feature selection, most algorithms will work on categorial data but you should beware of corner cases. For example, the mutual information, suggested by #Igor Raush is an excellent measure. However, features with many values tend to have higher entropy than feature withe less values. That in turn might lead into higher mutual information and a bias into features of many values. A way to cope with this problem is to normalize by dividing the mutual information by the feature entropy.
Another set of feature selection algorithms that might help you are the wrappers. They actually delegate the learning to the classification algorithm and therefore they are indifferent of the representation as long as the classification algorithm can cope with it.
I have a question about ensemble feature selection.
My data set is consist of 1000 samples with about 30000 features, and they are classified into label A or label B.
What I want to do is picking of some features which can classify the label efficiently.
I used three type of methods, univariate method(Pearson's coefficient), lasso regression and SVM-RFE(recursive feature elimination), so I got three feature sets from them. I used python scikit-learn for feature selection.
Then I am thinking of ensemble feature selection approach, because the size of features were so large. In this case, what is the way to make integrated subset with 3 feature sets?
What can I think is taking union of the sets and using lasso regression or SVM-RFE again, or just take the intersection of the sets.
Can anyone give an idea?
I guess what you do depends on how you want to use these features afterwards. If your goal is to "classify the label efficiently" one thing you can do is to use your classification algorithm (i.e. SVC, Lasso, etc.) as a wrapper and do Recursive Feature Elimination (RFE) with cross-validation.
You can start from the union of features from the previous three methods you used, or from scratch for the given type of model you want to fit, since the number of examples is small. In any case I believe the best way to select features in your case is to select the ones that optimize your goal, which seems to be classification accuracy, thus the CV proposal.
I'm trying to use scikit-learn to do some machine learning on natural language data. I've got my corpus transformed into bag-of-words vectors (which take the form of a sparse CSR matrix) and I'm wondering if there's a supervised dimensionality reduction algorithm in sklearn capable of taking high-dimensional, supervised data and projecting it into a lower dimensional space which preserves the variance between these classes.
The high-level problem description is that I have a collection of documents, each of which can have multiple labels on it, and I want to predict which of those labels will get slapped on a new document based on the content of the document.
At it's core, this is a supervised, multi-label, multi-class problem using a sparse representation of BoW vectors. Is there a dimensionality reduction technique in sklearn that can handle that sort of data? Are there other sorts of techniques people have used in working with supervised, BoW data in scikit-learn?
Thanks!
I am a bit confused by your question. In my experience, dimensionality reduction is never really supervised... but it seems that what you want is some sort of informed feature selection, which is impossible to do before the classification is done. In other words, you cannot know which features are more informative before your classifier is trained and validated.
However, reducing the size and complexity of your data is always good, and you have various ways to do it with text data. The applicability and performance depends on the type of vectors you have (frequency counts, tfidf) and you will always have to determine the number of dimensions (components) you want in your output. The implementations in scikit-learn are mostly in the decomposition module.
The most popular method in Natural Language Processing is Singular Value Decomposition (SVD), which is at the core of Latent Semantic Analysis (LSA, also LSI). Staying with scikit-learn, you can simply apply TruncatedSVD() on your data. A similar method is Non-negative matrix factorization, implemented in scikit-learn as NMF().
An increasingly popular approach uses transformation by random projections, Random Indexing. You can do this in scikit-learn with the functions in random_projection.
As someone pointed out in another answer, Latent Dirichlet Allocation is also an alternative, although it is much slower and computationally more demanding than the methods above. Besides, it is at the time of writing unavailable in scikit-learn.
If all you want is to simplify your data in order to feed it to a classifier, I would suggest SVD with n_components between 100 and 500, or random projection with n_components between 500 and 2000 (common values from the literature).
If you are interested in using the reduced dimensions as some sort of classification/clustering already (people call this topic extraction, although you are really not extracting topics, rather latent dimensions), then LDA might be better option. Beware, it is slow and it only takes pure frequency counts (no tfidf). And the number of components is a parameter that you have to determine in advance (no estimation possible).
Returning to your problem, I would make a sckit-learn pipeline with a vectorizer, dimensionality reduction options and classifier and would carry out a massive parameter search. In this way, you will see what gives you best results with the label set you have.
You can use latent dirichlet allocation (here's the wiki) to discover the topics in your documents. For the assignment of a label to a document, you can use the conditional probability distribution for a document label (given the distribution over the topics in your document). If you have labels for your documents already, then you just need to learn the CPD, which is trivial. Unfortunately, scikit-learn does not have an LDA implementation, but gensim does.
PS: Here's another paper that may help. If you're not very well versed in statistical inference/learning or machine learning, I suggest that your start here (note: it's still assumes a high level of mathematical maturity).
Several existing scikit modules do something similar to what you asked for.
Linear Discriminant Analysis is probably closest to what you asked for. It find a projection of the data that maximizes the distance between the class centroids relative to the projected variances.
Cross decomposition includes methods like Partial Least Squares which fit linear regression models for multidimentional targets via a projection through a lower dimentonial intermediate space. It is a lot like a single hidden layer neural net without the sigmoids.
These are linear regression methods, but you could apply a 0-1 encoding to you target signal
and use these models anyway.
You could use an L1 regularized classifier like LogisticRegression or SGDClassifier to do feature selection. RandomizedLogisticRegression combines this with bootstrapping get a more stable feature set.
Try ISOMAP. There's a super simple built-in function for it in scikits.learn. Even if it doesn't have some of the preservation properties you're looking for, it's worth a try.
Use a multi-layer neural net for classification. If you want to see what the representation of the input is in the reduced dimension, look at the activations of the hidden layer. The role of the hidden layer is by definition optimised to distinguish between the classes, since that's what's directly optimised when the weights are set.
You should remember to use a softmax activation on the output layer, and something non-linear on the hidden layer (tanh or sigmoid).