Handling Categorical Data with Many Values in sklearn - python

I am trying to predict customer retention with a variety of features.
One of these is org_id which represents the organization the customer belongs to. It is currently a float column with numbers ranging from 0.0 to 416.0 and 417 unique values.
I am wondering what the best way of preprocessing this column is before feeding it to a scikit-learn RandomForestClassifier. Generally, I would one-hot-encode categorical features, but there are so many values here so it would radically increase my data dimensionality. I have 12,000 rows of data, so I might be OK though, and only about 10 other features.
The alternatives are to leave the column with float values, convert the float values to int values, or convert the floats to pandas' categorical objects.
Any tips are much appreciated.

org_id does not seem to be a feature that brings any info for the classification, you should drop this value and not pass it into the classifier.
In a classifier you only want to pass features that are discriminative for the task that you are trying to perform: here the elements that can impact the retention or churn. The ID of a company does not bring any valuable information in this context therefore it should not be used.
Edit following OP's comment:
Before going further let's state something: with respect to the number of samples (12000) and the relative simplicity of the model, one can make multiple attempts to try different configurations of features easily.
So, As a baseline, I would do as I said before, drop this feature all together. Here is your baseline score i.e., a score you can compare your other combinations of features against.
I think it cost nothing to try to hot-encode org_id, whichever result you observe is going to add up to your experience and knowledge of how the Random Forest behaves in such cases. As you only have 10 more features, the Boolean features is_org_id_1, is_org_id_2, ... will be highly preponderant and the classification results may be highly influenced by these features.
Then I would try to reduce the number of Boolean features by finding new features that can "describe" these 400+ organizations. For instance, if they are only US organizations, their state which is ~50 features, or their number of users (which would be a single numerical feature), their years of existence (another single numerical feature). Let's note that these are only examples to illustrate the process of creating new features, only someone knowing the full problematic can design these features in a smart way.
Also, I would find interesting that, once you solve your problem, you come back here and write another answer to your question as I believe, many people run into such problems when working with real data :)

Related

randomforest Regressor with all independent variable as categorical

I am stuck in the process of building a model.Basically I have 10 parameters all of which are categorical variables, Even the categories have a large number of unique values (one category has 1335 unique values of 300 000 records), and the y value which is to be predicted is the number of days (Numerical). I am using randomforestregressor and getting an accuracy of around 55-60%. I am not sure if this is the max limit or I really need to change the algorithm itself. I am flexible with any kind of solutions.
Having up to 1335 categories for a categorical dimension might cause a random forest regressor (or classifier) some headache depending on how categorical dimensions are handled internally, and things will also depend on the distribution frequencies of the categories. What library are you using for the random forest regression?
Have you tried converting the categorical dimensions into unique integer IDs and interpreting this representation as a real number dimension? I've made the experience that this can raise the variable importance of many a type of categorical dimensions. (At times the inherent/initial ordering of the categories can provide useful grouping/partitioning information).
You can even shuffle your dimensions a few times and use these as input dimensions. I'll try to explain with an example:
You have a categorical dimension x1 with categories [c11,c12,...,c1n]
We easily map these categories to numerical values by saying x1 has a value of 1 if it's the category is c11, or a value of 2 if it's category, or a value or i for category c1i etc.
Use this new non-categorical dimension as an input dimension for training (you will have to change your input to the regressor accordingly later on).
You can go further than this. Shuffle the order (randomly) of your categories of x1 so you get a random order, for example [c13,c19,c1n,c1i,...,c12]. Do the same thing as above and you have another new non-categorical input dimension (Consider that you'll have to remember the shuffling order for the sake of regression later on).
I'm curious if adding a few (anywhere between 1 to 100, or whatever number you choose) dimensions like this can improve your performance.
Please, see how performance changes for different numbers of such dimensions. (But be aware that more such dimensions will cost you in preprocessing time at regression)
The statement in the codeblock below would require combining multiple categorical dimensions at once. Consider it only for inspiration.
Another idea would be to check if some form of linear classifier with the hot-encodings for each individual category for multiple categorical dimensions might be able improve things (This can help you find useful orderings more quickly than the approach above).
I am sure you need to more processing on your data.
having 1335 unique values on one variable is something bizarre.
please, if the data is public share it with me, I want to take a look.

How to determine most impactful input variables in a dataset?

I have a neural network program that is designed to take in input variables and output variables, and use forecasted data to predict what the output variables should be based on the forecasted data. After running this program, I will have an output of an output vector. Lets say for example, my input matrix is 100 rows and 10 columns and my output matrix is a vector with 100 values. How do I determine which of my 10 variables (columns) had the most impact on my output?
I've done a correlation analysis between each of my variables (columns) and my output and created a list of the highest correlation between each variable and output, but I'm wondering if there is a better way to go about this.
If what you want to know is model selection, and it's not as simple as studiying the correlation of your features to your target. For an in-depth, well explained look at model selection, I'd recommend you read chapter 7 of The Elements Statistical Learning. If what you're looking for is how to explain your network, then you're in for a treat as well and I'd recommend reading this article for starters, though I won't go into the matter myself.
Naive approaches to model selection:
There a number of ways to do this.
The naïve way is to estimate all possible models, so every combination of features. Since you have 10 features, it's computationally unfeasible.
Another way is to take a variable you think is a good predictor and train to model only on that variable. Compute the error on the training data. Take another variable at random, retrain the model and recompute the error on the training data. If it drops the error, keep the variable. Otherwise discard it. Keep going for all features.
A third approach is the opposite. Start with training the model on all features and sequentially drop variables (a less naïve approach would be to drop variables you intuitively think have little explanatory power), compute the error on training data and compare to know if you keep the feature or not.
There are million ways of going about this. I've exposed three of the simplest, but again, you can go really deeply into this subject and find all kinds of different information (which is why I highly recommend you read that chapter :) ).

Multi-Output Classification using scikit Decision Trees

Disclaimer: I'm new to the field of Machine Learning, and even though I have done my fair share of research during the past month I still lack deep understanding on this topic.
I have been playing around with the scikit library with the objective of learning how to predict new data based on historic information, and classify existing information.
I'm trying to solve 2 different problems which may be correlated:
Problem 1
Given a data set containing rows R1 ... RN with features F1 ... FN, and a target per each group of rows, determine in which group does row R(N+1) belongs to.
Now, the target value is not singular, it's a set of values; The best solution I have been able to come up with is to represent those sets of values as a concatenation, this creates an artificial class and allows me to represent multiple values using only one attribute. Is there a better approach to this?
What I'm expecting is to be able to pass totally new set of rows, and being told which are the target values per each of them.
Problem 2
Given a data set containing rows R1 ... RN with features F1 ... FN, predict the values of R(N+1) based on the frequency of the features.
A few considerations here:
Most of the features are categorical in nature.
Some of the features are dates, so when doing the prediction the date should be in the future relative to the historic data.
The frequency analysis needs to be done per row, because certain sets of values may be invalid.
My question here is: Is there any process/ML algorithm, which given historic data would be able to predict a new set of values based on just the frequency of the parameters?
If you have any suggestions, please let me know.
Regarding Problem 1, if you expect the different components of the target value to be independent, you can approach the problem as building a classifier for every component. That is, if the features are F = (F_1, F_2, ..., F_N) and the targets Y = (Y_1, Y_2, ..., Y_N), create a classifier with features F and target Y_1, a second classifier with features F and target Y_2, etc.
Regarding Problem 2, if you are not dealing with a time series, IMO the best you can do is simply predict the most frequent value for each feature.
That said, I believe your question fits better another stack exchange like cross-validated.

Text classification in Python based on large dict of string:string

I have a dataset that would be equivalent to a dict of 5 millions key-values, both strings.
Each key is unique but there are only a couple hundreds of different values.
Keys are not natural words but technical references. The values are "families", grouping similar technical references. Similar is meant in the sense of "having similar regex", "including similar characters", or some sort of pattern.
Example of key-values:
ADSF33344 : G1112
AWDX45603 : G1112
D99991111222 : X3334
E98881188393 : X3334
A30-00005-01 : B0007
B45-00234-07A : B0007
F50-01120-06 : B0007
The final goal is to feed an algorithm with a list of new references (never seen before) and the algorithm would return a suggested family for each reference, ideally together with a percentage of confidence, based on what it learned from the dataset.
The suggested family can only come from the existing families found in the dataset. No need to "invent" new family name.
I'm not familiar with machine learning so I don't really know where to start. I saw some solutions through Sklearn or TextBlob and I understand that I'm looking for a classifier algorithm but every tutorial is oriented toward analysis of large texts.
Somehow, I don't find how to handle my problem, although it seems to be a "simpler" problem than analysing newspaper articles in natural language...
Could you indicate me sources or tutorials that could help me?
Make a training dataset, and train a classifier. Most classifiers work on the values of a set of features that you define yourself. (The kind of features depends on the classifier; in some cases they are numeric quantities, in other cases true/false, in others they can take several discrete values.) You provide the features and the classifier decides how important each feature is, and how to interpret their combinations.
By way of a tutorial you can look at chapter 6 of the NLTK book. The example task, the classification of names into male and female, is structurally very close to yours: Based on the form of short strings (names), classify them into categories (genders).
You will translate each part number into a dictionary of features. Since you don't show us the real data, nobody give you concrete suggestions, but you should definitely make general-purpose features as in the book, and in addition you should make a feature out of every clue, strong or weak, that you are aware of. If supplier IDS differ in length, make a length feature. If the presence (or number or position) of hyphens is a clue, make that into a feature. If some suppliers' parts use a lot of zeros, ditto. Then make additional features for anything else, e.g. "first three letters" that might be useful. Once you have a working system, experiment with different feature sets and different classifier engines and algorithms, until you get acceptable performance.
To get good results with new data, don't forget to split up your training data into training, testing and evaluation subsets. You could use all this with any classifier, but the NLTK's Naive Bayes classifier is pretty quick to train so you could start with that. (Note that the features can be discrete values, e.g. first_letter can be the actual letter; you don't need to stick to boolean features.)

Impute multiple missing values in a feature-vector

Edited post
This is a short and somewhat clarified version of the original post.
We've got a training dataset (some features are significantly correlated). The feature space has 20 dimensions (all continuous).
We need to train a nonparametric (most features form nonlinear subspaces and we can't assume a distribution for any of them) imputer (kNN or tree-based regression) using the training data.
We need to predict multiple missing values in query data (a query feature-vector can have up to 13 missing features, so the imputer should handle any combination of missing features) using the trained imputer. NOTE the imputer should not be in any way retrained/fitted using the query data (like it is done in all mainstream R packages I've found so far: Amelia, impute, mi and mice...). That is the imputation should be based solely on the training data.
The purpose for all this is described below.
A small data sample is down below.
Original post (TL;DR)
Simply put, I've got some sophisticated data imputing to do. We've got a training dataset of ~100k 20D samples and a smaller testing dataset. Each feature/dimension is a continuous variable, but the scales are different. There are two distinct classes. Both datasets are very NA-inflated (NAs are not equally distributed across dimensions). I use sklearn.ensemble.ExtraTreesClassifier for classification and, although tree ensembles can handle missing data cases, there are three reasons to perform imputation
This way we get votes from all trees in a forest during classification of a query dataset (not just those that don't have a missing feature/features).
We don't loose data during training.
scikit implementation of tree ensembles (both ExtraTrees and RandomForest) do not handle missing values. But this point is not that much important. If it wasn't for the former two I would've just used rpy2 + some nice R implementation.
Things are quite simple with the training dataset because I can apply class-specific median imputation strategy to deal with missing values and this approach has been working fine so far. Obviously this approach can't be applied to a query - we don't have the classes to begin with. Since we know that the classes will likely have significantly different shares in the query we can't apply a class-indifferent approach because that might introduce bias and reduce classification performance, therefore we need to impute missing values from a model.
Linear models are not an option for several reasons:
all features are correlated to some extent;
theoretically we can get all possible combinations of missing features in a sample feature-vector, even though our tool requires at least 7 non-missing features we end up with ~1^E6 possible models, this doesn't look very elegant if you ask me.
Tree-based regression models aren't good for the very same reason. So we ended up picking kNN (k nearest neighbours), ball tree or LSH with radius threshold to be more specific. This approach fits the task quite well, because dimensions (ergo distances) are correlated, hence we get nice performance in extremely NA-rich cases, but there are several drawbacks:
I haven't found a single implementation in Python (including impute, sklearn.preprocessing.Imputer, orange) that handles feature-vectors with different sets of missing values, that is we want to have only one imputer for all possible combinations of missing features.
kNN uses pair-wise point distances for prediction/imputation. As I've already mentioned our variables have different scales, hence the feature space must be normalised prior to distance estimations. And we need to know theoretic max/min values for each dimension to scale it properly. This is not as much of a problem, as it is a matter architectural simplicity (a user will have to provide a vector of min/max values).
So here is what I would like to hear from you:
Are there any classic ways to address the kNN-related issues given in the list above? I believe this must be a common case, yet I haven't found anything specific on the web.
Is there a better way to impute data in our case? What would you recommend? Please, provide implementations in Python (R and C/C++ are considered as well).
Data
Here is a small sample of the training data set. I reduced the number of features to make it more readable. The query data has identical structure, except for the obvious absence of category information.
v1 v2 v3 v4 v5 category
0.40524 0.71542 NA 0.81033 0.8209 1
0.78421 0.76378 0.84324 0.58814 0.9348 2
0.30055 NA 0.84324 NA 0.60003 1
0.34754 0.25277 0.18861 0.28937 0.41394 1
NA 0.71542 0.10333 0.41448 0.07377 1
0.40019 0.02634 0.20924 NA 0.85404 2
0.56404 0.5481 0.51284 0.39956 0.95957 2
0.07758 0.40959 0.33802 0.27802 0.35396 1
0.91219 0.89865 0.84324 0.81033 0.99243 1
0.91219 NA NA 0.81033 0.95988 2
0.5463 0.89865 0.84324 0.81033 NA 2
0.00963 0.06737 0.03719 0.08979 0.57746 2
0.59875 0.89865 0.84324 0.50834 0.98906 1
0.72092 NA 0.49118 0.58814 0.77973 2
0.06389 NA 0.22424 0.08979 0.7556 2
Based on the new update I think I would recommend against kNN or tree-based algorithms here. Since imputation is the goal and not a consequence of the methods you're choosing you need an algorithm that will learn to complete incomplete data.
To me this seems very well suited to use a denoising autoencoder. If you're familiar with Neural Networks it's the same basic principle. Instead of training to predict labels you train the model to predict the input data with a notable twist.
The 'denoising' part refers to a intermediate step where you randomly set some percentage of the input data to 0 before attempting to predict it. This forces the algorithm to learn more rich features and how to complete the data when there are missing pieces. In your case I would recommend a low amount of drop out in training (since your data is already missing features) and no dropout in test.
It would be difficult to write a helpful example without looking at your data first, but the basics of what an autoencoder does (as well as a complete code implementation) are covered here: http://deeplearning.net/tutorial/dA.html
This link uses a python module called Theano which I would HIGHLY recommend for the job. The flexibility the module trumps every other module I've looked at for Machine Learning and I've looked at a lot. It's not the easiest thing to learn, but if you're going to be doing a lot of this kind of stuff I'd say it's worth the effort. If you don't want to go through all that then you can still implement a denoising autoencoder in Python without it.

Categories