Multi-Output Classification using scikit Decision Trees

Multi-Output Classification using scikit Decision Trees - python

Disclaimer: I'm new to the field of Machine Learning, and even though I have done my fair share of research during the past month I still lack deep understanding on this topic.
I have been playing around with the scikit library with the objective of learning how to predict new data based on historic information, and classify existing information.
I'm trying to solve 2 different problems which may be correlated:
Problem 1
Given a data set containing rows R1 ... RN with features F1 ... FN, and a target per each group of rows, determine in which group does row R(N+1) belongs to.
Now, the target value is not singular, it's a set of values; The best solution I have been able to come up with is to represent those sets of values as a concatenation, this creates an artificial class and allows me to represent multiple values using only one attribute. Is there a better approach to this?
What I'm expecting is to be able to pass totally new set of rows, and being told which are the target values per each of them.
Problem 2
Given a data set containing rows R1 ... RN with features F1 ... FN, predict the values of R(N+1) based on the frequency of the features.
A few considerations here:
Most of the features are categorical in nature.
Some of the features are dates, so when doing the prediction the date should be in the future relative to the historic data.
The frequency analysis needs to be done per row, because certain sets of values may be invalid.
My question here is: Is there any process/ML algorithm, which given historic data would be able to predict a new set of values based on just the frequency of the parameters?
If you have any suggestions, please let me know.

Regarding Problem 1, if you expect the different components of the target value to be independent, you can approach the problem as building a classifier for every component. That is, if the features are F = (F_1, F_2, ..., F_N) and the targets Y = (Y_1, Y_2, ..., Y_N), create a classifier with features F and target Y_1, a second classifier with features F and target Y_2, etc.
Regarding Problem 2, if you are not dealing with a time series, IMO the best you can do is simply predict the most frequent value for each feature.
That said, I believe your question fits better another stack exchange like cross-validated.

Related

Longitudinal data analysis in python

I have data whose structure is as below (a fictitious example):
data
There are 3 predictor variables and 1 response variable.
We have data of 5 students, and each student have 3 observations for time 1,2,3. Thus total number of observations is 15.
But I don't have an idea how to analyze the effect of X1, X2, X3 on Y in this kind of longitudinal data.(I will use python)
Can anyone give me some idea?
Thank you.

Since you have longitudinal data and a continuous response you have a few different options:
Ignore the grouping structure. I would not recommend doing this, since you may be ignoring information.
Model your groups separately. This is usually not a good idea, and most certainly not in the case where the groups have low sample size.
Treat your grouping variable as a categorical predictor. This is again may not be ideal when the number of groups are high, even with recent boosting packages that handle categorical predictors with high cardinality well (e.g. CatBoost).
Use a mixed effects model.
If you want to proceed with bullet 4, I recommend taking a look at the Gaussian Process Boosting or GPBoost package first. However, there are other Python packages to consider: MERF and LMER in Statsmodels.

Handling Categorical Data with Many Values in sklearn

I am trying to predict customer retention with a variety of features.
One of these is org_id which represents the organization the customer belongs to. It is currently a float column with numbers ranging from 0.0 to 416.0 and 417 unique values.
I am wondering what the best way of preprocessing this column is before feeding it to a scikit-learn RandomForestClassifier. Generally, I would one-hot-encode categorical features, but there are so many values here so it would radically increase my data dimensionality. I have 12,000 rows of data, so I might be OK though, and only about 10 other features.
The alternatives are to leave the column with float values, convert the float values to int values, or convert the floats to pandas' categorical objects.
Any tips are much appreciated.

org_id does not seem to be a feature that brings any info for the classification, you should drop this value and not pass it into the classifier.
In a classifier you only want to pass features that are discriminative for the task that you are trying to perform: here the elements that can impact the retention or churn. The ID of a company does not bring any valuable information in this context therefore it should not be used.
Edit following OP's comment:
Before going further let's state something: with respect to the number of samples (12000) and the relative simplicity of the model, one can make multiple attempts to try different configurations of features easily.
So, As a baseline, I would do as I said before, drop this feature all together. Here is your baseline score i.e., a score you can compare your other combinations of features against.
I think it cost nothing to try to hot-encode org_id, whichever result you observe is going to add up to your experience and knowledge of how the Random Forest behaves in such cases. As you only have 10 more features, the Boolean features is_org_id_1, is_org_id_2, ... will be highly preponderant and the classification results may be highly influenced by these features.
Then I would try to reduce the number of Boolean features by finding new features that can "describe" these 400+ organizations. For instance, if they are only US organizations, their state which is ~50 features, or their number of users (which would be a single numerical feature), their years of existence (another single numerical feature). Let's note that these are only examples to illustrate the process of creating new features, only someone knowing the full problematic can design these features in a smart way.
Also, I would find interesting that, once you solve your problem, you come back here and write another answer to your question as I believe, many people run into such problems when working with real data :)

How to determine most impactful input variables in a dataset?

I have a neural network program that is designed to take in input variables and output variables, and use forecasted data to predict what the output variables should be based on the forecasted data. After running this program, I will have an output of an output vector. Lets say for example, my input matrix is 100 rows and 10 columns and my output matrix is a vector with 100 values. How do I determine which of my 10 variables (columns) had the most impact on my output?
I've done a correlation analysis between each of my variables (columns) and my output and created a list of the highest correlation between each variable and output, but I'm wondering if there is a better way to go about this.

If what you want to know is model selection, and it's not as simple as studiying the correlation of your features to your target. For an in-depth, well explained look at model selection, I'd recommend you read chapter 7 of The Elements Statistical Learning. If what you're looking for is how to explain your network, then you're in for a treat as well and I'd recommend reading this article for starters, though I won't go into the matter myself.
Naive approaches to model selection:
There a number of ways to do this.
The naïve way is to estimate all possible models, so every combination of features. Since you have 10 features, it's computationally unfeasible.
Another way is to take a variable you think is a good predictor and train to model only on that variable. Compute the error on the training data. Take another variable at random, retrain the model and recompute the error on the training data. If it drops the error, keep the variable. Otherwise discard it. Keep going for all features.
A third approach is the opposite. Start with training the model on all features and sequentially drop variables (a less naïve approach would be to drop variables you intuitively think have little explanatory power), compute the error on training data and compare to know if you keep the feature or not.
There are million ways of going about this. I've exposed three of the simplest, but again, you can go really deeply into this subject and find all kinds of different information (which is why I highly recommend you read that chapter :) ).

Machine learning classification dataset setup

I am very sorry if this question violates SO's question guidelines but I am stuck and I cannot find anywhere else to ask this type of questions. Suppose I have a dataset containing three experimental data that were obtained in three different conditions (hot, cold, comfortable). The data is arranged in three columns in a pandas dataframe consisting of 4 columns (time, cold, comfortable and hot).
When I plot the data, I can visually see the separation of the three experiments, but I would like to do it automatically with machine learning.
The x-axis represents the time and the y-axis represents the magnitude of the data. I have read about different machine learning classification techniquesbut I do not understand how to set up my data so that I can 'feed' it into the classification algorithm. Namely, my questions are:
Is this programmatically feasible?
How can I set up (arrange my data) so that it can be easily fed into the classification algorithm? From what I read so far, it seems, for the algorithm to work, the data has to be in a certain order (see for example the iris dataset where the data is nicely labeled. How can I customize the algorithms to fit my needs?
NOTE: Ideally, I would like the program that, given a magnitude value, it would classify the value as hot, comfortable or cold. The time series is not much of relevance in my case

Of course this is feasible.
It's not entirely clear from the original post exactly what variables/features you have available for your model, but here is a bit of general guidance. All of these machine learning problems, from classification to regression, rely on the same core assumption that you are trying to predict some outcome based on a bunch of inputs. Usually this relationship is modeled like this: y ~ X1 + X2 + X3 ..., where y is your outcome ("dependent") variable, and X1, X2, etc. are features ("explanatory" variables). More simply, we can say that using our entire feature-set matrix X (i.e. the matrix containing all of our x-variables), we can predict some outcome variable y using a variety of ML techniques.
So in your case, you'd try to predict whether it's Cold, Comfortable, or Hot based on time. This is really more of a forecasting problem than it is a ML problem, since you have a time component that looks to be one of the most important (if not the only) features in your dataset. You may want to look at some simpler time-series forecasting methods (e.g. ARIMA) instead of ML algorithms, as some of the time-series ML approaches may not be well-suited for a beginner.
In any case, this should get you started, I think.

Impute multiple missing values in a feature-vector

Edited post
This is a short and somewhat clarified version of the original post.
We've got a training dataset (some features are significantly correlated). The feature space has 20 dimensions (all continuous).
We need to train a nonparametric (most features form nonlinear subspaces and we can't assume a distribution for any of them) imputer (kNN or tree-based regression) using the training data.
We need to predict multiple missing values in query data (a query feature-vector can have up to 13 missing features, so the imputer should handle any combination of missing features) using the trained imputer. NOTE the imputer should not be in any way retrained/fitted using the query data (like it is done in all mainstream R packages I've found so far: Amelia, impute, mi and mice...). That is the imputation should be based solely on the training data.
The purpose for all this is described below.
A small data sample is down below.
Original post (TL;DR)
Simply put, I've got some sophisticated data imputing to do. We've got a training dataset of ~100k 20D samples and a smaller testing dataset. Each feature/dimension is a continuous variable, but the scales are different. There are two distinct classes. Both datasets are very NA-inflated (NAs are not equally distributed across dimensions). I use sklearn.ensemble.ExtraTreesClassifier for classification and, although tree ensembles can handle missing data cases, there are three reasons to perform imputation
This way we get votes from all trees in a forest during classification of a query dataset (not just those that don't have a missing feature/features).
We don't loose data during training.
scikit implementation of tree ensembles (both ExtraTrees and RandomForest) do not handle missing values. But this point is not that much important. If it wasn't for the former two I would've just used rpy2 + some nice R implementation.
Things are quite simple with the training dataset because I can apply class-specific median imputation strategy to deal with missing values and this approach has been working fine so far. Obviously this approach can't be applied to a query - we don't have the classes to begin with. Since we know that the classes will likely have significantly different shares in the query we can't apply a class-indifferent approach because that might introduce bias and reduce classification performance, therefore we need to impute missing values from a model.
Linear models are not an option for several reasons:
all features are correlated to some extent;
theoretically we can get all possible combinations of missing features in a sample feature-vector, even though our tool requires at least 7 non-missing features we end up with ~1^E6 possible models, this doesn't look very elegant if you ask me.
Tree-based regression models aren't good for the very same reason. So we ended up picking kNN (k nearest neighbours), ball tree or LSH with radius threshold to be more specific. This approach fits the task quite well, because dimensions (ergo distances) are correlated, hence we get nice performance in extremely NA-rich cases, but there are several drawbacks:
I haven't found a single implementation in Python (including impute, sklearn.preprocessing.Imputer, orange) that handles feature-vectors with different sets of missing values, that is we want to have only one imputer for all possible combinations of missing features.
kNN uses pair-wise point distances for prediction/imputation. As I've already mentioned our variables have different scales, hence the feature space must be normalised prior to distance estimations. And we need to know theoretic max/min values for each dimension to scale it properly. This is not as much of a problem, as it is a matter architectural simplicity (a user will have to provide a vector of min/max values).
So here is what I would like to hear from you:
Are there any classic ways to address the kNN-related issues given in the list above? I believe this must be a common case, yet I haven't found anything specific on the web.
Is there a better way to impute data in our case? What would you recommend? Please, provide implementations in Python (R and C/C++ are considered as well).
Data
Here is a small sample of the training data set. I reduced the number of features to make it more readable. The query data has identical structure, except for the obvious absence of category information.
v1 v2 v3 v4 v5 category
0.40524 0.71542 NA 0.81033 0.8209 1
0.78421 0.76378 0.84324 0.58814 0.9348 2
0.30055 NA 0.84324 NA 0.60003 1
0.34754 0.25277 0.18861 0.28937 0.41394 1
NA 0.71542 0.10333 0.41448 0.07377 1
0.40019 0.02634 0.20924 NA 0.85404 2
0.56404 0.5481 0.51284 0.39956 0.95957 2
0.07758 0.40959 0.33802 0.27802 0.35396 1
0.91219 0.89865 0.84324 0.81033 0.99243 1
0.91219 NA NA 0.81033 0.95988 2
0.5463 0.89865 0.84324 0.81033 NA 2
0.00963 0.06737 0.03719 0.08979 0.57746 2
0.59875 0.89865 0.84324 0.50834 0.98906 1
0.72092 NA 0.49118 0.58814 0.77973 2
0.06389 NA 0.22424 0.08979 0.7556 2

Based on the new update I think I would recommend against kNN or tree-based algorithms here. Since imputation is the goal and not a consequence of the methods you're choosing you need an algorithm that will learn to complete incomplete data.
To me this seems very well suited to use a denoising autoencoder. If you're familiar with Neural Networks it's the same basic principle. Instead of training to predict labels you train the model to predict the input data with a notable twist.
The 'denoising' part refers to a intermediate step where you randomly set some percentage of the input data to 0 before attempting to predict it. This forces the algorithm to learn more rich features and how to complete the data when there are missing pieces. In your case I would recommend a low amount of drop out in training (since your data is already missing features) and no dropout in test.
It would be difficult to write a helpful example without looking at your data first, but the basics of what an autoencoder does (as well as a complete code implementation) are covered here: http://deeplearning.net/tutorial/dA.html
This link uses a python module called Theano which I would HIGHLY recommend for the job. The flexibility the module trumps every other module I've looked at for Machine Learning and I've looked at a lot. It's not the easiest thing to learn, but if you're going to be doing a lot of this kind of stuff I'd say it's worth the effort. If you don't want to go through all that then you can still implement a denoising autoencoder in Python without it.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.