I have a huge data set with has a mixture of both numerical and categorical variables. I have come across various feature selection techniques focused primarily on either numerical or categorical data alone, not on a mixture of them. Is there any feature selection technique which works on such a data set?
You are looking for Boruta package, originally written in R but also available in python. Boruta uses Random Forest to rank features but you first have to handle all missing values in your features otherwise boruta throws error. Look here for more information :
https://datascience.stackexchange.com/questions/31112/boruta-feature-selection-package
Related
I predicted a model with the xgboost algorithm on python and I graphed the predicted values vs the actal ones on a scatterplot (see image).
As you can see, there are several outliers (I drew a circle around them) who greatly damage the model and I would like to get rid of them.
Is there a way in python to identify the exact values from a dataframe with multiple independent variables that generate these outliers?[predicted vs actual values]
There is something called an anomaly/outlier detection system, should check that out.
Here is a link
There are several algorithms which are available in python for multivariate anomaly detection in sklearn like DBSCAN , Isolation Forest , One Class SVM etc and generally isolation forest is deemed have good anomaly/outlier detection when the dataset has high attributes. However more before using anomaly/outlier detection algorithms one needs to identify if these values are actually outlier or whether they are natural behaviour for the dataset. If yes then rather than removing the records one might have to normalize /bin or apply other feature engineering technique/ look at more complex algorithms to fit the data. What if the relation of the target variable and the independent variables is non-linear?
I have tried to find out basic answers for this question, but none on Stack Overflow seems a best fit.
I have a dataset with 40 columns and 55,000 rows. Only 8 out of these columns are numerical. The remaining 32 are categorical with string values in each.
Now I wish to do an exploratory data analysis for a predictive model and I need to drop certain irrelevant columns that do not show high correlation with the target (variable to predict). But since all of these 32 variables are categorical what can I do to see their relevance with the target variable?
What I am thinking to try:
LabelEncoding all 32 columns then run a Dimensional Reduction via PCA, and then create a predictive model. (If I do this, then how can I clean my data by removing the irrelevant columns that have low corr() with target?)
One Hot Encoding all 32 columns and directly run a predictive model on it.
(If I do this, then the concept of cleaning data is lost totally, and the number of columns will skyrocket and the model will consider all relevant and irrelevant variables for its prediction!)
What should be the best practice in such a situation to make a predictive model in the end where you have many categorical columns?
you got to check the correlation.. There are two scenarios I can think of..
if the target variable is continuous and independent variable is categorical, you can go with Kendall Tau correlation
if both target and independent variable are categorical, you can go with CramersV correlation
There's a package in python which cam do all of these for you and you can select only columns that you need..
pip install ctrl4ai
from ctrl4ai import automl
automl.preprocess(dataframe, learning type)
use help(automl.preprocess) to understand more about the hyper parameters and you can customise your preprocessing in the way you want to..
please check automl.master_correlation which checks correlation based on the approach I explained above.
You can have a look if your categorical variables are suitable for a Spearman rank correlation, which ranks the categorical variables and calculates the correlation coefficient. However, be careful for collinearity between the categorical variables.
I am working with a medical data set that contains many variables with discrete outputs. For example: type of anesthesia, infection site, Diabetes y/n. And to deal with this I have just been converting them into multiple columns with ones and zeros and then removing one to make sure there is not a direct correlation between them but I was wondering if there was a more efficient way of doing this
It depends on the purpose of the transformation. Converting categories to numerical labels may not make sense if the ordinal representation does not correspond to the logic of the categories. In this case, the "one-hot" encoding approach you have adopted is the best way to go, if (as I surmise from your post) the intention is to use the generated variables as the input to some sort of regression model. You can achieve what you are looking to do using pandas.get_dummies.
I have some 2D data:
The data is labeled and shown in different colors. Definitely a non supervised process will not yield any correct prediction because the data is pretty mixed (although the colors seem to have regions of preference). I want to see if it is possible to measure how mixed are points from different sets.
For this I need to define a measurement of how mixed they are (I think that this should exist). Also it would be nice to have these algorithms implemented. I am also looking for a simple predictive model that can be trained used the data shown. Thanks for your help. If possible I'm looking for these implementations in python.
Edited post
This is a short and somewhat clarified version of the original post.
We've got a training dataset (some features are significantly correlated). The feature space has 20 dimensions (all continuous).
We need to train a nonparametric (most features form nonlinear subspaces and we can't assume a distribution for any of them) imputer (kNN or tree-based regression) using the training data.
We need to predict multiple missing values in query data (a query feature-vector can have up to 13 missing features, so the imputer should handle any combination of missing features) using the trained imputer. NOTE the imputer should not be in any way retrained/fitted using the query data (like it is done in all mainstream R packages I've found so far: Amelia, impute, mi and mice...). That is the imputation should be based solely on the training data.
The purpose for all this is described below.
A small data sample is down below.
Original post (TL;DR)
Simply put, I've got some sophisticated data imputing to do. We've got a training dataset of ~100k 20D samples and a smaller testing dataset. Each feature/dimension is a continuous variable, but the scales are different. There are two distinct classes. Both datasets are very NA-inflated (NAs are not equally distributed across dimensions). I use sklearn.ensemble.ExtraTreesClassifier for classification and, although tree ensembles can handle missing data cases, there are three reasons to perform imputation
This way we get votes from all trees in a forest during classification of a query dataset (not just those that don't have a missing feature/features).
We don't loose data during training.
scikit implementation of tree ensembles (both ExtraTrees and RandomForest) do not handle missing values. But this point is not that much important. If it wasn't for the former two I would've just used rpy2 + some nice R implementation.
Things are quite simple with the training dataset because I can apply class-specific median imputation strategy to deal with missing values and this approach has been working fine so far. Obviously this approach can't be applied to a query - we don't have the classes to begin with. Since we know that the classes will likely have significantly different shares in the query we can't apply a class-indifferent approach because that might introduce bias and reduce classification performance, therefore we need to impute missing values from a model.
Linear models are not an option for several reasons:
all features are correlated to some extent;
theoretically we can get all possible combinations of missing features in a sample feature-vector, even though our tool requires at least 7 non-missing features we end up with ~1^E6 possible models, this doesn't look very elegant if you ask me.
Tree-based regression models aren't good for the very same reason. So we ended up picking kNN (k nearest neighbours), ball tree or LSH with radius threshold to be more specific. This approach fits the task quite well, because dimensions (ergo distances) are correlated, hence we get nice performance in extremely NA-rich cases, but there are several drawbacks:
I haven't found a single implementation in Python (including impute, sklearn.preprocessing.Imputer, orange) that handles feature-vectors with different sets of missing values, that is we want to have only one imputer for all possible combinations of missing features.
kNN uses pair-wise point distances for prediction/imputation. As I've already mentioned our variables have different scales, hence the feature space must be normalised prior to distance estimations. And we need to know theoretic max/min values for each dimension to scale it properly. This is not as much of a problem, as it is a matter architectural simplicity (a user will have to provide a vector of min/max values).
So here is what I would like to hear from you:
Are there any classic ways to address the kNN-related issues given in the list above? I believe this must be a common case, yet I haven't found anything specific on the web.
Is there a better way to impute data in our case? What would you recommend? Please, provide implementations in Python (R and C/C++ are considered as well).
Data
Here is a small sample of the training data set. I reduced the number of features to make it more readable. The query data has identical structure, except for the obvious absence of category information.
v1 v2 v3 v4 v5 category
0.40524 0.71542 NA 0.81033 0.8209 1
0.78421 0.76378 0.84324 0.58814 0.9348 2
0.30055 NA 0.84324 NA 0.60003 1
0.34754 0.25277 0.18861 0.28937 0.41394 1
NA 0.71542 0.10333 0.41448 0.07377 1
0.40019 0.02634 0.20924 NA 0.85404 2
0.56404 0.5481 0.51284 0.39956 0.95957 2
0.07758 0.40959 0.33802 0.27802 0.35396 1
0.91219 0.89865 0.84324 0.81033 0.99243 1
0.91219 NA NA 0.81033 0.95988 2
0.5463 0.89865 0.84324 0.81033 NA 2
0.00963 0.06737 0.03719 0.08979 0.57746 2
0.59875 0.89865 0.84324 0.50834 0.98906 1
0.72092 NA 0.49118 0.58814 0.77973 2
0.06389 NA 0.22424 0.08979 0.7556 2
Based on the new update I think I would recommend against kNN or tree-based algorithms here. Since imputation is the goal and not a consequence of the methods you're choosing you need an algorithm that will learn to complete incomplete data.
To me this seems very well suited to use a denoising autoencoder. If you're familiar with Neural Networks it's the same basic principle. Instead of training to predict labels you train the model to predict the input data with a notable twist.
The 'denoising' part refers to a intermediate step where you randomly set some percentage of the input data to 0 before attempting to predict it. This forces the algorithm to learn more rich features and how to complete the data when there are missing pieces. In your case I would recommend a low amount of drop out in training (since your data is already missing features) and no dropout in test.
It would be difficult to write a helpful example without looking at your data first, but the basics of what an autoencoder does (as well as a complete code implementation) are covered here: http://deeplearning.net/tutorial/dA.html
This link uses a python module called Theano which I would HIGHLY recommend for the job. The flexibility the module trumps every other module I've looked at for Machine Learning and I've looked at a lot. It's not the easiest thing to learn, but if you're going to be doing a lot of this kind of stuff I'd say it's worth the effort. If you don't want to go through all that then you can still implement a denoising autoencoder in Python without it.