Preprocessing Dataset with Large Categorical Variables

Preprocessing Dataset with Large Categorical Variables - python

I have tried to find out basic answers for this question, but none on Stack Overflow seems a best fit.
I have a dataset with 40 columns and 55,000 rows. Only 8 out of these columns are numerical. The remaining 32 are categorical with string values in each.
Now I wish to do an exploratory data analysis for a predictive model and I need to drop certain irrelevant columns that do not show high correlation with the target (variable to predict). But since all of these 32 variables are categorical what can I do to see their relevance with the target variable?
What I am thinking to try:
LabelEncoding all 32 columns then run a Dimensional Reduction via PCA, and then create a predictive model. (If I do this, then how can I clean my data by removing the irrelevant columns that have low corr() with target?)
One Hot Encoding all 32 columns and directly run a predictive model on it.
(If I do this, then the concept of cleaning data is lost totally, and the number of columns will skyrocket and the model will consider all relevant and irrelevant variables for its prediction!)
What should be the best practice in such a situation to make a predictive model in the end where you have many categorical columns?

you got to check the correlation.. There are two scenarios I can think of..
if the target variable is continuous and independent variable is categorical, you can go with Kendall Tau correlation
if both target and independent variable are categorical, you can go with CramersV correlation
There's a package in python which cam do all of these for you and you can select only columns that you need..
pip install ctrl4ai
from ctrl4ai import automl
automl.preprocess(dataframe, learning type)
use help(automl.preprocess) to understand more about the hyper parameters and you can customise your preprocessing in the way you want to..
please check automl.master_correlation which checks correlation based on the approach I explained above.

You can have a look if your categorical variables are suitable for a Spearman rank correlation, which ranks the categorical variables and calculates the correlation coefficient. However, be careful for collinearity between the categorical variables.

Related

Longitudinal data analysis in python

I have data whose structure is as below (a fictitious example):
data
There are 3 predictor variables and 1 response variable.
We have data of 5 students, and each student have 3 observations for time 1,2,3. Thus total number of observations is 15.
But I don't have an idea how to analyze the effect of X1, X2, X3 on Y in this kind of longitudinal data.(I will use python)
Can anyone give me some idea?
Thank you.

Since you have longitudinal data and a continuous response you have a few different options:
Ignore the grouping structure. I would not recommend doing this, since you may be ignoring information.
Model your groups separately. This is usually not a good idea, and most certainly not in the case where the groups have low sample size.
Treat your grouping variable as a categorical predictor. This is again may not be ideal when the number of groups are high, even with recent boosting packages that handle categorical predictors with high cardinality well (e.g. CatBoost).
Use a mixed effects model.
If you want to proceed with bullet 4, I recommend taking a look at the Gaussian Process Boosting or GPBoost package first. However, there are other Python packages to consider: MERF and LMER in Statsmodels.

randomforest Regressor with all independent variable as categorical

I am stuck in the process of building a model.Basically I have 10 parameters all of which are categorical variables, Even the categories have a large number of unique values (one category has 1335 unique values of 300 000 records), and the y value which is to be predicted is the number of days (Numerical). I am using randomforestregressor and getting an accuracy of around 55-60%. I am not sure if this is the max limit or I really need to change the algorithm itself. I am flexible with any kind of solutions.

Having up to 1335 categories for a categorical dimension might cause a random forest regressor (or classifier) some headache depending on how categorical dimensions are handled internally, and things will also depend on the distribution frequencies of the categories. What library are you using for the random forest regression?
Have you tried converting the categorical dimensions into unique integer IDs and interpreting this representation as a real number dimension? I've made the experience that this can raise the variable importance of many a type of categorical dimensions. (At times the inherent/initial ordering of the categories can provide useful grouping/partitioning information).
You can even shuffle your dimensions a few times and use these as input dimensions. I'll try to explain with an example:
You have a categorical dimension x1 with categories [c11,c12,...,c1n]
We easily map these categories to numerical values by saying x1 has a value of 1 if it's the category is c11, or a value of 2 if it's category, or a value or i for category c1i etc.
Use this new non-categorical dimension as an input dimension for training (you will have to change your input to the regressor accordingly later on).
You can go further than this. Shuffle the order (randomly) of your categories of x1 so you get a random order, for example [c13,c19,c1n,c1i,...,c12]. Do the same thing as above and you have another new non-categorical input dimension (Consider that you'll have to remember the shuffling order for the sake of regression later on).
I'm curious if adding a few (anywhere between 1 to 100, or whatever number you choose) dimensions like this can improve your performance.
Please, see how performance changes for different numbers of such dimensions. (But be aware that more such dimensions will cost you in preprocessing time at regression)
The statement in the codeblock below would require combining multiple categorical dimensions at once. Consider it only for inspiration.
Another idea would be to check if some form of linear classifier with the hot-encodings for each individual category for multiple categorical dimensions might be able improve things (This can help you find useful orderings more quickly than the approach above).

I am sure you need to more processing on your data.
having 1335 unique values on one variable is something bizarre.
please, if the data is public share it with me, I want to take a look.

Feature selection in mixed data type

I have a huge data set with has a mixture of both numerical and categorical variables. I have come across various feature selection techniques focused primarily on either numerical or categorical data alone, not on a mixture of them. Is there any feature selection technique which works on such a data set?

You are looking for Boruta package, originally written in R but also available in python. Boruta uses Random Forest to rank features but you first have to handle all missing values in your features otherwise boruta throws error. Look here for more information :
https://datascience.stackexchange.com/questions/31112/boruta-feature-selection-package

How to find correlation between categorical data and continuous data?

So i have 20 different nominal categorical variables which are independent variables. Each of these variables 2-10 categories.These independent variables are string type and will be used to predict a dependent variable called price, which is a continuous variable.
What algorithm do I use to find the correlation of each variable and decide on the best variables?
Note: I have not built a machine learning model yet and am using Python.
I've tried f_oneway ANOVA from sklearn, but it does not find the correlation, instead it only compares between the group itself. I've found correlation between continuous variables for both independent and dependent variables. Help is much appreciated

I'm not sure about sklearn, but perhaps this information will bring you a step closer.
First of all, when we speak about categorical data, we do not speak about correlation, we speak about association.
Generally speaking you need to use a ANOVA, chi square, or something similar to gather information on the association between a categorical variable and a continuous variable.
With ANOVA, we can calculate the inter- and intra-group variance, and then compare them.
Look at this post, it will probably make more sense then me trying to explain:
Click here

Handling Categorical Data with Many Values in sklearn

I am trying to predict customer retention with a variety of features.
One of these is org_id which represents the organization the customer belongs to. It is currently a float column with numbers ranging from 0.0 to 416.0 and 417 unique values.
I am wondering what the best way of preprocessing this column is before feeding it to a scikit-learn RandomForestClassifier. Generally, I would one-hot-encode categorical features, but there are so many values here so it would radically increase my data dimensionality. I have 12,000 rows of data, so I might be OK though, and only about 10 other features.
The alternatives are to leave the column with float values, convert the float values to int values, or convert the floats to pandas' categorical objects.
Any tips are much appreciated.

org_id does not seem to be a feature that brings any info for the classification, you should drop this value and not pass it into the classifier.
In a classifier you only want to pass features that are discriminative for the task that you are trying to perform: here the elements that can impact the retention or churn. The ID of a company does not bring any valuable information in this context therefore it should not be used.
Edit following OP's comment:
Before going further let's state something: with respect to the number of samples (12000) and the relative simplicity of the model, one can make multiple attempts to try different configurations of features easily.
So, As a baseline, I would do as I said before, drop this feature all together. Here is your baseline score i.e., a score you can compare your other combinations of features against.
I think it cost nothing to try to hot-encode org_id, whichever result you observe is going to add up to your experience and knowledge of how the Random Forest behaves in such cases. As you only have 10 more features, the Boolean features is_org_id_1, is_org_id_2, ... will be highly preponderant and the classification results may be highly influenced by these features.
Then I would try to reduce the number of Boolean features by finding new features that can "describe" these 400+ organizations. For instance, if they are only US organizations, their state which is ~50 features, or their number of users (which would be a single numerical feature), their years of existence (another single numerical feature). Let's note that these are only examples to illustrate the process of creating new features, only someone knowing the full problematic can design these features in a smart way.
Also, I would find interesting that, once you solve your problem, you come back here and write another answer to your question as I believe, many people run into such problems when working with real data :)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.