I am stuck in the process of building a model.Basically I have 10 parameters all of which are categorical variables, Even the categories have a large number of unique values (one category has 1335 unique values of 300 000 records), and the y value which is to be predicted is the number of days (Numerical). I am using randomforestregressor and getting an accuracy of around 55-60%. I am not sure if this is the max limit or I really need to change the algorithm itself. I am flexible with any kind of solutions.
Having up to 1335 categories for a categorical dimension might cause a random forest regressor (or classifier) some headache depending on how categorical dimensions are handled internally, and things will also depend on the distribution frequencies of the categories. What library are you using for the random forest regression?
Have you tried converting the categorical dimensions into unique integer IDs and interpreting this representation as a real number dimension? I've made the experience that this can raise the variable importance of many a type of categorical dimensions. (At times the inherent/initial ordering of the categories can provide useful grouping/partitioning information).
You can even shuffle your dimensions a few times and use these as input dimensions. I'll try to explain with an example:
You have a categorical dimension x1 with categories [c11,c12,...,c1n]
We easily map these categories to numerical values by saying x1 has a value of 1 if it's the category is c11, or a value of 2 if it's category, or a value or i for category c1i etc.
Use this new non-categorical dimension as an input dimension for training (you will have to change your input to the regressor accordingly later on).
You can go further than this. Shuffle the order (randomly) of your categories of x1 so you get a random order, for example [c13,c19,c1n,c1i,...,c12]. Do the same thing as above and you have another new non-categorical input dimension (Consider that you'll have to remember the shuffling order for the sake of regression later on).
I'm curious if adding a few (anywhere between 1 to 100, or whatever number you choose) dimensions like this can improve your performance.
Please, see how performance changes for different numbers of such dimensions. (But be aware that more such dimensions will cost you in preprocessing time at regression)
The statement in the codeblock below would require combining multiple categorical dimensions at once. Consider it only for inspiration.
Another idea would be to check if some form of linear classifier with the hot-encodings for each individual category for multiple categorical dimensions might be able improve things (This can help you find useful orderings more quickly than the approach above).
I am sure you need to more processing on your data.
having 1335 unique values on one variable is something bizarre.
please, if the data is public share it with me, I want to take a look.
Related
I have tried to find out basic answers for this question, but none on Stack Overflow seems a best fit.
I have a dataset with 40 columns and 55,000 rows. Only 8 out of these columns are numerical. The remaining 32 are categorical with string values in each.
Now I wish to do an exploratory data analysis for a predictive model and I need to drop certain irrelevant columns that do not show high correlation with the target (variable to predict). But since all of these 32 variables are categorical what can I do to see their relevance with the target variable?
What I am thinking to try:
LabelEncoding all 32 columns then run a Dimensional Reduction via PCA, and then create a predictive model. (If I do this, then how can I clean my data by removing the irrelevant columns that have low corr() with target?)
One Hot Encoding all 32 columns and directly run a predictive model on it.
(If I do this, then the concept of cleaning data is lost totally, and the number of columns will skyrocket and the model will consider all relevant and irrelevant variables for its prediction!)
What should be the best practice in such a situation to make a predictive model in the end where you have many categorical columns?
you got to check the correlation.. There are two scenarios I can think of..
if the target variable is continuous and independent variable is categorical, you can go with Kendall Tau correlation
if both target and independent variable are categorical, you can go with CramersV correlation
There's a package in python which cam do all of these for you and you can select only columns that you need..
pip install ctrl4ai
from ctrl4ai import automl
automl.preprocess(dataframe, learning type)
use help(automl.preprocess) to understand more about the hyper parameters and you can customise your preprocessing in the way you want to..
please check automl.master_correlation which checks correlation based on the approach I explained above.
You can have a look if your categorical variables are suitable for a Spearman rank correlation, which ranks the categorical variables and calculates the correlation coefficient. However, be careful for collinearity between the categorical variables.
I am trying to solve a machine learning task but have encountered some problems. Any tips would be greatly appreciated. One of my questions is, how do you create a correlation matrix for 2 dataframes (data for 2 labels) of different sizes, to see if you can combine them into one.
Here is the whole text of the task
This dataset is composed of 1100 samples with 30 features each. The first column is the sample id. The second column in the dataset represents the label. There are 4 possible values for the labels. The remaining columns are numeric features.
Notice that the classes are unbalanced: some labels are more frequent than others. You need to decide whether to take this into account, and if so how.
Compare the performance of a Support-Vector Machine (implemented by sklearn.svm.LinearSVC) with that of a RandomForest (implemented by sklearn.ensemble.ExtraTreesClassifier). Try to optimize both algorithms' parameters and determine which one is best for this dataset. At the end of the analysis, you should have chosen an algorithm and its optimal set of parameters.
I have tried to make a correlation matrix for rows with the labels with lower cardinality but I am not convinced it is reliable
I tried to make two new dataframes from the rows that have labels 1 and 2. There are 100-150 entries for each of those 2 labels, compared to 400 for labels 0 and 3. I wanted to check if there is high correlation bewteen data labeled 1 and 2 to see if i could combine them but dont know if this is the right approach.I tried to make the dataframes the same size by appending zeros to the smaller one and then did a correlation matrix for both datasets together. is this a correct approach
your question and approach is not clear. can you modify the question with problem statement and few data sets that you have been given.
If you wanted to visualize your data set please plot them into 2,3 or 4 dimensions.
Here are many plotting tools like 3D scatter plot, pair plot, histogram and may more. use them to better understand your data sets.
I am trying to predict customer retention with a variety of features.
One of these is org_id which represents the organization the customer belongs to. It is currently a float column with numbers ranging from 0.0 to 416.0 and 417 unique values.
I am wondering what the best way of preprocessing this column is before feeding it to a scikit-learn RandomForestClassifier. Generally, I would one-hot-encode categorical features, but there are so many values here so it would radically increase my data dimensionality. I have 12,000 rows of data, so I might be OK though, and only about 10 other features.
The alternatives are to leave the column with float values, convert the float values to int values, or convert the floats to pandas' categorical objects.
Any tips are much appreciated.
org_id does not seem to be a feature that brings any info for the classification, you should drop this value and not pass it into the classifier.
In a classifier you only want to pass features that are discriminative for the task that you are trying to perform: here the elements that can impact the retention or churn. The ID of a company does not bring any valuable information in this context therefore it should not be used.
Edit following OP's comment:
Before going further let's state something: with respect to the number of samples (12000) and the relative simplicity of the model, one can make multiple attempts to try different configurations of features easily.
So, As a baseline, I would do as I said before, drop this feature all together. Here is your baseline score i.e., a score you can compare your other combinations of features against.
I think it cost nothing to try to hot-encode org_id, whichever result you observe is going to add up to your experience and knowledge of how the Random Forest behaves in such cases. As you only have 10 more features, the Boolean features is_org_id_1, is_org_id_2, ... will be highly preponderant and the classification results may be highly influenced by these features.
Then I would try to reduce the number of Boolean features by finding new features that can "describe" these 400+ organizations. For instance, if they are only US organizations, their state which is ~50 features, or their number of users (which would be a single numerical feature), their years of existence (another single numerical feature). Let's note that these are only examples to illustrate the process of creating new features, only someone knowing the full problematic can design these features in a smart way.
Also, I would find interesting that, once you solve your problem, you come back here and write another answer to your question as I believe, many people run into such problems when working with real data :)
I have a neural network program that is designed to take in input variables and output variables, and use forecasted data to predict what the output variables should be based on the forecasted data. After running this program, I will have an output of an output vector. Lets say for example, my input matrix is 100 rows and 10 columns and my output matrix is a vector with 100 values. How do I determine which of my 10 variables (columns) had the most impact on my output?
I've done a correlation analysis between each of my variables (columns) and my output and created a list of the highest correlation between each variable and output, but I'm wondering if there is a better way to go about this.
If what you want to know is model selection, and it's not as simple as studiying the correlation of your features to your target. For an in-depth, well explained look at model selection, I'd recommend you read chapter 7 of The Elements Statistical Learning. If what you're looking for is how to explain your network, then you're in for a treat as well and I'd recommend reading this article for starters, though I won't go into the matter myself.
Naive approaches to model selection:
There a number of ways to do this.
The naïve way is to estimate all possible models, so every combination of features. Since you have 10 features, it's computationally unfeasible.
Another way is to take a variable you think is a good predictor and train to model only on that variable. Compute the error on the training data. Take another variable at random, retrain the model and recompute the error on the training data. If it drops the error, keep the variable. Otherwise discard it. Keep going for all features.
A third approach is the opposite. Start with training the model on all features and sequentially drop variables (a less naïve approach would be to drop variables you intuitively think have little explanatory power), compute the error on training data and compare to know if you keep the feature or not.
There are million ways of going about this. I've exposed three of the simplest, but again, you can go really deeply into this subject and find all kinds of different information (which is why I highly recommend you read that chapter :) ).
I have a set of data with 50 features (c1, c2, c3 ...), with over 80k rows.
Each row contains normalised numerical values (ranging 0-1). It is actually a normalised dummy variable, whereby some rows have only few features, 3-4 (i.e. 0 is assigned if there is no value). Most rows have about 10-20 features.
I used KMeans to cluster the data, always resulting in a cluster with a large number of members. Upon analysis, I noticed that rows with fewer than 4 features tends to get clustered together, which is not what I want.
Is there anyway balance out the clusters?
It is not part of the k-means objective to produce balanced clusters. In fact, solutions with balanced clusters can be arbitrarily bad (just consider a dataset with duplicates). K-means minimizes the sum-of-squares, and putting these objects into one cluster seems to be beneficial.
What you see is the typical effect of using k-means on sparse, non-continuous data. Encoded categoricial variables, binary variables, and sparse data just are not well suited for k-means use of means. Furthermore, you'd probably need to carefully weight variables, too.
Now a hotfix that will likely improve your results (at least the perceived quality, because I do not think it makes them statistically any better) is to normalize each vector to unit length (Euclidean norm 1). This will emphasize the ones of rows with few nonzero entries. You'll probably like the results more, but they are even much harder to interpret.