Encoding the categorical columna make data too sparse

Encoding the categorical columna make data too sparse - python

I have got a couple of questions.
if data set has some categorical
Varibles and it needs one hot encoding then sometimes as there are so many categories within each columns, one hot encoding increases the total number of columns to 200 or more. How do i deal with this?
what is a good number for total columns that go into model training , like are 50 columns or 20 columns or is it totally dependent on the dataset.

Generally it's good to have a lot of examples. This amount of column is not to much.
I recommend instead to focus on:
Deletion missing or not relevant data which may influence to wrong output after all transformations.
Provide significant number of records which magnify your learning effect.

Related

Group similar data and assign number to each group Python pandas

I have a dataset having 25 columns and 1000+ rows. This dataset contains dummy information of interns. We want to make squads of these interns. Suppose we want to make each squad of 10 members.
Based on the similarities of the intern we want to make squads and assign squad number to them. The factors will the columns we have in dataset which are Timezone, Language they speak, in which team they want to work etc.
These are the columns:
["Name","Squad_Num","Prefered_Lang","Interested_Grp","Age","City","Country","Region","Timezone",
"Occupation","Degree","Prev_Took_Courses","Intern_Experience","Product_Management","Digital_Marketing",
"Market_Research","Digital_Illustration","Product_Design","Prodcut_Developement","Growth_Marketing",
"Leading_Groups","Internship_News","Cohort_Product_Marketing","Cohort_Product_Design",
"Cohort_Product_Development","Cohort_Product_Growth","Hours_Per_Week"]
enter image description here

Here are a bunch of clustering algos for you to play around with.
https://github.com/ASH-WICUS/Notebooks/blob/master/Clustering%20Algorithms%20Compared.ipynb
Since this is unsupervised learning, you kind of have to fiddle around with different algos, and see which one performs to your liking, but there is no accuracy, precision, R^2, etc., to let you know how well the machine is performing.

randomforest Regressor with all independent variable as categorical

I am stuck in the process of building a model.Basically I have 10 parameters all of which are categorical variables, Even the categories have a large number of unique values (one category has 1335 unique values of 300 000 records), and the y value which is to be predicted is the number of days (Numerical). I am using randomforestregressor and getting an accuracy of around 55-60%. I am not sure if this is the max limit or I really need to change the algorithm itself. I am flexible with any kind of solutions.

Having up to 1335 categories for a categorical dimension might cause a random forest regressor (or classifier) some headache depending on how categorical dimensions are handled internally, and things will also depend on the distribution frequencies of the categories. What library are you using for the random forest regression?
Have you tried converting the categorical dimensions into unique integer IDs and interpreting this representation as a real number dimension? I've made the experience that this can raise the variable importance of many a type of categorical dimensions. (At times the inherent/initial ordering of the categories can provide useful grouping/partitioning information).
You can even shuffle your dimensions a few times and use these as input dimensions. I'll try to explain with an example:
You have a categorical dimension x1 with categories [c11,c12,...,c1n]
We easily map these categories to numerical values by saying x1 has a value of 1 if it's the category is c11, or a value of 2 if it's category, or a value or i for category c1i etc.
Use this new non-categorical dimension as an input dimension for training (you will have to change your input to the regressor accordingly later on).
You can go further than this. Shuffle the order (randomly) of your categories of x1 so you get a random order, for example [c13,c19,c1n,c1i,...,c12]. Do the same thing as above and you have another new non-categorical input dimension (Consider that you'll have to remember the shuffling order for the sake of regression later on).
I'm curious if adding a few (anywhere between 1 to 100, or whatever number you choose) dimensions like this can improve your performance.
Please, see how performance changes for different numbers of such dimensions. (But be aware that more such dimensions will cost you in preprocessing time at regression)
The statement in the codeblock below would require combining multiple categorical dimensions at once. Consider it only for inspiration.
Another idea would be to check if some form of linear classifier with the hot-encodings for each individual category for multiple categorical dimensions might be able improve things (This can help you find useful orderings more quickly than the approach above).

I am sure you need to more processing on your data.
having 1335 unique values on one variable is something bizarre.
please, if the data is public share it with me, I want to take a look.

Question about handling of numeric categorical data with no hierarchy

I am working with a dataset containing 19 features. Seven of these are nominal category features and all of these specific features possess high cardinality (some only contain 5-30 unique values but in multiple cases hundreds or thousands of unique values are present). I am aware that for most machine learning algorithms, text-based categorical data must be encoded. However if the feature is categorical, but already numeric, should I encode those as well?
Probably not necessary, but an example might look like:
error code
23
404
6
....
1324
500
Not encoding this column through any means would surely be better as far as dimensionality is concerned, but there are a finite number of error codes that can exist, and they have no hierarchy. My fear is that by not encoding, I am leaving an inherent hierarchy in place that is the default within python or pandas and am therefore creating bias within my dataset. I have a feeling that I must encode, but doing so for all seven of these features through one-hot encoding would take me from 19 features to over 14k. (Not that it's relevant to this question, but I am researching hash-encoding as well, but I'm having a hard time wrapping my head around it.)
Bonus question if anyone wants to answer: If I'm examining error codes that pop up in a large number of machines, and I want to consider the year that the machine was manufactured, is that a numeric value or categorical? There are a finite number of values (first year company began producing machines to current year), so I'm guessing it IS categorical in this case?

Dealing with multiple categorical inputs and variable-sized groups as inputs to neural network

I'm working with data which consists of numerical and categorical features, where each input consists of a variable-sized group of the features.
For example: predict the price of a house by using features about each room in the house, and each house could have a different amount of rooms. The features could be size in meters, type (e.g living room/bathroom/bedroom), color, floor...
Some of the categorical features have high cardinality, and I may be using many features.
I'd want to use the features from n rooms to predict the price for each house.
How would I structure my inputs/nn model to receive variable-sized groups of inputs?
I thought of using one-hot encoding, but then I'd end up with large input vectors and I'd lose the connections between the features for each room.
I also thought of using embeddings, but I'm not sure what the best way is to combine the features/samples to properly input all the data without losing any info about which features come from which samples etc.

As the article, linked below, suggests... you've got one of three routes to choose from.
Ordinal Encoding which I am thinking is not the right use case for your example
One Hot Encoding which you've ruled out efficiently.
Difference Encoding, which is I think a little bit suited as there are master bedrooms, minor ones, guest ones and children ones. So, try that angle.
Link to the beautiful article
Happy coding :)

Handling Categorical Data with Many Values in sklearn

I am trying to predict customer retention with a variety of features.
One of these is org_id which represents the organization the customer belongs to. It is currently a float column with numbers ranging from 0.0 to 416.0 and 417 unique values.
I am wondering what the best way of preprocessing this column is before feeding it to a scikit-learn RandomForestClassifier. Generally, I would one-hot-encode categorical features, but there are so many values here so it would radically increase my data dimensionality. I have 12,000 rows of data, so I might be OK though, and only about 10 other features.
The alternatives are to leave the column with float values, convert the float values to int values, or convert the floats to pandas' categorical objects.
Any tips are much appreciated.

org_id does not seem to be a feature that brings any info for the classification, you should drop this value and not pass it into the classifier.
In a classifier you only want to pass features that are discriminative for the task that you are trying to perform: here the elements that can impact the retention or churn. The ID of a company does not bring any valuable information in this context therefore it should not be used.
Edit following OP's comment:
Before going further let's state something: with respect to the number of samples (12000) and the relative simplicity of the model, one can make multiple attempts to try different configurations of features easily.
So, As a baseline, I would do as I said before, drop this feature all together. Here is your baseline score i.e., a score you can compare your other combinations of features against.
I think it cost nothing to try to hot-encode org_id, whichever result you observe is going to add up to your experience and knowledge of how the Random Forest behaves in such cases. As you only have 10 more features, the Boolean features is_org_id_1, is_org_id_2, ... will be highly preponderant and the classification results may be highly influenced by these features.
Then I would try to reduce the number of Boolean features by finding new features that can "describe" these 400+ organizations. For instance, if they are only US organizations, their state which is ~50 features, or their number of users (which would be a single numerical feature), their years of existence (another single numerical feature). Let's note that these are only examples to illustrate the process of creating new features, only someone knowing the full problematic can design these features in a smart way.
Also, I would find interesting that, once you solve your problem, you come back here and write another answer to your question as I believe, many people run into such problems when working with real data :)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.