I am working with a dataset containing 19 features. Seven of these are nominal category features and all of these specific features possess high cardinality (some only contain 5-30 unique values but in multiple cases hundreds or thousands of unique values are present). I am aware that for most machine learning algorithms, text-based categorical data must be encoded. However if the feature is categorical, but already numeric, should I encode those as well?
Probably not necessary, but an example might look like:
error code
23
404
6
....
1324
500
Not encoding this column through any means would surely be better as far as dimensionality is concerned, but there are a finite number of error codes that can exist, and they have no hierarchy. My fear is that by not encoding, I am leaving an inherent hierarchy in place that is the default within python or pandas and am therefore creating bias within my dataset. I have a feeling that I must encode, but doing so for all seven of these features through one-hot encoding would take me from 19 features to over 14k. (Not that it's relevant to this question, but I am researching hash-encoding as well, but I'm having a hard time wrapping my head around it.)
Bonus question if anyone wants to answer: If I'm examining error codes that pop up in a large number of machines, and I want to consider the year that the machine was manufactured, is that a numeric value or categorical? There are a finite number of values (first year company began producing machines to current year), so I'm guessing it IS categorical in this case?
Related
I have got a couple of questions.
if data set has some categorical
Varibles and it needs one hot encoding then sometimes as there are so many categories within each columns, one hot encoding increases the total number of columns to 200 or more. How do i deal with this?
what is a good number for total columns that go into model training , like are 50 columns or 20 columns or is it totally dependent on the dataset.
Generally it's good to have a lot of examples. This amount of column is not to much.
I recommend instead to focus on:
Deletion missing or not relevant data which may influence to wrong output after all transformations.
Provide significant number of records which magnify your learning effect.
enter image description here
I need to take squences as training data and output column as label. but before I have to apply one hot encoding on the sequences,as you can see sequences varies in length Please suggest me how to apply one-hot encoding on all amino acids to have different integer values assigned
No one else can determine the best way to bin your data set. That's a decision that can only be made by someone who has a good understanding of the objective and the dataset. ϕ(x) —your feature vector— is always very specific to your data.
For example if you had DNA you might have features for whether a certain codon is present, or bins for the quantity of Adenine etc., this is highly subjective and even with a good understanding tuning is a non-trivial task.
You have to be very careful because you might create biases in your data for certain classes to be of a certain length, quantity of certain amino acids etc. that are not truly representative of what you are classifying for if you generate the feature vector incorrectly. This could lead to testing and training error rates that are deceptive and produce incorrect conclusions.
Honesty, if you are in university, I would recommend soliciting someone in a computer science department or other analog to help contribute to your project. While it might seem tempting to use the pre-baked sklearn encoding it is not a good solution for your case. It is very likely you will have outlier cases in terms of sequence length due to limited quantity of data, and attempting to turn each character into it's own feature will cause poor performance with regards to fitting.
As for actually reading your data into python, it's a csv so you could parse it by hand with an open() and a split(',') or you could use some of the popular libraries for parsing csv formats. YMMV
I am very inexperienced when it comes to machine learning, but I would like to learn and in order to improve my skills I am currently trying to apply the things I have learned on one of my own research data sets.
I have a dataset with 77 rows and 308 columns. Every row correspondents to a sample. 305 out of the 308 columns give information about concentrations, one column tells whether the column belongs to group A,B,C or D, one column tells whether it is an X or Y sample and one column tells you eventually whether the output is successful or not. I would like to determine which concentrations significantly impact the output, taking into account the variation between the groups and sample types. I have tried multiple things (feature selection, classification, etc.) but so far I do not get the desired output
My question is therefore whether people have suggestions/tips/ideas about how I could tackle this problem, taking into account that the dataset is relatively small and that only 15 out the 77 samples have 'not successful' as output?
Calculate the correlation and sort it. After sorting take top 10-15 categories/features.
I am trying to predict customer retention with a variety of features.
One of these is org_id which represents the organization the customer belongs to. It is currently a float column with numbers ranging from 0.0 to 416.0 and 417 unique values.
I am wondering what the best way of preprocessing this column is before feeding it to a scikit-learn RandomForestClassifier. Generally, I would one-hot-encode categorical features, but there are so many values here so it would radically increase my data dimensionality. I have 12,000 rows of data, so I might be OK though, and only about 10 other features.
The alternatives are to leave the column with float values, convert the float values to int values, or convert the floats to pandas' categorical objects.
Any tips are much appreciated.
org_id does not seem to be a feature that brings any info for the classification, you should drop this value and not pass it into the classifier.
In a classifier you only want to pass features that are discriminative for the task that you are trying to perform: here the elements that can impact the retention or churn. The ID of a company does not bring any valuable information in this context therefore it should not be used.
Edit following OP's comment:
Before going further let's state something: with respect to the number of samples (12000) and the relative simplicity of the model, one can make multiple attempts to try different configurations of features easily.
So, As a baseline, I would do as I said before, drop this feature all together. Here is your baseline score i.e., a score you can compare your other combinations of features against.
I think it cost nothing to try to hot-encode org_id, whichever result you observe is going to add up to your experience and knowledge of how the Random Forest behaves in such cases. As you only have 10 more features, the Boolean features is_org_id_1, is_org_id_2, ... will be highly preponderant and the classification results may be highly influenced by these features.
Then I would try to reduce the number of Boolean features by finding new features that can "describe" these 400+ organizations. For instance, if they are only US organizations, their state which is ~50 features, or their number of users (which would be a single numerical feature), their years of existence (another single numerical feature). Let's note that these are only examples to illustrate the process of creating new features, only someone knowing the full problematic can design these features in a smart way.
Also, I would find interesting that, once you solve your problem, you come back here and write another answer to your question as I believe, many people run into such problems when working with real data :)
I am working with a medical data set that contains many variables with discrete outputs. For example: type of anesthesia, infection site, Diabetes y/n. And to deal with this I have just been converting them into multiple columns with ones and zeros and then removing one to make sure there is not a direct correlation between them but I was wondering if there was a more efficient way of doing this
It depends on the purpose of the transformation. Converting categories to numerical labels may not make sense if the ordinal representation does not correspond to the logic of the categories. In this case, the "one-hot" encoding approach you have adopted is the best way to go, if (as I surmise from your post) the intention is to use the generated variables as the input to some sort of regression model. You can achieve what you are looking to do using pandas.get_dummies.