I want to use PCA to reduce our features(columns) in a dataset, but one of the features is a text feature.
For this, our solution was convert text features to numeric, how can we do this?
Or any other solution to use PCA on text features?
for example this dataframe:
For the text, you can build vectors from text.
List of vectorizers in scikit-learn that work on text can be found here - https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text.
PCA finds the axis of the data with the greatest variance. For this all the inputs must be numerical.
You could take the length of the text string, that would provide a number, but it is unlikely that will provide any useful information. Ultimately, it is up to you to decide what you want from the data and that will inform how to change this. If your text field is categorical one way is to can create dummy variables that split a categorical variable into multiple binary variables. You can do this in Pandas with the get_dummies method.
In my opinion, a better question to ask is why you want to reduce your feature set and if the text is even relevant to your analysis.
Related
I have a very large dataset that is in a tfrecord format. There are three features in the dataset and I want to extract a stratified subset from it based on one feature (the label). I was wondering whether anyone had a go-to workflow for this. The first thing that popped to my mind is the following:
Parse the dataset elements one by one and associate element index with the label
Store this information to a new data structure (dataframe/np array) that would serve as a look-up table
Split the structure in a stratified manner with something like the sklearn stratified split to get the indices for each label
Create a new dataset based on the look-up table.
I would really like to avoid doing that since as I mentioned the dataset is very large and it would take a lot of time to parse it element-by-element. I would appreciate it if someone could suggest a built-in function or a more intuitive way of doing this. I couldn't find anything like this in the tf docs.
My dataset contains columns describing abilities of certain characters, filled with True/False values. There are no empty values. My ultimate goal is to make groups of characters with similar abilities. And here's the question:
Should i change True/False values to 1 and 0? Or there's no need for that?
What clustering model should i use? Is KMeans okay for that?
How do i interpret the results (output)? Can i visualize it?
The thing is i always see people perform clustering on numeric datasets that you can visualize and it looks much easier to do. With True/False i just don't even know how to approach it.
Thanks.
In general there is no need to change True/False to 0/1. This is only necessary if you want to apply a specific algorithm for clustering that cannot deal with boolean inputs, like K-means.
K-means is not a preferred option. K-means requires continuous features as input, as it is based on computing distances, like many clustering algorithms. So no boolean inputs. And although binary input (0-1) works, it does not compute distances in a very meaningful way (many points will have the same distance to each other). In case of 0-1 data only, I would not use clustering, but would recommend tabulating the data and see what cells occur frequently. If you have a large data set you might use the Apriori algorithm to find cells that occur frequently.
In general, a clustering algorithm typically returns a cluster number for each observation. In low-dimensions, this number is frequently used to give a color to an observation in a scatter plot. However, in your case of boolean values, I would just list the most frequently occurring cells.
LightGBM has support for categorical variables. I would like to know how it encodes them. It doesn't seem to be one hot encode since the algorithm is pretty fast (I tried with data that took a lot of time to one hot encode).
https://github.com/Microsoft/LightGBM/issues/699#issue-243313657
The basic idea is sorting the histogram according to it's accumulate values (sum_gradient / sum_hessian), then find the best split on the sorted histogram, just like the numerical features.
I am trying to apply a K Means algorithm on data from my database. First of all I am taking data like this:
So my questions are how I can make column with strings to numbers like "trash"=1, "car"=2 "truck"=3 and if I can use all columns and values for clustering.
The best you can do is to use label encoder of sklearn library
KMeans doesn't need "magic numbers".
It needs proper continuous variables, where the mean is meaningful. It's not the proper algorithm for your data. Minimizing least squares of encoded "dictionary numbers" is not sound.
I am working with a medical data set that contains many variables with discrete outputs. For example: type of anesthesia, infection site, Diabetes y/n. And to deal with this I have just been converting them into multiple columns with ones and zeros and then removing one to make sure there is not a direct correlation between them but I was wondering if there was a more efficient way of doing this
It depends on the purpose of the transformation. Converting categories to numerical labels may not make sense if the ordinal representation does not correspond to the logic of the categories. In this case, the "one-hot" encoding approach you have adopted is the best way to go, if (as I surmise from your post) the intention is to use the generated variables as the input to some sort of regression model. You can achieve what you are looking to do using pandas.get_dummies.