python scikit-learn with mixed variables dataset (numerical, categorical and embeddings) - python

I'm working on a machine learning project where I'm trying to predict the revenue of a movie.
My dataset contains mixed data types. There are numerical features (rating, number of votes, release year,...), categorical features (genres, studio, is the movie for mature audiance,...) but also embeddings that consists in large feature vectors (post embeddings and movie description embeddings).
My problem is with the last data type. I'm wondering how should I handle these embeddings ?
I've made some pre-processing (cleaning, one-hot encoding, label encoding,...) but I still have these embeddings. So basically, now I would like to do some feature selection and model selection but for example, let's say I would like to do a filter method. For a linear model, I can use a correlation matrix but I cannot compute it since the variable img_embeddings and txt_embeddings are not numericals but 1D vector. Same if I want to use mutual information for non-linear models.

Related

How does Tensorflow's Decision Forests handle categorical data?

I'm evaluating two different unsupervised ML algorithms, Isolation Forest and LSTM Autoencoder model, to identify anomalies in a large time series data. This dataset includes mostly categorical data such as Ip Adresses, cloud subscription Ids,tenant Ids, userAgents, and client Application Ids.
When reading a tutorial on an implementation of a Tensorflow's Decision Tree (TF-DF) model, it mentions that the model handles non-label categorical values natively and
there is no need for preprocessing in the form of one-hot encoding, normalization or extra is_present feature.
Does anybody know how Tensorflow handles the categorical features behind the scenes (assuming they do some transformation into a numeric representation)?
Tl;dr: There is a natural way of using categorical features in decision trees/forests that requires no encoding. Tensorflow Decision Forests uses this and a number of standard transformations to handle categorical features.
Tensorflow Decision Forest (TF-DF) constructs decision tree / decision forest models. A single decision tree recursively splits the dataset along its features. Splits along categorical features can naturally be performed through so-called in-set conditions. For instance, a tree can express a condition like userAgents \in \{“Mozilla/5.0”, “InternetExplorer/10.0”\}. Other types of conditions are also possible. Tensorflow Decision Forests (TF-DF) can construct in-set conditions if the dataset contains categorical features.
More specifically, Tensorflow Decision Forests uses the C++ library Yggdrasil Decision Forests (YDF) under the hood for any advanced computations. YDF offers three different algorithms for finding a good categorical split of the data. For example, the Random algorithm will just try out many possible splits at random and pick the best one.
For performance and quality reasons, YDF also preprocesses categorical features: If a categorical value is very rare, YDF may consider it “out-of-dictionary”, the threshold for “rare” being user-configurable. Furthermore, YDF maps the categorical features to integers by decreasing item frequency, with the mapping stored as part of the model. Note that this is purely an internal encoding; the algorithms are aware that a feature is categorical, hence typical issues with integer encodings do not apply.
Finally, Tensorflow Decision Forests (TF-DF) uses Keras, which expects classification tasks to have an integer label. Therefore, TF-DF users have to encode the label themselves or use the built-in pd_dataframe_to_tf_dataset.
Note that this answer only applies to Tensorflow Decision Forests. Other parts of Tensorflow may need manual encoding.

Correlations of feature columns in TensorFlow

I've recently started exploring for myself features columns by TensorFlow.
If I understood documentation right, feature columns are just a 'frame' for further transformations just before fitting data to the model. So, if I want to use it, I define some feature columns, create DenseFeatures layer from them, and when I fit data into a model, all features go through that DenseFeatures layer, transforms and then fits into first Dense layer of my NN.
My question is that is it possible at all somehow check correlations of transformed features to my target variable?
For example, I have a categorical feature, which corresponds to a day of a week (Mon/Tue.../Sun) (say, I change it to 1/2..7). Correlation of it to my target feature will not be the same as correlation of categorical feature column (f.e. indicator), as a model don't understand that 7 is the maximum of the possible sequence, but in case of categories, it will be a one-hot encoded feature with precise borders.
Let me know if all is clear.
Will be grateful for the help!
Tensorflow does not provide the feature_importance feature with the way Sklearn provides for XGBoost.
However, you could do this to test the importance or correlation of your feature with the target feature in TensorFlow as follows.
1) Shuffle the values of the particular feature whose correlation with the target feature you want to test. As in, if your feature is say fea1,the value at df['fea1'][0] becomes the value df['fea1'][4], the value at df['fea1'][2] becomes the value df['fea1'][3] and so on.
2) Now fit the model to your modified training data and check the accuracy with validation data.
3) Now if your accuracy goes down drastically, it means your feature had a high correlation with the target feature, else if the accuracy didn't vary much, it means the feature isn't of great importance (high error = high importance).
You can do the same with other features you introduced to your training data.
This could take some time and effort.

Naive Bayes multinomial model

For a movie reviews dataset, I'm creating a naive bayes multinomial model. Now in the training dataset, there are reviews per genre. So instead of creating a generic model for the movie reviews dataset-ignoring the genre feature, how do I train a model that also takes into consideration the genre feature in addition the tf-idf associated with words that occurred in the review. Do I need to create one model for each of the genre, or can I incorporate it into one model?
Training Dataset Sample:
genre, review, classification
Romantic, The movie was really emotional and touched my heart!, Positive
Action, It was a thrilling movie, Positive
....
Test Data Set:
Genre, review
Action, The movie sucked bigtime. The action sequences didnt fit into the plot very well
From the documentation, The multinomial distribution normally requires integer feature counts. Categorical variables provided as inputs, especially if they are encoded as integers, may not have a positive impact on the predictive capacity of the models. As stated above, you may either consider using a neural network, or dropping the genre column entirely. If after fitting the model shows a sufficient predictive capability on the text features alone, it may not even be necessary to add as input a categorical variable.
The way I would try this task is by stacking the dummy categorical values with the text features, and feeding the stacked array to a SGD model, along with the target labels. You would then perform GridSearch for the optimal choice of hyperparameters.
Consider treating genre as a categorical variable, probably with dummy encoding (see pd.get_dummies(df['genre'])), and feeding that as well as the tf-idf scores into your model.
Also consider other model types, besides Naive Bayes - a neural network involves more interaction between variables, and may help capture differences between genres better. Scikit-learn also has a MLPClassifier implementation which is worth a look.

High cardinal Categorical features into numerics

In most of the Academic examples, we used to convert categorical features using get_dummies or OneHotEncoder. Lets say I want to use Country as a feature and in the dataset we have 100 unique countries. When we apply get_dummies on country we will get 100 columns and model will be trained with 100 country columns plus other features.
Lets say, we have deployed this model into production, and we received only 10 countries. When we pre-process the data by using get_dummies, then model will fail predict because "Number of features model trained is not matching with the features passed" as we are passing 10 country columns plus other features.
I came across below article, where we can calculate score using Supervised ratio, Weight of evidence. But how to calculate the score when we want to predict the target in production, which country need to be assigned to right number.
https://www.kdnuggets.com/2016/08/include-high-cardinality-attributes-predictive-model.html
Can you please help me to understand how to handle such scenarios?
There are two things you can do.
Apply OHE after combining your training set and test/validation set data not before that.
Skip OHE and apply StandardScaler because "If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected."
I usually try second option when, I've multiple unique feature in any categorical dataset and can cause my test/validation set
Feel free to correct me.

How to assign weights to a set of features Linear regression

I am making a machine learning program just with a simple linear regression. In the dataset, I have two types of features:
Low dimension features: e.g. word count, length, etc.
High dimension features: TFIDF features
I just to want to manually assign weights on different set of features.
I have been trying with pipleline featureunion (transformer_weights), but I find it is difficult to do so if I want to have a TFIDF feature built from a vocabulary created by both train and test data.
Could you please advice if there are any solution other than using pipeline?
Thanks a lot!

Categories