For a movie reviews dataset, I'm creating a naive bayes multinomial model. Now in the training dataset, there are reviews per genre. So instead of creating a generic model for the movie reviews dataset-ignoring the genre feature, how do I train a model that also takes into consideration the genre feature in addition the tf-idf associated with words that occurred in the review. Do I need to create one model for each of the genre, or can I incorporate it into one model?
Training Dataset Sample:
genre, review, classification
Romantic, The movie was really emotional and touched my heart!, Positive
Action, It was a thrilling movie, Positive
....
Test Data Set:
Genre, review
Action, The movie sucked bigtime. The action sequences didnt fit into the plot very well
From the documentation, The multinomial distribution normally requires integer feature counts. Categorical variables provided as inputs, especially if they are encoded as integers, may not have a positive impact on the predictive capacity of the models. As stated above, you may either consider using a neural network, or dropping the genre column entirely. If after fitting the model shows a sufficient predictive capability on the text features alone, it may not even be necessary to add as input a categorical variable.
The way I would try this task is by stacking the dummy categorical values with the text features, and feeding the stacked array to a SGD model, along with the target labels. You would then perform GridSearch for the optimal choice of hyperparameters.
Consider treating genre as a categorical variable, probably with dummy encoding (see pd.get_dummies(df['genre'])), and feeding that as well as the tf-idf scores into your model.
Also consider other model types, besides Naive Bayes - a neural network involves more interaction between variables, and may help capture differences between genres better. Scikit-learn also has a MLPClassifier implementation which is worth a look.
Related
I'm working on a machine learning project where I'm trying to predict the revenue of a movie.
My dataset contains mixed data types. There are numerical features (rating, number of votes, release year,...), categorical features (genres, studio, is the movie for mature audiance,...) but also embeddings that consists in large feature vectors (post embeddings and movie description embeddings).
My problem is with the last data type. I'm wondering how should I handle these embeddings ?
I've made some pre-processing (cleaning, one-hot encoding, label encoding,...) but I still have these embeddings. So basically, now I would like to do some feature selection and model selection but for example, let's say I would like to do a filter method. For a linear model, I can use a correlation matrix but I cannot compute it since the variable img_embeddings and txt_embeddings are not numericals but 1D vector. Same if I want to use mutual information for non-linear models.
I have a dataset which includes socioeconomic indicators for students nationwide as well as their grades. More specifically, this dataset has 36 variables with about 30 million students as predictors and then the students grades as the responses.
My goal is to be able to predict whether a student will fail out (ie. be in the bottom 2%ile of the nation in terms of grades). I understand that classification with an imbalanced dataset (98% : 2%) will introduce a bias. Based on some research I planned to account for this by increasing the cost of an incorrect classification in the minority class.
Can someone please confirm that this is the correct approach (and that there isn't a better one, I'm assuming there is)? And also, given the nature of this dataset, could someone please help me choose a machine learning algorithm to accomplish this?
I am working with TensorFlow 2.0 in a Google Colab. I've compiled all the data together into a .feather file using pandas.
In case of having imbalanced dataset, using weighted class is the most common approach to do so, but having such large dataset (30M training example) for binary classification problem representing 2% for the first class and 98% for the second one, I can say it's too hard to prevent model to be unbiased against first class using weighted class as it's not too much differ from reducing the training set size to be balanced.
Here some steps for the model accuracy evaluation.
split your dataset set to train, evalution and test sets.
For evaluation metric I suggest these alternatives.
a. Make sure to have at least +20%, representing the first class for both
evaluation and test sets.
b. Set evalution metric to be precision and recall for your model accuracy
(rather than using f1 score).
c. Set evalution metric to be Cohen's kapp score (coefficient).
From my own perspective, I prefer using b.
Since you are using tensorflow, I assume that you are familiar with deep learning. so use deep learning instead of machine learning, that's gives you the ability to have many additional alternatives, anyway, here some steps for both machine learning and deep learning approach.
For Machine Leaning Algorithms
Decision Trees Algorithms (especially Random Forest).
If my features has no correlation, correlation approach to zero (i.e. 0.01),
I am going to try Complement Naive Bayes classifiers for multinomial features
or Gaussian Naive Bayes using weighted class for continuous features.
Try some nonparametric learning algorithms. You may not able to fit this
training set using Support Vector Machines (SVM) easily because of you
have somehow large data set but you could try.
Try unsupervised learning algorithms
(this sometimes gives you more generic model)
For Deep Leaning Algorithms
Encoder and decoder architectures or simply generative adversarial
networks (GANs).
Siamese network.
Train model using 1D convolution Layers.
Use weighted class.
Balanced batches of the training set, randomly chosen.
You have many other alternatives, From my own perspective, I may try hard to get it with 1, 3 or 5.
For Deep learning 5th approach sometimes works very well and I recommend to try it with 1, 3.
I'm wondering if the result of the test set is used to make the optimization of model's weights. I'm trying to make a model but the issue I have is I don't have many data because they are medical research patients. The number of patient is limited in my case (61) and I have 5 feature vectors per patient. What I tried is to create a deep learning model by excluding one subject and I used the exclude subject as the test set. My problem is there is a large variability in subject features and my model fits well the training set (60 subjects) but not that good the 1 excluded subject.
So I'm wondering if the testset (in my case the excluded subject) could be used in a certain way to make converge the model to better classify the exclude subject?
You should not use the test data of your data set in your training process. If your training data is not enough, one approach using a lot during this days(especially for medical images) is data augmentation. So I highly recommend you to use this technique in your training process. How to use Deep Learning when you have Limited Data is one of the good tutorial about data augmentation.
No , you souldn't use your test set for training to prevent overfitting , if you use cross-validation principles you need exactly to split your data into three datasets a train set which you'll use to train your model , a validation set to test different value of your hyperparameters , and a test set to finally test your model , if you use all your data for training, your model will overfit obviously.
One thing to remember deep learning work well if you have a large and very rich datasets
I am studying the ensemble machine learning and when I read some articles online, I encountered 2 questions.
1.
In this article, it mentions
Instead, model 2 may have a better overall performance on all the data
points, but it has worse performance on the very set of points where
model 1 is better. The idea is to combine these two models where they
perform the best. This is why creating out-of-sample predictions have
a higher chance of capturing distinct regions where each model
performs the best.
But I still cannot get the point, why not train all training data can avoid the problem?
2.
From this article, in the prediction section, it mentions
Simply, for a given input data point, all we need to do is to pass it
through the M base-learners and get M number of predictions, and send
those M predictions through the meta-learner as inputs
But in the training process, we use k -fold train data to train M base-learner, so should I also train M base-learner based on all train data for the input to predict?
Assume red and blue were the best models you could find.
One works better in region 1, the other on region 2.
Now you would also train a classifier to predict which model to use, i.e., you would try to learn the two regions.
Do the validation on the outside. You can overfit if you give the two inner models access to data that the meta model does not see.
The idea in ensembles is that a group of weak predictors outperform a strong predictor. So, if we train different models with different predictive results and use the majority rule as the final result of our ensemble, this result is better than just trying to train one single model. Assume, for example, that the data consist of two distinct patterns, one linear and one quadratic. Then using a single classifier can either overfit or produce inaccurate results.
You can read this tutorial to learn more about ensembles and bagging and boosting.
1) "But I still cannot get the point, why not train all training data can avoid the problem?" - We will hold that data for validation purpose, just like the way we do in K-fold
2) "so should I also train M base-learner based on all train data for the input to predict?" - If you give same data to all the learners then the output of all of them would be same and there is no use in creating them. So we will give a subset of data to each learner.
For question 1 I will prove why we train two models in a contradictory way.
Suppose you train a model with all the data points.During training whenever the model will see a data point belonging to the red class, then it will try to fit itself so that it can classify red points with minimal error.Same is true for data points belonging to the blue class.Therefore during training the model is leaning towards a specific data point(either red or blue).And at the end model will try to fit itself so that it does not make much mistakes on both the data points and the final model will be an average model.
But instead if you train two models for the two different datasets, then each model will be trained on a specific dataset and a model doesn't have to care about data points which belong to another class.
It will be more clearer with the following metaphor.
Suppose there are two persons which are specialized to do two completely different jobs.Now when a job comes if you tell them that both of you have to do the job and each of them need to do 50% of the job. Now think what kind of result you will get at the end. Now also think what could be the result if you would tell them that a person should work on only the job at which the person is best.
In question 2 you have to split the train dataset into M datasets.And during training give M datasets to M base learners.
I am using Scikit-Learn to classify texts (in my case tweets) using LinearSVC. Is there a way to classify texts as unclassified when they are a poor fit with any of the categories defined in the training set? For example if I have categories for sport, politics and cinema and attempt to predict the classification on a tweet about computing it should remain unclassified.
In the supervised learning approach as it is, you cannot add extra category.
Therefore I would use some heuristics. Try to predict probability for each category. Then, if all 4 or at least 3 probabilities are approximately equal, you can say that the sample is "unknown".
For this approach LinearSVC or other type of Support Vector Classifier is bad
suited, because it does not naturally gives you probabilities. Another classifier (Logistic Regression, Bayes, Trees, Forests) would be better