I run a Random Forest algorithm with TF-IDF and non-TF-IDF features.
In total the features are around 130k in number (after a feature selection conducted on the TF-IDF features) and the observations of the training set are around 120k in number.
Around 500 of them are the non-TF-IDF features.
The issue is that the accuracy of the Random Forest on the same test set etc with
- only the non-TF-IDF features is 87%
- the TF-IDF and non-TF-IDF features is 76%
This significant aggravation of the accuracy raises some questions in my mind.
The relevant piece of code of mine with the training of the models is the following:
drop_columns = ['labels', 'complete_text_1', 'complete_text_2']
# Split to predictors and targets
X_train = df.drop(columns=drop_columns).values
y_train = df['labels'].values
# Instantiate, train and transform with tf-idf models
vectorizer_1 = TfidfVectorizer(analyzer="word", ngram_range=(1,2), vocabulary=tf_idf_feature_names_selected)
X_train_tf_idf_1 = vectorizer_1.fit_transform(df['complete_text_1'])
vectorizer_2 = TfidfVectorizer(analyzer="word", ngram_range=(1,2), vocabulary=tf_idf_feature_names_selected)
X_train_tf_idf_2 = vectorizer_2.fit_transform(df['complete_text_2'])
# Covert the general features to sparse array
X_train = np.array(X_train, dtype=float)
X_train = csr_matrix(X_train)
# Concatenate the general features and tf-idf features array
X_train_all = hstack([X_train, X_train_tf_idf_1, X_train_tf_idf_2])
# Instantiate and train the model
rf_classifier = RandomForestClassifier(n_estimators=150, random_state=0, class_weight='balanced', n_jobs=os.cpu_count()-1)
rf_classifier.fit(X_train_all, y_train)
Personally, I have not seen any bug in my code (this piece above and in general).
The hypothesis which I have formulated to explain this decrease in accuracy is the following.
The number of non-TF-IDF features is only 500 (out of the 130k features in total)
This gives some chances that the non-TF-IDF features are not picked that much at each split by the trees of the random forest (eg because of max_features etc)
So if the non-TF-IDF features do actually matter then this will create problems because they are not taken enough into account.
Related to this, when I check the features' importances of the random forest after training it I see the importances of the non-TF-IDF features being very very low (although I am not sure how reliable indicator are the feature importances especially with TF-IDF features included).
Can you explain differently the decrease in accuracy at my classifier?
In any case, what would you suggest doing?
Some other ideas of combining the TF-IDF and non-TF-IDF features are the following.
One option would be to have two separate (random forest) models - one for the TF-IDF features and one for the non-TF-IDF features.
Then the results of these two models will be combined either by (weighted) voting or meta-classification.
Your view that 130K of features is way too much for the Random forest sounds right. You didn't mention how many examples you have in your dataset and that would be cruccial to the choice of the possible next steps. Here are a few ideas on top of my head.
If number of datapoints is large enough you myabe want to train some transformation for the TF-IDF features - e.g. you might want to train a small-dimensional embeddings of these TF-IDF features into, say 64-dimensional space and then e.g. a small NN on top of that (even a linear model maybe). After you have embeddings you could use them as transforms to generate 64 additional features for each example to replace TF-IDF features for RandomForest training. Or alternatively just replace the whole random forest with a NN of such architecture that e.g. TF-IDFs are all combined into a few neurons via fully-connected layers and later concatened with other features (pretty much same as embeddings but as a part of NN).
If you don't have enough data to train a large NN maybe you can try to train GBDT ensemble instead of random forest. It probably should do much better job at picking the good features compared to random forest which definitely likely to be affected a lot by a lot of noisy useless features. Also you can first train some crude version and then do a feature selection based on that (again, I would expect it should do a more reasonable job compared to random forest).
My guess is that your hypothesis is partly correct.
When using the full dataset (in the 130K feature model), each split in the tree uses only a small fraction of the 500 non-TF-IDF features. So if the non-TF-IDF features are important, then each split misses out on a lot of useful data. The data that is ignored for one split will probably be used for a different split in the tree, but the result isn't as good as it would be when more of the data is used at every split.
I would argue that there are some very important TF-IDF features, too. The fact that we have so many features means that a small fraction of those features is considered at each split.
In other words: the problem isn't that we're weakening the non-TF-IDF features. The problem is that we're weakening all of the useful features (both non-TF-IDF and TF-IDF). This is along the lines of Alexander's answer.
In light of this, your proposed solutions won't solve the problem very well. If you make two random forest models, one with 500 non-TF-IDF features and the other with 125K TF-IDF features, the second model will perform poorly, and negatively influence the results. If you pass the results of the 500 model as an additional feature to the 125K model, you're still underperforming.
If we want to stick with random forests, a better solution would be to increase the max_features and/or the number of trees. This will increase the odds that useful features are considered at each split, leading to a more accurate model.
Related
I have a project that asks to do binary classification for whether an employee will leave the company or not, based on about 52 features and 2000 rows of data. The data is somewhat balanced with 1200 neg to 800 pos. I have done extensive EDA and data cleansing. I chose to try several different models from sklearn, Logarithmic Regression, SVM, and Random Forests. I am getting very poor and similar results from all of them. I only used 15 of the 52 features for this run, but the results are almost identical to when I used all 52 features. Of the 52 features, 6 were categorical that I converted to dummies (between 3-6 categories per feature) and 3 were datetime that I converted to days-since-epoch. There were no null values to fill.
This is the code and confusion matrix my most recent run with a random forest.
x_train, x_test, y_train, y_test = train_test_split(small_features, endreason, test_size=0.2, random_state=0)
RF = RandomForestClassifier(bootstrap = True,
max_features = 'sqrt',
random_state=0)
RF.fit(x_train, y_train)
RF.predict(x_test)
cm = confusion_matrix(y_test, rf_predictions)
plot_confusion_matrix(cm, classes = ['Negative', 'Positive'],
title = 'Confusion Matrix')
What are steps I can do to help better fit this model?
The results you are showing definitely seem a bit dis-encouraging for the methods your propose and balance of the data you describe. However, from the description of the problem there definitely seems to be a lot of room for improvement.
When you are using train_test_split make sure you pass stratify=endreason to make sure there's no issues regarding the labels when splitting the dataset. Moving on to helpful points to improve your model:
First of all, dimensionality reduction: Since you are dealing with many features, some of them might be useless or even contaminate the classification problem you are trying to solve. It is very important to considering fitting different dimension reduction techniques to your data and using this fitted data to feed your model. Some common approaches that might be worth trying:
PCA (Principal component analysis)
Low Variance & Correlation filter
Random Forests feature importance
Secondly understanding the model: While Logistic Regression might prove to be an excellent baseline for a linear classifier, it might not necessarily be what you need for this task. Random Forests seem to be much better when capturing non-linear relationships but needs to be controlled and pruned to avoid overfitting and might require a lot of data. On the other hand, SVM is a very powerful method with non-linear kernels but might prove inefficient when working with huge amounts of data. XGBoost and LightGBM are very powerful gradient boosting algorithms that have won multiple kaggle competitions and work very well in almost every case, of course there needs to be some preprocessing as XGBoost is not preparred to work with categorical features (LightGBM is). My suggestion is to try these last 2 methods. From worse to last (in general case scenarios) I would list:
LightGBM / XGBoost
RandomForest / SVM / Logistic Regression
Last but not least hyperparameter tunning: Regardless of the method you choose, there will always be some fine-tuning that needs to be done. Sklearn offers gridsearch which comes in really handy. However you would need to understand how your classifiers are behaving in order to know what should you be looking for. I will not go in-depth in this as it will be off-topic and not suited for SO but you can definitely have a read here
Currently I am building a classifier with heavily imbalanced data. I am using the imblearn pipeline to first to StandardScaling, SMOTE, and then the classification with gridSearchCV. This ensures that the upsampling is done during the cross-validation. Now I want to include feature_selection into my pipeline. How should I include this step into the pipeline?
model = Pipeline([
('sampling', SMOTE()),
('classification', RandomForestClassifier())
])
param_grid = {
'classification__n_estimators': [10, 20, 50],
'classification__max_depth' : [2,3,5]
}
gridsearch_model = GridSearchCV(model, param_grid, cv = 4, scoring = make_scorer(recall_score))
gridsearch_model.fit(X_train, y_train)
predictions = gridsearch_model.predict(X_test)
print(classification_report(y_test, predictions))
print(confusion_matrix(y_test, predictions))
It does not necessarily make sense to include feature selection in a pipeline where your model is a random forest(RF). This is because the max_depth and max_features arguments of the RF model essentially control the amounts of features included when building the individual trees (the max depth of n just says that each tree in your forest will be built for n nodes, each with a split consisting of a combination of max_features amount of features). Check https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html.
You can simply investigate your trained model for the top ranked features. When training an individual tree, it can be computed how much each feature decreases the weighted impurity in a tree. For a forest, the impurity decrease from each feature can be averaged and the features are ranked according to this measure. So then you actually don't need to retrain the forest for different feature sets, because the feature importance (already computed in the sklearn model) tells you all the info you'd need.
P.S. I would not waste time grid searching n_estimators either, because more trees will result in better accuracy. More trees means more computational cost and after a certain number of trees, the improvement is too small, so maybe you have to worry about that, but otherwise you will gain performance from a large-ish number of n_estimator and you're not really in trouble of overfitting either.
do you mean feature selection form sklearn? https://scikit-learn.org/stable/modules/feature_selection.html
You can run it in the beginning. You will basically adjust your columns of X (X_train, and X_test accordingly). It is important that you fit your feature selection only with the training data (as your test data should be unseen at that point in time).
How should I include this step into the pipeline?
so you should run it before your code.
There is no "how" as if there is a concrete recipe, it depends on your goal.
If you want to check which set of features gives you the best performance (according to your metrics, here recall), you could use sklearn's sklearn.feature_selection.RFE (Recursive Feature Elimination) or it's cross validation variant sklearn.feature_selection.RFECV.
The first one fit's your model with whole set of features, measures their importance and prunes the least impactful ones. This operation continues until the desired number of features are left. It is quite computationally intensive though.
Second one starts with all features and removes step features every time trying out all possible combinations of learned models. This continues until min_features_to_select is hit. It is VERY computationally intensive, way more than the first one.
As this operation is rather infeasible to use in connection with hyperparameters search, you should do it with a fixed set of defaults before GridSearchCV or after you have found some suitable values with it. In the first case, features choice will not depend on the hyperparams you've found, while for the second case the influence might be quite high. Both ways are correct but would probably yield different results and models.
You can read more about RFECV and RFE in this StackOverflow answer.
I'm doing dialect text classification and I'm using countVectorizer with naive bayes. The number of features are too many, I have collected 20k tweets with 4 dialects. every dialect have 5000 tweets. And the total number of features are 43K. I was thinking maybe that's why I could be having overfitting. Because the accuracy has dropped a lot when I tested on new data. So how can I fix the number of features to avoid overfitting the data?
You can set the parameter max_features to 5000 for instance, It might help with overfitting. You could also tinker with max_df (for instance set it to 0.95)
This drop on test data is caused by curse of dimensionality. You can use some dimensionality reduction method to reduce this effect. Possible choice is Latent Semantic Analysis implemented in sklearn.
I am working with a dataset of about 400.000 x 250.
I have a problem with the model yielding a very good R^2 score when testing it on the training set, but extremely poorly when used on the test set. Initially, this sounds like overfitting. But the data is split into training/test set at random and the data set i pretty big, so I feel like there has to be something else.
Any suggestions?
Splitting dataset into training set and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop(['SalePrice'],
axis=1), df.SalePrice, test_size = 0.3)
Sklearn's Linear Regression estimator
from sklearn import linear_model
linReg = linear_model.LinearRegression() # Create linear regression object
linReg.fit(X_train, y_train) # Train the model using the training sets
# Predict from training set
y_train_linreg = linReg.predict(X_train)
# Predict from test set
y_pred_linreg = linReg.predict(X_test)
Metric calculation
from sklearn import metrics
metrics.r2_score(y_train, y_train_linreg)
metrics.r2_score(y_test, y_pred_linreg)
R^2 score when testing on training set: 0,64
R^2 score when testing on testing set: -10^23 (approximatly)
While I agree with Mihai that your problem definitely looks like overfitting, I don't necessarily agree on his answer that neural network would solve your problem; at least, not out of the box. By themselves, neural networks overfit more, not less, than linear models. You need somehow to take care of your data, hardly any model can do that for you. A few options that you might consider (apologies, I cannot be more precise without looking at the dataset):
Easiest thing, use regularization. 400k rows is a lot, but with 250 dimensions you can overfit almost whatever you like. So try replacing LinearRegression by Ridge or Lasso (or Elastic Net or whatever). See http://scikit-learn.org/stable/modules/linear_model.html (Lasso has the advantage of discarding features for you, see next point)
Especially if you want to go outside of linear models (and you probably should), it's advisable to first reduce the dimension of the problem, as I said 250 is a lot. Try using some of the Feature selection techniques here: http://scikit-learn.org/stable/modules/feature_selection.html
Probably most importantly than anything else, you should consider adapting your input data. The very first thing I'd try is, assuming you are really trying to predict a price as your code implies, to replace it by its logarithm, or log(1+x). Otherwise linear regression will try very very hard to fit that single object that was sold for 1 Million $ ignoring everything below $1k. Just as important, check if you have any non-numeric (categorical) columns and keep them only if you need them, in case reducing them to macro-categories: a categorical column with 1000 possible values will increase your problem dimension by 1000, making it an assured overfit. A single column with a unique categorical data for each input (e.g. buyer name) will lead you straight to perfect overfitting.
After all this (cleaning data, reducing dimension via either one of the methods above or just Lasso regression until you get to certainly less than dim 100, possibly less than 20 - and remember that this includes any categorical data!), you should consider non-linear methods to further improve your results - but that's useless until your linear model provides you at least some mildly positive R^2 value on test data. sklearn provides a lot of them: http://scikit-learn.org/stable/modules/kernel_ridge.html is the easiest to use out-of-the-box (also does regularization), but it might be too slow to use in your case (you should first try this, and any of the following, on a subset of your data, say 1000 rows once you've selected only 10 or 20 features and see how slow that is). http://scikit-learn.org/stable/modules/svm.html#regression have many different flavours, but I think all but the linear one would be too slow. Sticking to linear things, http://scikit-learn.org/stable/modules/sgd.html#regression is probably the fastest, and would be how I'd train a linear model on this many samples. Going truly out of linear, the easiest techniques would probably include some kind of trees, either directly http://scikit-learn.org/stable/modules/tree.html#regression (but that's an almost-certain overfit) or, better, using some ensemble technique (random forests http://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees are the typical go-to algorithm, gradient boosting http://scikit-learn.org/stable/modules/ensemble.html#gradient-tree-boosting sometimes works better). Finally, state-of-the-art results are indeed generally obtained via neural networks, see e.g. http://scikit-learn.org/stable/modules/neural_networks_supervised.html but for these methods sklearn is generally not the right answer and you should take a look at dedicated environments (TensorFlow, Caffe, PyTorch, etc.)... however if you're not familiar with those it is certainly not worth the trouble!
I have a classification problem (predicting whether a sequence belongs to a class or not), for which I decided to use multiple classification methods, in order to help filter out the false positives.
(The problem is in bioinformatics - classifying protein sequences as being Neuropeptide precursors sequences. Here's the original article if anyone's interested, and the code used to generate features and to train a single predictor) .
Now, the classifiers have roughly similar performance metrics (83-94% accuracy/precision/etc' on the training set for 10-fold CV), so my 'naive' approach was to simply use multiple classifiers (Random Forests, ExtraTrees, SVM (Linear kernel), SVM (RBF kernel) and GRB) , and to use a simple majority vote.
MY question is:
How can I get the performance metrics for the different classifiers and/or their votes predictions?
That is, I want to see if using the multiple classifiers improves my performance at all, or which combination of them does.
My intuition is maybe to use the ROC score, but I don't know how to "combine" the results and to get it from a combination of classifiers. (That is, to see what the ROC curve is just for each classifier alone [already known], then to see the ROC curve or AUC for the training data using combinations of classifiers).
(I currently filter the predictions using "predict probabilities" with the Random Forests and ExtraTrees methods, then I filter arbitrarily for results with a predicted score below '0.85'. An additional layer of filtering is "how many classifiers agree on this protein's positive classification").
Thank you very much!!
(The website implementation, where we're using the multiple classifiers - http://neuropid.cs.huji.ac.il/ )
The whole shebang is implemented using SciKit learn and python. Citations and all!)
To evaluate the performance of the ensemble, simply follow the same approach as you would normally. However, you will want to get the 10 fold data set partitions first, and for each fold, train all of your ensemble on that same fold, measure the accuracy, rinse and repeat with the other folds and then compute the accuracy of the ensemble. So the key difference is to not train the individual algorithms using k fold cross-validation when evaluating the ensemble. The important thing is not to let the ensemble see the test data either directly or by letting one of it's algorithms see the test data.
Note also that RF and Extra Trees are already ensemble algorithms in their own right.
An alternative approach (again making sure the ensemble approach) is to take the probabilities and \ or labels output by your classifiers, and feed them into another classifier (say a DT, RF, SVM, or whatever) that produces a prediction by combining the best guesses from these other classifiers. This is termed "Stacking"
You can use a linear regression for stacking. For each 10-fold, you can split the data with:
8 training sets
1 validation set
1 test set
Optimise the hyper-parameters for each algorithm using the training set and validation set, then stack yours predictions by using a linear regression - or a logistic regression - over the validation set. Your final model will be p = a_o + a_1 p_1 + … + a_k p_K, where K is the number of classifier, p_k is the probability given by model k and a_k is the weight of the model k. You can also directly use the predicted outcomes, if the model doesn't give you probabilities.
If yours models are the same, you can optimise for the parameters of the models and the weights in the same time.
If you have obvious differences, you can do different bins with different parameters for each. For example one bin could be short sequences and the other long sequences. Or different type of proteins.
You can use the metric whatever metric you want, as long as it makes sens, like for not blended algorithms.
You may want to look at the 2007 Belkor solution of the Netflix challenges, section Blending. In 2008 and 2009 they used more advances technics, it may also be interesting for you.