How to increase the presicion of text classification with the RBM?

How to increase the presicion of text classification with the RBM? - python

I am learning about text classification and I classify with my own corpus with linnear regression as follows:
from sklearn.linear_model.logistic import LogisticRegression
classifier = LogisticRegression(penalty='l2', C=7)
classifier.fit(training_matrix, y_train)
prediction = classifier.predict(testing_matrix)
I would like to increase the classification report with a Restricted Boltzman Machine that scikit-learn provide, from the documentation I read that this could be use to increase the classification recall, f1-score, accuracy, etc. Could anybody help me to increase this is what I tried so far, thanks in advance:
vectorizer = TfidfVectorizer(max_df=0.5,
max_features=None,
ngram_range=(1, 1),
norm='l2',
use_idf=True)
X_train = vectorizer.fit_transform(X_train_r)
X_test = vectorizer.transform(X_test_r)
from sklearn.pipeline import Pipeline
from sklearn.neural_network import BernoulliRBM
logistic = LogisticRegression()
rbm= BernoulliRBM(random_state=0, verbose=True)
classifier = Pipeline(steps=[('rbm', rbm), ('logistic', logistic)])
classifier.fit(X_train, y_train)

First, you have to understand the concepts here. RBM can be seen as a powerful clustering algorithm and clustering algorithms are unsupervised, i.e., they don't need labels.
Perhaps, the best way to use RBM in your problem is, first to train an RBM (which only needs data without labels) and then use the RBM weights to initialize a Neural network. To get a logistic regression in the output, you have to add an output layer with logistic reg. cost function to this neural net and train this neural network. This setting may result in performance improvement.

There are a couple of things that could be wrong.
1. You haven't properly calibrated the RBM
Look at the example on the scikit-learn site: http://scikit-learn.org/stable/auto_examples/plot_rbm_logistic_classification.html
In particular, these lines:
rbm.learning_rate = 0.06
rbm.n_iter = 20
# More components tend to give better prediction performance, but larger
# fitting time
rbm.n_components = 100
You don't set these anywhere. In the example, these are obtained through cross validation using a grid search. You should do the same and try to obtain (close to) optimal parameters for your own problem.
Additionally, you might want to try using cross validation to determine other parameters as well, such as the ngram range (using higher level ngrams as well usually helps, if you can afford the memory and execution time. For some problems, character level ngrams do better than word level) and logistic regression parameters.
2. You are just unlucky
There is nothing that says using an RBM in an intermediate step will definitely improve any performance measure. It can, but it's not a rule, it may very well do nothing or very little for your problem. You have to be prepared for this.
It's worth trying because it shouldn't take long to implement, but be prepare to have to look elsewhere.
Also look at the SGDClassifier and the PassiveAggressiveClassifier. These might improve performance.

Related

How to fix a very high false negative rate in fairly balanced binary classification?

I have a project that asks to do binary classification for whether an employee will leave the company or not, based on about 52 features and 2000 rows of data. The data is somewhat balanced with 1200 neg to 800 pos. I have done extensive EDA and data cleansing. I chose to try several different models from sklearn, Logarithmic Regression, SVM, and Random Forests. I am getting very poor and similar results from all of them. I only used 15 of the 52 features for this run, but the results are almost identical to when I used all 52 features. Of the 52 features, 6 were categorical that I converted to dummies (between 3-6 categories per feature) and 3 were datetime that I converted to days-since-epoch. There were no null values to fill.
This is the code and confusion matrix my most recent run with a random forest.
x_train, x_test, y_train, y_test = train_test_split(small_features, endreason, test_size=0.2, random_state=0)
RF = RandomForestClassifier(bootstrap = True,
max_features = 'sqrt',
random_state=0)
RF.fit(x_train, y_train)
RF.predict(x_test)
cm = confusion_matrix(y_test, rf_predictions)
plot_confusion_matrix(cm, classes = ['Negative', 'Positive'],
title = 'Confusion Matrix')
What are steps I can do to help better fit this model?

The results you are showing definitely seem a bit dis-encouraging for the methods your propose and balance of the data you describe. However, from the description of the problem there definitely seems to be a lot of room for improvement.
When you are using train_test_split make sure you pass stratify=endreason to make sure there's no issues regarding the labels when splitting the dataset. Moving on to helpful points to improve your model:
First of all, dimensionality reduction: Since you are dealing with many features, some of them might be useless or even contaminate the classification problem you are trying to solve. It is very important to considering fitting different dimension reduction techniques to your data and using this fitted data to feed your model. Some common approaches that might be worth trying:
PCA (Principal component analysis)
Low Variance & Correlation filter
Random Forests feature importance
Secondly understanding the model: While Logistic Regression might prove to be an excellent baseline for a linear classifier, it might not necessarily be what you need for this task. Random Forests seem to be much better when capturing non-linear relationships but needs to be controlled and pruned to avoid overfitting and might require a lot of data. On the other hand, SVM is a very powerful method with non-linear kernels but might prove inefficient when working with huge amounts of data. XGBoost and LightGBM are very powerful gradient boosting algorithms that have won multiple kaggle competitions and work very well in almost every case, of course there needs to be some preprocessing as XGBoost is not preparred to work with categorical features (LightGBM is). My suggestion is to try these last 2 methods. From worse to last (in general case scenarios) I would list:
LightGBM / XGBoost
RandomForest / SVM / Logistic Regression
Last but not least hyperparameter tunning: Regardless of the method you choose, there will always be some fine-tuning that needs to be done. Sklearn offers gridsearch which comes in really handy. However you would need to understand how your classifiers are behaving in order to know what should you be looking for. I will not go in-depth in this as it will be off-topic and not suited for SO but you can definitely have a read here

low training (~64%) and test accuracy (~14%) with 5 different models

Im struggling to find a learning algorithm that works for my dataset.
I am working with a typical regressor problem. There are 6 features in the dataset that I am concerned with. There are about 800 data points in my dataset. The features and the predicted values have high non-linear correlation so the features are not useless (as far as I understand). The predicted values have a bimodal distribution so I disregard linear model pretty quickly.
So I have tried 5 different models: random forest, extra trees, AdaBoost, gradient boosting and xgb regressor. The training dataset returns accuracy and the test data returns 11%-14%. Both numbers scare me haha. I try tuning the parameters for the random forest but seems like nothing particularly make a drastic difference.
Function to tune the parameters
def hyperparatuning(model, train_features, train_labels, param_grid = {}):
grid_search = GridSearchCV(estimator = model, param_grid = param_grid, cv = 3, n_jobs = -1, verbose =2)
grid_search.fit(train_features, train_labels)
print(grid_search.best_params_)
return grid_search.best_estimator_`
Function to evaluate the model
def evaluate(model, test_features, test_labels):
predictions = model.predict(test_features)
errors = abs(predictions - test_labels)
mape = 100*np.mean(errors/test_labels)
accuracy = 100 - mape
print('Model Perfomance')
print('Average Error: {:0.4f} degress. '.format(np.mean(errors)))
print('Accuracy = {:0.2f}%. '.format(accuracy))
I expect the output to be at least ya know acceptable but instead i got training data to be 64% and testing data to be 12-14%. It is a real horror to look at this numbers!

There are several issues with your question.
For starters, you are trying to use accuracy in what it seems to be a regression problem, which is meaningless.
Although you don't provide the exact models (it would arguably be a good idea), this line in your evaluation function
errors = abs(predictions - test_labels)
is actually the basis of the mean absolute error (MAE - although you should actually take its mean, as the name implies). MAE, like MAPE, is indeed a performance metric for regression problems; but the formula you use next
accuracy = 100 - mape
does not actually hold, neither it is used in practice.
It is true that, intuitively, one might want to get the 1-MAPE quantity; but this is not a good idea, as MAPE itself has a lot of drawbacks which seriously limit its use; here is a partial list from Wikipedia:
It cannot be used if there are zero values (which sometimes happens for example in demand data) because there would be a division by zero.
For forecasts which are too low the percentage error cannot exceed 100%, but for forecasts which are too high there is no upper limit to the percentage error.

It is an overfitting problem. You are fitting the hypothesis very well on your training data.
Possible solutions to your problem:
You can try getting more training data(not features).
Try less complex model like decision trees since highly complex
models(like random forest,neural networks etc.) fit the hypothesis
well on the training data.
Cross-validation:It allows you to tune hyperparameters with only
your original training set. This allows you to keep your test set as
a truly unseen dataset for selecting your final model.
Regularization:The method will depend on the type of learner you’re
using. For example, you could prune a decision tree, use dropout on
a neural network, or add a penalty parameter to the cost function in
regression.
I would suggest you use pipeline function since it'll allow you to perform multiple models simultaneously.
An example of that:
pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])
# Parameters of pipelines can be set using ‘__’ separated parameter names:
param_grid = {
'pca__n_components': [5, 20, 30, 40, 50, 64],
'logistic__alpha': np.logspace(-4, 4, 5),
}
search = GridSearchCV(pipe, param_grid, iid=False, cv=5)
search.fit(X_train, X_test)

I would suggest improving by preprocessing the data in better forms. Try to manually remove the outliers, check the concept of cook's distance to see elements which have high influence in your model negatively. Also, you could scale the data in a different form than Standard scaling, use log scaling if elements in your data are too big, or too small. Or use feature transformations like DCT transform/ SVD transform etc.
Or to be simplest, you could create your own features with the existing data, for example, if you have yest closing price and todays opening price as 2 features in stock price prediction, you can create a new feature saying the difference in cost%, which could help a lot on your accuracy.
Do some linear regression analysis to know the Beta values, to have a better understanding which feature is contributing more to the target value. U can use feature_importances_ in random forests too for the same purpose and try to improve that feature as well as possible such that the model would understand better.
This is just a tip of ice-berg of what could be done. I hope this helps.

Currently, you are overfitting so what you are looking for is regularization. For example, to reduce the capacity of models that are ensembles of trees, you can limit the maximum depth of the trees (max_depth), increase the minimum required samples at a node to split (min_samples_split), reduce the number of learners (n_estimators), etc.
When performing cross-validation, you should fit on the training set and evaluate on your validation set and the best configuration should be the one that performs the best on the validation set. You should also keep a test set in order to evaluate your model on completely new observations.

How does one identify if the ML model is overfitting the dataset or not?

I have been comparing different regression models from sklearn, On doing so I was confused with the model's score value that i got.
Below in the code you can see that i have used both Linear Regression and Ridge Regression but the difference in score values for the training and test data set vary by a lot.
using Linear Regression
from sklearn.linear_model import LinearRegression as lr
model = lr()
model.fit(X_train, y_train)
model.predict(X_test)
print("LINEAR REGRESSION")
print("Training Score", end = "\t")
print(model.score(X_train, y_train))
print("Test Score", end = "\t")
print(model.score(X_test, y_test))
------------------------------------------------------------
O/P
LINEAR REGRESSION
Training Score 0.7147120015665793
Test Score 0.4242120003778227
Using Ridge Regression
from sklearn.linear_model import Ridge as r
model = r(alpha = 20).fit(X_train, y_train)
model.predict(X_test)
print("RIDGE REGRESSION")
print("Training Score", end = "\t")
print(model.score(X_train, y_train))
print("Test Score", end = "\t")
print(model.score(X_test, y_test))
-----------------------------------------------------------
O/P
RIDGE REGRESSION
Training Score 0.4991610348613835
Test Score 0.32642156452579363
My question is, Does a smaller difference between the score values of the training and test dataset mean that my model is Generalized and fitting equally for both the test and Train Data ( not overfitting ) or does it mean something else.
If it does mean something else please do explain.
And How does the "alpha" value affect the ridge regression model?
I am a beginner so please do explain anything as simple as possible.
Thank You.

Maybe you can add a separate validation set to you model.fit or you set the validation_split paramter like in the keras docs of the fit method, I don't know if there is something like that in sklearn kit.
But in general the scores for the validation-set or test-st and the training set should be nearly equal, otherwise the model tends to overfitting.
Also there are a bunch of metrics you can use to evaluate your model. I would recommend the book Oreilly Deep Learning Page 39. There is a really good explanation.
Or have a look here and here.
Or look here chapter 5.2.
Feel free to ask other questions.

Expanding on Max's answer, overfitting is a modelling error when the trained model models the trained data too well. Now this usually occurs when the model is complex enough (high VC dimension) that it learns very intricate details and noises that it will negatively impact the final performance. VC Dimension Caltech Lecture on VC Overfitting A simple way to observe overfitting is to look at the difference between the training and test results.
Going back to your example, the score difference between test and training data for linear regression is 0.290. While the difference for ridge regression is 0.179. From this single experiment alone, it is difficult to judge whether or not a model is overfitting since usually in practice there will always be some difference. But here, we can say that ridge regression tends to less overfit for this dataset.
Now when making a decision on which model to pick, we have to also consider other factors beside overfitting itself. In this case Linear Regression tends to perform 10% higher on the test dataset as compared to Ridge Regression, so you have to take that into account as well. Perhaps conducting further experiment using different validation techniques and fine tune different hyper-parameters should be the next steps.

Machine learning procedure splitting the data into 3 sets

Reading documentation and procedures while using machine learning techniques for both classification and regression I came across with some topic which actually is new for me. It seems that a recommended procedure related to split the data before training and testing is to split it into three different sets training, validation and testing. Since this procedure makes sense to me I was wondering how should I proceed with this. Let's say we split the data into these three sets, since I came across with this reading sklearn approaches and tips
If we follow some interesting approaches like what I found in here:
Stratified Train/Validation/Test-split in scikit-learn
Taking this into account let's say we want to build a classifier using LogisticRegression(any classifier actually). The procedure as far as I am concerned should be something like this, right?:
# train a logistic regression model on the training set
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
Now if we want to make predictions we could use:
# make class predictions for the testing set
y_pred_class = logreg.predict(X_test)
What when one have to estimate accuracy of the model a common approach is:
# calculate accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))
And here is where my question comes. Validation set which was splitted before should be use for calculating accuracy or for validating somehow using a Kfold cv instead?. For instance,:
# Perform 10-fold cross validation
scores = cross_val_score(logreg, df, y, cv=10)
Any hint of the procedure with these three sets would be really appreciated. What I was thinking of was that validation set should be use with train but do not know really in which way.

Batch gradient descent with scikit learn (sklearn)

I'm playing with a one-vs-all Logistic Regression classifier using Scikit-Learn (sklearn). I have a large dataset that is too slow to run all at one go; also I would like to study the learning curve as the training proceeds.
I would like to use batch gradient descent to train my classifier in batches of, say, 500 samples. Is there some way of using sklearn to do this, or should I abandon sklearn and "roll my own"?
This is what I have so far:
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
# xs are subsets of my training data, ys are ground truth for same; I have more
# data available for further training and cross-validation:
xs.shape, ys.shape
# => ((500, 784), (500))
lr = OneVsRestClassifier(LogisticRegression())
lr.fit(xs, ys)
lr.predict(xs[0,:])
# => [ 1.]
ys[0]
# => 1.0
I.e. it correctly identifies a training sample (yes, I realize it would be better to evaluate it with new data -- this is just a quick smoke-test).
R.e. batch gradient descent: I haven't gotten as far as creating learning curves, but can one simply run fit repeatedly on subsequent subsets of the training data? Or is there some other function to train in batches? The documentation and Google are fairly silent on the matter. Thanks!

What you want is not batch gradient descent, but stochastic gradient descent; batch learning means learning on the entire training set in one go, while what you describe is properly called minibatch learning. That's implemented in sklearn.linear_model.SGDClassifier, which fits a logistic regression model if you give it the option loss="log".
With SGDClassifier, like with LogisticRegression, there's no need to wrap the estimator in a OneVsRestClassifier -- both do one-vs-all training out of the box.
# you'll have to set a few other options to get good estimates,
# in particular n_iterations, but this should get you going
lr = SGDClassifier(loss="log")
Then, to train on minibatches, use the partial_fit method instead of fit. The first time around, you have to feed it a list of classes because not all classes may be present in each minibatch:
import numpy as np
classes = np.unique(["ham", "spam", "eggs"])
for xs, ys in minibatches:
lr.partial_fit(xs, ys, classes=classes)
(Here, I'm passing classes for each minibatch, which isn't necessary but doesn't hurt either and makes the code shorter.)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.