I have a question regarding the interpretability of machine learning algorithms.
I have a dataset looking like this:
tabular data set
I have trained a classification model (MLPClassifier from Scikit-Learn) and want to know which features have the biggest impact (the highest weight) on the decision.
My final goal is to find different solutions (combination of features) which will have a high probability (>90%) to be classified as 1.
Does somebody know a way to get these solutions?
Thanks in advance!
To obtain the feature importance during a classification task the classification methodology has to be randomforest or decision tree, both implemented in sklearn,
clf = RandomForestClassifier(n_estimators=100, max_depth=2,random_state=0)
clf.fit(X, y)
#After the fit step
clf.feature_importances_
The feature importance will tell you how much weight each feature has, if your MLP classifier is trained properly, it will assign nearly similar importance to various features in your network,
Related
I have a project that asks to do binary classification for whether an employee will leave the company or not, based on about 52 features and 2000 rows of data. The data is somewhat balanced with 1200 neg to 800 pos. I have done extensive EDA and data cleansing. I chose to try several different models from sklearn, Logarithmic Regression, SVM, and Random Forests. I am getting very poor and similar results from all of them. I only used 15 of the 52 features for this run, but the results are almost identical to when I used all 52 features. Of the 52 features, 6 were categorical that I converted to dummies (between 3-6 categories per feature) and 3 were datetime that I converted to days-since-epoch. There were no null values to fill.
This is the code and confusion matrix my most recent run with a random forest.
x_train, x_test, y_train, y_test = train_test_split(small_features, endreason, test_size=0.2, random_state=0)
RF = RandomForestClassifier(bootstrap = True,
max_features = 'sqrt',
random_state=0)
RF.fit(x_train, y_train)
RF.predict(x_test)
cm = confusion_matrix(y_test, rf_predictions)
plot_confusion_matrix(cm, classes = ['Negative', 'Positive'],
title = 'Confusion Matrix')
What are steps I can do to help better fit this model?
The results you are showing definitely seem a bit dis-encouraging for the methods your propose and balance of the data you describe. However, from the description of the problem there definitely seems to be a lot of room for improvement.
When you are using train_test_split make sure you pass stratify=endreason to make sure there's no issues regarding the labels when splitting the dataset. Moving on to helpful points to improve your model:
First of all, dimensionality reduction: Since you are dealing with many features, some of them might be useless or even contaminate the classification problem you are trying to solve. It is very important to considering fitting different dimension reduction techniques to your data and using this fitted data to feed your model. Some common approaches that might be worth trying:
PCA (Principal component analysis)
Low Variance & Correlation filter
Random Forests feature importance
Secondly understanding the model: While Logistic Regression might prove to be an excellent baseline for a linear classifier, it might not necessarily be what you need for this task. Random Forests seem to be much better when capturing non-linear relationships but needs to be controlled and pruned to avoid overfitting and might require a lot of data. On the other hand, SVM is a very powerful method with non-linear kernels but might prove inefficient when working with huge amounts of data. XGBoost and LightGBM are very powerful gradient boosting algorithms that have won multiple kaggle competitions and work very well in almost every case, of course there needs to be some preprocessing as XGBoost is not preparred to work with categorical features (LightGBM is). My suggestion is to try these last 2 methods. From worse to last (in general case scenarios) I would list:
LightGBM / XGBoost
RandomForest / SVM / Logistic Regression
Last but not least hyperparameter tunning: Regardless of the method you choose, there will always be some fine-tuning that needs to be done. Sklearn offers gridsearch which comes in really handy. However you would need to understand how your classifiers are behaving in order to know what should you be looking for. I will not go in-depth in this as it will be off-topic and not suited for SO but you can definitely have a read here
I have been comparing different regression models from sklearn, On doing so I was confused with the model's score value that i got.
Below in the code you can see that i have used both Linear Regression and Ridge Regression but the difference in score values for the training and test data set vary by a lot.
using Linear Regression
from sklearn.linear_model import LinearRegression as lr
model = lr()
model.fit(X_train, y_train)
model.predict(X_test)
print("LINEAR REGRESSION")
print("Training Score", end = "\t")
print(model.score(X_train, y_train))
print("Test Score", end = "\t")
print(model.score(X_test, y_test))
------------------------------------------------------------
O/P
LINEAR REGRESSION
Training Score 0.7147120015665793
Test Score 0.4242120003778227
Using Ridge Regression
from sklearn.linear_model import Ridge as r
model = r(alpha = 20).fit(X_train, y_train)
model.predict(X_test)
print("RIDGE REGRESSION")
print("Training Score", end = "\t")
print(model.score(X_train, y_train))
print("Test Score", end = "\t")
print(model.score(X_test, y_test))
-----------------------------------------------------------
O/P
RIDGE REGRESSION
Training Score 0.4991610348613835
Test Score 0.32642156452579363
My question is, Does a smaller difference between the score values of the training and test dataset mean that my model is Generalized and fitting equally for both the test and Train Data ( not overfitting ) or does it mean something else.
If it does mean something else please do explain.
And How does the "alpha" value affect the ridge regression model?
I am a beginner so please do explain anything as simple as possible.
Thank You.
Maybe you can add a separate validation set to you model.fit or you set the validation_split paramter like in the keras docs of the fit method, I don't know if there is something like that in sklearn kit.
But in general the scores for the validation-set or test-st and the training set should be nearly equal, otherwise the model tends to overfitting.
Also there are a bunch of metrics you can use to evaluate your model. I would recommend the book Oreilly Deep Learning Page 39. There is a really good explanation.
Or have a look here and here.
Or look here chapter 5.2.
Feel free to ask other questions.
Expanding on Max's answer, overfitting is a modelling error when the trained model models the trained data too well. Now this usually occurs when the model is complex enough (high VC dimension) that it learns very intricate details and noises that it will negatively impact the final performance. VC Dimension Caltech Lecture on VC Overfitting A simple way to observe overfitting is to look at the difference between the training and test results.
Going back to your example, the score difference between test and training data for linear regression is 0.290. While the difference for ridge regression is 0.179. From this single experiment alone, it is difficult to judge whether or not a model is overfitting since usually in practice there will always be some difference. But here, we can say that ridge regression tends to less overfit for this dataset.
Now when making a decision on which model to pick, we have to also consider other factors beside overfitting itself. In this case Linear Regression tends to perform 10% higher on the test dataset as compared to Ridge Regression, so you have to take that into account as well. Perhaps conducting further experiment using different validation techniques and fine tune different hyper-parameters should be the next steps.
I am learning about text classification and I classify with my own corpus with linnear regression as follows:
from sklearn.linear_model.logistic import LogisticRegression
classifier = LogisticRegression(penalty='l2', C=7)
classifier.fit(training_matrix, y_train)
prediction = classifier.predict(testing_matrix)
I would like to increase the classification report with a Restricted Boltzman Machine that scikit-learn provide, from the documentation I read that this could be use to increase the classification recall, f1-score, accuracy, etc. Could anybody help me to increase this is what I tried so far, thanks in advance:
vectorizer = TfidfVectorizer(max_df=0.5,
max_features=None,
ngram_range=(1, 1),
norm='l2',
use_idf=True)
X_train = vectorizer.fit_transform(X_train_r)
X_test = vectorizer.transform(X_test_r)
from sklearn.pipeline import Pipeline
from sklearn.neural_network import BernoulliRBM
logistic = LogisticRegression()
rbm= BernoulliRBM(random_state=0, verbose=True)
classifier = Pipeline(steps=[('rbm', rbm), ('logistic', logistic)])
classifier.fit(X_train, y_train)
First, you have to understand the concepts here. RBM can be seen as a powerful clustering algorithm and clustering algorithms are unsupervised, i.e., they don't need labels.
Perhaps, the best way to use RBM in your problem is, first to train an RBM (which only needs data without labels) and then use the RBM weights to initialize a Neural network. To get a logistic regression in the output, you have to add an output layer with logistic reg. cost function to this neural net and train this neural network. This setting may result in performance improvement.
There are a couple of things that could be wrong.
1. You haven't properly calibrated the RBM
Look at the example on the scikit-learn site: http://scikit-learn.org/stable/auto_examples/plot_rbm_logistic_classification.html
In particular, these lines:
rbm.learning_rate = 0.06
rbm.n_iter = 20
# More components tend to give better prediction performance, but larger
# fitting time
rbm.n_components = 100
You don't set these anywhere. In the example, these are obtained through cross validation using a grid search. You should do the same and try to obtain (close to) optimal parameters for your own problem.
Additionally, you might want to try using cross validation to determine other parameters as well, such as the ngram range (using higher level ngrams as well usually helps, if you can afford the memory and execution time. For some problems, character level ngrams do better than word level) and logistic regression parameters.
2. You are just unlucky
There is nothing that says using an RBM in an intermediate step will definitely improve any performance measure. It can, but it's not a rule, it may very well do nothing or very little for your problem. You have to be prepared for this.
It's worth trying because it shouldn't take long to implement, but be prepare to have to look elsewhere.
Also look at the SGDClassifier and the PassiveAggressiveClassifier. These might improve performance.
I have texts that are rated on a continous scale from -100 to +100. I am trying to classify them as positive or negative.
How can you perform binomial log regression to get the probability that test data is -100 or +100?
The closest I have got is the SGDClassifier( penalty='l2',alpha=1e-05, n_iter=10), but this doesn't provide the same results as SPSS when I use binomial log regression to predict the probability of -100 and +100. So I'm guessing this is not the right function?
SGDClassifier provides access to several linear classifiers, all trained with stochastic gradient decent. It will default to a linear support vector machine, unless you call it with a different loss function. loss='log' will provide a probabilistic logistic regression.
See the documentation at:
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier
Alternatively, you could use sklearn.linear_model.LogisticRegression to classify your texts with a logistic regression.
It's not clear to me that you will get exactly the same results as you do with SPSS due to differences in implementation. However, I would not expect to see statistically significant differences.
Edited to add:
My suspicion is that the 99% accuracy you're getting with the SPSS logistic regression is training set accuracy, while the 87% that you're seeing with scikits-learn logistic regression is test set accuracy. I found this question on the datascience stack exchange where a different person is attempting and extremely similar problem, and getting ~99% accuracy on training sets and 90% test set accuracy.
https://datascience.stackexchange.com/questions/987/text-categorization-combining-different-kind-of-features
My recommended path forwards is a follows: Try several different basic classifiers in scikits-learn, including the standard logistic regression and a linear SVM, and also rerun the SPSS logistic regression several times with different train/test subsets of your data and compare the results. If you continue to see a large divergence across classifiers that can't be accounted for by ensuring similar train/test data splits, then post the results that you're seeing into your question, and we can move forward from there.
Good luck!
If pos/neg, or the probability of pos, is really the only thing you need as output, then you can derive binary labels y as
y = score > 0
assuming you have the scores in a NumPy array score.
You can then feed this to a LogisticRegression instance, using the continuous score to derive relative weights for the samples:
clf = LogisticRegression()
sample_weight = np.abs(score)
sample_weight /= sample_weight.sum()
clf.fit(X, y, sample_weight)
This gives maximum weight to tweets with scores ±100, and a weight of zero to tweets that are labeled neutral, varying linearly between the two.
If the dataset is very large, then as #brentlance showed, you can use SGDClassifier, but you have to give it loss="log" if you want a logistic regression model; otherwise, you'll get a linear SVM.
I have a classification problem (predicting whether a sequence belongs to a class or not), for which I decided to use multiple classification methods, in order to help filter out the false positives.
(The problem is in bioinformatics - classifying protein sequences as being Neuropeptide precursors sequences. Here's the original article if anyone's interested, and the code used to generate features and to train a single predictor) .
Now, the classifiers have roughly similar performance metrics (83-94% accuracy/precision/etc' on the training set for 10-fold CV), so my 'naive' approach was to simply use multiple classifiers (Random Forests, ExtraTrees, SVM (Linear kernel), SVM (RBF kernel) and GRB) , and to use a simple majority vote.
MY question is:
How can I get the performance metrics for the different classifiers and/or their votes predictions?
That is, I want to see if using the multiple classifiers improves my performance at all, or which combination of them does.
My intuition is maybe to use the ROC score, but I don't know how to "combine" the results and to get it from a combination of classifiers. (That is, to see what the ROC curve is just for each classifier alone [already known], then to see the ROC curve or AUC for the training data using combinations of classifiers).
(I currently filter the predictions using "predict probabilities" with the Random Forests and ExtraTrees methods, then I filter arbitrarily for results with a predicted score below '0.85'. An additional layer of filtering is "how many classifiers agree on this protein's positive classification").
Thank you very much!!
(The website implementation, where we're using the multiple classifiers - http://neuropid.cs.huji.ac.il/ )
The whole shebang is implemented using SciKit learn and python. Citations and all!)
To evaluate the performance of the ensemble, simply follow the same approach as you would normally. However, you will want to get the 10 fold data set partitions first, and for each fold, train all of your ensemble on that same fold, measure the accuracy, rinse and repeat with the other folds and then compute the accuracy of the ensemble. So the key difference is to not train the individual algorithms using k fold cross-validation when evaluating the ensemble. The important thing is not to let the ensemble see the test data either directly or by letting one of it's algorithms see the test data.
Note also that RF and Extra Trees are already ensemble algorithms in their own right.
An alternative approach (again making sure the ensemble approach) is to take the probabilities and \ or labels output by your classifiers, and feed them into another classifier (say a DT, RF, SVM, or whatever) that produces a prediction by combining the best guesses from these other classifiers. This is termed "Stacking"
You can use a linear regression for stacking. For each 10-fold, you can split the data with:
8 training sets
1 validation set
1 test set
Optimise the hyper-parameters for each algorithm using the training set and validation set, then stack yours predictions by using a linear regression - or a logistic regression - over the validation set. Your final model will be p = a_o + a_1 p_1 + … + a_k p_K, where K is the number of classifier, p_k is the probability given by model k and a_k is the weight of the model k. You can also directly use the predicted outcomes, if the model doesn't give you probabilities.
If yours models are the same, you can optimise for the parameters of the models and the weights in the same time.
If you have obvious differences, you can do different bins with different parameters for each. For example one bin could be short sequences and the other long sequences. Or different type of proteins.
You can use the metric whatever metric you want, as long as it makes sens, like for not blended algorithms.
You may want to look at the 2007 Belkor solution of the Netflix challenges, section Blending. In 2008 and 2009 they used more advances technics, it may also be interesting for you.