I am novice when it comes to Machine Learning, but I am very interested on this topic. I have a few questions so bear with me.
This is a time-series analysis.
I am using ElasticNet with GridSearchCV to figure out the best Hyperparameters for my model. I went through the steps with feature selection to reduce my features (I am using an f_regression at the basic <0.05 sig level). I am not using any test to figure out Multicollinearity, because I assume that elastic net would use the L1 ratio = 1 (Or close to it) to get around this issue. The parameter grids are listed below:
l1_space = np.linspace(.30, .90, 30)
alpha_space = np.logspace(-4, 1, 30)
However, the best parameters I keep getting are l1_space = 0.3 and alpha_space = 0.0001. Which defeats the purpose of using ElasticNet I assume? This gives me a really good adjusted R^2 and RMSE. However, when I change the parameter grids slightly, my metrics are horrible.
Its either the model is bad (Overfitting), or I just do not understand what is going on. I have read the documentations over and over again, but still am not understanding.
Thank you in advance!
Related
I'm currently following a tutorial on determining why my machine learning model made its predictions by using the "shap" python package
I'm not fully sure however what is happening in the following code shap_values = explainer.shap_values(X=X_test[:1])
I understand that I am looking for shap_values on the first row of my test data, however what does this mean?
In the tutorial, they have nsamples, and l1_reg also passed into .shap_values and I'm not sure what either of these parameters do. Could someone explain simply what these parameters are used for as everything online is a little too low level for my understanding.
I have 1270 features and my model is a sklearn SVC model if this would aid in explaining the nsamples parameter
I have some data (MFCC features for speaker recognition), from two different speakers. 60 vectors of 13 features for each person (in total 120). Each of them has their label (0 and 1). I need to show the results on confusion matrix. But GaussianMixture model from sklearn is unstable. For each program run i receive different scores (sometimes accuracy is 0.4, sometimes 0.7 ...). I don't know what I am doing wrong, because analogically i created SVM and k-NN models and they are working fine (stable accuracy around 0.9). Do you have any idea what am I doing wrong?
gmmclf = GaussianMixture(n_components=2, covariance_type='diag')
gmmclf.fit(X_train, y_train) #X_train are mfcc vectors, y_train are labels
ygmm_pred_class = gmmclf.predict(X_test)
print(accuracy_score(y_test, ygmm_pred_class))
print(confusion_matrix(y_test, ygmm_pred_class))
Short answer: you should simply not use a GMM for classification.
Long answer...
From the answer to a relevant thread, Multiclass classification using Gaussian Mixture Models with scikit learn (emphasis in the original):
Gaussian Mixture is not a classifier. It is a density estimation
method, and expecting that its components will magically align with
your classes is not a good idea. [...] GMM simply tries to fit mixture of Gaussians
into your data, but there is nothing forcing it to place them
according to the labeling (which is not even provided in the fit
call). From time to time this will work - but only for trivial
problems, where classes are so well separated that even Naive Bayes
would work, in general however it is simply invalid tool for the
problem.
And a comment by the respondent himself (again, emphasis in the original):
As stated in the answer - GMM is not a classifier, so asking if you
are using "GMM classifier" correctly is impossible to answer. Using
GMM as a classifier is incorrect by definition, there is no "valid"
way of using it in such a problem as it is not what this model is
designed to do. What you could do is to build a proper generative
model per class. In other words construct your own classifier where
you fit one GMM per label and then use assigned probability to do
actual classification. Then it is a proper classifier. See
github.com/scikit-learn/scikit-learn/pull/2468
(For what it may worth, you may want to notice that the respondent is a research scientist in DeepMind, and the very first person to be awarded the machine-learning gold badge here at SO)
To elaborate further (and that's why I didn't simply flag the question as a duplicate):
It is true that in the scikit-learn documentation there is a post titled GMM classification:
Demonstration of Gaussian mixture models for classification.
which I guess did not exist back in 2017, when the above response was written. But, digging into the provided code, you will realize that the GMM models are actually used there in the way proposed by lejlot above; there is no statement in the form of classifier.fit(X_train, y_train) - all usage is in the form of classifier.fit(X_train), i.e. without using the actual labels.
This is exactly what we would expect from a clustering-like algorithm (which is indeed what GMM is), and not from a classifier. It is true again that scikit-learn offers an option for providing also the labels in the GMM fit method:
fit (self, X, y=None)
which you have actually used here (and again, probably did not exist back in 2017, as the above response implies), but, given what we know about GMMs and their usage, it is not exactly clear what this parameter is there for (and, permit me to say, scikit-learn has its share on practices that may look sensible from a purely programming perspective, but which made very little sense from a modeling perspective).
A final word: although fixing the random seed (as suggested in a comment) may appear to "work", trusting a "classifier" that gives a range of accuracies between 0.4 and 0.7 depending on the random seed is arguably not a good idea...
In sklearn, the labels of clusters in gmm do not mean anything. So, each time you run a gmm, the labels may vary. It might be one reason the results are not robust.
I am facing with binary classification problem. I am using some machine learning models and Python 3. I have noticed that some models perform better on a given class than the others. I would like to combine them to improve my accuracy and precision. I know a way to do so in regression problems, something like a weighted average of predictions. But I am not sure that it makes sense in classification problem. And, you must know a better way to do so.
Here is my algorithm that helps me to identify labels which are particularly difficult to predict :
"""
each value is in {0, 1}
ytrue : real values
ypred : predicted values
"""
def errorIdentifier(ytrue, ypred):
n = len(ytrue)
ytrue = list(ytrue)
ypred = list(ypred)
error = [0,0]
for i in range(n):
if ytrue[i] != ypred[i] :
error[ytrue[i]] += 1
return error
As you can guess, I need to call it for each model I am using.
The issue that different models are better are predicting different class is a classical machine learning problem. It arises due to the fact the various algorithms are better at modelling different features, hence the inclination towards better accuracy on a certain class.
To overcome this issue, we can make use of a number of models and ensemble the results to get an improved accuracy. This approach is known as ensemble learning.
There are a number of methods such as bagging, boosting etc. and some well known ensemble learning algorithms such as RandomForest. You'll have to research the various techniques to find the one that best suits your needs.
I am currently running a linear regression on my time-series data set. However, depending on which python module I use, I get completely different results.
First I used Sklearn, and my model had an R^2 score of about 0.65. After that I tried using statsmodels.api, to get the summary of the regression, since Sklearn doesn't provide one, and I got a completely different R-2 score of 0.96.
After this, I used the linear model of statsmodels.formula.api and got another different result, this time, closer to my first result. (R^2 of 0.65)
I want to know why this happens. It seems like a mistake on my part, but I am pretty sure I am using the same data for all of the regressions (doing converting of the data frame to np.arrays where necessary). Can such large differences happen because of differences in implementation of the module?
Thank you for taking the time to read this.
Sorry about all the text, but I think the background of this project would help:
I've been working on a binary classification project. The original dataset consisted of about 28,000 of class 0 and 650 of class 1, so it was very highly imbalanced. I was given an under- and over-sampled dataset to work with that was 5,000 of each class (class 1 instances were simply duplicated 9 times). After training models on this and getting sub-par results (an AUC of about .85, but it needed to be better) I started wondering if these sampling techniques were actually a good idea, so I took the original highly imbalanced dataset out again. I plugged it right into a default GradientBoostClassifier, trained it on 80% of the data and
I immediately got something like this:
Accuracy:
0.997367035282
AUC:
.9998
Confusion Matrix:
[[5562 7]
[ 8 120]]
Now, I know a high accuracy can be an artefact of the imbalanced classes, but I did not expect an AUC like this or that kind of performance! So I am very confused and feel there must be something an error in my technique somewhere...but I have no idea what it is. I've tried a couple different classifiers too and gotten similar levels of ridiculously good performance. I didn't leave the class labels in the data array and the training data is COMPLETELY different than the testing data. Each observation has about 130 features too, so this isn't a simple classification. It very much seems like something is wrong, I'm sure the classifier cannot be this good. Could there be anything else I am overlooking? Any other common pitfalls people run into like this with unbalanced data?
I can provide the code, probability plots,example datapoints etc. if they would be helpful, but I didn't want this to get too long for now. Thanks to anybody who can help!
Accuracy may not be the best performance metric in your case, maybe you can think of using precision,recall and F1 score, and perform some debugging via learning curves, over fitting detection, etc.