How to focus on label(s) that is(are) difficult to predict? - python

I am facing with binary classification problem. I am using some machine learning models and Python 3. I have noticed that some models perform better on a given class than the others. I would like to combine them to improve my accuracy and precision. I know a way to do so in regression problems, something like a weighted average of predictions. But I am not sure that it makes sense in classification problem. And, you must know a better way to do so.
Here is my algorithm that helps me to identify labels which are particularly difficult to predict :
"""
each value is in {0, 1}
ytrue : real values
ypred : predicted values
"""
def errorIdentifier(ytrue, ypred):
n = len(ytrue)
ytrue = list(ytrue)
ypred = list(ypred)
error = [0,0]
for i in range(n):
if ytrue[i] != ypred[i] :
error[ytrue[i]] += 1
return error
As you can guess, I need to call it for each model I am using.

The issue that different models are better are predicting different class is a classical machine learning problem. It arises due to the fact the various algorithms are better at modelling different features, hence the inclination towards better accuracy on a certain class.
To overcome this issue, we can make use of a number of models and ensemble the results to get an improved accuracy. This approach is known as ensemble learning.
There are a number of methods such as bagging, boosting etc. and some well known ensemble learning algorithms such as RandomForest. You'll have to research the various techniques to find the one that best suits your needs.

Related

What's a base model and why/when do we go for other ML algorithms

Let's assume we're dealing with continuous features and responses. We fit a linear regression model (let's say first order) and after CV we get a somehow good r^2 (let's say r^2=0.8).
Why do we go for other ML algorithms? I've read some research papers and they were trying different ML algorithms and taking the simple linear model as a base model for comparison. In these papers, the linear model outperformed other algorithms, what I have difficulty understanding is why do we go for other ML algorithms then? Why can't we just be satisfied with the linear model especially in the specific case where other algorithms perform poorly?
The other question is what do they gain from presenting the other algorithms in their research papers if these algorithms performed poorly?
The Best model for solving predictive problems such as continuous output is the regression model especially if you do it using a Neural network (polynomial or linear) with hyperparameter tuning based on the problem.
Using other ML algorithms such as Decision Tree or SVM or any other model where their main goal is classification but on the paper, they say it can do regression also in fact, they can't predict any new values.
but in the field of research people always try to find a better way to predict values other than regression, like in the classification world we start with Logistic regression -> decision tree and now we have SVM and ensembling models and DeepLearning.
I think the answer is because you never know.
especially in the specific case where other algorithms perform poorly?
You know they performed poorly because someone tried dose models. It's always worthy trying various models.

Identical accuracy in different ML Classification models

I used the "Stroke" data set from kaggle to compare the accuracy of the following different models of classification:
K-Nearest-Neighbor (KNN).
Decision Trees.
Adaboost.
Logistic Regression.
I did not implement the models myself, but used sklearn library's implementations.
After training the models I ran the test data and printed the level of accuracy of each of the models and these are the results:
As you can see, KNN, Adaboost, and Logistic Regression gave me the exact same accuracy.
My question is, does it make sense that there is not even a small difference between them or did I make a mistake somewhere along the way (Even though I only used sklearn's implementations?
In general achieving the same scores is unlikely, and the explanation is usually:
bug in actual reporting
bug in the data processing
score reported corresponds to a degenerate solution
And the last explanation is probably the case. Stroke dataset has 249 positive samples in 5000 datapoints, so if your model always says "no stroke" it will get roughly 95%. So my best guess is that all your models failed to learn anything and are just constantly outputting "0".
In general accuracy is not a right metric for highly imabalnced datasets. Consider balanced accuracy, f1, etc.

Advice for my plan - large dataset of students and grades, looking to classify bottom 2%

I have a dataset which includes socioeconomic indicators for students nationwide as well as their grades. More specifically, this dataset has 36 variables with about 30 million students as predictors and then the students grades as the responses.
My goal is to be able to predict whether a student will fail out (ie. be in the bottom 2%ile of the nation in terms of grades). I understand that classification with an imbalanced dataset (98% : 2%) will introduce a bias. Based on some research I planned to account for this by increasing the cost of an incorrect classification in the minority class.
Can someone please confirm that this is the correct approach (and that there isn't a better one, I'm assuming there is)? And also, given the nature of this dataset, could someone please help me choose a machine learning algorithm to accomplish this?
I am working with TensorFlow 2.0 in a Google Colab. I've compiled all the data together into a .feather file using pandas.
In case of having imbalanced dataset, using weighted class is the most common approach to do so, but having such large dataset (30M training example) for binary classification problem representing 2% for the first class and 98% for the second one, I can say it's too hard to prevent model to be unbiased against first class using weighted class as it's not too much differ from reducing the training set size to be balanced.
Here some steps for the model accuracy evaluation.
split your dataset set to train, evalution and test sets.
For evaluation metric I suggest these alternatives.
a. Make sure to have at least +20%, representing the first class for both
evaluation and test sets.
b. Set evalution metric to be precision and recall for your model accuracy
(rather than using f1 score).
c. Set evalution metric to be Cohen's kapp score (coefficient).
From my own perspective, I prefer using b.
Since you are using tensorflow, I assume that you are familiar with deep learning. so use deep learning instead of machine learning, that's gives you the ability to have many additional alternatives, anyway, here some steps for both machine learning and deep learning approach.
For Machine Leaning Algorithms
Decision Trees Algorithms (especially Random Forest).
If my features has no correlation, correlation approach to zero (i.e. 0.01),
I am going to try Complement Naive Bayes classifiers for multinomial features
or Gaussian Naive Bayes using weighted class for continuous features.
Try some nonparametric learning algorithms. You may not able to fit this
training set using Support Vector Machines (SVM) easily because of you
have somehow large data set but you could try.
Try unsupervised learning algorithms
(this sometimes gives you more generic model)
For Deep Leaning Algorithms
Encoder and decoder architectures or simply generative adversarial
networks (GANs).
Siamese network.
Train model using 1D convolution Layers.
Use weighted class.
Balanced batches of the training set, randomly chosen.
You have many other alternatives, From my own perspective, I may try hard to get it with 1, 3 or 5.
For Deep learning 5th approach sometimes works very well and I recommend to try it with 1, 3.

How to know the factor by which a feature affects a model's prediction

I have trained my model on a data set and i used decision trees to train my model and it has 3 output classes - Yes,Done and No , and I got to know the feature that are most decisive in making a decision by checking feature importance of the classifier. I am using python and sklearn as my ML library. Now that I have found the feature that is most decisive I would like to know how that feature contributes, in the sense that if the relation is positive such that if the feature value increases the it leads to Yes and if it is negative It leads to No and so on and I would also want to know the magnitude for the same.
I would like to know if there a solution to this and also would to know a solution that is independent of the algorithm of choice, Please try to provide solutions that are not specific to decision tree but rather general solution for all the algorithms.
If there is some way that would tell me like:
for feature x1 the relation is 0.8*x1^2
for feature x2 the relation is -0.4*x2
just so that I would be able to analyse the output depends based on input feature x1 ,x2 and so on
Is it possible to find out the whether a high value for particular feature to a certain class, or a low value for the feature.
You can use Partial Dependency Plots (PDPs). scikit has a built-in PDP for their GBM - http://scikit-learn.org/stable/modules/ensemble.html#partial-dependence which was created in Friedman's Greedy Function Approximation Paper http://statweb.stanford.edu/~jhf/ftp/trebst.pdf pp26-28.
If you used scikit-learn GBM, use their PDP function. If you used another estimator, you can create your own PDP which is a few lines of code. PDPs and this method is algorithm agnostic as you asked. It just will not scale.
Logic
Take your training data
For the feature you are examining, get all unique values or some quantiles to reduce the time
Take a unique value
For the feature you are examining, in all observations, replace with the value from (3)
Predict all training observations
Get the mean of all predictions
Plot the point (unique value, mean)
Repeat 3-7 taking the next unique value until no more values
You now have a 1-way PDP. When the feature increases (X-axis), what on average happens to the prediction (y-axis). What is the magnitude of the change.
Taking the analysis further, you can fit a smooth curve or splines to the PDP which may help understand the relationship. As #Maxim said, there is not a perfect rule so you are looking for the trend here, trying to understand a relationship. We tend to run this for the most important features and/or features you are curious about.
The above scikit-learn reference has more examples.
For a Decision Tree, you can use the algorithmic short-cut as described by Friedman and implemented by scikit-learn. You need to walk the tree so the code is tied to the package and algorithm, hence it does not answer your question and I will not describe it. But it is on that scikit-learn page I referenced and in the paper.
def pdp_data(clf, X, col_index):
X_copy = np.copy(X)
results = {}
results['x_values'] = np.sort(np.unique(X_copy[:, col_index]))
results['y_values'] = []
for value in results['x_values']:
X_copy[:, col_index] = value
y_predict = clf.predict_log_proba(X_copy)[:, 1]
results['y_values'].append(np.mean(y_predict))
return results
Edited to answer new part of question:
For the addition to your question, you are looking for a linear model with coefficients. If you must interpret the model with linear coefficients, build a linear model.
Sometimes how you need to interpret the model guides what type of model you build.
In general - no. Decision trees work differently that that. For example it could have a rule under the hood that if feature X > 100 OR X < 10 and Y = 'some value' than answer is Yes, if 50 < X < 70 - answer is No etc. In the instance of decision tree you may want to visualize its results and analyse the rules. With RF model it is not possible, as far as I know, since you have a lot of trees working under the hood, each has independent decision rules.

What are good metrics to evaluate the performance of a multi-class classifier?

I'm trying to run a classifier in a set of about 1000 objects, each with 6 floating point variables. I've used scikit-learn's cross validation features to generate an array of the predicted values for several different models. I've then used sklearn.metrics to compute the accuracy of my classifiers, and the confusion table. Most classifiers have around 20-30% accuracy. Below is the confusion table for the SVC classifier (25.4% accuracy).
Since I'm new to machine learning, I'm not sure how to interpret that result, and whether there are other good metrics to evaluate the problem. Intuitively speaking, even with 25% accuracy, and given that the classifier got 25% of the predictions right, I believe it is at least somewhat effective, right? How can I express that with statistical arguments?
If this table is a confusion table, I think that your classifier predicts in majority of the time the class E. I think that your class E is overrepresented in your dataset, accuracy is not a good metric if your classes have not the same number of instances,
Example, If you have 3 classes, A,B,C and in the test dataset the class A is over represented (90%) if your classifier predicts all time class A, you will have 90% of accuracy,
A good metric is to use log loss, logistic regression is a good algorithm that optimize this metric
see https://stats.stackexchange.com/questions/113301/multi-class-logarithmic-loss-function-per-class
An other solution, is to do oversampling of your small classes
First of all, I find it very difficult to look at confusion tables. Plotting it as an image would give a lot better intuitive understanding about what is going on.
It is advisory to have single number metric to optimize since it is easier and faster. When you find that your system doesn't perform as you expect it to, revise your selection of metric.
Accuracy is usually a good metric to use if you have same amount of examples in every class. Otherwise (which seems to be the case here) I'd advise to use F1 score which takes into account both precision and recall of your estimator.
EDIT: However it is up to you to decide if the ~25% accuracy, or whatever metric is "good enough". If you are classifying if robot should shoot a person you should probably revise your algorithm but if you are deciding if it is a pseudo-random or random data, 25% percent accuracy could be more than enough to prove the point.

Categories