Sorry for the weird title, I don't know how to better express my problem. I'm working with an insurance dataset to predict future claim costs for a given policy.
For anyone who has worked with insurance claim data, you know that the claims are heavily 0-weighted. I've run into the issue before where regression on the entire dataset does not perform well, due to the skew of the data, and the continuous-discrete distribution mix.
I've tried some Tweedie distributions in R to help with this disconnect, but I ended up going a different route.
I first decided to classify the data into "Claim Amount = 0" and "Claim amount != 0", by using a support vector classifier sklearn.svm.svc(with 98% training and 95% test accuracy), where if a claim amount is predicted to be != 0, it will be fed into a regression model to predict the incurred claim amount. I decided to go with ridge regression sklearn.linear_model.Ridge for this part, and achieved a relatively good $R^2$ of 0.67 for the test set (real world data, so I'm not expecting anything extraordinary).
So my question is, what is the best way to evaluate this composite model, specifically in python? Do you think the MSE would be a good metric? The only other model I can compare it to is a basic linear regression (on the entire dataset, without the pre-classification).
Of course, feel free to suggest alternatives to this two-part classification-regression model.
EDIT: To clarify, I chose these specific models (over neural networks, for example) because of their ability to be translated into simple math for different applications.
Related
I used the "Stroke" data set from kaggle to compare the accuracy of the following different models of classification:
K-Nearest-Neighbor (KNN).
Decision Trees.
Adaboost.
Logistic Regression.
I did not implement the models myself, but used sklearn library's implementations.
After training the models I ran the test data and printed the level of accuracy of each of the models and these are the results:
As you can see, KNN, Adaboost, and Logistic Regression gave me the exact same accuracy.
My question is, does it make sense that there is not even a small difference between them or did I make a mistake somewhere along the way (Even though I only used sklearn's implementations?
In general achieving the same scores is unlikely, and the explanation is usually:
bug in actual reporting
bug in the data processing
score reported corresponds to a degenerate solution
And the last explanation is probably the case. Stroke dataset has 249 positive samples in 5000 datapoints, so if your model always says "no stroke" it will get roughly 95%. So my best guess is that all your models failed to learn anything and are just constantly outputting "0".
In general accuracy is not a right metric for highly imabalnced datasets. Consider balanced accuracy, f1, etc.
I have a dataset which includes socioeconomic indicators for students nationwide as well as their grades. More specifically, this dataset has 36 variables with about 30 million students as predictors and then the students grades as the responses.
My goal is to be able to predict whether a student will fail out (ie. be in the bottom 2%ile of the nation in terms of grades). I understand that classification with an imbalanced dataset (98% : 2%) will introduce a bias. Based on some research I planned to account for this by increasing the cost of an incorrect classification in the minority class.
Can someone please confirm that this is the correct approach (and that there isn't a better one, I'm assuming there is)? And also, given the nature of this dataset, could someone please help me choose a machine learning algorithm to accomplish this?
I am working with TensorFlow 2.0 in a Google Colab. I've compiled all the data together into a .feather file using pandas.
In case of having imbalanced dataset, using weighted class is the most common approach to do so, but having such large dataset (30M training example) for binary classification problem representing 2% for the first class and 98% for the second one, I can say it's too hard to prevent model to be unbiased against first class using weighted class as it's not too much differ from reducing the training set size to be balanced.
Here some steps for the model accuracy evaluation.
split your dataset set to train, evalution and test sets.
For evaluation metric I suggest these alternatives.
a. Make sure to have at least +20%, representing the first class for both
evaluation and test sets.
b. Set evalution metric to be precision and recall for your model accuracy
(rather than using f1 score).
c. Set evalution metric to be Cohen's kapp score (coefficient).
From my own perspective, I prefer using b.
Since you are using tensorflow, I assume that you are familiar with deep learning. so use deep learning instead of machine learning, that's gives you the ability to have many additional alternatives, anyway, here some steps for both machine learning and deep learning approach.
For Machine Leaning Algorithms
Decision Trees Algorithms (especially Random Forest).
If my features has no correlation, correlation approach to zero (i.e. 0.01),
I am going to try Complement Naive Bayes classifiers for multinomial features
or Gaussian Naive Bayes using weighted class for continuous features.
Try some nonparametric learning algorithms. You may not able to fit this
training set using Support Vector Machines (SVM) easily because of you
have somehow large data set but you could try.
Try unsupervised learning algorithms
(this sometimes gives you more generic model)
For Deep Leaning Algorithms
Encoder and decoder architectures or simply generative adversarial
networks (GANs).
Siamese network.
Train model using 1D convolution Layers.
Use weighted class.
Balanced batches of the training set, randomly chosen.
You have many other alternatives, From my own perspective, I may try hard to get it with 1, 3 or 5.
For Deep learning 5th approach sometimes works very well and I recommend to try it with 1, 3.
I am in 10th grade and I am looking to use a machine learning model on patient data to find a correlation between the time of week and patient adherence. I have separated the week into 21 time slots, three for each time of day (1 is Monday morning, 2 is monday afternoon, etc.). Adherence values will be binary (0 means they did not take the medicine, 1 means they did). I will simulate training, validation and test data for my model. From my understanding, I can use a logistic regression model to output the probability of the patient missing their medication on a certain time slot given past data for that time slot. This is because logistic regression outputs binary values when given a threshold and is good for problems dealing with probability and binary classes, which is my scenario. In my case, the two classes I am dealing with are yes they will take their medicine, and no they will not. But the major problem with this is that this data will be non-linear, at least to my understanding. To make this more clear, let me give a real life example. If a patient has yoga class on Sunday mornings, (time slot 19) and tends to forget to take their medication at this time, then most of the numbers under time slot 19 would be 0s, while all the other time slots would have many more 1s. The goal is to create a machine learning model which can realize given past data that the patient is very likely going to miss their medication on the next time slot 19. I believe that logistic regression must be used on data that still has an inherently linear data distribution, however I am not sure. I also understand that neural networks are ideal for non-linear distributions, but neural networks require a lot of data to function properly, and ideally the goal of my model is to be able to function decently with simply a few weeks of data. Of course any model becomes more accurate with more data, but it seems to me that generally neural networks need thousands of data sets to truly become decently accurate. Again, I could very well be wrong.
My question is really what model type would work here. I do know that I will need some form of supervised classification. But can I use logistic regression to make predictions when given time of week about adherence?
Really any general feedback on my project is greatly appreciated! Please keep in mind I am only 15, and so certain statements I made were possibly wrong and I will not be able to fully understand very complex replies.
I also have to complete this within the next two weeks, so please do not hesitate to respond as soon as you can! Thank you so much!
In my opinion a logistic regression won't be enough for this as u are going to use a single parameter as input. When I imagine a decision line for this problem, I don't think it can be achieved by a single neuron(a logistic regression). It may need few more neurons or even few layers of them to do so. And u may need a lot of data set for this purpose.
It's true that you need a lot of data for applying neural networks.
It would have been helpful if you could be more precise about your dataset and the features. You can also try implementing K-Means-Clustering for your project. If your aim is to find out that did the patient took medicine or not then it can be done using logistic regression.
For a project I am working on, I am aiming to predict market trends and make long or short plays as a result. I am looking to use a reinforcement algorithm for this. In a paper I read recently however, the authors suggested using a two tiered system; an SVM classifier to determine market trend and three algorithms based on positive, negative or sideways market trend. Therefore, each algorithm is trained with data of the same trend so there exists less variability.
My question is, would using three algorithms improve the accuracy of the result, or would one model (with the same amount of data in total) provide the same accuracy?
Apologies if this seems a very basic question, I am new to machine learning and am eager to learn. Cheers
Different models have different strengths and weaknesses. This is the entire idea behind using an ensemble model.
What you can do is train a random forest or adaboost
I'm currently working on a project in estimating signal by using some classification learning algorithms, such as logistics regression and random forest using scikit-learn.
I'm now using the confusion matrix to estimate the performance of different algorithms in prediction, and I found there was common problem for both algorithms. That is, in all cases, although the accuracy of algorithms seems relatively good (around 90% - 93%), the total number of FN are pretty high comparing to TP (FNR < 3% ). Does any one has clue about why I'm having this kind of issue in my prediction problem. If possible, can you give me some hints regarding how to possibly solve this problem?
Thanks for reply and help in advance.
Updates:
The dataset is extremely imbalanced (8:1), with in total around 180K observations. I already tested several re-sampling methods, such as OSS, SMOTE(+Tomek or +ENN), but neither of them returns good results. In both cases, although the recall goes up from 2.5% to 20%, the precision decreases significantly (from 60% to 20%).
You probably have an imbalanced dataset, where one of your classes has many more examples than your other class.
One solution is to give an higher cost of misclassifying the class with less examples.
This question in Cross Validated covers many approaches to your problem:
https://stats.stackexchange.com/questions/131255/class-imbalance-in-supervised-machine-learning
EDIT:
Given that you are using scikit-learn you can, as a first approach, set the parameter class_weight to balanced on your Logistic regression.