Use of SVM classifier and multiple algorithms to improve accuracy - python

For a project I am working on, I am aiming to predict market trends and make long or short plays as a result. I am looking to use a reinforcement algorithm for this. In a paper I read recently however, the authors suggested using a two tiered system; an SVM classifier to determine market trend and three algorithms based on positive, negative or sideways market trend. Therefore, each algorithm is trained with data of the same trend so there exists less variability.
My question is, would using three algorithms improve the accuracy of the result, or would one model (with the same amount of data in total) provide the same accuracy?
Apologies if this seems a very basic question, I am new to machine learning and am eager to learn. Cheers

Different models have different strengths and weaknesses. This is the entire idea behind using an ensemble model.
What you can do is train a random forest or adaboost

Related

What's a base model and why/when do we go for other ML algorithms

Let's assume we're dealing with continuous features and responses. We fit a linear regression model (let's say first order) and after CV we get a somehow good r^2 (let's say r^2=0.8).
Why do we go for other ML algorithms? I've read some research papers and they were trying different ML algorithms and taking the simple linear model as a base model for comparison. In these papers, the linear model outperformed other algorithms, what I have difficulty understanding is why do we go for other ML algorithms then? Why can't we just be satisfied with the linear model especially in the specific case where other algorithms perform poorly?
The other question is what do they gain from presenting the other algorithms in their research papers if these algorithms performed poorly?
The Best model for solving predictive problems such as continuous output is the regression model especially if you do it using a Neural network (polynomial or linear) with hyperparameter tuning based on the problem.
Using other ML algorithms such as Decision Tree or SVM or any other model where their main goal is classification but on the paper, they say it can do regression also in fact, they can't predict any new values.
but in the field of research people always try to find a better way to predict values other than regression, like in the classification world we start with Logistic regression -> decision tree and now we have SVM and ensembling models and DeepLearning.
I think the answer is because you never know.
especially in the specific case where other algorithms perform poorly?
You know they performed poorly because someone tried dose models. It's always worthy trying various models.

Advice for my plan - large dataset of students and grades, looking to classify bottom 2%

I have a dataset which includes socioeconomic indicators for students nationwide as well as their grades. More specifically, this dataset has 36 variables with about 30 million students as predictors and then the students grades as the responses.
My goal is to be able to predict whether a student will fail out (ie. be in the bottom 2%ile of the nation in terms of grades). I understand that classification with an imbalanced dataset (98% : 2%) will introduce a bias. Based on some research I planned to account for this by increasing the cost of an incorrect classification in the minority class.
Can someone please confirm that this is the correct approach (and that there isn't a better one, I'm assuming there is)? And also, given the nature of this dataset, could someone please help me choose a machine learning algorithm to accomplish this?
I am working with TensorFlow 2.0 in a Google Colab. I've compiled all the data together into a .feather file using pandas.
In case of having imbalanced dataset, using weighted class is the most common approach to do so, but having such large dataset (30M training example) for binary classification problem representing 2% for the first class and 98% for the second one, I can say it's too hard to prevent model to be unbiased against first class using weighted class as it's not too much differ from reducing the training set size to be balanced.
Here some steps for the model accuracy evaluation.
split your dataset set to train, evalution and test sets.
For evaluation metric I suggest these alternatives.
a. Make sure to have at least +20%, representing the first class for both
evaluation and test sets.
b. Set evalution metric to be precision and recall for your model accuracy
(rather than using f1 score).
c. Set evalution metric to be Cohen's kapp score (coefficient).
From my own perspective, I prefer using b.
Since you are using tensorflow, I assume that you are familiar with deep learning. so use deep learning instead of machine learning, that's gives you the ability to have many additional alternatives, anyway, here some steps for both machine learning and deep learning approach.
For Machine Leaning Algorithms
Decision Trees Algorithms (especially Random Forest).
If my features has no correlation, correlation approach to zero (i.e. 0.01),
I am going to try Complement Naive Bayes classifiers for multinomial features
or Gaussian Naive Bayes using weighted class for continuous features.
Try some nonparametric learning algorithms. You may not able to fit this
training set using Support Vector Machines (SVM) easily because of you
have somehow large data set but you could try.
Try unsupervised learning algorithms
(this sometimes gives you more generic model)
For Deep Leaning Algorithms
Encoder and decoder architectures or simply generative adversarial
networks (GANs).
Siamese network.
Train model using 1D convolution Layers.
Use weighted class.
Balanced batches of the training set, randomly chosen.
You have many other alternatives, From my own perspective, I may try hard to get it with 1, 3 or 5.
For Deep learning 5th approach sometimes works very well and I recommend to try it with 1, 3.

What should be the ideal validation accuracy of a LSTM based text generator?

I modelled a LSTM based text generator using a data set I have. The purpose of the model is to predict the end of sentences. My training is showing a validation accuracy of around 81%. When reading through a couple of articles, I found that unlike a classification problem I should be worried more about loss rather than accuracy. Is this the case, and if so what would be an ideal loss value? Right now my loss is around 1.5+.
There is no minimum limit for accuracy in any of the machine learning or Deep Learning problem.It's as many say garbage IN, garbage OUT
Quality of data and with a decent model will give you good accuracy.
Generally, these accuracy benchmark is set for the standard dataset available on an open internet like SQUAD, RACE, SWAG, GLUE and many more.
Usually, the state of the art models will check their performance on these datasets and set a accuarcy benchmark specific to these dataset.
Coming to your problem, you can tell the model is performing goog based on accuracy, and the evaluation metric you are using, generally in NLP to calculate loss is bit tricky. Considering your case where you are trying to predict the end of a sentence where there is no fixed dimension the reason being that the same information can be expressed in multiple ways with varying number of words.
By looking at the validation and test accuracy of your model it looks decent, but before pushing the accuracy you should worry about the overfitting problem also, the model should not be biased on your data.
You can try with different metrics to evaluate the model and you can compare the results on your own.
I hope this answers your question, Happy Learning!

Best metric for classification-regression?

Sorry for the weird title, I don't know how to better express my problem. I'm working with an insurance dataset to predict future claim costs for a given policy.
For anyone who has worked with insurance claim data, you know that the claims are heavily 0-weighted. I've run into the issue before where regression on the entire dataset does not perform well, due to the skew of the data, and the continuous-discrete distribution mix.
I've tried some Tweedie distributions in R to help with this disconnect, but I ended up going a different route.
I first decided to classify the data into "Claim Amount = 0" and "Claim amount != 0", by using a support vector classifier sklearn.svm.svc(with 98% training and 95% test accuracy), where if a claim amount is predicted to be != 0, it will be fed into a regression model to predict the incurred claim amount. I decided to go with ridge regression sklearn.linear_model.Ridge for this part, and achieved a relatively good $R^2$ of 0.67 for the test set (real world data, so I'm not expecting anything extraordinary).
So my question is, what is the best way to evaluate this composite model, specifically in python? Do you think the MSE would be a good metric? The only other model I can compare it to is a basic linear regression (on the entire dataset, without the pre-classification).
Of course, feel free to suggest alternatives to this two-part classification-regression model.
EDIT: To clarify, I chose these specific models (over neural networks, for example) because of their ability to be translated into simple math for different applications.

Which ML model would be best suited for time series activity log data to predict customer retention? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I have customer activity data like the number of logins, time spent on the site, devices registered and policies changed. The data is structured on a day-day basis. i.e activity for a customer on a particular day.
The ML model should be able to predict based on this activity whether the customer will be retained or not.
Ideally, the model should output a bool value or the % of chances of retention.
Which ML models should I look into?
Any suggestions would be appreciated.
"Which ML model would be best suited for ..."
Unfortunately, the "There is no free lunch" Theorem states that the answer will always be: "it depends". no free lunch theorem
Fortunately though, customer Retention models are well researched (e.g. this paper) and usually formulated as a simple classification Problem. Threfore you could try a few simple algorithms such:
Regression analysis: logistic regression.
Decision tree–CART.
Bayes algorithm: Naïve Bayesian.
Support Vector Machine
Instance – based learning: k-nearest Neighbor.
Ensemble learning: Ada Boost, Stochastic Gradient Boost and Random Forest.
Artificial neural network: Multi-layer Perceptron.
Linear Discriminant Analysis.
Time Series Forecasting
Making predictions about the future is called extrapolation in the classical statistical handling of time series data.
More modern fields focus on the topic and refer to it as time series forecasting.
Forecasting involves taking models fit on historical data and using them to predict future observations.
Descriptive models can borrow for the future (i.e. to smooth or remove noise), they only seek to best describe the data.
An important distinction in forecasting is that the future is completely unavailable and must only be estimated from what has already happened.
If your data has some kind of trends or seasonality you may want to smooth out the data and use either of the algorithms:
1. Moving Average Algorithm
2. Auto regression
3. ARIMA (Autoregressive Integrated Moving Average) Model
ARIMA model is a combination of both Moving Average and Auto Regression algo.
I strongly recommend to go through this great tutorial/blog about time series forecasting using ARIMA model: https://machinelearningmastery.com/arima-for-time-series-forecasting-with-python/

Categories