How to compare time-series predictions between XGB and Random Forest

How to compare time-series predictions between XGB and Random Forest - python

I have time series forecasting assignment, and I used a random forest regressor and XGBoost to predict the future price.
I would like to ask what kind of code or what I should do as a conclusion assignment to choose which result prediction more better.
XBG and Randomforest
any help for a link or share code much appreciates it because I try to google but still can't find the solution and it's near my dateline.

Diebold-Mariano Test is one of the statistical methods to compare forecasting predictions. It identifies forecast accuracy equivalence for 2 sets of predictions.
Diebold-Mariano Test Implementation from Kaggle

Related

how to interpret learning curve in machine learning?

those're the learning curves for each algorithm I used. i'm working on my report and i'm confused how to interpret the curve.
I used multi label classification algorithms.
this is the learning curve of binary relevance the classifier is KNeighborsClassifier.
the second one is the curve of classifier chain using DecisionTreeClassifier
and the last one is the curve of LabelPowerset using GaussianNB
which one is the best? because the accuracy and the F1 score are good results

Learning curves are a tool to finding which models benefit from increasing training data. In other words, they indicate whether a model, with an increased dataset, will give better results.
The best curve in my opinion is the one that gives the best normalized score in a minimum training example. It also has to converge fast enough to a good score.

How to apply Gaussian naive bayes to predict traffic number in the future?

I have got some historical data on traffic and would like to predict the future.
I take reference from http://www.nuriaoliver.com/bicing/IJCAI09_Bicing.pdf. It applied the Bayesian network to predict the change in numbers of bikes, where I got the Bayesian network and would like to predict the changes by using Bayesian.
I faced several questions. I tried to use naive bayes to predict the number, but it seems naive bayes only allowed to have the output as several discrete class. In my case, the changes seem cannot be grouped into discrete class (like predicting a human is "male" or "female", only 2 discrete output to be the classifier)
May I know how can I apply the baysian approach in my case and what kind of python packages could help me?

I would see this as a time series forecasting problem and not a classification problem. As you noted, you are not trying to label your data into a set of discrete classes. Given a series of observations x_1, x_2, .... x_n, you are trying to predict x_(n+1) or trying forecast the next observation of the same variable in the series. Perhaps you could refer to this slide for a brief introduction to time series forecasting.
A quick start guide for time series forecasting with Python can be found here: https://machinelearningmastery.com/time-series-forecasting-methods-in-python-cheat-sheet/

Use of SVM classifier and multiple algorithms to improve accuracy

For a project I am working on, I am aiming to predict market trends and make long or short plays as a result. I am looking to use a reinforcement algorithm for this. In a paper I read recently however, the authors suggested using a two tiered system; an SVM classifier to determine market trend and three algorithms based on positive, negative or sideways market trend. Therefore, each algorithm is trained with data of the same trend so there exists less variability.
My question is, would using three algorithms improve the accuracy of the result, or would one model (with the same amount of data in total) provide the same accuracy?
Apologies if this seems a very basic question, I am new to machine learning and am eager to learn. Cheers

Different models have different strengths and weaknesses. This is the entire idea behind using an ensemble model.
What you can do is train a random forest or adaboost

sklearn calibrated classifier with random forest

Scikit has a very useful classifier wrappers called CalibratedClassifer and CalibratedClassifierCV, which try to make sure that the predict_proba function of a classifier really predicts a probability and not just an arbitrary number (albeit perhaps well-ranked) between zero and one.
However, when using random forests it is customary to use oob_decision_function_ to determine the performance on the training data, but this is no longer available when using the the calibrated models. The calibration should therefore work well for new data but not for the training data. How can we evaluate performance on the training data to determine, e.g., overfitting?

Apparently there really was no solution to this, and so I made a pull request to scikit-learn.
The problem was that the out-of-bag predictions are created during learning. Therefore, in the CalibratedClassifierCV each of the sub-classifiers does have its own oob decision function. However, this decision function is calculated on a fold of the data. Therefore, it is necessary to store each oob prediction (keeping nan values for samples that are not in the fold), then convert all the predictions using the calibration transformation, and then average the calibrated oob predictions to create an updated oob prediction.
As mentioned, I created a pull request at https://github.com/scikit-learn/scikit-learn/pull/11175. It will probably be a while before it is merged into the package, though, so if anyone really needs to use it then feel free to use my fork of scikit-learn at https://github.com/yishaishimoni/scikit-learn.

What is the best way to minimize the RMSE?

I am using LinearRegression() from sklearn to predict. I have created different features for X and trying to understand how can i select the best features automatically? Let's say i have defined 50 different features for X and only one output for y. Is there a way to select the best performing features automatically instead of doing it manually?
Also I can get rmse using following command:
scores = np.sqrt(-cross_val_score(lm, X, y, cv=20, scoring='neg_mean_squared_error')).mean()
From now on, how can i use this RMSE scores? I mean do i have to make multiple predictions? How am i going to use this rmse? There must be a way to predict() using some optimisations but couldn't findout.

Actually sklearn doesn't seem to have a stepwise algorithm, which helps in understanding the importance of features. However, it does provide recursive feature elimination, which is a greedy feature elimination algorithm similar to sequential backward selection.
See the documentation here:
Recursive Feature Elimination
Note that it is not necessary that it will reduce your RMSE. You might try different techniques like Ridge and Lasso Regression as well.

RMSE measures the average magnitude of the prediction error.
RMSE gives high weight to high errors, lower the values it's always better. RMSE can be improved only if you have a decent model. For feature selection, you can use PCA or stepwise regression or basic correlation technique. If you see a lot of multi-collinearity then go for Lasso or Ridge regression. Also, make sure you have a decent split of test and train data. If you have bad testing data you will get poor results. Also, check training data R-sq and testing data R-sq to make sure the model doesn't over-fit.
It would be helpful if you add information on no. of observations in your test and train data and r-sq value. Hope this helps

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to compare time-series predictions between XGB and Random Forest - python

Diebold-Mariano Test is one of the statistical methods to compare forecasting predictions. It identifies forecast accuracy equivalence for 2 sets of predictions. Diebold-Mariano Test Implementation from Kaggle

Related

how to interpret learning curve in machine learning?

How to apply Gaussian naive bayes to predict traffic number in the future?

Use of SVM classifier and multiple algorithms to improve accuracy

sklearn calibrated classifier with random forest

What is the best way to minimize the RMSE?

Categories

Resources