Best information criterion for ARIMA model?

Best information criterion for ARIMA model? - python

I use ARIMA model with python:
from statsmodels.tsa.statespace.sarimax import SARIMAX
model = SARIMAX(x, order=(p, d, q),
enforce_stationarity=False,
enforce_invertibility=False).fit(disp=False)
I want to compare some models with different paramenters with each other and choose a model with fewer regressors (choose more easier model).
Which Information Criteria should I use.
I read about AIC and BIC.
And I read, that BIC better than AIC to choose more easier ARMA model, but which best with ARIMA?
Maybe I must use other information criteria like HQIC?
I would be grateful for useful links.

I don't think there is a general answer to your question.
In R there is an auto.arima function which is written by Rob Hyndman: he uses AICc.
You can read all about it in his online book (chapter 8.7).
Note that classical information criteria (AIC, BIC, etc) do not allow to compare ARIMA models with different parameter d or D (since the number of useable observations depends on d and D).
Here is a list of things to keep in mind when working with information criteria.
Therefore ultimately, the final choice of the model can (in my experience) not be based on one simple figure. Rather the final choice should be supported by different diagnostic plots and information criteria.

Related

<lifelines> Solving Cox Proportional Hazard after creating interaction variable with time

I am using lifelines package to do Cox Regression. After trying to fit the model, I checked the CPH assumptions for any possible violations and it returned some problematic variables, along with the suggested solutions.
One of the solution that I would like to try is the one suggested here:
https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Introduce-time-varying-covariates
However, the example written here is using CoxTimeVaryingFitter which, unlike CoxPHFitter, does not have concordance score, which will help me gauge the model performance. Additionally, CoxTimeVaryingFitter does not have check assumption feature. Does this mean that by putting it into episodic format, all the assumptions are automatically satisfied?
Alternatively, after reading a SAS textbook on survival analysis, it seemed like their solution is to create the interaction term directly (multiplying the problematic variable with the survival time) without changing the format to episodic format (as shown in the link). This way, I was hoping to just keep using CoxPHFitter due to its model scoring capability.
However, after doing this alternative, when I call check_assumptions again on the model with the time-interaction variable, the CPH assumption on the time-interaction variable is violated.
Now I am torn between:
Using CoxTimeVaryingFitter without knowing what the model performance is (seems like a bad idea)
Using CoxPHFitter, but the assumption is violated on the time-interaction variable (which inherently does not seem to fix the problem)
Any help regarding to solve this confusion is greatly appreciated

Here is one suggestion:
If you choose the CoxTimeVaryingFitter, then you need to somehow evaluate the quality of your model. Here is one way. Use the regression coefficients B and write down your model. I'll write it as S(t;x;B), where S is an estimator of the survival, t is the time, and x is a vector of covariates (age, wage, education, etc.). Now, for every individual i, you have a vector of covariates x_i. Thus, you have the survival function for each individual. Consequently, you can predict which individual will 'fail' first, which 'second', and so on. This produces a (predicted) ranking of survival. However, you know the real ranking of survival since you know the failure times or times-to-event. Now, quantify how many pairs (predicted survival, true survival) share the same ranking. In essence, you would be estimating the concordance.
If you opt to use CoxPHFitter, I don't think it was meant to be used with time-varying covariates. Instead, you could use two other approaches. One is to stratify your variable, i.e., cph.fit(dataframe, time_column, event_column, strata=['your variable to stratify']). The downside is that you no longer obtain a hazard ratio for that variable. The other approach is to use splines. Both of these methods are explained in here.

Selecting the best combination of variables for regression model based on reg score

Hello old faithful community,
This might be a though one as I can barely find any material on this.
The Problem
I have a data set of crimes committed in NSW Australia by council, and have merged this with average house prices by council. I'm now looking to produce a linear regression to try and predict said house price by the crime in the neighbourhood. The issue is, I have 49 crimes, and only want the best ones (statistically speaking) to be used in my model.
I've run the regression score over all and some variables (using correlation), and had results from .23 - .38 but I want to perfect this to the best possible - if there is a way to do this of course.
I've thought about looping over every possible combination, but this would end up by couple of million according to google.
So, my friends - how can I python this dataframe to get the best columns?

If I might add, you may want to take a look at the Python package mlxtend, http://rasbt.github.io/mlxtend.
It is a package that features several forward/backward stepwise regression algorithms, while still using the regressors/selectors of sklearn.

There is no gold standard to solving this problem and you are right, selecting every combination is computational not feasible most of the time -- especially with 49 variables. One method would be to implement a forward or backward selection by adding/removing variables based on a user specified p-value criteria (this is the statistically relevant criteria you mention). For python implementations using statsmodels, check out these links:
https://datascience.stackexchange.com/questions/24405/how-to-do-stepwise-regression-using-sklearn/24447#24447
http://planspace.org/20150423-forward_selection_with_statsmodels/
Other approaches that are less 'statistically valid' would be to define a model evaluation metric (e.g., r squared, mean squared error, etc) and use a variable selection approach such as LASSO, random forest, genetic algorithm, etc to identify the set of variables that optimize the metric of choice. I find that in practice, ensembling these techniques in a voting-type scheme works the best as different techniques work better for certain types of data. Check out the links below from sklearn to see some options that you can code up pretty quickly with your data:
Overview of techniques: http://scikit-learn.org/stable/modules/feature_selection.html
A stepwise procedure: http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html
Select best features based on model: http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html
If you are up for it, I would try a few techniques and see if the answers converge to the same set of features -- This will give you some insight into the relationships between your variables.

NLP-steps or approch to classify text?

I'm working on a project to classify restaurant reviews on sentiment(positive or negative) basis. Also I want to classify that if these comments belongs to food, service, value-for-money, etc category. I am unable to link the steps or the methodology provided on the internet. can anyone provide detailed method or steps to get to the solution.

How about using bag of words model. It's been tried and tested for ages. It has some downsides compared to more modern methods, but you can still get decent results. And there are tons of material on internet to help you:
Normalize documents to the form ingestable by your pipeline
Convert documents to vectors and perform TF-IDF to filter irrelevant terms.
Here is a good tutorial. And convert them to vector form.
Split your documents get some subset of documents and mark the ones that belong to training data according to classes ( Sentiment ) / type of comments. Clearly your documents will belong to two classes.
Apply some type of dimensionality reduction technique to make your models more robust, good discussion is here
Train your models on your training data. You need at least two models one for sentiment and one for type. Some algorithms work with binary classes only so you might need more than to models for comment type ( Food, Value, Service). This might be a good thing because a comment can belong to more than one class ( Food quality and Value, or Value and Service). Scikit-learn has a lot of good models, also I highly recommend orange toolbox it's like a GUI for data science.
Validate your models using validation set. If you accuracy is satisfactory (most classical methods like SVM should give you at leat 90%) go ahead and use it for incoming data

Clustering before regression - recommender system

I have a file called train.dat which has three fields - userID, movieID and rating.
I need to predict the rating in the test.dat file based on this.
I want to know how I can use scikit-learn's KMeans to group similar users given that I have only feature - rating.
Does this even make sense to do? After the clustering step, I could do a regression step to get the ratings for each user-movie pair in test.dat
Edit: I have some extra files which contain the actors in each movie, the directors and also the genres that the movie falls into. I'm unsure how to use these to start with and I'm asking this question because I was wondering whether it's possible to get a simple model working with just rating and then enhance it with the other data. I read that this is called content based recommendation. I'm sorry, I should've written about the other data files as well.

scikit-learn is not a library for recommender systems, neither is kmeans typical tool for clustering such data. Things that you are trying to do deal with graphs, and usually are either analyzed on graph level, or using various matrix factorization techniques.
In particular kmeans only works in euclidean spaces, and you do not have such thing here. What you can do is to use DBScan (or any other clustering technique accepting arbitrary simialrity, but this one is actually in scikit-learn) and define similarity between two users by some kind of their agreement in terms of their taste, for example:
sim(user1, user2) = # movies both users like / # movies at least one of them likes
which is known as Jaccard coefficient for similarity between binary vectors. You have rating, not just "liking" but I am giving here a simplest possible example, while you can come up with dozens other things to try out. The point is - for the simplest approach all you have to do is define a notion of per-user similarity and apply clustering that accepts such a setting (like mentioned DBScan).

Clustering users makes sense. But if your only feature is the rating, I don't think it could produce a useful model for prediction. Below are my assumptions to make this justification:
The quality of movie should be distributed with a gaussion distribution.
If we look at the rating distribution of a common user, it should be something like gaussian.
I don't exclude the possibility that a few users only give ratings when they see a bad movie (thus all low ratings); and vice versa. But on a large scale of users, this should be unusual behavior.
Thus I can imagine that after clustering, you get small groups of users in the two extreme cases; and most users are in the middle (because they share the gaussian-like rating behavior). Using this model, you probably get good results for users in the two small (extreme) groups; however for the majority of users, you cannot expect good predictions.

Predictions with ARIMA (python statsmodels)

I have some time series data which contains some seasonal trends and I want to use an ARIMA model to predict how this series will behave in the future.
In order to predict how my variable of interest (log_var) will behave I have taken a weekly, monthly and annual difference and then used these as the input to an ARIMA model.
Below is an example.
exog = np.column_stack([df_arima['log_var_diff_wk'],
df_arima['log_var_diff_mth'],
df_arima['log_var_diff_yr']])
model = ARIMA(df_arima['log_var'], exog = exog, order=(1,0,1))
results_ARIMA = model.fit()
I am doing this for several different data sources and in all of them I see great results, in the sense that if I plot log_var against results_ARIMA.fittedvalues for the training data then it matches very well (I tune p and q for each data source separately, but d is always 0 given that I have already taken the difference myself).
However, I then want to check what the predictions look like, and in order to do this I redfine exog to just be the 'test' dataset. For example, if I train the original ARIMA model on 2014-01-01 to 2016-01-01, the 'test' set would just be 2016-01-01 onwards.
My approach has worked well for some data sources (in the sense that I plot the forecast against the known values and the trends look sensible) but badly for others, although they are all the same 'kind' of data and they have just been taken from different geographical locations. In some of the locations it completely fails to catch obvious seasonal trends that occur again and again in the training data on the same dates each year. The ARIMA model always fits the training data well, it just seems that in some cases the predictions are completely useless.
I am now wondering if I am actually following the correct procedure to predict values from the ARIMA model. My approach is basically:
exog = np.column_stack([df_arima_predict['log_val_diff_wk'],
df_arima_predict['log_val_diff_mth'],
df_arima_predict['log_val_diff_yr']])
arima_predict = results_ARIMA.predict(start=training_cut_date, end = '2017-01-01', dynamic = False, exog = exog)
Is this the correct way to go about making predictions with ARIMA?
If so, is there a way I can try to understand why the predictions look very good in some datasets and terrible in others, when the ARIMA model seems to fit the training data just as well in both cases?

I have a similar problem atm which I have not entirely figured out yet. It seems including multiple seasonal terms in python is still a bit tricky. R does seem to have this capacity, see here. So, one suggestion I can give you is to try this with the more sophisticated functionality R provides for now (although that could require a large investment of time if you are not familiar with R yet).
Looking at your approach for modeling the seasonal patterns, taking the nth order difference scores does not give you seasonal constants, but rather some representation of the difference between the time points that you designate as seasonally related. If those differences are small, correcting for them might not have much impact on your modeling results. In such cases, model prediction might turn out fairly well. Conversely, if the differences are big, including them can easily distort prediction results. This could explain the variation you are seeing in your modeling results. Conceptually, then, what you'd want to do instead is represent the constants over time.
In the blog post referenced above, the author advocates the use of Fourier series to model the variance within each time period. Both the NumPy and SciPy packages offer routines for calculating the fast Fourier transform. However, as a non-mathematician I found it difficult to ascertain that the fast Fourier transform yielded the appropriate numbers.
In the end I opted to use the Welch signal decomposition form SciPy's signal module. What this does is return a spectral density analysis of your time series, from which you can deduce signal strength at various frequencies in your time series.
If you identify the peaks in the spectral density analysis which correspond to the seasonal frequencies you are trying to account for in your time series, you can use their frequencies and amplitudes to construct sine waves representing the seasonal variations. You can then include these in your ARIMA as exogenous variables, much like the Fourier terms in the blog post.
This is about as far as I have gotten myself at this point - right now I am trying to figure out whether I can get the statsmodels ARIMA process to use these sine waves, which specify a seasonal trend, as exogenous variables in my model (the documentation specifies they should not represent trends but hey, a guy can dream, right?) edit: This blog post by Rob Hyneman is also highly relevant, and explains some of the rationale behind including Fourier terms.
Sorry I'm not able to give you a solution that's proven to be effective within Python, but I hope this gives you some new ideas to control for that pesky seasonal variance.
TL;DR:
It seems python is not very well suited to handle multiple seasonal terms right now, R might be a better solution (see reference);
Using difference scores to account for seasonal trends seems not to capture the constant variance associated with the recurrence of the season;
One way to do this in python could be to use Fourier series representing seasonal trends (also see reference), which can be obtained using, among other ways, a Welch signal decomposition. How to use these as exogenous variables in an ARIMA to good effect is an open question, though.
Best of luck,
Evert
p.s.: I'll update if I find a way to get this to work in Python

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.