When should one use time series analysis vs. non-time series analysis? - python

I am trying to predict churn and for this my dependent variable is a binary variable. The independent variables can be categorical, integer or timeseries data. I am in the feature selection mode and will like to know if I am running correlation, should I run correlation on time series data or not. If I do use a wrapper method and use a ML algorithm for such a problem, do I use models like ARIMA that are more suited for time series analysis or a decision tree model?
I have tried using Spearman correlation but am not finding any significant correlated dependent variables

You most likely should! Since churn rate may be affected by macroeconomical issues that will show in your autocorrelation function. I suggest paying a visit to statsmodel and making sure you understand ACF plots and PACF plots (that can be done with statsmodel quite easily) together with ARIMA models so you can do some fine tuning. As for the feature selection, you can try using an overfitted neural network or model with L1 regularization.
https://www.statsmodels.org/stable/index.html

Related

Multivariate time series distribution forecasting problem

I am working on the following timeseries multi-class classification problem:
42 possible classes that are dependent on each other, I want to know the probability of each class for up to 56 days ahead
1 year of daily data so 365 observations
the class probabilities have a strong weekly seasonality
I have exogenous regressors that are strongly correlated with the output classes
I realise that I am trying to predict a lot of output classes with little data, but I am looking for a model (preferably with Python implementation) that is most suited for this use case.
Any recommendations on what model could work for this problem?
So far I have tried:
a tree based model, but it struggles with the high amount of classes and does not capture the time series component well
a VAR model, but the number of parameters to estimate becomes too high compared to the series
predicting each class probability independently, but that assumes the series are independent, which is not the case

Machine Learning Regression To Support Multi Variable Regression

I have a data set of ~150 samples where for each sample I have 11 inputs and 3 outputs. I tried to build a full regression model to take in 11 inputs to be trained to predict the 3 outputs The issue is with so few samples training a full model is almost impossible. For this reason I am experimenting with regression such as linear regression in pythons sklearn. From what I can find most regression models either support one input to predict one output (after regression is complete) or many inputs to predict one output.
Question: Are there any types of regression that support many inputs to predict many outputs. Or perhaps any regression types at all that may be better suited for my needs.
Thank you for any help!
Have you considered simply performing separate linear regressions for each dependent variable?
Also, you need to decide which inputs are theoretically significant (in terms of explaining the variation in the output) and then test for statistical significance to determine which ones should be retained in the model.
Additionally, test for multicollinearity to determine if you have variables which are statistically related and inadvertently influencing the output. Then, test in turn for serial correlation (if the data is a time series) and heteroscedasticity.
The approach you are describing of "garbage in, garbage out" risks overfitting - since you don't seem to be screening the inputs themselves for their relevance in predicting the output.

How to apply Gaussian naive bayes to predict traffic number in the future?

I have got some historical data on traffic and would like to predict the future.
I take reference from http://www.nuriaoliver.com/bicing/IJCAI09_Bicing.pdf. It applied the Bayesian network to predict the change in numbers of bikes, where I got the Bayesian network and would like to predict the changes by using Bayesian.
I faced several questions. I tried to use naive bayes to predict the number, but it seems naive bayes only allowed to have the output as several discrete class. In my case, the changes seem cannot be grouped into discrete class (like predicting a human is "male" or "female", only 2 discrete output to be the classifier)
May I know how can I apply the baysian approach in my case and what kind of python packages could help me?
I would see this as a time series forecasting problem and not a classification problem. As you noted, you are not trying to label your data into a set of discrete classes. Given a series of observations x_1, x_2, .... x_n, you are trying to predict x_(n+1) or trying forecast the next observation of the same variable in the series. Perhaps you could refer to this slide for a brief introduction to time series forecasting.
A quick start guide for time series forecasting with Python can be found here: https://machinelearningmastery.com/time-series-forecasting-methods-in-python-cheat-sheet/

What is the difference between Linear regression classifier and linear regression to extract the confidential interval?

I am a beginner with machine learning. I want to use time series linear regression to extract confidential interval of my dataset. I don't need to use the linear regression as a classifier. Firstly what is the difference between the two cases? Secondly in python, Is there different way to implement them ?
The main difference is the classifier will compute a probabilty about a label. The regression will compute a quantitative output.
Generally, classifier is used to compute a probability of label, and a regression is often use to compute a quantity. For instance if you want to compute the price of a flat considering some criterias you will use a regression, if you want to compute a label (luxurious, modest, ...) about the same flat considering some criterias you will use classifier.
But to use regressions in order to compute a threshold to seperate labels observed is a technic often used too. That is the case of linear SVM, which compute a boundary between labels. It is called decision boundary. Warning, the main drawback with linear is that is linear: it means the boundary will necessary be a straight line to separate labels. Sometimes it is good enough, sometimes it is not.
Logistic regression is an exception because it compute a probability actually. Its name is misleading.
For regression, when you want to compute a quantitative output, you can use a confidence interval to have an idea about the error. In a classification there is not confidence interval, even if you use linear SVM, it is non sensical. You can use the decision function but it is difficult to interpret in reality, or use the predicted probabilities and to check the number of time the label is wrong and compute a ratio of error. There are plethora ratios available considering your problematic, and it is buntly the subject of a whole book actually.
Anyway, if you're computing a time series, as far as I know your goal is to obtain a quantitative output, then you do not need a classifier as you said. And about extracting it depends totally of the object you used to compute it in python: meaning it depends of the available attributes of the object used. Then depends of the library too. So it would be very better, to answer to you, if you would indicate which libraries and objects you are using.

Distributed Lag Model in Python

I have quickly looked for Distributed Lag Model in StatsModels but can't find one. The one that is similar is VAR model. Can I transform VAR model to Distributed Lag Model and how? It will be great if there are already other packages which have Distributed Lag Model. Please let me know if so.
Thanks!
If you are using a finite distributed lag model, just use OLS or FGLS, with the lagged predictors forming the covariate matrix, and some parameterized model of autocorrelation (if using FGLS).
If your target variable is vector-valued, then the same advice applies and it just becomes a multiple regression problem, with a separate regression for each component of the output, and possibly additional covariance structure if there is correlation between error terms across components of the target.
It does not appear there is a standard statistics package in Python that implements this directly, likely because it would boil down to FGLS in almost any practical situation.

Categories