Distributed Lag Model in Python - python

I have quickly looked for Distributed Lag Model in StatsModels but can't find one. The one that is similar is VAR model. Can I transform VAR model to Distributed Lag Model and how? It will be great if there are already other packages which have Distributed Lag Model. Please let me know if so.
Thanks!

If you are using a finite distributed lag model, just use OLS or FGLS, with the lagged predictors forming the covariate matrix, and some parameterized model of autocorrelation (if using FGLS).
If your target variable is vector-valued, then the same advice applies and it just becomes a multiple regression problem, with a separate regression for each component of the output, and possibly additional covariance structure if there is correlation between error terms across components of the target.
It does not appear there is a standard statistics package in Python that implements this directly, likely because it would boil down to FGLS in almost any practical situation.

Related

Improving prediction accuracy in Bayesian Causal Network

I would like to determine the causes of an unexpected outcome (or anamoly) in a thermodynamic process. I have continuous data of the associated variables and trying to make use of 'Bayesian Network (BN)' for the determination of causality relationships. For this purpose, I used a library called 'Causalnex' in Python.
I have followed the tutorial section of this library to build the DAG,BN model and everything works fine upto the step of predictions. The prediction results of minority/less majority classes have an accuracy of around 60-70% (80-90% with SMOTE/SMOTETomek and a particular random state) whereas a stable accuracy of more than 90% is expected. I have implemented following data-preprocessing steps.
Ensuring no missing/NaN values
Discretization (only it is supported by the library)
SMOTE/SMOTETomek for data balancing
Various train/test size combinations
I am struggling to figure out the ways to optimize the model. I could not find any supportive material in Internet for the same.
Are there any Guidelines or 'Best practices' of data pre-processing techniques and dataset requirements that particulary work for this library/ BN model? Could you please suggest any troubleshooting methods to identify the causes of low accuracy/metrics? Perhaps a misunderstood node-node causal relationship in DAG causes mediocre accuracy?
Any ideas/literature/other suitable library regarding this would be of great help!
A few tips that can help:
Changing/Tuning the Structure learning.
Trying different thresholds. When doing from_pandas, you can experiment with different w-threshold values (and the beta term (if you are using from_pandas_lasso)).
This will change the density of the network. A more dense structure implies a BN with more parameters. If the structure is more dense, you have more parameters and your model may perform better. If it is too dense, though, you may not have enough data to train it and may overfit.
Center the Data. Empirically, it seems that NOTEARS (the algorithm behind from_pandas) works best if the data is centered. So, subtracting the mean of the see this may be a good idea.
Ensure causality. NOTEARS does not ensure causality. So we need "experts" to judge the output and make the necessary modifications. If you see edges that don't make causal sense, you can either remove them or add them as tabu_edges and train your network again.
Experiment with discretisation. The performance can be very sensitive to how you discretise the data. Experimenting with various types of discretisation can help. You can use:
Methods available in Causalnex (uniform, for example)
fixed discretisations based on what thresholds make sense for your data
MDLP is a supervised way to discretise data. You can apply MDLP for each node having as "target" one of its children. There are 2 main packages for MDLP in pypy: mdlp and mdlp-discretization

When should one use time series analysis vs. non-time series analysis?

I am trying to predict churn and for this my dependent variable is a binary variable. The independent variables can be categorical, integer or timeseries data. I am in the feature selection mode and will like to know if I am running correlation, should I run correlation on time series data or not. If I do use a wrapper method and use a ML algorithm for such a problem, do I use models like ARIMA that are more suited for time series analysis or a decision tree model?
I have tried using Spearman correlation but am not finding any significant correlated dependent variables
You most likely should! Since churn rate may be affected by macroeconomical issues that will show in your autocorrelation function. I suggest paying a visit to statsmodel and making sure you understand ACF plots and PACF plots (that can be done with statsmodel quite easily) together with ARIMA models so you can do some fine tuning. As for the feature selection, you can try using an overfitted neural network or model with L1 regularization.
https://www.statsmodels.org/stable/index.html

Elastic net regression or lasso regression with weighted samples (sklearn)

Scikit-learn allows sample weights to be provided to linear, logistic, and ridge regressions (among others), but not to elastic net or lasso regressions. By sample weights, I mean each element of the input to fit on (and the corresponding output) is of varying importance, and should have an effect on the estimated coefficients proportional to its weight.
Is there a way I can manipulate my data before passing it to ElasticNet.fit() to incorporate my sample weights?
If not, is there a fundamental reason it is not possible?
Thanks!
You can read some discussion about this in sklearn's issue-tracker.
It basically reads like:
not that hard to do (theory-wise)
pain keeping all the basic sklearn'APIs and supporting all possible cases (dense vs. sparse)
As you can see in this thread and the linked one about adaptive lasso, there is not much activity there (probably because not many people care and the related paper is not popular enough; but that's only a guess).
Depending on your exact task (size? sparseness?), you could build your own optimizer quite easily based on scipy.optimize, supporting this kind of sample-weights (which will be a bit slower, but robust and precise)!

Does sklearn have group lasso?

I'm interested in using group lasso for a problem I have. Here is a link to the algorithm. I know R has a slick implementation, but am curious to see if python has something similar.
I think sklearn.linear_model.MultiTaskLasso might be kind of similar, but am not sure. Can anyone shed some light on this?
Whether or not to implement the Group Lasso in sklearn is discussed in this issue in the sklearn repo, where the conclusion so far is that it is too much of a niche model to justify the maintenance it would need if included in master.
Therefore, I have implemented a GroupLasso class which passes sklearn's check_estimator() in my python/cython package celer, which acts as a dropin replacement for sklearn's Lasso, MultitaskLasso, sparse Logistic regression with faster solvers.
The solver uses coordinate descent, working set methods and extrapolation, which should allow it to scale to problems with millions of features.
It supports sparse and dense data, along with centering and normalization (centering sparse data is not trivial as it breaks the sparsity of the design matrix), and comes with a GroupLassoCV class to perform cross-validation.
In celer's documentation, there is an example showing how to use it.
I've also looked into this, as far as I know scikit-learn does not provide this implementation.
The MultiTaskLasso does something else. From the documentation:
"The MultiTaskLasso is a linear model that estimates sparse coefficients for multiple regression problems jointly: y is a 2D array, of shape (n_samples, n_tasks). The constraint is that the selected features are the same for all the regression problems, also called tasks."
In other words, the MultiTaskLasso is an implementation of the Lasso which is able to predict multiple targets at the same time (hence y is a 2D array). Another way this problem is known is 'multi-output regression' or 'multi-target regression'. If the tasks are related, such methods can improve methods which try to model every task or target separately.

Quickest linear regression implementation in python

I'm performing a stepwise model selection, progressively dropping variables with a variance inflation factor over a certain threshold.
In order to do this, I'm running OLS many, many times on datasets ranging from a few hundred MB to 10 gigs.
What is the quickest implementation of OLS would be for larger datasets? The Statsmodel OLS implementation seems to be using numpy to invert matrices. Would a gradient descent based method be quicker? Does scikit-learn have an especially quick implementation?
Or maybe an mcmc based approach using pymc is quickest...
Update 1: Seems that the scikit learn implementation of LinearRegression is a wrapper for the scipy implementation.
Update 2: Scipy OLS via scikit learn LinearRegression is twice as fast as statsmodels OLS in my very limited tests...
The scikit-learn SGDRegressor class is (iirc) the fastest, but would probably be more difficult to tune than a simple LinearRegression.
I would give each of those a try, and see if they meet your needs. I also recommend subsampling your data - if you have many gigs but they are all samples from the same distibution, you can train/tune your model on a few thousand samples (dependent on the number of features). This should lead to faster exploration of your model space, without wasting a bunch of time on "repeat/uninteresting" data.
Once you find a few candidate models, then you can try those on the whole dataset.
Stepwise methods are not a good way to perform model selection, as they are entirely ad hoc, and depend highly on which direction you run the stepwise procedure. Its far better to use criterion-based methods, or some other method for generating model probabilities. Perhaps the best approach is to use reversible-jump MCMC, which fits models over the entire models space, and not just the parameter space of a particular model.
PyMC does not implement rjMCMC itself, but it can be implemented. Note also that PyMC 3 makes it really easy to fit regression models using its new glm submodule.

Categories