statsmodels heteroscedasticity breusch pagan test doesn't match

statsmodels heteroscedasticity breusch pagan test doesn't match - python

When I estimate the breach-pagan test step by step doesn't math with the bp value in statsmodels:
Here is link to the nb in spanish.
I have been tested in Gretl and my 'manual' estimation is correct, but I want to know why is the difference.

Check the Koenker version in Gretl.
I don't find a Gretl reference right now, but according to the unit tests, the version in statsmodels is equal to the Koenker version of the Breusch-Pagan test.
I don't see an option for the original Breusch-Pagan test, but that one is not robust to non-normality (assumption on 4th moment, IIRC).
In general, many of the Lagrange multiplier specification tests have several versions that are asymptotically equivalent but differ in small samples, for example statsmodels reports LM test and the F-test version. Additionally, they differ which extra assumptions are required for their validity.
For example statsmodels does not have yet a heteroscedasticity test that is robust to autocorrelation. There are again several versions and I have not seen them available as standard option in any package yet.

Related

Is there a python implementation of multidimensional ECDFs?

Looking for a package that implements the multivariate version of statsmodels.distributions.ECDF
If one doesn't exist I will implement it for inclusion in statsmodels (if accepted), but don't want to reinvent the wheel.
I see this: https://gitlab.com/stochastic-control/StOpt
It has pybind11 bindings, but not sure if there is a wheel in pip already making this available.

The upcoming release of statsmodels 0.13 includes basic support for copulas.
Empirical and non-parametric copula and multivariate distributions have mainly experimental code, that is currently not public, and not sufficiently tested and verified.
For example _ecdf_mv (*) is currently a multivariate rankdata function. It needs to be divided by the number of observations or converted to plotting positions to make it into a ECDF. Because copulas require continuous uniform margins, ties are randomly or arbitrarily broken.
(*)
https://github.com/statsmodels/statsmodels/blob/main/statsmodels/distributions/tools.py#L424

How to perform adjoint sensitivity in Python (preferably through CVODE)

I want to implement the adjoint sensitivity analysis in python, in order to determine the gradient of my objective function with respect to some parameters. In specific the objective function depends on the solution of a differential equation which in turn depends on said parameters which I am looking to find the optimum of.
To perform this there are numerous good packages both in Julia (see here), as well as CVODES from SUNDIALS, however the latter which does apparently have a wrapper made for python, does not include sensitivity analysis capabilities according to this link. Furthermore, I have looked into SALib for sensitivity analysis, but as far as I understand this refers to some other type of 'sensitivity analysis' and therefore adjoint or even forward sensitivity analysis is not included (correct me if I am wrong on this one).
Thus my question is, does a version of CVODES exist in python with sensitivity analysis capabilities, or is there there any other package where one can use in order to perform adjoint sensitivity analys?

You can easily call Julia code / packages from Python with pyjulia.
https://github.com/JuliaPy/pyjulia

You can try Assimulo, which is a Python wrapper of the SUNDIALS suite. I've been using it for some years now and it works pretty robustly. So far, I have performed forward sensitivity analysis on ODE systems with moderate number of states/parameters using CVODEs (less than 20 states, less than 10 parameters). It works pretty well in terms of robustness (can handle stiff problems, and also supports a variety of linear solvers for sparse problems) and speed, and also supports DAEs through IDAs.
I have installed Assimulo using conda, which deals with all the dependency tree (including SUNDIALS in its more recent version). Finally, I'm not aware whether adjoint sensitivity analysis can be performed using Assimulo. If you find something, let us all know.

Welch's ANOVA in Python

I'm not sure if stackoverflow is the best forum for this, but anyway...
Scipy implements ANOVA using stats.f_oneway, which assumes equal variances. It says in the docs that if the variances are unequal, one could consider the Kruskal-Wallis test instead.
However, what I want is Welch's ANOVA. Scipy has a Welch t-test, but of course this doesn't work if I have more than two groups.
What I find interesting is that scipy used to have stats.oneway which allowed for an equal variance setting. However, it has been deprecated.
Is there an easy way to implement Welch's ANOVA in Python?

Just required the same thing. I had to copy code from R package. Also requested scipy.stats to add this feature. Here is the ~10 lines of code for the implementation
https://github.com/scipy/scipy/issues/11122

The pingouin package has a Welch's ANOVA implemented. You can find the documentation for it at https://pingouin-stats.org/generated/pingouin.welch_anova.html.

Python Statsmodels Testing Coefficients from Robust Linear Model based on M-Estimators

I have a linear model that I'm trying to fit to data with a good # of outliers in the endogenous variable, but not in the exogenous space. I've researched that RLM's based on M-estimators are good in this situation.
When I fit an RLM to my data in the follow way:
import numpy as np
import statsmodels.formula.api as smf
import statsmodels as sm
modelspec = ('cost ~ np.log(units) + np.log(units):item + item') #where item is a categorical variable
results = smf.rlm(modelspec, data = dataset, M = sm.robust.norms.TukeyBiweight()).fit()
print results.summary()
the summary results shows a z statistic, and seemingly the coefficient test of significance is based off of this rather than a t statistic. However, the following R manual (http://www.dst.unive.it/rsr/BelVenTutorial.pdf) shows the use of t statistics on pg. 19-21
Two questions:
Can someone explain to me conceptually why statsmodels uses a z-test rather than a t-test?
All terms and interactions are highly significant in the results (|z| > 4). In most cases, each item has 40 or more observations. There are some items that have 21-25 observations. Is there reason to believe that RLM is not effective in a small sample environment? The line it produces must be the best fit line after reweighting outliers, but is the z-test effective for samples of this size (ie, is there a reason to believe the confidence interval produced by smf.rlm() does not produce 95% probability coverage? I know for t-tests this potentially can be an issue...)?
Thanks!

I have mostly only a general answer, I never read any small sample Monte Carlo studies for M-estimators.
To 1.
In many models, like M-estimators, RLM, or generalized linear models, GLM, we have only asymptotic results, except for maybe a few special cases. Asymptotic results provide conditions that the estimator is normally distributed. Given this, statsmodels defaults to using normal distribution for all models outside of the linear regression model, OLS, and similar, and chisquare instead of the F distribution for Wald tests with joint hypothesis.
There is some evidence that in many cases using the t or F distribution with appropriate choice of degrees of freedom provides a better small sample approximation to the distribution of the test statistic. This relies on Monte Carlo results and is not directly justified by the theory, as far as I know.
In the next release, and in the current development version, of statsmodels users can choose to use the t and F distribution for the results, instead of the normal and chisquare distribution. The defaults stay the same as they are now.
There are other cases where it is not clear whether the t-distribution, and which small sample degrees of freedom should be used. In many cases, statsmodels tries to follow the lead of STATA, for example in cluster robust standard errors after OLS.
Another consequence is that sometimes equivalent models that are special cases of different models use different default assumptions on the distribution, both in Stata and in statsmodels.
I recently read the SAS documentation for M-estimators, and SAS is using the chisquare distribution, i.e. also the normal assumption, for the significance of the parameter estimates and for the confidence intervals.
To 2.
(see first sentence)
I think the same as for linear models also applies here. If the data is highly non-normal, then the test statistics could have incorrect coverage in small samples. This can also be the case with some robust sandwich covariance estimators. On the other hand, if we don't use heteroscedasticity or correlation robust covariance estimators, then the tests can also be strongly biased.
For robust estimation methods like M-estimators, RLM, the effective sample size also depends on the number of inliers, or the weights assigned to the observations, not just the total number of observations.
For your case, I think the z-values and the sample size are large enough that, for example, using the t-distribution would not make them much less significant.
Comparing M-estimators with different norms and scale estimates would provide an additional check on the robustness on the assumption on the outliers and for the choice of robust estimator. Another cross check: Does OLS with dropped outliers (observations with small weights in the RLM estimate) give a similar answer.
Finally as general caution:
The references on robust methods often warn that we should not use (outlier-)robust methods blindly. Using robust methods estimates a relationship based on "inliers". But is our discarding or down-weighting of outliers justified? Or, do we have missing non-linearities, missing variables, a mixture distribution or different regimes?

Johansen cointegration test in python

I can't find any reference on funcionality to perform Johansen cointegration test in any Python module dealing with statistics and time series analysis (pandas and statsmodel). Does anybody know if there's some code around that can perform such a test for cointegration among time series?

This is now implemented in Python's statsmodels:
from statsmodels.tsa.vector_ar.vecm import coint_johansen
x = getx() # dataframe of n series for cointegration analysis
jres = coint_johansen(x, det_order=0, k_ar_diff=1)
For a full description of inputs/results, see the documentation.

statsmodels doesn't have a Johansen cointegration test. And, I have never seen it in any other python package either.
statsmodels has VAR and structural VAR, but no VECM (vector error correction models) yet.
update:
As Wes mentioned, there is now a pull request for Johansen's cointegration test for statsmodels. I have translated the matlab version in LeSage's spatial econometrics toolbox and wrote a set of tests to verify that we get the same results.
It should be available in the next release of statsmodels.
update 2:
The test for cointegration coint_johansen was included in statsmodels 0.9.0 together with the vector error correction models VECM.
(see also 3rd answer)

See http://github.com/statsmodels/statsmodels/pull/453

Check this: https://searchcode.com/codesearch/view/88477497/
It provides a library where you can find the Johansen cointegration test.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.