I'm working in python stack (scipy/numpy/pandas) and I need to do a linear fit on a list of (x,y) points that have added noise from some distribution conditioned on x and other global properties. Are any specific methods available to measure and visualize the levels of heteroscedasticity in my data?
Some of the tests listed on the Wikipedia page for Heteroscedasticity can be found in the scipy.stats package. Under the circumstances, the statsmodels package (which is built on top of scipy) may be a better bet. There is an entire module dedicated to Heteroscedasticity tests.
Related
Looking for a package that implements the multivariate version of statsmodels.distributions.ECDF
If one doesn't exist I will implement it for inclusion in statsmodels (if accepted), but don't want to reinvent the wheel.
I see this: https://gitlab.com/stochastic-control/StOpt
It has pybind11 bindings, but not sure if there is a wheel in pip already making this available.
The upcoming release of statsmodels 0.13 includes basic support for copulas.
Empirical and non-parametric copula and multivariate distributions have mainly experimental code, that is currently not public, and not sufficiently tested and verified.
For example _ecdf_mv (*) is currently a multivariate rankdata function. It needs to be divided by the number of observations or converted to plotting positions to make it into a ECDF. Because copulas require continuous uniform margins, ties are randomly or arbitrarily broken.
(*)
https://github.com/statsmodels/statsmodels/blob/main/statsmodels/distributions/tools.py#L424
As indicated in the title, I'm trying to run a regression in python where the standard errors are clustered as well as robust to heteroskedascity and autocorrelation (HAC). I'm working within statsmodels (sm), but obviously open to using other libraries (e.g. linearmodels).
To cluster e.g. by id, the code would be
sm.OLS.from_formula(formula='y ~ x', data=df).fit(cov_type='cluster', cov_kwds={'groups': df['id']}, use_t=True)
For HAC standard errors, the code would be
sm.OLS.from_formula(formula='y ~ x', data=df).fit(cov_type='HAC', cov_kwds={'maxlags': max_lags}, use_t=True)
Given cov_type can't be both cluster and HAC at the same time, it doesn't seem feasible to do both in statsmodels? Is that right, and/or is there any other way to have both?
There are two panel HAC cov_types hac-groupsum and hac-panel, but I only know their use for panel data, but they should work with clustered data. As far as I remember there was some literature that they are not very good in highly imbalanced data (e.g. comparing population data of US states which differ widely in size).
https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLSResults.get_robustcov_results.html
The main reference for implementing that was the article by Petersen, e.g.
https://www.kellogg.northwestern.edu/faculty/petersen/htm/papers/standarderror.html
Examples for some comparison to Petersen are in the unit tests.
Statsmodels also has cluster robust standard errors when we have two(way) clusters.
The stochastic behavior of these covariance matrices depends on whether the number of clusters, the number of time periods or both become large in large samples.
I'm not sure if stackoverflow is the best forum for this, but anyway...
Scipy implements ANOVA using stats.f_oneway, which assumes equal variances. It says in the docs that if the variances are unequal, one could consider the Kruskal-Wallis test instead.
However, what I want is Welch's ANOVA. Scipy has a Welch t-test, but of course this doesn't work if I have more than two groups.
What I find interesting is that scipy used to have stats.oneway which allowed for an equal variance setting. However, it has been deprecated.
Is there an easy way to implement Welch's ANOVA in Python?
Just required the same thing. I had to copy code from R package. Also requested scipy.stats to add this feature. Here is the ~10 lines of code for the implementation
https://github.com/scipy/scipy/issues/11122
The pingouin package has a Welch's ANOVA implemented. You can find the documentation for it at https://pingouin-stats.org/generated/pingouin.welch_anova.html.
I am new to xarray and I need this functionality often to analyze an output of a general circulation model. I can do this with numpy, but I wonder if a shortcut functions for weighted mean and integral along coordinates, are already implemented in xarray? If not, is there a plan to include them into a future release, or should they belong to packages built on top of xarray?
For simple use cases, weighted averaging and integrals have been discussed for xarray, but not implemented yet. Help would be appreciated! (Please reach out on GitHub to discuss details.)
For your specific needs, xgcm might be worth a look. It includes utilities for doing these sort of transformations on native GCM grids.
I have a probability density function of an unknown distribution which is given as a set of tuples (x, f(x)), where x=numpy.arange(0,1,size) and f(x) is the corresponding probability.
What is the best way to identify the corresponding distribution? So far my idea is to draw a large amount of samples based on the pdf (by writing the code myself from scratch) and then use the obtained data to fit all of the distributions implemented in scipy.stats, then take the best fit.
Is there a better way to solve this problem? For example, is there some kind of utility in scipy.stats that I'm missing that would help me solve this problem?
In a fundamental sense, it's not really possible to summarize a distribution based on empirical samples - see here a discussion.
It's possible to do something more limited, which is to reject/accept the hypothesis that it comes out of one of a finite set of (parametric) distributions, based on a somewhat arbitrary criterion.
Given the finite set of distributions, for each distribution, you could perhaps realistically do the following:
Fit the distribution's parameters to the data. E.g., scipy.stats.beta.fit will fit the best parameters of the Beta distribution (all scipy distributions have this method).
Reject/accept the hypothesis that the data was generated by this distribution. There is more than a single way of doing this. A particularly simple way is to use the rvs() method of the distribution to generate another sample, then use ks_2samp to generate a Kolmogorov-Smirnov test.
Note that some specific distributions might have better, ad-hoc algorithms for testing whether a member of the distribution's family generated the data. As usual, the Normal distribution has many in particular (see Normalcy test).