I'm not sure if stackoverflow is the best forum for this, but anyway...
Scipy implements ANOVA using stats.f_oneway, which assumes equal variances. It says in the docs that if the variances are unequal, one could consider the Kruskal-Wallis test instead.
However, what I want is Welch's ANOVA. Scipy has a Welch t-test, but of course this doesn't work if I have more than two groups.
What I find interesting is that scipy used to have stats.oneway which allowed for an equal variance setting. However, it has been deprecated.
Is there an easy way to implement Welch's ANOVA in Python?
Just required the same thing. I had to copy code from R package. Also requested scipy.stats to add this feature. Here is the ~10 lines of code for the implementation
https://github.com/scipy/scipy/issues/11122
The pingouin package has a Welch's ANOVA implemented. You can find the documentation for it at https://pingouin-stats.org/generated/pingouin.welch_anova.html.
Related
I have two sets of about 15000 points describing various observations of a seemingly random variable according to two possible states, S1 or S2. These two sets are store in numpy arrays.
Let's assume the distrubutions are not "classical" and are overlapping.
Now I have a new observation, and I would like to know if it is more likely to be related to state S1 or S2.
I know how to use numpy.quantile to get any quantile value for my two distributions but now I would like to use it in a reverse manner so that I can get directly the related quantile for each distributions.
I see how I could do this by brute force and dichotomie search but I feel there must be a more efficient, "pythonic" way.
Is there a numpy function to do so ?
Thanks
OK, silly question, obvious answer.
There obviously is something available in python, but not in numpy, in scipy.
from scipy import stats
print(stats.percentileofscore(array_distribution_S1,observation))
print(stats.percentileofscore(array_distribution_S2,observation))
This enables to select which distribution is the most probable.
When I estimate the breach-pagan test step by step doesn't math with the bp value in statsmodels:
Here is link to the nb in spanish.
I have been tested in Gretl and my 'manual' estimation is correct, but I want to know why is the difference.
Check the Koenker version in Gretl.
I don't find a Gretl reference right now, but according to the unit tests, the version in statsmodels is equal to the Koenker version of the Breusch-Pagan test.
I don't see an option for the original Breusch-Pagan test, but that one is not robust to non-normality (assumption on 4th moment, IIRC).
In general, many of the Lagrange multiplier specification tests have several versions that are asymptotically equivalent but differ in small samples, for example statsmodels reports LM test and the F-test version. Additionally, they differ which extra assumptions are required for their validity.
For example statsmodels does not have yet a heteroscedasticity test that is robust to autocorrelation. There are again several versions and I have not seen them available as standard option in any package yet.
I'm trying to do a PCA analysis on a masked array. From what I can tell, matplotlib.mlab.PCA doesn't work if the original 2D matrix has missing values. Does anyone have recommendations for doing a PCA with missing values in Python?
Thanks.
Imputing data will skew the result in ways that might bias the PCA estimates. A better approach is to use a PPCA algorithm, which gives the same result as PCA, but in some implementations can deal with missing data more robustly.
I have found two libraries. You have
Package PPCA on PyPI, which is called PCA-magic on github
Package PyPPCA, having the same name on PyPI and github
Since the packages are in low maintenance, you might want to implement it yourself instead. The code above build on theory presented in the well quoted (and well written!) paper by Tipping and Bishop 1999. It is available on Tippings home page if you want guidance on how to implement PPCA properly.
As an aside, the sklearn implementation of PCA is actually a PPCA implementation based on TippingBishop1999, but they have not chosen to implement it in such a way that it handles missing values.
EDIT: both the libraries above had issues so I could not use them directly myself. I forked PyPPCA and bug fixed it. Available on github.
I think you will probably need to do some preprocessing of the data before doing PCA.
You can use:
sklearn.impute.SimpleImputer
https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer
With this function you can automatically replace the missing values for the mean, median or most frequent value. Which of this options is the best is hard to tell, it depends on many factors such as how the data looks like.
By the way, you can also use PCA using the same library with:
sklearn.decomposition.PCA
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
And many others statistical functions and machine learning tecniques.
Following from this question, is there a way to use any method other than MLE (maximum-likelihood estimation) for fitting a continuous distribution in scipy? I think that my data may be resulting in the MLE method diverging, so I want to try using the method of moments instead, but I can't find out how to do it in scipy. Specifically, I'm expecting to find something like
scipy.stats.genextreme.fit(data, method=method_of_moments)
Does anyone know if this is possible, and if so how to do it?
Few things to mention:
1) scipy does not have support for GMM. There is some support for GMM via statsmodels (http://statsmodels.sourceforge.net/stable/gmm.html), you can also access many R routines via Rpy2 (and R is bound to have every flavour of GMM ever invented): http://rpy.sourceforge.net/rpy2/doc-2.1/html/index.html
2) Regarding stability of convergence, if this is the issue, then probably your problem is not with the objective being maximised (eg. likelihood, as opposed to a generalised moment) but with the optimiser. Gradient optimisers can be really fussy (or rather, the problems we give them are not really suited for gradient optimisation, leading to poor convergence).
If statsmodels and Rpy do not give you the routine you need, it is perhaps a good idea to write out your moment computation out verbose, and see how you can maximise it yourself - perhaps a custom-made little tool would work well for you?
Some coworkers who have been struggling with Stata 11 are asking for my help to try to automate their laborious work. They mainly use 3 commands in Stata:
tsset (sets a time series analysis)
as in: tsset year_column, yearly
varsoc (Obtain lag-order selection statistics for VARs)
as in: varsoc column_a column_b
vec (vector error-correction model)
as in: vec column_a column_b, trend(con) lags(1) noetable
Does anyone know any scientific library that I can use through python for this same functionality?
I believe both scikits.timeseries and econpy / pytrix implement vector autoregression methods, but I haven't put either through their paces.
scikits.timeseries is mainly for data handling and has only some statistical, econometric analysis and no vectorautoregression. pytrix has some econometrics functions but also no VAR.
(At least last time I looked.)
scikits.statsmodels and pandas both have VAR, pandas also does the data handling for time series. I haven't seen any vector error correction models in python yet, but scikits.statsmodels is getting close.
http://groups.google.ca/group/pystatsmodels?hl=en&pli=1
Check out scikits.statsmodels.tsa.api.VAR (may need to get the latest development version--use Google) and, in check out the documentation for it:
http://statsmodels.sourceforge.net/devel/vector_ar.html#var
These models integrate with pandas also. I'll be working in the coming months to improve integration of pandas with the rest of statsmodels
Vector Error Correction Models have not been implemented yet but are on the TODO list!
Use Rpy2 and call the R var package.
I have absolutely no clue what any of those do, but NumPy and SciPy. Maybe Sage or SymPy.