I have a dataset which I need to fit to a GEV distribution. The data is one dimensional, and is stored in a numpy array. Currently, I am using scipy.stats.genextreme.fit(data), which works ok, but gives totally inaccurate results (obvious by plotting the pdf). After some investigation it turns out that my data does not fit well in log space, which scipy uses in its MLE fitting algorithm, so I need to try something like GMM instead which is only available in statsmodels. The problem is that I can't find anything which looks like scipy's fit function. All the examples I've found seem to deal with far more complicated data than I have. Also, statsmodels requires endog and exog parameters for eveything, and I have no idea what these are.
This should be really simple, so I'm sure I'm missing something obvious. Has anyone used statsmodels in this way, and if so, any pointers as to how to do it?
I'm guessing you want Gaussian Mixture Model (GMM) and not Generalized Method of Moments (GMM). The former GMM is available in scikit-learn here. The latter has code in statsmodels, but it's a work in progress.
EDIT Actually it's not clear to me that you want GMM. Maybe you just want a kernel density estimator (KDE). This is available in statsmodels here with an example
Hmm, if you do want to use (Generalized) Method of Moments to fit some kind of probability weighted GEV, then you need to specify the moment conditions, but I don't have a ready example for (G)MM in statsmodels for how you specify the moment conditions. You might be better off asking on the mailing list.
Related
I've been running some large logistic regression models in SAS, which take 4+ hours to converge. Recently however I acquired access to a Hadoop cluster and can use Python to fit the same models much faster (something more like 10-15 minutes).
Problematically, I have some complete/quasi-complete separation of data points in my data which results in failure to converge; I was using the FIRTH command in SAS to produce robust parameter estimates despite that, but there seems to be no equivalent option for Python, either in sklearn or statsmodels (I'm mostly using the latter).
Is there another way to get around this problem in Python?
AFAIK, there is no Firth penalization available in Python. Statsmodels has an open issue but nobody is working on it at the moment.
As alternative it would be possible to use a different kind of penalization, e.g. as available in sklearn or maybe statsmodels.
The other option is to change the observed response variable. Firth can be implemented by augmenting the dataset. However, I don't know of any recipe or prototype for this in Python.
https://github.com/statsmodels/statsmodels/issues/3561
Statsmodels has ongoing work on penalization but currently the emphasis is on feature/variable selection (elastic net, SCAD) and quadratic penalization for generalized additive models GAM, especially for splines.
Firth uses data dependent penalization which does not fit the generic penalization framework where the penalization structure is a data independent "prior".
Conditional likelihood is another way to work around perfect separation. This is in a Statsmodels PR that is basically ready to use:
https://github.com/statsmodels/statsmodels/pull/5304
I want to build a model that describes a curve that fits the data shown in the scatterplot. I thought it would be straight forward using sklearn. But the choice and application of the different methods gets rather confusing.
Which algorithms would you use to tackle this problem?
This is really a question for CrossValidated rather than a Python question.
Your data seems to strongly indicate a simple underlying model which is linear until the very end, when it perhaps becomes polynomial.
As a first step, if possible, I would investigate this phenomenon. It's unusual. Perhaps there's something wrong with the data source. But maybe not. For example, a physical phenomenon with two distinct phases might produce data like these.
As to models, I would suggest natural cubic splines for this data. They are simple and involve cutting the data up into windows which you fit with cubic polynomials (a special case of which is a line).
You might also consider smoothing splines, and local regression.
For information on these, see the free online textbook, An Introduction to Statistical Learning.
I have two questions.
1) I have an array like [1,2,3,4,5,5,3,1]. and I don't know which distributions it is. Can I use scipy.stats to calculate pmf,cdf automatically?
2)scipy.stats is just like a library of distributions? If I want to analysis data, I have to find one distributions or define one? I need to manually calculate some of data, like pmf. Am I understanding correctly?
Well, scipy.stats is not a library for telling you the distribution of data and calculating pmf and cdf automatically. Its a library for easing your tasks while estimating the probabily distribution. You have to explore your data and find which distribution which fits your data with least error ,which is the ultimate task and scipy.stats helps you achieving this.... you don't have to reinvent the wheel as they say by writing all the mathematical functions again and again.
Well, to answer your question in the comment, lets suppose you have a dataset, to get an insight and get a starting point for your analysis, what you'll do is plot the data in a historgram (which also shows distribution of data), now you can plot different distributions on the same plot using scipy.stats to get a feel of bestfit....
Check out this answer, it might help ya...
https://stats.stackexchange.com/questions/132652/how-to-determine-which-distribution-fits-my-data-best
I'm trying to do a PCA analysis on a masked array. From what I can tell, matplotlib.mlab.PCA doesn't work if the original 2D matrix has missing values. Does anyone have recommendations for doing a PCA with missing values in Python?
Thanks.
Imputing data will skew the result in ways that might bias the PCA estimates. A better approach is to use a PPCA algorithm, which gives the same result as PCA, but in some implementations can deal with missing data more robustly.
I have found two libraries. You have
Package PPCA on PyPI, which is called PCA-magic on github
Package PyPPCA, having the same name on PyPI and github
Since the packages are in low maintenance, you might want to implement it yourself instead. The code above build on theory presented in the well quoted (and well written!) paper by Tipping and Bishop 1999. It is available on Tippings home page if you want guidance on how to implement PPCA properly.
As an aside, the sklearn implementation of PCA is actually a PPCA implementation based on TippingBishop1999, but they have not chosen to implement it in such a way that it handles missing values.
EDIT: both the libraries above had issues so I could not use them directly myself. I forked PyPPCA and bug fixed it. Available on github.
I think you will probably need to do some preprocessing of the data before doing PCA.
You can use:
sklearn.impute.SimpleImputer
https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer
With this function you can automatically replace the missing values for the mean, median or most frequent value. Which of this options is the best is hard to tell, it depends on many factors such as how the data looks like.
By the way, you can also use PCA using the same library with:
sklearn.decomposition.PCA
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
And many others statistical functions and machine learning tecniques.
Following from this question, is there a way to use any method other than MLE (maximum-likelihood estimation) for fitting a continuous distribution in scipy? I think that my data may be resulting in the MLE method diverging, so I want to try using the method of moments instead, but I can't find out how to do it in scipy. Specifically, I'm expecting to find something like
scipy.stats.genextreme.fit(data, method=method_of_moments)
Does anyone know if this is possible, and if so how to do it?
Few things to mention:
1) scipy does not have support for GMM. There is some support for GMM via statsmodels (http://statsmodels.sourceforge.net/stable/gmm.html), you can also access many R routines via Rpy2 (and R is bound to have every flavour of GMM ever invented): http://rpy.sourceforge.net/rpy2/doc-2.1/html/index.html
2) Regarding stability of convergence, if this is the issue, then probably your problem is not with the objective being maximised (eg. likelihood, as opposed to a generalised moment) but with the optimiser. Gradient optimisers can be really fussy (or rather, the problems we give them are not really suited for gradient optimisation, leading to poor convergence).
If statsmodels and Rpy do not give you the routine you need, it is perhaps a good idea to write out your moment computation out verbose, and see how you can maximise it yourself - perhaps a custom-made little tool would work well for you?