Following from this question, is there a way to use any method other than MLE (maximum-likelihood estimation) for fitting a continuous distribution in scipy? I think that my data may be resulting in the MLE method diverging, so I want to try using the method of moments instead, but I can't find out how to do it in scipy. Specifically, I'm expecting to find something like
scipy.stats.genextreme.fit(data, method=method_of_moments)
Does anyone know if this is possible, and if so how to do it?
Few things to mention:
1) scipy does not have support for GMM. There is some support for GMM via statsmodels (http://statsmodels.sourceforge.net/stable/gmm.html), you can also access many R routines via Rpy2 (and R is bound to have every flavour of GMM ever invented): http://rpy.sourceforge.net/rpy2/doc-2.1/html/index.html
2) Regarding stability of convergence, if this is the issue, then probably your problem is not with the objective being maximised (eg. likelihood, as opposed to a generalised moment) but with the optimiser. Gradient optimisers can be really fussy (or rather, the problems we give them are not really suited for gradient optimisation, leading to poor convergence).
If statsmodels and Rpy do not give you the routine you need, it is perhaps a good idea to write out your moment computation out verbose, and see how you can maximise it yourself - perhaps a custom-made little tool would work well for you?
Related
Is there any place with a brief description of each of the algorithms for the parameter method in the minimize function of the lmfit package? Both there and in the documentation of SciPy there is no explanation about the details of each algorithm. Right now I know I can choose between them but I don't know which one to choose...
My current problem
I am using lmfit in Python to minimize a function. I want to minimize the function within a finite and predefined range where the function has the following characteristics:
It is almost zero everywhere, which makes it to be numerically identical to zero almost everywhere.
It has a very, very sharp peak in some point.
The peak can be anywhere within the region.
This makes many minimization algorithms to not work. Right now I am using a combination of the brute force method (method="brute") to find a point close to the peak and then feed this value to the Nelder-Mead algorithm (method="nelder") to finally perform the minimization. It is working approximately 50 % of the times, and the other 50 % of the times it fails to find the minimum. I wonder if there are better algorithms for cases like this one...
I think it is a fair point that docs for lmfit (such as https://lmfit.github.io/lmfit-py/fitting.html#fit-methods-table) and scipy.optimize (such as https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html#optimization-scipy-optimize) do not give detailed mathematical descriptions of the algorithms.
Then again, most of the docs for scipy, numpy, and related libraries describe how to use the methods, but do not describe in much mathematical detail how the algorithms work.
In fairness, the different optimization algorithms share many features and the differences between them can get pretty technical. All of these methods try to minimize some metric (often called "cost" or "residual") by changing the values of parameters for the supplied function.
It sort of takes a text book (or at least a Wikipedia page) to establish the concepts and mathematical terms used for these methods, and then a paper (or at least a Wikipedia page) to describe how each method differs from the others. So, I think the basic answer would be to look up the different methods.
I'm trying to perform a GP regression with linear operators as described in for example this paper by Särkkä: https://users.aalto.fi/~ssarkka/pub/spde.pdf In this example we can see from equation (8) that I need a different kernel function for the four covariance blocks (of training and test data) in the complete covariance matrix.
This is definitely possible and valid, but I would like to include this in a kernel definition of (preferably) GPflow, or GPytorch, GPy or the like.
However, in the documentation for kernel design in Gpflow, the only possibility is to define a covariance function that acts on all covariance blocks. In principle, the method above should be straight-forward to add myself (the kernel function expressions can be derived analytically), but I don't see any way of incorporating the 'heterogeneous' kernel functions into the regression or kernel classes. I tried to consult other packages such as Gpytorch and Gpy, but again, the kernel design does not seem to allow this.
Maybe I'm missing something here, maybe I'm not familiar enough with the underlying implementation to asses this, but if someone has done this before or sees the (what should be reasonably straight-forward?) implementation possibility, I would be happy to find out.
Thank you very much in advance for your answer!
Kind regards
This should be reasonably straightforward, though requires building a custom kernel. Basically, you need a kernel that can know for each input what the linear operator for the corresponding output is (whether this is a function observation/identity operator, integral observation, derivative observation, etc). You can achieve this by including an extra column in your input matrix X, similar to how it's done for the gpflow.kernels.Coregion kernel (see this notebook). You would need to then need to define a new kernel with K and K_diag methods that for each linear operator type find the corresponding rows in the input matrix, and pass it to the appropriate covariance function (using tf.dynamic_partition and tf.dynamic_stitch, this is used in a very similar way in GPflow's SwitchedLikelihood class).
The full implementation would probably take half a day or so, which is beyond what I can do here, but I hope this is a useful starting pointer, and you're very welcome to join the GPflow slack (invite link in the GPflow README) and discuss it in more detail there!
After taking a couple advanced statistics courses, I decided to code some functions/classes to just automate estimating parameters for different distributions via MLE. In Matlab, the below is something I easily coded once:
function [ params, max, confidence_interval ] = newroutine( fun, data, guesses )
lh = #(x,data) -sum(log(fun(x,data))); %Gets log-likelihood from user-defined fun.
options = optimset('Display', 'off', 'MaxIter', 1000000, 'TolX', 10^-20, 'TolFun', 10^-20);
[theta, max1] = fminunc(#(x) lh(x,data), guesses,options);
params = theta
max = max1
end
Where I just have to correctly specify the underlying pdf equation as fun, and with more code I can calculate p-values, confidence-intervals, etc.
With Python, however, all the sources I've found on MLE automation (for ex., here and here) insist that the easiest way to do this is to delve into OOP using a subclass of statsmodel's, GenericLikelihoodModel, which seems way too complicated for me. My reasoning is that, since the log-likelihood can be automatically created from the pdf (at least for the vast majority of functions), and scipy.stats."random_dist".fit() already easily returns MLE estimates, it seems ridiculous to have to write ~30 lines of class code each time you have a new dist. to fit.
I realize that doing it the way the two links suggests allows you to automatically tap into statsmodel's functions, but it honestly does not seem simpler than tapping into scipy oneself and writing much simpler functions.
Am I missing an easier way to perform basic MLE, or is there a real good reason for the way statsmodels does this?
I wrote the first post outlining the various methods, and I think it is fair to say that while I recommend the statsmodels approach, I did so to leverage the postestimation tools it provides and to get standard errors every time a model is estimated.
When using minimize, the python equivalent of fminunc (as you outline in your example), oftentimes I am forced to use "Nelder-Meade" or some other gradiant-free method to get convergence . Since I need standard errors for statistical inference, this entails an additional step using numdifftools to recover the hessian. So in the end, the method you propose has its complications too (for my work). If all you care about is the maximum likelihood estimate and not inference, then the approach you outline is probably best and you are correct that you don't need the machinery of statsmodel.
FYI: in a later post, I use your approach combined with autograd for significant speedups of big maximum likelihood models. I haven't successfully gotten this to work with statsmodels.
I am trying to find a equivalent of "NMaximize" optimization command in Mathematica in Python. I tried googling but did not help much.
The mathematica docs describe the methods usable within NMaximize as: Possible settings for the Method option include "NelderMead", "DifferentialEvolution", "SimulatedAnnealing", and "RandomSearch"..
Have a look at scipy's optimize which also supports:
NelderMead
DifferentialEvolution
and much more...
It is very important to find the correct tool for your optimization problem! This is at least dependent on:
Discrete variables?
Smooth optimization function?
Linear, Conic, Non-convex optimization problem?
and again: much more...
Compared to Mathematica's approach, you will have to choose the method a-priori within scipy (at some extent).
I have a dataset which I need to fit to a GEV distribution. The data is one dimensional, and is stored in a numpy array. Currently, I am using scipy.stats.genextreme.fit(data), which works ok, but gives totally inaccurate results (obvious by plotting the pdf). After some investigation it turns out that my data does not fit well in log space, which scipy uses in its MLE fitting algorithm, so I need to try something like GMM instead which is only available in statsmodels. The problem is that I can't find anything which looks like scipy's fit function. All the examples I've found seem to deal with far more complicated data than I have. Also, statsmodels requires endog and exog parameters for eveything, and I have no idea what these are.
This should be really simple, so I'm sure I'm missing something obvious. Has anyone used statsmodels in this way, and if so, any pointers as to how to do it?
I'm guessing you want Gaussian Mixture Model (GMM) and not Generalized Method of Moments (GMM). The former GMM is available in scikit-learn here. The latter has code in statsmodels, but it's a work in progress.
EDIT Actually it's not clear to me that you want GMM. Maybe you just want a kernel density estimator (KDE). This is available in statsmodels here with an example
Hmm, if you do want to use (Generalized) Method of Moments to fit some kind of probability weighted GEV, then you need to specify the moment conditions, but I don't have a ready example for (G)MM in statsmodels for how you specify the moment conditions. You might be better off asking on the mailing list.