I'm trying to fit some data to a mixed model using an expectation maximization approach. In Matlab, the code is as follows
% mixture model's PDF
mixtureModel = ...
#(x,pguess,kappa) pguess/180 + (1-pguess)*exp(kappa*cos(2*x/180*pi))/(180*besseli(0,kappa));
% Set up parameters for the MLE function
options = statset('mlecustom');
options.MaxIter = 20000;
options.MaxFunEvals = 20000;
% fit the model using maximum likelihood estimate
params = mle(data, 'pdf', mixtureModel, 'start', [.1 1/10], ...
'lowerbound', [0 1/50], 'upperbound', [1 50], ...
'options', options);
The data parameter is a 1-D vector of floats.
I'm wondering how the equivalent computation can be achieved in Python. I looked into scipy.optimize.minimize, but this doesn't seem to be a drop-in replacement for Matlab's mle.
I'm a bit lost and overwhelmed, can somebody point me in the right direction (ideally with some example code?)
Thanks very much in advance!
Edit: In the meantime I've found this, but I'm still rather lost as (1) this seems primarily focused on mixed guassian models (which mine is not) and (2) my mathematical skills are severely lacking. That said, I'll happily accept an answer that elucidates how this notebook relates to my specific problem!
This is a mixture model (not mixed model) of uniform and von mises distributions whose parameters you are trying to infer using direct maximum likelihood estimation (not EM, although that may be more appropriate). You can find theses written on this exact problem if you search on the internet. SciPy doesn't have anything that would be as clear a choice as matlab's fmincon which it uses as its default in your code, but you could look for scipy optimization methods that allow bounds on parameters. The scipy interface is different from that of matlab's mle, and you will want to pass the data in the 'args' argument of the scipy minimization functions, whereas the pguess and kappa parameters will need to be represented by a parameter array of length 2.
I believe the scikit-learn toolkit has what you need:
http://scikit-learn.org/stable/modules/generated/sklearn.mixture.GMM.html.
Gaussian Mixture Model
Representation of a Gaussian mixture model probability distribution. This class allows for easy evaluation of, sampling from, and maximum-likelihood estimation of the parameters of a GMM distribution.
Initializes parameters such that every mixture component has zero mean and identity covariance.
Related
In the documentation of scipy, the 'frozen pdf', etc, is mentioned sometimes, but I don't know the meaning of it? Is it a statistical concept or scipy terminology?
I agree that the docs are somewhat unclear on the issue. It seems that the frozen distribution fixes the first n moments for programmer's convenience. I am unaware of the term "forzen distribution" outside of SciPy.
SciPy's frozen distribution is perhaps best described here:
Passing the loc and scale keywords time and again can become quite
bothersome. The concept of freezing a RV is used to solve such
problems.
rv = gamma(1, scale=2.)
By using rv we no longer have to include the scale or the shape
parameters anymore. Thus, distributions can be used in one of two
ways, either by passing all distribution parameters to each method
call (such as we did earlier) or by freezing the parameters for the
instance of the distribution. Let us check this:
rv.mean(), rv.std() (2.0, 2.0)
This is, indeed, what we should get.
In the scipy tutorial page, we see the following line:
(We explain the meaning of a frozen distribution below).
The only mention of frozen distribution after that point is the following:
The main additional methods of the not frozen distribution are related
to the estimation of distribution parameters:
fit: maximum likelihood estimation of distribution parameters, including location
and scale
fit_loc_scale: estimation of location and scale when shape parameters are given
nnlf: negative log likelihood function
expect: calculate the expectation of a function against the pdf or pmf
Given a set of 2D points, I would like to fit the optimal spline to this data with a given number of internal knots.
I have seen that we can use scipy's LSQUnivariateSpline to specify the number and position of knots, however it does not allow us to only specify the number of knots.
From the UnivariateSpline documentation, it seems implied that they have a method for fitting the spline with a given number of knots, as the documentation for the smoothing factor s states (emphasis mine):
Positive smoothing factor used to choose the number of knots. Number
of knots will be increased until the smoothing condition is satisfied...
So while I could go about this in a kind of backwards way and search through smoothing factors until it yields a spline with the desired number of knots, this seems to be a rather ridiculous way to approach this from a computational efficiency standpoint. Two extra search steps are happening just to cancel each other out and obtain a result that was already computed directly at the start.
I've searched around but haven't found a function to access this spline interpolation with a given number of knots directly. I'm not sure if I've missed something simple, or if it's hidden deeper down somewhere and/or not available in the API.
Note: a scipy solution is not required, any python libraries or handcrafted python code is fine (I am using scipy here just because that's where all of my searches about spline interpolation in python have landed me).
Unfortunately, it looks like the UnivariateSpline constructor passes off the computational work to the function dfitpack.curf0, which is implemented in Fortran.
Therefore, although the documentation indicates that the smoothing requirement is met by adjusting the number of knots, there is no way to directly access the function which fits a spline given a number of knots from the python API.
In light of this, it looks like one may need to look to another library or write the algorithm oneself, if avoiding the roundabout double search method is desired. However, in many cases, it may be acceptable to simply run a binary search for the desired number of knots by adjusting the smoothing parameter.
Scipy does not have smoothing splines with a fixed number of knots. You either provide your knots, or let FITPACK select it via the smoothing condition knob.
So, I have the following data I've plotted in Python.
The data is input for a forcing term in a system of differential equations I am working with. Thus, I need to fit a continuous function to this data so I will not have to deal with stability issues that could come with discontinuities of a step-wise function. Unfortunately, it's a pretty large data set.
I am trying to end up with a fitted function that is possible and not too tedious to translate into Stan, the language that I am coding the differential equations in, so was preferring something in piece-wise polynomial form with a maximum of just a few pieces that I can manually code.
I started off with polyfit from numpy, which was not very good. Using UnivariateSpline from scipy gave me a decent fit, but it did not give me something that looked tractable for translation into Stan. Hence, I was looking for suggestions into other fits I could try that would return functions that are more easily translatable into other languages? Looking at the shape of my data, is there a periodic spline fit that could be useful?
The UnivariateSpline object has get_knots and get_coeffs methods. They give you the knots and coefficients of the fit in the b-spline basis.
An alternative, equivalent, way is to use splrep for fitting (and splev for evaluations).
To convert to a piecewise polynomial representation, use PPoly.from_spline (check the docs for the latter for the exact format)
If what you want is a Fourier space representation, you can use leastsq or least_squares. It'd be essential to provide sensible starting values for NLSQ fit parameters. At least I'd start from e.g. max-to-max distance estimate for the period and max-to-min estimate for the amplitude.
As always with non-linear fitting, YMMV, however.
From the direction field, it seems that a fit involving the sum of or composition of multiple sinusoidal functions might be it.
Ex: sin(cos(2x)), sin(x)+2cos(x), etc.
I would use Wolfram Alpha, Mathematica, or Matlab to create direction fields.
I've been digging into the API of statsmodels.regression.linear_model.RegressionResults and have found how to retrieve different flavors of heteroskedasticity corrected standard errors (via properties like HC0_se, etc.) However, I can't quite figure out how to get the t-tests on the coefficients to use these corrected standard errors. Is there a way to do this in the API, or do I have to do it manually? If the latter, can you suggest any guidance on how to do this with statsmodels results?
The fit method of the linear models, discrete models and GLM, take a cov_type and a cov_kwds argument for specifying robust covariance matrices. This will be attached to the results instance and used for all inference and statistics reported in the summary table.
Unfortunately, the documentation doesn't really show this yet in an appropriate way.
The auxiliary method that actually selects the sandwiches based on the options shows the options and required arguments:
http://statsmodels.sourceforge.net/devel/generated/statsmodels.regression.linear_model.OLS.fit.html
For example, estimating an OLS model and using HC3 covariance matrices can be done with
model_ols = OLS(...)
result = model_ols.fit(cov_type='HC3')
result.bse
result.t_test(....)
Some sandwiches require additional arguments, for example cluster robust standard errors, can be selected in the following way, assuming mygroups is an array that contains the groups labels:
results = OLS(...).fit(cov_type='cluster', cov_kwds={'groups': mygroups}
results.bse
...
Some robust covariance matrices make additional assumptions about the data without checking. For example heteroscedasticity and autocorrelation robust standard errors or Newey-West, HAC, standard errors assume a sequential time series structure. Some panel data robust standard errors also assume stacking of the time series by individuals.
A separate option use_t is available to specify whether the t and F or the normal and chisquare distributions should be used by default for Wald tests and confidence intervals.
I have a linear model that I'm trying to fit to data with a good # of outliers in the endogenous variable, but not in the exogenous space. I've researched that RLM's based on M-estimators are good in this situation.
When I fit an RLM to my data in the follow way:
import numpy as np
import statsmodels.formula.api as smf
import statsmodels as sm
modelspec = ('cost ~ np.log(units) + np.log(units):item + item') #where item is a categorical variable
results = smf.rlm(modelspec, data = dataset, M = sm.robust.norms.TukeyBiweight()).fit()
print results.summary()
the summary results shows a z statistic, and seemingly the coefficient test of significance is based off of this rather than a t statistic. However, the following R manual (http://www.dst.unive.it/rsr/BelVenTutorial.pdf) shows the use of t statistics on pg. 19-21
Two questions:
Can someone explain to me conceptually why statsmodels uses a z-test rather than a t-test?
All terms and interactions are highly significant in the results (|z| > 4). In most cases, each item has 40 or more observations. There are some items that have 21-25 observations. Is there reason to believe that RLM is not effective in a small sample environment? The line it produces must be the best fit line after reweighting outliers, but is the z-test effective for samples of this size (ie, is there a reason to believe the confidence interval produced by smf.rlm() does not produce 95% probability coverage? I know for t-tests this potentially can be an issue...)?
Thanks!
I have mostly only a general answer, I never read any small sample Monte Carlo studies for M-estimators.
To 1.
In many models, like M-estimators, RLM, or generalized linear models, GLM, we have only asymptotic results, except for maybe a few special cases. Asymptotic results provide conditions that the estimator is normally distributed. Given this, statsmodels defaults to using normal distribution for all models outside of the linear regression model, OLS, and similar, and chisquare instead of the F distribution for Wald tests with joint hypothesis.
There is some evidence that in many cases using the t or F distribution with appropriate choice of degrees of freedom provides a better small sample approximation to the distribution of the test statistic. This relies on Monte Carlo results and is not directly justified by the theory, as far as I know.
In the next release, and in the current development version, of statsmodels users can choose to use the t and F distribution for the results, instead of the normal and chisquare distribution. The defaults stay the same as they are now.
There are other cases where it is not clear whether the t-distribution, and which small sample degrees of freedom should be used. In many cases, statsmodels tries to follow the lead of STATA, for example in cluster robust standard errors after OLS.
Another consequence is that sometimes equivalent models that are special cases of different models use different default assumptions on the distribution, both in Stata and in statsmodels.
I recently read the SAS documentation for M-estimators, and SAS is using the chisquare distribution, i.e. also the normal assumption, for the significance of the parameter estimates and for the confidence intervals.
To 2.
(see first sentence)
I think the same as for linear models also applies here. If the data is highly non-normal, then the test statistics could have incorrect coverage in small samples. This can also be the case with some robust sandwich covariance estimators. On the other hand, if we don't use heteroscedasticity or correlation robust covariance estimators, then the tests can also be strongly biased.
For robust estimation methods like M-estimators, RLM, the effective sample size also depends on the number of inliers, or the weights assigned to the observations, not just the total number of observations.
For your case, I think the z-values and the sample size are large enough that, for example, using the t-distribution would not make them much less significant.
Comparing M-estimators with different norms and scale estimates would provide an additional check on the robustness on the assumption on the outliers and for the choice of robust estimator. Another cross check: Does OLS with dropped outliers (observations with small weights in the RLM estimate) give a similar answer.
Finally as general caution:
The references on robust methods often warn that we should not use (outlier-)robust methods blindly. Using robust methods estimates a relationship based on "inliers". But is our discarding or down-weighting of outliers justified? Or, do we have missing non-linearities, missing variables, a mixture distribution or different regimes?