Given a set of 2D points, I would like to fit the optimal spline to this data with a given number of internal knots.
I have seen that we can use scipy's LSQUnivariateSpline to specify the number and position of knots, however it does not allow us to only specify the number of knots.
From the UnivariateSpline documentation, it seems implied that they have a method for fitting the spline with a given number of knots, as the documentation for the smoothing factor s states (emphasis mine):
Positive smoothing factor used to choose the number of knots. Number
of knots will be increased until the smoothing condition is satisfied...
So while I could go about this in a kind of backwards way and search through smoothing factors until it yields a spline with the desired number of knots, this seems to be a rather ridiculous way to approach this from a computational efficiency standpoint. Two extra search steps are happening just to cancel each other out and obtain a result that was already computed directly at the start.
I've searched around but haven't found a function to access this spline interpolation with a given number of knots directly. I'm not sure if I've missed something simple, or if it's hidden deeper down somewhere and/or not available in the API.
Note: a scipy solution is not required, any python libraries or handcrafted python code is fine (I am using scipy here just because that's where all of my searches about spline interpolation in python have landed me).
Unfortunately, it looks like the UnivariateSpline constructor passes off the computational work to the function dfitpack.curf0, which is implemented in Fortran.
Therefore, although the documentation indicates that the smoothing requirement is met by adjusting the number of knots, there is no way to directly access the function which fits a spline given a number of knots from the python API.
In light of this, it looks like one may need to look to another library or write the algorithm oneself, if avoiding the roundabout double search method is desired. However, in many cases, it may be acceptable to simply run a binary search for the desired number of knots by adjusting the smoothing parameter.
Scipy does not have smoothing splines with a fixed number of knots. You either provide your knots, or let FITPACK select it via the smoothing condition knob.
Related
I want to find out how many samples are needed at minimum to more or less correctly fit a probability distribution (In my case the Generalized Extreme Value Distribution from scipy.stats).
In order to evaluate the matched function, I want to compute the KL-divergence between the original function and the fitted one.
Unfortunately, all implementations I found (e.g. scipy.stats.entropy) only take discrete arrays as input. So obviously I thought of approximating the pdf by a discrete array, but I just can't seem to figure it out.
Does anyone have experience with something similar? I would be thankful for hints relating directly to my question, but also for better alternatives to determine a distance between two functions in python, if there are any.
I'm trying to make a model for a very simple data set using spline regression but so far I couldn't find any Python implementation that lets me choose knots position. The picture below shows where I want to put my knot, I want my function to consist only of 2 linear regressions and nothing more.
So far I've tried pyearth and scipy splines but I couldn't find in any of them parameter responsible for setting knots position and even when I tweak other parameters I can't get result that would satisfy me.
patsy.dmatrix and scipy.interpolate.splrep both have knot selection features
I'm stuck trying to get functions that are existent in scipy (or sympy) for the following task:
Suppose we are given the following function:
f(A,B,C) = k1-A*sin(B*k2-C)
for each of the axis A,B,C of the space we have a specific interval, like [a_lb, a_ub], [b_lb, b_ub], [c_lb, c_ub], [d_lb, d_ub].
Which functions of scipy can be used to compute if the space encompassed by the boundaries is intersected by the given function? I thought of like e.g. computing the Hessian matrix.
Thank you for hints
Best regards
If I understand correctly, what you are looking for is an answer to whether f(A,B,C) bounded in the domain [a_l,a_u]x[b_l,b_u]x[c_l,c_u] has a value within [d_l,d_u]. You can try using scipy.optimize.minimize for this.
If you run scipy.optimize.minimize on f with the bounds [a_l,a_u]x[b_l,b_u]x[c_l,c_u], you should get the minimal value of f in the domain. Similarly, minimizing -f will give you the maximal value of f in the domain. f intersects the given boundary if and only if the interval [fmin, fmax] intersects the interval [d_l,d_u].
Note that scipy.optimize.minimize is a non-linear optimization and therefore requires an initial guess. The middle point of the domain box is a natural choice, but since the non-linear optimization may encounter a local minimum (or not converge), you may want to try several other initial guesses as well. scipy.optimize.minimize has many (optional) parameters so I recommend you read its documentation and play with them to fine-tune your usage to your needs.
So, I have the following data I've plotted in Python.
The data is input for a forcing term in a system of differential equations I am working with. Thus, I need to fit a continuous function to this data so I will not have to deal with stability issues that could come with discontinuities of a step-wise function. Unfortunately, it's a pretty large data set.
I am trying to end up with a fitted function that is possible and not too tedious to translate into Stan, the language that I am coding the differential equations in, so was preferring something in piece-wise polynomial form with a maximum of just a few pieces that I can manually code.
I started off with polyfit from numpy, which was not very good. Using UnivariateSpline from scipy gave me a decent fit, but it did not give me something that looked tractable for translation into Stan. Hence, I was looking for suggestions into other fits I could try that would return functions that are more easily translatable into other languages? Looking at the shape of my data, is there a periodic spline fit that could be useful?
The UnivariateSpline object has get_knots and get_coeffs methods. They give you the knots and coefficients of the fit in the b-spline basis.
An alternative, equivalent, way is to use splrep for fitting (and splev for evaluations).
To convert to a piecewise polynomial representation, use PPoly.from_spline (check the docs for the latter for the exact format)
If what you want is a Fourier space representation, you can use leastsq or least_squares. It'd be essential to provide sensible starting values for NLSQ fit parameters. At least I'd start from e.g. max-to-max distance estimate for the period and max-to-min estimate for the amplitude.
As always with non-linear fitting, YMMV, however.
From the direction field, it seems that a fit involving the sum of or composition of multiple sinusoidal functions might be it.
Ex: sin(cos(2x)), sin(x)+2cos(x), etc.
I would use Wolfram Alpha, Mathematica, or Matlab to create direction fields.
I have a list of many float numbers, representing the length of an operation made several times.
For each type of operation, I have a different trend in numbers.
I'm aware of many random generators presented in some python modules, like in numpy.random
For example, I have binomial, exponencial, normal, weibul, and so on...
I'd like to know if there's a way to find the best random generator, given a list of values, that best fit each list of numbers that I have.
I.e, the generator (with its params) that best fit the trend of the numbers on the list
That's because I'd like to automatize the generation of time lengths, of each operation, so that I can simulate it during n years, without having to find by hand what method fits best what list of numbers.
EDIT: In other words, trying to clarify the problem:
I have a list of numbers. I'm trying to find the probability distribution that best fit the array of numbers I already have. The only problem I see is that each probability distribution has input params that may interfer on the result. So I'll have to figure out how to enter this params automatically, trying to best fit the list.
Any idea?
You might find it better to think about this in terms of probability distributions, rather than thinking about random number generators. You can then think in terms of testing goodness of fit for your different distributions.
As a starting point, you might try constructing probability plots for your samples. Probably the easiest in terms of the math behind it would be to consider a Q-Q plot. Using the random number generators, create a sample of the same size as your data. Sort both of these, and plot them against one another. If the distributions are the same, then you should get a straight line.
Edit: To find appropriate parameters for a statistical model, maximum likelihood estimation is a standard approach. Depending on how many samples of numbers you have and the precision you require, you may well find that just playing with the parameters by hand will give you a "good enough" solution.
Why using random numbers for this is a bad idea has already been explained. It seems to me that what you really need is to fit the distributions you mentioned to your points (for example, with a least squares fit), then check which one fits the points best (for example, with a chi-squared test).
EDIT Adding reference to numpy least squares fitting example
Given a parameterized univariate distirbution (e.g. exponential depends on lambda, or gamma depends on theta and k), the way to find the parameter values that best fit a given sample of numbers is called the Maximum Likelyhood procedure. It is not a least squares procedure, which would require binning and thus loosing information! Some Wikipedia distribution articles give expressions for the maximum likelyhood estimates of parameters, but many do not, and even the ones that do are missing expressions for error bars and covarainces. If you know calculus, you can derive these results by expressing the log likeyhood of your data set in terms of the parameters, setting the second derivative to zero to maximize it, and using the inverse of the curvature matrix at the minimum as the covariance matrix of your parameters.
Given two different fits to two different parameterized distributions, the way to compare them is called the likelyhood ratio test. Basically, you just pick the one with the larger log likelyhood.
Gabriel, if you have access to Mathematica, parameter estimation is built in:
In[43]:= data = RandomReal[ExponentialDistribution[1], 10]
Out[43]= {1.55598, 0.375999, 0.0878202, 1.58705, 0.874423, 2.17905, \
0.247473, 0.599993, 0.404341, 0.31505}
In[44]:= EstimatedDistribution[data, ExponentialDistribution[la],
ParameterEstimator -> "MaximumLikelihood"]
Out[44]= ExponentialDistribution[1.21548]
In[45]:= EstimatedDistribution[data, ExponentialDistribution[la],
ParameterEstimator -> "MethodOfMoments"]
Out[45]= ExponentialDistribution[1.21548]
However, it might be easy to figure what maximum likelihood method commands the parameter to be.
In[48]:= Simplify[
D[LogLikelihood[ExponentialDistribution[la], {x}], la], x > 0]
Out[48]= 1/la - x
Hence the estimated parameter for exponential distribution is sum (1/la -x_i) from where la = 1/Mean[data]. Similar equations can be worked out for other distribution families and coded in the language of your choice.