Choose knots in Python regression splines

Choose knots in Python regression splines - python

I'm trying to make a model for a very simple data set using spline regression but so far I couldn't find any Python implementation that lets me choose knots position. The picture below shows where I want to put my knot, I want my function to consist only of 2 linear regressions and nothing more.
So far I've tried pyearth and scipy splines but I couldn't find in any of them parameter responsible for setting knots position and even when I tweak other parameters I can't get result that would satisfy me.

patsy.dmatrix and scipy.interpolate.splrep both have knot selection features

Related

Fit spline with given number of knots, but not knot positions

Given a set of 2D points, I would like to fit the optimal spline to this data with a given number of internal knots.
I have seen that we can use scipy's LSQUnivariateSpline to specify the number and position of knots, however it does not allow us to only specify the number of knots.
From the UnivariateSpline documentation, it seems implied that they have a method for fitting the spline with a given number of knots, as the documentation for the smoothing factor s states (emphasis mine):
Positive smoothing factor used to choose the number of knots. Number
of knots will be increased until the smoothing condition is satisfied...
So while I could go about this in a kind of backwards way and search through smoothing factors until it yields a spline with the desired number of knots, this seems to be a rather ridiculous way to approach this from a computational efficiency standpoint. Two extra search steps are happening just to cancel each other out and obtain a result that was already computed directly at the start.
I've searched around but haven't found a function to access this spline interpolation with a given number of knots directly. I'm not sure if I've missed something simple, or if it's hidden deeper down somewhere and/or not available in the API.
Note: a scipy solution is not required, any python libraries or handcrafted python code is fine (I am using scipy here just because that's where all of my searches about spline interpolation in python have landed me).

Unfortunately, it looks like the UnivariateSpline constructor passes off the computational work to the function dfitpack.curf0, which is implemented in Fortran.
Therefore, although the documentation indicates that the smoothing requirement is met by adjusting the number of knots, there is no way to directly access the function which fits a spline given a number of knots from the python API.
In light of this, it looks like one may need to look to another library or write the algorithm oneself, if avoiding the roundabout double search method is desired. However, in many cases, it may be acceptable to simply run a binary search for the desired number of knots by adjusting the smoothing parameter.

Scipy does not have smoothing splines with a fixed number of knots. You either provide your knots, or let FITPACK select it via the smoothing condition knob.

Optimization find polynomial coefficients using constraints

I am looking for an optimisations tool for python like pyswarm which tries to find polynomial coefficients by trying different coefficients values. To guide the search I want to add additional constraints eg. the values of the polynomial needs to be between limits or can have a maximum gradient of a certain value. In case that the optimizer chooses coefficients which don't satisfy these constraints, new coefficients need to be chosen instantly instead of using the created polynomial.
Background information:
The polynomial generated by the optimizer will then be used by a different program generating a data set. The generated data will be compared to an existing data set. The goal is to match the data as closely as possible.
Thanks

Adjusted Boxplot in Python

For my thesis, I am trying to identify outliers in my data set. The data set is constructed of 160000 times of one variable from a real process environment. In this environment however, there can be measurements that are not actual data from the process itself but simply junk data. I would like to filter them out with I little help of literature instead of only "expert opinion".
Now I've read about the IQR method of seeing whether possible outliers lie when dealing with a symmetric distribution like the normal distribution. However, my data set is right skewed and by distribution fitting, inverse gamma and lognormal where the best fit.
So, during my search for methods for non-symmetric distributions, I found this topic on crossvalidated where user603's answer is interesting in particular: Is there a boxplot variant for Poisson distributed data?
In user603's answer, he states that an adjusted boxplot helps to identify possible outliers in your dataset and that R and Matlab have functions for this
(There is an 𝚁R implementation of this
(𝚛𝚘𝚋𝚞𝚜𝚝𝚋𝚊𝚜𝚎::𝚊𝚍𝚓𝚋𝚘𝚡()robustbase::adjbox()) as well as
a matlab one (in a library called 𝚕𝚒𝚋𝚛𝚊libra)
I was wondering if there is such a function in Python. Or is there a way to calculate the medcouple (see paper in user603's answer) with python?
I really would like to see what comes out the adjusted boxplot for my data..

In the module statsmodels.stats.stattools there is a function medcouple(), which is the measure of the skewness used in the Adjusted Boxplot.
enter link description here
With this variable you can calculate the interval beyond which outliers are defined.

Most efficient method of returning coefficients for a fit in Python for use in another languages?

So, I have the following data I've plotted in Python.
The data is input for a forcing term in a system of differential equations I am working with. Thus, I need to fit a continuous function to this data so I will not have to deal with stability issues that could come with discontinuities of a step-wise function. Unfortunately, it's a pretty large data set.
I am trying to end up with a fitted function that is possible and not too tedious to translate into Stan, the language that I am coding the differential equations in, so was preferring something in piece-wise polynomial form with a maximum of just a few pieces that I can manually code.
I started off with polyfit from numpy, which was not very good. Using UnivariateSpline from scipy gave me a decent fit, but it did not give me something that looked tractable for translation into Stan. Hence, I was looking for suggestions into other fits I could try that would return functions that are more easily translatable into other languages? Looking at the shape of my data, is there a periodic spline fit that could be useful?

The UnivariateSpline object has get_knots and get_coeffs methods. They give you the knots and coefficients of the fit in the b-spline basis.
An alternative, equivalent, way is to use splrep for fitting (and splev for evaluations).
To convert to a piecewise polynomial representation, use PPoly.from_spline (check the docs for the latter for the exact format)
If what you want is a Fourier space representation, you can use leastsq or least_squares. It'd be essential to provide sensible starting values for NLSQ fit parameters. At least I'd start from e.g. max-to-max distance estimate for the period and max-to-min estimate for the amplitude.
As always with non-linear fitting, YMMV, however.

From the direction field, it seems that a fit involving the sum of or composition of multiple sinusoidal functions might be it.
Ex: sin(cos(2x)), sin(x)+2cos(x), etc.
I would use Wolfram Alpha, Mathematica, or Matlab to create direction fields.

Classifying a Distribution of Points for Object Identification

I have some points that I need to classify. Given the collection of these points, I need to say which other (known) distribution they match best. For example, given the points in the top left distribution, my algorithm would have to say whether they are a better match to the 2nd, 3rd, or 4th distribution. (Here the bottom-left would be correct due to the similar orientations)
I have some background in Machine Learning, but I am no expert. I was thinking of using Gaussian Mixture Models, or perhaps Hidden Markov Models (as I have previously classified signatures with these- similar problem).
I would appreciate any help as to which approach to use for this problem. As background information, I am working with OpenCV and Python, so I would most likely not have to implement the chosen algorithm from scratch, I just want a pointer to know which algorithms would be applicable to this problem.
Disclaimer: I originally wanted to post this on the Mathematics section of StackExchange, but I lacked the necessary reputation to post images. I felt that my point could not be made clear without showing some images, so I posted it here instead. I believe that it is still relevant to Computer Vision and Machine Learning, as it will eventually be used for object identification.
EDIT:
I read and considered some of the answers given below, and would now like to add some new information. My main reason for not wanting to model these distributions as a single Gaussian is that eventually I will also have to be able to discriminate between distributions. That is, there might be two different and separate distributions representing two different objects, and then my algorithm should be aware that only one of the two distributions represents the object that we are interested in.

I think this depends on where exactly the data comes from and what sort of assumptions you would like to make as to its distribution. The points above can easily be drawn even from a single Gaussian distribution, in which case the estimation of parameters for each one and then the selection of the closest match are pretty simple.
Alternatively you could go for the discriminative option, i.e. calculate whatever statistics you think may be helpful in determining the class a set of points belongs to and perform classification using SVM or something similar. This can be viewed as embedding these samples (sets of 2d points) in a higher-dimensional space to get a single vector.
Also, if the data is actually as simple as in this example, you could just do the principle component analysis and match by the first eigenvector.

You should just fit the distributions to the data, determine the chi^2 deviation for each one, look at F-Test. See for instance these notes on model fitting etc

You might want to consider also non-parametric techniques (e.g. multivariate kernel density estimation on each of your new data set) in order to compare the statistics or distances of the estimated distributions. In Python stats.kde is an implementation in SciPy.Stats.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.