Python Least-Squares Natural Splines - python

I am trying to find a numerical package which will fit a natural spline which minimizes weighted least squares.
There is a package in scipy which does what I want for unnatural splines.
import numpy as np
import matplotlib.pyplot as plt
from scipy import interpolate, randn
x = np.arange(0,5,1.0/6)
xs = np.arange(0,5,1.0/500)
y = np.sin(x+1) + .2*np.random.rand(len(x)) -.1
knots = np.array([1,2,3,4])
tck = interpolate.splrep(x,y,s=0,k=3,t=knots,task=-1)
ys = interpolate.splev(xs,tck,der=0)
plt.figure()
plt.plot(xs,ys,x,y,'x')

The spline.py file inside of this tar file from this page does a natural spline fit by default. There is also some code on this page that claims to mostly what you want. The pyD3D package also has a natural spline function in its pyDataUtils module. This last one looks the most promising to me. However, it doesn't appear to have the option of setting your own knots. Maybe if you look at the source you can find a way to rectify that.
Also, I found this message on the Scipy mailing list which says that using s=0.0 (as in your given code) makes splines fitted using your above procedure natural according the writer of the message. I did find this splmake function that has an option to do a natural spline fit, but upon looking at the source I found that it isn't implemented yet.

Related

How can I generate a CDF using Kernel Density Estimation in Python?

There are a few methods I have come across that can do kernel density estimation which will provide a PDF for a sample of data:
KDEpy
sklearn.neighbors.KernelDensity
scipy.stats.gaussian_kde
Using any of the above I can generate a PDF however I want to know how I can get the CDF for the PDF I am generating. In math I know you can integrate on the PDF to get the CDF, however the issue is that these methods are only supplying x and y points and not a function to integrate on.
I'm wondering how I could transform the data being given into a CDF plot or alternatively find the PDF function for the data to then integrate on to get the CDF. Or use an alternative method where the output is a CDF instead of PDF.
MCVE
Let's create some dummy data to shoulder the discussion:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
np.random.seed(123)
data = stats.norm(loc=0, scale=1).rvs(10**4)
Here is the baseline idea with the scipy.stats package.
Gaussian KDE
We can estimate KDE using dedicated tools such as gaussian_kde:
kde = stats.gaussian_kde(data)
Which exposes a PDF function to evaluate at every x but is missing the CDF.
Checking samples with Kolmogorov-Smirnov Test we cannot reject the null hypothesis (two distributions are identical) with the threshold of 10%:
stats.ks_2samp(data, kde.resample(100).squeeze())
# KstestResult(statistic=0.0969, pvalue=0.29163373800871994)
Continuous Variable
The scipy.stats package also exposes a generic class rv_continous to inherit from. As stated in documentation:
New random variables can be defined by subclassing the rv_continuous
class and re-defining at least the _pdf or the _cdf method (normalized
to location 0 and scale 1).
So we can use this on purpose logic to fill the gap. Without any performance consideration it boils down to:
class KDEDist(stats.rv_continuous):
def __init__(self, kde, *args, **kwargs):
super().__init__(*args, **kwargs)
self._kde = kde
def _pdf(self, x):
return self._kde.pdf(x)
Then we create the underlying object with our experimental KDE.
X = KDEDist(kde)
stats.ks_2samp(data, X.rvs(size=100)) # This call is kind of intensive
# KstestResult(statistic=0.0625, pvalue=0.8113077271721811)
Now we can naturally - at least in term of API call - evaluate the PDF and CDF as well:
fig, axe = plt.subplots()
axe.hist(data, density=1)
axe.plot(x, X.pdf(x))
axe.plot(x, X.cdf(x))
It returns:
Performance considerations
Notice than this methodology answers your question but is not performant. KDE computation are expensive mainly because the kernel spans the whole data space (Gaussian reaches zero at infinity). Therefore, without cut-off feature computations are based on all observations of the dataset at each evaluation.
Changing the window function can drastically improve the performance. Eg.: triangular window will have fixed span over the whole dataset and reduce computation w.r.t. dataset extent and size.
Implementation considerations
Reading the doc, it seems rv_continuous is initially designed to implement new continuous variable with analytical definition.
Anyway, the class provides automatic resolution/integration for other statistics if underlying methods are not implemented (overridden).
When choosing this methodology, it is up to you to implement missing logic if you wish to make it more performant and robust (numerical stability).
Histogram instead of KDE
If you can relax the KDE needs and is satisfied by an histogram distribution, then you can rely on rv_histogram which essentially does the same based on the binned distribution:
hist = np.histogram(data, bins=100)
hist_dist = stats.rv_histogram(hist)
stats.ks_2samp(data, hist_dist.rvs(size=100))
# KstestResult(statistic=0.0577, pvalue=0.8778871545532821)
Which returns:
KDE Histogram
Provided it is acceptable theoretically, we can mix both strategy by creating the expected histogram from the KDE:
hist = np.histogram(data, bins=1000)
hist_kde = kde.pdf(hist[1][:-1] + np.diff(hist[1]))
hist_dist_kde = stats.rv_histogram([hist_kde, hist[1]])
stats.ks_2samp(data, hist_dist_kde.rvs(size=100))
# KstestResult(statistic=0.1067, pvalue=0.19541766226890545)
Then the CDF has a relative smoothness w.r.t. the KDE (it is still an histogram) and the Continuous Variable object is as performant as rv_histogram can be.

dtreeviz replace plot, regression too many points

How can I replace the node-plots from dtreeviz by a custom plot function from me?
Alternatively: I want to replace the dtreeviz-plots with a 2d-histogram: y-axis=y-values, x-axis: values from the split feature, grid over the plot, each grid-cell gets the number of samples inside as color. (If that is already implemented in some package would also be great) In matplotlib the plotting function for that is called hist2d()
I use sklearn to learn a regression decision tree and visualize the results with dtreeviz.
MWE: (see https://github.com/parrt/dtreeviz#regression-decision-tree)
from sklearn.datasets import *
from sklearn import tree
from dtreeviz.trees import *
regr = tree.DecisionTreeRegressor(max_depth=2)
boston = load_boston()
regr.fit(boston.data, boston.target)
viz = dtreeviz(regr,
boston.data,
boston.target,
target_name='price',
feature_names=boston.feature_names)
viz.view()
Now I do have millions of samples in my problem and the resulting .svg is extremely slow (read 'impossible') to display. I could only use that visualization using downsampling.
Example 2d histogram:
(From https://matplotlib.org/gallery/scales/power_norm.html#sphx-glr-gallery-scales-power-norm-py)
Sorry, but you would have to alter the software as it was not designed to have plug-and-play node figures. It was extremely difficult to convince all of the tools in the chain to work together, even without allowing such flexibility.

Can't fit Poisson to histogram in

I've looked at a bunch of examples on here and tried using snippets of other codes, but they're not working for me. I have 4 data sets, but I'll include just one here. My professor told me that the data appeared to be Poisson distributed, so I am trying to fit a Poisson to a histogram of the data. Here is my code:
######## Poisson fit ########
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy.special import factorial
data = data59[4]
entries,bin_edges,patches = plt.hist(data59[4],60,[1,10],normed=True)
bin_middles = 0.5*(bin_edges[1:]+bin_edges[:-1])
def poisson(k, lamb):
return np.exp(-lamb)*(lamb**k)/factorial(k)
popt,pcov = curve_fit(poisson,bin_middles,entries)
x = np.linspace(1,10,100)
plt.plot(x,poisson(x,*popt))
plt.show()
I tried plotting other distributions on top of the histogram like normal and Rayleigh using scipy.stats instead of curve_fit. Those kind of worked only because they have a scale parameter, which scipy.stats.poisson doesn't. The distribution for this comes out looking exactly the same as the curve_fit. I'm not sure how to resolve this issue. Perhaps the data is not even Poisson distributed!
Thanks for helping!!
Update: The data is IceCube data from the TXS 0506+056 blazar. I used SkyDrive to get a URL for the file. I hope it works. The first column is the modified Julian day and the last column is the log of the energy proxy. I am using the last column. I have null and alternative hypotheses surrounding this data and am using maximum likelihood estimation (from a certain distribution, Poisson in my first case) to analyze the data.
Also, here is where I got the data: https://icecube.wisc.edu/science/data/TXS0506_point_source
The data presented in your histogram does not have a Poisson distribution. The Poisson is a counting distribution (what's the probability of getting 0, 1, 2,... observations per unit of time or space), its support is the positive integers. Your histogram clearly shows that you have fractional values, since the spikes are different non-zero heights at non-integer locations.

Most accurate way to interpolate data and find the peak?

The data I have is always on a second degree polynomial (quadratic function). I want to find the peak of the interpolated function as accurately as possible.
So far I've been using interp1d and then extract the peak value using linspace and a simple for loop. Although you can use a large number of newly generated samples in linspace you can still be more precise using the derivative of the fitted polynomial. I haven't found a way to do that using interp1d.
Now the only function I've found that returns the fitted polynomial coefficients is polyfit, but this fitted function is quite inaccurate (most of the time the function doesn't even go through the data points).
I've tried using UnivariateSpline and the fitted function seems to be quite accurate and it's very simple to get the derivative spline and its root.
Other polynomial fitting functions (BarycentricInterpolator, KroghInterpolator, ...) state that they are not computing polynomial coefficients for reasons of numerical stability.
How accurate is UnivariateSpline and its derivatives, or are there any better options out there?
If all you need is to find the min/max of a second degree polynomial why not do this:
import matplotlib.pyplot as plt
from scipy.interpolate import KroghInterpolator
import numpy as np
x=range(-20,20)
y=[]
for i in x:
y.append((i**2)+25)
x=x[1::5]
y=y[1::5]
f=KroghInterpolator(x,y)
xfine=np.arange(min(x),max(x),.5)
yfine=f(xfine)
val_interp=min(yfine)
print val_interp
plt.scatter(x,y)
plt.plot(xfine, yfine)
plt.show()
In the end I went with polyfit. Although the fitted function didn't go exactly through the data points the end result was still good. From the returned coefficients I got the desired x and y coordinates of the peak.

Multidimensional/multivariate dynamic time warping (DTW) library/code in Python

I am working on a time series data. The data available is multi-variate. So for every instance of time there are three data points available.
Format:
| X | Y | Z |
So one time series data in above format would be generated real time. I am trying to find a good match of this real time generated time series within another time series base data, which is already stored (which is much larger in size and was collected at a different frequency). If I apply standard DTW to each of the series (X,Y,Z) individually they might end up getting a match at different points within the base database, which is unfavorable. So I need to find a point in base database where all three components (X,Y,Z) match well and at the same point.
I have researched into the matter and found out that multidimensional DTW is a perfect solution to such a problem. In R the dtw package does include multidimensional DTW but I have to implement it in Python. The R-Python bridging package namely "rpy2" can probably of help here but I have no experience in R. I have looked through available DTW packages in Python like mlpy, dtw but are not help. Can anyone suggest a package in Python to do the same or the code for multi-dimensional DTW using rpy2.
Thanks in advance!
Thanks #lgautier I dug deeper and found implementation of multivariate DTW using rpy2 in Python. Just passing the template and query as 2D matrices (matrices as in R) would allow rpy2 dtw package to do a multivariate DTW. Also if you have R installed, loading the R dtw library and "?dtw" would give access to the library's documentation and different functionalities available with the library.
For future reference to other users with similar questions:
Official documentation of R dtw package: https://cran.r-project.org/web/packages/dtw/dtw.pdf
Sample code, passing two 2-D matrices for multivariate DTW, the open_begin and open_end arguments enable subsequence matching:
import numpy as np
import rpy2.robjects.numpy2ri
rpy2.robjects.numpy2ri.activate()
from rpy2.robjects.packages import importr
import rpy2.robjects as robj
R = rpy2.robjects.r
DTW = importr('dtw')
# Generate our data
template = np.array([[1,2,3,4,5],[1,2,3,4,5]]).transpose()
rt,ct = template.shape
query = np.array([[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]]).transpose()
rq,cq = query.shape
#converting numpy matrices to R matrices
templateR=R.matrix(template,nrow=rt,ncol=ct)
queryR=R.matrix(query,nrow=rq,ncol=cq)
# Calculate the alignment vector and corresponding distance
alignment = R.dtw(templateR,queryR,keep=True, step_pattern=R.rabinerJuangStepPattern(4,"c"),open_begin=True,open_end=True)
dist = alignment.rx('distance')[0][0]
print dist
It seems like tslearn's dtw_path() is exactly what you are looking for. to quote the docs linked before:
Compute Dynamic Time Warping (DTW) similarity measure between (possibly multidimensional) time series and return both the path and the similarity.
[...]
It is not required that both time series share the same size, but they must be the same dimension. [...]
The implementation they provide follows:
H. Sakoe, S. Chiba, “Dynamic programming algorithm optimization for spoken word recognition,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 26(1), pp. 43–49, 1978.
I think that it is a good idea to try out a method in whatever implementation is already available before considering whether it worth working on a reimplementation.
Did you try the following ?
from rpy2.robjects.packages import importr
# You'll obviously need the R package "dtw" installed with your R
dtw = importr("dtw")
# all functions and objects in the R package "dtw" are now available
# with `dtw.<function or object>`
I happened upon this post and thought I would provide some updated information in case anyone else is trying to locate a way to do multivariate DTW in Python. The DTADistance package has the option to perform multivariate DTW.

Categories