scipy -- how to integrate a linearly interpolated function? - python

I have a function which is an interpolation of a relative large set of data. I use linear interpolation interp1d so there are a lot of non-smooth sharp point like this. The quad function from scipy will give warning because of the sharp points. I wonder how to do the integration without the warning?
Thank you!
Thanks for all the answers. Here I summarize the solutions in case some others run into the same problem:
Just like what #Stelios did, use points to avoid warnings and to get a more accurate result.
In practice the number of points are usually larger than the default limit(limit=50) of quad, so I choose quad(f_interp, a, b, limit=2*p.shape[0], points=p) to avoid all those warnings.
If a and b are not the same start or the end point of the data set x, the points p can be chosen by p = x[where(x>=a and x<=b)]

quad accepts an optional argument, called points. According to the documentation:
points : (sequence of floats,ints), optional
A sequence of break points in the bounded integration interval where
local difficulties of the integrand may occur (e.g., singularities,
discontinuities). The sequence does not have to be sorted.
In your case, the "difficult" points are exactly the x-coordinates of the data points. Here is an example:
import numpy as np
from scipy.integrate import quad
np.random.seed(123)
# generate random data set
x = np.arange(0,10)
y = np.random.rand(10)
# construct a linear interpolation function of the data set
f_interp = lambda xx: np.interp(xx, x, y)
Here is a plot of the data points and f_interp:
Now calling quad as
quad(f_interp,0,9)
return a series of warnings along with
(4.89770017785734, 1.3762838395159349e-05)
If you provide the points argument, i.e.,
quad(f_interp,0,9, points = x)
it issues no warnings and the result is
(4.8977001778573435, 5.437539505167948e-14)
which also implies a much greater accuracy of the result compared to the previous call.

Instead of interp1d, you could use scipy.interpolate.InterpolatedUnivariateSpline. That interpolator has the method integral(a, b) that computes the definite integral.
Here's an example:
import numpy as np
from scipy.interpolate import InterpolatedUnivariateSpline
import matplotlib.pyplot as plt
# Create some test data.
x = np.linspace(0, np.pi, 21)
np.random.seed(12345)
y = np.sin(1.5*x) + np.random.laplace(scale=0.35, size=len(x))**3
# Create the interpolator. Use k=1 for linear interpolation.
finterp = InterpolatedUnivariateSpline(x, y, k=1)
# Create a finer mesh of points on which to compute the integral.
xx = np.linspace(x[0], x[-1], 5*len(x))
# Use the interpolator to compute the integral from 0 to t for each
# t in xx.
qq = [finterp.integral(0, t) for t in xx]
# Plot stuff
p = plt.plot(x, y, '.', label='data')
plt.plot(x, y, '-', color=p[0].get_color(), label='linear interpolation')
plt.plot(xx, qq, label='integral of linear interpolation')
plt.grid()
plt.legend(framealpha=1, shadow=True)
plt.show()
The plot:

Related

How to obtain coordinates of maximum density

How could I get the coordinates of a point in the space with the greatest density.
I have this code to generate a random point and density analyze from this point.
import numpy as np
from scipy import stats
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
def random_data(N):
# Generate some random data.
return np.random.uniform(0., 10., N)
x_data = random_data(50)
y_data = random_data(50)
kernel = stats.gaussian_kde(np.vstack([x_data, y_data]), bw_method=0.05)
b = plt.plot(x_data, y_data, 'ro')
df = pd.DataFrame({"x":x_data,"y":y_data})
p = sns.jointplot(data=df,x='x', y='y',kind='kde')
plt.show(p)
Thank you for help. :)
For starters, let me state the obvious by saying that sns.jointplot computes the kernel density on its own, so your kernel variable is as of yet unused.
Here's what sns.jointplot generated for me with a random sample:
There's a nice maximum at around (7, 5.4).
Here's what your kernel corresponds to:
x,y = np.mgrid[:10:100j, :10:100j] # 100 x 100 grid for plotting
z = kernel.pdf(np.array([x.ravel(),y.ravel()])).reshape(x.shape)
fig,ax = plt.subplots()
ax.contourf(x, y, z, levels=10)
ax.axis('scaled')
This will clearly not do: the density contains peaks centered around your input points; you will never be able to get a similar estimate than what sns.jointplot gave you.
We can easily fix this: you just have to drop the custom bw_method argument in the call to gaussian_kde:
kernel = stats.gaussian_kde(np.vstack([x_data, y_data]))
x,y = np.mgrid[:10:100j, :10:100j] # 100 x 100 grid for plotting
z = kernel.pdf(np.array([x.ravel(),y.ravel()])).reshape(x.shape)
fig,ax = plt.subplots()
ax.contourf(x, y, z, levels=10)
ax.axis('scaled')
This looks just the way you want it:
Now you know that this kernel.pdf is a bivariate function for which you're looking for the maximum.
And to find the maximum you should probably use something from scipy.optimize, for instance scipy.optimize.minimize (the trick is to look at the negative of your function, which turns maxima into minima).
Since your function will probably have a few local maxima, finding the global maximum reliably is not trivial. I would either use the aforementioned minimize, but first use a sparse mesh over the relevant domain and find the best maximum candidate first, or use a heavy-weight solver such as differential_evolution which is a stochastic solver that's supposed to be good at finding the true global minimum of a function.
Root finding and minimization is always fickle business, so you will have to play around with your real data and available methods to find a reliable workflow that gives you your maximum.

Invert interpolation to give the variable associated with a desired interpolation function value

I am trying to invert an interpolated function using scipy's interpolate function. Let's say I create an interpolated function,
import scipy.interpolate as interpolate
interpolatedfunction = interpolated.interp1d(xvariable,data,kind='cubic')
Is there some function that can find x when I specify a:
interpolatedfunction(x) == a
In other words, "I want my interpolated function to equal a; what is the value of xvariable such that my function is equal to a?"
I appreciate I can do this with some numerical scheme, but is there a more straightforward method? What if the interpolated function is multivalued in xvariable?
There are dedicated methods for finding roots of cubic splines. The simplest to use is the .roots() method of InterpolatedUnivariateSpline object:
spl = InterpolatedUnivariateSpline(x, y)
roots = spl.roots()
This finds all of the roots instead of just one, as generic solvers (fsolve, brentq, newton, bisect, etc) do.
x = np.arange(20)
y = np.cos(np.arange(20))
spl = InterpolatedUnivariateSpline(x, y)
print(spl.roots())
outputs array([ 1.56669456, 4.71145244, 7.85321627, 10.99554642, 14.13792756, 17.28271674])
However, you want to equate the spline to some arbitrary number a, rather than 0. One option is to rebuild the spline (you can't just subtract a from it):
solutions = InterpolatedUnivariateSpline(x, y - a).roots()
Note that none of this will work with the function returned by interp1d; it does not have roots method. For that function, using generic methods like fsolve is an option, but you will only get one root at a time from it. In any case, why use interp1d for cubic splines when there are more powerful ways to do the same kind of interpolation?
Non-object-oriented way
Instead of rebuilding the spline after subtracting a from data, one can directly subtract a from spline coefficients. This requires us to drop down to non-object-oriented interpolation methods. Specifically, sproot takes in a tck tuple prepared by splrep, as follows:
tck = splrep(x, y, k=3, s=0)
tck_mod = (tck[0], tck[1] - a, tck[2])
solutions = sproot(tck_mod)
I'm not sure if messing with tck is worth the gain here, as it's possible that the bulk of computation time will be in root-finding anyway. But it's good to have alternatives.
After creating an interpolated function interp_fn, you can find the value of x where interp_fn(x) == a by the roots of the function
interp_fn2 = lambda x: interp_fn(x) - a
There are number of options to find the roots in scipy.optimize. For instance, to use Newton's method with the initial value starting at 10:
from scipy import optimize
optimize.newton(interp_fn2, 10)
Actual example
Create an interpolated function and then find the roots where fn(x) == 5
import numpy as np
from scipy import interpolate, optimize
x = np.arange(10)
y = 1 + 6*np.arange(10) - np.arange(10)**2
y2 = 5*np.ones_like(x)
plt.scatter(x,y)
plt.plot(x,y)
plt.plot(x,y2,'k-')
plt.show()
# create the interpolated function, and then the offset
# function used to find the roots
interp_fn = interpolate.interp1d(x, y, 'quadratic')
interp_fn2 = lambda x: interp_fn(x)-5
# to find the roots, we need to supply a starting value
# because there are more than 1 root in our range, we need
# to supply multiple starting values. They should be
# fairly close to the actual root
root1, root2 = optimize.newton(interp_fn2, 1), optimize.newton(interp_fn2, 5)
root1, root2
# returns:
(0.76393202250021064, 5.2360679774997898)
If your data are monotonic you might also try the following:
inversefunction = interpolated.interp1d(data, xvariable, kind='cubic')
Mentioning another option because I found this page in a google search and the other option works for my simple use case. Hopefully it'll be of use to someone.
If the function you're interpolating is very simple and always has a 1:1 relationship between y and x, then you can simply take your data, swap x and y when you pass it into interp1d, and then call the interpolation function in that direction.
Adapting code from https://docs.scipy.org/doc/scipy/reference/generated/scipy.interpolate.interp1d.html
import numpy as np
import matplotlib.pyplot as plt
from scipy import interpolate
x = np.arange(0, 10)
y = np.exp(-x/3.0)
f = interpolate.interp1d(x, y)
xnew = np.arange(0, 9, 0.1)
ynew = f(xnew)
plt.plot(x, y, 'o', xnew, ynew, '-')
plt.show()
When x and y have been swapped you can call swappedInterpolationFunction(a) to get the x value where that would occur.
f = interpolate.interp1d(y, x)
xnew = np.arange(np.exp(-9/3), np.exp(0), 0.01)
ynew = f(xnew)
plt.plot(y, x, 'o', xnew, ynew, '-')
plt.title("Inverted")
plt.show()
Of course, if the function ever has multiple x values for a given y value (like sine or a parabola) then this will not work because it will no longer be a 1:1 function from x to y, and the above answers are necessary. This is just a simplification in a limited use case.

Discrete fourier transformation from a list of x-y points

What I'm trying to do is, from a list of x-y points that has a periodic pattern, calculate the period. With my limited mathematics knowledge I know that Fourier Transformation can do this sort of thing.
I'm writing Python code.
I found a related answer here, but it uses an evenly-distributed x axis, i.e. dt is fixed, which isn't the case for me. Since I don't really understand the math behind it, I'm not sure if it would work properly in my code.
My question is, does it work? Or, is there some method in numpy that already does my work? Or, how can I do it?
EDIT: All values are Pythonic float (i.e. double-precision)
For samples that are not evenly spaced, you can use scipy.signal.lombscargle to compute the Lomb-Scargle periodogram. Here's an example, with a signal whose
dominant frequency is 2.5 rad/s.
from __future__ import division
import numpy as np
from scipy.signal import lombscargle
import matplotlib.pyplot as plt
np.random.seed(12345)
n = 100
x = np.sort(10*np.random.rand(n))
# Dominant periodic signal
y = np.sin(2.5*x)
# Add some smaller periodic components
y += 0.15*np.cos(0.75*x) + 0.2*np.sin(4*x+.1)
# Add some noise
y += 0.2*np.random.randn(x.size)
plt.figure(1)
plt.plot(x, y, 'b')
plt.xlabel('x')
plt.ylabel('y')
plt.grid()
dxmin = np.diff(x).min()
duration = x.ptp()
freqs = np.linspace(1/duration, n/duration, 5*n)
periodogram = lombscargle(x, y, freqs)
kmax = periodogram.argmax()
print("%8.3f" % (freqs[kmax],))
plt.figure(2)
plt.plot(freqs, np.sqrt(4*periodogram/(5*n)))
plt.xlabel('Frequency (rad/s)')
plt.grid()
plt.axvline(freqs[kmax], color='r', alpha=0.25)
plt.show()
The script prints 2.497 and generates the following plots:
As starting point:
(I assume all coordinates are positive and integer, otherwise map them to reasonable range like 0..4095)
find max coordinates xMax, yMax in list
make 2D array with dimensions yMax, xMax
fill it with zeros
walk through you list, set array elements, corresponding to coordinates, to 1
make 2D Fourier transform
look for peculiarities (peaks) in FT result
This page from Scipy shows you basic knowledge of how Discrete Fourier Transform works:
http://docs.scipy.org/doc/numpy-1.10.0/reference/routines.fft.html
They also provide API for using DFT. For your case, you should look at how to use fft2.

Use Histogram data to generate random samples in scipy

Suppose I have a process where I push a button, and after a certain amount of time (from 1 to 30 minutes), an event occurs. I then run a very large number of trials, and record how long it takes the event to occur for each trial. This raw data is then reduced to a set of 30 data points where the x value is the number of minutes it took for the event to occur, and the y value is the percentage of trials which fell into that bucket. I do not have access to the original data.
How can I use this set of 30 points to identify an appropriate probability distribution which I can then use to generate representative random samples?
I feel like scipy.stats has all the tools I need built in, but for the life of me I can't figure out how to go about it. Any tips?
If you don't have any prior information about the underlying function of the data which have been produced, I suggest you to use numpy.polyfit which fits a polynomial of given degree.
import matplotlib.pyplot as plt
import numpy as np
y = np.array([ 0.005995184, ...]) # your array
x = np.arange(len(y))
f = np.poly1d(np.polyfit(x, y, 10))
x_new = np.linspace(x[0], x[-1], 30)
y_new = f(x_new)
plt.plot(x,y,'o', x_new, y_new)
plt.xlim([x[0]-1, x[-1] + 1 ])
plt.show()
Here is an example for degree = 10.
In order to get an unknown value from the produced polynomial distribution, you simply:
f(13.5)
which in this case gives:
0.0206996531272
You can also use the histogram, piecewise uniform distribution directly, then you get exactly the corresponding random numbers instead of an approximation.
The inverse cdf, ppf, is piecewise linear and linear interpolation can be used to transform uniform random numbers appropriately.
I was able to come up with a solution, but it doesn't feel like a very elegant one. Basically, take the percentage value (y value) for each x value, multiply by some large number (say, 10,000), then add that many values of x to an array. Continue through all values of x, ending up with a single giant array. This array can then be fed into .fit() methods of the scipy.stats.rv_discrete subclasses. I'll leave the question open for now as I feel like there must be a better way.
import matplotlib.pyplot as plt
import scipy
import scipy.stats
import numpy as np
xRange = 30
x = scipy.arange(0,xRange+1)
data = [
0.005995184,0.012209876,0.028232119,0.04711878,0.087894128,
0.116652421,0.115370764,0.12774159,0.109731418,0.079767439,
0.068016186,0.045287033,0.033403796,0.029145134,0.018925806,
0.013340493,0.010087069,0.007998098,0.00984276,0.004906083,
0.004720561,0.003186032,0.003028522,0.002942859,0.002780096,
0.002450613,0.002733441,0.002217294,0.002072314,0.002063246]
y=[]
for i in range(len(data)):
for j in range(int(data[i]*10000)):
y=np.append(y,i+1)
# creating the histogram
plt.figure(num=1,figsize=(22,12))
h = plt.hist(y, bins=x, normed=True)
dist_names = ['burr','f','rayleigh']
for dist_name in dist_names:
dist = getattr(scipy.stats, dist_name)
param = dist.fit(y)
pdf_fitted = dist.pdf(x, *param[:-2], loc=param[-2], scale=param[-1])
plt.plot(pdf_fitted, label=dist_name, lw=4)
plt.xlim(0,xRange)
plt.legend(loc='upper right')
plt.show()

Spline representation with scipy.interpolate: Poor interpolation for low-amplitude, rapidly oscillating functions

I need to (numerically) calculate the first and second derivative of a function for which I've attempted to use both splrep and UnivariateSpline to create splines for the purpose of interpolation the function to take the derivatives.
However, it seems that there's an inherent problem in the spline representation itself for functions who's magnitude is order 10^-1 or lower and are (rapidly) oscillating.
As an example, consider the following code to create a spline representation of the sine function over the interval (0,6*pi) (so the function oscillates three times only):
import scipy
from scipy import interpolate
import numpy
from numpy import linspace
import math
from math import sin
k = linspace(0, 6.*pi, num=10000) #interval (0,6*pi) in 10'000 steps
y=[]
A = 1.e0 # Amplitude of sine function
for i in range(len(k)):
y.append(A*sin(k[i]))
tck =interpolate.UnivariateSpline(x, y, w=None, bbox=[None, None], k=5, s=2)
M=tck(k)
Below are the results for M for A = 1.e0 and A = 1.e-2
http://i.imgur.com/uEIxq.png Amplitude = 1
http://i.imgur.com/zFfK0.png Amplitude = 1/100
Clearly the interpolated function created by the splines is totally incorrect! The 2nd graph does not even oscillate the correct frequency.
Does anyone have any insight into this problem? Or know of another way to create splines within numpy/scipy?
Cheers,
Rory
I'm guessing that your problem is due to aliasing.
What is x in your example?
If the x values that you're interpolating at are less closely spaced than your original points, you'll inherently lose frequency information. This is completely independent from any type of interpolation. It's inherent in downsampling.
Nevermind the above bit about aliasing. It doesn't apply in this case (though I still have no idea what x is in your example...
I just realized that you're evaluating your points at the original input points when you're using a non-zero smoothing factor (s).
By definition, smoothing won't fit the data exactly. Try putting s=0 in instead.
As a quick example:
import matplotlib.pyplot as plt
import numpy as np
from scipy import interpolate
x = np.linspace(0, 6.*np.pi, num=100) #interval (0,6*pi) in 10'000 steps
A = 1.e-4 # Amplitude of sine function
y = A*np.sin(x)
fig, axes = plt.subplots(nrows=2)
for ax, s, title in zip(axes, [2, 0], ['With', 'Without']):
yinterp = interpolate.UnivariateSpline(x, y, s=s)(x)
ax.plot(x, yinterp, label='Interpolated')
ax.plot(x, y, 'bo',label='Original')
ax.legend()
ax.set_title(title + ' Smoothing')
plt.show()
The reason that you're only clearly seeing the effects of smoothing with a low amplitude is due to the way the smoothing factor is defined. See the documentation for scipy.interpolate.UnivariateSpline for more details.
Even with a higher amplitude, the interpolated data won't match the original data if you use smoothing.
For example, if we just change the amplitude (A) to 1.0 in the code example above, we'll still see the effects of smoothing...
The problem is in choosing suitable values for the s parameter. Its values depend on the scaling of the data.
Reading the documentation carefully, one can deduce that the parameter should be chosen around s = len(y) * np.var(y), i.e. # of data points * variance. Taking for example s = 0.05 * len(y) * np.var(y) gives a smoothing spline that does not depend on the scaling of the data or the number of data points.
EDIT: sensible values for s depend of course also on the noise level in the data. The docs seem to recommend choosing s in the range (m - sqrt(2*m)) * std**2 <= s <= (m + sqrt(2*m)) * std**2 where std is the standard deviation associated with the "noise" you want to smooth over.

Categories