I have the following graph that I want to digitize to a high-quality publication grade figure using Python and Matplotlib:
I used a digitizer program to grab a few samples from one of the 3 data sets:
x_data = np.array([
1,
1.2371,
1.6809,
2.89151,
5.13304,
9.23238,
])
y_data = np.array([
0.0688824,
0.0490012,
0.0332843,
0.0235889,
0.0222304,
0.0245952,
])
I have already tried 3 different methods of fitting a curve through these data points. The first method being to draw a spline through the points using scipy.interpolate import spline
This results in (with the actual data points drawn as blue markers):
This is obvisously no good.
My second attempt was to draw a curve fit using a series of different order polinimials using scipy.optimize import curve_fit. Even up to a fourth order polynomial the answer is useless (the lower order ones were even more useless):
Finally, I used scipy.interpolate import interp1d to try and interpolate between the data points. Linear interpolation obviously yields expected results but the line are straight and the whole purpose of this exercise is to get a nice smooth curve:
If I then use cubic interpolation I get a rubish result, however quadratic interpolation yields a slightly better result:
But it's not quite there yet, and I don't think interp1d can do higher order interpolation.
Is there anyone out there who has a good method of doing this? Maybe I would be better off trying to do it in IPE or something?
Thank you!
A standard cubic spline is not very good at reasonable looking interpolations between data points that are very unevenly spaced. Fortunately, there are plenty of other interpolation algorithms and Scipy provides a number of them. Here are a few applied to your data:
import numpy as np
from scipy.interpolate import spline, UnivariateSpline, Akima1DInterpolator, PchipInterpolator
import matplotlib.pyplot as plt
x_data = np.array([1, 1.2371, 1.6809, 2.89151, 5.13304, 9.23238])
y_data = np.array([0.0688824, 0.0490012, 0.0332843, 0.0235889, 0.0222304, 0.0245952])
x_data_smooth = np.linspace(min(x_data), max(x_data), 1000)
fig, ax = plt.subplots(1,1)
spl = UnivariateSpline(x_data, y_data, s=0, k=2)
y_data_smooth = spl(x_data_smooth)
ax.plot(x_data_smooth, y_data_smooth, 'b')
bi = Akima1DInterpolator(x_data, y_data)
y_data_smooth = bi(x_data_smooth)
ax.plot(x_data_smooth, y_data_smooth, 'g')
bi = PchipInterpolator(x_data, y_data)
y_data_smooth = bi(x_data_smooth)
ax.plot(x_data_smooth, y_data_smooth, 'k')
ax.plot(x_data_smooth, y_data_smooth)
ax.scatter(x_data, y_data)
plt.show()
I suggest looking through these, and also a few others, and finding one that matches what you think looks right. Also, though, you may want to sample a few more points. For example, I think the PCHIP algorithm wants to keep the fit monotonic between data points, so digitizing your minimum point would be useful (and probably a good idea regardless of the algorithm you use).
Related
I am trying to interpolate a cumulated distribution of e.g. i) number of people to ii) number of owned cars, showing that e.g. the top 20% of people own much more than 20% of all cars - off course 100% of people own 100% of cars. Also I know that there are e.g. 100mn people and 200mn cars.
Now coming to my code:
#import libraries (more than required here)
import pandas as pd
from scipy import interpolate
from scipy.interpolate import interp1d
from sympy import symbols, solve, Eq
import matplotlib.pyplot as plt
from matplotlib import pyplot as plt
%matplotlib inline
import plotly.express as px
from scipy import interpolate
curve=pd.read_excel('inputs.xlsx',sheet_name='inputdata')
Input data: Curveplot (cumulated people (x) on the left // cumulated cars (y) on the right)
#Input data in list form (I am not sure how to interpolate from a list for the moment)
cumulatedpeople = [0, 0.453086, 0.772334, 0.950475, 0.978981, 0.999876, 0.999990, 1]
cumulatedcars= [0, 0.016356, 0.126713, 0.410482, 0.554976, 0.950073, 0.984913, 1]
x, y = points[:,0], points[:,1]
interpolation = interp1d(x, y, kind = 'cubic')
number_of_people_mn= 100000000
oneperson = 1 / number_of_people_mn
dataset = pd.DataFrame(range(number_of_people_mn + 1))
dataset.columns = ["nr_of_one_person"]
dataset.drop(dataset.index[:1], inplace=True)
#calculating the position of every single person on the cumulated x-axis (between 0 and 1)
dataset["cumulatedpeople"] = dataset["nr_of_one_person"] / number_of_people_mn
#finding the "cumulatedcars" to the "cumulatedpeople" via interpolation (between 0 and 1)
dataset["cumulatedcars"] = interpolation(dataset["cumulatedpeople"])
plt.plot(dataset["cumulatedpeople"], dataset["cumulatedcars"])
plt.legend(['Cubic interpolation'], loc = 'best')
plt.xlabel('Cumulated people')
plt.ylabel('Cumulated cars')
plt.title("People-to-car cumulated curve")
plt.show()
However when looking at the actual plot, I get the following result which is false: Cubic interpolation
In fact, the curve should look almost like the one from a linear interpolation with the exact same input data - however this is not accurate enough for my purpose: Linear interpolation
Is there any relevant step I am missing out or what would be the best way to get an accurate interpolation from the inputs that almost looks like the one from a linear interpolation?
Short answer: your code is doing the right thing, but the data is unsuitable for cubic interpolation.
Let me explain. Here is your code that I simplified for clarity
from scipy.interpolate import interp1d
from matplotlib import pyplot as plt
cumulatedpeople = [0, 0.453086, 0.772334, 0.950475, 0.978981, 0.999876, 0.999990, 1]
cumulatedcars= [0, 0.016356, 0.126713, 0.410482, 0.554976, 0.950073, 0.984913, 1]
interpolation = interp1d(cumulatedpeople, cumulatedcars, kind = 'cubic')
number_of_people_mn= 100#000000
cumppl = np.arange(number_of_people_mn + 1)/number_of_people_mn
cumcars = interpolation(cumppl)
plt.plot(cumppl, cumcars)
plt.plot(cumulatedpeople, cumulatedcars,'o')
plt.show()
note the last couple of lines -- I am plotting, on the same graph, both the interpolated results and the input date. Here is the result
orange dots are the original data, blue line is cubic interpolation. The interpolator passes through all the points so technically is doing the right thing
Clearly it is not doing what you would want
The reason for such strange behavior is mostly at the right end where you have a few x-points that are very close together -- the interpolator produces massive wiggles trying to fit very closely spaced points.
If I remove two right-most points from the interpolator:
interpolation = interp1d(cumulatedpeople[:-2], cumulatedcars[:-2], kind = 'cubic')
it looks a bit more reasonable:
But still one would argue linear interpolation is better. The wiggles on the left end now because the gaps between initial x-poonts are too large
The moral here is that cubic interpolation should really be used only if gaps between x points are roughly the same
Your best bet here, I think, is to use something like curve_fit
a related discussion can be found here
specifically monotone interpolation as explained here yields good results on your data. Copying the relevant bits here, you would replace the interpolator with
from scipy.interpolate import pchip
interpolation = pchip(cumulatedpeople, cumulatedcars)
and get a decent-looking fit:
I am new in Python and a bit confused with the interpolation and Least-squares fitting of two ndarrays.
I have 2 ndarrays:
My final goal is to make Least-squares fitting of the modelled spectrum (blue curve) to the observed spectrum (orange curve).
Blue curve ndarray has the following parameters:
Orange curve ndarray has the following parameters:
As a first and the easiest step I wanted to plot the residuals (difference) between that two ndarrays, but the problem is that since they have different sizes 391 and 256 respectively. I've tried to use numpy.reshape or ndarray.resphape functions, but they lead to an errors.
Probably the proper solution will be to start with the interpolation of the blue curve into the less denser grid of the orange curve. I've tried to use numpy.interp function but it also leads to an errors.
Something along the lines of the following:
import numpy as np
import matplotlib.pyplot as plt
n_denser = 33
n_coarser = 7
x_denser = np.linspace(0,1,n_denser)
y_denser = np.power(x_denser, 2) + np.random.randn(n_denser)/10.
x_coarser = np.linspace(0,1,n_coarser)
y_coarser = np.power(x_coarser, 2) + np.random.randn(n_coarser)/10. + 0.5
y_dense_interp = np.interp(x_coarser, x_denser, y_denser)
plt.plot(x_denser, y_denser, 'b+-')
plt.plot(x_coarser, y_coarser, 'ro:')
plt.plot(x_coarser, y_dense_interp, 'go')
plt.legend(['dense data', 'coarse data', 'interp data'])
plt.show()
Which returns something like:
Your confusion seems to stem from mixing up the methods you mention. Least-squares is not a method for interpolation, rather it is a minimization curve fitting method. One key difference is that with interpolation the plots always pass through the original data points. With least-squares this can happen bit it is not generally the case.
Cubic-spline interpolation will give you 'nice' plots if you need to pass through the original data points.
If you want to use least-squares, you need to know what degree polynomial you want to fit. The most common is linear (first order).
How could I get the coordinates of a point in the space with the greatest density.
I have this code to generate a random point and density analyze from this point.
import numpy as np
from scipy import stats
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
def random_data(N):
# Generate some random data.
return np.random.uniform(0., 10., N)
x_data = random_data(50)
y_data = random_data(50)
kernel = stats.gaussian_kde(np.vstack([x_data, y_data]), bw_method=0.05)
b = plt.plot(x_data, y_data, 'ro')
df = pd.DataFrame({"x":x_data,"y":y_data})
p = sns.jointplot(data=df,x='x', y='y',kind='kde')
plt.show(p)
Thank you for help. :)
For starters, let me state the obvious by saying that sns.jointplot computes the kernel density on its own, so your kernel variable is as of yet unused.
Here's what sns.jointplot generated for me with a random sample:
There's a nice maximum at around (7, 5.4).
Here's what your kernel corresponds to:
x,y = np.mgrid[:10:100j, :10:100j] # 100 x 100 grid for plotting
z = kernel.pdf(np.array([x.ravel(),y.ravel()])).reshape(x.shape)
fig,ax = plt.subplots()
ax.contourf(x, y, z, levels=10)
ax.axis('scaled')
This will clearly not do: the density contains peaks centered around your input points; you will never be able to get a similar estimate than what sns.jointplot gave you.
We can easily fix this: you just have to drop the custom bw_method argument in the call to gaussian_kde:
kernel = stats.gaussian_kde(np.vstack([x_data, y_data]))
x,y = np.mgrid[:10:100j, :10:100j] # 100 x 100 grid for plotting
z = kernel.pdf(np.array([x.ravel(),y.ravel()])).reshape(x.shape)
fig,ax = plt.subplots()
ax.contourf(x, y, z, levels=10)
ax.axis('scaled')
This looks just the way you want it:
Now you know that this kernel.pdf is a bivariate function for which you're looking for the maximum.
And to find the maximum you should probably use something from scipy.optimize, for instance scipy.optimize.minimize (the trick is to look at the negative of your function, which turns maxima into minima).
Since your function will probably have a few local maxima, finding the global maximum reliably is not trivial. I would either use the aforementioned minimize, but first use a sparse mesh over the relevant domain and find the best maximum candidate first, or use a heavy-weight solver such as differential_evolution which is a stochastic solver that's supposed to be good at finding the true global minimum of a function.
Root finding and minimization is always fickle business, so you will have to play around with your real data and available methods to find a reliable workflow that gives you your maximum.
I am getting some errors when interpolating with RBF. Here is an example in 1D. I think that it has to do with how close my y values are to each other. Is there any fix for this?
import numpy as np
from scipy.interpolate import Rbf, interp1d
import matplotlib.pyplot as plt
x = np.array([0.77639752, 0.8136646, 0.85093168, 0.88819876, 0.92546584, 0.96273292, 1.])
y = np.array([0.97119742, 0.98089758, 0.98937066, 0.99540737, 0.99917735, 1., 0.99779049])
xi = np.linspace(min(x),max(x),1000)
fig = plt.figure(1)
plt.plot(x,y,'ko', label='Raw Data')
#RBF
rbfi = Rbf(x,y, function='linear')
plt.plot(xi,rbfi(xi), label='RBF (linear)')
rbfi = Rbf(x,y, function='cubic')
plt.plot(xi,rbfi(xi), label='RBF (cubic)')
#1D
f = interp1d(x,y, kind='cubic')
plt.plot(xi,f(xi), label='Interp1D (cubic)')
plt.plot(x,y,'ko', label=None)
plt.grid()
plt.legend()
plt.xlabel('x')
plt.ylabel('y')
plt.tight_layout()
plt.savefig('RBFTest.png')
Indeed, when implemented properly, RBF interpolation using the polyharmonic spline r^3 in 1D coincides with the natural cubic spline, and is a "smoothest" interpolant.
Unfortunately, the scipy.interpolate.Rbf, despite the name, does not appear to be a correct implementation of the RBF methods known from the approximation theory. The error is around the line
self.nodes = linalg.solve(self.A, self.di)
They forgot the (linear) polynomial term in the construction of the polyharmonic RBF! The system should have been (2).
Now, one shouldn't trust interp1d blindly either. What algorithm used in interp1d function in scipy.interpolate suggests that it may not be using natural cubic spline but a different condition. No mentioning of it in the help page: one needs to go into the python source, and I'm afraid of what we will find there.
Is there a fix for this?
If it's a serious work, make your own implementation of the RBF interpolation algorithm. Or, if you want to try a different implementation in python, there is apparently one from the University of Michigan: https://rbf.readthedocs.io. If you do, could you post your findings here? If not, you've already did a good service by demonstrating an important SciPy error -- thank you!
I need to (numerically) calculate the first and second derivative of a function for which I've attempted to use both splrep and UnivariateSpline to create splines for the purpose of interpolation the function to take the derivatives.
However, it seems that there's an inherent problem in the spline representation itself for functions who's magnitude is order 10^-1 or lower and are (rapidly) oscillating.
As an example, consider the following code to create a spline representation of the sine function over the interval (0,6*pi) (so the function oscillates three times only):
import scipy
from scipy import interpolate
import numpy
from numpy import linspace
import math
from math import sin
k = linspace(0, 6.*pi, num=10000) #interval (0,6*pi) in 10'000 steps
y=[]
A = 1.e0 # Amplitude of sine function
for i in range(len(k)):
y.append(A*sin(k[i]))
tck =interpolate.UnivariateSpline(x, y, w=None, bbox=[None, None], k=5, s=2)
M=tck(k)
Below are the results for M for A = 1.e0 and A = 1.e-2
http://i.imgur.com/uEIxq.png Amplitude = 1
http://i.imgur.com/zFfK0.png Amplitude = 1/100
Clearly the interpolated function created by the splines is totally incorrect! The 2nd graph does not even oscillate the correct frequency.
Does anyone have any insight into this problem? Or know of another way to create splines within numpy/scipy?
Cheers,
Rory
I'm guessing that your problem is due to aliasing.
What is x in your example?
If the x values that you're interpolating at are less closely spaced than your original points, you'll inherently lose frequency information. This is completely independent from any type of interpolation. It's inherent in downsampling.
Nevermind the above bit about aliasing. It doesn't apply in this case (though I still have no idea what x is in your example...
I just realized that you're evaluating your points at the original input points when you're using a non-zero smoothing factor (s).
By definition, smoothing won't fit the data exactly. Try putting s=0 in instead.
As a quick example:
import matplotlib.pyplot as plt
import numpy as np
from scipy import interpolate
x = np.linspace(0, 6.*np.pi, num=100) #interval (0,6*pi) in 10'000 steps
A = 1.e-4 # Amplitude of sine function
y = A*np.sin(x)
fig, axes = plt.subplots(nrows=2)
for ax, s, title in zip(axes, [2, 0], ['With', 'Without']):
yinterp = interpolate.UnivariateSpline(x, y, s=s)(x)
ax.plot(x, yinterp, label='Interpolated')
ax.plot(x, y, 'bo',label='Original')
ax.legend()
ax.set_title(title + ' Smoothing')
plt.show()
The reason that you're only clearly seeing the effects of smoothing with a low amplitude is due to the way the smoothing factor is defined. See the documentation for scipy.interpolate.UnivariateSpline for more details.
Even with a higher amplitude, the interpolated data won't match the original data if you use smoothing.
For example, if we just change the amplitude (A) to 1.0 in the code example above, we'll still see the effects of smoothing...
The problem is in choosing suitable values for the s parameter. Its values depend on the scaling of the data.
Reading the documentation carefully, one can deduce that the parameter should be chosen around s = len(y) * np.var(y), i.e. # of data points * variance. Taking for example s = 0.05 * len(y) * np.var(y) gives a smoothing spline that does not depend on the scaling of the data or the number of data points.
EDIT: sensible values for s depend of course also on the noise level in the data. The docs seem to recommend choosing s in the range (m - sqrt(2*m)) * std**2 <= s <= (m + sqrt(2*m)) * std**2 where std is the standard deviation associated with the "noise" you want to smooth over.