Python Curve Smoothing using Savitzky_Golay - issue

Python Curve Smoothing using Savitzky_Golay - issue - python

Input File CSV data link. My Python code is as pasted beneath. The curve smoothing technique does not really seem to be working. As I plot the smoothed curve upon the parent data, they overlap exactly. Could someone please help me in resolving the issue please. The code uses the Savitzky_Golay algorithm.The code extracts the x,y axis data from a csv file and is formulated to suite the required data type needed for the Savitzky_Golay function call
import numpy as np
import csv
from math import factorial
import itertools
import matplotlib.pyplot as plt
from matplotlib.ticker import FormatStrFormatter
#from scipy.interpolate import spline
#import openpyxl
#import pandas as pd
#from scipy.interpolate import interp1d
def savitzky_golay(y, window_size, order, deriv=0, rate=1):
r"""Smooth (and optionally differentiate) data with a Savitzky-Golay filter.
The Savitzky-Golay filter removes high frequency noise from data.
It has the advantage of preserving the original shape and
features of the signal better than other types of filtering
approaches, such as moving averages techniques.
Parameters
----------
y : array_like, shape (N,)
the values of the time history of the signal.
window_size : int
the length of the window. Must be an odd integer number.
order : int
the order of the polynomial used in the filtering.
Must be less then `window_size` - 1.
deriv: int
the order of the derivative to compute (default = 0 means only smoothing)
Returns
-------
ys : ndarray, shape (N)
the smoothed signal (or it's n-th derivative).
Notes
-----
The Savitzky-Golay is a type of low-pass filter, particularly
suited for smoothing noisy data. The main idea behind this
approach is to make for each point a least-square fit with a
polynomial of high order over a odd-sized window centered at
the point.
Examples
--------
t = np.linspace(-4, 4, 500)
y = np.exp( -t**2 ) + np.random.normal(0, 0.05, t.shape)
ysg = savitzky_golay(y, window_size=31, order=4)
import matplotlib.pyplot as plt
plt.plot(t, y, label='Noisy signal')
plt.plot(t, np.exp(-t**2), 'k', lw=1.5, label='Original signal')
plt.plot(t, ysg, 'r', label='Filtered signal')
plt.legend()
plt.show()
References
----------
.. [1] A. Savitzky, M. J. E. Golay, Smoothing and Differentiation of
Data by Simplified Least Squares Procedures. Analytical
Chemistry, 1964, 36 (8), pp 1627-1639.
.. [2] Numerical Recipes 3rd Edition: The Art of Scientific Computing
W.H. Press, S.A. Teukolsky, W.T. Vetterling, B.P. Flannery
Cambridge University Press ISBN-13: 9780521880688
"""
try:
window_size = np.abs(np.int(window_size))
order = np.abs(np.int(order))
except ValueError, msg:
raise ValueError("window_size and order have to be of type int")
if window_size % 2 != 1 or window_size < 1:
raise TypeError("window_size size must be a positive odd number")
if window_size < order + 2:
raise TypeError("window_size is too small for the polynomials order")
order_range = range(order+1)
half_window = (window_size -1) // 2
# precompute coefficients
b = np.mat([[k**i for i in order_range] for k in range(-half_window, half_window+1)])
m = np.linalg.pinv(b).A[deriv] * rate**deriv * factorial(deriv)
# pad the signal at the extremes with
# values taken from the signal itself
firstvals = y[0] - np.abs( y[1:half_window+1][::-1] - y[0] )
lastvals = y[-1] + np.abs(y[-half_window-1:-1][::-1] - y[-1])
y = np.concatenate((firstvals, y, lastvals))
return np.convolve( m[::-1], y, mode='valid')
CurveName_1 = "Actual"
ind1, ind2 = 0,0
check = 0
for line in open('C:\Users\XYZ\Documents\FileTransfers\Vicky.csv'):
csv_row = line.split(",")
csv_row = map(str.strip, csv_row)
csv_row = [i.replace('"', '') for i in csv_row]
if CurveName_1 in csv_row:
ind1 = csv_row.index(CurveName_1)
check += 1
if check > 1:
break
x = []
y = []
with open( 'C:\Users\XYZ\Documents\FileTransfers\Vicky.csv', "r") as file:
reader = csv.reader(file)
for idx,line in enumerate(reader):
if idx>3:
x.append(float(line[ind1]))
y.append(float(line[0]))
#t=line[ind1],line[ind2]
#print t
print len(x)
print len(y)
xm = np.array(x)
ym = np.array(y)
#ym = np.array(ym)
yhat = savitzky_golay(ym, 51, 3) # window size 51, polynomial order 3
plt.grid(True)
plt.minorticks_on()
# Customize the major grid
plt.grid(which='major', linestyle='-', linewidth='0.5', color='red')
# Customize the minor grid
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
axes = plt.subplot(111)
axes.xaxis.set_minor_formatter(FormatStrFormatter("%.2f"))
axes.yaxis.set_minor_formatter(FormatStrFormatter("%.2f"))
plt.plot(yhat, xm)
plt.plot(ym,xm, color='red')
plt.show()

Maybe a little late, and maybe not the exact answer to your question, but for a very similar application I use pandas.read_excel to import the data and scipy.signal.savgol_filter for filtering: the less I implement, the more chances it has of working properly...

Related

scipy curve_fi returns initial parameters estimates

I am triyng to use scipy curve_fit function to fit a gaussian function to my data to estimate a theoretical power spectrum density. While doing so, the curve_fit function always return the initial parameters (p0=[1,1,1]) , thus telling me that the fitting didn't work.
I don't know where the issue is. I am using python 3.9 (spyder 5.1.5) from the anaconda distribution on windows 11.
here a Wetransfer link to the data file
https://wetransfer.com/downloads/6097ebe81ee0c29ee95a497128c1c2e420220704110130/86bf2d
Here is my code below. Can someone tell me what the issue is, and how can i solve it?
on the picture of the plot, the blue plot is my experimental PSD and the orange one is the result of the fit.
import numpy as np
import math
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
import scipy.constants as cst
File = np.loadtxt('test5.dat')
X = File[:, 1]
Y = File[:, 2]
f_sample = 50000
time=[]
for i in range(1,len(X)+1):
t=i*(1/f_sample)
time= np.append(time,t)
N = X.shape[0] # number of observation
N1=int(N/2)
delta_t = time[2] - time[1]
T_mes = N * delta_t
freq = np.arange(1/T_mes, (N+1)/T_mes, 1/T_mes)
freq=freq[0:N1]
fNyq = f_sample/2 # Nyquist frequency
nb = 350
freq_block = []
# discrete fourier tansform
X_ft = delta_t*np.fft.fft(X, n=N)
X_ft=X_ft[0:N1]
plt.figure()
plt.plot(time, X)
plt.xlabel('t [s]')
plt.ylabel('x [micro m]')
# Experimental power spectrum on both raw and blocked data
PSD_X_exp = (np.abs(X_ft)**2/T_mes)
PSD_X_exp_b = []
STD_PSD_X_exp_b = []
for i in range(0, N1+2, nb):
freq_b = np.array(freq[i:i+nb]) # i-nb:i
psd_b = np.array(PSD_X_exp[i:i+nb])
freq_block = np.append(freq_block, (1/nb)*np.sum(freq_b))
PSD_X_exp_b = np.append(PSD_X_exp_b, (1/nb)*np.sum(psd_b))
STD_PSD_X_exp_b = np.append(STD_PSD_X_exp_b, PSD_X_exp_b/np.sqrt(nb))
plt.figure()
plt.loglog(freq, PSD_X_exp)
plt.legend(['Raw Experimental PSD'])
plt.xlabel('f [Hz]')
plt.ylabel('PSD')
plt.figure()
plt.loglog(freq_block, PSD_X_exp_b)
plt.legend(['Experimental PSD after blocking'])
plt.xlabel('f [Hz]')
plt.ylabel('PSD')
kB = cst.k # Boltzmann constant [m^2kg/s^2K]
T = 273.15 + 25 # Temperature [K]
r = (2.8 / 2) * 1e-6 # Particle radius [m]
v = 0.00002414 * 10 ** (247.8 / (-140 + T)) # Water viscosity [Pa*s]
gamma = np.pi * 6 * r * v # [m*Pa*s]
Do = kB*T/gamma # expected value for D
f3db_o = 50000 # expected value for f3db
fc_o = 300 # expected value pour fc
n = np.arange(-10,11)
def theo_spectrum_lorentzian_filter(x, D_, fc_, f3db_):
PSD_theo=[]
for i in range(0,len(x)):
# print(i)
psd_theo=np.sum((((D_*Do)/2*math.pi**2)/((fc_*fc_o)**2+(x[i]+n*f_sample)
** 2))*(1/(1+((x[i]+n*f_sample)/(f3db_*f3db_o))**2)))
PSD_theo= np.append(PSD_theo,psd_theo)
return PSD_theo
popt, pcov = curve_fit(theo_spectrum_lorentzian_filter, freq_block, PSD_X_exp_b, p0=[1, 1, 1], sigma=STD_PSD_X_exp_b, absolute_sigma=True, check_finite=True,bounds=(0.1, 10), method='trf', jac=None)
D_, fc_, f3db_ = popt
D1 = D_*Do
fc1 = fc_*fc_o
f3db1 = f3db_*f3db_o
print('Diffusion constant D = ', D1, ' Corner frequency fc= ',fc1, 'f3db(diode,eff)= ', f3db1)

I believe I've successfully fitted your data. Here's the approach I took.
First, I plotted your model (with popt=[1, 1, 1]) and the data you had. I noticed your data was significantly lower than the model. Then I started fiddling with the parameters. I wanted to push the model upwards. I did that by multiplying popt[0] by increasingly large values. I ended up with 1E13 as a ballpark value. Note that I have no idea if this is physically possible for your model. Then I jury-rigged your fitting function to multiply D_ by 1E13 and ran your code. I got this fit:
So I believe it's a problem of 1) inappropriate starting values and 2) inappropriate bounds. In your position, I would revise this model, check if there's any problems with units and so on.
Here's what I used to try to fit your model:
plt.figure()
plt.loglog(freq_block[:170], PSD_X_exp_b[:170], label='Exp')
plt.loglog(freq_block[:170],
theo_spectrum_lorentzian_filter(
freq_block[:170],
1E13*popt[0], popt[1], popt[2]),
label='model'
)
plt.xlabel('f [Hz]')
plt.ylabel('PSD')
plt.legend()
I limited the data to point 170 because there were some weird backwards values that made me uncomfortable. I would recheck them if I were you.
Here's the model code I used. I didn't change the curve_fit call (except to limit x to :170.
def theo_spectrum_lorentzian_filter(x, D_, fc_, f3db_):
PSD_theo=[]
D_ = 1E13*D_ # I only changed here
for i in range(0,len(x)):
psd_theo=np.sum((((D_*Do)/2*math.pi**2)/((fc_*fc_o)**2+(x[i]+n*f_sample)
** 2))*(1/(1+((x[i]+n*f_sample)/(f3db_*f3db_o))**2)))
PSD_theo= np.append(PSD_theo,psd_theo)
return PSD_theo

Scaling x-axis after IFFT-FFT

See the edit below for details.
I have a dataset, on which I need to perform and IFFT, cut the valueable part of it (by multiplying with a gaussian curve), then FFT back.
First is in angular frequency domain, so an IFFT leads to time domain. Then FFT-ing back should lead to angular frequency again, but I can't seem to find a solution how to get back the original domain. Of course it's easy on the y-values:
yf = np.fft.ifft(y)
#cut the valueable part there..
np.fft.fft(yf)
For the x-value transforms I'm using np.fft.fftfreq the following way:
# x is in ang. frequency domain, that's the reason for the 2*np.pi division
t = np.fft.fftfreq(len(x), d=(x[1]-x[0])/(2*np.pi))
However doing
x = np.fft.fftfreq(len(t), d=2*np.pi*(t[1]-t[0]))
completely not giving me back the original x values. Is that something I'm misunderstanding?
The question can be asked generalized, for example:
import numpy as np
x = np.arange(100)
xx = np.fft.fftfreq(len(x), d = x[1]-x[0])
# how to get back the original x from xx? Is it even possible?
I've tried to use a temporal variable where I store the original x values, but it's not too elegant. I'm looking for some kind of inverse of fftfreq, and in general the possible best solution for that problem.
Thank you.
EDIT:
I will provide the code at the end.
I have a dataset which has angular frequency on x axis and intensity on the y. I want to perfrom IFFT to change to time domain. Unfortunately the x values are not
evenly spaced, so a (linear) interpolation is needed first before IFFT. Then in time domain the transform looks like this:
The next step is to cut one of the symmetrical spikes with a gaussian curve, then FFT back to angular frequency domain (the same where we started). My problem is when I transfrom the x-axis for the IFFT (which I think is correct), I can't get back into the original angular frequency domain. Here is the code, which includes the generator for the dataset too.
import numpy as np
import matplotlib.pyplot as plt
import scipy
from scipy.interpolate import interp1d
C_LIGHT = 299.792
# for easier case, this is zero, so it can be ignored.
def _disp(x, GD=0, GDD=0, TOD=0, FOD=0, QOD=0):
return x*GD+(GDD/2)*x**2+(TOD/6)*x**3+(FOD/24)*x**4+(QOD/120)*x**5
# the generator to make sample datasets
def generator(start, stop, center, delay, GD=0, GDD=0, TOD=0, FOD=0, QOD=0, resolution=0.1, pulse_duration=15, chirp=0):
window = (np.sqrt(1+chirp**2)*8*np.log(2))/(pulse_duration**2)
lamend = (2*np.pi*C_LIGHT)/start
lamstart = (2*np.pi*C_LIGHT)/stop
lam = np.arange(lamstart, lamend+resolution, resolution)
omega = (2*np.pi*C_LIGHT)/lam
relom = omega-center
i_r = np.exp(-(relom)**2/(window))
i_s = np.exp(-(relom)**2/(window))
i = i_r + i_s + 2*np.sqrt(i_r*i_s)*np.cos(_disp(relom, GD=GD, GDD=GDD, TOD=TOD, FOD=FOD, QOD=QOD)+delay*omega)
#since the _disp polynomial is set to be zero, it's just cos(delay*omega)
return omega, i
def interpol(x,y):
''' Simple linear interpolation '''
xs = np.linspace(x[0], x[-1], len(x))
intp = interp1d(x, y, kind='linear', fill_value = 'extrapolate')
ys = intp(xs)
return xs, ys
def ifft_method(initSpectrumX, initSpectrumY, interpolate=True):
if len(initSpectrumY) > 0 and len(initSpectrumX) > 0:
Ydata = initSpectrumY
Xdata = initSpectrumX
else:
raise ValueError
N = len(Xdata)
if interpolate:
Xdata, Ydata = interpol(Xdata, Ydata)
# the (2*np.pi) division is because we have angular frequency, not frequency
xf = np.fft.fftfreq(N, d=(Xdata[1]-Xdata[0])/(2*np.pi)) * N * Xdata[-1]/(N-1)
yf = np.fft.ifft(Ydata)
else:
pass # some irrelevant code there
return xf, yf
def fft_method(initSpectrumX ,initSpectrumY):
if len(initSpectrumY) > 0 and len(initSpectrumX) > 0:
Ydata = initSpectrumY
Xdata = initSpectrumX
else:
raise ValueError
yf = np.fft.fft(Ydata)
xf = np.fft.fftfreq(len(Xdata), d=(Xdata[1]-Xdata[0])*2*np.pi)
# the problem is there, where I transform the x values.
xf = np.fft.ifftshift(xf)
return xf, yf
# the generated data
x, y = generator(1, 3, 2, delay = 1500, resolution = 0.1)
# plt.plot(x,y)
xx, yy = ifft_method(x,y)
#if the x values are correctly scaled, the two symmetrical spikes should appear exactly at delay value
# plt.plot(xx, np.abs(yy))
#do the cutting there, which is also irrelevant now
# the problem is there, in fft_method. The x values are not the same as before transforms.
xxx, yyy = fft_method(xx, yy)
plt.plot(xxx, np.abs(yyy))
#and it should look like this:
#xs = np.linspace(x[0], x[-1], len(x))
#plt.plot(xs, np.abs(yyy))
plt.grid()
plt.show()

Create BSpline from knots and coefficients

How can a spline be created if only the points and the coefficients are known? I'm using scipy.interpolate.BSpline here, but am open to other standard packages as well. So basically I want to be able to give someone just those short arrays of coefficients for them to be able to recreate the fit to the data. See the failed red-dashed curve below.
import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import BSpline, LSQUnivariateSpline
x = np.linspace(0, 10, 50) # x-data
y = np.exp(-(x-5)**2/4) # y-data
# define the knot positions
t = [1, 2, 4, 5, 6, 8, 9]
# get spline fit
s1 = LSQUnivariateSpline(x, y, t)
x2 = np.linspace(0, 10, 200) # new x-grid
y2 = s1(x2) # evaluate spline on that new grid
# FAILED: try to construct BSpline using the knots and coefficients
k = s1.get_knots()
c = s1.get_coeffs()
s2 = BSpline(t,c,2)
# plotting
plt.plot(x, y, label='original')
plt.plot(t, s1(t),'o', label='knots')
plt.plot(x2, y2, '--', label='spline 1')
plt.plot(x2, s2(x2), 'r:', label='spline 2')
plt.legend()

The fine print under get_knots says:
Internally, the knot vector contains 2*k additional boundary knots.
That means, to get a usable knot array from get_knots, one should add k copies of the left boundary knot at the beginning of array, and k copies of the right boundary knot at the end. Here k is the degree of the spline, which is usually 3 (you asked for LSQUnivariateSpline of default degree, so that's 3). So:
kn = s1.get_knots()
kn = 3*[kn[0]] + list(kn) + 3*[kn[-1]]
c = s1.get_coeffs()
s2 = BSpline(kn, c, 3) # not "2" as in your sample; we are working with a cubic spline
Now, the spline s2 is the same as s1:
Equivalently, kn = 4*[x[0]] + t + 4*[x[-1]] would work: your t list contains only interior knots, so x[0] and x[-1] are added, and then each repeated k times more.
The mathematical reason for the repetition is that B-splines need some room to get built, due to their inductive definition which requires (k-1)-degree splines to exist around every interval in which we define the kth degree spline.

Here is a slightly more compact way of doing it if you don't care too much about the details of the knot positions. The tk array is what you are looking for. Once tk is in hand the spline can be reproduced using the y=splev(x,tk,der=0) line.
import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import splrep,splev
import matplotlib.pyplot as plt
### Input data
x_arr = np.linspace(0, 10, 50) # x-data
y_arr = np.exp(-(x_arr-5)**2/4) # y-data
### Order of spline
order = 3
### Make the spline
tk = splrep(x_arr, y_arr, k=order) # Returns the knots and coefficents
### Evaluate the spline using the knots and coefficents on the domian x
x = np.linspace(0, 10, 1000) # new x-grid
y = splev(x, tk, der=0)
### Plot
f,ax=plt.subplots()
ax.scatter(x_arr, y_arr, label='original')
ax.plot(x,y,label='Spline')
ax.legend(fontsize=15)
plt.show()

Python natural smoothing splines

I am trying to find a python package that would give an option to fit natural smoothing splines with user selectable smoothing factor. Is there an implementation for that? If not, how would you use what is available to implement it yourself?
By natural spline I mean that there should be a condition that the second derivative of the fitted function at the endpoints is zero (linear).
By smoothing spline I mean that the spline should not be 'interpolating' (passing through all the datapoints). I would like to decide the correct smoothing factor lambda (see the Wikipedia page for smoothing splines) myself.
What I have found
scipy.interpolate.CubicSpline [link]: Does natural (cubic) spline fitting. Does interpolation, and there is no way to smooth the data.
scipy.interpolate.UnivariateSpline [link]: Does spline fitting with user selectable smoothing factor. However, there is no option to make the splines natural.

After hours of investigation, I did not find any pip installable packages which could fit a natural cubic spline with user-controllable smoothness. However, after deciding to write one myself, while reading about the topic I stumbled upon a blog post by github user madrury. He has written python code capable of producing natural cubic spline models.
The model code is available here (NaturalCubicSpline) with a BSD-licence. He has also written some examples in an IPython notebook.
But since this is the Internet and links tend to die, I will copy the relevant parts of the source code here + a helper function (get_natural_cubic_spline_model) written by me, and show an example of how to use it. The smoothness of the fit can be controlled by using different number of knots. The position of the knots can be also specified by the user.
Example
from matplotlib import pyplot as plt
import numpy as np
def func(x):
return 1/(1+25*x**2)
# make example data
x = np.linspace(-1,1,300)
y = func(x) + np.random.normal(0, 0.2, len(x))
# The number of knots can be used to control the amount of smoothness
model_6 = get_natural_cubic_spline_model(x, y, minval=min(x), maxval=max(x), n_knots=6)
model_15 = get_natural_cubic_spline_model(x, y, minval=min(x), maxval=max(x), n_knots=15)
y_est_6 = model_6.predict(x)
y_est_15 = model_15.predict(x)
plt.plot(x, y, ls='', marker='.', label='originals')
plt.plot(x, y_est_6, marker='.', label='n_knots = 6')
plt.plot(x, y_est_15, marker='.', label='n_knots = 15')
plt.legend(); plt.show()
The source code for get_natural_cubic_spline_model
import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
def get_natural_cubic_spline_model(x, y, minval=None, maxval=None, n_knots=None, knots=None):
"""
Get a natural cubic spline model for the data.
For the knots, give (a) `knots` (as an array) or (b) minval, maxval and n_knots.
If the knots are not directly specified, the resulting knots are equally
space within the *interior* of (max, min). That is, the endpoints are
*not* included as knots.
Parameters
----------
x: np.array of float
The input data
y: np.array of float
The outpur data
minval: float
Minimum of interval containing the knots.
maxval: float
Maximum of the interval containing the knots.
n_knots: positive integer
The number of knots to create.
knots: array or list of floats
The knots.
Returns
--------
model: a model object
The returned model will have following method:
- predict(x):
x is a numpy array. This will return the predicted y-values.
"""
if knots:
spline = NaturalCubicSpline(knots=knots)
else:
spline = NaturalCubicSpline(max=maxval, min=minval, n_knots=n_knots)
p = Pipeline([
('nat_cubic', spline),
('regression', LinearRegression(fit_intercept=True))
])
p.fit(x, y)
return p
class AbstractSpline(BaseEstimator, TransformerMixin):
"""Base class for all spline basis expansions."""
def __init__(self, max=None, min=None, n_knots=None, n_params=None, knots=None):
if knots is None:
if not n_knots:
n_knots = self._compute_n_knots(n_params)
knots = np.linspace(min, max, num=(n_knots + 2))[1:-1]
max, min = np.max(knots), np.min(knots)
self.knots = np.asarray(knots)
#property
def n_knots(self):
return len(self.knots)
def fit(self, *args, **kwargs):
return self
class NaturalCubicSpline(AbstractSpline):
"""Apply a natural cubic basis expansion to an array.
The features created with this basis expansion can be used to fit a
piecewise cubic function under the constraint that the fitted curve is
linear *outside* the range of the knots.. The fitted curve is continuously
differentiable to the second order at all of the knots.
This transformer can be created in two ways:
- By specifying the maximum, minimum, and number of knots.
- By specifying the cutpoints directly.
If the knots are not directly specified, the resulting knots are equally
space within the *interior* of (max, min). That is, the endpoints are
*not* included as knots.
Parameters
----------
min: float
Minimum of interval containing the knots.
max: float
Maximum of the interval containing the knots.
n_knots: positive integer
The number of knots to create.
knots: array or list of floats
The knots.
"""
def _compute_n_knots(self, n_params):
return n_params
#property
def n_params(self):
return self.n_knots - 1
def transform(self, X, **transform_params):
X_spl = self._transform_array(X)
if isinstance(X, pd.Series):
col_names = self._make_names(X)
X_spl = pd.DataFrame(X_spl, columns=col_names, index=X.index)
return X_spl
def _make_names(self, X):
first_name = "{}_spline_linear".format(X.name)
rest_names = ["{}_spline_{}".format(X.name, idx)
for idx in range(self.n_knots - 2)]
return [first_name] + rest_names
def _transform_array(self, X, **transform_params):
X = X.squeeze()
try:
X_spl = np.zeros((X.shape[0], self.n_knots - 1))
except IndexError: # For arrays with only one element
X_spl = np.zeros((1, self.n_knots - 1))
X_spl[:, 0] = X.squeeze()
def d(knot_idx, x):
def ppart(t): return np.maximum(0, t)
def cube(t): return t*t*t
numerator = (cube(ppart(x - self.knots[knot_idx]))
- cube(ppart(x - self.knots[self.n_knots - 1])))
denominator = self.knots[self.n_knots - 1] - self.knots[knot_idx]
return numerator / denominator
for i in range(0, self.n_knots - 2):
X_spl[:, i+1] = (d(i, X) - d(self.n_knots - 2, X)).squeeze()
return X_spl

You could use this numpy/scipy implementation of natural cubic smoothing spline for univariate/multivariate data smoothing. Smoothing parameter should be in range [0.0, 1.0]. If we use smoothing parameter equal to 1.0 we get natural cubic spline interpolant without data smoothing. Also the implementation supports vectorization for univariate data.
Univariate example:
import numpy as np
import matplotlib.pyplot as plt
import csaps
np.random.seed(1234)
x = np.linspace(-5., 5., 25)
y = np.exp(-(x/2.5)**2) + (np.random.rand(25) - 0.2) * 0.3
sp = csaps.UnivariateCubicSmoothingSpline(x, y, smooth=0.85)
xs = np.linspace(x[0], x[-1], 150)
ys = sp(xs)
plt.plot(x, y, 'o', xs, ys, '-')
plt.show()
Bivariate example:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import csaps
xdata = [np.linspace(-3, 3, 61), np.linspace(-3.5, 3.5, 51)]
i, j = np.meshgrid(*xdata, indexing='ij')
ydata = (3 * (1 - j)**2. * np.exp(-(j**2) - (i + 1)**2)
- 10 * (j / 5 - j**3 - i**5) * np.exp(-j**2 - i**2)
- 1 / 3 * np.exp(-(j + 1)**2 - i**2))
np.random.seed(12345)
noisy = ydata + (np.random.randn(*ydata.shape) * 0.75)
sp = csaps.MultivariateCubicSmoothingSpline(xdata, noisy, smooth=0.988)
ysmth = sp(xdata)
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot_wireframe(j, i, noisy, linewidths=0.5, color='r')
ax.scatter(j, i, noisy, s=5, c='r')
ax.plot_surface(j, i, ysmth, linewidth=0, alpha=1.0)
plt.show()

The python package patsy has functions for generating spline bases, including a natural cubic spline basis. Described in the documentation.
Any library can then be used for fitting a model, e.g. scikit-learn or statsmodels.
The df parameter for cr() can be used to control the "smoothness"
Note that too low df can result to underfit (see below).
A simple example using scikit-learn.
import numpy as np
from sklearn.linear_model import LinearRegression
from patsy import cr
import matplotlib.pyplot as plt
n_obs = 600
np.random.seed(0)
x = np.linspace(-3, 3, n_obs)
y = 1 / (x ** 2 + 1) * np.cos(np.pi * x) + np.random.normal(0, 0.2, size=n_obs)
def plot_smoothed(df=5):
# Generate spline basis with different degrees of freedom
x_basis = cr(x, df=df, constraints="center")
# Fit model to the data
model = LinearRegression().fit(x_basis, y)
# Get estimates
y_hat = model.predict(x_basis)
plt.plot(x, y_hat, label=f"df={df}")
plt.scatter(x, y, s=4, color="tab:blue")
for df in (5, 7, 10, 25):
plot_smoothed(df)
plt.legend()
plt.title(f"Natural cubic spline with varying degrees of freedom")
plt.show()

For a project of mine, I needed to create intervals for time-series modeling, and to make the procedure more efficient I created tsmoothie: A python library for time-series smoothing and outlier detection in a vectorized way.
It provides different smoothing algorithms together with the possibility to computes intervals.
In the case of SplineSmoother of natural cubic type:
import numpy as np
import matplotlib.pyplot as plt
from tsmoothie.smoother import *
def func(x):
return 1/(1+25*x**2)
# make example data
x = np.linspace(-1,1,300)
y = func(x) + np.random.normal(0, 0.2, len(x))
# operate smoothing
smoother = SplineSmoother(n_knots=10, spline_type='natural_cubic_spline')
smoother.smooth(y)
# generate intervals
low, up = smoother.get_intervals('prediction_interval', confidence=0.05)
# plot the first smoothed timeseries with intervals
plt.figure(figsize=(11,6))
plt.plot(smoother.smooth_data[0], linewidth=3, color='blue')
plt.plot(smoother.data[0], '.k')
plt.fill_between(range(len(smoother.data[0])), low[0], up[0], alpha=0.3)
I point out also that tsmoothie can carry out the smoothing of multiple time-series in a vectorized way

The programming language R offers a very good implementation of natural cubic smoothing splines. You can use R functions in Python with rpy2:
import rpy2.robjects as robjects
r_y = robjects.FloatVector(y_train)
r_x = robjects.FloatVector(x_train)
r_smooth_spline = robjects.r['smooth.spline'] #extract R function# run smoothing function
spline1 = r_smooth_spline(x=r_x, y=r_y, spar=0.7)
ySpline=np.array(robjects.r['predict'](spline1,robjects.FloatVector(x_smooth)).rx2('y'))
plt.plot(x_smooth,ySpline)
If you want to directly set lambda: spline1 = r_smooth_spline(x=r_x, y=r_y, lambda=42) doesn't work, because lambda has already another meaning in Python, but there is a solution: How to use the lambda argument of smooth.spline in RPy WITHOUT Python interprating it as lambda.
To get the code running you first need to define the data x_train and y_train and you can define x_smooth=np.array(np.linspace(-3,5,1920)). if you want to plot it between -3 and 5 in Full-HD-resolution.
Note that this code is not fully compatible with Jupyter-notebooks for the latest versions of rpy2. You can fix this by using !pip install -Iv rpy2==3.4.2 as described in NotImplementedError: Conversion 'rpy2py' not defined for objects of type '<class 'rpy2.rinterface.SexpClosure'>' only after I run the code twice

Fitting a Weibull distribution using Scipy

I am trying to recreate maximum likelihood distribution fitting, I can already do this in Matlab and R, but now I want to use scipy. In particular, I would like to estimate the Weibull distribution parameters for my data set.
I have tried this:
import scipy.stats as s
import numpy as np
import matplotlib.pyplot as plt
def weib(x,n,a):
return (a / n) * (x / n)**(a - 1) * np.exp(-(x / n)**a)
data = np.loadtxt("stack_data.csv")
(loc, scale) = s.exponweib.fit_loc_scale(data, 1, 1)
print loc, scale
x = np.linspace(data.min(), data.max(), 1000)
plt.plot(x, weib(x, loc, scale))
plt.hist(data, data.max(), density=True)
plt.show()
And get this:
(2.5827280639441961, 3.4955032285727947)
And a distribution that looks like this:
I have been using the exponweib after reading this http://www.johndcook.com/distributions_scipy.html. I have also tried the other Weibull functions in scipy (just in case!).
In Matlab (using the Distribution Fitting Tool - see screenshot) and in R (using both the MASS library function fitdistr and the GAMLSS package) I get a (loc) and b (scale) parameters more like 1.58463497 5.93030013. I believe all three methods use the maximum likelihood method for distribution fitting.
I have posted my data here if you would like to have a go! And for completeness I am using Python 2.7.5, Scipy 0.12.0, R 2.15.2 and Matlab 2012b.
Why am I getting a different result!?

My guess is that you want to estimate the shape parameter and the scale of the Weibull distribution while keeping the location fixed. Fixing loc assumes that the values of your data and of the distribution are positive with lower bound at zero.
floc=0 keeps the location fixed at zero, f0=1 keeps the first shape parameter of the exponential weibull fixed at one.
>>> stats.exponweib.fit(data, floc=0, f0=1)
[1, 1.8553346917584836, 0, 6.8820748596850905]
>>> stats.weibull_min.fit(data, floc=0)
[1.8553346917584836, 0, 6.8820748596850549]
The fit compared to the histogram looks ok, but not very good. The parameter estimates are a bit higher than the ones you mention are from R and matlab.
Update
The closest I can get to the plot that is now available is with unrestricted fit, but using starting values. The plot is still less peaked. Note values in fit that don't have an f in front are used as starting values.
>>> from scipy import stats
>>> import matplotlib.pyplot as plt
>>> plt.plot(data, stats.exponweib.pdf(data, *stats.exponweib.fit(data, 1, 1, scale=02, loc=0)))
>>> _ = plt.hist(data, bins=np.linspace(0, 16, 33), normed=True, alpha=0.5);
>>> plt.show()

It is easy to verify which result is the true MLE, just need a simple function to calculate log likelihood:
>>> def wb2LL(p, x): #log-likelihood
return sum(log(stats.weibull_min.pdf(x, p[1], 0., p[0])))
>>> adata=loadtxt('/home/user/stack_data.csv')
>>> wb2LL(array([6.8820748596850905, 1.8553346917584836]), adata)
-8290.1227946678173
>>> wb2LL(array([5.93030013, 1.57463497]), adata)
-8410.3327470347667
The result from fit method of exponweib and R fitdistr (#Warren) is better and has higher log likelihood. It is more likely to be the true MLE. It is not surprising that the result from GAMLSS is different. It is a complete different statistic model: Generalized Additive Model.
Still not convinced? We can draw a 2D confidence limit plot around MLE, see Meeker and Escobar's book for detail).
Again this verifies that array([6.8820748596850905, 1.8553346917584836]) is the right answer as loglikelihood is lower that any other point in the parameter space. Note:
>>> log(array([6.8820748596850905, 1.8553346917584836]))
array([ 1.92892018, 0.61806511])
BTW1, MLE fit may not appears to fit the distribution histogram tightly. An easy way to think about MLE is that MLE is the parameter estimate most probable given the observed data. It doesn't need to visually fit the histogram well, that will be something minimizing mean square error.
BTW2, your data appears to be leptokurtic and left-skewed, which means Weibull distribution may not fit your data well. Try, e.g. Gompertz-Logistic, which improves log-likelihood by another about 100.
Cheers!

I know it's an old post, but I just faced a similar problem and this thread helped me solve it. Thought my solution might be helpful for others like me:
# Fit Weibull function, some explanation below
params = stats.exponweib.fit(data, floc=0, f0=1)
shape = params[1]
scale = params[3]
print 'shape:',shape
print 'scale:',scale
#### Plotting
# Histogram first
values,bins,hist = plt.hist(data,bins=51,range=(0,25),normed=True)
center = (bins[:-1] + bins[1:]) / 2.
# Using all params and the stats function
plt.plot(center,stats.exponweib.pdf(center,*params),lw=4,label='scipy')
# Using my own Weibull function as a check
def weibull(u,shape,scale):
'''Weibull distribution for wind speed u with shape parameter k and scale parameter A'''
return (shape / scale) * (u / scale)**(shape-1) * np.exp(-(u/scale)**shape)
plt.plot(center,weibull(center,shape,scale),label='Wind analysis',lw=2)
plt.legend()
Some extra info that helped me understand:
Scipy Weibull function can take four input parameters: (a,c),loc and scale.
You want to fix the loc and the first shape parameter (a), this is done with floc=0,f0=1. Fitting will then give you params c and scale, where c corresponds to the shape parameter of the two-parameter Weibull distribution (often used in wind data analysis) and scale corresponds to its scale factor.
From docs:
exponweib.pdf(x, a, c) =
a * c * (1-exp(-x**c))**(a-1) * exp(-x**c)*x**(c-1)
If a is 1, then
exponweib.pdf(x, a, c) =
c * (1-exp(-x**c))**(0) * exp(-x**c)*x**(c-1)
= c * (1) * exp(-x**c)*x**(c-1)
= c * x **(c-1) * exp(-x**c)
From this, the relation to the 'wind analysis' Weibull function should be more clear

I was curious about your question and, despite this is not an answer, it compares the Matlab result with your result and with the result using leastsq, which showed the best correlation with the given data:
The code is as follows:
import scipy.stats as s
import numpy as np
import matplotlib.pyplot as plt
import numpy.random as mtrand
from scipy.integrate import quad
from scipy.optimize import leastsq
## my distribution (Inverse Normal with shape parameter mu=1.0)
def weib(x,n,a):
return (a / n) * (x / n)**(a-1) * np.exp(-(x/n)**a)
def residuals(p,x,y):
integral = quad( weib, 0, 16, args=(p[0],p[1]) )[0]
penalization = abs(1.-integral)*100000
return y - weib(x, p[0],p[1]) + penalization
#
data = np.loadtxt("stack_data.csv")
x = np.linspace(data.min(), data.max(), 100)
n, bins, patches = plt.hist(data,bins=x, normed=True)
binsm = (bins[1:]+bins[:-1])/2
popt, pcov = leastsq(func=residuals, x0=(1.,1.), args=(binsm,n))
loc, scale = 1.58463497, 5.93030013
plt.plot(binsm,n)
plt.plot(x, weib(x, loc, scale),
label='weib matlab, loc=%1.3f, scale=%1.3f' % (loc, scale), lw=4.)
loc, scale = s.exponweib.fit_loc_scale(data, 1, 1)
plt.plot(x, weib(x, loc, scale),
label='weib stack, loc=%1.3f, scale=%1.3f' % (loc, scale), lw=4.)
plt.plot(x, weib(x,*popt),
label='weib leastsq, loc=%1.3f, scale=%1.3f' % tuple(popt), lw=4.)
plt.legend(loc='upper right')
plt.show()

I had the same problem, but found that setting loc=0 in exponweib.fit primed the pump for the optimization. That was all that was needed from #user333700's answer. I couldn't load your data -- your data link points to an image, not data. So I ran a test on my data instead:
import scipy.stats as ss
import matplotlib.pyplot as plt
import numpy as np
N=30
counts, bins = np.histogram(x, bins=N)
bin_width = bins[1]-bins[0]
total_count = float(sum(counts))
f, ax = plt.subplots(1, 1)
f.suptitle(query_uri)
ax.bar(bins[:-1]+bin_width/2., counts, align='center', width=.85*bin_width)
ax.grid('on')
def fit_pdf(x, name='lognorm', color='r'):
dist = getattr(ss, name) # params = shape, loc, scale
# dist = ss.gamma # 3 params
params = dist.fit(x, loc=0) # 1-day lag minimum for shipping
y = dist.pdf(bins, *params)*total_count*bin_width
sqerror_sum = np.log(sum(ci*(yi - ci)**2. for (ci, yi) in zip(counts, y)))
ax.plot(bins, y, color, lw=3, alpha=0.6, label='%s err=%3.2f' % (name, sqerror_sum))
return y
colors = ['r-', 'g-', 'r:', 'g:']
for name, color in zip(['exponweib', 't', 'gamma'], colors): # 'lognorm', 'erlang', 'chi2', 'weibull_min',
y = fit_pdf(x, name=name, color=color)
ax.legend(loc='best', frameon=False)
plt.show()

There have been a few answers to this already here and in other places. likt in Weibull distribution and the data in the same figure (with numpy and scipy)
It still took me a while to come up with a clean toy example so I though it would be useful to post.
from scipy import stats
import matplotlib.pyplot as plt
#input for pseudo data
N = 10000
Kappa_in = 1.8
Lambda_in = 10
a_in = 1
loc_in = 0
#Generate data from given input
data = stats.exponweib.rvs(a=a_in,c=Kappa_in, loc=loc_in, scale=Lambda_in, size = N)
#The a and loc are fixed in the fit since it is standard to assume they are known
a_out, Kappa_out, loc_out, Lambda_out = stats.exponweib.fit(data, f0=a_in,floc=loc_in)
#Plot
bins = range(51)
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.plot(bins, stats.exponweib.pdf(bins, a=a_out,c=Kappa_out,loc=loc_out,scale = Lambda_out))
ax.hist(data, bins = bins , density=True, alpha=0.5)
ax.annotate("Shape: $k = %.2f$ \n Scale: $\lambda = %.2f$"%(Kappa_out,Lambda_out), xy=(0.7, 0.85), xycoords=ax.transAxes)
plt.show()

In the meantime, there is really good package out there: reliability. Here is the documentation: reliability # readthedocs.
Your code simply becomes:
from reliability.Fitters import Fit_Weibull_2P
...
wb = Fit_Weibull_2P(failures=data)
plt.show()
Saves a lot of headaches and makes beautiful plots, too.

the order of loc and scale is messed up in the code:
plt.plot(x, weib(x, scale, loc))
the scale parameter should come first.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Curve Smoothing using Savitzky_Golay - issue - python

Maybe a little late, and maybe not the exact answer to your question, but for a very similar application I use pandas.read_excel to import the data and scipy.signal.savgol_filter for filtering: the less I implement, the more chances it has of working properly...

Related

scipy curve_fi returns initial parameters estimates

Scaling x-axis after IFFT-FFT

Create BSpline from knots and coefficients

Python natural smoothing splines

Fitting a Weibull distribution using Scipy

Categories

Resources