Lets assume we have a dataset which might be given approximately by
import numpy as np
x = np.linspace(0,2*np.pi,100)
y = np.sin(x) + np.random.random(100) * 0.2
Therefore we have a variation of 20% of the dataset. My first idea was to use the UnivariateSpline function of scipy, but the problem is that this does not consider the small noise in a good way. If you consider the frequencies, the background is much smaller than the signal, so a spline only of the cutoff might be an idea, but that would involve a back and forth fourier transformation, which might result in bad behaviour.
Another way would be a moving average, but this would also need the right choice of the delay.
Any hints/ books or links how to tackle this problem?
I prefer a Savitzky-Golay filter. It uses least squares to regress a small window of your data onto a polynomial, then uses the polynomial to estimate the point in the center of the window. Finally the window is shifted forward by one data point and the process repeats. This continues until every point has been optimally adjusted relative to its neighbors. It works great even with noisy samples from non-periodic and non-linear sources.
Here is a thorough cookbook example. See my code below to get an idea of how easy it is to use. Note: I left out the code for defining the savitzky_golay() function because you can literally copy/paste it from the cookbook example I linked above.
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0,2*np.pi,100)
y = np.sin(x) + np.random.random(100) * 0.2
yhat = savitzky_golay(y, 51, 3) # window size 51, polynomial order 3
plt.plot(x,y)
plt.plot(x,yhat, color='red')
plt.show()
UPDATE: It has come to my attention that the cookbook example I linked to has been taken down. Fortunately, the Savitzky-Golay filter has been incorporated into the SciPy library, as pointed out by #dodohjk (thanks #bicarlsen for the updated link).
To adapt the above code by using SciPy source, type:
from scipy.signal import savgol_filter
yhat = savgol_filter(y, 51, 3) # window size 51, polynomial order 3
EDIT: look at this answer. Using np.cumsum is much faster than np.convolve
A quick and dirty way to smooth data I use, based on a moving average box (by convolution):
x = np.linspace(0,2*np.pi,100)
y = np.sin(x) + np.random.random(100) * 0.8
def smooth(y, box_pts):
box = np.ones(box_pts)/box_pts
y_smooth = np.convolve(y, box, mode='same')
return y_smooth
plot(x, y,'o')
plot(x, smooth(y,3), 'r-', lw=2)
plot(x, smooth(y,19), 'g-', lw=2)
If you are interested in a "smooth" version of a signal that is periodic (like your example), then a FFT is the right way to go. Take the fourier transform and subtract out the low-contributing frequencies:
import numpy as np
import scipy.fftpack
N = 100
x = np.linspace(0,2*np.pi,N)
y = np.sin(x) + np.random.random(N) * 0.2
w = scipy.fftpack.rfft(y)
f = scipy.fftpack.rfftfreq(N, x[1]-x[0])
spectrum = w**2
cutoff_idx = spectrum < (spectrum.max()/5)
w2 = w.copy()
w2[cutoff_idx] = 0
y2 = scipy.fftpack.irfft(w2)
Even if your signal is not completely periodic, this will do a great job of subtracting out white noise. There a many types of filters to use (high-pass, low-pass, etc...), the appropriate one is dependent on what you are looking for.
Fitting a moving average to your data would smooth out the noise, see this this answer for how to do that.
If you'd like to use LOWESS to fit your data (it's similar to a moving average but more sophisticated), you can do that using the statsmodels library:
import numpy as np
import pylab as plt
import statsmodels.api as sm
x = np.linspace(0,2*np.pi,100)
y = np.sin(x) + np.random.random(100) * 0.2
lowess = sm.nonparametric.lowess(y, x, frac=0.1)
plt.plot(x, y, '+')
plt.plot(lowess[:, 0], lowess[:, 1])
plt.show()
Finally, if you know the functional form of your signal, you could fit a curve to your data, which would probably be the best thing to do.
This Question is already thoroughly answered, so I think a runtime analysis of the proposed methods would be of interest (It was for me, anyway). I will also look at the behavior of the methods at the center and the edges of the noisy dataset.
TL;DR
| runtime in s | runtime in s
method | python list | numpy array
--------------------|--------------|------------
kernel regression | 23.93405 | 22.75967
lowess | 0.61351 | 0.61524
naive average | 0.02485 | 0.02326
others* | 0.00150 | 0.00150
fft | 0.00021 | 0.00021
numpy convolve | 0.00017 | 0.00015
*savgol with different fit functions and some numpy methods
Kernel regression scales badly, Lowess is a bit faster, but both produce smooth curves. Savgol is a middle ground on speed and can produce both jumpy and smooth outputs, depending on the grade of the polynomial. FFT is extremely fast, but only works on periodic data.
Moving average methods with numpy are faster but obviously produce a graph with steps in it.
Setup
I generated 1000 data points in the shape of a sin curve:
size = 1000
x = np.linspace(0, 4 * np.pi, size)
y = np.sin(x) + np.random.random(size) * 0.2
data = {"x": x, "y": y}
I pass these into a function to measure the runtime and plot the resulting fit:
def test_func(f, label): # f: function handle to one of the smoothing methods
start = time()
for i in range(5):
arr = f(data["y"], 20)
print(f"{label:26s} - time: {time() - start:8.5f} ")
plt.plot(data["x"], arr, "-", label=label)
I tested many different smoothing fuctions. arr is the array of y values to be smoothed and span the smoothing parameter. The lower, the better the fit will approach the original data, the higher, the smoother the resulting curve will be.
def smooth_data_convolve_my_average(arr, span):
re = np.convolve(arr, np.ones(span * 2 + 1) / (span * 2 + 1), mode="same")
# The "my_average" part: shrinks the averaging window on the side that
# reaches beyond the data, keeps the other side the same size as given
# by "span"
re[0] = np.average(arr[:span])
for i in range(1, span + 1):
re[i] = np.average(arr[:i + span])
re[-i] = np.average(arr[-i - span:])
return re
def smooth_data_np_average(arr, span): # my original, naive approach
return [np.average(arr[val - span:val + span + 1]) for val in range(len(arr))]
def smooth_data_np_convolve(arr, span):
return np.convolve(arr, np.ones(span * 2 + 1) / (span * 2 + 1), mode="same")
def smooth_data_np_cumsum_my_average(arr, span):
cumsum_vec = np.cumsum(arr)
moving_average = (cumsum_vec[2 * span:] - cumsum_vec[:-2 * span]) / (2 * span)
# The "my_average" part again. Slightly different to before, because the
# moving average from cumsum is shorter than the input and needs to be padded
front, back = [np.average(arr[:span])], []
for i in range(1, span):
front.append(np.average(arr[:i + span]))
back.insert(0, np.average(arr[-i - span:]))
back.insert(0, np.average(arr[-2 * span:]))
return np.concatenate((front, moving_average, back))
def smooth_data_lowess(arr, span):
x = np.linspace(0, 1, len(arr))
return sm.nonparametric.lowess(arr, x, frac=(5*span / len(arr)), return_sorted=False)
def smooth_data_kernel_regression(arr, span):
# "span" smoothing parameter is ignored. If you know how to
# incorporate that with kernel regression, please comment below.
kr = KernelReg(arr, np.linspace(0, 1, len(arr)), 'c')
return kr.fit()[0]
def smooth_data_savgol_0(arr, span):
return savgol_filter(arr, span * 2 + 1, 0)
def smooth_data_savgol_1(arr, span):
return savgol_filter(arr, span * 2 + 1, 1)
def smooth_data_savgol_2(arr, span):
return savgol_filter(arr, span * 2 + 1, 2)
def smooth_data_fft(arr, span): # the scaling of "span" is open to suggestions
w = fftpack.rfft(arr)
spectrum = w ** 2
cutoff_idx = spectrum < (spectrum.max() * (1 - np.exp(-span / 2000)))
w[cutoff_idx] = 0
return fftpack.irfft(w)
Results
Speed
Runtime over 1000 elements, tested on a python list as well as a numpy array to hold the values.
method | python list | numpy array
--------------------|-------------|------------
kernel regression | 23.93405 s | 22.75967 s
lowess | 0.61351 s | 0.61524 s
numpy average | 0.02485 s | 0.02326 s
savgol 2 | 0.00186 s | 0.00196 s
savgol 1 | 0.00157 s | 0.00161 s
savgol 0 | 0.00155 s | 0.00151 s
numpy convolve + me | 0.00121 s | 0.00115 s
numpy cumsum + me | 0.00114 s | 0.00105 s
fft | 0.00021 s | 0.00021 s
numpy convolve | 0.00017 s | 0.00015 s
Especially kernel regression is very slow to compute over 1k elements, lowess also fails when the dataset becomes much larger. numpy convolve and fft are especially fast. I did not investigate the runtime behavior (O(n)) with increasing or decreasing sample size.
Edge behavior
I'll separate this part into two, to keep image understandable.
Numpy based methods + savgol 0:
These methods calculate an average of the data, the graph is not smoothed. They all (with the exception of numpy.cumsum) result in the same graph when the window that is used to calculate the average does not touch the edge of the data. The discrepancy to numpy.cumsum is most likely due to a 'off by one' error in the window size.
There are different edge behaviours when the method has to work with less data:
savgol 0: continues with a constant to the edge of the data (savgol 1 and savgol 2 end with a line and parabola respectively)
numpy average: stops when the window reaches the left side of the data and fills those places in the array with Nan, same behaviour as my_average method on the right side
numpy convolve: follows the data pretty accurately. I suspect the window size is reduced symmetrically when one side of the window reaches the edge of the data
my_average/me: my own method that I implemented, because I was not satisfied with the other ones. Simply shrinks the part of the window that is reaching beyond the data to the edge of the data, but keeps the window to the other side the original size given with span
Complicated methods:
These methods all end with a nice fit to the data. savgol 1 ends with a line, savgol 2 with a parabola.
Curve behaviour
To showcase the behaviour of the different methods in the middle of the data.
The different savgol and average filters produce a rough line, lowess, fft and kernel regression produce a smooth fit. lowess appears to cut corners when the data changes.
Motivation
I have a Raspberry Pi logging data for fun and the visualization proved to be a small challenge. All data points, except RAM usage and ethernet traffic are only recorded in discrete steps and/or inherently noisy. For example the temperature sensor only outputs whole degrees, but differs by up to two degrees between consecutive measurements. No useful information can be gained from such a scatter plot. To visualize the data I therefore needed some method that is not too computationally expensive and produced a moving average. I also wanted nice behavior at the edges of the data, as this especially impacts the latest info when looking at live data. I settled on the numpy convolve method with my_average to improve the edge behavior.
Another option is to use KernelReg in statsmodels:
from statsmodels.nonparametric.kernel_regression import KernelReg
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0,2*np.pi,100)
y = np.sin(x) + np.random.random(100) * 0.2
# The third parameter specifies the type of the variable x;
# 'c' stands for continuous
kr = KernelReg(y,x,'c')
plt.plot(x, y, '+')
y_pred, y_std = kr.fit(x)
plt.plot(x, y_pred)
plt.show()
A clear definition of smoothing of a 1D signal from SciPy Cookbook shows you how it works.
Shortcut:
import numpy
def smooth(x,window_len=11,window='hanning'):
"""smooth the data using a window with requested size.
This method is based on the convolution of a scaled window with the signal.
The signal is prepared by introducing reflected copies of the signal
(with the window size) in both ends so that transient parts are minimized
in the begining and end part of the output signal.
input:
x: the input signal
window_len: the dimension of the smoothing window; should be an odd integer
window: the type of window from 'flat', 'hanning', 'hamming', 'bartlett', 'blackman'
flat window will produce a moving average smoothing.
output:
the smoothed signal
example:
t=linspace(-2,2,0.1)
x=sin(t)+randn(len(t))*0.1
y=smooth(x)
see also:
numpy.hanning, numpy.hamming, numpy.bartlett, numpy.blackman, numpy.convolve
scipy.signal.lfilter
TODO: the window parameter could be the window itself if an array instead of a string
NOTE: length(output) != length(input), to correct this: return y[(window_len/2-1):-(window_len/2)] instead of just y.
"""
if x.ndim != 1:
raise ValueError, "smooth only accepts 1 dimension arrays."
if x.size < window_len:
raise ValueError, "Input vector needs to be bigger than window size."
if window_len<3:
return x
if not window in ['flat', 'hanning', 'hamming', 'bartlett', 'blackman']:
raise ValueError, "Window is on of 'flat', 'hanning', 'hamming', 'bartlett', 'blackman'"
s=numpy.r_[x[window_len-1:0:-1],x,x[-2:-window_len-1:-1]]
#print(len(s))
if window == 'flat': #moving average
w=numpy.ones(window_len,'d')
else:
w=eval('numpy.'+window+'(window_len)')
y=numpy.convolve(w/w.sum(),s,mode='valid')
return y
from numpy import *
from pylab import *
def smooth_demo():
t=linspace(-4,4,100)
x=sin(t)
xn=x+randn(len(t))*0.1
y=smooth(x)
ws=31
subplot(211)
plot(ones(ws))
windows=['flat', 'hanning', 'hamming', 'bartlett', 'blackman']
hold(True)
for w in windows[1:]:
eval('plot('+w+'(ws) )')
axis([0,30,0,1.1])
legend(windows)
title("The smoothing windows")
subplot(212)
plot(x)
plot(xn)
for w in windows:
plot(smooth(xn,10,w))
l=['original signal', 'signal with noise']
l.extend(windows)
legend(l)
title("Smoothing a noisy signal")
show()
if __name__=='__main__':
smooth_demo()
For a project of mine, I needed to create intervals for time-series modeling, and to make the procedure more efficient I created tsmoothie: A python library for time-series smoothing and outlier detection in a vectorized way.
It provides different smoothing algorithms together with the possibility to computes intervals.
Here I use a ConvolutionSmoother but you can also test it others.
import numpy as np
import matplotlib.pyplot as plt
from tsmoothie.smoother import *
x = np.linspace(0,2*np.pi,100)
y = np.sin(x) + np.random.random(100) * 0.2
# operate smoothing
smoother = ConvolutionSmoother(window_len=5, window_type='ones')
smoother.smooth(y)
# generate intervals
low, up = smoother.get_intervals('sigma_interval', n_sigma=2)
# plot the smoothed timeseries with intervals
plt.figure(figsize=(11,6))
plt.plot(smoother.smooth_data[0], linewidth=3, color='blue')
plt.plot(smoother.data[0], '.k')
plt.fill_between(range(len(smoother.data[0])), low[0], up[0], alpha=0.3)
I point out also that tsmoothie can carry out the smoothing of multiple timeseries in a vectorized way
There is a simple function in scipy.ndimage that also works well for me:
from scipy.ndimage import uniform_filter1d
y_smooth = uniform_filter1d(y,size=15)
Using a moving average, a quick way (that also works for non-bijective functions) is
def smoothen(x, winsize=5):
return np.array(pd.Series(x).rolling(winsize).mean())[winsize-1:]
This code is based on https://towardsdatascience.com/data-smoothing-for-data-science-visualization-the-goldilocks-trio-part-1-867765050615. There, also more advanced solutions are discussed.
You could also use this:
def smooth(scalars, weight = 0.8): # Weight between 0 and 1
return [scalars[i] * weight + (1 - weight) * scalars[i+1] for i in range(len(scalars)) if i < len(scalars)-1]
If you are plotting time series graph and if you have used mtplotlib for drawing graphs then use
median method to smooth-en the graph
smotDeriv = timeseries.rolling(window=20, min_periods=5, center=True).median()
where timeseries is your set of data passed you can alter windowsize for more smoothining.
Related
I have thousands of signals, some noisy and some with periodic features. Auto-correlation seems like a natural check:
import numpy as np
def autocorr(x):
n = x.size
norm = x - np.mean(x)
result = np.correlate(norm, norm, mode="same")
acorr = result[n // 2 + 1 :] / (x.var() * np.arange(n - 1, n // 2, -1))
return acorr
Below is a plot showing a set of signals, left column being the raw signals, middle being the autocorrelation and far right being the FFT. An additional point about the data - for each batch of signals (~20 per batch, one batch plotted here), the ones that are periodic are highly correlated to one another as well (i.e. same frequency), opening up e.g. clustering/classification options too, though that's probably better done on the smoother autocorrelations.
Although it is quite clear which are periodic from the autocorrelation, it still recasts it as another problem, determining which of those are periodic. Any good ideas on distinguishing say rows 1, 7, 13 from the others? I thought about doing peak detection on the autocorrelation function but I'm sure there are cleaner ways.
I have scaled the alpha of each line to be the inverse of the Durbin-Watson statistic of the autocorrelation result:
from statsmodels.regression.linear_model import OLS
import numpy as np
from statsmodels.stats.stattools import durbin_watson
def dw(data):
ols_res = OLS(data, np.ones(len(data))).fit()
dwout = durbin_watson(ols_res.resid)
return dwout
Though this isn't entirely clear as some of the noisy signals are still highlighted and periodic signals (e.g. 1) slip through.
This looks as such:
fig, ax = plt.subplots(18, 3, sharex=True, constrained_layout=True, figsize=(16, 12))
for item in np.arange(tracks.shape[1]):
sig = tracks[:, item]
acc = autocorr(sig)
dout = dw(autocorr(sig)) # but scaled elsewhere between 0.1 to 1.
ff = np.abs(np.fft.fft(acc))
ax[item, 0].plot(sig, alpha=dout, color="k")
ax[item, 1].plot(acc, alpha=dout, color="k")
ax[item, 2].plot(ff, alpha=dout, color="k")
plt.show()
Traces CSV here: https://www.dropbox.com/s/atr9k56vwkq5mq7/traces.csv?dl=0
Note: these are just some examples of thousands - some are higher/lower wavelength than these few examples that have them.
Update: I thought given my autocorrelation is maximally 1, I could just do peak detection with a threshold of e.g. 0.8 and see if there is more than 1 peak detected (with some minimum distance between them) e.g.
peaks = peakutils.indexes(acc, thres=0.8, min_dist=10, thres_abs=True)
Seems unclean though.
I am interested in integrating in Fourier space after using scipy to take an fft of some data. I have been following along with this stack exchange post numerical integration in Fourier space with numpy.fft but it does not properly integrate a few test cases I have been working with. I have added a few lines to address this issue but still am not recovering the correct integrals. Below is the code I have been using to integrate my test cases. At the top of the code are the 3 test cases I have been using.
import numpy as np
import scipy.special as sp
from scipy.fft import fft, ifft, fftfreq
import matplotlib.pyplot as plt
#set number of points in array
Ns = 2**16
#create array in space
x = np.linspace(-np.pi, np.pi, Ns)
#test case 1 from stack exchange post
# y = np.exp(-x**2) # function f(x)
# ys = np.exp(-x**2) * (-2 *x) # derivative f'(x)
#test case 2
# y = np.exp(-x**2) * x - 1/2 *np.sqrt(np.pi)*sp.erf(x)
# ys = np.exp(-x**2) * -2*x**2
#test case 3
y = np.sin(x**2) + (1/4)*np.exp(x)
ys = 1/4*(np.exp(x) + 8*x*np.cos(x**2))
#find spacing in space array
ss = x[1]-x[0]
#definte fft integration function
def fft_int(N,s,dydt):
#create frequency array
f = fftfreq(N,s)
# integration step ignoring divide by 0 errors
Fys = fft(dydt)
with np.errstate(divide="ignore", invalid="ignore"):
modFys = Fys / (2*np.pi*1j*f)
#set DC term to 0, was a nan since we divided by 0
modFys[0] = 0
#take inverse fft and subtract by integration constant
fourier = ifft(modFys)
fourier = fourier-fourier[0]
#tilt correction if function doesn't approach 0 at its ends
tilt = np.sum(dydt)*s*(np.arange(0,N)/(N-1) - 1/2)
fourier = fourier + tilt
return fourier
Test case 1 was from the stack exchange post from above. If you copy paste the code from the top answer and plot you'll get something like this:
with the solid blue line being the fft integration method and the dashed orange as the analytic solution. I account for this offset with the following line of code:
fourier = fourier-fourier[0]
since I don't believe the code was setting the constant of integration.
Next for test case 2 I get a plot like this:
again with the solid blue line being the fft integration method and the dashed orange as the analytic solution. I account for this tilt in the solution using the following lines of code
tilt = np.sum(dydt)*s*(np.arange(0,N)/(N-1) - 1/2)
fourier = fourier + tilt
Finally we arrive at test case 3. Which results in the following plot:
again with the solid blue line being the fft integration method and the dashed orange as the analytic solution. This is where I'm stuck, this offset has appeared again and I'm not sure why.
TLDR: How do I correctly integrate a function in fourier space using scipy.fft?
The tilt component makes no sense. It fixes one function, but it's not a generic solution of the problem.
The problem is that the FFT induces periodicity in the signal, meaning you compute the integral of a different function. Multiplying the FFT of the signal by 1/(2*np.pi*1j*f) is equivalent to a circular convolution of the signal with ifft(1/(2*np.pi*1j*f)). "Circular" is the key here. This is just a boundary problem.
Padding the function with zeros is one way to attempt to fix this:
import numpy as np
import scipy.special as sp
from scipy.fft import fft, ifft, fftfreq
import matplotlib.pyplot as plt
def fft_int(s, dydt, N=0):
dydt_padded = np.pad(dydt, (0, N))
f = fftfreq(dydt_padded.shape[0], s)
F = fft(dydt_padded)
with np.errstate(divide="ignore", invalid="ignore"):
F = F / (2*np.pi*1j*f)
F[0] = 0
y_padded = np.real(ifft(F))
y = y_padded[0:dydt.shape[0]]
return y - np.mean(y)
N = 2**16
x = np.linspace(-np.pi, np.pi, N)
s = x[1] - x[0]
# Test case 3
y = np.sin(x**2) + (1/4)*np.exp(x)
dy = 1/4*(np.exp(x) + 8*x*np.cos(x**2))
plt.plot(y - np.mean(y))
plt.plot(fft_int(s, dy))
plt.plot(fft_int(s, dy, N))
plt.plot(fft_int(s, dy, 10*N))
plt.show()
(Blue is expected output, computed solution without padding is orange, and with increasing amount of padding, green and red.)
Here I've solved the "offset" problem by plotting all functions with their mean removed. Setting the DC component to 0 is equal to subtracting the mean. But after cropping off the padding the mean changes, so fft_int subtracts the mean again after cropping.
Anyway, note how we get an increasingly better approximation as the padding increases. To get the exact result, one would need an infinite amount of padding, which of course is unrealistic.
Test case #1 doesn't need padding, the function reaches zero at the edges of the sampled domain. We can impose such a behavior on the other cases too. In Discrete Fourier analysis this is called windowing. This would look something like this:
def fft_int(s, dydt):
dydt_windowed = dydt * np.hanning(dydt.shape[0])
f = fftfreq(dydt.shape[0], s)
F = fft(dydt_windowed)
with np.errstate(divide="ignore", invalid="ignore"):
F = F / (2*np.pi*1j*f)
F[0] = 0
y = np.real(ifft(F))
return y
However, here we get correct integration results only in the middle of the domain, with increasingly suppressed values towards to ends. So this is not a practical solution either.
My conclusion is that no, this is not possible to do. It is much easier to compute the integral with np.cumsum:
yp = np.cumsum(dy) * s
plt.plot(y - np.mean(y))
plt.plot(yp - np.mean(yp))
plt.show()
(not showing output: the two plots overlap perfectly.)
I can compute the autocorrelation using numpy's built in functionality:
numpy.correlate(x,x,mode='same')
However the resulting correlation is naturally noisy. I can partition my data, and compute the correlation on each resulting window, then average them all together to compute cleaner autocorrelation, similar to what signal.welch does. Is there a handy function in either numpy or scipy that does this, possibly faster than I would get if I were to compute partition and loop through the data myself?
UPDATE
This is motivated by #kazemakase answer. I have tried to show what I mean with some code used to generate the figure below.
One can see that #kazemakase is correct with the fact that the AC function naturally averages out the noise. However the averaging of the AC has the advantage that it is much faster! np.correlate seems to scale as the slow O(n^2) rather than O(nlogn) that I would expect if the correlation was calculated using circular convolution via the FFT...
from statsmodels.tsa.arima_model import ARIMA
import statsmodels as sm
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(12345)
arparams = np.array([.75, -.25, 0.2, -0.15])
maparams = np.array([.65, .35])
ar = np.r_[1, -arparams] # add zero-lag and negate
ma = np.r_[1, maparams] # add zero-lag
x = sm.tsa.arima_process.arma_generate_sample(ar, ma, 10000)
def calc_rxx(x):
x = x-x.mean()
N = len(x)
Rxx = np.correlate(x,x,mode="same")[N/2::]/N
#Rxx = np.correlate(x,x,mode="same")[N/2::]/np.arange(N,N/2,-1)
return Rxx/x.var()
def avg_rxx(x,nperseg=1024):
rxx_windows = []
Nw = int(np.floor(len(x)/nperseg))
print Nw
first = True
for i in range(Nw-1):
xw = x[i*nperseg:nperseg*(i+1)]
y = calc_rxx(xw)
if i%1 == 0:
if first:
plt.semilogx(y,"k",alpha=0.2,label="Short AC")
first = False
else:
plt.semilogx(y,"k",alpha=0.2)
rxx_windows.append(y)
print np.shape(rxx_windows)
return np.mean(rxx_windows,axis=0)
plt.figure()
r_avg = avg_rxx(x,nperseg=300)
r = calc_rxx(x)
plt.semilogx(r_avg,label="Average AC")
plt.semilogx(r,label="Long AC")
plt.xlabel("Lag")
plt.ylabel("Auto-correlation")
plt.legend()
plt.xlim([0,150])
plt.show()
TL-DR: To decrease noise in the autocorrelation function increase the length of your signal x.
Partitioning the data and averaging like in spectral estimation is an interesting idea. I wish it would work...
The autocorrelation is defined as
Let's say we partition the data into two windows. Their autocorrelations become
Note how they are only different in the limits of the sumations. Basically, we split the summation of the autocorrelation into two parts. When we add these back together we are back to the original autocorrelation! So we did not gain anything.
The conclusion is, there is no such thing implemented in numpy/scipy because there is no point in doing so.
Remarks:
I hope it's easy to see that this extends to any number of partitions.
to keep it simple I left the normalization out. If you divide Rxx by n and the partial Rxx by n/2 you get Rxx / n == (Rxx1 * 2/n + Rxx2 * 2/n) / 2. I.e. The mean of the normalized partial autocorrelation is equal to the complete normalized autocorrelation.
to keep it even simpler I assumed the signal x could be indexed beyond the limits of 0 and n-1. In practice, if the signal is stored in an array this is often not possible. In this case there is a small difference between the full and the partialized autocorrelations that increases with the lag l. Unfortunately, this is merely a loss of precision and does not reduce noise.
Code heretic! I don't belive your evil math!
Of course we can try things out and see:
import matplotlib.pyplot as plt
import numpy as np
n = 2**16
n_segments = 8
x = np.random.randn(n) # data
rx = np.correlate(x, x, mode='same') / n # ACF
l1 = np.arange(-n//2, n//2) # Lags
segments = x.reshape(n_segments, -1)
m = segments.shape[1]
rs = []
for y in segments:
ry = np.correlate(y, y, mode='same') / m # partial ACF
rs.append(ry)
l2 = np.arange(-m//2, m//2) # lags of partial ACFs
plt.plot(l1, rx, label='full ACF')
plt.plot(l2, np.mean(rs, axis=0), label='partial ACF')
plt.xlim(-m, m)
plt.legend()
plt.show()
Although we used 8 segments to average the ACF, the noise level visually stays the same.
Okay, so that's why it does not work but what is the solution?
Here are the good news: Autocorrelation is already a noise reduction technique! Well, in some way at least: An application of the ACF is to find periodic signals hidden by noise.
Since noise (ideally) has zero mean, its influence diminishes the more elements we sum up. In other words, you can reduce noise in the autocorrelation by using longer signals. (I guess this is probably not true for every type of noise, but should hold for the usual Gaussian white noise and its relatives.)
Behold the noise getting lower with more data samples:
import matplotlib.pyplot as plt
import numpy as np
for n in [2**6, 2**8, 2**12]:
x = np.random.randn(n)
rx = np.correlate(x, x, mode='same') / n # ACF
l1 = np.arange(-n//2, n//2) # Lags
plt.plot(l1, rx, label='n={}'.format(n))
plt.legend()
plt.xlim(-20, 20)
plt.show()
I want to do unit testing of simulation models and for that, I run a simulation once and store the results (a time series) as reference in a csv file (see an example here). Now when I change my model, I run the simulation again, store the new reults as a csv file as well and then I compare the results.
The results are usually not 100% identical, an example plot is shown below:
The reference results are plotted in black and the new results are plotted in green.
The difference of the two is plotted in the second plot, in blue.
As can be seen, at a step the difference can become arbitrarily high, while everywhere else the difference is almost zero.
Therefore, I would prefer to use a different algorithms for comparison than just subtracting the two, but I can only describe my idea graphically:
When plotting the reference line twice, first in a light color with a high line width and then again in a dark color and a small line width, then it will look like it has a pink tube around the centerline.
Note that during a step that tube will not only be in the direction of the ordinate axis, but also in the direction of the abscissa.
When doing my comparison, I want to know whether the green line stays within the pink tube.
Now comes my question: I do not want to compare the two time series using a graph, but using a python script. There must be something like this already, but I cannot find it because I am missing the right vocabulary, I believe. Any ideas? Is something like that in numpy, scipy, or similar? Or would I have to write the comparison myself?
Additional question: When the script says the two series are not sufficiently similar, I would like to plot it as described above (using matplotlib), but the line width has to be defined somehow in other units than what I usually use to define line width.
I would assume here that your problem can be simplified by assuming that your function has to be close to another function (e.g. the center of the tube) with the very same support points and then a certain number of discontinuities are allowed.
Then, I would implement a different discretization of function compared to the typical one that is used for L^2 norm (See for example some reference here).
Basically, in the continuous case, the L^2 norm relaxes the constrain of the two function being close everywhere, and allow it to be different on a finite number of points, called singularities
This works because there are an infinite number of points where to calculate the integral, and a finite number of points will not make a difference there.
However, since there are no continuous functions here, but only their discretization, the naive approach will not work, because any singularity will contribute potentially significantly to the final integral value.
Therefore, what you could do is to perform a point by point check whether the two functions are close (within some tolerance) and allow at most num_exceptions points to be off.
import numpy as np
def is_close_except(arr1, arr2, num_exceptions=0.01, **kwargs):
# if float, calculate as percentage of number of points
if isinstance(num_exceptions, float):
num_exceptions = int(len(arr1) * num_exceptions)
num = len(arr1) - np.sum(np.isclose(arr1, arr2, **kwargs))
return num <= num_exceptions
By contrast the standard L^2 norm discretization would lead to something like this integrated (and normalized) metric:
import numpy as np
def is_close_l2(arr1, arr2, **kwargs):
norm1 = np.sum(arr1 ** 2)
norm2 = np.sum(arr2 ** 2)
norm = np.sum((arr1 - arr2) ** 2)
return np.isclose(2 * norm / (norm1 + norm2), 0.0, **kwargs)
This however will fail for arbitrarily large peaks, unless you set such a large tolerance than basically anything results as "being close".
Note that the kwargs is used if you want to specify a additional tolerance constraints to np.isclose() or other of its options.
As a test, you could run:
import numpy as np
import numpy.random
np.random.seed(0)
num = 1000
snr = 100
n_peaks = 5
x = np.linspace(-10, 10, num)
# generate ground truth
y = np.sin(x)
# distributed noise
y2 = y + np.random.random(num) / snr
# distributed noise + peaks
y3 = y + np.random.random(num) / snr
peak_positions = [np.random.randint(num) for _ in range(n_peaks)]
for i in peak_positions:
y3[i] += np.random.random() * snr
# for distributed noise, both work with a 1/snr tolerance
is_close_l2(y, y2, atol=1/snr)
# output: True
is_close_except(y, y2, atol=1/snr)
# output: True
# for peak noise, since n_peaks < num_exceptions, this works
is_close_except(y, y3, atol=1/snr)
# output: True
# and if you allow 0 exceptions, than it fails, as expected
is_close_except(y, y3, num_exceptions=0, atol=1/snr)
# output: False
# for peak noise, this fails because the contribution from the peaks
# in the integral is much larger than the contribution from the rest
is_close_l2(y, y3, atol=1/snr)
# output: False
There are other approaches to this problem involving higher mathematics (e.g. Fourier or Wavelet transforms), but I would stick to the simplest.
EDIT (updated):
However, if the working assumption does not hold or you do not like, for example because the two functions have different sampling or they are described by non-injective relations.
In that case, you can follow the center of the tube using (x, y) data and the calculate the Euclidean distance from the target (the tube center), and check that this distance is point-wise smaller than the maximum allowed (the tube size):
import numpy as np
# assume it is something with shape (N, 2) meaning (x, y)
target = ...
# assume it is something with shape (M, 2) meaning again (x, y)
trajectory = ...
# calculate the distance minimum distance between each point
# of the trajectory and the target
def is_close_trajectory(trajectory, target, max_dist):
dist = np.zeros(trajectory.shape[0])
for i in range(len(dist)):
dist[i] = np.min(np.sqrt(
(target[:, 0] - trajectory[i, 0]) ** 2 +
(target[:, 1] - trajectory[i, 1]) ** 2))
return np.all(dist < max_dist)
# same as above but faster and more memory-hungry
def is_close_trajectory2(trajectory, target, max_dist):
dist = np.min(np.sqrt(
(target[:, np.newaxis, 0] - trajectory[np.newaxis, :, 0]) ** 2 +
(target[:, np.newaxis, 1] - trajectory[np.newaxis, :, 1]) ** 2),
axis=1)
return np.all(dist < max_dist)
The price of this flexibility is that this will be a significantly slower or memory-hungry function.
Assuming you have your list of results in the form we discussed in the comments already loaded:
from random import randint
import numpy
l1 = [(i,randint(0,99)) for i in range(10)]
l2 = [(i,randint(0,99)) for i in range(10)]
# I generate some random lists e.g:
# [(0, 46), (1, 33), (2, 85), (3, 63), (4, 63), (5, 76), (6, 85), (7, 83), (8, 25), (9, 72)]
# where the first element is the time and the second a value
print(l1)
# Then I just evaluate for each time step the difference between the values
differences = [abs(x[0][1]-x[1][1]) for x in zip(l1,l2)]
print(differences)
# And I can just print hte maximum difference and its index:
print(max(differences))
print(differences.index(max(differences)))
And with this data if you define that your "tube" is for example 10 large you can just check if the maxximum value that you find is greater than your thrashold in order to decide if those functions are similar enough or not
you will have to remove outliers from your dataset first if you need to skip a random spike.
you could also try the following?
from tslearn.metrics import dtw
print(dtw(arr1,arr2)*100/<lengthOfArray>)
Bit late to the game but I encountered the same conundrum recently and this seems to be the only question on on the site discussing this particular problem.
A basic solution is to use time and amplitude tolerance values to create a 'bounding box' style envelope (similar to your pink tube) around the data.
I'm sure there are more elegant ways to do this, but a very crudely coded brute force example would be something like the following using pandas:
import pandas as pd
data = pd.DataFrame()
data['benchmark'] = [0.1, 0.2, 0.3] # or whatever you pull from your expected value data set
data['under_test'] = [0.2, 0.3, 0.1] # or whatever you pull from your simulation results data set
sample_rate = 20 # or whatever the data sample rate is
st = 0.05 * sample_rate # shift tolerance adjusted to time series sample rate
# best to make it an integer so we can use standard
# series shift functions and whatnot
at = 0.05 # amplitude tolerance
bounding = pd.DataFrame()
# if we didn't care about time shifts, the following two would be sufficient
# (i.e. if the data didn't have severe discontinuities between samples)
bounding['top'] = data[['benchmark']] + at
bounding['bottom'] = data[['benchmark']] - at
# if you want to be able to tolerate large discontinuities
# the bounds can be widened along the time axis to accommodate for large jumps
bounding['bottomleft'] = data[['benchmark']].shift(-st) - at
bounding['topleft'] = data[['benchmark']].shift(-st) + at
bounding['topright'] = data[['benchmark']].shift(st) + at
bounding['bottomright'] = data[['benchmark']].shift(st) - at
# minimums and maximums give us a rough (but hopefully good enough) envelope
# these can be plotted as a parametric replacement of the 'pink tube' of line width
data['min'] = bounding.min(1)
data['max'] = bounding.max(1)
# see if the test data falls inside the envelope
data['pass/fail'] = data['under_test'].between(data['min'], data['max'])
# You now have a machine-readable column of booleans
# indicating which data points are outside the envelope
I need code to do 2D Kernel Density Estimation (KDE), and I've found the SciPy implementation is too slow. So, I've written an FFT based implementation, but several things confuse me. (The FFT implementation also enforces periodic boundary conditions, which is what I want.)
The implementation is based on creating a simple histogram from the samples and then convolving this with a gaussian. Here's code to do this and compare it with the SciPy result.
from numpy import *
from scipy.stats import *
from numpy.fft import *
from matplotlib.pyplot import *
from time import clock
ion()
#PARAMETERS
N = 512 #number of histogram bins; want 2^n for maximum FFT speed?
nSamp = 1000 #number of samples if using the ranom variable
h = 0.1 #width of gaussian
wh = 1.0 #width and height of square domain
#VARIABLES FROM PARAMETERS
rv = uniform(loc=-wh,scale=2*wh) #random variable that can generate samples
xyBnds = linspace(-1.0, 1.0, N+1) #boundaries of histogram bins
xy = (xyBnds[1:] + xyBnds[:-1])/2 #centers of histogram bins
xx, yy = meshgrid(xy,xy)
#DEFINE SAMPLES, TWO OPTIONS
#samples = rv.rvs(size=(nSamp,2))
samples = array([[0.5,0.5],[0.2,0.5],[0.2,0.2]])
#DEFINITIONS FOR FFT IMPLEMENTATION
ker = exp(-(xx**2 + yy**2)/2/h**2)/h/sqrt(2*pi) #Gaussian kernel
fKer = fft2(ker) #DFT of kernel
#FFT IMPLEMENTATION
stime = clock()
#generate normalized histogram. Note sure why .T is needed:
hst = histogram2d(samples[:,0], samples[:,1], bins=xyBnds)[0].T / (xy[-1] - xy[0])**2
#convolve histogram with kernel. Not sure why fftshift is neeed:
KDE1 = fftshift(ifft2(fft2(hst)*fKer))/N
etime = clock()
print "FFT method time:", etime - stime
#DEFINITIONS FOR NON-FFT IMPLEMTATION FROM SCIPY
#points to sample the KDE at, in a form gaussian_kde likes:
grid_coords = append(xx.reshape(-1,1),yy.reshape(-1,1),axis=1)
#NON-FFT IMPLEMTATION FROM SCIPY
stime = clock()
KDEfn = gaussian_kde(samples.T, bw_method=h)
KDE2 = KDEfn(grid_coords.T).reshape((N,N))
etime = clock()
print "SciPy time:", etime - stime
#PLOT FFT IMPLEMENTATION RESULTS
fig = figure()
ax = fig.add_subplot(111, aspect='equal')
c = contour(xy, xy, KDE1.real)
clabel(c)
title("FFT Implementation Results")
#PRINT SCIPY IMPLEMENTATION RESULTS
fig = figure()
ax = fig.add_subplot(111, aspect='equal')
c = contour(xy, xy, KDE2)
clabel(c)
title("SciPy Implementation Results")
There are two sets of samples above. The 1000 random points is for benchmarking and is commented out; the three points are for debugging.
The resulting plots for the latter case are at the end of this post.
Here are my questions:
Can I avoid the .T for the histogram and the fftshift for KDE1? I'm not sure why they're needed, but the gaussians show up in the wrong places without them.
How is the scalar bandwidth defined for SciPy? The gaussians have much different widths in the two implementations.
Along the same lines, why are the gaussians in the SciPy implementation not radially symmetric even though I gave gaussian_kde a scalar bandwidth?
How could I implement the other bandwidth methods available in SciPy for the FFT code?
(Let me note that the FFT code is ~390x fast than the SciPy code in the 1000 random points case.)
The differences you're seeing are due to the bandwidth and scaling factors, as you've already noticed.
By default, gaussian_kde chooses the bandwidth using Scott's rule. Dig into the code, if you're curious about the details. The code snippets below are from something I wrote quite awhile ago to do something similar to what you're doing. (If I remember right, there's an obvious error in that particular version and it really shouldn't use scipy.signal for the convolution, but the bandwidth estimation and normalization are correct.)
# Calculate the covariance matrix (in pixel coords)
cov = np.cov(xyi)
# Scaling factor for bandwidth
scotts_factor = np.power(n, -1.0 / 6) # For 2D
#---- Make the gaussian kernel -------------------------------------------
# First, determine how big the gridded kernel needs to be (2 stdev radius)
# (do we need to convolve with a 5x5 array or a 100x100 array?)
std_devs = np.diag(np.sqrt(cov))
kern_nx, kern_ny = np.round(scotts_factor * 2 * np.pi * std_devs)
# Determine the bandwidth to use for the gaussian kernel
inv_cov = np.linalg.inv(cov * scotts_factor**2)
After the convolution, the grid is then normalized:
# Normalization factor to divide result by so that units are in the same
# units as scipy.stats.kde.gaussian_kde's output. (Sums to 1 over infinity)
norm_factor = 2 * np.pi * cov * scotts_factor**2
norm_factor = np.linalg.det(norm_factor)
norm_factor = n * dx * dy * np.sqrt(norm_factor)
# Normalize the result
grid /= norm_factor
Hopefully that helps clarify things a touch.
As for your other questions:
Can I avoid the .T for the histogram and the fftshift for KDE1? I'm
not sure why they're needed, but the gaussians show up in the wrong
places without them.
I could be misreading your code, but I think you just have the transpose because you're going from point coordinates to index coordinates (i.e. from <x, y> to <y, x>).
Along the same lines, why are the gaussians in the SciPy
implementation not radially symmetric even though I gave gaussian_kde
a scalar bandwidth?
This is because scipy uses the full covariance matrix of the input x, y points to determine the gaussian kernel. Your formula assumes that x and y aren't correlated. gaussian_kde tests for and uses the correlation between x and y in the result.
How could I implement the other bandwidth methods available in SciPy
for the FFT code?
I'll leave that one for you to figure out. :) It's not too hard, though. Basically, instead of scotts_factor, you'd change the formula and have some other scalar factor. Everything else is the same.