Determine "wiggliness" of set of data - Python

Determine "wiggliness" of set of data - Python - python

I'm working on a piece of software which needs to implement the wiggliness of a set of data. Here's a sample of the input I would receive, merged with the lightness plot of each vertical pixel strip:
It is easy to see that the left margin is really wiggly (i.e. has a ton of minima/maxima), and I want to generate a set of critical points of the image. I've applied a Gaussian smoothing function to the data ~ 10 times, but it seems to be pretty wiggly to begin with.
Any ideas?
Here's my original code, but it does not produce very nice results (for the wiggliness):
def local_maximum(list, center, delta):
maximum = [0, 0]
for i in range(delta):
if list[center + i] > maximum[1]: maximum = [center + i, list[center + i]]
if list[center - i] > maximum[1]: maximum = [center - i, list[center - i]]
return maximum
def count_maxima(list, start, end, delta, threshold = 10):
count = 0
for i in range(start + delta, end - delta):
if abs(list[i] - local_maximum(list, i, delta)[1]) < threshold: count += 1
return count
def wiggliness(list, start, end, delta, threshold = 10):
return float(abs(start - end) * delta) / float(count_maxima(list, start, end, delta, threshold))

Take a look at lowpass/highpass/notch/bandpass filters, fourier transforms, or wavelets. The basic idea is there's lots of different ways to figure out the frequency content of a signal quantized over different time-periods.
If we can figure out what wiggliness is, that would help. I would say the leftmost margin is wiggly b/c it has more high-frequency content, which you could visualize by using a fourier transform.
If you take a highpass filter of that red signal, you'll get just the high frequency content, and then you can measure the amplitudes and do thresholds to determine wiggliness. But I guess wiggliness just needs more formalism behind it.

For things like these, numpy makes things much easier, as it provides useful functions for manipulating vector data, e.g. adding a scalar to each element, calculating the average value etc.
For example, you might try with zero crossing rate of either the original data-wiggliness1 or the first difference-wiggliness2 (depending on what wiggliness is supposed to be, exactly-if global trends are to be ignored, you should probably use the difference data). For x you would take the slice or window of interest from the original data, getting a sort of measure of local wiggliness.
If you use the original data, after removing the bias you might also want to set all values smaller than some threshold to 0 to ignore low-amplitude wiggles.
import numpy as np
def wiggliness1(x):
#remove bias:
x=x-np.average(x)
#calculate zero crossing rate:
np.sum(np.abs(np.sign(np.diff(x))))
def wiggliness(x):
#calculate zero crossing rate of the first difference:
return np.sum(np.abs(np.sign(np.diff(np.sign(np.diff(x))))))

Related

How can I identify the start and end of lower period of noisy data?

I have noisy data at roughly 1 minute intervals across a day.
Here is a simple version:
How can I identify the start and end index values of the less noisy and lower valued period marked in yellow?
Here is the test data:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
arr = np.array([8,9,7,3,6,3,2,1,2,3,1,2,3,2,2,3,2,2,5,7,8,9,15,20,21])
plt.plot(arr)
plt.show()

You could try to detect less noisy points by measuring the variance of the values in their neighborhood.
For example, for each point you can look at the last N values before it and calculate their standard deviation, then flag the point if the std is lower than some threshold.
The following code applies this procedure using the rolling method of a pandas series.
std_thresh = 1
window_len = 5
s = pd.Series([8,9,7,3,6,3,2,1,2,3,1,2,3,2,2,3,2,2,5,7,8,9,15,20,21])
# Create a boolean mask which marks the less noisy points
marked = s.rolling(window=window_len).std() < std_thresh
# Whenever a new point is marked, mark also the other points of the window (see discussion below)
for i in range(window_len + 1, len(marked)):
if marked[i] and ~marked[i-1]:
marked[i - (window_len-1) : i] = True
plt.plot(s)
plt.scatter(s[marked].index, s[marked], c='orange')
You can try to change the values of window_len (the length of the window where you calculate the std) and std_thresh (points whose window has std less than it are flagged) and tune them according to your needs.
Note that rolling considers a window which end at each point, so, whenever you encounter a segment of less noisy points, the first window_len-1 of them will not be marked. This is why I included the for loop in the code after defining marked.

For a given point, we can decide to keep/mask it based on certain criteria:
Are its neighbors are within some delta?
Is it within some threshold of the minimum?
Is it in a contiguous block?
Note: Since you tagged and imported pandas, I'll use pandas for convenience, but the same ideas can be implemented with pure numpy/matplotlib.
If all lower periods are around the same level
Then a simple approach is to use a neighbor delta with minimum threshold (though be careful of outliers in the real data):
s = pd.Series(np.hstack([arr, arr]))
delta = 2
threshold = s.std()
# check if each point's neighbors are within `delta`
mask_delta = s.diff().abs().le(delta) & s.diff(-1).abs().le(delta)
# check if each point is within `threshold` of the minimum
mask_threshold = s < s.min() + threshold
s.plot(label='raw')
s.where(mask_threshold & mask_delta).plot(marker='*', label='delta & threshold')
If the lower periods are at different levels
Then a global minimum threshold won't work since some periods will be too high. In this case try a neighbor delta with contiguous blocks:
# shift the second period by 5
s = pd.Series(np.hstack([arr, arr + 5]))
delta = 2
blocksize = 10
# check if each point's neighbors are within `delta`
mask_delta = s.diff().abs().le(delta) & s.diff(-1).abs().le(delta)
# check if each point is in a contiguous block of at least `blocksize`
masked = s.where(mask_delta)
groups = masked.isnull().cumsum()
blocksizes = masked.groupby(groups).transform('count').mask(masked.isnull())
mask_contiguous = blocksizes >= blocksize
s.plot(label='raw')
s.where(mask_contiguous).plot(marker='*', label='delta & contiguous')

Well if you just want that 'area', you need some way of finding points within certain bounds. How can we do that? Well, we should probably start by finding the minimum of the array and then finding other values in that same array that fall within the specified deviation:
def lows(arr, dev=0):
lim = min(arr) + dev
pts = []
for i,e in enumerate(arr):
if e <= lim:
pts.append((i,e))
return pts
The above function returns a list of points that fall within the specified bounds. The lower bound is the minimum of the input array and the upper bound is the minimum value plus the deviation you will supply. For example, if you want all points within 1 of the lowest value:
plt.plot(arr)
for pt in lows(arr, 1):
circle = plt.Circle(pt, 0.2, color='g')
plt.gca().add_patch(circle)
plt.show()

Find plateau in Numpy array

I am looking for an efficient way to detect plateaus in otherwise very noisy data. The plateaus are always relatively broad A simple example of what this data could look like:
test=np.random.uniform(0.9,1,100)
test[10:20]=0
plt.plot(test)
Note that there can be multiple plateaus (which should all be detected) which can have different values.
I've tried using scipy.signal.argrelextrema, but it doesn't seem to be doing what I want it to:
peaks=argrelextrema(test,np.less,order=25)
plt.vlines(peaks,ymin=0, ymax=1)
I don't need the exact interval of the plateau- a rough range estimate would be enough, as long as that estimate is bigger or equal than the actual plateau range. It should be relatively efficient however.

There is a method scipy.signal.find_peaks that you can try, here is an exmple
import numpy
from scipy.signal import find_peaks
test = numpy.random.uniform(0.9, 1.0, 100)
test[10 : 20] = 0
peaks, peak_plateaus = find_peaks(- test, plateau_size = 1)
although find_peaks only finds peaks, it can be used to find valleys if the array is negated, then you do the following
for i in range(len(peak_plateaus['plateau_sizes'])):
if peak_plateaus['plateau_sizes'][i] > 1:
print('a plateau of size %d is found' % peak_plateaus['plateau_sizes'][i])
print('its left index is %d and right index is %d' % (peak_plateaus['left_edges'][i], peak_plateaus['right_edges'][i]))
it will print
a plateau of size 10 is found
its left index is 10 and right index is 19

This is really just a "dumb" machine learning task. You'll want to code a custom function to screen for them. You have two key characteristics to a plateau:
They're consecutive occurrences of the same value (or very nearly so).
The first and last points deviate strongly from a forward and backward moving average, respectively. (Try quantifying this based on the standard deviation if you expect additive noise, for geometric noise you'll have to take the magnitude of your signal into account too.)
A simple loop should then be sufficient to calculate a forward moving average, stdev of points in that forward moving average, reverse moving average, and stdev of points in that reverse moving average.
Read until you find a point well outside the regular noise (compare to variance). Start buffering those indices into a list.
Keep reading and buffering indices into that list while they have the same value (or nearly the same, if your plateaus can be a little rough; you'll want to use some tolerance plus the standard deviation of your plateaus, or just some tolerance if you expect them all to behave similarly).
If the variance of the points in your buffer gets too high, it's not a plateau, too rough; throw it out and start scanning again from your current position.
If the last value was very different from the previous (on the order of the change that triggered your code to start buffering indices) and in the opposite direction of the original impulse, cap your buffer here; you've got a plateau there.
Now do whatever you want with the points at those indices. Delete them, replace them with a linear interpolation between the two boundary points, whatever.
I could generate some noise and give you some sample code, but this is really something you're going to have to adapt to your application. (For example, there's a shortcoming in this method that a plateau which captures a point on the middle of the "cliff edge" may leave that point when it removes the rest of the plateau. If that's something you're worried about, you'll have to do a little more exploring after you ID the plateau.) You should be able to do this in a single pass over the data, but it might be wise to get some statistics on the whole set first to intelligently tweak your thresholds.
If you have an exact definition of what constitutes a plateau, you can make this a lot less hand-wavey and ML-looking, but so long as you're trying to identify fuzzy pattern, you're gonna have to take a statistics-based approach.

I had a similar problem, and found a simple heuristic solution shared below. I find plateaus as ranges of constant gradient of the signal. You could change the code to also check that the gradient is (close to) 0.
I apply a moving average (uniform_filter_1d) to filter out noise. Also, I calculate the first and second derivative of the signal numerically, so I'm not sure it matches the requirement of efficiency. But it worked perfectly for my signal and might be a good starting point for others.
def find_plateaus(F, min_length=200, tolerance = 0.75, smoothing=25):
'''
Finds plateaus of signal using second derivative of F.
Parameters
----------
F : Signal.
min_length: Minimum length of plateau.
tolerance: Number between 0 and 1 indicating how tolerant
the requirement of constant slope of the plateau is.
smoothing: Size of uniform filter 1D applied to F and its derivatives.
Returns
-------
plateaus: array of plateau left and right edges pairs
dF: (smoothed) derivative of F
d2F: (smoothed) Second Derivative of F
'''
import numpy as np
from scipy.ndimage.filters import uniform_filter1d
# calculate smooth gradients
smoothF = uniform_filter1d(F, size = smoothing)
dF = uniform_filter1d(np.gradient(smoothF),size = smoothing)
d2F = uniform_filter1d(np.gradient(dF),size = smoothing)
def zero_runs(x):
'''
Helper function for finding sequences of 0s in a signal
https://stackoverflow.com/questions/24885092/finding-the-consecutive-zeros-in-a-numpy-array/24892274#24892274
'''
iszero = np.concatenate(([0], np.equal(x, 0).view(np.int8), [0]))
absdiff = np.abs(np.diff(iszero))
ranges = np.where(absdiff == 1)[0].reshape(-1, 2)
return ranges
# Find ranges where second derivative is zero
# Values under eps are assumed to be zero.
eps = np.quantile(abs(d2F),tolerance)
smalld2F = (abs(d2F) <= eps)
# Find repititions in the mask "smalld2F" (i.e. ranges where d2F is constantly zero)
p = zero_runs(np.diff(smalld2F))
# np.diff(p) gives the length of each range found.
# only accept plateaus of min_length
plateaus = p[(np.diff(p) > min_length).flatten()]
return (plateaus, dF, d2F)

Given a function with widespread zero tails: Cut down the plotting range

I'd like to plot (using matplotlib.pyplot) a probability density function (pdf), but hide their left and/or right tail whenever these are fairly close to zero.
E.g., the normal distribution when being apart some multiples of the standard deviation from the mean value.
The pdf is stored in two arrays samplingPts and functionVals,
containing the equidistant sampling point coordinates and the values of the function, respectively.
Both arrays are of type numpy.ndarray and have identical length.
Until now I use a quick and dirty hack by just cutting down the arrays:
# Define shortened arrays by dropping indices whose
# corresponding value.__abs__() is below a given threshold
threshold = 0.005
samplingPts_shortened = samplingPts[scipy.absolute(functionVals) > threshold]
functionVals_shortened = functionVals[scipy.absolute(functionVals) > threshold]
Very dirty indeed, it cannot be the final clue because the pdf possibly has two or more humps and may be close to zero in between, whence the "in-between sampling points" were eliminated as well. But they should remain and be present in the plot.
In addition it is not at all memory saving.
So my question is how to implement sound code which,
given two arrays as of above representing the function, cuts down these arrays at both ends until the function values begin to notably emerge from zero?

Why not look from the beginning of the samplingPts for when functionVals increases above the threshold, and cut it off there. Then look from the end of samplingPts backwards for when functionVals increases above the threshold and cut it off there too?
Something like:
for i in range(len(samplingPts)):
if scipy.absolute(functionVals[i]) > threshold:
break
samplingPts = samplingPts[i:]
functionVals = functionVals[i:]
for i in range(len(samplingPts)-1, 0, -1)):
if scipy.absolute(functionVals[i]) > threshold:
break
samplingPts = samplingPts[:i+1]
functionVals = functionVals[:i+1]

Averaging unevenly sampled data

I have data which consist of the radial distance to the ground, sampled evenly every d_theta. I would like to do gaussian smoothing on it, but make the size of the smoothing window a constant in x, rather than be a constant number of points. What is a good way to do this?
I made a function to do it, but it is slow and I haven't even put in the parts that will calculate the edges yet.
If it helps to do it faster, I guess you can assume the floor is flat and use that to calculate how many points to sample, rather than using the actual x-values.
Here is what I have attempted so far:
bs = [gaussian(2*n-1,n/2) for n in range (1,500)] #bring the computation of the
bs = [b/b.sum() for b in bs] #gaussian outside to speed it up
def uneven_gauss_smoothing(xvals,yvals,sigma):
newy = []
for i, xval in enumerate (xvals):
#find how big the window should be to have the chosen sigma
#(or .5*sigma, whatever):
wheres = np.where(xvals> xval + sigma )[0]
iright = wheres[0] -i if len(wheres) else 100
if i - iright < 0 :
newy.append(0) #not implemented yet
continue
if i + iright >= len(xvals):
newy.append(0) #not implemented
continue
else:
#weighted average with gaussian curve:
newy.append((yvals[i-iright:i+iright+1]*bs[iright]).sum())
return np.array(newy)
Sorry it's a bit of a mess--it was so incredibly frustrating to debug that I just ended up using the first solution (usually one which was difficult to read) that came to mind for some of the problems that popped up. But it does work in it's limited way.

These spectrum bands used to be judged by eye, how to do it programmatically?

Operators used to examine the spectrum, knowing the location and width of each peak and judge the piece the spectrum belongs to. In the new way, the image is captured by a camera to a screen. And the width of each band must be computed programatically.
Old system: spectroscope -> human eye
New system: spectroscope -> camera -> program
What is a good method to compute the width of each band, given their approximate X-axis positions; given that this task used to be performed perfectly by eye, and must now be performed by program?
Sorry if I am short of details, but they are scarce.
Program listing that generated the previous graph; I hope it is relevant:
import Image
from scipy import *
from scipy.optimize import leastsq
# Load the picture with PIL, process if needed
pic = asarray(Image.open("spectrum.jpg"))
# Average the pixel values along vertical axis
pic_avg = pic.mean(axis=2)
projection = pic_avg.sum(axis=0)
# Set the min value to zero for a nice fit
projection /= projection.mean()
projection -= projection.min()
#print projection
# Fit function, two gaussians, adjust as needed
def fitfunc(p,x):
return p[0]*exp(-(x-p[1])**2/(2.0*p[2]**2)) + \
p[3]*exp(-(x-p[4])**2/(2.0*p[5]**2))
errfunc = lambda p, x, y: fitfunc(p,x)-y
# Use scipy to fit, p0 is inital guess
p0 = array([0,20,1,0,75,10])
X = xrange(len(projection))
p1, success = leastsq(errfunc, p0, args=(X,projection))
Y = fitfunc(p1,X)
# Output the result
print "Mean values at: ", p1[1], p1[4]
# Plot the result
from pylab import *
#subplot(211)
#imshow(pic)
#subplot(223)
#plot(projection)
#subplot(224)
#plot(X,Y,'r',lw=5)
#show()
subplot(311)
imshow(pic)
subplot(312)
plot(projection)
subplot(313)
plot(X,Y,'r',lw=5)
show()

Given an approximate starting point, you could use a simple algorithm that finds a local maxima closest to this point. Your fitting code may be doing that already (I wasn't sure whether you were using it successfully or not).
Here's some code that demonstrates simple peak finding from a user-given starting point:
#!/usr/bin/env python
from __future__ import division
import numpy as np
from matplotlib import pyplot as plt
# Sample data with two peaks: small one at t=0.4, large one at t=0.8
ts = np.arange(0, 1, 0.01)
xs = np.exp(-((ts-0.4)/0.1)**2) + 2*np.exp(-((ts-0.8)/0.1)**2)
# Say we have an approximate starting point of 0.35
start_point = 0.35
# Nearest index in "ts" to this starting point is...
start_index = np.argmin(np.abs(ts - start_point))
# Find the local maxima in our data by looking for a sign change in
# the first difference
# From http://stackoverflow.com/a/9667121/188535
maxes = (np.diff(np.sign(np.diff(xs))) < 0).nonzero()[0] + 1
# Find which of these peaks is closest to our starting point
index_of_peak = maxes[np.argmin(np.abs(maxes - start_index))]
print "Peak centre at: %.3f" % ts[index_of_peak]
# Quick plot showing the results: blue line is data, green dot is
# starting point, red dot is peak location
plt.plot(ts, xs, '-b')
plt.plot(ts[start_index], xs[start_index], 'og')
plt.plot(ts[index_of_peak], xs[index_of_peak], 'or')
plt.show()
This method will only work if the ascent up the peak is perfectly smooth from your starting point. If this needs to be more resilient to noise, I have not used it, but PyDSTool seems like it might help. This SciPy post details how to use it for detecting 1D peaks in a noisy data set.
So assume at this point you've found the centre of the peak. Now for the width: there are several methods you could use, but the easiest is probably the "full width at half maximum" (FWHM). Again, this is simple and therefore fragile. It will break for close double-peaks, or for noisy data.
The FWHM is exactly what its name suggests: you find the width of the peak were it's halfway to the maximum. Here's some code that does that (it just continues on from above):
# FWHM...
half_max = xs[index_of_peak]/2
# This finds where in the data we cross over the halfway point to our peak. Note
# that this is global, so we need an extra step to refine these results to find
# the closest crossovers to our peak.
# Same sign-change-in-first-diff technique as above
hm_left_indices = (np.diff(np.sign(np.diff(np.abs(xs[:index_of_peak] - half_max)))) > 0).nonzero()[0] + 1
# Add "index_of_peak" to result because we cut off the left side of the data!
hm_right_indices = (np.diff(np.sign(np.diff(np.abs(xs[index_of_peak:] - half_max)))) > 0).nonzero()[0] + 1 + index_of_peak
# Find closest half-max index to peak
hm_left_index = hm_left_indices[np.argmin(np.abs(hm_left_indices - index_of_peak))]
hm_right_index = hm_right_indices[np.argmin(np.abs(hm_right_indices - index_of_peak))]
# And the width is...
fwhm = ts[hm_right_index] - ts[hm_left_index]
print "Width: %.3f" % fwhm
# Plot to illustrate FWHM: blue line is data, red circle is peak, red line
# shows FWHM
plt.plot(ts, xs, '-b')
plt.plot(ts[index_of_peak], xs[index_of_peak], 'or')
plt.plot(
[ts[hm_left_index], ts[hm_right_index]],
[xs[hm_left_index], xs[hm_right_index]], '-r')
plt.show()
It doesn't have to be the full width at half maximum — as one commenter points out, you can try to figure out where your operators' normal threshold for peak detection is, and turn that into an algorithm for this step of the process.
A more robust way might be to fit a Gaussian curve (or your own model) to a subset of the data centred around the peak — say, from a local minima on one side to a local minima on the other — and use one of the parameters of that curve (eg. sigma) to calculate the width.
I realise this is a lot of code, but I've deliberately avoided factoring out the index-finding functions to "show my working" a bit more, and of course the plotting functions are there just to demonstrate.
Hopefully this gives you at least a good starting point to come up with something more suitable to your particular set.

Late to the party, but for anyone coming across this question in the future...
Eye movement data looks very similar to this; I'd base an approach off that used by Nystrom + Holmqvist, 2010. Smooth the data using a Savitsky-Golay filter (scipy.signal.savgol_filter in scipy v0.14+) to get rid of some of the low-level noise while keeping the large peaks intact - the authors recommend using an order of 2 and a window size of about twice the width of the smallest peak you want to be able to detect. You can find where the bands are by arbitrarily removing all values above a certain y value (set them to numpy.nan). Then take the (nan)mean and (nan)standard deviation of the remainder, and remove all values greater than the mean + [parameter]*std (I think they use 6 in the paper). Iterate until you're not removing any data points - but depending on your data, certain values of [parameter] may not stabilise. Then use numpy.isnan() to find events vs non-events, and numpy.diff() to find the start and end of each event (values of -1 and 1 respectively). To get even more accurate start and end points, you can scan along the data backward from each start and forward from each end to find the nearest local minimum which has value smaller than mean + [another parameter]*std (I think they use 3 in the paper). Then you just need to count the data points between each start and end.
This won't work for that double peak; you'd have to do some extrapolation for that.

The best method might be to statistically compare a bunch of methods with human results.
You would take a large variety data and a large variety of measurement estimates (widths at various thresholds, area above various thresholds, different threshold selection methods, 2nd moments, polynomial curve fits of various degrees, pattern matching, and etc.) and compare these estimates to human measurements of the same data set. Pick the estimate method that correlates best with expert human results. Or maybe pick several methods, the best one for each of various heights, for various separations from other peaks, and etc.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.