Find Start and End point of a Curve in Python

Find Start and End point of a Curve in Python - python

I have a signal with 5 curves and my goal is to find the start, peak, and end point of each curve.
To find the peaks, I am using find_peaks function from scipy.signal. The following screenshot shows signal and the 5 peaks:
In order to find the start point of each of these curves, I have written a function that checks for the percentage difference between every pair of signal values. If the difference is less than or equal to 1%, then I mark it as the start point.
def find_start(peak, leftLimit, signal):
# Go in left direction from the peak
# Start at the current curve's peak, and go until the previous curve's peak (or until zero in case of very first curve)
for i in range(peak, leftLimit, -1):
diff = signal[i] - signal[i-1]
perc_change = float(diff)/signal[i] * 100
perc_change = round(abs(perc_change), 3)
if perc_change <= 1.0:
return i
return -1
I have a similar function to find the end point of each curve.
Now, although these two functions work in general, they fail if the signal is quite distorted and is not constant (straight horizontal line) outside the 5 curves. The following screenshot shows one such case:
Is there a better way (or any Python Library function) that can help locate the start and end points for each curve?

I would try folding your signal with a defined pulse (rectangular or triangle) and shift this triangle over your signal.
It is basically described here: https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.correlate.html
In the resulting correlation result you could apply a simple thresholding to get the indices where it rises/dropes over/below the threshold.

Related

Wrong inflection points

I want to come up with a plot that shows the inflection points of a curve as follows:
I have a somewhat similar curve and I want to compute somehow the inflection points by using python. My curve looks as follows:
I am using the following code to compute the inflection points:
def find_inflection_points(df, n=1):
raw = df['consumption'].to_numpy()
infls = []
dx = 0
for i, x in enumerate(np.diff(raw, n)):
if x >= dx and i > 0:
infls.append(i*n)
dx = x
# plot results
plt.plot(raw, label='Input Data')
for i, infl in enumerate(infls, 1):
plt.axvline(x=infl, color='k', label=f'Inflection Point {i}')
plt.legend(bbox_to_anchor=(1.55, 1.0))
return infls
However, I am getting the following plot:
I would expect less inflection points. Any idea of what should I change or any other proposal for implementation?
EDIT:
The data is the following:
raw = np.array([52.33,
50.154444444444444,
48.69222222222223,
46.49111111111111,
44.01444444444444,
43.30555555555556,
43.034444444444446,
40.62888888888889,
40.38111111111111,
39.07666666666667,
38.339999999999996,
36.41444444444445,
36.37888888888889,
36.17111111111111,
35.666666666666664,
33.827777777777776,
29.35222222222222,
28.60888888888889,
24.43,
22.078888888888887,
21.756666666666664,
20.345555555555556,
19.874444444444446,
19.763333333333335])

So, strictly speaking, inflection point is indeed a change of sign of curvature. Which, for a 3 times differenciable function, is a point at which there is a change of sign of the 2nd derivative (the second derivative is 0, and the third derivative is not).
In your case, since the data are very discrete (only 24 data, one per hour in the day, I surmise), it is quite tricky to talk about second and third derivative. But if we give a try, we can see that, not only the point you are interested in are not inflection points (when the second derivative is 0, that means that the first derivative is locally constant. Which means that the slope is constant. And the points you seem to be interested in are, on the contrary, points where there is a change of slope. The the opposite of an inflection point!.
They are tho, inflection points of the derivative, since it seems that they are local extremum of the second derivative, so, at least if we had a continuous enough curve to dare speak of 3rd derivative, one could that that the extrema of second derivative are the 0 of the 3rd derivative. And 3rd derivative is the 2nd derivative of the 1st derivative. So, you could say that you are interested in the inflection points of the 1st derivative, that is of the slope.
See code below:
import numpy as np
from matplotlib import pyplot as plt
raw = np.array([52.33, 50.154444444444444, 48.69222222222223, 46.49111111111111, 44.01444444444444, 43.30555555555556, 43.034444444444446, 40.62888888888889, 40.38111111111111, 39.07666666666667, 38.339999999999996, 36.41444444444445, 36.37888888888889, 36.17111111111111, 35.666666666666664, 33.827777777777776, 29.35222222222222, 28.60888888888889, 24.43, 22.078888888888887, 21.756666666666664, 20.345555555555556, 19.874444444444446, 19.763333333333335])
x=np.arange(24)
rawb=np.roll(raw, 1)
rawa=np.roll(raw,-1)
der2=rawb+rawa-2*raw
der2[0]=der2[-1]=np.nan
der2abs=np.abs(der2)
offs = der2abs/(der2abs+np.roll(der2abs, -1))
yoffs = raw*(1-offs) + rawa*offs
chsign=(der2*np.roll(der2,-1))<0
s=1.5
mxder2 = (((der2>np.roll(der2,-1)) & (der2>np.roll(der2,1))) | ((der2<np.roll(der2,-1)) & (der2<np.roll(der2,1)))) & (der2abs>s)
fig, ax=plt.subplots()
ax2=ax.twinx()
ax.plot(raw)
ax2.plot([0]*24)
ax2.plot(der2)
ax.scatter(x[mxder2], raw[mxder2], 80, c='r', marker='*')
ax.scatter((x+offs)[chsign], yoffs[chsign], 30, c='g', marker='o')
plt.show()
It shows
The blue line are your data.
The orange line is the second derivative.
The green dots are the points where second derivative is 0.
While the red stars are the points where second derivative is an extrema (with a minimum absolute value: we don't count local extrema of area where second derivative is almost flatlining 0).
From what you have shown, you seem more interested in red stars!
The green dots are not just too numerous. Even filtering them (from an unknown criteria) would not do: they are all quite boring!
What makes the situation, and the vocabulary, ambiguous is the fact that we are talking about discrete points in reality. Inflection point are points where second derivative is 0. That is where 1st derivative is an extrema. And you need on the contrary points where second derivative is extreme. On so discrete set of data, you can be both tho. And maybe that was the case in your paper: points with sharp change of slopes are points where second derivative are extremely positive, but is surronded by 2 extremely negative second derivative (or the opposite).
But, my point is, you seem more interested in red stars.
As for how I compute that:
der2 is the second derivative, using discrete scheme y[-dt]-2y[0]+y[dt]
der2abs is its absolute value.
offs is a barycenter weighted by successive values of der2abs. Where there is a change of sign of the 2nd derivative, between index i and i+1, this account for an estimation of the exact position of the 0: offs is 0 if the 0 is at index i, 1 if it is at index 1, 0.5 if it is in the middle between i and i+1, etc. offs makes no sense where there is no change of sign (and we won't use those values).
yoffs is the raw value using the same barycenter. So, yoffs is yoffs[i], yoffs[i+1], yoffs[i+0.5] in the 3 previous cases (what would be yoffs[i+0.5] were a legal thing). Like offs, makes sense only where there is a change of sign of der2.
chsign is precisely what says where those change of sign occur.
So, we just have to plot yoffs[chsign] vs (x+offs)[chsign] to filter the cases where the second derivative are 0.
The red stars are easier to compute:
We just find all the points whose second derivative is either bigger or smaller than its 2 neighbor. And filter those to add a minimum value condition (|secondDerivative| must be at least 1.5)

Calculating area within a curve

I am measuring the current that passes through a sample as a vary the voltage over it. The result is a current-voltage plot like this
https://content.sciendo.com/view/journals/joeb/9/1/graphic/j_joeb-2018-0023_fig_004.jpg
I want to calculate the area within the curve of the first half period (part of the curve in the first quadrant). Not sure what the best way to do this in python. I have tried to write some code that finds pairs with same x-coordinate and subtracts the bottom y-value from the top y-value, and iterates over all the points in the first quadrant.
def LobeAreaByPeriod(all_periods_I):
print('Starting Lobe area by period')
LobeArea = []
for period in all_periods_I:
halfperiod = round(len(period)/2)
duration = round(halfperiod/2)
area = 0
for i in range(duration):
area += (period[i] - period[halfperiod - i])
LobeArea.append(area)
return LobeArea
I am not certain the pairs will be located directly above each other this way and I find it difficult to check it the answer is really correct. Any tips on how to do this?

To calculate area within a curve you're going to have to do some sort of approximation. Depending on your necessity of precision the method you choose will vary. I like the trapezoidal rule. Numpy has an implementation that is quite nice.
np.trapz

Find plateau in Numpy array

I am looking for an efficient way to detect plateaus in otherwise very noisy data. The plateaus are always relatively broad A simple example of what this data could look like:
test=np.random.uniform(0.9,1,100)
test[10:20]=0
plt.plot(test)
Note that there can be multiple plateaus (which should all be detected) which can have different values.
I've tried using scipy.signal.argrelextrema, but it doesn't seem to be doing what I want it to:
peaks=argrelextrema(test,np.less,order=25)
plt.vlines(peaks,ymin=0, ymax=1)
I don't need the exact interval of the plateau- a rough range estimate would be enough, as long as that estimate is bigger or equal than the actual plateau range. It should be relatively efficient however.

There is a method scipy.signal.find_peaks that you can try, here is an exmple
import numpy
from scipy.signal import find_peaks
test = numpy.random.uniform(0.9, 1.0, 100)
test[10 : 20] = 0
peaks, peak_plateaus = find_peaks(- test, plateau_size = 1)
although find_peaks only finds peaks, it can be used to find valleys if the array is negated, then you do the following
for i in range(len(peak_plateaus['plateau_sizes'])):
if peak_plateaus['plateau_sizes'][i] > 1:
print('a plateau of size %d is found' % peak_plateaus['plateau_sizes'][i])
print('its left index is %d and right index is %d' % (peak_plateaus['left_edges'][i], peak_plateaus['right_edges'][i]))
it will print
a plateau of size 10 is found
its left index is 10 and right index is 19

This is really just a "dumb" machine learning task. You'll want to code a custom function to screen for them. You have two key characteristics to a plateau:
They're consecutive occurrences of the same value (or very nearly so).
The first and last points deviate strongly from a forward and backward moving average, respectively. (Try quantifying this based on the standard deviation if you expect additive noise, for geometric noise you'll have to take the magnitude of your signal into account too.)
A simple loop should then be sufficient to calculate a forward moving average, stdev of points in that forward moving average, reverse moving average, and stdev of points in that reverse moving average.
Read until you find a point well outside the regular noise (compare to variance). Start buffering those indices into a list.
Keep reading and buffering indices into that list while they have the same value (or nearly the same, if your plateaus can be a little rough; you'll want to use some tolerance plus the standard deviation of your plateaus, or just some tolerance if you expect them all to behave similarly).
If the variance of the points in your buffer gets too high, it's not a plateau, too rough; throw it out and start scanning again from your current position.
If the last value was very different from the previous (on the order of the change that triggered your code to start buffering indices) and in the opposite direction of the original impulse, cap your buffer here; you've got a plateau there.
Now do whatever you want with the points at those indices. Delete them, replace them with a linear interpolation between the two boundary points, whatever.
I could generate some noise and give you some sample code, but this is really something you're going to have to adapt to your application. (For example, there's a shortcoming in this method that a plateau which captures a point on the middle of the "cliff edge" may leave that point when it removes the rest of the plateau. If that's something you're worried about, you'll have to do a little more exploring after you ID the plateau.) You should be able to do this in a single pass over the data, but it might be wise to get some statistics on the whole set first to intelligently tweak your thresholds.
If you have an exact definition of what constitutes a plateau, you can make this a lot less hand-wavey and ML-looking, but so long as you're trying to identify fuzzy pattern, you're gonna have to take a statistics-based approach.

I had a similar problem, and found a simple heuristic solution shared below. I find plateaus as ranges of constant gradient of the signal. You could change the code to also check that the gradient is (close to) 0.
I apply a moving average (uniform_filter_1d) to filter out noise. Also, I calculate the first and second derivative of the signal numerically, so I'm not sure it matches the requirement of efficiency. But it worked perfectly for my signal and might be a good starting point for others.
def find_plateaus(F, min_length=200, tolerance = 0.75, smoothing=25):
'''
Finds plateaus of signal using second derivative of F.
Parameters
----------
F : Signal.
min_length: Minimum length of plateau.
tolerance: Number between 0 and 1 indicating how tolerant
the requirement of constant slope of the plateau is.
smoothing: Size of uniform filter 1D applied to F and its derivatives.
Returns
-------
plateaus: array of plateau left and right edges pairs
dF: (smoothed) derivative of F
d2F: (smoothed) Second Derivative of F
'''
import numpy as np
from scipy.ndimage.filters import uniform_filter1d
# calculate smooth gradients
smoothF = uniform_filter1d(F, size = smoothing)
dF = uniform_filter1d(np.gradient(smoothF),size = smoothing)
d2F = uniform_filter1d(np.gradient(dF),size = smoothing)
def zero_runs(x):
'''
Helper function for finding sequences of 0s in a signal
https://stackoverflow.com/questions/24885092/finding-the-consecutive-zeros-in-a-numpy-array/24892274#24892274
'''
iszero = np.concatenate(([0], np.equal(x, 0).view(np.int8), [0]))
absdiff = np.abs(np.diff(iszero))
ranges = np.where(absdiff == 1)[0].reshape(-1, 2)
return ranges
# Find ranges where second derivative is zero
# Values under eps are assumed to be zero.
eps = np.quantile(abs(d2F),tolerance)
smalld2F = (abs(d2F) <= eps)
# Find repititions in the mask "smalld2F" (i.e. ranges where d2F is constantly zero)
p = zero_runs(np.diff(smalld2F))
# np.diff(p) gives the length of each range found.
# only accept plateaus of min_length
plateaus = p[(np.diff(p) > min_length).flatten()]
return (plateaus, dF, d2F)

Selecting best range of values from histogram curve

Scenario :
I am trying to track two different colored objects. At the beginning, user is prompted to hold the first colored object (say, may be a RED) at a particular position in front of camera (marked on screen by a rectangle) and press any key, then my program takes that portion of frame (ROI) and analyze the color in it, to find what color to track. Similarly for second object also. Then as usual, use cv.inRange function in HSV color plane and track the object.
What is done :
I took the ROI of object to be tracked, converted it to HSV and checked the Hue histogram. I got two cases as below :
( here there is only one major central peak. But in some cases, I get two such peaks, One a bigger peak with some pixel cluster around it, and second peak, smaller than first one, but significant size with small cluster around it also. I don't have an sample image of it now. But it almost look like below (created in paint))
Question :
How can I get best range of hue values from these histograms?
By best range I mean, may be around 80-90% of the pixels in ROI lie in that range.
Or is there any better method than this to track different colored objects ?

If I understand right, the only thing you need here is to find a maximum in a graph, where the maximum is not necessarily the highest peak, but the area with largest density.
Here's a very simple not too scientific but fast O(n) approach. Run the histogram trough a low pass filter. E.g. a moving average. The length of your average can be let's say 20. In that case the 10th value of your new modified histogram would be:
mh10 = (h1 + h2 + ... + h20) / 20
where h1, h2... are values from your histogram. The next value:
mh11 = (h2 + h3 + ... + h21) / 20
which can be calculated much easier using the previously calculated mh10, by dropping it's first component and adding a new one to the end:
mh11 = mh10 - h1/20 + h21/20
Your only problem is how you handle numbers at the edge of your histogram. You could shrink your moving average's length to the length available, or you could add values before and after what you already have. But either way, you couldn't handle peaks right at the edge.
And finally, when you have this modified histogram, just get the maximum. This works, because now every value in your histogram contains not only himself but it's neighbors as well.
A more sophisticated approach is to weight your average for example with a Gaussian curve. But that's not linear any more. It would be O(k*n), where k is the length of your average which is also the length of the Gaussian.

These spectrum bands used to be judged by eye, how to do it programmatically?

Operators used to examine the spectrum, knowing the location and width of each peak and judge the piece the spectrum belongs to. In the new way, the image is captured by a camera to a screen. And the width of each band must be computed programatically.
Old system: spectroscope -> human eye
New system: spectroscope -> camera -> program
What is a good method to compute the width of each band, given their approximate X-axis positions; given that this task used to be performed perfectly by eye, and must now be performed by program?
Sorry if I am short of details, but they are scarce.
Program listing that generated the previous graph; I hope it is relevant:
import Image
from scipy import *
from scipy.optimize import leastsq
# Load the picture with PIL, process if needed
pic = asarray(Image.open("spectrum.jpg"))
# Average the pixel values along vertical axis
pic_avg = pic.mean(axis=2)
projection = pic_avg.sum(axis=0)
# Set the min value to zero for a nice fit
projection /= projection.mean()
projection -= projection.min()
#print projection
# Fit function, two gaussians, adjust as needed
def fitfunc(p,x):
return p[0]*exp(-(x-p[1])**2/(2.0*p[2]**2)) + \
p[3]*exp(-(x-p[4])**2/(2.0*p[5]**2))
errfunc = lambda p, x, y: fitfunc(p,x)-y
# Use scipy to fit, p0 is inital guess
p0 = array([0,20,1,0,75,10])
X = xrange(len(projection))
p1, success = leastsq(errfunc, p0, args=(X,projection))
Y = fitfunc(p1,X)
# Output the result
print "Mean values at: ", p1[1], p1[4]
# Plot the result
from pylab import *
#subplot(211)
#imshow(pic)
#subplot(223)
#plot(projection)
#subplot(224)
#plot(X,Y,'r',lw=5)
#show()
subplot(311)
imshow(pic)
subplot(312)
plot(projection)
subplot(313)
plot(X,Y,'r',lw=5)
show()

Given an approximate starting point, you could use a simple algorithm that finds a local maxima closest to this point. Your fitting code may be doing that already (I wasn't sure whether you were using it successfully or not).
Here's some code that demonstrates simple peak finding from a user-given starting point:
#!/usr/bin/env python
from __future__ import division
import numpy as np
from matplotlib import pyplot as plt
# Sample data with two peaks: small one at t=0.4, large one at t=0.8
ts = np.arange(0, 1, 0.01)
xs = np.exp(-((ts-0.4)/0.1)**2) + 2*np.exp(-((ts-0.8)/0.1)**2)
# Say we have an approximate starting point of 0.35
start_point = 0.35
# Nearest index in "ts" to this starting point is...
start_index = np.argmin(np.abs(ts - start_point))
# Find the local maxima in our data by looking for a sign change in
# the first difference
# From http://stackoverflow.com/a/9667121/188535
maxes = (np.diff(np.sign(np.diff(xs))) < 0).nonzero()[0] + 1
# Find which of these peaks is closest to our starting point
index_of_peak = maxes[np.argmin(np.abs(maxes - start_index))]
print "Peak centre at: %.3f" % ts[index_of_peak]
# Quick plot showing the results: blue line is data, green dot is
# starting point, red dot is peak location
plt.plot(ts, xs, '-b')
plt.plot(ts[start_index], xs[start_index], 'og')
plt.plot(ts[index_of_peak], xs[index_of_peak], 'or')
plt.show()
This method will only work if the ascent up the peak is perfectly smooth from your starting point. If this needs to be more resilient to noise, I have not used it, but PyDSTool seems like it might help. This SciPy post details how to use it for detecting 1D peaks in a noisy data set.
So assume at this point you've found the centre of the peak. Now for the width: there are several methods you could use, but the easiest is probably the "full width at half maximum" (FWHM). Again, this is simple and therefore fragile. It will break for close double-peaks, or for noisy data.
The FWHM is exactly what its name suggests: you find the width of the peak were it's halfway to the maximum. Here's some code that does that (it just continues on from above):
# FWHM...
half_max = xs[index_of_peak]/2
# This finds where in the data we cross over the halfway point to our peak. Note
# that this is global, so we need an extra step to refine these results to find
# the closest crossovers to our peak.
# Same sign-change-in-first-diff technique as above
hm_left_indices = (np.diff(np.sign(np.diff(np.abs(xs[:index_of_peak] - half_max)))) > 0).nonzero()[0] + 1
# Add "index_of_peak" to result because we cut off the left side of the data!
hm_right_indices = (np.diff(np.sign(np.diff(np.abs(xs[index_of_peak:] - half_max)))) > 0).nonzero()[0] + 1 + index_of_peak
# Find closest half-max index to peak
hm_left_index = hm_left_indices[np.argmin(np.abs(hm_left_indices - index_of_peak))]
hm_right_index = hm_right_indices[np.argmin(np.abs(hm_right_indices - index_of_peak))]
# And the width is...
fwhm = ts[hm_right_index] - ts[hm_left_index]
print "Width: %.3f" % fwhm
# Plot to illustrate FWHM: blue line is data, red circle is peak, red line
# shows FWHM
plt.plot(ts, xs, '-b')
plt.plot(ts[index_of_peak], xs[index_of_peak], 'or')
plt.plot(
[ts[hm_left_index], ts[hm_right_index]],
[xs[hm_left_index], xs[hm_right_index]], '-r')
plt.show()
It doesn't have to be the full width at half maximum — as one commenter points out, you can try to figure out where your operators' normal threshold for peak detection is, and turn that into an algorithm for this step of the process.
A more robust way might be to fit a Gaussian curve (or your own model) to a subset of the data centred around the peak — say, from a local minima on one side to a local minima on the other — and use one of the parameters of that curve (eg. sigma) to calculate the width.
I realise this is a lot of code, but I've deliberately avoided factoring out the index-finding functions to "show my working" a bit more, and of course the plotting functions are there just to demonstrate.
Hopefully this gives you at least a good starting point to come up with something more suitable to your particular set.

Late to the party, but for anyone coming across this question in the future...
Eye movement data looks very similar to this; I'd base an approach off that used by Nystrom + Holmqvist, 2010. Smooth the data using a Savitsky-Golay filter (scipy.signal.savgol_filter in scipy v0.14+) to get rid of some of the low-level noise while keeping the large peaks intact - the authors recommend using an order of 2 and a window size of about twice the width of the smallest peak you want to be able to detect. You can find where the bands are by arbitrarily removing all values above a certain y value (set them to numpy.nan). Then take the (nan)mean and (nan)standard deviation of the remainder, and remove all values greater than the mean + [parameter]*std (I think they use 6 in the paper). Iterate until you're not removing any data points - but depending on your data, certain values of [parameter] may not stabilise. Then use numpy.isnan() to find events vs non-events, and numpy.diff() to find the start and end of each event (values of -1 and 1 respectively). To get even more accurate start and end points, you can scan along the data backward from each start and forward from each end to find the nearest local minimum which has value smaller than mean + [another parameter]*std (I think they use 3 in the paper). Then you just need to count the data points between each start and end.
This won't work for that double peak; you'd have to do some extrapolation for that.

The best method might be to statistically compare a bunch of methods with human results.
You would take a large variety data and a large variety of measurement estimates (widths at various thresholds, area above various thresholds, different threshold selection methods, 2nd moments, polynomial curve fits of various degrees, pattern matching, and etc.) and compare these estimates to human measurements of the same data set. Pick the estimate method that correlates best with expert human results. Or maybe pick several methods, the best one for each of various heights, for various separations from other peaks, and etc.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.