Related
I want to fit a 2D shape in an image. In the past, I have successfully done this using lmfit in Python and wrapping the 2D function/data to 1D. On that occasion, the 2D model was a smooth function (a ring with a gaussian profile). Now I am trying to do the same but with a "non-smooth function" and it is not working as expected.
This is what I am trying to do (guessed and fitted are the same):
I have shifted the guessed parameters in purpose to easily see if it moves as expected, and nothing happens.
I have noticed that if instead of a swiss flag I use a 2D gaussian, which is a smooth function, this works fine (see MWE below):
So I guess the problem is related to the fact that the Swiss flag function is not smooth. I have tried to make it smooth by adding a gaussian filter (blur) but it still did not work, even though the swiss flag plot became very blurred.
After some time I came to the thought that maybe the step size that is using lmfit (o whoever is in the background) is too small to produce any change in the swiss flag. I would like to try to increase the step size to 1, but I don't know exactly how to do that.
This is my MWE (sorry, it is still quite long):
import numpy as np
import myplotlib as mpl # https://github.com/SengerM/myplotlib
import lmfit
def draw_swiss_flag(fig, center, side, **kwargs):
fig.plot(
np.array(2*[side] + 2*[side/2] + 2*[-side/2] + 2*[-side] + 2*[-side/2] + 2*[side/2] + 2*[side]) + center[0],
np.array([0] + 2*[side/2] + 2*[side] + 2*[side/2] + 2*[-side/2] + 2*[-side] + 2*[-side/2] + [0]) + center[1],
**kwargs,
)
def swiss_flag(x, y, center: tuple, side: float):
# x, y numpy arrays.
if x.shape != y.shape:
raise ValueError(f'<x> and <y> must have the same shape!')
flag = np.zeros(x.shape)
flag[(center[0]-side/2<x)&(x<center[0]+side/2)&(center[1]-side<y)&(y<center[1]+side)] = 1
flag[(center[1]-side/2<y)&(y<center[1]+side/2)&(center[0]-side<x)&(x<center[0]+side)] = 1
return flag
def gaussian_2d(x, y, center, side):
return np.exp(-(x-center[0])**2/side**2-(y-center[1])**2/side**2)
def wrapper_for_lmfit(x, x_pixels, y_pixels, function_2D_to_wrap, *params):
pixel_number = x # This is the pixel number in the data array
# x_pixels and y_pixels are the number of pixels that the image has. This is needed to make the mapping.
if (pixel_number > x_pixels*y_pixels - 1).any():
raise ValueError('pixel_number (x) > x_pixels*y_pixels - 1')
x = np.array([int(p%x_pixels) for p in pixel_number])
y = np.array([int(p/x_pixels) for p in pixel_number])
return function_2D_to_wrap(x, y, *params)
data = np.genfromtxt('data.txt') # Read data
data -= data.min().min()
data = data/data.max().max()
guessed_center = (data.sum(axis=0).argmax()+11, data.sum(axis=1).argmax()+11) # I am adding 11 in purpose.
guessed_side = 19
model = lmfit.Model(lambda x, xc, yc, side: wrapper_for_lmfit(x, data.shape[1], data.shape[0], swiss_flag, (xc,yc), side))
params = model.make_params()
params['xc'].set(value = guessed_center[0], min = 0, max = data.shape[1])
params['yc'].set(value = guessed_center[1], min = 0, max = data.shape[0])
params['side'].set(value = guessed_side, min = 0)
fit_results = model.fit(data.ravel(), params, x = [i for i in range(len(data.ravel()))])
mpl.manager.set_plotting_package('matplotlib')
fit_plot = mpl.manager.new(
title = 'Data vs fit',
aspect = 'equal',
)
fit_plot.colormap(data)
draw_swiss_flag(fit_plot, guessed_center, guessed_side, label = 'Guessed')
draw_swiss_flag(fit_plot, (fit_results.params['xc'],fit_results.params['yc']), fit_results.params['side'], label = 'Fitted')
swiss_flag_plot = mpl.manager.new(
title = 'Swiss flag plot',
aspect = 'equal',
)
xx, yy = np.meshgrid(np.array([i for i in range(data.shape[1])]), np.array([i for i in range(data.shape[0])]))
swiss_flag_plot.colormap(
z = swiss_flag(xx, yy, center = (fit_results.params['xc'],fit_results.params['yc']), side = fit_results.params['side']),
)
mpl.manager.show()
and this is the content of data.txt.
It seems your code is all fine. The issue is, as you already guessed, that the algorithm used by lmfit is not dealing well with non-smooth data.
By default lmfit uses a leas squares method. Let's change it to method 'differential_evolution' instead.
params['side'].set(value=guessed_side, min=0, max=len(data))
fit_results = model.fit(data.ravel(), params,
x=[i for i in range(len(data.ravel()))],
method='differential_evolution'
)
Note that I needed to add some finite value for the max value to prevent a "differential_evolution requires finite bound for all varying parameters" message.
After switching to the evolutionary algorithm, the fit now looks like this:
All the fitting algorithms in lmfit (and scipy.optimize for that matter), and including the "global optimizers" really work on continuous variables (double precision). When trying to find the optimal parameter values, most of the algorithms will make a very small step (at the ~1.e-7 level) in the value to determine the derivative which will then be used to make the next guess of the optimal values.
The problem you're seeing is that your model function uses the parameter value as discrete values - as the index of an array using int(). If a small change is made to the parameter value, no change in the result will be detected - the algorithm will decide that the fit result does not depend on small changes to that value.
The so-called "global solvers" like differential evolution, basin-hopping, shgo, take the view that the derivative approach can lead to "false minima" and so will "spray parameter space" with lots of candidate values and then use different strategies to refine the best of those results to find the optimal values. Generally speaking, these are much slower to run (OTOH runtime is cheap!) and very good for problems where there may be multiple "minima" and you really want to find the best of these, or where getting a decent guess of starting values is very hard.
For your problem, it is pretty clear that you can guess starting values (the center pixels must be on the image, say, so maybe guess "the middle"), and it seems likely from the image that there are not a lot of false minima that might be found. That means that the expense of a global solver might not be needed.
Another approach would be to allow your shaped object to be centered at any continuous center in the image, and not only at integer pixels. Of course, you do have to map that to the discrete image, but it doesn't need to fully on/off. Using a sigmoidal functions like scipy.special.erf() and erfc() will allow you to still have a transition from "on" to "off", but with a small but finite width, bleeding into adjacent pixels. And that would be enough to allow a fit to find a continuous (and so, sub-pixel!) value for the center position. In 1-d, that might look like::
from scipy.special import erf
def smoothed_window(x, edge1, edge2, width):
return (erf((x-edge1)/width) + erf((edge2-x)/width))/2.0
For integer x values, a width of 0.5 (that is, half a pixel) will almost certainly allow a fit to find sub-integer values for edge1 and edge2. (Aside: either force the width parameter to be fixed or force it to be positive, eithr in the code or at the Parameter level).
I have not tried to extend that to your more complicated "swiss flag" function, but it should be possible and also work for fitting center values.
I have experimental data of the form (X,Y) and a theoretical model of the form (x(t;*params),y(t;*params)) where t is a physical (but unobservable) variable, and *params are the parameters that I want to determine. t is a continuous variable, and there is a 1:1 relationship between x and t and between y and t in the model.
In a perfect world, I would know the value of T (the real-world value of the parameter) and would be able to do an extremely basic least-squares fit to find the values of *params. (Note that I am not trying to "connect" the values of x and y in my plot, like in 31243002 or 31464345.) I cannot guarantee that in my real data, the latent value T is monotonic, as my data is collected across multiple cycles.
I'm not very experienced doing curve fitting manually, and have to use extremely crude methods without easy access to a basic scipy function. My basic approach involves:
Choose some value of *params and apply it to the model
Take an array of t values and put it into the model to create an array of model(*params) = (x(*params),y(*params))
Interpolate X (the data values) into model to get Y_predicted
Run a least-squares (or other) comparison between Y and Y_predicted
Do it again for a new set of *params
Eventually, choose the best values for *params
There are several obvious problems with this approach.
1) I'm not experienced enough with coding to develop a very good "do it again" other than "try everything in the solution space," of maybe "try everything in a coarse grid" and then "try everything again in a slightly finer grid in the hotspots of the coarse grid." I tried doing MCMC methods, but I never found any optimum values, largely because of problem 2
2) Steps 2-4 are super inefficient in their own right.
I've tried something like (resembling pseudo-code; the actual functions are made up). There are many minor quibbles that could be made about using broadcasting on A,B, but those are less significant than the problem of needing to interpolate for every single step.
People I know have recommended using some sort of Expectation Maximization algorithm, but I don't know enough about that to code one up from scratch. I'm really hoping there's some awesome scipy (or otherwise open-source) algorithm I haven't been able to find that covers my whole problem, but at this point I am not hopeful.
import numpy as np
import scipy as sci
from scipy import interpolate
X_data
Y_data
def x(t,A,B):
return A**t + B**t
def y(t,A,B):
return A*t + B
def interp(A,B):
ts = np.arange(-10,10,0.1)
xs = x(ts,A,B)
ys = y(ts,A,B)
f = interpolate.interp1d(xs,ys)
return f
N = 101
lsqs = np.recarray((N**2),dtype=float)
count = 0
for i in range(0,N):
A = 0.1*i #checks A between 0 and 10
for j in range(0,N):
B = 10 + 0.1*j #checks B between 10 and 20
f = interp(A,B)
y_fit = f(X_data)
squares = np.sum((y_fit - Y_data)**2)
lsqs[count] = (A,b,squares) #puts the values in place for comparison later
count += 1 #allows us to move to the next cell
i = np.argmin(lsqs[:,2])
A_optimal = lsqs[i][0]
B_optimal = lsqs[i][1]
If I understand the question correctly, the params are constants which are the same in every sample, but t varies from sample to sample. So, for example, maybe you have a whole bunch of points which you believe have been sampled from a circle
x = a+r cos(t)
y = b+r sin(t)
at different values of t.
In this case, what I would do is eliminate the variable t to get a relation between x and y -- in this case, (x-a)^2+(y-b)^2 = r^2. If your data fit the model perfectly, you would have (x-a)^2+(y-b)^2 = r^2 at each of your data points. With some error, you could still find (a,b,r) to minimize
sum_i ((x_i-a)^2 + (y_i-b)^2 - r^2)^2.
Mathematica's Eliminate command can automate the procedure of eliminating t in some cases.
PS You might do better at stats.stackexchange, math.stackexchange or mathoverflow.net . I know the last one has a scary reputation, but we don't bite, really!
As per advice given to me in this answer, I have implemented a Runge-Kutta integrator in my gravity simulator.
However, after I simulate one year of the solar system, the positions are still off by cca 110 000 kilometers, which isn't acceptable.
My initial data was provided by NASA's HORIZONS system. Through it, I obtained position and velocity vectors of the planets, Pluto, the Moon, Deimos and Phobos at a specific point in time.
These vectors were 3D, however, some people told me that I could ignore the third dimension as the planets aligned themselves in a plate around the Sun, and so I did. I merely copied the x-y coordinates into my files.
This is the code of my improved update method:
"""
Measurement units:
[time] = s
[distance] = m
[mass] = kg
[velocity] = ms^-1
[acceleration] = ms^-2
"""
class Uni:
def Fg(self, b1, b2):
"""Returns the gravitational force acting between two bodies as a Vector2."""
a = abs(b1.position.x - b2.position.x) #Distance on the x axis
b = abs(b1.position.y - b2.position.y) #Distance on the y axis
r = math.sqrt(a*a + b*b)
fg = (self.G * b1.m * b2.m) / pow(r, 2)
return Vector2(a/r * fg, b/r * fg)
#After this is ran, all bodies have the correct accelerations:
def updateAccel(self):
#For every combination of two bodies (b1 and b2) out of all bodies:
for b1, b2 in combinations(self.bodies.values(), 2):
fg = self.Fg(b1, b2) #Calculate the gravitational force between them
#Add this force to the current force vector of the body:
if b1.position.x > b2.position.x:
b1.force.x -= fg.x
b2.force.x += fg.x
else:
b1.force.x += fg.x
b2.force.x -= fg.x
if b1.position.y > b2.position.y:
b1.force.y -= fg.y
b2.force.y += fg.y
else:
b1.force.y += fg.y
b2.force.y -= fg.y
#For body (b) in all bodies (self.bodies.itervalues()):
for b in self.bodies.itervalues():
b.acceleration.x = b.force.x/b.m
b.acceleration.y = b.force.y/b.m
b.force.null() #Reset the force as it's not needed anymore.
def RK4(self, dt, stage):
#For body (b) in all bodies (self.bodies.itervalues()):
for b in self.bodies.itervalues():
rd = b.rk4data #rk4data is an object where the integrator stores its intermediate data
if stage == 1:
rd.px[0] = b.position.x
rd.py[0] = b.position.y
rd.vx[0] = b.velocity.x
rd.vy[0] = b.velocity.y
rd.ax[0] = b.acceleration.x
rd.ay[0] = b.acceleration.y
if stage == 2:
rd.px[1] = rd.px[0] + 0.5*rd.vx[0]*dt
rd.py[1] = rd.py[0] + 0.5*rd.vy[0]*dt
rd.vx[1] = rd.vx[0] + 0.5*rd.ax[0]*dt
rd.vy[1] = rd.vy[0] + 0.5*rd.ay[0]*dt
rd.ax[1] = b.acceleration.x
rd.ay[1] = b.acceleration.y
if stage == 3:
rd.px[2] = rd.px[0] + 0.5*rd.vx[1]*dt
rd.py[2] = rd.py[0] + 0.5*rd.vy[1]*dt
rd.vx[2] = rd.vx[0] + 0.5*rd.ax[1]*dt
rd.vy[2] = rd.vy[0] + 0.5*rd.ay[1]*dt
rd.ax[2] = b.acceleration.x
rd.ay[2] = b.acceleration.y
if stage == 4:
rd.px[3] = rd.px[0] + rd.vx[2]*dt
rd.py[3] = rd.py[0] + rd.vy[2]*dt
rd.vx[3] = rd.vx[0] + rd.ax[2]*dt
rd.vy[3] = rd.vy[0] + rd.ay[2]*dt
rd.ax[3] = b.acceleration.x
rd.ay[3] = b.acceleration.y
b.position.x = rd.px[stage-1]
b.position.y = rd.py[stage-1]
def update (self, dt):
"""Pushes the uni 'dt' seconds forward in time."""
#Repeat four times:
for i in range(1, 5, 1):
self.updateAccel() #Calculate the current acceleration of all bodies
self.RK4(dt, i) #ith Runge-Kutta step
#Set the results of the Runge-Kutta algorithm to the bodies:
for b in self.bodies.itervalues():
rd = b.rk4data
b.position.x = b.rk4data.px[0] + (dt/6.0)*(rd.vx[0] + 2*rd.vx[1] + 2*rd.vx[2] + rd.vx[3]) #original_x + delta_x
b.position.y = b.rk4data.py[0] + (dt/6.0)*(rd.vy[0] + 2*rd.vy[1] + 2*rd.vy[2] + rd.vy[3])
b.velocity.x = b.rk4data.vx[0] + (dt/6.0)*(rd.ax[0] + 2*rd.ax[1] + 2*rd.ax[2] + rd.ax[3])
b.velocity.y = b.rk4data.vy[0] + (dt/6.0)*(rd.ay[0] + 2*rd.ay[1] + 2*rd.ay[2] + rd.ay[3])
self.time += dt #Internal time variable
The algorithm is as follows:
Update the accelerations of all bodies in the system
RK4(first step)
goto 1
RK4(second)
goto 1
RK4(third)
goto 1
RK4(fourth)
Did I mess something up with my RK4 implementation? Or did I just start with corrupted data (too few important bodies and ignoring the 3rd dimension)?
How can this be fixed?
Explanation of my data etc...
All of my coordinates are relative to the Sun (i.e. the Sun is at (0, 0)).
./my_simulator 1yr
Earth position: (-1.47589927462e+11, 18668756050.4)
HORIZONS (NASA):
Earth position: (-1.474760457316177E+11, 1.900200786726017E+10)
I got the 110 000 km error by subtracting the Earth's x coordinate given by NASA from the one predicted by my simulator.
relative error = (my_x_coordinate - nasa_x_coordinate) / nasa_x_coordinate * 100
= (-1.47589927462e+11 + 1.474760457316177E+11) / -1.474760457316177E+11 * 100
= 0.077%
The relative error seems miniscule, but that's simply because Earth is really far away from the Sun both in my simulation and in NASA's. The distance is still huge and renders my simulator useless.
110 000 km absolute error means what relative error?
I got the 110 000 km value by subtracting my predicted Earth's x
coordinate with NASA's Earth x coordinate.
I'm not sure what you're calculating here or what you mean by "NASA's Earth x coordinate". That's a distance from what origin, in what coordinate system, at what time? (As far as I know, the earth moves in orbit around the sun, so its x-coordinate w.r.t. a coordinate system centered at the sun is changing all the time.)
In any case, you calculated an absolute error of 110,000 km by subtracting your calculated value from "NASA's Earth x coordinate". You seem to think this is a bad answer. What's your expectation? To hit it spot on? To be within a meter? One km? What's acceptable to you and why?
You get a relative error by dividing your error difference by "NASA's Earth x coordinate". Think of it as a percentage. What value do you get? If it's 1% or less, congratulate yourself. That would be quite good.
You should know that floating point numbers aren't exact on computers. (You can't represent 0.1 exactly in binary any more than you can represent 1/3 exactly in decimal.) There are going to be errors. Your job as a simulator is to understand the errors and minimize them as best you can.
You could have a stepsize problem. Try reducing your time step size by half and see if you do better. If you do, it says that your results have not converged. Reduce by half again until you achieve acceptable error.
Your equations might be poorly conditioned. Small initial errors will be amplified over time if that's true.
I'd suggest that you non-dimensionalize your equations and calculate the stability limit step size. Your intuition about what a "small enough" step size should be might surprise you.
I'd also read a bit more about the many body problem. It's subtle.
You might also try a numerical integration library instead of your integration scheme. You'll program your equations and give them to an industrial strength integrator. It could give some insight into whether or not it's your implementation or the physics that causes the problem.
Personally, I don't like your implementation. It'd be a better solution if you'd done it with mathematical vectors in mind. The "if" test for the relative positions leaves me cold. Vector mechanics would make the signs come out naturally.
UPDATE:
OK, your relative errors are pretty small.
Of course the absolute error does matter - depending on your requirements. If you're landing a vehicle on a planet you don't want to be off by that much.
So you need to stop making assumptions about what constitutes too small a step size and do what you must to drive the errors to an acceptable level.
Are all the quantities in your calculation 64-bit IEEE floating point numbers? If not, you'll never get there.
A 64 bit floating point number has about 16 digits of accuracy. If you need more than that, you'll have to use an infinite precision object like Java's BigDecimal or - wait for it - rescale your equations to use a length unit other than kilometers. If you scale all your distances by something meaningful for your problem (e.g., the diameter of the earth or the length of the major/minor axis of the earth's orbit) you might do better.
To do a RK4 integration of the solar system you need a very good precision or your solution will diverge quickly. Assuming you have implemented everything correctly, you may be seeing the drawbacks with RK for this sort of simulation.
To verify if this is the case: try a different integration algorithm. I found that using Verlet integration a solar system simulation will be much less chaotic. Verlet is much simpler to implement than RK4 as well.
The reason Verlet (and derived methods) are often better than RK4 for long term prediction (like full orbits) is that they are symplectic, that is, conserve momentum which RK4 does not. Thus Verlet will give a better behavior even after it diverges (a realistic simulation but with an error in it) whereas RK will give a non-physical behavior once it diverges.
Also: make sure you are using floating point as good as you can. Don't use distances in meters in the solar system, since the precision of floating point numbers is much better in the 0..1 interval. Using AU or some other normalized scale is much better than using meters. As suggested on the other topic, ensure you use an epoch for the time to avoid accumulating errors when adding time steps.
Such simulations are notoriously unreliable. Rounding errors accumulate and introduce instability. Increasing precision doesn't help much; the problem is that you (have to) use a finite step size and nature uses a zero step size.
You can reduce the problem by reducing the step size, so it takes longer for the errors to become apparent. If you are not doing this in real time, you can use a dynamic step size which reduces if two or more bodies are very close.
One thing I do with these kinds of simulations is "re-normalise" after each step to make the total energy the same. The sum of gravitational plus kinetic energy for the system as a whole should be a constant (conservation of energy). Work out the total energy after each step, and then scale all the object speeds by a constant amount to keep the total energy a constant. This at least keeps the output looking more plausible. Without this scaling, a tiny amount of energy is either added to or removed from the system after each step, and orbits tend to blow up to infinity or spiral into the sun.
Very simple changes that will improve things (proper usage of floating point values)
Change the unit system, to use as much mantissa bits as possible. Using meters, you're doing it wrong... Use AU, as suggested above. Even better, scale things so that the solar system fits in a 1x1x1 box
Already said in an other post : your time, compute it as time = epoch_count * time_step, not by adding ! This way, you avoid accumulating errors.
When doing a summation of several values, use a high precision sum algorithm, like Kahan summmation. In python, math.fsum does it for you.
Shouldn't the force decomposition be
th = atan(b, a);
return Vector2(cos(th) * fg, sin(th) * fg)
(see http://www.physicsclassroom.com/class/vectors/Lesson-3/Resolution-of-Forces or https://fiftyexamples.readthedocs.org/en/latest/gravity.html#solution)
BTW: you take the square root to calculate the distance, but you actually need the squared distance...
Why not simplify
r = math.sqrt(a*a + b*b)
fg = (self.G * b1.m * b2.m) / pow(r, 2)
to
r2 = a*a + b*b
fg = (self.G * b1.m * b2.m) / r2
I am not sure about python, but in some cases you get more precise calculations for intermediate results (Intel CPUs support 80 bit floats, but when assigning to variables, they get truncated to 64 bit):
Floating point computation changes if stored in intermediate "double" variable
It is not quite clear in what frame of reference you have your planet coordinates and velocities. If it is in heliocentric frame (the frame of reference is tied to the Sun), then you have two options: (i) perform calculations in non-inertial frame of reference (the Sun is not motionless), (ii) convert positions and velocities into the inertial barycentric frame of reference. If your coordinates and velocities are in barycentric frame of reference, you must have coordinates and velocities for the Sun as well.
Operators used to examine the spectrum, knowing the location and width of each peak and judge the piece the spectrum belongs to. In the new way, the image is captured by a camera to a screen. And the width of each band must be computed programatically.
Old system: spectroscope -> human eye
New system: spectroscope -> camera -> program
What is a good method to compute the width of each band, given their approximate X-axis positions; given that this task used to be performed perfectly by eye, and must now be performed by program?
Sorry if I am short of details, but they are scarce.
Program listing that generated the previous graph; I hope it is relevant:
import Image
from scipy import *
from scipy.optimize import leastsq
# Load the picture with PIL, process if needed
pic = asarray(Image.open("spectrum.jpg"))
# Average the pixel values along vertical axis
pic_avg = pic.mean(axis=2)
projection = pic_avg.sum(axis=0)
# Set the min value to zero for a nice fit
projection /= projection.mean()
projection -= projection.min()
#print projection
# Fit function, two gaussians, adjust as needed
def fitfunc(p,x):
return p[0]*exp(-(x-p[1])**2/(2.0*p[2]**2)) + \
p[3]*exp(-(x-p[4])**2/(2.0*p[5]**2))
errfunc = lambda p, x, y: fitfunc(p,x)-y
# Use scipy to fit, p0 is inital guess
p0 = array([0,20,1,0,75,10])
X = xrange(len(projection))
p1, success = leastsq(errfunc, p0, args=(X,projection))
Y = fitfunc(p1,X)
# Output the result
print "Mean values at: ", p1[1], p1[4]
# Plot the result
from pylab import *
#subplot(211)
#imshow(pic)
#subplot(223)
#plot(projection)
#subplot(224)
#plot(X,Y,'r',lw=5)
#show()
subplot(311)
imshow(pic)
subplot(312)
plot(projection)
subplot(313)
plot(X,Y,'r',lw=5)
show()
Given an approximate starting point, you could use a simple algorithm that finds a local maxima closest to this point. Your fitting code may be doing that already (I wasn't sure whether you were using it successfully or not).
Here's some code that demonstrates simple peak finding from a user-given starting point:
#!/usr/bin/env python
from __future__ import division
import numpy as np
from matplotlib import pyplot as plt
# Sample data with two peaks: small one at t=0.4, large one at t=0.8
ts = np.arange(0, 1, 0.01)
xs = np.exp(-((ts-0.4)/0.1)**2) + 2*np.exp(-((ts-0.8)/0.1)**2)
# Say we have an approximate starting point of 0.35
start_point = 0.35
# Nearest index in "ts" to this starting point is...
start_index = np.argmin(np.abs(ts - start_point))
# Find the local maxima in our data by looking for a sign change in
# the first difference
# From http://stackoverflow.com/a/9667121/188535
maxes = (np.diff(np.sign(np.diff(xs))) < 0).nonzero()[0] + 1
# Find which of these peaks is closest to our starting point
index_of_peak = maxes[np.argmin(np.abs(maxes - start_index))]
print "Peak centre at: %.3f" % ts[index_of_peak]
# Quick plot showing the results: blue line is data, green dot is
# starting point, red dot is peak location
plt.plot(ts, xs, '-b')
plt.plot(ts[start_index], xs[start_index], 'og')
plt.plot(ts[index_of_peak], xs[index_of_peak], 'or')
plt.show()
This method will only work if the ascent up the peak is perfectly smooth from your starting point. If this needs to be more resilient to noise, I have not used it, but PyDSTool seems like it might help. This SciPy post details how to use it for detecting 1D peaks in a noisy data set.
So assume at this point you've found the centre of the peak. Now for the width: there are several methods you could use, but the easiest is probably the "full width at half maximum" (FWHM). Again, this is simple and therefore fragile. It will break for close double-peaks, or for noisy data.
The FWHM is exactly what its name suggests: you find the width of the peak were it's halfway to the maximum. Here's some code that does that (it just continues on from above):
# FWHM...
half_max = xs[index_of_peak]/2
# This finds where in the data we cross over the halfway point to our peak. Note
# that this is global, so we need an extra step to refine these results to find
# the closest crossovers to our peak.
# Same sign-change-in-first-diff technique as above
hm_left_indices = (np.diff(np.sign(np.diff(np.abs(xs[:index_of_peak] - half_max)))) > 0).nonzero()[0] + 1
# Add "index_of_peak" to result because we cut off the left side of the data!
hm_right_indices = (np.diff(np.sign(np.diff(np.abs(xs[index_of_peak:] - half_max)))) > 0).nonzero()[0] + 1 + index_of_peak
# Find closest half-max index to peak
hm_left_index = hm_left_indices[np.argmin(np.abs(hm_left_indices - index_of_peak))]
hm_right_index = hm_right_indices[np.argmin(np.abs(hm_right_indices - index_of_peak))]
# And the width is...
fwhm = ts[hm_right_index] - ts[hm_left_index]
print "Width: %.3f" % fwhm
# Plot to illustrate FWHM: blue line is data, red circle is peak, red line
# shows FWHM
plt.plot(ts, xs, '-b')
plt.plot(ts[index_of_peak], xs[index_of_peak], 'or')
plt.plot(
[ts[hm_left_index], ts[hm_right_index]],
[xs[hm_left_index], xs[hm_right_index]], '-r')
plt.show()
It doesn't have to be the full width at half maximum — as one commenter points out, you can try to figure out where your operators' normal threshold for peak detection is, and turn that into an algorithm for this step of the process.
A more robust way might be to fit a Gaussian curve (or your own model) to a subset of the data centred around the peak — say, from a local minima on one side to a local minima on the other — and use one of the parameters of that curve (eg. sigma) to calculate the width.
I realise this is a lot of code, but I've deliberately avoided factoring out the index-finding functions to "show my working" a bit more, and of course the plotting functions are there just to demonstrate.
Hopefully this gives you at least a good starting point to come up with something more suitable to your particular set.
Late to the party, but for anyone coming across this question in the future...
Eye movement data looks very similar to this; I'd base an approach off that used by Nystrom + Holmqvist, 2010. Smooth the data using a Savitsky-Golay filter (scipy.signal.savgol_filter in scipy v0.14+) to get rid of some of the low-level noise while keeping the large peaks intact - the authors recommend using an order of 2 and a window size of about twice the width of the smallest peak you want to be able to detect. You can find where the bands are by arbitrarily removing all values above a certain y value (set them to numpy.nan). Then take the (nan)mean and (nan)standard deviation of the remainder, and remove all values greater than the mean + [parameter]*std (I think they use 6 in the paper). Iterate until you're not removing any data points - but depending on your data, certain values of [parameter] may not stabilise. Then use numpy.isnan() to find events vs non-events, and numpy.diff() to find the start and end of each event (values of -1 and 1 respectively). To get even more accurate start and end points, you can scan along the data backward from each start and forward from each end to find the nearest local minimum which has value smaller than mean + [another parameter]*std (I think they use 3 in the paper). Then you just need to count the data points between each start and end.
This won't work for that double peak; you'd have to do some extrapolation for that.
The best method might be to statistically compare a bunch of methods with human results.
You would take a large variety data and a large variety of measurement estimates (widths at various thresholds, area above various thresholds, different threshold selection methods, 2nd moments, polynomial curve fits of various degrees, pattern matching, and etc.) and compare these estimates to human measurements of the same data set. Pick the estimate method that correlates best with expert human results. Or maybe pick several methods, the best one for each of various heights, for various separations from other peaks, and etc.
I'm working on a piece of software which needs to implement the wiggliness of a set of data. Here's a sample of the input I would receive, merged with the lightness plot of each vertical pixel strip:
It is easy to see that the left margin is really wiggly (i.e. has a ton of minima/maxima), and I want to generate a set of critical points of the image. I've applied a Gaussian smoothing function to the data ~ 10 times, but it seems to be pretty wiggly to begin with.
Any ideas?
Here's my original code, but it does not produce very nice results (for the wiggliness):
def local_maximum(list, center, delta):
maximum = [0, 0]
for i in range(delta):
if list[center + i] > maximum[1]: maximum = [center + i, list[center + i]]
if list[center - i] > maximum[1]: maximum = [center - i, list[center - i]]
return maximum
def count_maxima(list, start, end, delta, threshold = 10):
count = 0
for i in range(start + delta, end - delta):
if abs(list[i] - local_maximum(list, i, delta)[1]) < threshold: count += 1
return count
def wiggliness(list, start, end, delta, threshold = 10):
return float(abs(start - end) * delta) / float(count_maxima(list, start, end, delta, threshold))
Take a look at lowpass/highpass/notch/bandpass filters, fourier transforms, or wavelets. The basic idea is there's lots of different ways to figure out the frequency content of a signal quantized over different time-periods.
If we can figure out what wiggliness is, that would help. I would say the leftmost margin is wiggly b/c it has more high-frequency content, which you could visualize by using a fourier transform.
If you take a highpass filter of that red signal, you'll get just the high frequency content, and then you can measure the amplitudes and do thresholds to determine wiggliness. But I guess wiggliness just needs more formalism behind it.
For things like these, numpy makes things much easier, as it provides useful functions for manipulating vector data, e.g. adding a scalar to each element, calculating the average value etc.
For example, you might try with zero crossing rate of either the original data-wiggliness1 or the first difference-wiggliness2 (depending on what wiggliness is supposed to be, exactly-if global trends are to be ignored, you should probably use the difference data). For x you would take the slice or window of interest from the original data, getting a sort of measure of local wiggliness.
If you use the original data, after removing the bias you might also want to set all values smaller than some threshold to 0 to ignore low-amplitude wiggles.
import numpy as np
def wiggliness1(x):
#remove bias:
x=x-np.average(x)
#calculate zero crossing rate:
np.sum(np.abs(np.sign(np.diff(x))))
def wiggliness(x):
#calculate zero crossing rate of the first difference:
return np.sum(np.abs(np.sign(np.diff(np.sign(np.diff(x))))))