I need help writing code that would allow me to baseline each peak in a set of peaks (enthalpy vs. time isothermal titration calorimetry data).
The data is created by the ITC instrument in this fashion (where '##' signifies the start of a peak and the data are listed below as time [seconds], enthalpy [ucal/s], and temperature [deg. C but unnecessary as it is usually held constant]):
There are well over 100 points per peak (I've shortened it above), and I'd like to incorporate a linear equation to zero each enthalpy value in each peak so I may integrate each peak to produce a binding plot. I'd welcome any help/advice; thank you!
I was able to do it. Thank you to those who replied! I will leave this here to anyone who may need to baseline peaks with a linear fit in the future (assuming the first point and last 40 points will suffice in a decent fit line like it does in ITC):
#defining function to calculate baseline of peaks in x vs y graph
def calc_baseline(x,y):
for n in range(len(y)):
return zeroed_y
#defining function to zero baselines of peaks in x vs y graph, assuming number_injections is a known integer
def zero_baseline(number_injections,y,zeroed_y):
for i in range(0,number_injections+1):
return zeroed_y_lists
I suspect that there's something I'm missing in my understanding of the Fourier Transform, so I'm looking for some correction (if that's the case). How should I gather peak information from the first plot below?
The dataset is hourly data for 911 calls over the past 17 years (for a particular city).
I've removed the trend from my data, and am now removing the seasonality. When I run the Fourier transform, I get the following plot:
I believe the dataset does have some seasonality to it (looking at weekly data, I have this pattern):
How do I pick out the values of the peaks in the first plot? Presumably for all of the "peaks" under, say 5000 in the first plot, I may ignore the inclusion of that seasonality in my final model, but only at a loss of accuracy, correct?
Here's the bit of code I'm working with, currently:
from scipy import fftpack
fft = fftpack.fft(calls_grouped_hour.detrended_residuals - calls_grouped_hour.detrended_residuals.mean())
plt.plot(1./(17*365)*np.arange(len(fft)), np.abs(fft))
plt.xlim([-.1, 23/2]);
After Mark Snider's initial answer, I have the following plot:
Adding code attempt to get peak values from fft:
Do I need to convert the values back using ifft first?
fft_x_y = np.stack((fft.real, fft.imag), -1)
peaks = []
for x, y in np.abs(fft_x_y):
if (y >= 0):
peaks = np.unique(peaks)
print('Length: ', len(peaks))
print('Peak values: ', '\n', np.sort(peaks))
threshold = 5000
fft[np.abs(fft)<threshold] = 0
This'll give you an fft that ignores everything except the peaks. And no, I wouldn't imagine that the "noise" represents actual seasonality. The peak at fft[0] doesn't represent seasonality, either - it's a multiple of the mean of the data, so if you plan on subtracting the ifft of the peaks I wouldn't include fft[0] either unless you want your data to be centered.
If you want just the peak values and not the full fft that you can invert, you can just do this:
peaks = [np.abs(value) for value in fft if np.abs(value)>threshold]
I am analyzing a time-series dataset that I am pretty sure can be broken down using fft. I want to develop a model to estimate the data using a sum of sin/cos but I am having trouble with the syntax to find the frequencies in python
Here is a graph of the data
data graph
And here's a link to the original data: https://drive.google.com/open?id=1mqZtQ-txdd_AFbKGBlbSL6903CK-_kXl
Most of the examples I have seen have multiple samples per second/time period, however the data in this set represent by-minute observations of some metric. Because of this, I've had trouble translating the answers online to this problem
Here's my naive first approach
X = fftpack.fft(data)
freqs = fftpack.fftfreq(len(data))
plt.plot(freqs, np.abs(X))
Instead of peaking at the major frequencies, my plot only has one peak at 0.
The FFT you posted has been shifted so that 0 is at the center. Data to the left of the center represents negative frequencies and to the right represents positive frequencies. If you zoom in and look more closely, I think you will see that there are two peaks close to the center that you are interpreting as a single peak at 0. Just looking at the positive side, the location of this peak will tell you which frequency is contributing significant signal power.
Like you said, your x-axis is probably incorrect. scipy.fftpack.fftfreq needs to know the time between samples (in seconds, I think) of your time-domain signal to correctly determine the bandwidth and create the x-axis array in Hz. This should do it:
dt = 60 # 60 seconds between samples
freqs = fftpack.fftfreq(len(data),dt)
I have few weeks data with units sold given
xs[weeks] = [1,2,3,4]
ys['Units Sold'] = [1043,6582,5452,7571]
from the given series, we can see that although there is a drop from xs[2] to xs[3] but overall the trend is increasing. How to detect the trend in small time series dataset.
Is finding a slope for the line is the best way? And how to calculate slope angle of a line in python?
I have gone through the same issue that you face today. In order to detect the trend, I couldn't find a specific function to handle the situation.
I found a really helpful function ie, numpy.polyfit():
numpy.polyfit(x, y, deg, rcond=None, full=False, w=None, cov=False)
[Check this Official Documentation]
You can use the function like this
def trenddetector(list_of_index, array_of_data, order=1):
result = np.polyfit(list_of_index, list(array_of_data), order)
slope = result[-2]
return float(slope)
This function returns a float value that indicates the trend of your data and also you can analyze it by something like this.
For example,
if the slope is a +ve value --> increasing trend
if the slope is a -ve value --> decreasing trend
if the slope is a zero value --> No trend
Play with this function and find out the correct threshold as per your problem and give it as a condition.
Example Code for your Solution
import numpy as np
def trendline(index,data, order=1):
coeffs = np.polyfit(index, list(data), order)
slope = coeffs[-2]
return float(slope)
As per this output, The result is much greater than zero so it shows your data is increasing steadily.
One approach could be to use a Moving Average (lots of variations of this, you may see EMA or SMA thrown around) which looks at the current time-step and n number of previous steps, averages these and uses this as a sort of 'smoothed' value. This will give you a better indication of the way the data is actually moving, as one small decrease isnt going to have a dramatic impact on the gradient of the line.
Depending on the domain of your problem, it may also be worth checking out some statistics used in the financial sector, such as DMI (Directional Movement Indicator) or MACD.
Hope this helps
I am new to Python.
I intend to do Fourier Transform to an array of discrete points, (time, acceleration), and plot the result out.
I copy and paste the sample FFT code, and modify accordingly.
Please see codes:
import numpy as np
import matplotlib.pyplot as plt
# Load the .txt file in
myData = np.loadtxt('twenty_z_up.txt')
# Extract the time and acceleration columns
time = copy(myData[:,0])
# Extract the acceleration columns
zAcc = copy(myData[:,3])
t = np.arange(10080)
sp = np.fft.fft(zAcc)
freq = np.fft.fftfreq(t.shape[-1])
plt.plot(freq, sp.real)
myData is a rectangular matrix with 10080 rows and 10 columns.
Thus, zAcc is the row3 extracted from the matrix.
In the plot drawn by Spyder, most of the harmonics concentrated around 0.
They are all extremely small.
But my data are actually the accelerations of the phone carried by a walking person (including the gravity). So I expect the most significant harmonic happens around 2Hz.
Why is the graph non-sense?
Thanks in advance!
==============UPDATES: My Graphs======================
The first time domain one:
x-axis is in millisecond.
y-axis is in m/s^2, due to earth gravity, it has a DC offset of ~10.
You do get two spikes at (approximately) 2Hz. Your sampling period is around 2.8 ms (as best as I can infer from your first plot), giving +/-2Hz the normalized frequency of +/-0.056, which is about where your spikes are. fft.fftfreq by default returns the normalized frequency (which scales the sampling period). You can set the d argument to be the sampling period, and you'll get a vector containing the actual frequency.
Your huge spike in the middle is obviously the DC offset (which you can trivially remove by subtracting the mean).
As others said, we need to see the data, post it somewhere. Just to check, try first fixing the timestep size in fftfreq, then plot this synthetic signal, and then plot your signal to see how they compare:
timestep=1./50.#Assume sampling at 50Hz. Change this accordingly.
N=10080#the number of samples
t = np.linspace(0,T,N)#needed only to generate xAcc_synthetic
freq=2.#peak a frequency at 2Hz
#generate synthetic signal at 2Hz and add some noise to it
xAcc_synthetic = sin((2*np.pi)*freq*t)+np.random.rand(N)*0.2
sp_synthetic = np.fft.fft(xAcc_synthetic)
freq = np.fft.fftfreq(t.size,d=timestep)
print max(abs(freq))==(1/timestep)/2.#simple check highest freq.
plt.plot(freq, abs(sp_synthetic))
Now, at the x axis equal to 2 you actually have a physical frequency of 2Hz, and you may spot the more pronounced peak you are looking for. Moreover, you may want to have a look also at yAcc and zAcc.
Operators used to examine the spectrum, knowing the location and width of each peak and judge the piece the spectrum belongs to. In the new way, the image is captured by a camera to a screen. And the width of each band must be computed programatically.
Old system: spectroscope -> human eye
New system: spectroscope -> camera -> program
What is a good method to compute the width of each band, given their approximate X-axis positions; given that this task used to be performed perfectly by eye, and must now be performed by program?
Sorry if I am short of details, but they are scarce.
Program listing that generated the previous graph; I hope it is relevant:
import Image
from scipy import *
from scipy.optimize import leastsq
# Load the picture with PIL, process if needed
pic = asarray(Image.open("spectrum.jpg"))
# Average the pixel values along vertical axis
pic_avg = pic.mean(axis=2)
projection = pic_avg.sum(axis=0)
# Set the min value to zero for a nice fit
projection /= projection.mean()
projection -= projection.min()
#print projection
# Fit function, two gaussians, adjust as needed
def fitfunc(p,x):
return p[0]*exp(-(x-p[1])**2/(2.0*p[2]**2)) + \
errfunc = lambda p, x, y: fitfunc(p,x)-y
# Use scipy to fit, p0 is inital guess
p0 = array([0,20,1,0,75,10])
X = xrange(len(projection))
p1, success = leastsq(errfunc, p0, args=(X,projection))
Y = fitfunc(p1,X)
# Output the result
print "Mean values at: ", p1[1], p1[4]
# Plot the result
from pylab import *
Given an approximate starting point, you could use a simple algorithm that finds a local maxima closest to this point. Your fitting code may be doing that already (I wasn't sure whether you were using it successfully or not).
Here's some code that demonstrates simple peak finding from a user-given starting point:
#!/usr/bin/env python
from __future__ import division
import numpy as np
from matplotlib import pyplot as plt
# Sample data with two peaks: small one at t=0.4, large one at t=0.8
ts = np.arange(0, 1, 0.01)
xs = np.exp(-((ts-0.4)/0.1)**2) + 2*np.exp(-((ts-0.8)/0.1)**2)
# Say we have an approximate starting point of 0.35
start_point = 0.35
# Nearest index in "ts" to this starting point is...
start_index = np.argmin(np.abs(ts - start_point))
# Find the local maxima in our data by looking for a sign change in
# the first difference
# From http://stackoverflow.com/a/9667121/188535
maxes = (np.diff(np.sign(np.diff(xs))) < 0).nonzero()[0] + 1
# Find which of these peaks is closest to our starting point
index_of_peak = maxes[np.argmin(np.abs(maxes - start_index))]
print "Peak centre at: %.3f" % ts[index_of_peak]
# Quick plot showing the results: blue line is data, green dot is
# starting point, red dot is peak location
plt.plot(ts, xs, '-b')
plt.plot(ts[start_index], xs[start_index], 'og')
plt.plot(ts[index_of_peak], xs[index_of_peak], 'or')
This method will only work if the ascent up the peak is perfectly smooth from your starting point. If this needs to be more resilient to noise, I have not used it, but PyDSTool seems like it might help. This SciPy post details how to use it for detecting 1D peaks in a noisy data set.
So assume at this point you've found the centre of the peak. Now for the width: there are several methods you could use, but the easiest is probably the "full width at half maximum" (FWHM). Again, this is simple and therefore fragile. It will break for close double-peaks, or for noisy data.
The FWHM is exactly what its name suggests: you find the width of the peak were it's halfway to the maximum. Here's some code that does that (it just continues on from above):
# FWHM...
half_max = xs[index_of_peak]/2
# This finds where in the data we cross over the halfway point to our peak. Note
# that this is global, so we need an extra step to refine these results to find
# the closest crossovers to our peak.
# Same sign-change-in-first-diff technique as above
hm_left_indices = (np.diff(np.sign(np.diff(np.abs(xs[:index_of_peak] - half_max)))) > 0).nonzero()[0] + 1
# Add "index_of_peak" to result because we cut off the left side of the data!
hm_right_indices = (np.diff(np.sign(np.diff(np.abs(xs[index_of_peak:] - half_max)))) > 0).nonzero()[0] + 1 + index_of_peak
# Find closest half-max index to peak
hm_left_index = hm_left_indices[np.argmin(np.abs(hm_left_indices - index_of_peak))]
hm_right_index = hm_right_indices[np.argmin(np.abs(hm_right_indices - index_of_peak))]
# And the width is...
fwhm = ts[hm_right_index] - ts[hm_left_index]
print "Width: %.3f" % fwhm
# Plot to illustrate FWHM: blue line is data, red circle is peak, red line
# shows FWHM
plt.plot(ts, xs, '-b')
plt.plot(ts[index_of_peak], xs[index_of_peak], 'or')
[ts[hm_left_index], ts[hm_right_index]],
[xs[hm_left_index], xs[hm_right_index]], '-r')
It doesn't have to be the full width at half maximum — as one commenter points out, you can try to figure out where your operators' normal threshold for peak detection is, and turn that into an algorithm for this step of the process.
A more robust way might be to fit a Gaussian curve (or your own model) to a subset of the data centred around the peak — say, from a local minima on one side to a local minima on the other — and use one of the parameters of that curve (eg. sigma) to calculate the width.
I realise this is a lot of code, but I've deliberately avoided factoring out the index-finding functions to "show my working" a bit more, and of course the plotting functions are there just to demonstrate.
Hopefully this gives you at least a good starting point to come up with something more suitable to your particular set.
Late to the party, but for anyone coming across this question in the future...
Eye movement data looks very similar to this; I'd base an approach off that used by Nystrom + Holmqvist, 2010. Smooth the data using a Savitsky-Golay filter (scipy.signal.savgol_filter in scipy v0.14+) to get rid of some of the low-level noise while keeping the large peaks intact - the authors recommend using an order of 2 and a window size of about twice the width of the smallest peak you want to be able to detect. You can find where the bands are by arbitrarily removing all values above a certain y value (set them to numpy.nan). Then take the (nan)mean and (nan)standard deviation of the remainder, and remove all values greater than the mean + [parameter]*std (I think they use 6 in the paper). Iterate until you're not removing any data points - but depending on your data, certain values of [parameter] may not stabilise. Then use numpy.isnan() to find events vs non-events, and numpy.diff() to find the start and end of each event (values of -1 and 1 respectively). To get even more accurate start and end points, you can scan along the data backward from each start and forward from each end to find the nearest local minimum which has value smaller than mean + [another parameter]*std (I think they use 3 in the paper). Then you just need to count the data points between each start and end.
This won't work for that double peak; you'd have to do some extrapolation for that.
The best method might be to statistically compare a bunch of methods with human results.
You would take a large variety data and a large variety of measurement estimates (widths at various thresholds, area above various thresholds, different threshold selection methods, 2nd moments, polynomial curve fits of various degrees, pattern matching, and etc.) and compare these estimates to human measurements of the same data set. Pick the estimate method that correlates best with expert human results. Or maybe pick several methods, the best one for each of various heights, for various separations from other peaks, and etc.