Improving frequency time normalization/hilbert transfer runtimes - python

So this is a bit of a nitty gritty question...
I have a time-series signal that has a non-uniform response spectrum that I need to whiten. I do this whitening using a frequency time normalization method, where I incrementally filter my signal between two frequency endpoints, using a constant narrow frequency band (~1/4 the lowest frequency end-member). I then find the envelope that characterizes each one of these narrow bands, and normalize that frequency component. I then rebuild my signal using these normalized signals... all done in python (sorry, has to be a python solution)...
Here is the raw data:
and here is its spectrum:
and here is the spectrum of the whitened data:
The problem is, that I have to do this for maybe ~500,000 signals like this, and it takes a while (~a minute each)... With almost the entirety of the time being spend doing the actual (multiple) Hilbert transforms
I have it running on a small cluster already. I don't want to parallelize the loop the Hilbert is in.
I'm looking for alternative envelope routines/functions (non Hilbert), or alternative ways to calculate the entire narrowband response function without doing a loop.
The other option is to make the frequency bands adaptive to the center frequency over which its filtering, so they get progressively larger as we march through the routines; which would just decrease the number of times I have to go through the loop.
Any and all suggestions welcome!!!
example code/dataset:
https://github.com/ashtonflinders/FTN_Example

Here is a faster method to calculate the enveloop by local max:
def calc_envelope(x, ind):
x_abs = np.abs(x)
loc = np.where(np.diff(np.sign(np.diff(x_abs))) < 0)[0] + 1
peak = x_abs[loc]
envelope = np.interp(ind, loc, peak)
return envelope
Here is an example output:
It's about 6x faster than hilbert. To speedup even more, you can write a cython function that find next local max point and does normalization up to the local max point iteratively.

Related

Waveform frequency resolution for FFT – any way to increase it?

This article https://www.bitweenie.com/listings/fft-zero-padding/ gives a simple relation between time-length of the input data to the FFT and the minimum distance between two frequencies that can be distinguished in the FFT. The article calls this Waveform frequency resolution.
In other words; if two input-frequencies are closer in frequency than 1/time-length_of_input_data, they will show as only one peak in the FFT-plot.
My question is: is there a way to increase this Waveform frequency resolution? I am finding it difficult to work with rather short data-series due to this limitation.
As an example, if I use a combination of sine series with periods 9.5, 10, and 11 over 240 datapoints I cannot distinguish between the different frequencies.
To have good frequency resultion you need a long time series.
This is a fundamental issue, called uncertainty principle. It cannot be overcome within Fourier analysis (Fourier transform, DFT, short-time Fourier transform and so on).
Also note that zero padding will not overcome this issue.
It gives more points in the frequency domain, in the sense that the same spectral information is sampled more densely, but it will not make peaks sharper or more separated.
The only way to overcome the uncertainty principle is to make further assumptions on the data.
If for example you know that there is only a single frequency component, it is possible to determine its frequency more accurately than the uncertainty principle predicts.
Also you can use transforms such as the Vigner-Wille transform . It is not bound by the uncertainty principle, but generates "crossterms", i.e. frequency component artifacts. However, when you only have few frequency compoents this might be acceptable. Depends on the use-case.

Optimising parameters for finding peaks in 1D array

I need to optimise a method for finding the number of data peaks in a 1D array. The data is a time-series of the amplitude of a wav file.
I have the code implemented already:
from scipy.io.wavfile import read
from scipy.signal import find_peaks
_, amplitudes = read('audio1.wav')
indexes, _ = find_peaks(amplitudes, height=80)
print(f'Number of peaks: {len(indexes)}')
When plotted, the data looks like this:
General scale
The 'peaks' that I am interested in are clear to the human eye - there are 23 in this particular dataset.
However, because the array is so large, the data is extremely variant within the peaks that are clear at a general scale (hence the many hundreds of peaks labelled with blue crosses):
Zoomed in view of one peak
Peak-finding questions have been asked many times before (I've been through a lot of them!) - but I can't find any help or explanation of optimising the parameters for finding only the peaks I want. I know a little about Python, but am blind when it comes to mathematical analysis!
Analysing by width seems useless because, as per the second image, the peaks clear at a large scale are actually interspersed with 'silent' ranges. Distance is not helpful because I do not know how close the peaks will be in other wav files. Prominence has been suggested as the best method but I could not get the results I needed; threshold likewise. I have also tried decimating the signal, smoothing the signal with a Savitzky-Golay filter, and different combinations of parameters and values, all with inaccurate results.
Height alone has been useful because I can see from the charts that peaks always reach above 80.
This is a common task in audio processing and there are several approaches which totally depend on your data.
However, there are implementations out there which are used for finding peaks in novelty functions (e.g., the output from a beat tracker). Try these ones:
https://madmom.readthedocs.io/en/latest/modules/features/onsets.html#madmom.features.onsets.peak_picking
https://librosa.github.io/librosa/generated/librosa.util.peak_pick.html#librosa.util.peak_pick
Basically they implement the same method but there might be differences in the details.
Furthermore, you could check, if you really need to work on this high sampling frequency. Try downsampling the signal or use a moving average filter.
Look at 0d persistent homology to find a good strategy, where the parameter you can optimize for is peak persistence. A nice blog post here explains the basics.
But in short the idea is to imagine your graph being filled by water, and then slowly draining the water. Every time a piece of the graph comes above water a new island is born. When two islands are next to each other they merge, which causes the younger island (with the lower peak) to die.
Then each data point has a birth time and a death time. The most significant peaks are those with the longest persistence, which is death - birth.
If the water level drops at a continuous rate, then the persistence is defined in terms of peak height. Another possibility is by dropping the water instantaneously from point to point as time goes from step t to step t+1, in wich case the persistence is defined in peak width in terms of signal samples.
For you it seems that using the original definition in terms of peak height > 70 finds all peaks you are interested in, albeit possibly too many, clustered together. You can limit this by choosing the first peak in each cluster or the highest peak in each cluster or by doing both approaches and only choosing peaks that have both great height persistence as well as width persistence.

How to find the bin number of the frequency of interest from an fft signal in python?

I am doing a task which involves taking two signals from a phase doppler anemometry system and calculating the phase shift and the frequency which will further help in finding the velocity and diameter of the droplet. Before getting into the actual task, I am now taking two sine signals from a function generator and producing a phase shift and then calculating the phase and frequency via a program in python with FFT to verify if both are the same. In that process, I am now getting the frequency value same as what I have set in the function generator. So frequency problem is solved. I am currently stuck up in a state where I need to find the bin number where my frequency belongs and using that I can calculate the exact phase shift.
Also, I would like to know how to find the number of bins used in the FFT.
My signal is 40MHz, my sampling frequency is 125MHz.
Thank you!
It might be slightly overkilling but you can use numpy.where to find the index of a specific value in an array
>>> import numpy as np
>>> np.where(np.linspace(1,10,10)==4)
(array([3]),)

How can I discriminate the falling edge of a signal with Python?

I am working on discriminating some signals for the calculation of the free-period oscillation and damping ratio of a spring-mass system (seismometer). I am using Python as the main processing program. What I need to do is import this signal, parse the signal and find the falling edge, then return a list of the peaks from the tops of each oscillation as a list so that I can calculate the damping ratio. The calculation of the free-period is fairly straightforward once I've determined the location of the oscillation within the dataset.
Where I am mostly hung up are how to parse through the list, identify the falling edge, and then capture each of the Z0..Zn elements. The oscillation frequency can be calculated using an FFT fairly easily, once I know where that falling edge is, but if I process the whole file, a lot of energy from the energizing of the system before the release can sometimes force the algorithm to kick out an ultra-low frequency that represents the near-DC offset, rather than the actual oscillation. (It's especially a problem at higher damping ratios where there might be only four or five measurable rebounds).
Has anyone got some ideas on how I can go about this? Right now, the code below uses the arbitrarily assigned values for the signal in the screenshot. However I need to have code calculate those values. Also, I haven't yet determined how to create my list of peaks for the calculation of the damping ratio h. Your help in getting some ideas together for solving this would be very welcome. Due to the fact I have such a small Stackoverflow reputation, I have included my signal in a sample screenshot at the following link:
(Boy, I hope this works!)
Tycho's sample signal -->
https://github.com/tychoaussie/Sigcal_v1/blob/066faca7c3691af3f894310ffcf3bbb72d730601/Freeperiod_Damping%20Ratio.jpg
##########################################################
import os, sys, csv
from scipy import signal
from scipy.integrate import simps
import pylab as plt
import numpy as np
import scipy as sp
#
# code goes here that imports the data from the csv file into lists.
# One of those lists is called laser, a time-history list in counts from an analog-digital-converter
# sample rate is 130.28 samples / second.
#
#
# Find the period of the observed signal
#
delta = 0.00767 # Calculated elsewhere in code, represents 130.28 samples/sec
# laser is a list with about 20,000 elements representing time-history data from a laser position sensor
# The release of the mass and system response occurs starting at sample number 2400 in this particular instance.
sense = signal.detrend(laser[2400:(2400+8192)]) # Remove the mean of the signal
N = len(sense)
W = np.fft.fft(sense)
freq = np.fft.fftfreq(len(sense),delta) # First value represents the number of samples and delta is the sample rate
#
# Take the sample with the largest amplitude as our center frequency.
# This only works if the signal is heavily sinusoidal and stationary
# in nature, like our calibration data.
#
idx = np.where(abs(W)==max(np.abs(W)))[0][-1]
Frequency = abs(freq[idx]) # Frequency in Hz
period = 1/(Frequency*delta) # represents the number of samples for one cycle of the test signal.
#
# create an axis representing time.
#
dt = [] # Create an x axis that represents elapsed time in seconds. delta = seconds per sample, i represents sample count
for i in range(0,len(sensor)):
dt.append(i*delta)
#
# At this point, we know the frequency interval, the delta, and we have the arrays
# for signal and laser. We can now discriminate out the peaks of each rebound and use them to process the damping ratio of either the 'undamped' system or the damping ratio of the 'electrically damped' system.
#
print 'Frequency calcuated to ',Frequency,' Hz.'
Here's a somewhat unconventional idea, which I think is might be fairly robust and doesn't require a lot of heuristics and guesswork. The data you have is really high quality and fits a known curve so that helps a lot here. Here I assume the "good part" of your curve has the form:
V = a * exp(-γ * t) * cos(2 * π * f * t + φ) + V0 # [Eq1]
V: voltage
t: time
γ: damping constant
f: frequency
a: starting amplitude
φ: starting phase
V0: DC offset
Outline of the algorithm
Getting rid of the offset
Firstly, calculate the derivative numerically. Since the data quality is quite high, the noise shouldn't affect things too much.
The voltage derivative V_deriv has the same form as the original data: same frequency and damping constant, but with a different phase ψ and amplitude b,
V_deriv = b * exp(-γ * t) * cos(2 * π * f * t + ψ) # [Eq2]
The nice thing is that this will automatically get rid of your DC offset.
Alternative: This step isn't completely needed – the offset is a relatively minor complication to the curve fitting, since you can always provide a good guess for an offset by averaging the entire curve. Trade-off is you get a better signal-to-noise if you don't use the derivative.
Preliminary guesswork
Now consider the curve of the derivative (or the original curve if you skipped the last step). Start with the data point on the very far right, and then follow the curve leftward as many oscillations as you can until you reach an oscillation whose amplitude is greater than some amplitude threshold A. You want to find a section of the curve containing one oscillation with a good signal-to-noise ratio.
How you determine the amplitude threshold is difficult to say. It depends on how good your sensors are. I suggest leaving this as a parameter for tweaking later.
Curve-fitting
Now that you have captured one oscillation, you can do lots of easy things: estimate the frequency f and damping constant γ. Using these initial estimates, you can perform a nonlinear curve fit of all the data to the right of this oscillation (including the oscillation itself).
The function that you fit is [Eq2] if you're using the derivative, or [Eq1] if you use the original curve (but they are the same functions anyway, so rather moot point).
Why do you need these initial estimates? For a nonlinear curve fit to a decaying wave, it's critical that you give a good initial guess for the parameters, especially the frequency and damping constant. The other parameters are comparatively less important (at least that's my experience), but here's how you can get the others just in case:
The phase is 0 if you start from a maximum, or π if you start at a minimum.
You can guess the starting amplitude too, though the fitting algorithm will typically do fine even if you just set it to 1.
Finding the edge
The curve fit should be really good at this point – you can tell because discrepancy between your curve and the fit is very low. Now the trick is to attempt to increase the domain of the curve to the left.
If you stay within the sinusoidal region, the curve fit should remain quite good (how you judge the "goodness" requires some experimenting). As soon as you hit the flat region of the curve, however, the errors will start to increase dramatically and the parameters will start to deviate. This can be used to determine where the "good" data ends.
You don't have to do this point by point though – that's quite inefficient. A binary search should work quite well here (possibly the "one-sided" variation of it).
Summary
You don't have to follow this exact procedure, but the basic gist is that you can perform some analysis on a small part of the data starting on the very right, and then gradually increase the time domain until you reach a point where you find that the errors are getting larger rather than smaller as they ought to be.
Furthermore, you can also combine different heuristics to see if they agree with each other. If they don't, then it's probably a rather sketchy data point that requires some manual intervention.
Note that one particular advantage of the algorithm I sketched above is that you will get the results you want (damping constant and frequency) as part of the process, along with uncertainty estimates.
I neglected to mention most of the gory mathematical and algorithmic details so as to provide a general overview, but if needed I can provide more detail.

Using fourier analysis for time series prediction

For data that is known to have seasonal, or daily patterns I'd like to use fourier analysis be used to make predictions. After running fft on time series data, I obtain coefficients. How can I use these coefficients for prediction?
I believe FFT assumes all data it receives constitute one period, then, if I simply regenerate data using ifft, I am also regenerating the continuation of my function, so can I use these values for future values?
Simply put: I run fft for t=0,1,2,..10 then using ifft on coef, can I use regenerated time series for t=11,12,..20 ?
I'm aware that this question may be not actual for you anymore, but for others that are looking for answers I wrote a very simple example of fourier extrapolation in Python https://gist.github.com/tartakynov/83f3cd8f44208a1856ce
Before you run the script make sure that you have all dependencies installed (numpy, matplotlib). Feel free to experiment with it.
P.S. Locally Stationary Wavelet may be better than fourier extrapolation. LSW is commonly used in predicting time series. The main disadvantage of fourier extrapolation is that it just repeats your series with period N, where N - length of your time series.
It sounds like you want a combination of extrapolation and denoising.
You say you want to repeat the observed data over multiple periods. Well, then just repeat the observed data. No need for Fourier analysis.
But you also want to find "patterns". I assume that means finding the dominant frequency components in the observed data. Then yes, take the Fourier transform, preserve the largest coefficients, and eliminate the rest.
X = scipy.fft(x)
Y = scipy.zeros(len(X))
Y[important frequencies] = X[important frequencies]
As for periodic repetition: Let z = [x, x], i.e., two periods of the signal x. Then Z[2k] = X[k] for all k in {0, 1, ..., N-1}, and zeros otherwise.
Z = scipy.zeros(2*len(X))
Z[::2] = X
When you run an FFT on time series data, you transform it into the frequency domain. The coefficients multiply the terms in the series (sines and cosines or complex exponentials), each with a different frequency.
Extrapolation is always a dangerous thing, but you're welcome to try it. You're using past information to predict the future when you do this: "Predict tomorrow's weather by looking at today." Just be aware of the risks.
I'd recommend reading "Black Swan".
you can use the library that #tartakynov posted and, to not repeat exactly the same time series in the forcast (overfitting), you can add a new parameter to the function called n_param and fix a lower bound h for the amplitudes of the frequencies.
def fourierExtrapolation(x, n_predict,n_param):
usually you will find that, in a signal, there are some frequencies that have significantly higher amplitude than others, so, if you select this frequencies you will be able to isolate the periodic nature of the signal
you can add this two lines who are determinated by certain number n_param
h=np.sort(x_freqdom)[-n_param]
x_freqdom=[ x_freqdom[i] if np.absolute(x_freqdom[i])>=h else 0 for i in range(len(x_freqdom)) ]
just adding this you will be able to forecast nice and smooth
another useful article about FFt:
forecast FFt in R

Categories