Rolling Window, Dominant Frequency Numpy Accelerometer Data - python

Want to calculate dominant frequency, secondary dominant frequency from X,Y,Z accelerometer data stored in flat CSV files (million rows+) e.g.
data
I'm trying to use scipy although aware of numpy - either would do. I've converted my X, Y, Z to SMV format (single magnitude vector) and want to apply the fourier transform to this, and then get the frequencies using fftfreq - the bit that defeats me is the n and timestep. I have my sample rates, the hertz, and the size of rolling window I want to look at (10 rows of data) but not quite sure how to apply this to script below:
#The three-dimension data collected (X,Y,Z) were transformed into a
#single-dimensional Signal Magnitude Vector SMV (aka The Resultant)
#SMV = x2 + Y2 + Z2
X2 = X['X']*X['X']
Y2 = X['Y']*X['Y']
Z2 = X['Z']*X['Z']
#print X['X'].head(2) #Confirmed worked
#print X2.head(2) #Confirmed worked
combine = [X2,Y2,Z2, Y]
parent = pd.concat(combine, axis=1)
parent['ADD'] = parent.sum(axis=1) #Sum X2,Y2,Z2
sqr = np.sqrt(parent['ADD']) #Square Root of Sum Above
sqr.name = 'SMV'
combine2 = [sqr, Y] #Reduce Dataset to SMV and Class
parent2 = pd.concat(combine2, axis=1)
print parent2.head(4)
"************************* Begin Fourier ****************************"
from scipy import fftpack
X = fftpack.fft(sqr)
f_s = 80 #80 Hertz
samp = 1024 #samples per segment divided by 12.8 secs signal length
n = X.size
timestep = 10
freqs = fftpack.fftfreq(n, d=timestep)

Firstly you need to load your data into a numpy array (sorry i didn't quite follow your approach):
def load_data():
csvlist = []
times = []
with open('freq.csv') as f:
csvfile = csv.reader(f, delimiter=',')
for i, row in enumerate(csvfile):
timestamp = datetime.datetime.strptime(row[0],"%Y-%m-%d %H:%M:%S.%f")
times.append(timestamp)
csvlist.append(row[1:])
timestep = times[1]-times[0]
csvarr = numpy.array(csvlist, dtype=numpy.float32)
return timestep, csvarr
The may well be a better way to do this?
Then you need to calculate the magnitudes:
rms = numpy.sqrt(numpy.sum(data**2, axis=1))
And then the fourier analysis:
def fourier(timestep, data):
N = len(data)//2
freq = fftpack.fftfreq(len(data), d=timestep)[:N]
fft = fftpack.fft(data)[:N]
amp = numpy.abs(fft)/N
order = numpy.argsort(amp)[::-1]
return freq[order]
the return from this is a list of frequencies in decreasing order of importance.

Related

3D Fourier transformation of a gaussian function in python

I'm trying to get the 3D Fourier Transform of the gaussian function e^(-r^(2)/2) in python using the numpy.fft library.
I've attempted using different ffts from the library with different inputs, shifting the results with np.fft.fftshift, trying to find a multiplicative factor and many other things, the last thing I tried was using the 1D fft function, and then cubing the result, here's the corresponding source code:
import numpy as np
R = float(10)
N = float(100)
y= np.dtype(np.float64)
dr = R/N
def F(x):
return np.exp(-((x*dr)**2)/2)
Frange = np.arange(1,int(N)+1)
y = np.zeros((int(N)))
i = 0
while i<int(N):
y[i] = F(Frange[i])
i += 1
y = y/3
y_fft = np.fft.fftshift(np.abs(np.fft.fft(y)))**3
print (y_fft)
The first values I get:
4.62e-03, 4.63e-03, 4.65e-03, 4.69e-03, 4.74e-03
According to Lado, Fred. (1971) Numerical Fourier transforms in one, two, and three dimensions for liquid state calculations, the analytic solution to the problem is: (2pi )^(3/2)*e^(-k^(2)/2)
And the first values of the analytic solution with the same values of R and N are:
14.99, 12.92, 10.10, 7.15, 4.58
I also created a DFT program using a formula provided in the previous article which gives the expected results, but I haven't been able to replicate the analytic results in any of my attempts using the NumPy or SciPy fft libraries.
Here's my program for the analytic and DFT results:
import math
import numpy as np
def F(r):
x=math.exp((-1/2)*(r**2))
return x
def FT(r):
x=((2*math.pi)**(3/2))*(math.exp((-1/2)*(r**2)))
return x
R = float(10)
N = int(100)
ft = np.zeros(N)
fta = np.zeros(N)
dr = R/N
dk = math.pi/R
print ("\tk \t\t\t Discrete \t\t\t Analytic")
for j in range (1, N):
kj = j*dk
#Discrete Transform
sum = 0
for i in range(1, N):
ri = i*dr
sum = sum + (dr*ri*(F(ri))*(math.sin(kj*ri)))
ft[j] = ((4*math.pi)/kj)*sum
#Analytic Transform
fta[j] = FT(kj)
#Print results
print(kj, f" \t\t{ft[j]:.10E} \t\t{fta[j]:.10E}")
And these are the first few results:
k Discrete Analytic
0.3141592653589793 1.4991263193E+01 1.4991263193E+01
0.6283185307179586 1.2928362116E+01 1.2928362116E+01
0.9424777960769379 1.0101494686E+01 1.0101494686E+01
1.2566370614359172 7.1509645344E+00 7.1509645344E+00
1.5707963267948966 4.5864901093E+00 4.5864901093E+00

Kalman filter in python 2-D

As shown in this picture, my predicted points are following the GPS track, which has noisy points and that is not desired. Instead I want my filter to predict points that follow the road instead of the green area.
I tried to implement Kalman filter on noisy GPS data to remove the jumping points or predicting missing data if GPS signal is lost. Data contains latitude and longitude. After adjusting the parameters I can see that my predicted values are very much the same as the measurements I have, which is not fulfilling the actual problem I am trying to solve. I am still at the learning
stage, so I am not sure if the parameter selection is not right or the problem lies within my Python code. I'm using QGIS for visualization of Actual and Prediction values to compare them with my real GPS data.
Here is my code:
....python...
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('C:/Users/mun/Desktop/Research/Ny mappe/GPS_track.csv')
df.head(1000)
lat = np.array([df.latitude])
print(lat)
long = np.array([df.longitude])
print(long)
print(len(long[0]))
for i in range(len(long)):
print(long[i][0])
for i in range(len(lat[0])):
print(lat[0][i])
print(len(lat[0]))
print(len(long[0]))
#length of the arrays. the arrays should always have the same length
lng=len(lat[0])
print(lng)
for index in range(lng):
print(lat[0][index])
print(long[0][index])
for index in range (lng):
np.array((lat[0][index], long[0][index]))
coord1 = [list(i) for i in zip (lat[0],long[0])]
print(coord1)
from pylab import *
from numpy import *
import matplotlib.pyplot as plt
class Kalman:
def __init__(self, ndim):
self.ndim = ndim
self.Sigma_x = eye(ndim)*1e-4 # Process noise (Q)
self.A = eye(ndim) # Transition matrix which
predict state for next time step (A)
self.H = eye(ndim) # Observation matrix (H)
self.mu_hat = 0 # State vector (X)
self.cov = eye(ndim)*0.01 # Process Covariance (P)
self.R = .001 # Sensor noise covariance matrix /
measurement error (R)
def update(self, obs):
# Make prediction
self.mu_hat_est = dot(self.A,self.mu_hat)
self.cov_est = dot(self.A,dot(self.cov,transpose(self.A))) +
self.Sigma_x
# Update estimate
self.error_mu = obs - dot(self.H,self.mu_hat_est)
self.error_cov = dot(self.H,dot(self.cov,transpose(self.H))) +
self.R
self.K =
dot(dot(self.cov_est,transpose(self.H)),linalg.inv(self.error_cov))
self.mu_hat = self.mu_hat_est + dot(self.K,self.error_mu)
if ndim>1:
self.cov = dot((eye(self.ndim) -
dot(self.K,self.H)),self.cov_est)
else:
self.cov = (1-self.K)*self.cov_est
if __name__ == "__main__":
#print "***** 1d ***********"
ndim = 1
nsteps = 3
k = Kalman(ndim)
mu_init=array([54.907134])
cov_init=0.001*ones((ndim))
obs = random.normal(mu_init,cov_init,(ndim, nsteps))
for t in range(ndim,nsteps):
k.update(obs[:,t])
print ("Actual: ", obs[:, t], "Prediction: ", k.mu_hat_est)
coord_output=[]
for coordinate in coord1:
temp_list=[]
ndim = 2
nsteps = 100
k = Kalman(ndim)
mu_init=np.array(coordinate)
cov_init=0.0001*ones((ndim))
obs = zeros((ndim, nsteps))
for t in range(nsteps):
obs[:, t] = random.normal(mu_init,cov_init)
for t in range(ndim,nsteps):
k.update(obs[:,t])
print ("Actual: ", obs[:, t], "Prediction: ", k.mu_hat_est[0])
temp_list.append(obs[:, t])
temp_list.append(k.mu_hat_est[0])
print("temp list")
print(temp_list)
coord_output.append(temp_list)
for coord_pair in coord_output:
print(coord_pair[0])
print(coord_pair[1])
print("--------")
print(line_actual)
print(coord_output)
df2= pd.DataFrame(coord_output)
print(df2)
Actual = df2[0]
Prediction = df2[1]
print (Actual)
print(Prediction)
Actual_df = pd.DataFrame(Actual)
Prediction_df = pd.DataFrame(Prediction)
print(Actual_df)
print(Prediction_df)
Actual_coord = pd.DataFrame(Actual_df[0].to_list(), columns = ['latitude',
'longitude'])
Actual_coord.to_csv('C:/Users/mun/Desktop/Research/Ny
mappe/Actual_noise.csv')
Prediction_coord = pd.DataFrame(Prediction_df[1].to_list(), columns =
['latitude', 'longitude'])
Prediction_coord.to_csv('C:/Users/mun/Desktop/Research/Ny
mappe/Prediction_noise.csv')
print (Actual_coord)
print (Prediction_coord)
Actual_coord.plot(kind='scatter',x='longitude',y='latitude',color='red')
plt.show()
Prediction_coord.plot(kind='scatter',x='longitude',y='latitude',
color='green')
plt.show()

Speeding up normal distribution probability mass allocation

We have N users with P avg. points per user, where each point is a single value between 0 and 1. We need to distribute the mass of each point using a normal distribution with a known density of 0.05 as the points have some uncertainty. Additionally, we need to wrap the mass around 0 and 1 such that e.g. a point at 0.95 will also allocate mass around 0. I've provided a working example below, which bins the normal distribution into D=50 bins. The example uses the Python typing module, but you can ignore that if you'd like.
from typing import List, Any
import numpy as np
import scipy.stats
import matplotlib.pyplot as plt
D = 50
BINS: List[float] = np.linspace(0, 1, D + 1).tolist()
def probability_mass(distribution: Any, x0: float, x1: float) -> float:
"""
Computes the area under the distribution, wrapping at 1.
The wrapping is done by adding the PDF at +- 1.
"""
assert x1 > x0
return (
(distribution.cdf(x1) - distribution.cdf(x0))
+ (distribution.cdf(x1 + 1) - distribution.cdf(x0 + 1))
+ (distribution.cdf(x1 - 1) - distribution.cdf(x0 - 1))
)
def point_density(x: float) -> List[float]:
distribution: Any = scipy.stats.norm(loc=x, scale=0.05)
density: List[float] = []
for i in range(D):
density.append(probability_mass(distribution, BINS[i], BINS[i + 1]))
return density
def user_density(points: List[float]) -> Any:
# Find the density of each point
density: Any = np.array([point_density(p) for p in points])
# Combine points and normalize
combined = density.sum(axis=0)
return combined / combined.sum()
if __name__ == "__main__":
# Example for one user
data: List[float] = [.05, .3, .5, .5]
density = user_density(data)
# Example for multiple users (N = 2)
print([user_density(x) for x in [[.3, .5], [.7, .7, .7, .9]]])
### NB: THE REMAINING CODE IS FOR ILLUSTRATION ONLY!
### NB: THE IMPORTANT THING IS TO COMPUTE THE DENSITY FAST!
middle: List[float] = []
for i in range(D):
middle.append((BINS[i] + BINS[i + 1]) / 2)
plt.bar(x=middle, height=density, width=1.0 / D + 0.001)
plt.xlim(0, 1)
plt.xlabel("x")
plt.ylabel("Density")
plt.show()
In this example N=1, D=50, P=4. However, we want to scale this approach to N=10000 and P=100 while being as fast as possible. It's unclear to me how we'd vectorize this approach. How do we best speed up this?
EDIT
The faster solution can have slightly different results. For instance, it could approximate the normal distribution instead of using the precise normal distribution.
EDIT2
We only care about computing density using the user_density() function. The plot is only to help explain the approach. We do not care about the plot itself :)
EDIT3
Note that P is the avg. points per user. Some users may have more and some may have less. If it helps, you can assume that we can throw away points such that all users have a max of 2 * P points. It's fine to ignore this part while benchmarking as long as the solution can handle a flexible # of points per user.
You could get below 50ms for largest case (N=10000, AVG[P]=100, D=50) by using using FFT and creating data in numpy friendly format. Otherwise it will be closer to 300 msec.
The idea is to convolve a single normal distribution centered at 0 with a series Dirac deltas.
See image below:
Using circular convolution solves two issues.
naturally deals with wrapping at the edges
can be efficiently computed with FFT and Convolution Theorem
First one must create a distribution to be copied. Function mk_bell() created a histogram of a normal distribution of stddev 0.05 centered at 0.
The distribution wraps around 1. One could use arbitrary distribution here. The spectrum of the distribution is computed are used for fast convolution.
Next a comb-like function is created. The peaks are placed at indices corresponding to peaks in user density. E.g.
peaks_location = [0.1, 0.3, 0.7]
D = 10
maps to
peak_index = (D * peak_location).astype(int) = [1, 3, 7]
dist = [0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0] # ones at [1, 3, 7]
You can quickly create a composition of Diract Deltas by computing indices of the bins for each peak location with help of np.bincount() function.
In order to speed things even more one can compute comb-functions for user-peaks in parallel.
Array dist is 2D-array of shape NxD. It can be linearized to 1D array of shape (N*D). After this change element on position [user_id, peak_index] will be accessible from index user_id*D + peak_index.
With numpy-friendly input format (described below) this operation is easily vectorized.
The convolution theorem says that spectrum of convolution of two signals is equal to product of spectrums of each signal.
The spectrum is compute with numpy.fft.rfft which is a variant of Fast Fourier Transfrom dedicated to real-only signals (no imaginary part).
Numpy allows to compute FFT of each row of the larger matrix with one command.
Next, the spectrum of convolution is computed by simple multiplication and use of broadcasting.
Next, the spectrum is computed back to "time" domain by Inverse Fourier Transform implemented in numpy.fft.irfft.
To use the full speed of numpy one should avoid variable size data structure and keep to fixed size arrays. I propose to represent input data as three arrays.
uids the identifier for user, integer 0..N-1
peaks, the location of the peak
mass, the mass of the peek, currently it is 1/numer-of-peaks-for-user
This representation of data allows quick vectorized processing.
Eg:
user_data = [[0.1, 0.3], [0.5]]
maps to:
uids = [0, 0, 1] # 2 points for user_data[0], one from user_data[1]
peaks = [0.1, 0.3, 0.5] # serialized user_data
mass = [0.5, 0.5, 1] # scaling factors for each peak, 0.5 means 2 peaks for user 0
The code:
import numpy as np
import matplotlib.pyplot as plt
import time
def mk_bell(D, SIGMA):
# computes normal distribution wrapped and centered at zero
x = np.linspace(0, 1, D, endpoint=False);
x = (x + 0.5) % 1 - 0.5
bell = np.exp(-0.5*np.square(x / SIGMA))
return bell / bell.sum()
def user_densities_by_fft(uids, peaks, mass, D, N=None):
bell = mk_bell(D, 0.05).astype('f4')
sbell = np.fft.rfft(bell)
if N is None:
N = uids.max() + 1
# ensure that peaks are in [0..1) internal
peaks = peaks - np.floor(peaks)
# convert peak location from 0-1 to the indices
pidx = (D * (peaks + uids)).astype('i4')
dist = np.bincount(pidx, mass, N * D).reshape(N, D)
# process all users at once with Convolution Theorem
sdist = np.fft.rfft(dist)
sdist *= sbell
res = np.fft.irfft(sdist)
return res
def generate_data(N, Pmean):
# generateor for large data
data = []
for n in range(N):
# select P uniformly from 1..2*Pmean
P = np.random.randint(2 * Pmean) + 1
# select peak locations
chunk = np.random.uniform(size=P)
data.append(chunk.tolist())
return data
def make_data_numpy_friendly(data):
uids = []
chunks = []
mass = []
for uid, peaks in enumerate(data):
uids.append(np.full(len(peaks), uid))
mass.append(np.full(len(peaks), 1 / len(peaks)))
chunks.append(peaks)
return np.hstack(uids), np.hstack(chunks), np.hstack(mass)
D = 50
# demo for simple multi-distribution
data, N = [[0, .5], [.7, .7, .7, .9], [0.05, 0.3, 0.5, 0.5]], None
uids, peaks, mass = make_data_numpy_friendly(data)
dist = user_densities_by_fft(uids, peaks, mass, D, N)
plt.plot(dist.T)
plt.show()
# the actual measurement
N = 10000
P = 100
data = generate_data(N, P)
tic = time.time()
uids, peaks, mass = make_data_numpy_friendly(data)
toc = time.time()
print(f"make_data_numpy_friendly: {toc - tic}")
tic = time.time()
dist = user_densities_by_fft(uids, peaks, mass, D, N)
toc = time.time()
print(f"user_densities_by_fft: {toc - tic}")
The results on my 4-core Haswell machine are:
make_data_numpy_friendly: 0.2733159065246582
user_densities_by_fft: 0.04064297676086426
It took 40ms to process the data. Notice that processing data to numpy friendly format takes 6 times more time than the actual computation of distributions.
Python is really slow when it comes to looping.
Therefore I strongly recommend to generate input data directly in numpy-friendly way in the first place.
There are some issues to be fixed:
precision, can be improved by using larger D and downsampling
accuracy of peak location could be further improved by widening the spikes.
performance, scipy.fft offers move variants of FFT implementation that may be faster
This would be my vectorized approach:
data = np.array([0.05, 0.3, 0.5, 0.5])
np.random.seed(31415)
# random noise
randoms = np.random.normal(0,1,(len(data), int(1e5))) * 0.05
# samples with noise
samples = data[:,None] + randoms
# wrap [0,1]
samples = (samples % 1).ravel()
# histogram
hist, bins, patches = plt.hist(samples, bins=BINS, density=True)
Output:
I was able to reduce the time from about 4 seconds per sample of 100 datapoints to about 1 ms per sample.
It looks to me like you're spending quite a lot of time simulating a very large number of normal distributions. Since you're dealing with a very large sample size anyway, you may as well just use standard normal distribution values, because it'll all just average out anyway.
I recreated your approach (BaseMethod class), then created an optimized class (OptimizedMethod class), and evaluated them using a timeit decorator. The primary difference in my approach is the following line:
# Generate a standardized set of values to add to each sample to simulate normal distribution
self.norm_vals = np.array([norm.ppf(x / norm_val_n) * 0.05 for x in range(1, norm_val_n, 1)])
This creates a generic set of datapoints based on an inverse normal cumulative distribution function that we can add to each datapoint to simulate a normal distribution around that point. Then we just reshape the data into user samples and run np.histogram on the samples.
import numpy as np
import scipy.stats
from scipy.stats import norm
import time
# timeit decorator for evaluating performance
def timeit(method):
def timed(*args, **kw):
ts = time.time()
result = method(*args, **kw)
te = time.time()
print('%r %2.2f ms' % (method.__name__, (te - ts) * 1000 ))
return result
return timed
# Define Variables
N = 10000
D = 50
P = 100
# Generate sample data
np.random.seed(0)
data = np.random.rand(N, P)
# Run OP's method for comparison
class BaseMethod:
def __init__(self, d=50):
self.d = d
self.bins = np.linspace(0, 1, d + 1).tolist()
def probability_mass(self, distribution, x0, x1):
"""
Computes the area under the distribution, wrapping at 1.
The wrapping is done by adding the PDF at +- 1.
"""
assert x1 > x0
return (
(distribution.cdf(x1) - distribution.cdf(x0))
+ (distribution.cdf(x1 + 1) - distribution.cdf(x0 + 1))
+ (distribution.cdf(x1 - 1) - distribution.cdf(x0 - 1))
)
def point_density(self, x):
distribution = scipy.stats.norm(loc=x, scale=0.05)
density = []
for i in range(self.d):
density.append(self.probability_mass(distribution, self.bins[i], self.bins[i + 1]))
return density
#timeit
def base_user_density(self, data):
n = data.shape[0]
density = np.empty((n, self.d))
for i in range(data.shape[0]):
# Find the density of each point
row_density = np.array([self.point_density(p) for p in data[i]])
# Combine points and normalize
combined = row_density.sum(axis=0)
density[i, :] = combined / combined.sum()
return density
base = BaseMethod(d=D)
# Only running base method on first 2 rows of data because it's slow
density = base.base_user_density(data[:2])
print(density[:2, :5])
class OptimizedMethod:
def __init__(self, d=50, norm_val_n=50):
self.d = d
self.norm_val_n = norm_val_n
self.bins = np.linspace(0, 1, d + 1).tolist()
# Generate a standardized set of values to add to each sample to simulate normal distribution
self.norm_vals = np.array([norm.ppf(x / norm_val_n) * 0.05 for x in range(1, norm_val_n, 1)])
#timeit
def optimized_user_density(self, data):
samples = np.empty((data.shape[0], data.shape[1], self.norm_val_n - 1))
# transform datapoints to normal distributions around datapoint
for i in range(self.norm_vals.shape[0]):
samples[:, :, i] = data + self.norm_vals[i]
samples = samples.reshape(samples.shape[0], -1)
#wrap around [0, 1]
samples = samples % 1
#loop over samples for density
density = np.empty((data.shape[0], self.d))
for i in range(samples.shape[0]):
hist, bins = np.histogram(samples[i], bins=self.bins)
density[i, :] = hist / hist.sum()
return density
om = OptimizedMethod()
#Run optimized method on first 2 rows for apples to apples comparison
density = om.optimized_user_density(data[:2])
#Run optimized method on full data
density = om.optimized_user_density(data)
print(density[:2, :5])
Running on my system, the original method took about 8.4 seconds to run on 2 rows of data, while the optimized method took 1 millisecond to run on 2 rows of data and completed 10,000 rows in 4.7 seconds. I printed the first five values of the first 2 samples for each method.
'base_user_density' 8415.03 ms
[[0.02176227 0.02278653 0.02422535 0.02597123 0.02745976]
[0.0175103 0.01638513 0.01524853 0.01432158 0.01391156]]
'optimized_user_density' 1.09 ms
'optimized_user_density' 4755.49 ms
[[0.02142857 0.02244898 0.02530612 0.02612245 0.0277551 ]
[0.01673469 0.01653061 0.01510204 0.01428571 0.01326531]]

Find time shift of two signals using cross correlation

I have two signals which are related to each other and have been captured by two different measurement devices simultaneously.
Since the two measurements are not time synchronized there is a small time delay between them which I want to calculate. Additionally, I need to know which signal is the leading one.
The following can be assumed:
no or only very less noise present
speed of the algorithm is not an issue, only accuracy and robustness
signals are captured with an high sampling rate (>10 kHz) for several seconds
expected time delay is < 0.5s
I though of using-cross correlation for that purpose.
Any suggestions how to implement that in Python are very appreciated.
Please let me know if I should provide more information in order to find the most suitable algorithmn.
A popular approach: timeshift is the lag corresponding to the maximum cross-correlation coefficient. Here is how it works with an example:
import matplotlib.pyplot as plt
from scipy import signal
import numpy as np
def lag_finder(y1, y2, sr):
n = len(y1)
corr = signal.correlate(y2, y1, mode='same') / np.sqrt(signal.correlate(y1, y1, mode='same')[int(n/2)] * signal.correlate(y2, y2, mode='same')[int(n/2)])
delay_arr = np.linspace(-0.5*n/sr, 0.5*n/sr, n)
delay = delay_arr[np.argmax(corr)]
print('y2 is ' + str(delay) + ' behind y1')
plt.figure()
plt.plot(delay_arr, corr)
plt.title('Lag: ' + str(np.round(delay, 3)) + ' s')
plt.xlabel('Lag')
plt.ylabel('Correlation coeff')
plt.show()
# Sine sample with some noise and copy to y1 and y2 with a 1-second lag
sr = 1024
y = np.linspace(0, 2*np.pi, sr)
y = np.tile(np.sin(y), 5)
y += np.random.normal(0, 5, y.shape)
y1 = y[sr:4*sr]
y2 = y[:3*sr]
lag_finder(y1, y2, sr)
In the case of noisy signals, it is common to apply band-pass filters first. In the case of harmonic noise, they can be removed by identifying and removing frequency spikes present in the frequency spectrum.
Numpy has function correlate which suits your needs: https://docs.scipy.org/doc/numpy/reference/generated/numpy.correlate.html
To complement Reveille's answer above (I reproduce his algorithm), I would like to point out some ideas for preprocessing the input signals.
Since there seems to be no fit-for-all (duration in periods, resolution, offset, noise, signal type, ...) you may play with it.
In my example the application of a window function improves the detected phase shift (within resolution of the discretization).
import numpy as np
from scipy import signal
import matplotlib.pyplot as plt
r2d = 180.0/np.pi # conversion factor RAD-to-DEG
delta_phi_true = 50.0/r2d
def detect_phase_shift(t, x, y):
'''detect phase shift between two signals from cross correlation maximum'''
N = len(t)
L = t[-1] - t[0]
cc = signal.correlate(x, y, mode="same")
i_max = np.argmax(cc)
phi_shift = np.linspace(-0.5*L, 0.5*L , N)
delta_phi = phi_shift[i_max]
print("true delta phi = {} DEG".format(delta_phi_true*r2d))
print("detected delta phi = {} DEG".format(delta_phi*r2d))
print("error = {} DEG resolution for comparison dphi = {} DEG".format((delta_phi-delta_phi_true)*r2d, dphi*r2d))
print("ratio = {}".format(delta_phi/delta_phi_true))
return delta_phi
L = np.pi*10+2 # interval length [RAD], for generality not multiple period
N = 1001 # interval division, odd number is better (center is integer)
noise_intensity = 0.0
X = 0.5 # amplitude of first signal..
Y = 2.0 # ..and second signal
phi = np.linspace(0, L, N)
dphi = phi[1] - phi[0]
'''generate signals'''
nx = noise_intensity*np.random.randn(N)*np.sqrt(dphi)
ny = noise_intensity*np.random.randn(N)*np.sqrt(dphi)
x_raw = X*np.sin(phi) + nx
y_raw = Y*np.sin(phi+delta_phi_true) + ny
'''preprocessing signals'''
x = x_raw.copy()
y = y_raw.copy()
window = signal.windows.hann(N) # Hanning window
#x -= np.mean(x) # zero mean
#y -= np.mean(y) # zero mean
#x /= np.std(x) # scale
#y /= np.std(y) # scale
x *= window # reduce effect of finite length
y *= window # reduce effect of finite length
print(" -- using raw data -- ")
delta_phi_raw = detect_phase_shift(phi, x_raw, y_raw)
print(" -- using preprocessed data -- ")
delta_phi_preprocessed = detect_phase_shift(phi, x, y)
Without noise (to be deterministic) the output is
-- using raw data --
true delta phi = 50.0 DEG
detected delta phi = 47.864788975654 DEG
...
-- using preprocessed data --
true delta phi = 50.0 DEG
detected delta phi = 49.77938053468019 DEG
...
Numpy has a useful function, called correlation_lags for this, which uses the underlying correlate function mentioned by other answers to find the time lag. The example displayed at the bottom of that page is useful:
from scipy import signal
from numpy.random import default_rng
rng = default_rng()
x = rng.standard_normal(1000)
y = np.concatenate([rng.standard_normal(100), x])
correlation = signal.correlate(x, y, mode="full")
lags = signal.correlation_lags(x.size, y.size, mode="full")
lag = lags[np.argmax(correlation)]
Then lag would be -100

how to extract frequency associated with fft values in python

I used fft function in numpy which resulted in a complex array. How to get the exact frequency values?
np.fft.fftfreq tells you the frequencies associated with the coefficients:
import numpy as np
x = np.array([1,2,1,0,1,2,1,0])
w = np.fft.fft(x)
freqs = np.fft.fftfreq(len(x))
for coef,freq in zip(w,freqs):
if coef:
print('{c:>6} * exp(2 pi i t * {f})'.format(c=coef,f=freq))
# (8+0j) * exp(2 pi i t * 0.0)
# -4j * exp(2 pi i t * 0.25)
# 4j * exp(2 pi i t * -0.25)
The OP asks how to find the frequency in Hertz.
I believe the formula is frequency (Hz) = abs(fft_freq * frame_rate).
Here is some code that demonstrates that.
First, we make a wave file at 440 Hz:
import math
import wave
import struct
if __name__ == '__main__':
# http://stackoverflow.com/questions/3637350/how-to-write-stereo-wav-files-in-python
# http://www.sonicspot.com/guide/wavefiles.html
freq = 440.0
data_size = 40000
fname = "test.wav"
frate = 11025.0
amp = 64000.0
nchannels = 1
sampwidth = 2
framerate = int(frate)
nframes = data_size
comptype = "NONE"
compname = "not compressed"
data = [math.sin(2 * math.pi * freq * (x / frate))
for x in range(data_size)]
wav_file = wave.open(fname, 'w')
wav_file.setparams(
(nchannels, sampwidth, framerate, nframes, comptype, compname))
for v in data:
wav_file.writeframes(struct.pack('h', int(v * amp / 2)))
wav_file.close()
This creates the file test.wav.
Now we read in the data, FFT it, find the coefficient with maximum power,
and find the corresponding fft frequency, and then convert to Hertz:
import wave
import struct
import numpy as np
if __name__ == '__main__':
data_size = 40000
fname = "test.wav"
frate = 11025.0
wav_file = wave.open(fname, 'r')
data = wav_file.readframes(data_size)
wav_file.close()
data = struct.unpack('{n}h'.format(n=data_size), data)
data = np.array(data)
w = np.fft.fft(data)
freqs = np.fft.fftfreq(len(w))
print(freqs.min(), freqs.max())
# (-0.5, 0.499975)
# Find the peak in the coefficients
idx = np.argmax(np.abs(w))
freq = freqs[idx]
freq_in_hertz = abs(freq * frate)
print(freq_in_hertz)
# 439.8975
Here we deal with the Numpy implementation of the fft.
Frequencies associated with DFT values (in python)
By fft, Fast Fourier Transform, we understand a member of a large family of algorithms that enable the fast computation of the DFT, Discrete Fourier Transform, of an equisampled signal.
A DFT converts an ordered sequence of N complex numbers to an ordered sequence of N complex numbers, with the understanding that both sequences are periodic with period N.
In many cases, you think of
a signal x, defined in the time domain, of length N, sampled at a constant interval dt,¹
its DFT X (here specifically X = np.fft.fft(x)), whose elements are sampled on the frequency axis with a sample rate dω.
Some definition
the period (aka duration²) of the signal x, sampled at dt with N samples is is
T = dt*N
the fundamental frequencies (in Hz and in rad/s) of X, your DFT are
df = 1/T
dω = 2*pi/T # =df*2*pi
the top frequency is the Nyquist frequency
ny = dω*N/2
(NB: the Nyquist frequency is not dω*N)³
The frequencies associated with a particular element in the DFT
The frequencies corresponding to the elements in X = np.fft.fft(x) for a given index 0<=n<N can be computed as follows:
def rad_on_s(n, N, dω):
return dω*n if n<N/2 else dω*(n-N)
or in a single sweep
ω = np.array([dω*n if n<N/2 else dω*(n-N) for n in range(N)])
if you prefer to consider frequencies in Hz, s/ω/f/
f = np.array([df*n if n<N/2 else df*(n-N) for n in range(N)])
Using those frequencies
If you want to modify the original signal x -> y applying an operator in the frequency domain in the form of a function of frequency only, the way to go is computing the ω's and
Y = X*f(ω)
y = ifft(Y)
Introducing np.fft.fftfreq
Of course numpy has a convenience function np.fft.fftfreq that returns dimensionless frequencies rather than dimensional ones but it's as easy as
f = np.fft.fftfreq(N)*N*df
ω = np.fft.fftfreq(N)*N*dω
Because df = 1/T and T = N/sps (sps being the number of samples per second) one can also write
f = np.fft.fftfreq(N)*sps
Notes
Dual to the sampling interval dt there is the sampling rate, sr, or how many samples are taken during a unit of time; of course dt=1/sr and sr=1/dt.
Speaking of a duration, even if it is rather common, hides the
fundamental idea of periodicity.
The concept of Nyquist frequency is clearly exposed in any textbook dealing with the analysis of time signals, and also in the linked Wikipedia article. Does it suffice to say that information cannot be created?
The frequency is just the index of the array. At index n, the frequency is 2πn / the array's length (radians per unit). Consider:
>>> numpy.fft.fft([1,2,1,0,1,2,1,0])
array([ 8.+0.j, 0.+0.j, 0.-4.j, 0.+0.j, 0.+0.j, 0.+0.j, 0.+4.j,
0.+0.j])
the result has nonzero values at indices 0, 2 and 6. There are 8 elements. This means
2πit/8 × 0 2πit/8 × 2 2πit/8 × 6
8 e - 4i e + 4i e
y ~ ———————————————————————————————————————————————
8

Categories