Time aligning signals of different sampling rate with missing values - python

I am working with signals from two different sensors having different sampling rate one with 10 Hz and one with 1 Hz. I want to time align these two signals as the timing is bit different (in seconds). Also, there were chunks of values missing at random intervals from sensor with 1 Hz sampling rate.
I am purely from CS background and have never worked before on DSP. I would highly appreciate if you point me in right direction.

You are trying to estimate an unknown function using samples taken at 1 Hz rate.
Simplest is just to use the previous sample as the estimated value.
First order linear draws a line through the two previous known values and uses the points on the line as the estimated values. Whenever you get a new sample you replace the 2nd value with the first, and the first with the new value. Keep track of the sample times also so that you can track properly across missing values.
If your samples are t1,v1,t2,v2 for times and values. Current time is t. Then estimated value will be
e = v1 + ((v2 - v1) * (t - t1) / ( t2 - t1))
Note that for t = t1 this evaluates to v1, and for t = t2 this evaluates to v2.

Related

Most efficient and/or simplest method of interpolating multiple numpy timeseries arrays into one array?

I have three numpy arrays where one column is the time stamp (Unix time to the millisecond as an integer) and the other column is a reading from a sensor (integer). Each of these three arrays occurs simultaneously in time (ie, the span of the time column is roughly the same), however they are sampled at different frequencies (one is 500 Hz, others 125 Hz). The final array should be (n,4) with columns [time, array1,array2,array3].
500.0 Hz Example (only the head, these are multiple minutes long)
array([[1463505325032, 196],
[1463505325034, 197],
[1463505325036, 197],
[1463505325038, 195]])
125.0 Hz Example (only the head, these are multiple minutes long)
array([[1463505287912, -5796],
[1463505287920, -5858],
[1463505287928, -5920],
[1463505287936, -5968]])
Currently, my initial plan has been as follows but performance isn't amazing:
Find the earliest start time (b/c of different frequencies and system issues, they do not exactly all start at the same millisecond)
Create a new array with a time column that starts at the earliest time and runs as long as the longest of the three arrays. Fill the time column to the desired common frequency using np.linspace/np.arange
Loop over the three arrays, using np.interp or similar to convert to common frequency, and then stack the output onto the common numpy array created above
I have tens of thousands of these intervals and they can be multiple days long, so hoping for something that is reasonably quick and memory efficient. Thank you!
You'll have to interpolate the 125 Hz signal to get 500 Hz. It depends on what quality of the interpolation you need. For linear interpolation, scipy.signal.interp1d in linear-interpolation mode is a bit slow, O(log n) for n data points, because it does a bisection search for every evaluation. The calculation time explodes if you ask it to do a smooth interpolation on a large dataset, because that involves solving a system of 3n equations with 3n unknowns.
If your sampling rates have an integer ratio (1:4 in your example), you can do linear interpolation more efficiently like this:
# interpolate a125 to a500
n = len(a125)
a500 = np.zeros((n-1)*4+1)
a500[0::4] = a125
a500[1::4] = 0.75*a125[:-1] + 0.25*a125[1:]
a500[2::4] = 0.5*a125[:-1] + 0.5*a125[1:]
a500[3::4] = 0.25*a125[:-1] + 0.75*a125[1:]
If you need smooth interpolation, use scipy.signal.resample. This being a Fourier method will require careful handling of the end points of your time series; you need to pad it with data that makes a gradual transition from the end point back to the start point:
from scipy.signal import resample
m = n//8
padding = np.linspace(a125[-1], a125[0], m)
a125_pad = np.concatenate([a125, padding])
a500b = resample(a125_pad, (n+m)*4)[:4*n]
Depending on the nature of your data, it might be better to have a continuous derivative at the end points.
Note that the FFT that is used for the resampling likes to have an array size that is a product of small prime numbers (2, 3, 5, 7). Choose the padding size (m) wisely.

Generating large simulations and inserting the same array multiple times into another array at different locations

I am working to generate a monte carlo simulation for oil wells. The end goal is to have all the wells with a smoothed probabilistic production curve. I have optimized what I can, but each of the 3 apply statements I am listing take so much time when I use my full dataset and the number of simulations I want. (Hours) The code I included is has 10 iterations. If you crank it up to 10,000 which is the goal it really starts to drag.
I have generated a Panda that has all the future wells I want to model with a probability of that well being chosen next to be drilled.
I then created a panda where I grouped everything into the categories I want to use to figure out the order that the model will choose the wells. So my "timing" panda contains my categories and an array of every index of those wells in those categories and an array of the well's probabilities.
This all is done in a few seconds. The next part works, but gets very slow.
Next I use a numpy generator choice with percentages to randomly generate the order of the wells for i simulations. As other posts have noted #njit does not work with the probability array. The result is 1 dimension of the array is the order that the wells will be chosen by each category, and the other dimension is each simulation. There are about 150 categories, and 10,000s of wells in each categories. I am hoping to run 10,000 simulations.
a is an array of indexes of wells that can be chosen
size is the length of that array
per is the probability that each well will get chosen
Next I link my timing panda to my panda with all of the wells in it. This attaches the previous array to the wells array. Then I search this array for the well index to figure out for each simulation when that specific well is going to get run. This generates a 1d array with what order that well is going be drilled in each simulation.
This function gets called on 100,000s of wells and as I increase the number of simulations it really slows down.
order is an array of the order each well is drilled per simulation
index is the index of that well
The final difficulty I am having is the averaging out the production curve for the wells. I have how much oil will be produced by each well per month. I need to insert that curve into the array at each point when the well is drilled, then average all of those values together to get the average production of the well given all the simulations.
I have also tried creating an np.zeros array then using the np.insert function, but I could not figure out how to insert an array multiple times without a loop and generating the initial array of 0's took longer than the current method I had. (I overcame inserting the array multiple times by covering everything to a string, inserting the type curve as a string then converting back to an array of numbers, but this did not seem efficient). I need to have the number of leading 0's
order is the time in months that each well will get drilled
curve is the production curve passed as a list
m is the highest value of the months that the well is drilled in all simulations
import numpy as np
from numba import njit
import datetime
import math
def TimingGenerator(a, size, p):
i = 10
g = np.random.Generator(np.random.PCG64())
order = np.concatenate([g.choice(a=a, size=size, replace=False, p=p) for z in range(i)]).reshape(i, size)
return order
#njit
def OrderGenerator(order, index):
result = np.where(order == index)[1]
return result
def CurveAverager(order, curve, m):
matrix = np.array([[0] * math.ceil(i) + curve + [0] * int((m - math.ceil(i))) for i in order])
result = np.mean(matrix, axis=0)
return result
begin_time = datetime.datetime.now()
size = 8000
g = np.random.Generator(np.random.PCG64())
a = g.choice(20_000, size=size, replace=False)
p = np.random.randint(1,100, size=size)
p = p/np.sum(p)
for i in range(150):
q = TimingGenerator(a,size,p)
print(datetime.datetime.now() - begin_time)
index = np.amin(q)
for i in range(100000):
order = OrderGenerator(q, index)
print(datetime.datetime.now() - begin_time)
order = order / 15
curve = list(range(600, 0, -1))
for i in range(20000):
avgcurve = CurveAverager(order, curve, size)
print(datetime.datetime.now() - begin_time)
Thanks for any help you can offer. I am willing to greatly alter my code if you can think of anything to help speed it up. Not sure if there is a better way to apply probabilities and smooth out the production curve which is really the end goal.
Cheers.

power spectral density-scipy.signal

While trying to compute the Power spectral density with an acquisition rate of 300000hz using ... signal.periodogram(x, fs,nfft=4096) , I get the graph upto 150000Hz and not upto 300000. Why is this upto half the value ? What is the meaning of sampling rate here?
In the example given in scipy documentation , the sampling rate is 10000Hz but we see in the plot only upto 5000Hz.
https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.signal.periodogram.html
The spectrum of real-valued signal is always symmetric with respect to the Nyquist frequency (half of the sampling rate). As a result, there is often no need to store or plot the redundant symmetric portion of the spectrum.
If you still want to see the whole spectrum, you can set the return_onesided argument to True as follows:
f, Pxx_den = signal.periodogram(x, fs, return_onesided=False)
The resulting plot of the same example provided in scipy.periodogram documentation would then cover a 10000Hz frequency range as would be expected:
If you check the length of f in the example:
>>> len(f)
>>> 50001
This is NOT 50000 Hz. This is because scipy.signal.periodogram calls scipy.signal.welch with the parameter nperseg=x.shape[-1] by default. This is the correct input for scipy.signal.welch. However, if dig into source and see lines 328-329 (as of now), you'll see the reason why the size of output is 50001.
if nfft % 2 == 0: # even
outshape[-1] = nfft // 2 + 1

Calculating thd in python

I'm trying to calculate the total harmonic distortion values of ac voltage supplied. I am sampling voltage data using Arduino at over 8 KHz rate and storing those data into a text file. Then I'm trying to calculate thd using the following code snippet written in python:
import numpy as np
import scipy.fftpack
from scipy.fftpack import fft
from numpy import genfromtxt
sampled_data = genfromtxt('/../file.txt',delimiter=',')
abs_yf=np.abs(fft(sampled_data))
#As far as I know, THD=sqrt(sum of square magnitude of
#harmonics+noise)/Fundamental value (Is it correct?)So I'm
#just summing up square of all frequency data obtained from FFT,
#sqrt() them and dividing them with fundamental frequecy value.
def thd(abs_data):
sq_sum=0.0
for r in range(len(abs_data)):
sq_sum=sq_sum+(abs_data[r])**2
sq_harmonics=sq_sum-(max(abs_data))**2.0
thd=100*sq_harmonics**0.5/max(abs_data)
return thd
print "Total Harmonic Distortion(in percent):"
print thd(abs_yf)
Problem is, The obtained Thd values vary within 5% to 25% in my case. (In reality it's not more than 5% actually). What am I doing wrong? Is there any other way to find out thd?
Though this is long quiet, for anyone encountering this post like me: There are a couple of problems with the OP method.
1) The magnitudes returned by FFT include a magnitude of the 0 frequency bin, so the assumption that max(abs_data) is the magnitude corresponding to the fundamental frequency is not correct if there is any DC bias in the signal. This is a problem in the line
thd = 100*sq_harmonics**0.5 / max(abs_data)
The amplitude associated with the 0 frequency can just be ignored as a quick solution.
2) The second half of the abs_data should be thrown out, it is a "mirrored" reflection of the first. This is due to the nature of the Fourier transform.
Both these issues can be addressed by changing the input to the function, i.e by replacing
print thd(abs_yf)
with
print( thd(abs_yf[1:int(len(abs_yf)/2) ]) )
where we have changed the input to not include the first or the last N/2 elements.
The result is still not ideal because the window needs to be exactly an integer number of cycles as the previous answers noted above. Testing with a pure sine with offset and adjusting the window demonstrates that the method works fairly well but fails terribly if significant window errors.
t0=0
tf = 0.02 # integer number of cycles
dt = 1e-4
offset = 0.5
N = int((tf-t0)/dt)
time = np.linspace(0.0,tf,N ) #;
commandSigFreq = 100
Amplitude = 2
waveOfSin = Amplitude*np.sin(2.0*pi*commandSigFreq*time) + offset
abs_yf = np.abs(fft(waveOfSin))
#print("freq is" + str(scipy.fftpack.fftfreq(sampled_data, dt ) ))
#As far as I know, THD=sqrt(sum of square magnitude of
#harmonics+noise)/Fundamental value (Is it correct?)So I'm
#just summing up square of all frequency data obtained from FFT,
#sqrt() them and dividing them with fundamental frequency value.
def thd(abs_data):
sq_sum=0.0
for r in range( len(abs_data)):
sq_sum = sq_sum + (abs_data[r])**2
sq_harmonics = sq_sum -(max(abs_data))**2.0
thd = 100*sq_harmonics**0.5 / max(abs_data)
return thd
print("Total Harmonic Distortion(in percent):")
print(thd(abs_yf[1:int(len(abs_yf)/2) ]))
It is quite likely that you add additional distortion by the measurement process itself.
If you compare an Arduino ADC with a high class measurement device, the values of the Arduino will very likely much worse. At least you need a very stable and jitter-free clock.
Furthermore, the output of the data (I guess via UART) might interfere with the timing of the ADC measurement.

Integration in Fourier or time domain

I'm struggling to understand a problem with numerical integration of a signal. Basically I have a signal which I would like to integrate or perform and antiderivative as a function of time (integration of pick-up coil for getting magnetic field). I've tried two different methods which in principle should be consistent but they are not. The code I'm using is the following. Beware that the signals y in the code has been previously high pass filtered using butterworth filtering (similar to what done here http://wiki.scipy.org/Cookbook/ButterworthBandpass). The signal and time basis can be downloaded here (https://www.dropbox.com/s/fi5z38sae6j5410/trial.npz?dl=0)
import scipy as sp
from scipy import integrate
from scipy import fftpack
data = np.load('trial.npz')
y = data['arr_1'] # this is the signal
t = data['arr_0']
# integration using pfft
bI = sp.fftpack.diff(y-y.mean(),order=-1)
bI2= sp.integrate.cumtrapz(y-y.mean(),x=t)
Now the two signals (besides the eventual different linear trend which can be taken out) are different, or better dynamically they are quite similar with the same time of oscillations but there is a factor approximately of 30 between the two signals, in the sense that bI2 is 30 times (approximately) lower than bI. BTW I've subtracted the mean in both the two signals to be sure that they are zero mean signals and performing integration in IDL (both with equivalent cumsumtrapz and in the fourier domain) gives values compatible with bI2. Any clue is really welcomed
It's difficult to know what scipy.fftpack.diff() is doing under the bonnet.
To try and solve your problem, I have dug up an old frequency domain integration function that I wrote a while ago. It's worth pointing out that in practice, one generally wants a bit more control of some of the parameters than scipy.fftpack.diff() gives you. For example, the f_lo and f_hi parameters of my intf() function allow you to band-limit the input to exclude very low or very high frequencies which may be noisy. Noisy low frequencies in particular can 'blow-up' during integration and overwhelm the signal. You may also want to use a window at the start and end of the time series to stop spectral leakage.
I have calculated bI2 and also a result, bI3, integrated once with intf() using the following code (I assumed an average sampling rate for simplicity):
import intf
from scipy import integrate
data = np.load(path)
y = data['arr_1']
t = data['arr_0']
bI2= sp.integrate.cumtrapz(y-y.mean(),x=t)
bI3 = intf.intf(y-y.mean(), fs=500458, f_lo=1, winlen=1e-2, times=1)
I plotted bI2 and bI3:
The two time series are of the same order of magnitude, and broadly the same shape, notwithstanding the piecewise linear trend apparent in bI2. I know this doesn't explain what's going on in the scipy function, but at least this shows it's not a problem with the frequency domain method.
The code for intf is pasted in full below.
def intf(a, fs, f_lo=0.0, f_hi=1.0e12, times=1, winlen=1, unwin=False):
"""
Numerically integrate a time series in the frequency domain.
This function integrates a time series in the frequency domain using
'Omega Arithmetic', over a defined frequency band.
Parameters
----------
a : array_like
Input time series.
fs : int
Sampling rate (Hz) of the input time series.
f_lo : float, optional
Lower frequency bound over which integration takes place.
Defaults to 0 Hz.
f_hi : float, optional
Upper frequency bound over which integration takes place.
Defaults to the Nyquist frequency ( = fs / 2).
times : int, optional
Number of times to integrate input time series a. Can be either
0, 1 or 2. If 0 is used, function effectively applies a 'brick wall'
frequency domain filter to a.
Defaults to 1.
winlen : int, optional
Number of seconds at the beginning and end of a file to apply half a
Hanning window to. Limited to half the record length.
Defaults to 1 second.
unwin : Boolean, optional
Whether or not to remove the window applied to the input time series
from the output time series.
Returns
-------
out : complex ndarray
The zero-, single- or double-integrated acceleration time series.
Versions
----------
1.1 First development version.
Uses rfft to avoid complex return values.
Checks for even length time series; if not, end-pad with single zero.
1.2 Zero-means time series to avoid spurious errors when applying Hanning
window.
"""
a = a - a.mean() # Convert time series to zero-mean
if np.mod(a.size,2) != 0: # Check for even length time series
odd = True
a = np.append(a, 0) # If not, append zero to array
else:
odd = False
f_hi = min(fs/2, f_hi) # Upper frequency limited to Nyquist
winlen = min(a.size/2, winlen) # Limit window to half record length
ni = a.size # No. of points in data (int)
nf = float(ni) # No. of points in data (float)
fs = float(fs) # Sampling rate (Hz)
df = fs/nf # Frequency increment in FFT
stf_i = int(f_lo/df) # Index of lower frequency bound
enf_i = int(f_hi/df) # Index of upper frequency bound
window = np.ones(ni) # Create window function
es = int(winlen*fs) # No. of samples to window from ends
edge_win = np.hanning(es) # Hanning window edge
window[:es/2] = edge_win[:es/2]
window[-es/2:] = edge_win[-es/2:]
a_w = a*window
FFTspec_a = np.fft.rfft(a_w) # Calculate complex FFT of input
FFTfreq = np.fft.fftfreq(ni, d=1/fs)[:ni/2+1]
w = (2*np.pi*FFTfreq) # Omega
iw = (0+1j)*w # i*Omega
mask = np.zeros(ni/2+1) # Half-length mask for +ve freqs
mask[stf_i:enf_i] = 1.0 # Mask = 1 for desired +ve freqs
if times == 2: # Double integration
FFTspec = -FFTspec_a*w / (w+EPS)**3
elif times == 1: # Single integration
FFTspec = FFTspec_a*iw / (iw+EPS)**2
elif times == 0: # No integration
FFTspec = FFTspec_a
else:
print 'Error'
FFTspec *= mask # Select frequencies to use
out_w = np.fft.irfft(FFTspec) # Return to time domain
if unwin == True:
out = out_w*window/(window+EPS)**2 # Remove window from time series
else:
out = out_w
if odd == True: # Check for even length time series
return out[:-1] # If not, remove last entry
else:
return out

Categories