Head related impulse response for binaural audio - python

I am working with audio digital signal processing and binaural audio processing.
I am still learning the basics.
Right now, the idea is to do deconvolution and get an impulse response.
Please see the attached screenshot
Detailed description of what is happening:
Here, an exponential sweep signal is taken and played back back through loudspeaker. The playback is recorded using microphone. The recorded signal is extended using zero padding(probably double the original length) and the original exponential sweep signal is also extended as well. FFTs are taken for both (extended recorded and the extended original), their FFT's are divided and we get room transfer function. Finally,Inverse FFT is taken and some windowing is performed to get Impulse response.
My question:
I am having difficulty implementing this diagram in python. How would you divide two FFT's? Is it possible? I can probably do all steps like zero padding and fft's, but I guess I am not going the correct way. I do not understand the windowing and discarding second half option.
Please can anyone with his/her knowledge show me how would I implement this in python with sweep signal? Just a small example would also help to get an idea with few plots. Please help.
Source of this image: http://www.four-audio.com/data/MF/aes-swp-english.pdf
Thanks in advance,
Sanket Jain

Yes, deviding two FFT-spectra is possible and actually quite easy to implement in python (but with some caveats).
Simply said: As convolution of two time signal corresponds to multiplying their spectra, vice versa the deconvolution can be realized by dividing the spectra.
Here is an example for a simple deconvolution with numpy:
(x is your excitation sweep signal and y is the recorded sweep signal, from which you want to obtain the impulse response.)
import numpy as np
from numpy.fft import rfft, irfft
# define length of FFT (zero padding): at least double length of input
input_length = np.size(x)
n = np.ceil(np.log2(input_length)) + 1
N_fft = int(pow(2, n))
# transform
# real fft: N real input -> N/2+1 complex output (single sided spectrum)
# real ifft: N/2+1 complex input -> N real output
X_f = rfft(x, N_fft)
Y_f = rfft(x, N_fft)
# deconvolve
H = Y_f / X_f
# backward transform
h = irfft(H, N_fft)
# truncate to original length
h = h[:input_length]
This simple solution is a practical one but can (and should be) be improved. A problem is that you will get a boost of the noise floor at those frequencies where X_f has a low amplitude. For example if your exponential sine sweep starts at 100Hz, for the frequency bins below that frequency, you get a division of (almost) zero. One simple possible solution to that is to first invert X_f, apply a bandlimit filter (highpass+lowpass) to remove the "boost areas" and then multiply it with Y_f:
# deconvolve
Xinv_f = 1 / X_f
Xinv_f = Xinv_f * bandlimit_filter
H = Y_f * Xinv_f
Regarding the distortion:
A nice property of the exponential sine sweep is that harmonic distortion production during the measurement (e.g. by nonlinearities in the loudpspeaker) will produce smaller "side" responses before the "main" response after deconvolution (see this for more details). These side responses are the distortion products and can be simply removed by a time window. If there is no delay of the "main" response (starts at t=0), those side responses will appear at the end of the whole iFFT, so you remove them by windowing out the second half.
I cannot guarantee that this is 100% correct from a signal-theory point of view, but I think it shows the point and it works ;)

This is a little over my head, but maybe the following bits of advice can help.
First, I came across a very helpful amount of sample code presented in Steve Smith's book The Scientist and Engineer's Guide to Digital Signal Processing. This includes a range operations, from basics of convolution to the FFT algorithm itself. The sample code is in BASIC, not Python. But the BASIC is perfectly readable, and should be easy to translate.
I'm not entirely sure about the specific calculation you describe, but many operations in this realm (when dealing with multiple signals) turn out to simply employ addition or subtraction of constituent elements. To get an authoritative answer, I think you will have better luck at Stack Overflow's Signal Processing forum or at one of the forums at DSP Related.
If you do get an answer elsewhere, it might be good to either recap it here or delete this question entirely to reduce clutter.

Related

Using A-Weighting on time signal

Im trying to solve this for a couple of weeks now but it seems like Im not able to wrap my head around this. The task is pretty simple: Im getting a signal in voltage from a microfone and in the end I want to know how loud in dB(A) it is out there.
There are so many problems I dont even know where to start. Lets begin with my idea.
Im converting the voltsignal into a signal in pascal [Pa].
Using a FFT on that signal so I know which frequencies im dealing with.
Then somehow I should implement the A-Weighting on that, but since im handling my values in [Pa] I cant just multiply or add my A-Weighning.
Going with an iFFT and getting back to my timesignal.
Going from Pa to dB.
Calculate RMS and Im done. (Hopefully)
The main problem is the A-Weighting. I realy dont get the idea how I can implement it on a live signal? And since the FFT leads to complex values Im also a little confused by that.
Maybe you get the idea/problem/workflow and help me to at least getting a little bit closer to the goal.
A little disclaimer, I am 100% new to the world of acoustics so please make sure to explain it like you would explain it a little child :D and Im programming with python.
Thanks in advance for your time!
To give you a short answer. This task can be done in only a few steps, utilizing the waveform_analysis package and Parseval's theorem.
The most simple implementation I can come up with is:
Time domain A-weighting filtering the signal - Using this library -
import waveform_analysis
weighted_signal = waveform_analysis.A_weight(signal, fs)
Take the RMS of the signal (utilizing that the power of the time domain equals the power of the frequency domain - Parseval's theorem). -
import numpy as np
rms_value = np.sqrt(np.mean(np.abs(weighted_signal)**2))
Convert this amplitude to dB -
result = 20 * np.log10(rms_value)
This gives you the results in dB(A)FS, if you run these three snippets together.
To get the dB(A)Pa value, you need to know what 0 dBPa corresponds to in dBFS. This is usually done by having a calibrated source such as https://www.grasacoustics.com/products/calibration-equipment/product/756-42ag
One flaw of this implementation is not pre-windowing the time signal. This is, on the other hand, not an issue for sufficently long signals.
Hello Christian Fichtelberg and welcome to StackOveflow. I believe your question could be answered more easily in DSP StackExchange but I will try to provide some quick and dirty answer.
In order to avoid taking the signal to the frequency domain, do the multiplication there (I urge you to the fact that convolution in the time domain - where your signal "resides" - is equivalent to multiplication in the frequency domain. If unfamiliar with this please have a quick look in Wikipedia's convolution page) you can implement the A-Weighting filter in the time-domain and perform some kind of convolution there.
I won't go into the details of the possible pros and cons of each method (time-domain convolution vs frequency domain multiplication). You could have a search on DSP SE or look into some textbook on DSP (such as Oppenheim's Digital Signal Processing, or an equivalent book by Proakis and Manolakis).
According to IEC 61672-1:2013 the digital filter should be "translated" from the analogue filter (a good way to do so is to use the bilinear transform). The proposed filter is a quite "simple" IIR (Infinite Impulse Response) filter.
I will skip the implementation here as it has been provided by others. Please find a MATLAB implementation, a Python implementation (most probably what you are seeking for your application), a quite "high-level" answer on DSP SE with some links and information on designing filters for arbitrary sample rates on DSP SE.
Finally, I would like to mention that if you manage to create a ("smooth enough") polynomial approximation to the curve of the A-Weighting filter you could possibly perform a frequency domain multiplication of the frequency response and the polynomial to change the magnitude only of the spectrum and then perform an iFFT to go back to time domain. This should most probably provide an approximation to the A-Weighted signal. Please note here that this is NOT the correct way to do filtering so treat it with caution (if you decide to try it at all) and only as a quick solution to perform some checks.

Why divide the output of NumPy FFT by N?

In many tutorials/blogs I've seen the output of np.fft.fft(signal) divided by the number of sample points N.
I understand that in some implementations that the transform is scaled/normalized by some factor like multiplying by N. However, I just read the docs, and by default the output of fft.fft() is unscaled. Yet I still see the output divided by N everywhere.
Why is this?
I have noticed that by scaling the output by 1/N I get back the correct amplitudes of the contributing wave signals. So obviously it is necessary, but I'd like to understand what the pure output is as compared to the scaled output.
For the DFT to be reversible (x == IDFT(DFT(x))), you need to divide by N somewhere. In signal processing this normalization is typically done in the inverse transform. For example Wikipedia shows it this way.
In other fields it is more often done in the forward transform. In physics I have seen half the normalization (1/sqrt(N)) applied to each transform, making them symmetric.
When the forward transform normalizes, then the values it returns are independent of the signal length (for example the zero frequency is the mean of all signal values). This is therefore the more useful variant when studying signal power.
The variant where the normalization is applied in the inverse transform (as commonly implemented in signal processing software, such as np.fft.fft(), and MATLAB's fft), then computing the convolution by multiplication in the frequency domain is easiest: one can directly write g = IDFT(DFT(f)*DFT(h)). If the normalization is applied elsewhere, it must be partly undone to obtain a correctly scaled result.
Other software, for example the FFTW library, does not normalize the transform at all, leaving that up to the user. This avoids unnecessary multiplications if the user wants a different normalization variant than what the library chooses.
Based on Parseval's theorem
This expresses that the energies in the time- and frequency-domain are the same. It means the magnitude |X[k]| of each frequency bin k is contributed by N samples. In order to find out the average contribution by each sample, the magnitude is normalized as |X[k]|/N, which leads to
where the LHS is the power of the signal.
However, such a scale normally doesn't matter unless you care about the unit of the magnitude, like in the case of the Sound Pressure Level (SPL) spectrum.

Can aubio be used to detect rhythm-only segments?

Does aubio have a way to detect sections of a piece of audio that lack tonal elements -- rhythm only? I tested a piece of music that has 16 seconds of rhythm at the start, but all the aubiopitch and aubionotes algorithms seemed to detect tonality during the rhythmic section. Could it be tuned somehow to distinguish tonal from non-tonal onsets? Or is there a related library that can do this?
Been busy the past couple of days - but started looking into this today...
It'll take a while to perfect I guess but I thought I'd give you a few thoughts and some code I've started working on to attack this!
Firstly, pseudo code's a good way to design an initial method.
1/ use import matplotlib.pyplot as plt to spectrum analyse the audio, and plot various fft and audio signals.
2/ import numpy as np for basic array-like structure handling.
(I know this is more than pseudo code, but hey :-)
3/ plt.specgram creates spectral maps of your audio. Apart from the image it creates (which can be used to start to manually deconstruct your audio file), it returns 4 structures.
eg
ffts,freqs,times,img = plt.specgram(signal,Fs=44100)
ffts is a 2 dimentional array where the columns are the ffts (Fast Fourier Transforms) of the time sections (rows).
The plain vanilla specgram analyses time sections of 256 samples long, stepping 128 samples forwards each time.
This gives a very low resolution frequency array at a pretty fast rate.
As musical notes merge into a single sound when played at more or less 10 hz, I decided to use the specgram options to divide the audio into 4096 sample lengths (circa 10 hz) stepping forwards every 2048 samples (ie 20 times a second).
This gives a decent frequency resolution, and the time sections being 20th sec apart are faster than people can perceive individual notes.
This means calling the specgram as follows:
plt.specgram(signal,Fs=44100,NFFT=4096,noverlap=2048,mode='magnitude')
(Note the mode - this seems to give me amplitudes of between 0 - 0.1: I have a problem with fft not giving me amplitudes of the same scale as the audio signal (you may have seen the question I posted). But here we are...
4/ Next I decided to get rid of noise in the ffts returned. This means we can concentrate on freqs of a decent amplitude, and zero out the noise which is always present in ffts (in my experience).
Here is (are) my function(s):
def gate(signal,minAmplitude):
return np.array([int((((a-minAmplitude)+abs(a-minAmplitude))/2) > 0) * a for a in signal])
Looks a bit crazy - and I'm sure a proper mathematician could come up with something more efficient - but this is the best I could invent. It zeros any freqencies of amplitude less than minAmplitude.
This is the relevant code to call it from the ffts returned by plt.specgram as follows, my function is more involved as it is part of a class, and has other functions it references - but this should be enough:
def fft_noise_gate(minAmplitude=0.001,check=True):
'''
zero the amplitudes of frequencies
with amplitudes below minAmplitude
across self.ffts
check - plot middle fft just because!
'''
nffts = ffts.shape[1]
gated_ffts = []
for f in range(nffts):
fft = ffts[...,f]
# Anyone got a more efficient noise gate formula? Best I could think up!
fft_gated = gate(fft,minAmplitude)
gated_ffts.append(fft_gated)
ffts = np.array(gated_ffts)
if check:
# plot middle fft just to see!
plt.plot(ffts[int(nffts/2)])
plt.show(block=False)
return ffts
This should give you a start I'm still working on it and will get back to you when I've got further - but if you have any ideas, please share them.
Any way my strategy from here is to:
1/ find the peaks ( ie start of any sounds) then
2/ Look for ranges of frequencies which rise and fall in unison (ie make up a sound).
And
3/ Differentiate them into individual instruments (sound sources more specifically), and plot the times and amplitudes thereof to create your analysis (score).
Hope you're having fun with it - I know I am.
As I said any thoughts...
Regards
Tony
Use a spectrum analyser to detect sections with high amplitude. If you program - you could take each section and make an average of the freqencies (and amplitudes) present to give you an idea of the instrument(s) involved in creating that amplitude peak.
Hope that helps - if you're using python I could give you some pointers how to program this!?
Regards
Tony

Why should I discard half of what a FFT returns?

Looking at this answer:
Python Scipy FFT wav files
The technical part is obvious and working, but I have two theoretical questions (the code mentioned is below):
1) Why do I have to normalized (b=...) the frames? What would happen if I used the raw data?
2) Why should I only use half of the FFT result (d=...)?
3) Why should I abs(c) the FFT result?
Perhaps I'm missing something due to inadequate understanding of WAV format or FFT, but while this code works just fine, I'd be glad to understand why it works and how to make the best use of it.
Edit: in response to the comment by #Trilarion :
I'm trying to write a simple, not 100% accurate but more like a proof-of-concept Speaker Diarisation in Python. That means taking a wav file (right now I am using this one for my tests) and in each second (or any other resolution) say if the speaker is person #1 or person #2. I know in advance that these are 2 persons and I am not trying to link them to any known voice signatures, just to separate. Right now take each second, FFT it (and thus get a list of frequencies), and cluster them using KMeans with the number of clusters between 2 and 4 (A, B [,Silence [,A+B]]).
I'm still new to analyzing wav files and audio in general.
import matplotlib.pyplot as plt
from scipy.io import wavfile # get the api
fs, data = wavfile.read('test.wav') # load the data
a = data.T[0] # this is a two channel soundtrack, I get the first track
b=[(ele/2**8.)*2-1 for ele in a] # this is 8-bit track, b is now normalized on [-1,1)
c = sfft.fft(b) # create a list of complex number
d = len(c)/2 # you only need half of the fft list
plt.plot(abs(c[:(d-1)]),'r')
plt.show()
To address these in order:
1) You don't need to normalize, but the input normalization is close to the raw structure of the digitized waveform so the numbers are unintuitive. For example, how loud is a value of 67? It's easier to normalize it to be in the range -1 to 1 to interpret the values. (But if you wanted to implement a filter, for example, where you did an FFT, modified the FFT values, followed by an IFFT, normalizing would be an unnecessary hassle.)
2) and 3) are similar in that they both have to do with the math living primarily in the complex numbers space. That is, FFTs take a waveform of complex numbers (eg, [.5+.1j, .4+.7j, .4+.6j, ...]) to another sequence of complex numbers.
So in detail:
2) It turns out that if the input waveform is real instead of complex, then the FFT has a symmetry about 0, so only the values that have a frequency >=0 are uniquely interesting.
3) The values output by the FFT are complex, so they have a Re and Im part, but this can also be expressed as a magnitude and phase. For audio signals, it's usually the magnitude that's the most interesting, because this is primarily what we hear. Therefore people often use abs (which is the magnitude), but the phase can be important for different problems as well.
That depends on what you're trying to do. It looks like you're only looking to plot the spectral density and then it's OK to do so.
In general the coefficient in the DFT is depending on the phase for each frequency so if you want to keep phase information you have to keep the argument of the complex numbers.
The symmetry you see is only guaranteed if the input is real numbered sequence (IIRC). It's related to the mirroring distortion you'll get if you have frequencies above the Nyquist frequency (half the sampling frequency), the original frequency shows up in the DFT, but also the mirrored frequency.
If you're going to inverse DFT you should keep the full data and also keep the arguments of the DFT-coefficients.

How can I discriminate the falling edge of a signal with Python?

I am working on discriminating some signals for the calculation of the free-period oscillation and damping ratio of a spring-mass system (seismometer). I am using Python as the main processing program. What I need to do is import this signal, parse the signal and find the falling edge, then return a list of the peaks from the tops of each oscillation as a list so that I can calculate the damping ratio. The calculation of the free-period is fairly straightforward once I've determined the location of the oscillation within the dataset.
Where I am mostly hung up are how to parse through the list, identify the falling edge, and then capture each of the Z0..Zn elements. The oscillation frequency can be calculated using an FFT fairly easily, once I know where that falling edge is, but if I process the whole file, a lot of energy from the energizing of the system before the release can sometimes force the algorithm to kick out an ultra-low frequency that represents the near-DC offset, rather than the actual oscillation. (It's especially a problem at higher damping ratios where there might be only four or five measurable rebounds).
Has anyone got some ideas on how I can go about this? Right now, the code below uses the arbitrarily assigned values for the signal in the screenshot. However I need to have code calculate those values. Also, I haven't yet determined how to create my list of peaks for the calculation of the damping ratio h. Your help in getting some ideas together for solving this would be very welcome. Due to the fact I have such a small Stackoverflow reputation, I have included my signal in a sample screenshot at the following link:
(Boy, I hope this works!)
Tycho's sample signal -->
https://github.com/tychoaussie/Sigcal_v1/blob/066faca7c3691af3f894310ffcf3bbb72d730601/Freeperiod_Damping%20Ratio.jpg
##########################################################
import os, sys, csv
from scipy import signal
from scipy.integrate import simps
import pylab as plt
import numpy as np
import scipy as sp
#
# code goes here that imports the data from the csv file into lists.
# One of those lists is called laser, a time-history list in counts from an analog-digital-converter
# sample rate is 130.28 samples / second.
#
#
# Find the period of the observed signal
#
delta = 0.00767 # Calculated elsewhere in code, represents 130.28 samples/sec
# laser is a list with about 20,000 elements representing time-history data from a laser position sensor
# The release of the mass and system response occurs starting at sample number 2400 in this particular instance.
sense = signal.detrend(laser[2400:(2400+8192)]) # Remove the mean of the signal
N = len(sense)
W = np.fft.fft(sense)
freq = np.fft.fftfreq(len(sense),delta) # First value represents the number of samples and delta is the sample rate
#
# Take the sample with the largest amplitude as our center frequency.
# This only works if the signal is heavily sinusoidal and stationary
# in nature, like our calibration data.
#
idx = np.where(abs(W)==max(np.abs(W)))[0][-1]
Frequency = abs(freq[idx]) # Frequency in Hz
period = 1/(Frequency*delta) # represents the number of samples for one cycle of the test signal.
#
# create an axis representing time.
#
dt = [] # Create an x axis that represents elapsed time in seconds. delta = seconds per sample, i represents sample count
for i in range(0,len(sensor)):
dt.append(i*delta)
#
# At this point, we know the frequency interval, the delta, and we have the arrays
# for signal and laser. We can now discriminate out the peaks of each rebound and use them to process the damping ratio of either the 'undamped' system or the damping ratio of the 'electrically damped' system.
#
print 'Frequency calcuated to ',Frequency,' Hz.'
Here's a somewhat unconventional idea, which I think is might be fairly robust and doesn't require a lot of heuristics and guesswork. The data you have is really high quality and fits a known curve so that helps a lot here. Here I assume the "good part" of your curve has the form:
V = a * exp(-γ * t) * cos(2 * π * f * t + φ) + V0 # [Eq1]
V: voltage
t: time
γ: damping constant
f: frequency
a: starting amplitude
φ: starting phase
V0: DC offset
Outline of the algorithm
Getting rid of the offset
Firstly, calculate the derivative numerically. Since the data quality is quite high, the noise shouldn't affect things too much.
The voltage derivative V_deriv has the same form as the original data: same frequency and damping constant, but with a different phase ψ and amplitude b,
V_deriv = b * exp(-γ * t) * cos(2 * π * f * t + ψ) # [Eq2]
The nice thing is that this will automatically get rid of your DC offset.
Alternative: This step isn't completely needed – the offset is a relatively minor complication to the curve fitting, since you can always provide a good guess for an offset by averaging the entire curve. Trade-off is you get a better signal-to-noise if you don't use the derivative.
Preliminary guesswork
Now consider the curve of the derivative (or the original curve if you skipped the last step). Start with the data point on the very far right, and then follow the curve leftward as many oscillations as you can until you reach an oscillation whose amplitude is greater than some amplitude threshold A. You want to find a section of the curve containing one oscillation with a good signal-to-noise ratio.
How you determine the amplitude threshold is difficult to say. It depends on how good your sensors are. I suggest leaving this as a parameter for tweaking later.
Curve-fitting
Now that you have captured one oscillation, you can do lots of easy things: estimate the frequency f and damping constant γ. Using these initial estimates, you can perform a nonlinear curve fit of all the data to the right of this oscillation (including the oscillation itself).
The function that you fit is [Eq2] if you're using the derivative, or [Eq1] if you use the original curve (but they are the same functions anyway, so rather moot point).
Why do you need these initial estimates? For a nonlinear curve fit to a decaying wave, it's critical that you give a good initial guess for the parameters, especially the frequency and damping constant. The other parameters are comparatively less important (at least that's my experience), but here's how you can get the others just in case:
The phase is 0 if you start from a maximum, or π if you start at a minimum.
You can guess the starting amplitude too, though the fitting algorithm will typically do fine even if you just set it to 1.
Finding the edge
The curve fit should be really good at this point – you can tell because discrepancy between your curve and the fit is very low. Now the trick is to attempt to increase the domain of the curve to the left.
If you stay within the sinusoidal region, the curve fit should remain quite good (how you judge the "goodness" requires some experimenting). As soon as you hit the flat region of the curve, however, the errors will start to increase dramatically and the parameters will start to deviate. This can be used to determine where the "good" data ends.
You don't have to do this point by point though – that's quite inefficient. A binary search should work quite well here (possibly the "one-sided" variation of it).
Summary
You don't have to follow this exact procedure, but the basic gist is that you can perform some analysis on a small part of the data starting on the very right, and then gradually increase the time domain until you reach a point where you find that the errors are getting larger rather than smaller as they ought to be.
Furthermore, you can also combine different heuristics to see if they agree with each other. If they don't, then it's probably a rather sketchy data point that requires some manual intervention.
Note that one particular advantage of the algorithm I sketched above is that you will get the results you want (damping constant and frequency) as part of the process, along with uncertainty estimates.
I neglected to mention most of the gory mathematical and algorithmic details so as to provide a general overview, but if needed I can provide more detail.

Categories