I am converting a Python program to Node.js, the program follows these steps:
Microphone listens with callbacks
Callbacks do a Librosa "log_mel_S" extraction
The "log_mel_S" is inferenced by an AI model
Sound is labeled
I have managed to translate all of the steps and their relatives from Python to Node.js, except for the Librosa extraction.
This would be an example for the audio shape and type required:
audio_sample = numpy.zeros(shape=(1024, 100), dtype=numpy.float32)
And this is the Librosa piece I need help translating:
S = numpy.abs(librosa.stft(y=audio_sample, n_fft=1024, hop_length=500)) ** 2
mel_S = numpy.dot(librosa.filters.mel(sr=44100, n_fft=1024, n_mels=64), S).T
log_mel_S = librosa.power_to_db(mel_S, ref=1.0, amin=1e-10, top_db=None)
I found this package Meyda, and it looks like it can be a good substitute, but I am not sure how I should approach this, it is unclear to me what is being extracted from Librosa, so I cannot identify the terms like Amplitude Spectrum, Power Spectrum, etc.
Please help me understand and translate this action.
TL;DR
Amplitude Spectrum is basically FFT of the signal, and Power Spectrum is a squared value of the Amplitude Spectrum, which is also referred as energy sometimes.
Here is one of examples from Meyda website that is calculating Amplitude Spectrum https://github.com/catalli/audiotrainer-server/blob/df41322906c88cd6f899e8f9b9661ebb949f72e1/index.js#L17
Long answer:
Now, lets look into your code sample line by line and figure out what is it doing and how to implement it in javascript.
S = numpy.abs(librosa.stft(y=audio_sample, n_fft=1024, hop_length=500)) ** 2
this is calculating square values of 1024 bins fft of audio_sample y, which is basically a Power Spectrum or an Amplitude Spectrum squared. Please note that the abs of complex number is a vector lenth: sqrt(real_part^2 + img_part^2)
mel_S = numpy.dot(librosa.filters.mel(sr=44100, n_fft=1024, n_mels=64), S).T
this is an mfcc calculation, which is basically a product of predefined filter banks and fft squared.
log_mel_S = librosa.power_to_db(mel_S, ref=1.0, amin=1e-10, top_db=None)
this last one will convert the result to decibel (dB) units (10 * log10(S / ref))
i will extend this answer with js code-sample later, submitting it now because i think it will be helpful already as it is
Related
I am using this algorithm to detect the pitch of
this audio file. As you can hear, it is an E2 note played on a guitar with a bit of noise in the background.
I generated this spectrogram using STFT:
And I am using the algorithm linked above like this:
y, sr = librosa.load(filename, sr=40000)
pitches, magnitudes = librosa.core.piptrack(y=y, sr=sr, fmin=75, fmax=1600)
np.set_printoptions(threshold=np.nan)
print pitches[np.nonzero(pitches)]
As a result, I am getting pretty much every possible frequency between my fmin and fmax. What do I have to do with the output of the piptrack method to discover the fundamental frequency of a time frame?
UPDATE
I am still not sure what those 2D array represents, though. Let's say I want to find out how strong is 82Hz in frame 5. I could do that using the STFT function which simply returns a 2D matrix (which was used to plot the spectrogram).
However, piptrack does something additional which could be useful and I don't really understand what. pitches[f, t] contains instantaneous frequency at bin f, time t. Does that mean that, if I want to find the maximum frequency at time frame t, I have to:
Go to the magnitudes[][t] array, find the bin with the maximum
magnitude.
Assign the bin to a variable f.
Find pitches[b][t] to find the frequency that belongs to that bin?
Pitch detection is a tricky topic and is often counter-intuitive. I'm not wild about the way the source code is documented for this particular function -- it almost seems like the developer is confusing a 'harmonic' with a 'pitch'.
When a single note (a 'pitch') is made on a guitar or piano, what we hear is not just one frequency of sound vibration, but a composite of multiple sound vibrations occurring at different mathematically related frequencies, called harmonics. Typical pitch tracking techniques include searching the results of a FFT for magnitudes in certain bins that correspond to the expected frequencies of harmonics. For instance, if we press the Middle C key on the piano, the individual frequencies of the composite's harmonics will start at 261.6 Hz as the fundamental frequency, 523 Hz would be the 2nd Harmonic, 785 Hz would be the 3rd Harmonic, 1046 Hz would be the 4th Harmonic, etc. The later harmonics are integer multiples of the fundamental frequency, 261.6 Hz ( ex: 2 x 261.6 = 523, 3 x 261.6 = 785, 4 x 261.6 = 1046 ). However, the frequencies where harmonics are located are logarithmically spaced, but the FFT uses a linear spacing. Often the vertical spacing for FFTs are not resolved enough at the lower frequencies.
For that reason when I wrote a pitch detecting application (PitchScope Player), I chose to create a logarithmically spaced DFT, rather than a FFT, so I could focus on the precise frequencies of interest for music ( see the attached diagram of my custom DFT from 3 seconds of a guitar solo ). If you are serious about pursuing pitch detection, you should consider doing more reading into the topic, looking at other sample code (mine is linked below), and consider writing your own functions to measure frequency.
https://en.wikipedia.org/wiki/Transcription_(music)#Pitch_detection
https://github.com/CreativeDetectors/PitchScope_Player
Turns out the way to pick the pitch at a certain frame t is simple:
def detect_pitch(y, sr, t):
index = magnitudes[:, t].argmax()
pitch = pitches[index, t]
return pitch
First getting the bin of the strongest frequency by looking at the magnitudes array, and then finding the pitch at pitches[index, t].
To find the pitch of the whole audio segment:
def detect_pitch(y, sr):
pitches, magnitudes = librosa.core.piptrack(y=y, sr=sr, fmin=75, fmax=1600)
# get indexes of the maximum value in each time slice
max_indexes = np.argmax(magnitudes, axis=0)
# get the pitches of the max indexes per time slice
pitches = pitches[max_indexes, range(magnitudes.shape[1])]
return pitches
I am using this algorithm to detect the pitch of
this audio file. As you can hear, it is an E2 note played on a guitar with a bit of noise in the background.
I generated this spectrogram using STFT:
And I am using the algorithm linked above like this:
y, sr = librosa.load(filename, sr=40000)
pitches, magnitudes = librosa.core.piptrack(y=y, sr=sr, fmin=75, fmax=1600)
np.set_printoptions(threshold=np.nan)
print pitches[np.nonzero(pitches)]
As a result, I am getting pretty much every possible frequency between my fmin and fmax. What do I have to do with the output of the piptrack method to discover the fundamental frequency of a time frame?
UPDATE
I am still not sure what those 2D array represents, though. Let's say I want to find out how strong is 82Hz in frame 5. I could do that using the STFT function which simply returns a 2D matrix (which was used to plot the spectrogram).
However, piptrack does something additional which could be useful and I don't really understand what. pitches[f, t] contains instantaneous frequency at bin f, time t. Does that mean that, if I want to find the maximum frequency at time frame t, I have to:
Go to the magnitudes[][t] array, find the bin with the maximum
magnitude.
Assign the bin to a variable f.
Find pitches[b][t] to find the frequency that belongs to that bin?
Pitch detection is a tricky topic and is often counter-intuitive. I'm not wild about the way the source code is documented for this particular function -- it almost seems like the developer is confusing a 'harmonic' with a 'pitch'.
When a single note (a 'pitch') is made on a guitar or piano, what we hear is not just one frequency of sound vibration, but a composite of multiple sound vibrations occurring at different mathematically related frequencies, called harmonics. Typical pitch tracking techniques include searching the results of a FFT for magnitudes in certain bins that correspond to the expected frequencies of harmonics. For instance, if we press the Middle C key on the piano, the individual frequencies of the composite's harmonics will start at 261.6 Hz as the fundamental frequency, 523 Hz would be the 2nd Harmonic, 785 Hz would be the 3rd Harmonic, 1046 Hz would be the 4th Harmonic, etc. The later harmonics are integer multiples of the fundamental frequency, 261.6 Hz ( ex: 2 x 261.6 = 523, 3 x 261.6 = 785, 4 x 261.6 = 1046 ). However, the frequencies where harmonics are located are logarithmically spaced, but the FFT uses a linear spacing. Often the vertical spacing for FFTs are not resolved enough at the lower frequencies.
For that reason when I wrote a pitch detecting application (PitchScope Player), I chose to create a logarithmically spaced DFT, rather than a FFT, so I could focus on the precise frequencies of interest for music ( see the attached diagram of my custom DFT from 3 seconds of a guitar solo ). If you are serious about pursuing pitch detection, you should consider doing more reading into the topic, looking at other sample code (mine is linked below), and consider writing your own functions to measure frequency.
https://en.wikipedia.org/wiki/Transcription_(music)#Pitch_detection
https://github.com/CreativeDetectors/PitchScope_Player
Turns out the way to pick the pitch at a certain frame t is simple:
def detect_pitch(y, sr, t):
index = magnitudes[:, t].argmax()
pitch = pitches[index, t]
return pitch
First getting the bin of the strongest frequency by looking at the magnitudes array, and then finding the pitch at pitches[index, t].
To find the pitch of the whole audio segment:
def detect_pitch(y, sr):
pitches, magnitudes = librosa.core.piptrack(y=y, sr=sr, fmin=75, fmax=1600)
# get indexes of the maximum value in each time slice
max_indexes = np.argmax(magnitudes, axis=0)
# get the pitches of the max indexes per time slice
pitches = pitches[max_indexes, range(magnitudes.shape[1])]
return pitches
I am attempting to calculate the MTF from a test target. I calculate the spread function easily enough, but the FFT results do not quite make sense to me. To summarize,the values seem to alternate giving me a reflection of what I would expect. To test, I used a simple square wave and numpy:
from numpy import fft
data = []
for x in range (0, 20):
data.append(0)
data[9] = 10
data[10] = 10
data[11] = 10
dataFFT = fft.fft(data)
The results look correct, with the exception of the sign... I am seeing the following for the first 4 values as an example:
30.00000000 +0.00000000e+00j
-29.02113033 +7.10542736e-15j
26.18033989 -1.24344979e-14j
-21.75570505 +1.24344979e-14j
So my question is why positive->negative->positive->negative in the real plane? This is not what I would expect... It I plot it, it almost appears that the correct function is mirrored around the x axis.
Note: I was expecting the following as an example:
This is what I am getting:
Your pulse is symmetric and positioned in the center of your FFT window (around N/2). Symmetric real data corresponds to only the cosine or "real" components of an FFT result. Note that the cosine function alternates between being -1 and 1 at the center of the FFT window, depending on the frequency bin index (representing cosine periods per FFT width). So the correlation of these FFT basis functions with a positive going pulse will also alternate as long as the pulse is narrower than half the cosine period.
If you want the largest FFT coefficients to be mostly positive, try centering your narrow rectangular pulse around time 0 (or circularly, time N), where the cosine function is always 1 for any frequency.
It works if you shift the data around 0 instead of half your array, with:
dataFFT = fft.fft(np.fftshift(data))
This isn't all that unexpected. If you want to check against conventional plots, make sure you convert that info to magnitude and phase before coming to any conclusions.
I did a quick check using your code and numpy.abs for mag, numpy,angle for phase. It sure looks like a sinc() function to me, which is what would be expected if the time-domain is a square pulse. If you do this, you'll find a pretty wide sinc, as would be expeceted for a short duration pulse on so few samples.
you forget to specify if your data is Real or Complex
not everyone code in python/numpy (including me) and if you do not know this then you probably handle data to/from FFT the wrong way.
FFT input can be both real or complex domain
FFT output is complex domain
so check the docs for your FFT implementation and specify it and also repair your data handling accordingly. Complex domain usually have first value Re and Second Im but that depends on FFT implementation/configuration.
signal
here is an example of impulse response from FFT
first is input Real domain signal (Im=0) single finite nonzero width pulse and second is the Re part of FFT output. The third is the Im part of FFT output. If you zoom it a bit then you will see amplitude range of y axis of each signal (on left).
Do not forget that different FFT implementations can have different normalization constants which will change the amplitude of signal. If you want magnitude and phase convert it like this:
mag=sqrt(Re*Re+Im*Im); // power
ang=atanxy(Re,Im); // phase angle
atanxy(dx,dy) is 4 quadrant arctan also called atan2 but be careful to get the operand order the same as your atanxy/atan2 implementation needs. Also can use mine C++ atanxy implementation
[Notes]
if your input signal is Real domain then FFT output is symmetric. Both Re and Im signals will be like:
{ a0,a1,a2,a3,...,a(n-1),a(n-1)...,a3,a2,a1,a0 }
exactly like on the image above. On the left are low frequencies and in the middle is the top frequency. If your input signal is Complex domain then the output can be anything.
I'm working on a problem where I would like to extract and compare the time domain amplitudes of two different signals at each frequency. The signals are real world, so have noise, and multiple frequencies, so I'm trying to work in the FFT world.
I wrote a function to take the FFT of a dataset, and return the amplitudes. This seems to work okay for a simulated pure sin wave, but when performed on actual datasets, the amplitudes are always attenuated by some amount.
def amplitudePowerSpectrum(time,data):
dt = np.zeros(time.size-1,)
avgdt = np.mean(time[1:-1] - time[0:-2])
sampFreq = 1.0/(avedt)
nyquistFreq = sampFreq/2.0
FFTData = np.abs(scipy.fftpack.fft(data))
## Only care about positive frequencies
FFTData = FFTData[0:len(FFTData)/2]
## This is how we get the power spectrum in terms of time-domain amplitudes
amplitudeSpectrum = FFTData/len(FFTData)
freqsData = scipy.fftpack.fftfreq(data.size, avgdt)
freq = freqsData[0:len(freqsData)/2]
return (freq,amplitudeSpectrum,(sampFreq,nyquistFreq))
Here is a plot of a raw dataset, followed by one of the computed amplitude spectrum.As you can see, there are two specifically different frequencies, with other noise on top.
I'd expect the amplitudes in figure 2 to match the time domain amplitudes in figure 1. But they are attenuated by a pretty decent factor. The end goal is a scale factor between the input (blue) and output (red) signals at each frequency.
First, is obataining time domain amplitudes accurately possible in the Fourrier domain on real datasets? If so, what am I missing? I'm working with python numpy and scipy packages
Audio processing is pretty new for me. And currently using Python Numpy for processing wave files. After calculating FFT matrix I am getting noisy power values for non-existent frequencies. I am interested in visualizing the data and accuracy is not a high priority. Is there a safe way to calculate the clipping value to remove these values, or should I use all FFT matrices for each sample set to come up with an average number ?
regards
Edit:
from numpy import *
import wave
import pymedia.audio.sound as sound
import time, struct
from pylab import ion, plot, draw, show
fp = wave.open("500-200f.wav", "rb")
sample_rate = fp.getframerate()
total_num_samps = fp.getnframes()
fft_length = 2048.
num_fft = (total_num_samps / fft_length ) - 2
temp = zeros((num_fft,fft_length), float)
for i in range(num_fft):
tempb = fp.readframes(fft_length);
data = struct.unpack("%dH"%(fft_length), tempb)
temp[i,:] = array(data, short)
pts = fft_length/2+1
data = (abs(fft.rfft(temp, fft_length)) / (pts))[:pts]
x_axis = arange(pts)*sample_rate*.5/pts
spec_range = pts
plot(x_axis, data[0])
show()
Here is the plot in non-logarithmic scale, for synthetic wave file containing 500hz(fading out) + 200hz sine wave created using Goldwave.
Simulated waveforms shouldn't show FFTs like your figure, so something is very wrong, and probably not with the FFT, but with the input waveform. The main problem in your plot is not the ripples, but the harmonics around 1000 Hz, and the subharmonic at 500 Hz. A simulated waveform shouldn't show any of this (for example, see my plot below).
First, you probably want to just try plotting out the raw waveform, and this will likely point to an obvious problem. Also, it seems odd to have a wave unpack to unsigned shorts, i.e. "H", and especially after this to not have a large zero-frequency component.
I was able to get a pretty close duplicate to your FFT by applying clipping to the waveform, as was suggested by both the subharmonic and higher harmonics (and Trevor). You could be introducing clipping either in the simulation or the unpacking. Either way, I bypassed this by creating the waveforms in numpy to start with.
Here's what the proper FFT should look like (i.e. basically perfect, except for the broadening of the peaks due to the windowing)
Here's one from a waveform that's been clipped (and is very similar to your FFT, from the subharmonic to the precise pattern of the three higher harmonics around 1000 Hz)
Here's the code I used to generate these
from numpy import *
from pylab import ion, plot, draw, show, xlabel, ylabel, figure
sample_rate = 20000.
times = arange(0, 10., 1./sample_rate)
wfm0 = sin(2*pi*200.*times)
wfm1 = sin(2*pi*500.*times) *(10.-times)/10.
wfm = wfm0+wfm1
# int test
#wfm *= 2**8
#wfm = wfm.astype(int16)
#wfm = wfm.astype(float)
# abs test
#wfm = abs(wfm)
# clip test
#wfm = clip(wfm, -1.2, 1.2)
fft_length = 5*2048.
total_num_samps = len(times)
num_fft = (total_num_samps / fft_length ) - 2
temp = zeros((num_fft,fft_length), float)
for i in range(num_fft):
temp[i,:] = wfm[i*fft_length:(i+1)*fft_length]
pts = fft_length/2+1
data = (abs(fft.rfft(temp, fft_length)) / (pts))[:pts]
x_axis = arange(pts)*sample_rate*.5/pts
spec_range = pts
plot(x_axis, data[2], linewidth=3)
xlabel("freq (Hz)")
ylabel('abs(FFT)')
show()
FFT's because they are windowed and sampled cause aliasing and sampling in the frequency domain as well. Filtering in the time domain is just multiplication in the frequency domain so you may want to just apply a filter which is just multiplying each frequency by a value for the function for the filter you are using. For example multiply by 1 in the passband and by zero every were else. The unexpected values are probably caused by aliasing where higher frequencies are being folded down to the ones you are seeing. The original signal needs to be band limited to half your sampling rate or you will get aliasing. Of more concern is aliasing that is distorting the area of interest because for this band of frequencies you want to know that the frequency is from the expected one.
The other thing to keep in mind is that when you grab a piece of data from a wave file you are mathmatically multiplying it by a square wave. This causes a sinx/x to be convolved with the frequency response to minimize this you can multiply the original windowed signal with something like a Hanning window.
It's worth mentioning for a 1D FFT that the first element (index [0]) contains the DC (zero-frequency) term, the elements [1:N/2] contain the positive frequencies and the elements [N/2+1:N-1] contain the negative frequencies. Since you didn't provide a code sample or additional information about the output of your FFT, I can't rule out the possibility that the "noisy power values at non-existent frequencies" aren't just the negative frequencies of your spectrum.
EDIT: Here is an example of a radix-2 FFT implemented in pure Python with a simple test routine that finds the FFT of a rectangular pulse, [1.,1.,1.,1.,0.,0.,0.,0.]. You can run the example on codepad and see that the FFT of that sequence is
[0j, Negative frequencies
(1+0.414213562373j), ^
0j, |
(1+2.41421356237j), |
(4+0j), <= DC term
(1-2.41421356237j), |
0j, v
(1-0.414213562373j)] Positive frequencies
Note that the code prints out the Fourier coefficients in order of ascending frequency, i.e. from the highest negative frequency up to DC, and then up to the highest positive frequency.
I don't know enough from your question to actually answer anything specific.
But here are a couple of things to try from my own experience writing FFTs:
Make sure you are following Nyquist rule
If you are viewing the linear output of the FFT... you will have trouble seeing your own signal and think everything is broken. Make sure you are looking at the dB of your FFT magnitude. (i.e. "plot(10*log10(abs(fft(x))))" )
Create a unitTest for your FFT() function by feeding generated data like a pure tone. Then feed the same generated data to Matlab's FFT(). Do a absolute value diff between the two output data series and make sure the max absolute value difference is something like 10^-6 (i.e. the only difference is caused by small floating point errors)
Make sure you are windowing your data
If all of those three things work, then your fft is fine. And your input data is probably the issue.
Check the input data to see if there is clipping http://www.users.globalnet.co.uk/~bunce/clip.gif
Time doamin clipping shows up as mirror images of the signal in the frequency domain at specific regular intervals with less amplitude.