How to replicate a 128 Kaldi-Compatible Mel-frequency bands? - python

Hi I am trying to replicate the following pre-processing pipeline from the paper 'Masked Autoencoders that Listen' (https://doi.org/10.48550/arXiv.2207.06405):
we transform raw waveform (pre-processed as mono channel under 16,000 sampling rate) into 128 Kaldi [54]-compatible Mel-frequency bands with a 25ms Hanning window that shifts every 10 ms. For a 10-second recording in AudioSet, the resulting spectrogram is of 1×1024×128 dimension.
Starting from a 10s audio, with a sampling rate of 16Khz, I am trying with the following function from torchaudio:
fbank = torchaudio.compliance.kaldi.fbank(signal, htk_compat=True, sample_frequency=16000, use_energy=False, window_type='hanning', num_mel_bins=128, dither=0.0, frame_shift=10)
However I am not reaching to obtain the expected spectrogram, the expected shape is (1x1024x128) instead I am obtaining (1x998x128)

Related

Librosa (Python) to Meyda (Node.js) conversion

I am converting a Python program to Node.js, the program follows these steps:
Microphone listens with callbacks
Callbacks do a Librosa "log_mel_S" extraction
The "log_mel_S" is inferenced by an AI model
Sound is labeled
I have managed to translate all of the steps and their relatives from Python to Node.js, except for the Librosa extraction.
This would be an example for the audio shape and type required:
audio_sample = numpy.zeros(shape=(1024, 100), dtype=numpy.float32)
And this is the Librosa piece I need help translating:
S = numpy.abs(librosa.stft(y=audio_sample, n_fft=1024, hop_length=500)) ** 2
mel_S = numpy.dot(librosa.filters.mel(sr=44100, n_fft=1024, n_mels=64), S).T
log_mel_S = librosa.power_to_db(mel_S, ref=1.0, amin=1e-10, top_db=None)
I found this package Meyda, and it looks like it can be a good substitute, but I am not sure how I should approach this, it is unclear to me what is being extracted from Librosa, so I cannot identify the terms like Amplitude Spectrum, Power Spectrum, etc.
Please help me understand and translate this action.
TL;DR
Amplitude Spectrum is basically FFT of the signal, and Power Spectrum is a squared value of the Amplitude Spectrum, which is also referred as energy sometimes.
Here is one of examples from Meyda website that is calculating Amplitude Spectrum https://github.com/catalli/audiotrainer-server/blob/df41322906c88cd6f899e8f9b9661ebb949f72e1/index.js#L17
Long answer:
Now, lets look into your code sample line by line and figure out what is it doing and how to implement it in javascript.
S = numpy.abs(librosa.stft(y=audio_sample, n_fft=1024, hop_length=500)) ** 2
this is calculating square values of 1024 bins fft of audio_sample y, which is basically a Power Spectrum or an Amplitude Spectrum squared. Please note that the abs of complex number is a vector lenth: sqrt(real_part^2 + img_part^2)
mel_S = numpy.dot(librosa.filters.mel(sr=44100, n_fft=1024, n_mels=64), S).T
this is an mfcc calculation, which is basically a product of predefined filter banks and fft squared.
log_mel_S = librosa.power_to_db(mel_S, ref=1.0, amin=1e-10, top_db=None)
this last one will convert the result to decibel (dB) units (10 * log10(S / ref))
i will extend this answer with js code-sample later, submitting it now because i think it will be helpful already as it is

What is the second number in the MFCCs array?

When I extract MFCCs from an audio the ouput is (13, 22). What does the number represent? Is it time frames ? I use librosa.
The code is use is:
mfccs = librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=13, hop_length=256)
mfccs
print(mfccs.shape)
And the ouput is (13,22).
Yes, it is time frames and mainly depends on how many samples you provide via y and what hop_length you choose.
Example
Say you have 10s of audio sampled at 44.1 kHz (CD quality). When you load it with librosa, it gets resampled to 22,050 Hz (that's the librosa default) and downmixed to one channel (mono). When you then run something like a STFT, melspectrogram, or MFCC, so-called feature frames are computed.
The question is, how many (feature) frames do you get for your 10s of audio?
The deciding parameter for this is the hop_length. For all the mentioned functions, librosa slides a window of a certain length (typically n_fft) over the 1d audio signal, i.e., it looks at one shorter segment (or frame) at a time, computes features for this segment and moves on to the next segment. These segments are usually overlapping. The distance between two such segments is hop_length and it is specified in number of samples. It may be identical to n_fft, but often times hop_length is half or even just a quarter of n_fft. It allows you to control the temporal resolution of your features (the spectral resolution is controlled by n_fft or n_mfcc, depending on what you are actually computing).
10s of audio at 44.1 kHz are 441000 samples. But remember, librosa by default resamples to 22050 Hz, so it's actually only 220500 samples. How many times can we move a segment of some length over these 220500 samples, if we move it by 256 samples in each step? The precise number depends on how long the segment is. But let's ignore that for a second and assume that when we hit the end, we simply zero-pad the input so that we can still compute frames for as long as there is at least some input. Then the computation becomes trivial:
number_of_samples / hop_length = number_of_frames
So for our examples, this would be:
220500 / 256 = 861.3
So we get about 861 frames.
Note that you can make this computation even easier by computing the so-called frame_rate. That's frames per second in Hz. It's:
frame_rate = sample_rate / hop_length = 86.13
To get the number of frames for your input simply multiple frame_rate with the length of your audio and you're set (ignoring padding).
frames = frame_rate * audio_in_seconds

Understanding the shape of spectrograms and n_mels

I am going through these two librosa docs: melspectrogram and stft.
I am working on datasets of audio of variable lengths, but I don't quite get the shapes. For example:
(waveform, sample_rate) = librosa.load('audio_file')
spectrogram = librosa.feature.melspectrogram(y=waveform, sr=sample_rate)
dur = librosa.get_duration(waveform)
spectrogram = torch.from_numpy(spectrogram)
print(spectrogram.shape)
print(sample_rate)
print(dur)
Output:
torch.Size([128, 150])
22050
3.48
What I get are the following points:
Sample rate is that you get N samples each second, in this case 22050 samples each second.
The window length is the FFT calculated for that period of length of the audio.
STFT is calculation os FFT in small windows of time of audio.
The shape of the output is (n_mels, t). t = duration/window_of_fft.
I am trying to understand or calculate:
What is n_fft? I mean what exactly is it doing to the audio wave? I read in the documentation the following:
n_fft : int > 0 [scalar]
length of the windowed signal after padding with zeros. The number of rows in the STFT matrix D is (1 + n_fft/2). The default value,
n_fft=2048 samples, corresponds to a physical duration of 93
milliseconds at a sample rate of 22050 Hz, i.e. the default sample
rate in librosa.
This means that in each window 2048 samples are taken which means that --> 1/22050 * 2048 = 93[ms]. FFT is being calculated for every 93[ms] of the audio?
So, this means that the window size and window is for filtering the signal in this frame?
In the example above, I understand I am getting 128 number of Mel spectrograms but what exactly does that mean?
And what is hop_length? Reading the docs, I understand that it is how to shift the window from one fft window to the next right? If this value is 512 and n_fft = also 512, what does that mean? Does this mean that it will take a window of 23[ms], calculate FFT for this window and skip the next 23[ms]?
How can I specify that I want to overlap from one FFT window to another?
Please help, I have watched many videos of calculating spectrograms but I just can't seem to see it in real life.
The essential parameter to understanding the output dimensions of spectrograms is not necessarily the length of the used FFT (n_fft), but the distance between consecutive FFTs, i.e., the hop_length.
When computing an STFT, you compute the FFT for a number of short segments. These segments have the length n_fft. Usually these segments overlap (in order to avoid information loss), so the distance between two segments is often not n_fft, but something like n_fft/2. The name for this distance is hop_length. It is also defined in samples.
So when you have 1000 audio samples, and the hop_length is 100, you get 10 features frames (note that, if n_fft is greater than hop_length, you may need to pad).
In your example, you are using the default hop_length of 512. So for audio sampled at 22050 Hz, you get a feature frame rate of
frame_rate = sample_rate/hop_length = 22050 Hz/512 = 43 Hz
Again, padding may change this a little.
So for 10s of audio at 22050 Hz, you get a spectrogram array with the dimensions (128, 430), where 128 is the number of Mel bins and 430 the number of features (in this case, Mel spectra).

Create custom convolution layer and compare two keras layers

I am currently creating a network in keras to perform harmonic/percussive source separation on an audio spectrogram using a median filtering technique (http://dafx10.iem.at/papers/DerryFitzGerald_DAFx10_P15.pdf).
Given an input magnitude spectrogram S, and denoting the ith time frame as Si, and the hth frequency slice as Sh, a percussion-enhanced spectrogram frame Pi can be generated by performing median filtering on Si : Pi = M{Si, lperc} where M denotes the median filtering and lperc is the filter length. The individual percussion-enhanced frames Pi are then combined to yield a percussion-enhanced spectrogram P. Similarly, a harmonic-enhanced spectrogram frequency slice Hh can be obtained by median filtering frequency slice Sh : Hi = M{Sh, lharm}.
Once you have P and H, you can see whether each frequency bin Sh,i belongs either to the harmonic or the percussive source : if Hh,i > Ph,i, Sh,i goes to the harmonic spectrogram and takes the value 0 in the percussive spectrogram, and vice versa.
In my network, given the input spectrogram and for a specific time frame Si, I need to compute the medians horizontally for each frequency h. This can be easily done with a lambda layer and tensorflow :
layer_H = Lambda(lambda x:tf.contrib.distributions.percentile(x[0], 50, axis=0))(layer)
Here, the length of the harmonic median filter lharm is the horizontal length of the input spectrogram. The output is a vector whose size is equal to the number of frequencies (in my case, 88).
The next step is where I am stuck right now : I need to compute the medians vertically for the current time frame Si, given the length of the percussive median filter lperc, and knowing that I want the resulting vector to be the same size as the input, so I have to be careful on each end of the the input (the size of the filter will be between lharm and lharm/2 depending on where we're at). This looks like some sort of convolution, for lack of a better word.
Once I have the two resulting vectors Hi and Pi, I want to compare them and assign each value of the original frame Si to either a percussive layer (Lp) or a harmonic layer (Lh). So, I have three different inputs, Hi, Pi and Si, and I want to end up with Lp and Lh by comparing Hi and Pi, and continue building my network from there. If Hi,j > Pi,j, then Lpi,j = 0 and Lhi,j = Si,j.
To sum up, I am stuck on two different problems :
How to compute the horizontal medians ?
How to implement in the network the operations that will allow me to go from Hi, Pi and Si to Lp and Lh ?
Thank you very much in advance !

"Constant" audio waveform generation using python?

I'm looking for something to generate a "constant" audio waveform, which looks like this: A constant waveform
I have a set of analog datas:
12766:149
12786:0
13339:149
13359:0
13721:57
13741:0
15249:255
15269:0
15822:87
Where the format is time_in_ms:amplitue. I try to output them from a headphone jack.
I think that "constant waveform" shown above can be considered as a combination of multiple square waves with a very small frequency having different amplitudes, and last for 20ms.
Is that possible? What's out there that I can do this?
I guess you can simply write raw PCM audio frames into a .wav file using the wave module... https://docs.python.org/3/library/wave.html If your amplitude ranges from 0 to 255 inclusive it's probably easiest to create a wav file with a sampwidth of 1 (byte) and just write the amplitude as bytes into the sample frames.
with wave.open("test.wav","w") as w:
w.setnchannels(1)
w.setsampwidth(1)
w.setframerate(4000) # 4000 samples/sec
w.writeframes(bytearray([100]*4000)) # 4000 samples of amplitude 100
The above creates a small file 'test.wav' that is a mono 8-bit audio waveform of constant amplitude 100. Change the code accordingly to write the amplitude values from your input file, adjust sample rate as required

Categories