Understanding the shape of spectrograms and n_mels - python

I am going through these two librosa docs: melspectrogram and stft.
I am working on datasets of audio of variable lengths, but I don't quite get the shapes. For example:
(waveform, sample_rate) = librosa.load('audio_file')
spectrogram = librosa.feature.melspectrogram(y=waveform, sr=sample_rate)
dur = librosa.get_duration(waveform)
spectrogram = torch.from_numpy(spectrogram)
print(spectrogram.shape)
print(sample_rate)
print(dur)
Output:
torch.Size([128, 150])
22050
3.48
What I get are the following points:
Sample rate is that you get N samples each second, in this case 22050 samples each second.
The window length is the FFT calculated for that period of length of the audio.
STFT is calculation os FFT in small windows of time of audio.
The shape of the output is (n_mels, t). t = duration/window_of_fft.
I am trying to understand or calculate:
What is n_fft? I mean what exactly is it doing to the audio wave? I read in the documentation the following:
n_fft : int > 0 [scalar]
length of the windowed signal after padding with zeros. The number of rows in the STFT matrix D is (1 + n_fft/2). The default value,
n_fft=2048 samples, corresponds to a physical duration of 93
milliseconds at a sample rate of 22050 Hz, i.e. the default sample
rate in librosa.
This means that in each window 2048 samples are taken which means that --> 1/22050 * 2048 = 93[ms]. FFT is being calculated for every 93[ms] of the audio?
So, this means that the window size and window is for filtering the signal in this frame?
In the example above, I understand I am getting 128 number of Mel spectrograms but what exactly does that mean?
And what is hop_length? Reading the docs, I understand that it is how to shift the window from one fft window to the next right? If this value is 512 and n_fft = also 512, what does that mean? Does this mean that it will take a window of 23[ms], calculate FFT for this window and skip the next 23[ms]?
How can I specify that I want to overlap from one FFT window to another?
Please help, I have watched many videos of calculating spectrograms but I just can't seem to see it in real life.

The essential parameter to understanding the output dimensions of spectrograms is not necessarily the length of the used FFT (n_fft), but the distance between consecutive FFTs, i.e., the hop_length.
When computing an STFT, you compute the FFT for a number of short segments. These segments have the length n_fft. Usually these segments overlap (in order to avoid information loss), so the distance between two segments is often not n_fft, but something like n_fft/2. The name for this distance is hop_length. It is also defined in samples.
So when you have 1000 audio samples, and the hop_length is 100, you get 10 features frames (note that, if n_fft is greater than hop_length, you may need to pad).
In your example, you are using the default hop_length of 512. So for audio sampled at 22050 Hz, you get a feature frame rate of
frame_rate = sample_rate/hop_length = 22050 Hz/512 = 43 Hz
Again, padding may change this a little.
So for 10s of audio at 22050 Hz, you get a spectrogram array with the dimensions (128, 430), where 128 is the number of Mel bins and 430 the number of features (in this case, Mel spectra).

Related

How to replicate a 128 Kaldi-Compatible Mel-frequency bands?

Hi I am trying to replicate the following pre-processing pipeline from the paper 'Masked Autoencoders that Listen' (https://doi.org/10.48550/arXiv.2207.06405):
we transform raw waveform (pre-processed as mono channel under 16,000 sampling rate) into 128 Kaldi [54]-compatible Mel-frequency bands with a 25ms Hanning window that shifts every 10 ms. For a 10-second recording in AudioSet, the resulting spectrogram is of 1×1024×128 dimension.
Starting from a 10s audio, with a sampling rate of 16Khz, I am trying with the following function from torchaudio:
fbank = torchaudio.compliance.kaldi.fbank(signal, htk_compat=True, sample_frequency=16000, use_energy=False, window_type='hanning', num_mel_bins=128, dither=0.0, frame_shift=10)
However I am not reaching to obtain the expected spectrogram, the expected shape is (1x1024x128) instead I am obtaining (1x998x128)

Usage of signal.welch

I want to use python signal.welch. The usage of signal.welch is as follows,
f, Pxx_den = signal.welch(x, fs, nperseg=1024)
In my case, x is for example gyroscopy signal ( 1 x 1024 samples (for about 10 sec data)), fs = 100 Hz. In my case, how can I decide nperseg? I want to know how I can select nperseg when the number of samples of the input is 1024 (about 10 sec).
scipy.signal.welch estimates the power spectral density by dividing the data into segments and averaging periodograms computed on each segment. The nperseg arg is the segment length and (by default) also determines the FFT size.
On the one hand, making nperseg smaller allows the input to divide into more segments, good for more averaging to get a more reliable estimate. On the other hand, making nperseg larger improves the frequency resolution of the result. In any case, nperseg should be smaller than the input size in order to get multiple segments.
The default segment length is 256 samples, which seems like a reasonable starting point for a 1024-sample input.

What is the second number in the MFCCs array?

When I extract MFCCs from an audio the ouput is (13, 22). What does the number represent? Is it time frames ? I use librosa.
The code is use is:
mfccs = librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=13, hop_length=256)
mfccs
print(mfccs.shape)
And the ouput is (13,22).
Yes, it is time frames and mainly depends on how many samples you provide via y and what hop_length you choose.
Example
Say you have 10s of audio sampled at 44.1 kHz (CD quality). When you load it with librosa, it gets resampled to 22,050 Hz (that's the librosa default) and downmixed to one channel (mono). When you then run something like a STFT, melspectrogram, or MFCC, so-called feature frames are computed.
The question is, how many (feature) frames do you get for your 10s of audio?
The deciding parameter for this is the hop_length. For all the mentioned functions, librosa slides a window of a certain length (typically n_fft) over the 1d audio signal, i.e., it looks at one shorter segment (or frame) at a time, computes features for this segment and moves on to the next segment. These segments are usually overlapping. The distance between two such segments is hop_length and it is specified in number of samples. It may be identical to n_fft, but often times hop_length is half or even just a quarter of n_fft. It allows you to control the temporal resolution of your features (the spectral resolution is controlled by n_fft or n_mfcc, depending on what you are actually computing).
10s of audio at 44.1 kHz are 441000 samples. But remember, librosa by default resamples to 22050 Hz, so it's actually only 220500 samples. How many times can we move a segment of some length over these 220500 samples, if we move it by 256 samples in each step? The precise number depends on how long the segment is. But let's ignore that for a second and assume that when we hit the end, we simply zero-pad the input so that we can still compute frames for as long as there is at least some input. Then the computation becomes trivial:
number_of_samples / hop_length = number_of_frames
So for our examples, this would be:
220500 / 256 = 861.3
So we get about 861 frames.
Note that you can make this computation even easier by computing the so-called frame_rate. That's frames per second in Hz. It's:
frame_rate = sample_rate / hop_length = 86.13
To get the number of frames for your input simply multiple frame_rate with the length of your audio and you're set (ignoring padding).
frames = frame_rate * audio_in_seconds

how can we use scipy.signal.resample to downsample the speech signal from 44100 to 8000 Hz signal?

fs, s = wav.read('wave.wav')
This signal has 44100 Hz sampleing frquency, I want to donwnsample this signal to 8Khz using
scipy.signal.resample(s,s.size/5.525) but the second element can't be float, so, how can we use this function for resmapling the speech signal?
How we can use scipy.signal.resample to downsample the speech signal from 44100 to 8000 Hz in python?
Okay then, another solution, this one with scipy for real. Just what asked for.
This is the doc string of scipy.signal.resample():
"""
Resample `x` to `num` samples using Fourier method along the given axis.
The resampled signal starts at the same value as `x` but is sampled
with a spacing of ``len(x) / num * (spacing of x)``. Because a
Fourier method is used, the signal is assumed to be periodic.
Parameters
----------
x : array_like
The data to be resampled.
num : int
The number of samples in the resampled signal.
t : array_like, optional
If `t` is given, it is assumed to be the sample positions
associated with the signal data in `x`.
axis : int, optional
The axis of `x` that is resampled. Default is 0.
window : array_like, callable, string, float, or tuple, optional
Specifies the window applied to the signal in the Fourier
domain. See below for details.
Returns
-------
resampled_x or (resampled_x, resampled_t)
Either the resampled array, or, if `t` was given, a tuple
containing the resampled array and the corresponding resampled
positions.
Notes
-----
The argument `window` controls a Fourier-domain window that tapers
the Fourier spectrum before zero-padding to alleviate ringing in
the resampled values for sampled signals you didn't intend to be
interpreted as band-limited.
If `window` is a function, then it is called with a vector of inputs
indicating the frequency bins (i.e. fftfreq(x.shape[axis]) ).
If `window` is an array of the same length as `x.shape[axis]` it is
assumed to be the window to be applied directly in the Fourier
domain (with dc and low-frequency first).
For any other type of `window`, the function `scipy.signal.get_window`
is called to generate the window.
The first sample of the returned vector is the same as the first
sample of the input vector. The spacing between samples is changed
from dx to:
dx * len(x) / num
If `t` is not None, then it represents the old sample positions,
and the new sample positions will be returned as well as the new
samples.
"""
As you should know, 8000 Hz means that one second of your signal contains 8000 samples, and for 44100 Hz, it means that one second contains 44100 samples.
Then, just calculate how many samples do you need for 8000 Hz and use the number as an second argument to scipy.signal.resample().
You may use the method that Nathan Whitehead used in a resample function that I coppied in other answer (with scaling),
or go through time i.e.
secs = len(X)/44100.0 # Number of seconds in signal X
samps = secs*8000 # Number of samples to downsample
Y = scipy.signal.resample(X, samps)
This I picked from SWMixer module written by Nathan Whitehead:
import numpy
def resample(smp, scale=1.0):
"""Resample a sound to be a different length
Sample must be mono. May take some time for longer sounds
sampled at 44100 Hz.
Keyword arguments:
scale - scale factor for length of sound (2.0 means double length)
"""
# f*ing cool, numpy can do this with one command
# calculate new length of sample
n = round(len(smp) * scale)
# use linear interpolation
# endpoint keyword means than linspace doesn't go all the way to 1.0
# If it did, there are some off-by-one errors
# e.g. scale=2.0, [1,2,3] should go to [1,1.5,2,2.5,3,3]
# but with endpoint=True, we get [1,1.4,1.8,2.2,2.6,3]
# Both are OK, but since resampling will often involve
# exact ratios (i.e. for 44100 to 22050 or vice versa)
# using endpoint=False gets less noise in the resampled sound
return numpy.interp(
numpy.linspace(0.0, 1.0, n, endpoint=False), # where to interpret
numpy.linspace(0.0, 1.0, len(smp), endpoint=False), # known positions
smp, # known data points
)
So, if you are using scipy, that means that you have numpy too. If scipy is not "a MUST#, use this, it works perfectly.
this code if you want to resample all wav files in some folder:
from tqdm.notebook import tqdm
from scipy.io import wavfile
from scipy.signal import resample
def scip_rs(file):
try:
sr, wv = wavfile.read(file)
if sr==16000:
pass
else:
sec = len(wv)/sr
nsmp = int(sec*16000)+1
rwv = resample(wv, nsmp)
wavfile.write(file, 16000, rwv/(2**15-1))
except Exception as e:
print('Error in file : ', file)
for file in tqdm(np.array(df.filepath.tolist())):
scip_rs(file)

How to implement/perform DFT on a segment in python?

I am trying to write a simple program in python that will calculate and display DFT output of 1 segment.
My signal is 3 seconds long, I want to calculate DFT for every 10ms long segment. Sampling rate is 44100. So one segment is 441 samples long.
Since I am in the phase of testing this and original program is much larger(speech recognition) here is an isolated part for testing purposes that unfortunately behaves odd. Either that or my lack of knowledge on the subject.
I read somewhere that DFT input should be rounded to power of 2 so I arranged my array to 512 instead 441. Is this true?
If I am sampling at a rate of 44100, at most I can reach frequency of 22050Hz and for sample of length 512(~441) at least 100Hz ?
If 2. is true, then I can have all frequencies between 100hz and 22050hz in that 10ms segments, but the length of segment is 512(441) samples only, output of fft returns array of 256(220) values, they cannot contain all 21950 frequencies in there, can they?
My first guess is that the values in output of fft should be multiplied by 100, since 10ms is 100th of a second. Is this good reasoning?
The following program for two given frequencies 1000 and 2000 returns two spikes on graph at positions 24 and 48 in the output array and ~2071 and ~4156 on the graph. Since ratio of numbers is okay (2000:1000 = 48:24) I wonder if I should ignore some starting part of the fft output?
import matplotlib.pyplot as plt
import numpy as np
t = np.arange(0, 1, 1/512.0) # We create 512 long array
# We calculate here two sinusoids together at 1000hz and 2000hz
y = np.sin(2*np.pi*1000*t) + np.sin(2*np.pi*2000*t)
n = len(y)
k = np.arange(n)
# Problematic part is around here, I am not quite sure what
# should be on the horizontal line
T = n/44100.0
frq = k/T
frq = frq[range(n/2)]
Y = fft(y)
Y = Y[range(n/2)]
# Convert from complex numbers to magnitudes
iY = []
for f in Y:
iY.append(np.sqrt(f.imag * f.imag + f.real * f.real))
plt.plot(frq, iY, 'r')
plt.xlabel('freq (HZ)')
plt.show()
I read somewhere that the DFT input should be rounded to power of 2 so I arranged my array to 512 instead 441. Is this true?
The DFT is defined for all sizes. However, implementations of the DFT such as the FFT are generally much more efficient for sizes which can be factored in small primes. Some library implementations have limitations and do not support sizes other than powers of 2, but that isn't the case with numpy.
If I am sampling at a rate of 44100, at most I can reach frequency of 22050Hz and for sample of length 512(~441) at least 100Hz?
The highest frequency for even sized DFT will be 44100/2 = 22050Hz as you've correctly pointed out. Note that for odd sized DFT the highest frequency bin will correspond to a frequency slightly less than the Nyquist frequency. As for the minimum frequency, it will always be 0Hz. The next non-zero frequency will be 44100.0/N where N is the DFT length in samples (which gives 100Hz if you are using a DFT length of 441 samples and ~86Hz with a DFT length of 512 samples).
If 2) is true, then I can have all frequencies between 100Hz and 22050Hz in that 10ms segments, but the length of segment is 512(441) samples only, output of fft returns array of 256(220) values, they cannot contain all 21950 frequencies in there, can they?
First there aren't 21950 frequencies between 100Hz and 22050Hz since frequencies are continuous and not limited to integer frequencies. That said, you are correct in your realization that the output of the DFT will be limited to a much smaller set of frequencies. More specifically the DFT represents the frequency spectrum at discrete frequency step: 0, 44100/N, 2*44100/N, ...
My first guess is that the values in output of FFT should be multiplied by 100, since 10ms is 100th of a second. Is this good reasoning?
There is no need to multiply the FFT output by 100. But if you meant multiples of 100Hz with a DFT of length 441 and a sampling rate of 44100Hz, then your guess would be correct.
The following program for two given frequencies 1000 and 2000 returns two spikes on graph at positions 24 and 48 in the output array and ~2071 and ~4156 on the graph. Since ratio of numbers is okay (2000:1000 = 48:24) I wonder if I should ignore some starting part of the fft output?
Here the problem is more significant. As you declare the array
t = np.arange(0, 1, 1/512.0) # We create 512 long array
you are in fact representing a signal with a sampling rate of 512Hz instead of 44100Hz. As a result the tones you are generating are severely aliased (to 24Hz and 48Hz respectively). This is further compounded by the fact that you then use a sampling rate of 44100Hz for the frequency axis conversion. This is why the peaks are not appearing at the expected 1000Hz and 2000Hz frequencies.
To represent 512 samples of a signal sampled at a rate of 44100Hz, you should instead use
t = np.arange(0, 511.0/44100, 1/44100.0)
at which point the formula you used for the frequency axis would be correct (since it is based of the same 44100Hz sampling rate). You should then be able to see peaks near the expected 1000Hz and 2000Hz (the closest frequency bins of the peaks being at ~1033Hz and 1981Hz).
1) I read somewhere that DFT input should be rounded to power of 2 so
I aranged my array to 512 instead 441. Is this true?
Yes, DFT length should be a power of two. Just pad the input with zero to match 512.
2) If I am sampling at a rate of 44100, at most I can reach frequency
of 22050hz and for sample of length 512(~441) at least 100hz ?
Yes, the highest frequency you can get is half the the sampling rate, It's called the Nyquist frequency.
No, the lowest frequency bin you get (the first bin of the DFT) is called the DC component and marks the average of the signal. The next lowest frequency bin in your case is 22050 / 256 = 86Hz, and then 172Hz, 258Hz, and so on until 22050Hz.
You can get this freqs with the numpy.fftfreq() function.
3) If 2) is true, then I can have all frequencies between 100hz and
22050hz in that 10ms segments, but the length of segment is 512(441)
samples only, output of fft returns array of 256(220) values, they
cannot contain all 21950 frequencies in there, can they?
DFT doesn't lose the original signal's data, but it lacks accuracy when the DFT size is small. You may zero-pad it to make the DFT size larger, such as 1024 or 2048.
The DFT bin refers to a frequency range centered at each of the N output
points. The width of the bin is sample rate/2,
and it extends from: center frequency -(sample rate/N)/2 to center
frequency +(sample rate/N)/2. In other words, half of the bin extends
below each of the N output points, and half above it.
4) My first guess is that the values in output of fft should be
multiplied by 100, since 10ms is 100th of a second. Is this good
reasoning?
No, The value should not be multiplied if you want to preserve the magnitude.
The following program for two given frequencies 1000 and 2000 returns
two spikes on graph at positions 24 and 48 in the output array and
~2071 and ~4156 on the graph. Since ratio of numbers is okay
(2000:1000 = 48:24) I wonder if I should ignore some starting part of
the fft output?
The DFT result is mirrored in real input. In other words, your frequencies will be like this:
n 0 1 2 3 4 ... 255 256 257 ... 511 512
Hz DC 86 172 258 344 ... 21964 22050 21964 ... 86 0

Categories