What is the second number in the MFCCs array? - python

When I extract MFCCs from an audio the ouput is (13, 22). What does the number represent? Is it time frames ? I use librosa.
The code is use is:
mfccs = librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=13, hop_length=256)
mfccs
print(mfccs.shape)
And the ouput is (13,22).

Yes, it is time frames and mainly depends on how many samples you provide via y and what hop_length you choose.
Example
Say you have 10s of audio sampled at 44.1 kHz (CD quality). When you load it with librosa, it gets resampled to 22,050 Hz (that's the librosa default) and downmixed to one channel (mono). When you then run something like a STFT, melspectrogram, or MFCC, so-called feature frames are computed.
The question is, how many (feature) frames do you get for your 10s of audio?
The deciding parameter for this is the hop_length. For all the mentioned functions, librosa slides a window of a certain length (typically n_fft) over the 1d audio signal, i.e., it looks at one shorter segment (or frame) at a time, computes features for this segment and moves on to the next segment. These segments are usually overlapping. The distance between two such segments is hop_length and it is specified in number of samples. It may be identical to n_fft, but often times hop_length is half or even just a quarter of n_fft. It allows you to control the temporal resolution of your features (the spectral resolution is controlled by n_fft or n_mfcc, depending on what you are actually computing).
10s of audio at 44.1 kHz are 441000 samples. But remember, librosa by default resamples to 22050 Hz, so it's actually only 220500 samples. How many times can we move a segment of some length over these 220500 samples, if we move it by 256 samples in each step? The precise number depends on how long the segment is. But let's ignore that for a second and assume that when we hit the end, we simply zero-pad the input so that we can still compute frames for as long as there is at least some input. Then the computation becomes trivial:
number_of_samples / hop_length = number_of_frames
So for our examples, this would be:
220500 / 256 = 861.3
So we get about 861 frames.
Note that you can make this computation even easier by computing the so-called frame_rate. That's frames per second in Hz. It's:
frame_rate = sample_rate / hop_length = 86.13
To get the number of frames for your input simply multiple frame_rate with the length of your audio and you're set (ignoring padding).
frames = frame_rate * audio_in_seconds

Related

How to replicate a 128 Kaldi-Compatible Mel-frequency bands?

Hi I am trying to replicate the following pre-processing pipeline from the paper 'Masked Autoencoders that Listen' (https://doi.org/10.48550/arXiv.2207.06405):
we transform raw waveform (pre-processed as mono channel under 16,000 sampling rate) into 128 Kaldi [54]-compatible Mel-frequency bands with a 25ms Hanning window that shifts every 10 ms. For a 10-second recording in AudioSet, the resulting spectrogram is of 1×1024×128 dimension.
Starting from a 10s audio, with a sampling rate of 16Khz, I am trying with the following function from torchaudio:
fbank = torchaudio.compliance.kaldi.fbank(signal, htk_compat=True, sample_frequency=16000, use_energy=False, window_type='hanning', num_mel_bins=128, dither=0.0, frame_shift=10)
However I am not reaching to obtain the expected spectrogram, the expected shape is (1x1024x128) instead I am obtaining (1x998x128)

Usage of signal.welch

I want to use python signal.welch. The usage of signal.welch is as follows,
f, Pxx_den = signal.welch(x, fs, nperseg=1024)
In my case, x is for example gyroscopy signal ( 1 x 1024 samples (for about 10 sec data)), fs = 100 Hz. In my case, how can I decide nperseg? I want to know how I can select nperseg when the number of samples of the input is 1024 (about 10 sec).
scipy.signal.welch estimates the power spectral density by dividing the data into segments and averaging periodograms computed on each segment. The nperseg arg is the segment length and (by default) also determines the FFT size.
On the one hand, making nperseg smaller allows the input to divide into more segments, good for more averaging to get a more reliable estimate. On the other hand, making nperseg larger improves the frequency resolution of the result. In any case, nperseg should be smaller than the input size in order to get multiple segments.
The default segment length is 256 samples, which seems like a reasonable starting point for a 1024-sample input.

Understanding the shape of spectrograms and n_mels

I am going through these two librosa docs: melspectrogram and stft.
I am working on datasets of audio of variable lengths, but I don't quite get the shapes. For example:
(waveform, sample_rate) = librosa.load('audio_file')
spectrogram = librosa.feature.melspectrogram(y=waveform, sr=sample_rate)
dur = librosa.get_duration(waveform)
spectrogram = torch.from_numpy(spectrogram)
print(spectrogram.shape)
print(sample_rate)
print(dur)
Output:
torch.Size([128, 150])
22050
3.48
What I get are the following points:
Sample rate is that you get N samples each second, in this case 22050 samples each second.
The window length is the FFT calculated for that period of length of the audio.
STFT is calculation os FFT in small windows of time of audio.
The shape of the output is (n_mels, t). t = duration/window_of_fft.
I am trying to understand or calculate:
What is n_fft? I mean what exactly is it doing to the audio wave? I read in the documentation the following:
n_fft : int > 0 [scalar]
length of the windowed signal after padding with zeros. The number of rows in the STFT matrix D is (1 + n_fft/2). The default value,
n_fft=2048 samples, corresponds to a physical duration of 93
milliseconds at a sample rate of 22050 Hz, i.e. the default sample
rate in librosa.
This means that in each window 2048 samples are taken which means that --> 1/22050 * 2048 = 93[ms]. FFT is being calculated for every 93[ms] of the audio?
So, this means that the window size and window is for filtering the signal in this frame?
In the example above, I understand I am getting 128 number of Mel spectrograms but what exactly does that mean?
And what is hop_length? Reading the docs, I understand that it is how to shift the window from one fft window to the next right? If this value is 512 and n_fft = also 512, what does that mean? Does this mean that it will take a window of 23[ms], calculate FFT for this window and skip the next 23[ms]?
How can I specify that I want to overlap from one FFT window to another?
Please help, I have watched many videos of calculating spectrograms but I just can't seem to see it in real life.
The essential parameter to understanding the output dimensions of spectrograms is not necessarily the length of the used FFT (n_fft), but the distance between consecutive FFTs, i.e., the hop_length.
When computing an STFT, you compute the FFT for a number of short segments. These segments have the length n_fft. Usually these segments overlap (in order to avoid information loss), so the distance between two segments is often not n_fft, but something like n_fft/2. The name for this distance is hop_length. It is also defined in samples.
So when you have 1000 audio samples, and the hop_length is 100, you get 10 features frames (note that, if n_fft is greater than hop_length, you may need to pad).
In your example, you are using the default hop_length of 512. So for audio sampled at 22050 Hz, you get a feature frame rate of
frame_rate = sample_rate/hop_length = 22050 Hz/512 = 43 Hz
Again, padding may change this a little.
So for 10s of audio at 22050 Hz, you get a spectrogram array with the dimensions (128, 430), where 128 is the number of Mel bins and 430 the number of features (in this case, Mel spectra).

How can I get the pitch of an audio using Librosa? (NOT 2D array, but similar with crepe) [duplicate]

I am using this algorithm to detect the pitch of
this audio file. As you can hear, it is an E2 note played on a guitar with a bit of noise in the background.
I generated this spectrogram using STFT:
And I am using the algorithm linked above like this:
y, sr = librosa.load(filename, sr=40000)
pitches, magnitudes = librosa.core.piptrack(y=y, sr=sr, fmin=75, fmax=1600)
np.set_printoptions(threshold=np.nan)
print pitches[np.nonzero(pitches)]
As a result, I am getting pretty much every possible frequency between my fmin and fmax. What do I have to do with the output of the piptrack method to discover the fundamental frequency of a time frame?
UPDATE
I am still not sure what those 2D array represents, though. Let's say I want to find out how strong is 82Hz in frame 5. I could do that using the STFT function which simply returns a 2D matrix (which was used to plot the spectrogram).
However, piptrack does something additional which could be useful and I don't really understand what. pitches[f, t] contains instantaneous frequency at bin f, time t. Does that mean that, if I want to find the maximum frequency at time frame t, I have to:
Go to the magnitudes[][t] array, find the bin with the maximum
magnitude.
Assign the bin to a variable f.
Find pitches[b][t] to find the frequency that belongs to that bin?
Pitch detection is a tricky topic and is often counter-intuitive. I'm not wild about the way the source code is documented for this particular function -- it almost seems like the developer is confusing a 'harmonic' with a 'pitch'.
When a single note (a 'pitch') is made on a guitar or piano, what we hear is not just one frequency of sound vibration, but a composite of multiple sound vibrations occurring at different mathematically related frequencies, called harmonics. Typical pitch tracking techniques include searching the results of a FFT for magnitudes in certain bins that correspond to the expected frequencies of harmonics. For instance, if we press the Middle C key on the piano, the individual frequencies of the composite's harmonics will start at 261.6 Hz as the fundamental frequency, 523 Hz would be the 2nd Harmonic, 785 Hz would be the 3rd Harmonic, 1046 Hz would be the 4th Harmonic, etc. The later harmonics are integer multiples of the fundamental frequency, 261.6 Hz ( ex: 2 x 261.6 = 523, 3 x 261.6 = 785, 4 x 261.6 = 1046 ). However, the frequencies where harmonics are located are logarithmically spaced, but the FFT uses a linear spacing. Often the vertical spacing for FFTs are not resolved enough at the lower frequencies.
For that reason when I wrote a pitch detecting application (PitchScope Player), I chose to create a logarithmically spaced DFT, rather than a FFT, so I could focus on the precise frequencies of interest for music ( see the attached diagram of my custom DFT from 3 seconds of a guitar solo ). If you are serious about pursuing pitch detection, you should consider doing more reading into the topic, looking at other sample code (mine is linked below), and consider writing your own functions to measure frequency.
https://en.wikipedia.org/wiki/Transcription_(music)#Pitch_detection
https://github.com/CreativeDetectors/PitchScope_Player
Turns out the way to pick the pitch at a certain frame t is simple:
def detect_pitch(y, sr, t):
index = magnitudes[:, t].argmax()
pitch = pitches[index, t]
return pitch
First getting the bin of the strongest frequency by looking at the magnitudes array, and then finding the pitch at pitches[index, t].
To find the pitch of the whole audio segment:
def detect_pitch(y, sr):
pitches, magnitudes = librosa.core.piptrack(y=y, sr=sr, fmin=75, fmax=1600)
# get indexes of the maximum value in each time slice
max_indexes = np.argmax(magnitudes, axis=0)
# get the pitches of the max indexes per time slice
pitches = pitches[max_indexes, range(magnitudes.shape[1])]
return pitches

How to implement/perform DFT on a segment in python?

I am trying to write a simple program in python that will calculate and display DFT output of 1 segment.
My signal is 3 seconds long, I want to calculate DFT for every 10ms long segment. Sampling rate is 44100. So one segment is 441 samples long.
Since I am in the phase of testing this and original program is much larger(speech recognition) here is an isolated part for testing purposes that unfortunately behaves odd. Either that or my lack of knowledge on the subject.
I read somewhere that DFT input should be rounded to power of 2 so I arranged my array to 512 instead 441. Is this true?
If I am sampling at a rate of 44100, at most I can reach frequency of 22050Hz and for sample of length 512(~441) at least 100Hz ?
If 2. is true, then I can have all frequencies between 100hz and 22050hz in that 10ms segments, but the length of segment is 512(441) samples only, output of fft returns array of 256(220) values, they cannot contain all 21950 frequencies in there, can they?
My first guess is that the values in output of fft should be multiplied by 100, since 10ms is 100th of a second. Is this good reasoning?
The following program for two given frequencies 1000 and 2000 returns two spikes on graph at positions 24 and 48 in the output array and ~2071 and ~4156 on the graph. Since ratio of numbers is okay (2000:1000 = 48:24) I wonder if I should ignore some starting part of the fft output?
import matplotlib.pyplot as plt
import numpy as np
t = np.arange(0, 1, 1/512.0) # We create 512 long array
# We calculate here two sinusoids together at 1000hz and 2000hz
y = np.sin(2*np.pi*1000*t) + np.sin(2*np.pi*2000*t)
n = len(y)
k = np.arange(n)
# Problematic part is around here, I am not quite sure what
# should be on the horizontal line
T = n/44100.0
frq = k/T
frq = frq[range(n/2)]
Y = fft(y)
Y = Y[range(n/2)]
# Convert from complex numbers to magnitudes
iY = []
for f in Y:
iY.append(np.sqrt(f.imag * f.imag + f.real * f.real))
plt.plot(frq, iY, 'r')
plt.xlabel('freq (HZ)')
plt.show()
I read somewhere that the DFT input should be rounded to power of 2 so I arranged my array to 512 instead 441. Is this true?
The DFT is defined for all sizes. However, implementations of the DFT such as the FFT are generally much more efficient for sizes which can be factored in small primes. Some library implementations have limitations and do not support sizes other than powers of 2, but that isn't the case with numpy.
If I am sampling at a rate of 44100, at most I can reach frequency of 22050Hz and for sample of length 512(~441) at least 100Hz?
The highest frequency for even sized DFT will be 44100/2 = 22050Hz as you've correctly pointed out. Note that for odd sized DFT the highest frequency bin will correspond to a frequency slightly less than the Nyquist frequency. As for the minimum frequency, it will always be 0Hz. The next non-zero frequency will be 44100.0/N where N is the DFT length in samples (which gives 100Hz if you are using a DFT length of 441 samples and ~86Hz with a DFT length of 512 samples).
If 2) is true, then I can have all frequencies between 100Hz and 22050Hz in that 10ms segments, but the length of segment is 512(441) samples only, output of fft returns array of 256(220) values, they cannot contain all 21950 frequencies in there, can they?
First there aren't 21950 frequencies between 100Hz and 22050Hz since frequencies are continuous and not limited to integer frequencies. That said, you are correct in your realization that the output of the DFT will be limited to a much smaller set of frequencies. More specifically the DFT represents the frequency spectrum at discrete frequency step: 0, 44100/N, 2*44100/N, ...
My first guess is that the values in output of FFT should be multiplied by 100, since 10ms is 100th of a second. Is this good reasoning?
There is no need to multiply the FFT output by 100. But if you meant multiples of 100Hz with a DFT of length 441 and a sampling rate of 44100Hz, then your guess would be correct.
The following program for two given frequencies 1000 and 2000 returns two spikes on graph at positions 24 and 48 in the output array and ~2071 and ~4156 on the graph. Since ratio of numbers is okay (2000:1000 = 48:24) I wonder if I should ignore some starting part of the fft output?
Here the problem is more significant. As you declare the array
t = np.arange(0, 1, 1/512.0) # We create 512 long array
you are in fact representing a signal with a sampling rate of 512Hz instead of 44100Hz. As a result the tones you are generating are severely aliased (to 24Hz and 48Hz respectively). This is further compounded by the fact that you then use a sampling rate of 44100Hz for the frequency axis conversion. This is why the peaks are not appearing at the expected 1000Hz and 2000Hz frequencies.
To represent 512 samples of a signal sampled at a rate of 44100Hz, you should instead use
t = np.arange(0, 511.0/44100, 1/44100.0)
at which point the formula you used for the frequency axis would be correct (since it is based of the same 44100Hz sampling rate). You should then be able to see peaks near the expected 1000Hz and 2000Hz (the closest frequency bins of the peaks being at ~1033Hz and 1981Hz).
1) I read somewhere that DFT input should be rounded to power of 2 so
I aranged my array to 512 instead 441. Is this true?
Yes, DFT length should be a power of two. Just pad the input with zero to match 512.
2) If I am sampling at a rate of 44100, at most I can reach frequency
of 22050hz and for sample of length 512(~441) at least 100hz ?
Yes, the highest frequency you can get is half the the sampling rate, It's called the Nyquist frequency.
No, the lowest frequency bin you get (the first bin of the DFT) is called the DC component and marks the average of the signal. The next lowest frequency bin in your case is 22050 / 256 = 86Hz, and then 172Hz, 258Hz, and so on until 22050Hz.
You can get this freqs with the numpy.fftfreq() function.
3) If 2) is true, then I can have all frequencies between 100hz and
22050hz in that 10ms segments, but the length of segment is 512(441)
samples only, output of fft returns array of 256(220) values, they
cannot contain all 21950 frequencies in there, can they?
DFT doesn't lose the original signal's data, but it lacks accuracy when the DFT size is small. You may zero-pad it to make the DFT size larger, such as 1024 or 2048.
The DFT bin refers to a frequency range centered at each of the N output
points. The width of the bin is sample rate/2,
and it extends from: center frequency -(sample rate/N)/2 to center
frequency +(sample rate/N)/2. In other words, half of the bin extends
below each of the N output points, and half above it.
4) My first guess is that the values in output of fft should be
multiplied by 100, since 10ms is 100th of a second. Is this good
reasoning?
No, The value should not be multiplied if you want to preserve the magnitude.
The following program for two given frequencies 1000 and 2000 returns
two spikes on graph at positions 24 and 48 in the output array and
~2071 and ~4156 on the graph. Since ratio of numbers is okay
(2000:1000 = 48:24) I wonder if I should ignore some starting part of
the fft output?
The DFT result is mirrored in real input. In other words, your frequencies will be like this:
n 0 1 2 3 4 ... 255 256 257 ... 511 512
Hz DC 86 172 258 344 ... 21964 22050 21964 ... 86 0

Categories