Librosa : MFCC feature calculation

Librosa : MFCC feature calculation - python

Given a audio file of 22 mins (1320 secs), Librosa extracts a MFCC features by
data = librosa.feature.mfcc(y=None, sr=22050, S=None, n_mfcc=20, **kwargs)
data.shape
(20,56829)
It returns numpy array of 20 MFCC features of 56829 frames .
My question is how it calculated 56829. Is there any calculation to achieve this frame ? and What is the window size for each frame ?

you can specify the hop length
mfcc = librosa.feature.mfcc(y=y, sr=sr, hop_length=hop_length, n_mfcc=13)
librosa uses centered frames, so that the kth frame is centered around sample k * hop_length
I think that default hop value is 512, with your data (1320*22050)/56829 = 512,16

Related

Funnction to normalize input audio file of 44.1khZ sampling rate and 16 bit resolution

I want to apply the following to an input audio file.
Input are mono audio files with 16 bit resolution at 44.1 kHz sampling rate. The audio is normalized and padded with 0.25 seconds of silence, to avoid onsets occurring immediately at the beginning of the audio file. First a logarithmic power spectrogram is calculated using a 2048 samples window size and a resulting frame rate of 100Hz. The frequency axis is then transformed to a logarithmic scale using twelve triangular filters per octave for a frequency range from 20 to 20,000 Hz. This results in a total number of 84 frequency bins.
But I was not able to implement this. I should be able to load a wav file and convert it to the expected output.

import numpy as np
import scipy.io.wavfile as wav
import matplotlib.pyplot as plt
#read wav file
rate, data = wav.read('C:/...')
#convert to mono
if data.ndim > 1:
data = data.sum(axis=1) / 2
#normalize
data = data / np.max(np.abs(data))
#pad with 0.25 seconds of silence
data = np.append(data, np.zeros(int(0.25 * rate)))
#calculate spectrogram
spec = np.abs(np.fft.rfft(data, n=2048))**2
#convert to log scale (dB)
spec = 10 * np.log10(spec)
filters = np.zeros((12, 84))
#convert to chroma
chroma = np.dot(spec, np.transpose(filters))
#convert to onset strength
onset = np.diff(chroma, axis=0)
#plot
plt.figure(figsize=(12, 4))
plt.imshow(chroma.T, aspect='auto', origin='lower', interpolation='none')
plt.show()

Transforming Pyplot x axis

I am trying to plot an audio sample's amplitude across the time domain. I've used scipy.io.wavfile to read the audio sample and determine sampling rate:
# read .wav file
data = read("/Users/user/Desktop/voice.wav")
# determine sample rate of .wav file
# print(data[0]) # 48000 samples per second
# 48000 (samples per second) * 4 (4 second sample) = 192000 samples overall
# store the data read from .wav file
audio = data[1]
# plot the data
plt.plot(audio[0 : 192000]) # see code above for how this value was determined
This creates a plot displaying amplitude on y axis and sample number on the x axis. How can I transform the x axis to instead show seconds?
I tried using plt.xticks but I don't think this is the correct use case based upon the error I received:
# label title, axis, show the plot
seconds = range(0,4)
plt.xticks(range(0,192000), seconds)
plt.ylabel("Amplitude")
plt.xlabel("Time")
plt.show()
ValueError: The number of FixedLocator locations (192000), usually from a call to set_ticks, does not match the number of ticklabels (4).

You need to pass a t vector to the plotting command, a vector that you can generate on the fly using Numpy, so that after the command execution it is garbage collected (sooner or later, that is)
from numpy import linspace
plt.plot(linspace(0, 4, 192000, endpoint=False), audio[0 : 192000])

Get timing information from MFCC generated with librosa.feature.mfcc

I am extracting MFCCs from an audio file using Librosa's function (librosa.feature.mfcc) and I correctly get back a numpy array with the shape I was expecting: 13 MFCCs values for the entire length of the audio file which is 1292 windows (in 30 seconds).
What is missing is timing information for each window: for example I want to know what the MFCC looks like at time 5000ms, then at 5200ms etc.
Do I have to manually calculate the time? Is there a way to automatically get the exact time for each window?

The "timing information" is not directly available, as it depends on sampling rate. In order to provide such information, librosa would have create its own classes. This would rather pollute the interface and make it much less interoperable. In the current implementation, feature.mfcc returns you numpy.ndarray, meaning you can easily integrate this code anywhere in Python.
To relate MFCC to timing:
import librosa
import numpy as np
filename = librosa.util.example_audio_file()
y, sr = librosa.load(filename)
hop_length = 512 # number of samples between successive frames
mfcc = librosa.feature.mfcc(y=y, n_mfcc=13, sr=sr, hop_length=hop_length)
audio_length = len(y) / sr # in seconds
step = hop_length / sr # in seconds
intervals_s = np.arange(start=0, stop=audio_length, step=step)
print(f'MFCC shape: {mfcc.shape}')
print(f'intervals_s shape: {intervals_s.shape}')
print(f'First 5 intervals: {intervals_s[:5]} second')
Note that array length of mfcc and intervals_s is the same - a sanity check that we did not make a mistake in our calculation.
MFCC shape: (13, 2647)
intervals_s shape: (2647,)
First 5 intervals: [0. 0.02321995 0.04643991 0.06965986 0.09287982] second

Python implementation of MFCC algorithm

I have a database which contains a videos streaming. I want to calculate the LBP features from images and MFCC audio and for every frame in the video I have some annotation. The annotation is inlined with the video frames and the time of the video. Thus, I want to map the time that i have from the annotation to the result of the mfcc. I know that the sample_rate = 44100
from python_speech_features import mfcc
from python_speech_features import logfbank
import scipy.io.wavfile as wav
audio_file = "sample.wav"
(rate,sig) = wav.read(audio_file)
mfcc_feat = mfcc(sig,rate)
print len(sig) # 2130912
print len(mfcc_feat) # 4831
Firstly, why the result of the length of the mfcc is 4831 and how to map that in the annotation that i have in seconds? The total duration of the video is 48second. And the annotation of the video is 0 everywhere except the 19-29sec windows where is is 1. How can i locate the samples within the window (19-29) from the results of the mfcc?

Run
mfcc_feat.shape
You should get (4831,13) . 13 is your MFCC length (default numcep is 13). 4831 is the windows. Default winstep is 10 msec, and this matches your sound file duration. To get to the windows corresponding to 19-29 sec, just slice
mfcc_feat[1900:2900,:]
Remember, that you can not listen to the MFCC. It just represents the slice of audio of 0.025 sec (default value of winlen parameter).
If you want to get to the audio itself, it is
sig[time_beg_in_sec*rate:time_end_in_sec*rate]

converting stft to chroma and plotting the result

I am trying to convert stft of a wav file into chromagram.
Here's my code :-
def stft(x,fs,framesize,hopsize):
frame = int(framesize*fs)
hop = int(hopsize*fs)
w = scipy.hamming(frame)
X = scipy.array([scipy.fft(w*x[i:i+frame])])
for i in range(0,len(x)-frame,hop)
return X
Here's the code for chromagram :-
def chromagram(x,fs,framesize,hopsize):
X = stft(x,fs,framesize,hopsize)
chroma = np.fmod(np.round(np.log2(X / 440) * 12), 12)
return chroma
When I calculate fft I get an array with complex values so I have to cast the result to float before calculating chroma. Am I doing anything wrong here?
Also, How do I plot the result?

I don't think, that works the way to do it. In X you have the complex-valued STFT. You can get its magnitude values with np.abs(X). Did you want to apply this formula? This was to convert frequencies to musical notes, but in X there are no frequencies. You can get the the corresponding frequencies with np.fft.fftfreq(framesize, 1.0/fs).
If you don't want to use the Bregman Audio-Visual Information Toolbox for Chroma Features, and want to implement them for you own, you could port the Matlab Chroma Toolbox. I think they use filterbanks instead of the FFT. Down on this page you find references where Chroma Features are explained in detail.
Anyway, if you have Chroma Features, you can plot them like any 2-dimensional array with imshow.
from matplotlib import pyplot as plt
import numpy as np
X = np.random.random((30, 30))
plt.imshow(X)
plt.show()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Librosa : MFCC feature calculation - python

you can specify the hop length mfcc = librosa.feature.mfcc(y=y, sr=sr, hop_length=hop_length, n_mfcc=13) librosa uses centered frames, so that the kth frame is centered around sample k * hop_length I think that default hop value is 512, with your data (1320*22050)/56829 = 512,16

Related

Funnction to normalize input audio file of 44.1khZ sampling rate and 16 bit resolution

Transforming Pyplot x axis

Get timing information from MFCC generated with librosa.feature.mfcc

Python implementation of MFCC algorithm

converting stft to chroma and plotting the result

Categories

Resources