In passing a file via the use:
librosa_audio, librosa_sample_rate = librosa.load(filename)
The output produces an audio file such that:
Librosa audio file min~max range: -1.2105224 to 1.2942806
The file that I am working on was obtained from https://www.boomlibrary.com/ and had a bit depth of 24. I down sampled to 16 and also up sampled to 32 to work with librosa. Both of these files produced the same min-max range after going through librosa.
Why does this happen?
Is there a way to parse the wav file to Librosa such that the data will fall between [-1,1]?
Here is a link to the files:
https://drive.google.com/drive/folders/12a0ii5i0ugyvdMMRX4MPfWMSN0arD0bn?usp=sharing
The behaviour you are observing stems directly from resampling to 22050 Hz that librosa load does by default:
librosa.core.load(path, sr=22050)
Resampling process always affects the audio, hence you see values that are not normalized. You have to do this yourself.
More likely, you wanted to read the audio with the native sampling rate, in which case you should have passed None to sr like this:
librosa.core.load(path, sr=None)
Example based on the audio sample you have provided:
In [4]: y, sr = librosa.load('201-AWCKARAK47Close0116BIT.wav', sr=None)
In [5]: y.max()
Out[5]: 0.9773865
In [6]: y.min()
Out[6]: -0.8358917
Related
I have downloaded the Kaggle Speech Accent Archive to learn how to handle audio data. I'm comparing three ways of reading mp3's in this dataset. The first uses Tensorflow's AudioIOTensor, the second uses Librosa and the third uses PyDub. I let each of them read the same mp3 file, however, all 3 get different results on the same file.
I used this code:
import librosa
import numpy as np
import os
import pathlib
import pyaudio
from pydub import AudioSegment as pydub_AudioSegment
from pydub.utils import mediainfo as pydub_mediainfo
import tensorflow as tf
import tensorflow_io as tfio
DATA_DIR = <Path to data>
data_path = pathlib.Path(DATA_DIR)
mp3Files = [x for x in data_path.iterdir() if '.mp3' in x.name]
def load_audios(file_list):
dataset = []
for curr_file in file_list:
tf2 = tfio.audio.AudioIOTensor(curr_file.as_posix())
librsa, librsa_sr = librosa.load(curr_file.as_posix())
pdub = pydub_AudioSegment.from_file(curr_file.as_posix(),'mp3')
dataset.append([tf2, librsa, librsa_sr, pdub, curr_file.name])
return dataset
audios = load_audios(mp3Files[0:1]) # Reads 'afrikaans1.mp3' from
# Kaggle's Speech Accent Archive
tf2 = audios[0][0]
libr = audios[0][1]
libr_sr = audios[0][2]
pdub = audios[0][3]
But now when I start comparing the way these 3 modules read the same mp3 file I see this behavior for Tensorflow's AudioIOTensor:
>> tf2_arr = tf.squeeze(tf2.to_tensor(),-1).numpy()
>> tf2_arr, tf2, tf2_arr.shape # Gives raw data, sampling rate & shape
(array([ 0.00905748, 0.01102116, 0.00883307, ..., -0.00131128,
-0.00134344, -0.00090137], dtype=float32),
<AudioIOTensor: shape=[916057 1], dtype=<dtype: 'float32'>, rate=44100>,
(916057,))
>> np.argmax(tf2_arr), np.argmin(tf2_arr)
(113149, 106715)
This behavior for Librosa:
>> libr, libr_sr, libr.shape # Gives raw data, sampling rate & shape
(array([ 0.00711342, 0.01064209, 0.00806945, ..., -0.00168153,
-0.00148052, 0. ], dtype=float32),
22050,
(458029,))
And for PyDub, I see this:
>> pdub_data = np.array(pdub.get_array_of_samples())
>> pdub_data, pdub.frame_rate, pdub_data.shape # Gives raw data, sampling rate
# & shape
(array([297, 361, 289, ..., -43, -44, -30], dtype=int16), 44100, (916057,))
Although all the raw values disagreed with each other, the first confirming thing I noticed is that the AudioIOTensor and PyDub result had the same sampling frequency(44100) and the same shape((916057,)). Yet, Librosa's result had a sampling frequency(22050) and shape dimensions((458029)) that were half the sampling frequency and shape dimensions of the other two techniques.
Next, I looked to see where the max and mins of each array was. I found this:
>> np.argmax(tf2_arr), np.argmin(tf2_arr)
(113149, 106715)
>> np.argmax(pdub_data), np.argmin(pdub_data)
(113149, 106715)
>> np.argmax(libr)*2, np.argmin(libr)*2
(113150, 106714)
So, allowing for the fact that Librosa has half the sampling rate of the other two libraries, all three libraries agree on where the max's and min's are.
Lastly, I decided to see if Tensorflow's AudioIOTensor and PyDub's result were separated by a constant multiplicative factor by taking the average of the ratio of the maxes and mins:
>> pdub_data[113149]/tf2_arr[113149], pdub_data[106715]/tf2_arr[106715]
(32768.027, 32768.184)
>> test = tf2_arr * 32768.105
>> diff = test-pdub_data
>> np.max(diff), np.min(diff)
(0.578125, -0.5917969)
Since pdub_data had values ranging from 23864 to -22269 (i.e. I checked np.max(pdub_data) and np.min(dub_data)), I was willing to assume that if the differences were bounded by +/- 0.6, they were due to rounding and similar effects. I was willing to assume that the same would hold for Librosa, but now I'm left wondering why?
I would've thought that reading an mp3 file wouldn't leave room for interpretation. Raw data was stored using whatever rules mp3 uses and should be recovered when the file is read.
Why do these 3 libraries differ in the raw numbers they return, and in 1 case differ in the sampling rate corresponding to the returned data? How can I get one or all of them to return the raw data stored in the mp3 format? Should I attach any significance to the fact that the ratio between the pydub_data values and the tf2_arr values is 32678 (i.e. 2^15)?
=====================================================================
Later Thoughts: I'm wondering if part of the reason for the differences between these libraries lies in the variable type they use. Librosa uses float32 and PyDub uses int16. So, It might make sense that PyDub sees twice as many numbers as Librosa, which gives it twice the sampling rate. Similarly, AudioIOTensor differs from PyDub by a factor of 2^15. If one prepends 15 bits to a 16 bit int, with one more to handle the sign, one could conceivably get a 32 bit float. But both of these cases seem to imply that one set of numbers will be, in some sense, 'wrong'.....
I have one wav file which I resampled to 16.000 kHz with Audacity.
Now I am trying to load the file with python with 2 different ways.
import tensorflow as tf
import librosa
f = "path/to/wav/file/xxxx.wav"
raw = tf.io.read_file(f)
audio, sr = tf.audio.decode_wav(raw, desired_channels=1)
print("Sample Rate TF: ",sr.numpy())
y, sr2 = librosa.load(f)
print("Sample Rate librosa: ",sr2)
#Sample Rate TF: 16000
#Sample Ratelibrosa: 22050
Why is the sample rate so different for the same file?
Which library I can trust more?
This is not a question of "trust". Both functions do what they are supposed to do. The TF version apparently does not resample the audio. Librosa, by default, resamples to 22,050 Hz (for whatever reason). Please read the docs. You can avoid this by calling
y, sr2 = librosa.load(f, sr=None)
In general, the sr argument provides the sampling rate to resample to; by passing None, you prevent resampling.
I am extracting MFCCs from an audio file using Librosa's function (librosa.feature.mfcc) and I correctly get back a numpy array with the shape I was expecting: 13 MFCCs values for the entire length of the audio file which is 1292 windows (in 30 seconds).
What is missing is timing information for each window: for example I want to know what the MFCC looks like at time 5000ms, then at 5200ms etc.
Do I have to manually calculate the time? Is there a way to automatically get the exact time for each window?
The "timing information" is not directly available, as it depends on sampling rate. In order to provide such information, librosa would have create its own classes. This would rather pollute the interface and make it much less interoperable. In the current implementation, feature.mfcc returns you numpy.ndarray, meaning you can easily integrate this code anywhere in Python.
To relate MFCC to timing:
import librosa
import numpy as np
filename = librosa.util.example_audio_file()
y, sr = librosa.load(filename)
hop_length = 512 # number of samples between successive frames
mfcc = librosa.feature.mfcc(y=y, n_mfcc=13, sr=sr, hop_length=hop_length)
audio_length = len(y) / sr # in seconds
step = hop_length / sr # in seconds
intervals_s = np.arange(start=0, stop=audio_length, step=step)
print(f'MFCC shape: {mfcc.shape}')
print(f'intervals_s shape: {intervals_s.shape}')
print(f'First 5 intervals: {intervals_s[:5]} second')
Note that array length of mfcc and intervals_s is the same - a sanity check that we did not make a mistake in our calculation.
MFCC shape: (13, 2647)
intervals_s shape: (2647,)
First 5 intervals: [0. 0.02321995 0.04643991 0.06965986 0.09287982] second
I have a database which contains a videos streaming. I want to calculate the LBP features from images and MFCC audio and for every frame in the video I have some annotation. The annotation is inlined with the video frames and the time of the video. Thus, I want to map the time that i have from the annotation to the result of the mfcc. I know that the sample_rate = 44100
from python_speech_features import mfcc
from python_speech_features import logfbank
import scipy.io.wavfile as wav
audio_file = "sample.wav"
(rate,sig) = wav.read(audio_file)
mfcc_feat = mfcc(sig,rate)
print len(sig) # 2130912
print len(mfcc_feat) # 4831
Firstly, why the result of the length of the mfcc is 4831 and how to map that in the annotation that i have in seconds? The total duration of the video is 48second. And the annotation of the video is 0 everywhere except the 19-29sec windows where is is 1. How can i locate the samples within the window (19-29) from the results of the mfcc?
Run
mfcc_feat.shape
You should get (4831,13) . 13 is your MFCC length (default numcep is 13). 4831 is the windows. Default winstep is 10 msec, and this matches your sound file duration. To get to the windows corresponding to 19-29 sec, just slice
mfcc_feat[1900:2900,:]
Remember, that you can not listen to the MFCC. It just represents the slice of audio of 0.025 sec (default value of winlen parameter).
If you want to get to the audio itself, it is
sig[time_beg_in_sec*rate:time_end_in_sec*rate]
I have to read the data from just one channel in a stereo wave file in Python.
For this I tried it with scipy.io:
import scipy.io.wavfile as wf
import numpy
def read(path):
data = wf.read(path)
for frame in data[1]:
data = numpy.append(data, frame[0])
return data
But this code is very slow, especially if I have to work with longer files.
So does anybody know a faster way to do this? I thought about the standard wave module by using wave.readframes(), but how are the frames stored there?
scipy.io.wavfile.read returns the tuple (rate, data). If the file is stereo, data is a numpy array with shape (nsamples, 2). To get a specific channel, use a slice of data. For example,
rate, data = wavfile.read(path)
# data0 is the data from channel 0.
data0 = data[:, 0]
The wave module returns the frames as a string of bytes, which can be converted to numbers with the struct module. For instance:
def oneChannel(fname, chanIdx):
""" list with specified channel's data from multichannel wave with 16-bit data """
f = wave.open(fname, 'rb')
chans = f.getnchannels()
samps = f.getnframes()
sampwidth = f.getsampwidth()
assert sampwidth == 2
s = f.readframes(samps) #read the all the samples from the file into a byte string
f.close()
unpstr = '<{0}h'.format(samps*chans) #little-endian 16-bit samples
x = list(struct.unpack(unpstr, s)) #convert the byte string into a list of ints
return x[chanIdx::chans] #return the desired channel
If your WAV file has some other sample size, you can use the (uglier) function in another answer I wrote here.
I've never used scipy's wavfile function so I can't compare speed, but the wave and struct approach I use here has always worked for me.
rate, audio = wavfile.read(path)
audio = np.mean(audio, axis=1)