I have downloaded the Kaggle Speech Accent Archive to learn how to handle audio data. I'm comparing three ways of reading mp3's in this dataset. The first uses Tensorflow's AudioIOTensor, the second uses Librosa and the third uses PyDub. I let each of them read the same mp3 file, however, all 3 get different results on the same file.
I used this code:
import librosa
import numpy as np
import os
import pathlib
import pyaudio
from pydub import AudioSegment as pydub_AudioSegment
from pydub.utils import mediainfo as pydub_mediainfo
import tensorflow as tf
import tensorflow_io as tfio
DATA_DIR = <Path to data>
data_path = pathlib.Path(DATA_DIR)
mp3Files = [x for x in data_path.iterdir() if '.mp3' in x.name]
def load_audios(file_list):
dataset = []
for curr_file in file_list:
tf2 = tfio.audio.AudioIOTensor(curr_file.as_posix())
librsa, librsa_sr = librosa.load(curr_file.as_posix())
pdub = pydub_AudioSegment.from_file(curr_file.as_posix(),'mp3')
dataset.append([tf2, librsa, librsa_sr, pdub, curr_file.name])
return dataset
audios = load_audios(mp3Files[0:1]) # Reads 'afrikaans1.mp3' from
# Kaggle's Speech Accent Archive
tf2 = audios[0][0]
libr = audios[0][1]
libr_sr = audios[0][2]
pdub = audios[0][3]
But now when I start comparing the way these 3 modules read the same mp3 file I see this behavior for Tensorflow's AudioIOTensor:
>> tf2_arr = tf.squeeze(tf2.to_tensor(),-1).numpy()
>> tf2_arr, tf2, tf2_arr.shape # Gives raw data, sampling rate & shape
(array([ 0.00905748, 0.01102116, 0.00883307, ..., -0.00131128,
-0.00134344, -0.00090137], dtype=float32),
<AudioIOTensor: shape=[916057 1], dtype=<dtype: 'float32'>, rate=44100>,
(916057,))
>> np.argmax(tf2_arr), np.argmin(tf2_arr)
(113149, 106715)
This behavior for Librosa:
>> libr, libr_sr, libr.shape # Gives raw data, sampling rate & shape
(array([ 0.00711342, 0.01064209, 0.00806945, ..., -0.00168153,
-0.00148052, 0. ], dtype=float32),
22050,
(458029,))
And for PyDub, I see this:
>> pdub_data = np.array(pdub.get_array_of_samples())
>> pdub_data, pdub.frame_rate, pdub_data.shape # Gives raw data, sampling rate
# & shape
(array([297, 361, 289, ..., -43, -44, -30], dtype=int16), 44100, (916057,))
Although all the raw values disagreed with each other, the first confirming thing I noticed is that the AudioIOTensor and PyDub result had the same sampling frequency(44100) and the same shape((916057,)). Yet, Librosa's result had a sampling frequency(22050) and shape dimensions((458029)) that were half the sampling frequency and shape dimensions of the other two techniques.
Next, I looked to see where the max and mins of each array was. I found this:
>> np.argmax(tf2_arr), np.argmin(tf2_arr)
(113149, 106715)
>> np.argmax(pdub_data), np.argmin(pdub_data)
(113149, 106715)
>> np.argmax(libr)*2, np.argmin(libr)*2
(113150, 106714)
So, allowing for the fact that Librosa has half the sampling rate of the other two libraries, all three libraries agree on where the max's and min's are.
Lastly, I decided to see if Tensorflow's AudioIOTensor and PyDub's result were separated by a constant multiplicative factor by taking the average of the ratio of the maxes and mins:
>> pdub_data[113149]/tf2_arr[113149], pdub_data[106715]/tf2_arr[106715]
(32768.027, 32768.184)
>> test = tf2_arr * 32768.105
>> diff = test-pdub_data
>> np.max(diff), np.min(diff)
(0.578125, -0.5917969)
Since pdub_data had values ranging from 23864 to -22269 (i.e. I checked np.max(pdub_data) and np.min(dub_data)), I was willing to assume that if the differences were bounded by +/- 0.6, they were due to rounding and similar effects. I was willing to assume that the same would hold for Librosa, but now I'm left wondering why?
I would've thought that reading an mp3 file wouldn't leave room for interpretation. Raw data was stored using whatever rules mp3 uses and should be recovered when the file is read.
Why do these 3 libraries differ in the raw numbers they return, and in 1 case differ in the sampling rate corresponding to the returned data? How can I get one or all of them to return the raw data stored in the mp3 format? Should I attach any significance to the fact that the ratio between the pydub_data values and the tf2_arr values is 32678 (i.e. 2^15)?
=====================================================================
Later Thoughts: I'm wondering if part of the reason for the differences between these libraries lies in the variable type they use. Librosa uses float32 and PyDub uses int16. So, It might make sense that PyDub sees twice as many numbers as Librosa, which gives it twice the sampling rate. Similarly, AudioIOTensor differs from PyDub by a factor of 2^15. If one prepends 15 bits to a 16 bit int, with one more to handle the sign, one could conceivably get a 32 bit float. But both of these cases seem to imply that one set of numbers will be, in some sense, 'wrong'.....
Related
i want to test the following Dataset: https://www.tensorflow.org/datasets/catalog/speech_commands
when i load & play the audio i just get ?random? noise.
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
import IPython.display as ipd
ds, ds_info = tfds.load('speech_commands', shuffle_files=False, with_info=True)
ds_info
tfds.core.DatasetInfo(
name='speech_commands',
full_name='speech_commands/0.0.2',
description="""
An audio dataset of spoken words designed to help train and evaluate keyword
spotting systems. Its primary goal is to provide a way to build and test small
models that detect when a single word is spoken, from a set of ten target words,
with as few false positives as possible from background noise or unrelated
speech. Note that in the train and validation set, the label "unknown" is much
more prevalent than the labels of the target words or background noise.
One difference from the release version is the handling of silent segments.
While in the test set the silence segments are regular 1 second files, in the
training they are provided as long segments under "background_noise" folder.
Here we split these background noise into 1 second clips, and also keep one of
the files for the validation set.
""",
homepage='https://arxiv.org/abs/1804.03209',
data_path='C:\\Users\\abc\\tensorflow_datasets\\speech_commands\\0.0.2',
download_size=2.37 GiB,
dataset_size=9.07 GiB,
features=FeaturesDict({
'audio': Audio(shape=(None,), dtype=tf.int64),
'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=12),
}),
supervised_keys=('audio', 'label'),
splits={
'test': <SplitInfo num_examples=4890, num_shards=4>,
'train': <SplitInfo num_examples=106497, num_shards=128>,
'validation': <SplitInfo num_examples=121, num_shards=1>,
},
citation="""#article{speechcommandsv2,
author = {{Warden}, P.},
title = "{Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition}",
journal = {ArXiv e-prints},
archivePrefix = "arXiv",
eprint = {1804.03209},
primaryClass = "cs.CL",
keywords = {Computer Science - Computation and Language, Computer Science - Human-Computer Interaction},
year = 2018,
month = apr,
url = {https://arxiv.org/abs/1804.03209},
}""",
)
The Audio files are Arrays of Type int64 with a Samplerate of 16000. I couldnt find any information on how to play the Files within this Dataset. From other Datasets i was able to play the WAV-Sounds. One of the difference is, that other DS used float arrays and this DS uses int array. Maybe im missing a conversation step?
ds_list = list(ds['validation'])
idx = -1
audio, label = ds_list[idx]['audio'], ds_list[idx]['label']
ipd.Audio(audio, rate=16_000)
I obviously tried multiple Indeces within the Dataset but i always just get noise. One Audio-Entry looks something like this:
tf.Tensor([ -112 1285 -2002 ... -140 1000 -595], shape=(16000,), dtype=int64)
Ty :)
As per the source code, on the description page [1], it is stated that:
Its primary goal is to provide a way to build and test small
models that detect when a single word is spoken, from a set of ten target words,
with as few false positives as possible from background noise or unrelated
speech.
At first, I was able to play noisy wavfile as you have shown. Then, I modify my codes based on [2] to produce cleaner voice.
I use the following code to convert tensor to wav format.
import scipy.io.wavfile as wavfile
import tensorflow as tf
import tensorflow_datasets as tfds
# load speech commands dataset
ds = tfds.load('speech_commands', split=['train', 'validation', 'test'],
shuffle_files=True)
# convert from tfds format to list
ds_train = list(ds[0])
ds_val = list(ds[1])
# convert from tensor int64 to numpy float32
sc1 = ds_list[0]['audio'].numpy().astype(np.float32)/np.iinfo(np.int16).max()
sv1 = ds_list[1]['audio'].numpy().astype(np.float32)/np.iinfo(np.int16).max()
# save as wav
wavfile('sc_train_1.wav', 16000, sc1)
wavfile('sc_train_1.wav', 16000, sv1)
The trick is to convert int64 to float32 and divide by max value of np.int16: .astype(np.float32)/np.iinfo(np.int16).max().
Now, I can hear a cleaner voice than previously int64 format.
[1]https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/audio/speech_commands.py
[2] https://github.com/google-research/google-research/blob/master/non_semantic_speech_benchmark/train_and_eval_sklearn_small_tfds_dataset.ipynb
I am extracting MFCCs from an audio file using Librosa's function (librosa.feature.mfcc) and I correctly get back a numpy array with the shape I was expecting: 13 MFCCs values for the entire length of the audio file which is 1292 windows (in 30 seconds).
What is missing is timing information for each window: for example I want to know what the MFCC looks like at time 5000ms, then at 5200ms etc.
Do I have to manually calculate the time? Is there a way to automatically get the exact time for each window?
The "timing information" is not directly available, as it depends on sampling rate. In order to provide such information, librosa would have create its own classes. This would rather pollute the interface and make it much less interoperable. In the current implementation, feature.mfcc returns you numpy.ndarray, meaning you can easily integrate this code anywhere in Python.
To relate MFCC to timing:
import librosa
import numpy as np
filename = librosa.util.example_audio_file()
y, sr = librosa.load(filename)
hop_length = 512 # number of samples between successive frames
mfcc = librosa.feature.mfcc(y=y, n_mfcc=13, sr=sr, hop_length=hop_length)
audio_length = len(y) / sr # in seconds
step = hop_length / sr # in seconds
intervals_s = np.arange(start=0, stop=audio_length, step=step)
print(f'MFCC shape: {mfcc.shape}')
print(f'intervals_s shape: {intervals_s.shape}')
print(f'First 5 intervals: {intervals_s[:5]} second')
Note that array length of mfcc and intervals_s is the same - a sanity check that we did not make a mistake in our calculation.
MFCC shape: (13, 2647)
intervals_s shape: (2647,)
First 5 intervals: [0. 0.02321995 0.04643991 0.06965986 0.09287982] second
In passing a file via the use:
librosa_audio, librosa_sample_rate = librosa.load(filename)
The output produces an audio file such that:
Librosa audio file min~max range: -1.2105224 to 1.2942806
The file that I am working on was obtained from https://www.boomlibrary.com/ and had a bit depth of 24. I down sampled to 16 and also up sampled to 32 to work with librosa. Both of these files produced the same min-max range after going through librosa.
Why does this happen?
Is there a way to parse the wav file to Librosa such that the data will fall between [-1,1]?
Here is a link to the files:
https://drive.google.com/drive/folders/12a0ii5i0ugyvdMMRX4MPfWMSN0arD0bn?usp=sharing
The behaviour you are observing stems directly from resampling to 22050 Hz that librosa load does by default:
librosa.core.load(path, sr=22050)
Resampling process always affects the audio, hence you see values that are not normalized. You have to do this yourself.
More likely, you wanted to read the audio with the native sampling rate, in which case you should have passed None to sr like this:
librosa.core.load(path, sr=None)
Example based on the audio sample you have provided:
In [4]: y, sr = librosa.load('201-AWCKARAK47Close0116BIT.wav', sr=None)
In [5]: y.max()
Out[5]: 0.9773865
In [6]: y.min()
Out[6]: -0.8358917
I recently do my homework about MFCC, and I can't figure out some differences between using these libraries.
The 3 libraries I use are:
python_speech_features
SpeechPy
LibROSA
samplerate = 16000
NFFT = 512
NCEPT = 13
1st Part: Mel filter bank
temp1_fb = pyspeech.get_filterbanks(nfilt=NFILT, nfft=NFFT, samplerate=sample1)
# speechpy do not divide 2 and add 1 when initializing
temp2_fb = speechpy.feature.filterbanks(num_filter=NFILT, fftpoints=NFFT, sampling_freq=sample1)
temp3_fb = librosa.filters.mel(sr=sample1, n_fft=NFFT, n_mels=NFILT)
# fix librosa normalized version
temp3_fb /= np.max(temp3_fb, axis=-1)[:, None]
pic1
Only the shape in speechpy will get (, 512), other all (, 257). The figure of librosa is a bit of deformation.
2nd Part: MFCC
# pyspeech without lifter. Using hamming
temp1_mfcc = pyspeech.mfcc(speaker1, samplerate=sample1, winlen=0.025, winstep=0.01, numcep=NCEPT, nfilt=NFILT, nfft=NFFT,
preemph=0.97, ceplifter=0, winfunc=np.hamming, appendEnergy=False)
# speechpy need pre-emphasized. Using rectangular window fixed. Mel filter bank is not the same
temp2_mfcc = speechpy.feature.mfcc(emphasized_speaker1, sampling_frequency=sample1, frame_length=0.025, frame_stride=0.01,
num_cepstral=NCEPT, num_filters=NFILT, fft_length=NFFT)
# librosa need pre-emphasized. Using log energy. Its STFT using hanning, but its framing is not the same
temp3_energy = librosa.feature.melspectrogram(emphasized_speaker1, sr=sample1, S=temp3_pow.T, n_fft=NFFT,
hop_length=frame_step, n_mels=NFILT).T
temp3_energy = np.log(temp3_energy)
temp3_mfcc = librosa.feature.mfcc(emphasized_speaker1, sr=sample1, S=temp3_energy.T, n_mfcc=13, dct_type=2, n_fft=NFFT,
hop_length=frame_step).T
pic2
I've tried my best to set the condition faire. The figure of speechpy gets darker.
3rd Part: Delta coefficient
temp1 = pyspeech.delta(mfcc_speaker1, 2)
temp2 = speechpy.processing.derivative_extraction(mfcc_speaker1.T, 1).T
# librosa along the frame axis
temp3 = librosa.feature.delta(mfcc_speaker1, width=5, axis=0, order=1)
pic3
I can't directly set mfcc as argument in speechpy, or it will be very strange. And what these parameters originally act is not the same as my expected.
I'm wondering what factors to make these differences. Is it just somethong I mentioned above? Or I made some mistakes? Hope for details, thanks.
There are many MFCC implementations and they often differ bit by bit - window function shape, mel filterbank calculation, dct could be different too. It is hard to find a fully compatible library. In the long term it should not matter for you as long as you use the same implementation everywhere. The differences do not affect results.
I'm using matplotlib's magnitude_spectrum to compare the tonal characteristics of guitar strings. Magnitude_spectrum shows the y axis as having units of "Magnitude (energy)". I use two different 'processes' to compare the FFT. Process 2 (for lack of a better description) is much easier to interpret- code & graphs below
My questions are:
In terms of units, what does "Magnitude (energy)" mean and how does it relate to dB?
Using #Process 2 (see code & graphs below), what type of units am I looking at, dB?
If #Process 2 is not dB, then what is the best way to scale it to dB?
My code below (simplified) shows an example of what I'm talking about/looking at.
import numpy as np
from scipy.io.wavfile import read
from pylab import plot
from pylab import plot, psd, magnitude_spectrum
import matplotlib.pyplot as plt
#Hello Signal!!!
(fs, x) = read('C:\Desktop\Spectral Work\EB_AB_1_2.wav')
#Remove silence out of beginning of signal with threshold of 1000
def indices(a, func):
#This allows to use the lambda function for equivalent of find() in matlab
return [i for (i, val) in enumerate(a) if func(val)]
#Make the signal smaller so it uses less resources
x_tiny = x[0:100000]
#threshold is 1000, 0 is calling the first index greater than 1000
thresh = indices(x_tiny, lambda y: y > 1000)[1]
# backs signal up 20 bins, so to not ignore the initial pluck sound...
thresh_start = thresh-20
#starts at threshstart ends at end of signal (-1 is just a referencing thing)
analysis_signal = x[thresh_start-1:]
#Split signal so it is 1 second long
one_sec = 1*fs
onesec = x[thresh_start-1:one_sec+thresh_start-1]
#process 1
(spectrum, freqs, _) = magnitude_spectrum(onesec, Fs=fs)
#process 2
spectrum1 = spectrum/len(spectrum)
I don't know how to bulk process on multiple .wav files so I run this code separately on a whole bunch of different .wav files and i put them into excel to compare. But for the sake of not looking at ugly graphs, I graphed it in Python. Here's what #process1 and #process2 look like when graphed:
Process 1
Process 2
Magnetude is just the absolute value of the frequency spectrum. As you have labelled in Process 1 "Energy" is a good way to think about it.
Both Process 1 and Process 2 are in the same units. The only difference is that the values in Process 2 has been divided by the total length of the array (a scalar, hence no change of units). Normally this happens as part of the FFT, but sometimes it does not (e.g. numpy.FFT doesn't include the divide by length).
The easiest way to scale it to dB is:
(spectrum, freqs, _) = magnitude_spectrum(onesec, Fs=fs, scale='dB')
If you wanted to do this yourself then you would need to do something like:
spectrum2 = 20*numpy.log10(spectrum)
**It is worth noting that I'm not sure if you should be applying the /len(spectrum) or not. I would suggest using the scale='dB' !!
To convert to dB, take the log of any non-zero spectrum magnitudes, and scale (scale to match a calibrated mic and sound source if available, or use an arbitrarily scale to make the levels look familiar otherwise), before plotting.
For zero magnitude values, perhaps just replace or clamp the log with whatever you want to be on the bottom of your log plot (certainly not negative-infinity).