My main goal is in feeding mfcc features to an ANN.
However I am stuck at the data pre processing step and my question has two parts.
BACKGROUND :
I have an audio.
I have a txt file that has the annotation and time stamp like this:
0.0 2.5 Music
2.5 6.05 silence
6.05 8.34 notmusic
8.34 12.0 silence
12.0 15.5 music
I know for a single audio file, I can calculate the mfcc using librosa like this :
import librosa
y, sr = librosa.load('abcd.wav')
mfcc=librosa.feature.mfcc(y=y, sr=sr)
Part 1: I'm unable to wrap my head around two things :
how to calculate mfcc based on the segments from the annotations.
Part2: How to best store these mfcc's for passing them to keras DNN. i.e should all mfcc's calculated per audio segment be saved to a single list/dictionary. or is it better to save them to different dictionaries so that all mfcc's belonging to one label are at one place.
I'm new to audio processing and python so, i'm open to recommendations regarding best practices.
More than happy to provide additional details.
Thanks.
Part 1: MFCC to tag conversion
It's not obvious from the librosa documentation but I believe the mfcc's are being calculated at about a 23mS frame rate. With your code above mfcc.shape will return (20, x) where 20 is the number of features and the x corresponds to x number of frames. The default hop_rate for mfcc is 512 samples which means each mfcc sample spans about 23mS (512/sr).
Using this you can compute which frame goes with which tag in your text file. For example, the tag Music goes from 0.0 to 2.5 seconds so that will be mfcc frame 0 to 2.5*sr/512 ~= 108. They will not come out exactly equal so you need to round the values.
Part 2A: DNN Data Format
For the input (mfcc data) you'll need to figure out what the input looks like. You'll have 20 features but do you want to input a single frame to your net or are you going to submit a time series. You're mfcc data is already a numpy array, however it's formatted as (feature, sample). You probably want to reverse that for input to Keras. You can use numpy.reshape to do that.
For the output, you need assign a numeric value to each tag in your text file. Typically you would store the the tag to integer in a dictionary. This will then be used to create your training output for the network. There should be one output integer for each input sample.
Part 2B: Saving the Data
The simplest way to do this is to use pickle to save and the reload it later. I like to use a class to encapsulate the input, output and dictionary data but you can choose whatever works for you.
Related
I have split audio files consisting of all the English letters (A, B, C, D, etc.) into separate chunks of audio .wav files. I want to sort each letter into a group. For example, I want all the audio files of letter A grouped in one folder. So then I will have 26 folders consists of different sounds of the same letters.
I have searched for this, and I found some work done on K-mean clustering, but I could not achieve my requirement.
First of all, you need to convert the sounds into representation suitable for further processing, so some feature vectors for which you can apply classification or clustering algorithms.
For audio, typical choice are features based on spectrum. To process sounds, librosa can be very helpful.
Since sounds have different duration and you probably want a fixed-size feature vector for each recording, you need a way to build a single feature vector on top of series of data. Here, different methods can be used, depending on your data amount and availability of labels. Assuming you have limited amount of recordings and no labels, you can start with simply stacking several vectors together. Averaging is another possibility, but it destroys the temporal information (which can be ok in this case). Training some kind of RNN to learn representation as hidden state is the most powerful method.
Take a look on this related answer: How to classify continuous audio
I'm currently trying to classify emotions (7 classes) based on audio files. The first thing I did was to extract the features using the mfcc function in the python_speech_features library (https://python-speech-features.readthedocs.io/en/latest/#functions-provided-in-python-speech-features-module).
In the documentation, it says that each row contains one feature vector. The problem is that each audio file returns a different number of rows (features) as the audio length is different. For example, for audio_1 the shape of the output is (155,13), for audio_2 the output's shape is (258,13). Any advice about how to make them the same shape? I am currently using PCA to force the data to have the same dimensionality, is this a correct approach?
This is how I extract the features:
sample_rate, data = wavfile.read(path)
mfccExtract = features.mfcc(data, sample_rate, winfunc=np.hamming)
If you want each audio sample to be the same length, there are 4 different approaches that are available:
Zero Padding
N Modulo Reduction
Interpolation
Dynamic Time Wrapping
You can use any of them approach to each audio sample. These approaches are available in academic papers.
I am creating a Text to Speech system for a phonetic language called "Kannada" and I plan to train it with a Neural Network. The input is a word/phrase while the output is the corresponding audio.
While implementing the Network, I was thinking the input should be the segmented characters of the word/phrase as the output pronunciation only depends on the characters that make up the word, unlike English where we have slient words and Part of Speech to consider. However, I do not know how I should train the output.
Since my Dataset is a collection of words/phrases and the corrusponding MP3 files, I thought of converting these files to WAV using pydub for all audio files.
from pydub import AudioSegment
sound = AudioSegment.from_mp3("audio/file1.mp3")
sound.export("wav/file1.wav", format="wav")
Next, I open the wav file and convert it to a normalized byte array with values between 0 and 1.
import numpy as np
import wave
f = wave.open('wav/kn3.wav', 'rb')
frames = f.readframes(-1)
#Array of integers of range [0,255]
data = np.fromstring(frames, dtype='uint8')
#Normalized bytes of wav
arr = np.array(data)/255
How Should I train this?
From here, I am not sure how to train this with the input text. From this, I would need a variable number of input and output neurons in the First and Last layers as the number of characters (1st layer) and the bytes of the corresponding wave (Last layer) change for every input.
Since RNNs deal with such variable data, I thought it would come in handy here.
Correct me if I am wrong, but the output of Neural Networks are actually probability values between 0 and 1. However, we are not dealing with a classification problem. The audio can be anything, right? In my case, the "output" should be a vector of bytes corrusponding to the WAV file. So there will be around 40,000 of these with values between 0 and 255 (without the normalization step) for every word. How do I train this speech data? Any suggestions are appreciated.
EDIT 1 : In response to Aaron's comment
From what I understand, Phonemes are the basic sounds of the language. So, why do I need a neural network to map phoneme labels with speech? Can't I just say, "whenever you see this alphabet, pronounce it like this". After all, this language, Kannada, is phonetic: There are no silent words. All words are pronounced the same way they are spelled. How would a Neural Network help here then?
On input of a new text, I just need to break it down to the corresponding alphabets (which are also the phonemes) and retrieve it's file (converted from WAV to raw byte data). Now, merge the bytes together and convert it to a wav file.
Is this this too simplistic? Am I missing something here? What would be the point of a Neural Network for this particular language (Kannada) ?
It is not trivial and requires special architecture. You can read the description of it in a publications of DeepMind and Baidu.
You might also want to study existing implementation of wavenet training.
Overall, pure end-to-end speech synthesis is still not working. If you are serious about text-to-speech it is better to study conventional systems like merlin.
I was trying to use the tensorflow tf. image_summary but it wasn't clear to me how to use it. In the tensorboard readme file they have the following sentence that confuses me:
The dashboard is set up so that each row corresponds to a different
tag, and each column corresponds to a run.
I don't understand the sentence and thus, I am having a hard time figuring out what the columns and rows mean for TensorBoard image visualization. What exactly is a "tag" and what exactly is a "run"? How do I get multiple "tags" and multiple "runs" to display? Why would I want multiple "tags" and "runs" to display?
Does someone have a very simple but non-trivial example of how to use this?
Ideally, what I want to use is compare how my model performs with respect to PCA so in my head it would be nice to compare how the reconstructions compare to PCA reconstruction at each step. Not sure if this is a good idea but I also want to see what the activation images look like and how the templates look like.
Curenttly I have a very simple script with the following lines:
with tf.name_scope('input_reshape'):
x_image = tf.to_float(x, name='ToFloat')
image_shaped_input = tf.reshape(x_image, [-1, 28, 28, 1])
tf.image_summary('input', image_shaped_input, 10)
currently I have managed to discover that the rows are of length 10 so i assume its showing me 10 images that have something to do with the current run/batch.
however, if possible I'd like to see reconstruction, filters (currently I am doing fully connected to keep things simple but eventually it would be nice to see a conv net examples), activation units (with any number of units that I choose), etc.
TensorFlow was officially released (r1.0) after this question was posed, and the functions and documentation accompanying Tensorboard have been simplified.
tf.summary.image is now the Op for writing images represented by a 4D Tensor to the summary file; here is the documentation.
To answer your questions about rows and columns, each call to tf.summary.image generates a new tag or row of image summaries with the total number dictated by the value passed as max_outputs (10 in your given example).
As to why one might want to view more than one column of data, If the first dimension of the 4D Tensor is greater than 1 (i.e. batch size > 1), it will be helpful to see more than one column in Tensorboard to get a better sense of the entire batch of images.
Finally, having multiple tags is helpful when wanting to view two different collections of images, such as input images and reconstructed images if you were building an autoencoder architecture.
I am working on detection of regions of specific tree in an aerial image and my approach is using texture detection. I have 4 descriptors/features and I want to use FANN to create a machine learning environment that would detect properly the regions.
My question is,
is the format I am reading, regarding the input of pyfann, always as described in https://stackoverflow.com/a/25703709/5722784 ?
What if I would like to have 4 input neurons and one output neuron, where for each input neuron I have a list (not a single integer) that I would like to feed on it? Can FANN provide it? If so, what's the format that I have to follow in making the input data?
Thank you so much for significant responses :)
Each input neuron can only take a single input - this is the case for all neural networks, irrespective of the library you use. I would suggest using each element in each of your lists as an input to the neural network, e.g. inputs 1-5 are your first list and then 6-10 are your second list. If you have variable length lists you likely have a problem, though.