I'm currently trying to classify emotions (7 classes) based on audio files. The first thing I did was to extract the features using the mfcc function in the python_speech_features library (https://python-speech-features.readthedocs.io/en/latest/#functions-provided-in-python-speech-features-module).
In the documentation, it says that each row contains one feature vector. The problem is that each audio file returns a different number of rows (features) as the audio length is different. For example, for audio_1 the shape of the output is (155,13), for audio_2 the output's shape is (258,13). Any advice about how to make them the same shape? I am currently using PCA to force the data to have the same dimensionality, is this a correct approach?
This is how I extract the features:
sample_rate, data = wavfile.read(path)
mfccExtract = features.mfcc(data, sample_rate, winfunc=np.hamming)
If you want each audio sample to be the same length, there are 4 different approaches that are available:
Zero Padding
N Modulo Reduction
Interpolation
Dynamic Time Wrapping
You can use any of them approach to each audio sample. These approaches are available in academic papers.
Related
I have split audio files consisting of all the English letters (A, B, C, D, etc.) into separate chunks of audio .wav files. I want to sort each letter into a group. For example, I want all the audio files of letter A grouped in one folder. So then I will have 26 folders consists of different sounds of the same letters.
I have searched for this, and I found some work done on K-mean clustering, but I could not achieve my requirement.
First of all, you need to convert the sounds into representation suitable for further processing, so some feature vectors for which you can apply classification or clustering algorithms.
For audio, typical choice are features based on spectrum. To process sounds, librosa can be very helpful.
Since sounds have different duration and you probably want a fixed-size feature vector for each recording, you need a way to build a single feature vector on top of series of data. Here, different methods can be used, depending on your data amount and availability of labels. Assuming you have limited amount of recordings and no labels, you can start with simply stacking several vectors together. Averaging is another possibility, but it destroys the temporal information (which can be ok in this case). Training some kind of RNN to learn representation as hidden state is the most powerful method.
Take a look on this related answer: How to classify continuous audio
Given a matrix representing a video, where the rows represent video frames and the columns represent frame features, are there any techniques I can use to create a single feature vector from this without losing the spatio-temporal information from the matrix? i.e. 40 frame vectors with 36 features each converted to 1 vector with 36 features representing the entire 40 frames.
I've already tried taking the mean for each feature but I'm wondering about other techniques.
I have no experience on video data. But I had a sense that your problem is the same/or very similar to text summarization problem.
In principle, what you want is to "summarize" all your different time frames into just only one frame. It is very much like you have many sentences in one article but you just want to have one sentence which can summarize all the sentences in one article.
In general there are two methods:
1. Extract methods: :Evaluate the importance of each frame and put weights on different frames. Then simply use linear combination to have a summarize vector. In extreme case, you can just use the dominant frame as a representative frame if its weight is much bigger than the others. This method is very similar to TextRank or TextTeasing algorithm in NLP.
2. Abstract methods: : In case of NLP, you need some knowledge of linguist and sentence semantic structures in order to develop this approach. In your case, I believe you need to do the same thing for your video data.
My main goal is in feeding mfcc features to an ANN.
However I am stuck at the data pre processing step and my question has two parts.
BACKGROUND :
I have an audio.
I have a txt file that has the annotation and time stamp like this:
0.0 2.5 Music
2.5 6.05 silence
6.05 8.34 notmusic
8.34 12.0 silence
12.0 15.5 music
I know for a single audio file, I can calculate the mfcc using librosa like this :
import librosa
y, sr = librosa.load('abcd.wav')
mfcc=librosa.feature.mfcc(y=y, sr=sr)
Part 1: I'm unable to wrap my head around two things :
how to calculate mfcc based on the segments from the annotations.
Part2: How to best store these mfcc's for passing them to keras DNN. i.e should all mfcc's calculated per audio segment be saved to a single list/dictionary. or is it better to save them to different dictionaries so that all mfcc's belonging to one label are at one place.
I'm new to audio processing and python so, i'm open to recommendations regarding best practices.
More than happy to provide additional details.
Thanks.
Part 1: MFCC to tag conversion
It's not obvious from the librosa documentation but I believe the mfcc's are being calculated at about a 23mS frame rate. With your code above mfcc.shape will return (20, x) where 20 is the number of features and the x corresponds to x number of frames. The default hop_rate for mfcc is 512 samples which means each mfcc sample spans about 23mS (512/sr).
Using this you can compute which frame goes with which tag in your text file. For example, the tag Music goes from 0.0 to 2.5 seconds so that will be mfcc frame 0 to 2.5*sr/512 ~= 108. They will not come out exactly equal so you need to round the values.
Part 2A: DNN Data Format
For the input (mfcc data) you'll need to figure out what the input looks like. You'll have 20 features but do you want to input a single frame to your net or are you going to submit a time series. You're mfcc data is already a numpy array, however it's formatted as (feature, sample). You probably want to reverse that for input to Keras. You can use numpy.reshape to do that.
For the output, you need assign a numeric value to each tag in your text file. Typically you would store the the tag to integer in a dictionary. This will then be used to create your training output for the network. There should be one output integer for each input sample.
Part 2B: Saving the Data
The simplest way to do this is to use pickle to save and the reload it later. I like to use a class to encapsulate the input, output and dictionary data but you can choose whatever works for you.
I am creating a Text to Speech system for a phonetic language called "Kannada" and I plan to train it with a Neural Network. The input is a word/phrase while the output is the corresponding audio.
While implementing the Network, I was thinking the input should be the segmented characters of the word/phrase as the output pronunciation only depends on the characters that make up the word, unlike English where we have slient words and Part of Speech to consider. However, I do not know how I should train the output.
Since my Dataset is a collection of words/phrases and the corrusponding MP3 files, I thought of converting these files to WAV using pydub for all audio files.
from pydub import AudioSegment
sound = AudioSegment.from_mp3("audio/file1.mp3")
sound.export("wav/file1.wav", format="wav")
Next, I open the wav file and convert it to a normalized byte array with values between 0 and 1.
import numpy as np
import wave
f = wave.open('wav/kn3.wav', 'rb')
frames = f.readframes(-1)
#Array of integers of range [0,255]
data = np.fromstring(frames, dtype='uint8')
#Normalized bytes of wav
arr = np.array(data)/255
How Should I train this?
From here, I am not sure how to train this with the input text. From this, I would need a variable number of input and output neurons in the First and Last layers as the number of characters (1st layer) and the bytes of the corresponding wave (Last layer) change for every input.
Since RNNs deal with such variable data, I thought it would come in handy here.
Correct me if I am wrong, but the output of Neural Networks are actually probability values between 0 and 1. However, we are not dealing with a classification problem. The audio can be anything, right? In my case, the "output" should be a vector of bytes corrusponding to the WAV file. So there will be around 40,000 of these with values between 0 and 255 (without the normalization step) for every word. How do I train this speech data? Any suggestions are appreciated.
EDIT 1 : In response to Aaron's comment
From what I understand, Phonemes are the basic sounds of the language. So, why do I need a neural network to map phoneme labels with speech? Can't I just say, "whenever you see this alphabet, pronounce it like this". After all, this language, Kannada, is phonetic: There are no silent words. All words are pronounced the same way they are spelled. How would a Neural Network help here then?
On input of a new text, I just need to break it down to the corresponding alphabets (which are also the phonemes) and retrieve it's file (converted from WAV to raw byte data). Now, merge the bytes together and convert it to a wav file.
Is this this too simplistic? Am I missing something here? What would be the point of a Neural Network for this particular language (Kannada) ?
It is not trivial and requires special architecture. You can read the description of it in a publications of DeepMind and Baidu.
You might also want to study existing implementation of wavenet training.
Overall, pure end-to-end speech synthesis is still not working. If you are serious about text-to-speech it is better to study conventional systems like merlin.
Anybody able to supply links, advice, or other forms of help to the following?
Objective - use python to classify 10-second audio samples so that I afterwards can speak into a microphone and have python pick out and play snippets (faded together) of closest matches from db.
My objective is not to have the closest match and I don't care what the source of the audio samples is. So the result is probably of no use other than speaking in noise (fun).
I would like the python app to be able to find a specific match of FFT for example within the 10 second samples in the db. I guess the real-time sampling of the microphone will have a 100 millisecond buffersample.
Any ideas? FFT? What db? Other?
In order to do this, you need three things:
Segmentation (decide how to make your audio samples)
Feature Extraction (decide what audio feature (e.g. FFT) you care about)
Distance Metric (decide what the "closest" sample is)
Segmentation: you currently describe using 10-second samples. I think you might have better results with shorter segments (closer to 100-1000ms) in order to get something that fits the changes in the voice better.
Feature Extraction: you mention using FFT. The zero crossing rate is surprisingly ok considering how simple it is. If you want to get more fancy, using MFCCs or spectral centroid is probably the way to go.
Distance Metric: most people use the euclidean distance, but there are also fancier ones like the manhattan distance, cosine distance, and earth-movers distance.
For a database, if you have a small enough set of samples, you might try just loading everything up into a kdtree so that you can do fast distance calculations, and just hold it in memory.
Good luck! It sounds like a fun project.
Try searching for algorithms on "music fingerprinting".
You could try some typical short-term feature extraction (e.g. energy, zero crossing rate, MFCCs, spectral features, chroma, etc) and then model your segment through a vector of feature statistics. Then you could use a simple distance-based classifier (e.g. kNN) in order to retrieve the "closest" training samples from a manually laballed set, given an unknown "query".
Check out my lib on several Python Audio Analysis functionalities: pyAudioAnalysis