I am creating a Text to Speech system for a phonetic language called "Kannada" and I plan to train it with a Neural Network. The input is a word/phrase while the output is the corresponding audio.
While implementing the Network, I was thinking the input should be the segmented characters of the word/phrase as the output pronunciation only depends on the characters that make up the word, unlike English where we have slient words and Part of Speech to consider. However, I do not know how I should train the output.
Since my Dataset is a collection of words/phrases and the corrusponding MP3 files, I thought of converting these files to WAV using pydub for all audio files.
from pydub import AudioSegment
sound = AudioSegment.from_mp3("audio/file1.mp3")
sound.export("wav/file1.wav", format="wav")
Next, I open the wav file and convert it to a normalized byte array with values between 0 and 1.
import numpy as np
import wave
f = wave.open('wav/kn3.wav', 'rb')
frames = f.readframes(-1)
#Array of integers of range [0,255]
data = np.fromstring(frames, dtype='uint8')
#Normalized bytes of wav
arr = np.array(data)/255
How Should I train this?
From here, I am not sure how to train this with the input text. From this, I would need a variable number of input and output neurons in the First and Last layers as the number of characters (1st layer) and the bytes of the corresponding wave (Last layer) change for every input.
Since RNNs deal with such variable data, I thought it would come in handy here.
Correct me if I am wrong, but the output of Neural Networks are actually probability values between 0 and 1. However, we are not dealing with a classification problem. The audio can be anything, right? In my case, the "output" should be a vector of bytes corrusponding to the WAV file. So there will be around 40,000 of these with values between 0 and 255 (without the normalization step) for every word. How do I train this speech data? Any suggestions are appreciated.
EDIT 1 : In response to Aaron's comment
From what I understand, Phonemes are the basic sounds of the language. So, why do I need a neural network to map phoneme labels with speech? Can't I just say, "whenever you see this alphabet, pronounce it like this". After all, this language, Kannada, is phonetic: There are no silent words. All words are pronounced the same way they are spelled. How would a Neural Network help here then?
On input of a new text, I just need to break it down to the corresponding alphabets (which are also the phonemes) and retrieve it's file (converted from WAV to raw byte data). Now, merge the bytes together and convert it to a wav file.
Is this this too simplistic? Am I missing something here? What would be the point of a Neural Network for this particular language (Kannada) ?
It is not trivial and requires special architecture. You can read the description of it in a publications of DeepMind and Baidu.
You might also want to study existing implementation of wavenet training.
Overall, pure end-to-end speech synthesis is still not working. If you are serious about text-to-speech it is better to study conventional systems like merlin.
Related
I have split audio files consisting of all the English letters (A, B, C, D, etc.) into separate chunks of audio .wav files. I want to sort each letter into a group. For example, I want all the audio files of letter A grouped in one folder. So then I will have 26 folders consists of different sounds of the same letters.
I have searched for this, and I found some work done on K-mean clustering, but I could not achieve my requirement.
First of all, you need to convert the sounds into representation suitable for further processing, so some feature vectors for which you can apply classification or clustering algorithms.
For audio, typical choice are features based on spectrum. To process sounds, librosa can be very helpful.
Since sounds have different duration and you probably want a fixed-size feature vector for each recording, you need a way to build a single feature vector on top of series of data. Here, different methods can be used, depending on your data amount and availability of labels. Assuming you have limited amount of recordings and no labels, you can start with simply stacking several vectors together. Averaging is another possibility, but it destroys the temporal information (which can be ok in this case). Training some kind of RNN to learn representation as hidden state is the most powerful method.
Take a look on this related answer: How to classify continuous audio
I'm currently trying to classify emotions (7 classes) based on audio files. The first thing I did was to extract the features using the mfcc function in the python_speech_features library (https://python-speech-features.readthedocs.io/en/latest/#functions-provided-in-python-speech-features-module).
In the documentation, it says that each row contains one feature vector. The problem is that each audio file returns a different number of rows (features) as the audio length is different. For example, for audio_1 the shape of the output is (155,13), for audio_2 the output's shape is (258,13). Any advice about how to make them the same shape? I am currently using PCA to force the data to have the same dimensionality, is this a correct approach?
This is how I extract the features:
sample_rate, data = wavfile.read(path)
mfccExtract = features.mfcc(data, sample_rate, winfunc=np.hamming)
If you want each audio sample to be the same length, there are 4 different approaches that are available:
Zero Padding
N Modulo Reduction
Interpolation
Dynamic Time Wrapping
You can use any of them approach to each audio sample. These approaches are available in academic papers.
My question is about how to create a labeled image dataset for machine learning?
I have always worked with already available datasets, so I am facing difficulties with how to labeled image dataset(Like we do in the cat vs dog classification).
I have to do labeling as well as image segmentation, after searching on the internet, I found some manual labeling tools such as LabelMe and LabelBox.LabelMe is good but it's returning output in the form of XML files.
Now again my concern is how to feed XML files into the neural network? I am not at all good at image processing task, so I need an alternative suggestion.
Edit: I have scanned copy of degree certificates and normal documents, I have to make a classifier which will classify degree certificates as 1 and non-degree certificates as 0. So my label would be like:
Degree_certificate -> y(1)
Non_degree_cert -> y(0)
You don't feed XML files to the neural network. You process them with an XML parser, and use that to extract the label. See the question How do I parse XML in Python? for advice on how this works.
Image data sets can come in a variety of starting states. Sometimes, for instance, images are in folders which represent their class. If you like to work with this approach, then rather than read the XML file directly every time you train, use it to create a data set in the form that you like or are used to. The reason you find many nice ready-prepared data sets online is because other people have done exactly this. It is worth doing, as you don't then need to repeat all the transformations from raw data just to start training a model.
For example, collect your XML data from LabelMe, then use a short script to read the XML file, extract the label you entered previously using ElementTree, and copy the image to a correct folder. You will end up with a data set consisting of two folders with positive and negative matching images, ready to process with your favourite CNN image-processing package.
My main goal is in feeding mfcc features to an ANN.
However I am stuck at the data pre processing step and my question has two parts.
BACKGROUND :
I have an audio.
I have a txt file that has the annotation and time stamp like this:
0.0 2.5 Music
2.5 6.05 silence
6.05 8.34 notmusic
8.34 12.0 silence
12.0 15.5 music
I know for a single audio file, I can calculate the mfcc using librosa like this :
import librosa
y, sr = librosa.load('abcd.wav')
mfcc=librosa.feature.mfcc(y=y, sr=sr)
Part 1: I'm unable to wrap my head around two things :
how to calculate mfcc based on the segments from the annotations.
Part2: How to best store these mfcc's for passing them to keras DNN. i.e should all mfcc's calculated per audio segment be saved to a single list/dictionary. or is it better to save them to different dictionaries so that all mfcc's belonging to one label are at one place.
I'm new to audio processing and python so, i'm open to recommendations regarding best practices.
More than happy to provide additional details.
Thanks.
Part 1: MFCC to tag conversion
It's not obvious from the librosa documentation but I believe the mfcc's are being calculated at about a 23mS frame rate. With your code above mfcc.shape will return (20, x) where 20 is the number of features and the x corresponds to x number of frames. The default hop_rate for mfcc is 512 samples which means each mfcc sample spans about 23mS (512/sr).
Using this you can compute which frame goes with which tag in your text file. For example, the tag Music goes from 0.0 to 2.5 seconds so that will be mfcc frame 0 to 2.5*sr/512 ~= 108. They will not come out exactly equal so you need to round the values.
Part 2A: DNN Data Format
For the input (mfcc data) you'll need to figure out what the input looks like. You'll have 20 features but do you want to input a single frame to your net or are you going to submit a time series. You're mfcc data is already a numpy array, however it's formatted as (feature, sample). You probably want to reverse that for input to Keras. You can use numpy.reshape to do that.
For the output, you need assign a numeric value to each tag in your text file. Typically you would store the the tag to integer in a dictionary. This will then be used to create your training output for the network. There should be one output integer for each input sample.
Part 2B: Saving the Data
The simplest way to do this is to use pickle to save and the reload it later. I like to use a class to encapsulate the input, output and dictionary data but you can choose whatever works for you.
I am working on detection of regions of specific tree in an aerial image and my approach is using texture detection. I have 4 descriptors/features and I want to use FANN to create a machine learning environment that would detect properly the regions.
My question is,
is the format I am reading, regarding the input of pyfann, always as described in https://stackoverflow.com/a/25703709/5722784 ?
What if I would like to have 4 input neurons and one output neuron, where for each input neuron I have a list (not a single integer) that I would like to feed on it? Can FANN provide it? If so, what's the format that I have to follow in making the input data?
Thank you so much for significant responses :)
Each input neuron can only take a single input - this is the case for all neural networks, irrespective of the library you use. I would suggest using each element in each of your lists as an input to the neural network, e.g. inputs 1-5 are your first list and then 6-10 are your second list. If you have variable length lists you likely have a problem, though.