I am working on detection of regions of specific tree in an aerial image and my approach is using texture detection. I have 4 descriptors/features and I want to use FANN to create a machine learning environment that would detect properly the regions.
My question is,
is the format I am reading, regarding the input of pyfann, always as described in https://stackoverflow.com/a/25703709/5722784 ?
What if I would like to have 4 input neurons and one output neuron, where for each input neuron I have a list (not a single integer) that I would like to feed on it? Can FANN provide it? If so, what's the format that I have to follow in making the input data?
Thank you so much for significant responses :)
Each input neuron can only take a single input - this is the case for all neural networks, irrespective of the library you use. I would suggest using each element in each of your lists as an input to the neural network, e.g. inputs 1-5 are your first list and then 6-10 are your second list. If you have variable length lists you likely have a problem, though.
Related
In school we have to listen to intervals and chords and determine their name. I'm really into neuronal network. Thats why I want to create a neuronal network with Python which listen to the audio and give me the name as an output. I've learned once that for music I need a LSTM. Should I need for this purpose also a LSTM and how/where should I start? Can anybody teach me how to achieve my goal?
first of all you need to exactly define the task you like to solve: Do you like to classify a whole piece of music/track or do you like to classify segments of the piece/track? This will influence which architecture you need to use to solve your task. I will briefly present an approach for each of those tasks.
Classifying a track: Recordings of music are time series, for each of your recordings you need to have a label. Your first intuition of using LSTMs (or RNNs in general) is a good one. Just use your recording transformed into a vector as input-sequence for your LSTM-network and let it put out probabilities for each class. As already indicated by a comment, working in frequency-space can be beneficial. However just using the Fourier-Transformation of the whole track will most likely lose important information since the temporal frequency information is lost. Rather use Short-time Fourier-Transormation (STFT) or Mel-frequency cepstrum coefficients (MFCC, here is a python-library to calculate them: libROSA). Very oversimplified, those methods will transform your time series into some kind of 'image', an two-dimensional frequncy-spectrum, and for image classification task Convolutional Neural Networks (CNNs) are the way to go.
Classifying segments: If you like to classify segments of your track you need to have a labels for each time-frame in your song. Lets say your song is 3 minutes long and you have a sampling frequency of 60 Hz, your vector representation of the song will have 3*60*60 = 10800 time-frames, thus for each of the entries you need to provide a class label (chord or whatever). Again you can use LSTMs, use your vector as input-sequence and let your network produce an output sequence of the same length of your song and compare it to the class labels. You also could use the previously mentioned STFT- or MFC-coefficients as inputs and take advantage of the frequency information, now you will have a spectrum for each time-frame as input.
I hope these broad ideas will bring you one step closer to solve your task. For implementation details I like to point you to the keras documentation and to countless tutorials on the internet.
Disclaimer:
My knowledge of music theory is rather limited, so please take my answer with a grain of salt and feel free to correct me or ask for clarification. Have fun
I have an image that consists of only black and white pixels. 500x233px. I want to use this image as an input, or maybe all the pixels in the image as individual inputs, and then receive 3 floating point values as output using machine learning.
I have spent all day on this and have come up with nothing. The best I can find is some image classification libraries, but I am not trying to classify an image. I’m trying to get 3 values that range from -180.0 to 180.0.
I just want to know where to start. Tensorflow seems like it could probably do what I want, but I have no idea where to start with it.
I think my main issue is that I don’t have one output for each input. I’ve been trying to use each pixel’s value (0 or 1) as an input, but my output doesn’t apply to each pixel, only to the image as a whole. I’ve tried creating a string of each pixel’s value and using that as one input, but that didn’t seem to work either.
Should I be using neural networks? Or genetic algorithms? Or something else? Or would I be better off with only receiving one of the three outputs I need, and just training three different models for each output? Even then, I’m not sure how to get a floating point value out of these things. Maybe machine learning isn’t even the correct approach.
Any help is greatly appreciated!
I want to run simple MLP Classifier (Scikit learn) with following set of data.
Data set consists of 100 files, containing sound signals. Each file has two columns (two signals) and rows (length of the signals). The length of rows (signals) vary from file to file ranges between 70 to 80 values. So the dimensions of file are 70 x 2 to 80 x 2. Each file represent one complete record.
The problem I am facing how to train simple MLP with variable length of data, with training and testing set contains 75 and 25 files respectively.
One solution is to concatenate all file and make one file i.e. 7500 x 2 and train MLP. But important information of signals is no longer useful in this case.
Three approaches in order of usefulness. Approach 1 is strongly recommended.
1st Approach - LSTM/GRU
You don't use simple MLP. The type of data you're dealing with is a sequential data. Recurrent networks (LSTM/GRU) have been created for this purpose. They are capable of processing variable length sequences.
2nd Approach - Embeddings
Find a function that can transform your data into a fixed-length sequence, called embedding. An example of network producing time series embedding is TimeNet. However, that essentially brings us back to the first approach.
3rd Approach - Padding
If you you can find a reasonable upper bound for the length of sequence, you can pad shorter series to the length of the longest one (pad 0 at the beginning/end of the series, interpolate/forecast the remaining values), or cut longer series to the length of the shortest one. Obviously you will either introduce noise or lose information, respectively.
This is a very old question, however, it is very related to my recent research topic. Aechlys provides alternatives to solve your problems, which is great. Let me clarify it more clearly. Neural networks can be divided into two sorts according to the size of input length: one is fixed-size and the other is varying-size.
For fixed-size, the most common example is MLP. Traditionally, it is insensitive to the position of your model input. In other words, you assume that the order of your input features does not matter. For instance, you use age, sex, education to predict the salary of a person. These characteristics can be placed in positions of your MLP.
For varying-size, model architectures include RNN, LSTM, Transformer. They are specifically designed for sequential data like texts and time series. These sorts of data have a natural order within their data points. They can perfectly deal with vary-size inputs.
To summarize, you might use the wrong model to deal with signals with MLP. The optimal choice is to adopt RNN/Transformer.
I am creating a Text to Speech system for a phonetic language called "Kannada" and I plan to train it with a Neural Network. The input is a word/phrase while the output is the corresponding audio.
While implementing the Network, I was thinking the input should be the segmented characters of the word/phrase as the output pronunciation only depends on the characters that make up the word, unlike English where we have slient words and Part of Speech to consider. However, I do not know how I should train the output.
Since my Dataset is a collection of words/phrases and the corrusponding MP3 files, I thought of converting these files to WAV using pydub for all audio files.
from pydub import AudioSegment
sound = AudioSegment.from_mp3("audio/file1.mp3")
sound.export("wav/file1.wav", format="wav")
Next, I open the wav file and convert it to a normalized byte array with values between 0 and 1.
import numpy as np
import wave
f = wave.open('wav/kn3.wav', 'rb')
frames = f.readframes(-1)
#Array of integers of range [0,255]
data = np.fromstring(frames, dtype='uint8')
#Normalized bytes of wav
arr = np.array(data)/255
How Should I train this?
From here, I am not sure how to train this with the input text. From this, I would need a variable number of input and output neurons in the First and Last layers as the number of characters (1st layer) and the bytes of the corresponding wave (Last layer) change for every input.
Since RNNs deal with such variable data, I thought it would come in handy here.
Correct me if I am wrong, but the output of Neural Networks are actually probability values between 0 and 1. However, we are not dealing with a classification problem. The audio can be anything, right? In my case, the "output" should be a vector of bytes corrusponding to the WAV file. So there will be around 40,000 of these with values between 0 and 255 (without the normalization step) for every word. How do I train this speech data? Any suggestions are appreciated.
EDIT 1 : In response to Aaron's comment
From what I understand, Phonemes are the basic sounds of the language. So, why do I need a neural network to map phoneme labels with speech? Can't I just say, "whenever you see this alphabet, pronounce it like this". After all, this language, Kannada, is phonetic: There are no silent words. All words are pronounced the same way they are spelled. How would a Neural Network help here then?
On input of a new text, I just need to break it down to the corresponding alphabets (which are also the phonemes) and retrieve it's file (converted from WAV to raw byte data). Now, merge the bytes together and convert it to a wav file.
Is this this too simplistic? Am I missing something here? What would be the point of a Neural Network for this particular language (Kannada) ?
It is not trivial and requires special architecture. You can read the description of it in a publications of DeepMind and Baidu.
You might also want to study existing implementation of wavenet training.
Overall, pure end-to-end speech synthesis is still not working. If you are serious about text-to-speech it is better to study conventional systems like merlin.
I just started working on an artificial life simulation (again... I lost the other one) in Python and Pygame using Pybrain, and I'm planning how this is going to work. So far I have an environment with some "food pellets". A food pellet is added every minutes. I haven't made my agents (aka "Creatures") yet, but I know I want them to have simple feed forward neural networks with some inputs and the outputs will be its' movement. I want the inputs to show what's in front of them, sort of like they are seeing the simulated world in front of them. How should I go about this? I either want them to actually "see" the colors in their line of vision, or just input the nearest object into their NN. Which one would be best, and how will I implement them?
Having a full field of vision is technically possible in a neural network, but requires a LOT of inputs and massive processing; not a direction you should expect to be able to evolve in any kind of meaningful way.
A neural network deals with values and thresholds. I'd recommend using two inputs associated with the nearest individual - one of them has a value for distance (of the nearest) and the other its angle (with zero being directly ahead, less than zero being on the left and greater than zero bring on the right).
Make sure that these values are easy to process into outputs. For example, if one output goes to a rotation actuator, make sure that the input values and output values are on the same scale. Then it will be easy to both turn toward or away from a particular individual.
If you want them to be able to see multiple individuals, simple include multiple pairs of inputs. I was going to suggest putting them in distance order, but it might be easier for them if as soon as an organism sees something it always comes in to the same inputs until it's no longer tracked.