Cluster identification with NN - python

I have a dataframe containing the coordinates of millions of particles which I want to use to train a Neural network. These particles build individual clusters which are already identified and labeled; meaning that every particle is already assigned to its correct cluster (this assignment is done by a density estimation but for my purpose not that relevant).
the challenge is now to build a network which does this clustering after learning from the huge data. there are also a few more features in the dataframe like clustersize, amount of particles in a cluster etc.
since this is not a classification problem but more a identification of clusters-challenge what kind of neural network should i use? I have also problems to build this network: for example a CNN which classifies wheather there is a dog or cat in the picture, the output is obviously binary. so also the last layer just consists of two outputs which represent the probability for being 1 or 0. But how can I implement the last layer when I want to identify clusters?
during my research I heard about self organizing maps. would these networks do the job?
thank you

If you want to treat clustering as a classification problem, then you can try to train the network to predict whether two points belong to the same clusters or to different clusters.
This does not ultimately solve your problems, though - to cluster the data, this labeling needs to be transitive (which it likely will not be) and you have to label n² pairs, which is expensive.
Furthermore, because your clustering is density-based, your network may need to know about further data points to judge which ones should be connected...

These particles build individual clusters which are already identified
and labeled; meaning that every particle is already assigned to its
correct cluster (this assignment is done by a density estimation but
for my purpose not that relevant).
the challenge is now to build a network which does this clustering
after learning from the huge data.
Sounds pretty much like a classification problem to me. Images themselves can build clusters in their image space (e.g. a vector space of dimension width * height * RGB).
since this is not a classification problem but more a identification
of clusters-challenge what kind of neural network should i use?
You have data of coordinates, you have labels. Start with a simple fully connected single/multi-layer-perceptron i.e. vanilla NN, with as many outputs as number of clusters and softmax-activation function.
There are tons of blogs and tutorials for Deep Learning libraries like keras out there in the internet.

Related

Weight prediction using NNs

I’m relatively new to the topic of machine learning, so naturally I have a couple of issues that I hope you can help me with or lead me in the right direction. I had a project before, during which we collected data of people walking normally and also with a stone in their shoe. We measured Acceleration and also with a gyroscope sensor. Based on this data I build a neural network that can classify the signals into normal or impaired walking. So two possible outputs.
Now my idea is this: I want to, using the same data, build a network that can predict the weights of the participants (it was also recorded).
Based on this my three questions:
- What kind of network structure is most suitable for such a task? (Dense, CNN, LSTM,…)
- Before the network basically had two options to answer from (normal or impaired walking) but now I have a continuous range of answers… How can this be approached?
- How can I make sure the network initializes with a sensible prediction?
I hope all the questions make sense. Any help will be much appreciated!
You can use the NNa architecture you prefer:
If you work with sequences use 1d convolutionals or RNNs.
As you are dealing with a regression problem you have to have a single neuron as output without activation function.
Take a.look here to learn to solve a regression problem with RNNs

Neural network architecture recomendation

Its my first time working with neural networks and I have been given the task of predicting some values of a dataset and I could make good use of some help on deciding which is the smartest architecture for the task. I'm working with Keras using Tensorflow as backend.
I'm not going into details but basically I have performed lots of CFD simulations on similar but slightly different geometries to obtain a stress value on the surface of the geometries. All the geometries have the same conectivity and number of nodes and I have the stress value for each of those nodes.
Simply put, I have an input matrix of [2500,3,300] where 2500 is the amount of nodes in each geometry, 3 represents the x,y,z coordinates in space of each node on the mesh and 300 is the total number of geometries I have. For the stress I have an output matrix of [2500,300] where 2500 is the value of stress for each node and 300 once again corresponds to the number of instances. I would like to train some kind of neural network so I can predict the stress values given the geometry.
I have been basing my architecture on the following paper but I can't not make use of the part in which the convolutional networks are employed. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5805990/
The simplest approach I can think of is a fully connected network but I struggle to figure out the layer architecture to relate the 3D matrix of the geometry to the 2D of the output stress matrix with my scarce knowledge of the subject.
Any suggestiong is more than welcomed. Thanks for your help!!!
Since I have been working with stress values prediction using DL, I would like to recommend you to work with CNN models which you have filters due to intelligent learning capacity even correlation between parameters your suppose to following. Nevertheless RNN and its promised version like LSTM & GRU have good performance on sufficient data if you have. Unfortunately I can't address you to paper because that this issue is still under study!
Another point I can turn out is reshape of your data is kind of important when you feed to NN models especially when you are dealing with time-series data.

Convolutional Autoencoder feature learning

I am training a convolutional autoencoder on my own dataset. After training, the network is able to reconstruct the test images from the dataset quite well.
I am now taking the intermediate representation(1648-dim) from the encoder network and trying to cluster the feature vectors into 17(known upfront) different classes using a GMM soft clustering. However, the clusters are really bad and it is not able to cluster the images into its respective categories.
I am using sklearn.mixture.GaussianMixture package for clustering with a regularization of 0.01 and 'full' covariance_type.
My question: Why do you think that the reconstruction is very decent but the clustering is quite bad? Does it mean the intermediate features learned by the network is not adequate?
Lets revert the question - why do you think it should have any meaning? You are using clustering, which is just arbitrary method of splitting into groups yet you expect it will discover classes. Why would it do it? There is literally nothing forcing model to do so, and it is probably modeling completely different things (like patches of images, textures etc.). In general you should never expect clustering to solve the problem of some arbitrary labeling, this is not what clustering is for. To give you more perspective here - you have images, which come from say 10 categories (like cats, dogs etc.), and you ask:
why clustering in the feature space does not recover classes?
Note that equally valid questions would be:
why clustering in the features space does not divide images to "redish", "greenish" and "blueish"?
why clustering in the features space does not divide images by the size of the object on the image?
why clustering in the features space does not divide images by the country it is from?
There are exponentially many labelings to be assigned to each dataset, and nothing in your training uses any labels (autoencoding is unsupervised, clustering is unsupervised) so expecting that the result will magically guess which of so many labellings you have in mind is simply a wild guess, and the fact it does not do so means nothing. It is neither good nor bad. (Lets also ignore at this point how good can GMM be with ~1700 dimensional space. )
If you want a model to perform some task you have to give it a chance, train it to solve it. If you want to see if features learned are enough to recover categories then learn a classifier on them.

How to classify continuous audio

I have a audio data set and each of them has different length. There are some events in these audios, that I want to train and test but these events are placed randomly, plus the lengths are different, it is really hard to build a machine learning system with using that dataset. I thought fixing a default size of length and build a multilayer NN however, the length's of events are also different. Then I thought about using CNN, like it is used to recognise patterns or multiple humans on an image. The problem for that one is I am really struggling when I try to understand the audio file.
So, my questions, Is there anyone who can give me some tips about building a machine learning system that classifies different types of defined events with training itself on a dataset that has these events randomly(1 data contains more than 1 events and they are different from each other.) and each of them has different lenghts?
I will be so appreciated if anyone helps.
First, you need to annotate your events in the sound streams, i.e. specify bounds and labels for them.
Then, convert your sounds into sequences of feature vectors using signal framing. Typical choices are MFCCs or log-mel filtebank features (the latter corresponds to a spectrogram of a sound). Having done this, you will convert your sounds into sequences of fixed-size feature vectors that can be fed into a classifier. See this. for better explanation.
Since typical sounds have a longer duration than an analysis frame, you probably need to stack several contiguous feature vectors using sliding window and use these stacked frames as input to your NN.
Now you have a) input data and b) annotations for each window of analysis. So, you can try to train a DNN or a CNN or a RNN to predict a sound class for each window. This task is known as spotting. I suggest you to read Sainath, T. N., & Parada, C. (2015). Convolutional Neural Networks for Small-footprint Keyword Spotting. In Proceedings INTERSPEECH (pp. 1478–1482) and to follow its references for more details.
You can use a recurrent neural network (RNN).
https://www.tensorflow.org/versions/r0.12/tutorials/recurrent/index.html
The input data is a sequence and you can put a label in every sample of the time series.
For example a LSTM (a kind of RNN) is available in libraries like tensorflow.

Alternative to support vector machine classifier in python?

I have to make comparison between 155 image feature vectors. Every feature vector has got 5 features.
My image are divided in 10 classes.
Unfortunately i need at least 100 images for class for using support vector machine , There is any alternative?
15 samples per class is very low for any machine learning model. Rather than wasting time trying many model classes and parameters you should collect and label new examples by hand. It will be much more fruitful. If you have a bunch of unlabeled pictures, you can use services such as https://www.mturk.com/ .
Check out pybrain.http://pybrain.org. And possibly use neural net as I've heard they need less data to train than svm's but less accurate.
If your images that belong to the same class are results of a transformations to some starting image you can increase your training size by making transofrmations to your labeled examples.
For example if you are doing character recognition, afine or elastic transforamtions can be used. P.Simard in Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis describes it in more detail. In the paper he uses Neural Networks but the same applies for SVM.

Categories