Convolutional Autoencoder feature learning - python

I am training a convolutional autoencoder on my own dataset. After training, the network is able to reconstruct the test images from the dataset quite well.
I am now taking the intermediate representation(1648-dim) from the encoder network and trying to cluster the feature vectors into 17(known upfront) different classes using a GMM soft clustering. However, the clusters are really bad and it is not able to cluster the images into its respective categories.
I am using sklearn.mixture.GaussianMixture package for clustering with a regularization of 0.01 and 'full' covariance_type.
My question: Why do you think that the reconstruction is very decent but the clustering is quite bad? Does it mean the intermediate features learned by the network is not adequate?

Lets revert the question - why do you think it should have any meaning? You are using clustering, which is just arbitrary method of splitting into groups yet you expect it will discover classes. Why would it do it? There is literally nothing forcing model to do so, and it is probably modeling completely different things (like patches of images, textures etc.). In general you should never expect clustering to solve the problem of some arbitrary labeling, this is not what clustering is for. To give you more perspective here - you have images, which come from say 10 categories (like cats, dogs etc.), and you ask:
why clustering in the feature space does not recover classes?
Note that equally valid questions would be:
why clustering in the features space does not divide images to "redish", "greenish" and "blueish"?
why clustering in the features space does not divide images by the size of the object on the image?
why clustering in the features space does not divide images by the country it is from?
There are exponentially many labelings to be assigned to each dataset, and nothing in your training uses any labels (autoencoding is unsupervised, clustering is unsupervised) so expecting that the result will magically guess which of so many labellings you have in mind is simply a wild guess, and the fact it does not do so means nothing. It is neither good nor bad. (Lets also ignore at this point how good can GMM be with ~1700 dimensional space. )
If you want a model to perform some task you have to give it a chance, train it to solve it. If you want to see if features learned are enough to recover categories then learn a classifier on them.

Related

Is it possible to use Keras to classify images between one option or another?

To explain the title better, I am looking to classify pictures between two classes. For example, let's say that 0 is white, and black is 1. I train and validate the system with pictures that are gray, some lighter than others. In other words, none of the training/validation (t/v) pictures are 0, and none are 1. The t/v pictures range between 0 and 1 depending of how dark the gray is.
Of course, this is just a hypothetical situation, but I want to apply a similar scenario for my work. All of the information I have found online is based on a binary classification (either 1 or 0), rather than a spectrum classification (between 1 and 0).
I assume that this is possible, but I have no idea where to start. Although, I do have a binary code written with good accuracy.
Based on your given example, maybe a classification approach is not the best one. I think that what you have is a regression problem, as you want your output to be a continuous value in some range, that has a meaning itself (as higher or lower values have a proper meaning).
Regression tasks usually have an output with linear activation, and they expect to have a continuous value as the ground truth.
I think you could start by taking a look at this tutorial.
Hope this helps!
If I understand you correctly, it's definitely possible.
The creator of Keras, François Chollet, wrote Deep Learning with Python which is worth reading. In it he describes how you could accomplish what you would like.
I have worked through examples in his book and shared the code: whyboris/ml-with-python-and-keras
There are many approaches, but a fast one is to use a pre-trained model that can recognize a wide variety of images (for example, classify 1,000 different categories). You will use it "headless" (without the last classification layer that takes the vectors and decides which of the 1,000 categories it falls most into). And you will train just the "last step" in the model (freezing all the previous layers) while training your binary classifier.
Alternatively you could train your own classifier from scratch. Specifically glance at my example (based off the book) cat-dog-classifier which trains its own binary classifier.

Cluster identification with NN

I have a dataframe containing the coordinates of millions of particles which I want to use to train a Neural network. These particles build individual clusters which are already identified and labeled; meaning that every particle is already assigned to its correct cluster (this assignment is done by a density estimation but for my purpose not that relevant).
the challenge is now to build a network which does this clustering after learning from the huge data. there are also a few more features in the dataframe like clustersize, amount of particles in a cluster etc.
since this is not a classification problem but more a identification of clusters-challenge what kind of neural network should i use? I have also problems to build this network: for example a CNN which classifies wheather there is a dog or cat in the picture, the output is obviously binary. so also the last layer just consists of two outputs which represent the probability for being 1 or 0. But how can I implement the last layer when I want to identify clusters?
during my research I heard about self organizing maps. would these networks do the job?
thank you
If you want to treat clustering as a classification problem, then you can try to train the network to predict whether two points belong to the same clusters or to different clusters.
This does not ultimately solve your problems, though - to cluster the data, this labeling needs to be transitive (which it likely will not be) and you have to label n² pairs, which is expensive.
Furthermore, because your clustering is density-based, your network may need to know about further data points to judge which ones should be connected...
These particles build individual clusters which are already identified
and labeled; meaning that every particle is already assigned to its
correct cluster (this assignment is done by a density estimation but
for my purpose not that relevant).
the challenge is now to build a network which does this clustering
after learning from the huge data.
Sounds pretty much like a classification problem to me. Images themselves can build clusters in their image space (e.g. a vector space of dimension width * height * RGB).
since this is not a classification problem but more a identification
of clusters-challenge what kind of neural network should i use?
You have data of coordinates, you have labels. Start with a simple fully connected single/multi-layer-perceptron i.e. vanilla NN, with as many outputs as number of clusters and softmax-activation function.
There are tons of blogs and tutorials for Deep Learning libraries like keras out there in the internet.

How to explain clustering results?

Say I have a high dimensional dataset which I assume to be well separable by some kind of clustering algorithm. And I run the algorithm and end up with my clusters.
Is there any sort of way (preferable not "hacky" or some kind of heuristic) to explain "what features and thresholds were important in making members of cluster A (for example) part of cluster A?"
I have tried looking at cluster centroids but this gets tedious with a high dimensional dataset.
I have also tried fitting a decision tree to my clusters and then looking at the tree to determine which decision path most of the members of a given cluster follow. I have also tried fitting an SVM to my clusters and then using LIME on the closest samples to the centroids in order to get an idea of what features were important in classifying near the centroids.
However, both of these latter 2 ways require the use of supervised learning in an unsupervised setting and feel "hacky" to me, whereas I'd like something more grounded.
Have you tried using PCA or some other dimensionality reduction techniques and checking whether the clusters still hold? Sometimes relationships still exist in lower dimensions (Caveat: it doesn't always help one's understanding of the data). Cool article about visualizing MNIST data. http://colah.github.io/posts/2014-10-Visualizing-MNIST/. I hope this helps a bit.
Do not treat the clustering algorithm as a black box.
Yes, k-means uses centroids. But most algorithms for high-dimensional data don't (and don't use k-means!). Instead, they will often select some features, projections, subspaces, manifolds, etc. So look at what information the actual clustering algorithm provides!

Clustering algorithm performance check on un plot able data

I am using Kmeans Clustring algorithm from Sci-kit learn library and dimension of my data is 169 and that's why I am unable to visualize the result of clustering.
Is there any way to measure the performance of algorithm?
Secondly, I have the labels of data and I want to test the learned model with the test dataset but I am not sure the labels Kmeans algo gave to cluster coincide with the labels I have.
There are ways of visualizing high dimensional data. You can sample some dimensions, use PCA components, MDS, tSNE, parallel coordinates, and many more.
If you even just read the Wikipedia article on clustering, there is a section on evaluation, including supervised as well as unsupervised evaluation. But the results of such evaluation can be very misleading...
Bear on mind that if you have labeled data, supervised methods should always outperform unsupervised methods that do not have the labels: they don't know what to look for - there is lie reason to believe that every clustering happens to align with some labels. In particular, on most data there will be many reasonable clusterings that capture different aspects of your data.

How to calculate probability(confidence) of SVM classification for small data set?

Use case:
I have a small dataset with about 3-10 samples in each class. I am using sklearn SVC to classify those with rbf kernel.
I need the confidence of the prediction along with the predicted class. I used predict_proba method of SVC.
I was getting weird results with that. I searched a bit and found out that it makes sense only for larger datasets.
Found this question on stack Scikit-learn predict_proba gives wrong answers.
The author of the question verified this by multiplying the dataset, thereby duplicating the dataset.
My questions:
1) If I multiply my dataset by lets say 100, having each sample 100 times, it increases the "correctness" of "predict_proba". What sideeffects will it have? Overfitting?
2) Is there any other way I can calculate the confidence of the classifier? Like distance from the hyperplanes?
3) For this small sample size, is SVM a recommended algorithm or should I choose something else?
First of all: Your data set seems very small for any practical purposes. That being said, let's see what we can do.
SVM's are mainly popular in high dimensional settings. It is currently unclear whether that applies to your project. They build planes on a handful of (or even single) supporting instances, and are often outperformed in situation with large trainingsets by Neural Nets. A priori they might not be your worse choice.
Oversampling your data will do little for an approach using SVM. SVM is based on the notion of support vectors, which are basically the outliers of a class that define what is in the class and what is not. Oversampling will not construct new support vector (I am assuming you are already using the train set as test set).
Plain oversampling in this scenario will also not give you any new information on confidence, other than artififacts constructed by unbalanced oversampling, since the instances will be exact copies and no distibution changes will occur. You might be able to find some information by using SMOTE (Synthetic Minority Oversampling Technique). You will basically generate synthetic instances based of the ones you have. In theory this will provide you with new instances, that won't be exact copies of the ones you have, and might thusly fall a little out of the normal classification. Note: By definition all these examples will lie in between the original examples in your sample space. This will not mean that they will lie in between your projected SVM-space, possibly learning effects that aren't really true.
Lastly, you can estimate confidence with the distance to the hyperplane. Please see: https://stats.stackexchange.com/questions/55072/svm-confidence-according-to-distance-from-hyperline

Categories