I am running a model to detect a few interesting features in an image. I have a set of images measuring 600x200 px. These images have features such as rock fragments that I would like to identify. Imagine a (4x12) grid overlayed on the image I can produce annotations manually using an annotator tool such as ((4,9), (3,10), (3,11), (3,12)) to identify the interesting cells in the image. I can build a CNN model with Keras with the input as a grayscale image. But how should I encode the output. One way that seems intuitive to me is to treat it as a sparse matrix of shape (12,4,1) and only the interesting cells have 1 while others have 0.
Is there a better way to encode the outputs?
What should be the activation function on the last layer be? I am using ReLU for the hidden layers.
What should the loss function be? Will mean_squared_error work?
Your problem is really similiar to both detection and segmentation problems (you can read about it e.g. here. The approach you proposed is reasonable because in both detection and segmentation tasks computing the feature map you proposed is an usual part of training pipeline. However - there are several problem you might come across:
memory issues: you need to either deal with sparse tensors or use generators in order to deal with memory problems,
loss and activation: loss and activation for segmentation are currently not supported by Keras API so you need to implement it on your own. Here and here you can find an examples on how to tackle this problem.
In case of detection only (not classification of this points) I would advice you to use sigmoid and binary_crossentropy. In case of classification softmax and categorical_crossentropy.
Of course - there are other ways on how to tackle this problem. One could solve it as a regression where you need to predict the pixels where there is something interesting. But dealing with varying input in Keras is rather cumbersome.
Related
I am working on a classification task which uses byte sequences as samples. A byte sequence can be normalized as input to neural networks by applying x/255 to each byte x. In this way, I trained a simple MLP and the accuracy is about 80%. Then I trained an autoencoder using 'mse' loss on the whole data to see if it works well for the task. I freezed the weights of the encoder's layers and add a softmax dense layer to it for classification. I retrained the new model (only trained the last layer) and to my surprise, the result was much worse than the MLP, merely 60% accuracy.
Can't the autoencoder learn good features from all the data? Why the result is so bad?
Possible actions to take:
Check the error of autoencoder, could it really predict itself?
Visualize the autoencoder results (dimensionality reduction), is the variance explained with fewer dimensions?
Making model more complex does not necessarily outperform simpler ones, did you plot the validation mse versus epoch? Is there a global minimum after a number of steps?
Do you have enough epochs?
What is the number of units you have in your autoencoder? It may be too less (or too much, in case of underfitting) depending on the behavior of your data and its volume.
Did you make any comparison with other dimensionality reduction methods like PCA, NMF?
Last but not least, is it the best way to engineer your features with autoencoder for this task?
"Why the result is so bad?" This is not actually a surprise. You've trained one model to be good at compressing the information. The transformations it learns at each layer do not need to be good for any other type of task at all. In fact, it could be throwing away a lot of information that is perfectly helpful for whatever auxiliary classification task you have, but which is not needed for a task purely of compressing and reconstructing the sequence.
Instead of approaching it by training a separate autoencoder, you might have better luck with just adding sparsity penalty terms from the MLP layers into the loss function, or use some other types of regularization like dropout. Finally you could consider more advanced network architectures, like ResNet / ODE layers or Inception layers, modified for a 1D sequence.
I'm implementing a UNet for binary segmentation while using Sigmoid and BCELoss. The problem is that after several iterations the network tries to predict very small values per pixel while for some regions it should predict values close to one (for ground truth mask region). Does it give any intuition about the wrong behavior?
Besides, there exist NLLLoss2d which is used for pixel-wise loss. Currently, I'm simply ignoring this and I'm using MSELoss() directly. Should I use NLLLoss2d with Sigmoid activation layer?
Thanks
Seems to me like that your Sigmoids are saturating the activation maps. The images are not properly normalised or some batch normalisation layers are missing. If you have an implementation that is working with other images check the image loader and make sure it does not saturate the pixel values. This usually happens with 16-bits channels. Can you share some of the input images?
PS Sorry for commenting in the answer. This is a new account and I am not allowed to comment yet.
You might want to use torch.nn.BCEWithLogitsLoss(), replacing the Sigmoid and the BCELoss function.
An excerpt from the docs tells you why its always better to use this loss function implementation.
This loss combines a Sigmoid layer and the BCELoss in one single class. This version is more numerically stable than using a plain Sigmoid followed by a BCELoss as, by combining the operations into one layer, we take advantage of the log-sum-exp trick for numerical stability.
So I've been doing a lot of research regarding the visualization of CNN's and I can't seem to find a solution to what I'm trying to do, or at least to my understanding of the methodologies employed. A lot of it is pretty new and cutting edge, so I could just not be properly grasping the concepts.
Basically, I want to take a learned kernel/feature as trained by a CNN and essentially manufacture an "optimized" picture such that when the kernel is convolved with said picture, we have the highest convolutional sum possible.
If I'm not mistaken, this should exaggerate the features of that kernel on the image level rather than at the filter/kernel level, which seems to be what most have done in terms of visualizing these filters.
In case what I'm asking is not clear, here's an example (probably bad, but it'll get the point across.)
Assume we are using MNIST and I've created a CNN like so:
5x5 Conv with 10 kernels/Feature Maps
Relu
2x2 MaxPool 2 stride
Dense + Softmax
Let's say I've fully trained my model and now want to look at one of the 10 5x5 kernels it produced and get a better idea of what it's looking for. I want to manufacture a new 28x28 picture such that when convolved with this 5x5 kernel, the sum of the 28x28 convolution is maximized.
Are there techniques that already do something like this? I feel like everything I see involves either "unwinding" or "reversing" the neural net (https://arxiv.org/pdf/1311.2901.pdf), viewing the feature maps as pictures pass through (http://kvfrans.com/visualizing-features-from-a-convolutional-neural-network/), or just looking at the kernels themselves (https://www.youtube.com/watch?v=AgkfIQ4IGaM).
Is it even something useful to look at? I feel like this is the closest thing I've seen to what I'm requesting. https://arxiv.org/pdf/1312.6034.pdf
Any insight would be a huge help, thanks!
This is called activation maximization, and Keras even has an example of it available here. Note that the code in the post might be outdated for current Keras versions, but a updated version is available in the examples folder in Keras.
Is there a way to learn unsupervised features from set of images. Similar to word2vec or doc2vec, where neural network is learnt and given new document we get its features.
Expecting similar to this example shows that it can load learnt nn-model and predict features for new images.
Is there any simple example how to implement cnn over images and get their features back will help !!
Suppose in this example
If I want to get cnn features for all X_train and X_test ... is there any way?
Also, if we can get weights per layer per image, we can stack them and use as features. In that case is there a way to get the same.
Using those features for unsupervised task would be easier, if we consider them as vectors.
If I correctly understood your question, this task is quite common in a deep learning field. In case of images what I consider the best is a convolutional autoencoder. You may read about this architecture e.g. here
http://people.idsia.ch/~ciresan/data/icann2011.pdf
Previous version of Keras supported this architecture as one of core layers, though from version 1.0 I noticed that it disappeared from documentation. But - it's still quite easy to build it from a scratch :)
In noimage cases there are also another approaches like e.g. Restricted Boltzmann Machines.
UPDATE :
When it comes to what sort of activations are the best for obtaining new features from neural network activations - from my personal experience - it depends on the size of the net which you use. If you use a network which last layer is wide (has a lot of nodes) it might be useful to get only last layer (due to number of parameters if you want to consider also previous layers - it may harm the performance of learning). But - if (like in case of some MNIST networks) your last layer is not sufficient for this task - you may try using also previous layers activation or even all net activity. To be honest - I'm not expecting much of improvement in this case - but you may try. I think that you should use both approaches - starting from taking only last layer activations - and then trying to check the behaviour of your new classifier when you add activations from previous layers.
What I will strongly advise to you is also getting some insights from what sort of features network is learning - using T-SNE embeddings of it activations. In many cases I found it useful - e.g. checking if the size of a layer is sufficient. Using T-SNE you may check if the features obtained from last layer are good discriminators of your classes. It may also give you good insights about your data and what neural networks are really learning (alongside with amazing visualizations :) )
I'm running the default classify_image code of the imagenet model. Is there any way to visualize the features that it has extracted? If I use 'pool_3:0', that gives me the feature vector. Is there any way to overlay this on top of my image to see which features it has picked as important?
Ross Girshick described one way to visualize what a pooling layer has learned: https://www.cs.berkeley.edu/~rbg/papers/r-cnn-cvpr.pdf
Essentially instead of visualizing features, you find a few images that a neuron fires most on. You repeat that for a few or all neurons from your feature vector. The algorithm needs lots of images to choose from of course, e.g. the test set.
I wrote my implementation of this idea for cifar10 model in Tensorflow today, which I want to share (uses OpenCV): https://gist.github.com/kukuruza/bb640cebefcc550f357c
You could use it if you manage to provide the images tensor for reading images by batches, and the pool_3:0 tensor.