How to visualize features classified by tensorflow? - python

I'm running the default classify_image code of the imagenet model. Is there any way to visualize the features that it has extracted? If I use 'pool_3:0', that gives me the feature vector. Is there any way to overlay this on top of my image to see which features it has picked as important?

Ross Girshick described one way to visualize what a pooling layer has learned: https://www.cs.berkeley.edu/~rbg/papers/r-cnn-cvpr.pdf
Essentially instead of visualizing features, you find a few images that a neuron fires most on. You repeat that for a few or all neurons from your feature vector. The algorithm needs lots of images to choose from of course, e.g. the test set.
I wrote my implementation of this idea for cifar10 model in Tensorflow today, which I want to share (uses OpenCV): https://gist.github.com/kukuruza/bb640cebefcc550f357c
You could use it if you manage to provide the images tensor for reading images by batches, and the pool_3:0 tensor.

Related

Using a siamese model to obtain an embedding by cutting off one half?

I used Keras to build a Siamese network using the coding format of one of the questions posted (please see code sample here). To explain this briefly, I built a Siamese network using the pretrained efficient net so that each copy of the network produces a dense layer which then get combined into into a L1-similarity output.
However, during prediction time, I only want to obtain the dense output of one of the layers (as an embedding). I plan on using a variety of unsupervised learning methods (including KNN) on these outputs.
During prediction, how can I ask keras to run only one copy of my network graph using a single input? Can I extract only a part of the NN graph? I don't want to have to always generate pairs of images or run the cost of running 2 images when I only need one output.
Let me just make sure that I understand your question and context. You are using a Siamese network (efficient net) and you want to generate embeddings for your input images.
From the image below, you only want to save the image encodings for one the ConvNets?
If that is the case, I dont really see the point of building a Siamese network at all. Just go for a single ConvNet (using efficient net). Because if you use the Siamese network model, it will always ask you to make image pairs.
If you go for only a single ConvNet model, and you identify the layer which you want to use to get the embeddings, then you can use the tf.keras.backend.function like this:
get_layer_output = tf.keras.backend.function([fine_tuned_model.layers[0].input],[fine_tuned_model.layers[-2].output])
Which then, for the predict, you can call it like this:
features = get_layer_output([x])[0]

Feature Extraction Using Representation Learning

I'm new to machine learning, and I've been given a task where I'm asked to extract features from a data set with continuous data using representation learning (for example a stacked autoencoder).
Then I'm to combine these extracted features with the original features of the dataset and then use a feature selection technique to determine my final set of features that goes into my prediction model.
Could anyone point me to some resources or demos or sample code of how I could get started on this? I'm very confused on where to begin on this and would love some advice!
Okay, say you have an input of (1000 instances and 30 features). What I would do based on what you told us is:
Train an autoencoder, a neural network that compresses the input and then decompresses it, which has as a target your original input. The compressed representation lies in the latent space and encapsulates information about the input which is not directly accessible by humans. Now you may find such networks in tensorflow or pytorch. Tensorflow is easier and more straightforward so it could be better for you. I will provide this link (https://keras.io/examples/generative/vae/) for a variational autoencoder that may do the job for you. This has Conv2D layers so it performs really well for image data, but you can play around with the architecture. I cannot tell u more because you did not provide more info for your dataset. However, the important thing is the following:
After your autoencoder is trained properly and you need to make sure of it, (it adequately reconstructs the input) then you need to extract the aforementioned latent inputs (you will find more in the link). Now, that will be let's say 16 numbers but you can play with it. These 16 numbers were built to preserve info regarding your input. You said you wanted to combine these numbers with your input so might as well do that and end up with 46 input features. Now the feature selection part has to do with selecting the input features that are more useful for your model. That is not very interesting, you may find more information (https://towardsdatascience.com/feature-selection-techniques-in-machine-learning-with-python-f24e7da3f36e) and one way to select features is by training many models with different feature subsets. Remember, techniques such as PCA are for feature extraction not selection. I cannot provide any demo that does the whole thing but there are sources that can help. Remember, your autoencoder is supposed to return 16 numbers for each training example. Your autoencoder is trained only on your train data, with your train data as targets.

Implementation of ResNet50 SSD on keras

I am using this Single Shot Detector (SSD) implementation in keras which uses VGG16 (as the original version of SSD). I am interested in replacing the original backbone of VGG16 with ResNet50. Although this has been implemented before I could not find any source regarding the keras SSD resnet implementation.
Anyway, does someone knows where I can find this architecture? Basically my question regards at which feature extraction layers to attach the decision layer. Judging from the similarities between the two networks and for a typical image size (for Pascal Voc for example) 300x300 I would guess to apply decision layer after each repetition group of ResBlocks (i.e. before downscaling the feature map). I am not sure the names are universal in every ResNet implementation but anyway:
After Add3 with size 38x38x256
After Add7 with size 19x19x512
After Add13 with size 10x10x1024
After Add16 with size 5x5x2048
So, I would get 4 decision layers comparing with 6 on the SSD VVG original version.
Should I add more feature extraction layer to further decrease the feature map size and extract features from there also? Is this considered to be the typical architecture? If anyone has some knowledge over this please share it.

Using MFCC's for voice recognition

I'm currently using the Fourier transformation in conjunction with Keras for voice recogition (speaker identification). I have heard MFCC is a better option for voice recognition, but I am not sure how to use it.
I am using librosa in python (3) to extract 20 MFCC features. My question is: which MFCC features should I use for speaker identification?
In addition to this I am unsure on how to implement these features. What I would do is to get the necessary features and make one long vector input for a neural network. However, it is also possible to display colors, so could image recognition also be possible, or is this more aimed at speech, and not speaker recognition?
In short, I am unsure where I should start, as I am not very experienced with image recognition and have no idea where to start.
Thanks in advance!!
My question is: which MFCC features should I use for speaker identification?
I shall say that use all of them. Technically MFCC features are output from different filter banks. It is hard to say a priori which of them will be useful.
In addition to this I am unsure on how to implement these features. What I would do is to get the necessary features and make one long vector input for a neural network.
Actually when you extract MFCC for N samples you get an array like N x T x 20 T represents the number of frames in the audio signal after processed for MFCC. I will suggest using Sequence classification with LSTM. This will give better result.
In addition to this I am unsure on how to implement these features.
What I would do is to get the necessary features and make one long
vector input for a neural network.
For each sample, you must have got a 2D matrix of MFCC like N x T X no_mfccs (in your case no_mfccs=20); to make it into one single vector, various researchers take statistics such as mean, var, IQR, etc. to reduce the feature dimension. Some also model it using multivariate regression, and some fit it to a Gaussian mixture model. It depends on the next stage. In your case, you can use statistics to convert into a single vector
OR As told by Parthosarathi, you can use LSTM to preserve sequential information across time frames.
However, it is also possible to display colors, so could image recognition also be possible, or is this more aimed at speech, and not speaker recognition?
I will not recommend you to use spectrogram (image) as a feature vector to neural network because visual images and spectrograms do not accumulate visual objects and sound events information in the same manner.
when you feed image to neural network it assumes that features (pixel values) of an image carry the same meaning regardless of their location. But in case of the spectrogram, location of feature matters a lot.
e.g. Moving the frequencies of a male voice upwards could change its meaning from man to child. Therefore, the spatial invariance that 2D CNN provides might not perform as well for this form of data.
To learn more about it refer: What’s wrong with CNNs and spectrograms for audio processing?
You can use MFCCs with dense layers / multilayer perceptron, but probably a Convolutional Neural Network on the mel-spectrogram will perform better, assuming that you have enough training data.

What should the output layer of my CNN look like?

I am running a model to detect a few interesting features in an image. I have a set of images measuring 600x200 px. These images have features such as rock fragments that I would like to identify. Imagine a (4x12) grid overlayed on the image I can produce annotations manually using an annotator tool such as ((4,9), (3,10), (3,11), (3,12)) to identify the interesting cells in the image. I can build a CNN model with Keras with the input as a grayscale image. But how should I encode the output. One way that seems intuitive to me is to treat it as a sparse matrix of shape (12,4,1) and only the interesting cells have 1 while others have 0.
Is there a better way to encode the outputs?
What should be the activation function on the last layer be? I am using ReLU for the hidden layers.
What should the loss function be? Will mean_squared_error work?
Your problem is really similiar to both detection and segmentation problems (you can read about it e.g. here. The approach you proposed is reasonable because in both detection and segmentation tasks computing the feature map you proposed is an usual part of training pipeline. However - there are several problem you might come across:
memory issues: you need to either deal with sparse tensors or use generators in order to deal with memory problems,
loss and activation: loss and activation for segmentation are currently not supported by Keras API so you need to implement it on your own. Here and here you can find an examples on how to tackle this problem.
In case of detection only (not classification of this points) I would advice you to use sigmoid and binary_crossentropy. In case of classification softmax and categorical_crossentropy.
Of course - there are other ways on how to tackle this problem. One could solve it as a regression where you need to predict the pixels where there is something interesting. But dealing with varying input in Keras is rather cumbersome.

Categories