Using MLP for Feature Extraction and Dimension Reduction

Using MLP for Feature Extraction and Dimension Reduction - python

I'm trying to build a model that use MLP for feature extraction and dimension reduction. The model could transform the data from 204 dimensions to 80 dimensions after this process. The proposed model is as follows:
A 512 dimension dense layer with the input of original data (204 dimension)
A 256 dimension dense layer with the input of 512 dimensions
A 80 dimension dense layer with the input of 256 dimensions
The proposed training epoch is 1, and the output of the MLP is regarded as the input of the further models (such as, LR, SVM, etc.)
My question is: When training the MLP, what loss function should I set? Is the MSE loss OK, or I should use other loss functions? Thanks!

What would you be training this MLP on? (what would be the target 80-dimensional "Y"?)
MLPs are used to learn features at the same time as the model. For example if you wanted to have an MLP that does linear regression and learns a set of features that are 80-dimensional you could create something like this:
model = keras.models.Sequential()
model.add(layers.Dense(80, input_dim=512, activation=MY_ACTIVATION))
model.add(layers.Dense(1))
model.compile(loss="mean_squared_error")
In the last layer, the network will learn to find the "best" weights and biases to capture Y as a function of the 80 features extracted. These features are in turn a function of X - a function the network learns by adjusting for how well these features are able to capture Y (this is backpropagation).
So creating an MLP just to learn features doesn't make sense without a problem statement for what these features are supposed to do.
As such I would recommend using something like Principal Component Analysis or Singular Value Decomposition. These project the data onto the k-dimensional space that captures the most variance (information) in the data.

Related

How do I predict on multiple samples on an LSTM model?

In Keras, if I want to predict on my LSTM model for multiple instances, that are based on independent and new data from the training data, does the input array need to include the amount of time steps used in training? And, if so, can I expect that the shape of the input array for model.predict to be the same as the training data? (Ie [number of samples to be predicted on, their timesteps, their features])?
Thank you :)

You need to distinguish between the 'sample' or 'batch' axis and the time steps and features dimensions.
The number of samples is variable - you can train (fit) your model on thousands of samples, and make a prediction for a single sample.
The times steps and feature dimensions have to be identical for fit and predict - that is because the weights etc. have the same dimensions for the input layer.
In that, an LSTM is not that much different from a DNN.
There are cases (eg. one-to-many models) in which the application is different, but the formal design (i.e. input shape, output shape) is the same.

Handwritten Signature Verification

I am working on a Signature Verification project . I have used the ICDAR 2011 Signature Dataset.Currently,I am pairing the encoding of an original image and a forgery to get a training sample(labelled 0). The encodings are obtained from a pre-trained VGG-16 convolutional neural network (removing the fully connected layer). I have then modified the fully connected layer having the following architecture :
Input size : 50177
1st hidden layer : 1000 units (activation : "sigmoid",Dropout : 0.5)
2nd hidden layer : 500 units (activation : "sigmoid",Dropout : 0.2)
Output Layer : 1 unit (activation : "sigmoid")
The issue is that although the training set accuracy increases the validation accuracy fluctuates randomly.It performs very badly on the test set
I have tried different architectures but nothing seems to work
So is there any other way to prepare the data or should I continue trying different architectures??

I don't think that using a VGG16 model for features extraction for your task is the right way to go. You are using a model that was trained on relatively complex RGB images and than try to use it for a dataset that basically consists of grayscale images of edges (signatures). And you are using the last embedding layer which contains the most complex and specialized representation of the ImageNet dataset (the original training dataset for the VGG model).
The features you get have no real meaning and that is probably why the training accuracy and validation accuracy are not correlated at all when you try to fine-tune the model.
My suggestion is to either use an earlier layer of the VGG16 for feature extraction (I'm talking somewhere around layer no.5-6), or better yet, use a simpler model that was trained on a more similar dataset, like the MNIST dataset.
The MNIST dataset consists of handwritten digits so it is considerably more similar to your task and any model trained on it will act as a much better feature extractor for your task.
You can pick any model from the following list of benchmark results on the MNIST and use it as a feature extractor:
MNIST Benchmark Results

How to train a L2-SVM classifier on top of a flattened vector of representations as per DCGAN paper

In the original DCGAN paper, the GAN is partly evaluated by being used as a feature extractor to classify CIFAR-10, after having been trained on Imagenet.
From the paper:
To evaluate the quality of the representations learned by DCGANs for supervised tasks,
we train on Imagenet-1k and then use the discriminator’s convolutional features from all layers,
maxpooling each layers representation to produce a 4 × 4 spatial grid. These features are then
flattened and concatenated to form a 28672 dimensional vector and a regularized linear L2-SVM
classifier is trained on top of them.
I have tried to replicate this using PyTorch to train the official PyTorch DCGAN and then use scikit-learn to classify using their linear SVC, but find the wording of the paper confusing and am not sure where to go from here. I've been able to maxpool each layer and then concatenate them, but am stumped on how to proceed with the classification of CIFAR-10.
In e.g. sklearn, you use model.fit(x,y) to fit the model according to the given training data, and then model.predict([X]) to predict the class labels for the samples in X. In model.fit(x,y), x is the (2D) features (e.g. images) and y is the labels. But it feels like they’re saying in the above quote make this 28672 dimensional vector the x. But that’s a 1D vector, and they use it to classify CIFAR-10, which has 50k images, and 50000 > 28672. Am I missing something obvious?
Do I use e.g. model.fit with x being the CIFAR-10 images (using e.g. torchvision.datasets.cifar10) (although how to make 50k Tensors of RGB images a 2D array is another story) and y being their labels, and then somehow predict using the 28672 dimensional vector?
Apologies if this is super obvious; unfortunately that’s all they say about it in the paper, and no one seems to have reproduced it (at least on GitHub etc.). Any help would be greatly appreciated!

DCGAN would give you a 28672 dimenstional vector for each image. Hence the shape of the output of DCGAN woud be (50000,28672) for a complete CIFAR10 dataset.
you have to take this as input for your sklearn SVM x, which as you mentioned takes a 2D data.

PyCaffe output layer for testing binary classification model

I fine tunes vgg-16 for binary classification. I used sigmoidLoss layer as the loss function.
To test the model, I coded a python file in which I loaded the model with an image and got the output using :
out = net.forward()
My doubt is should I take the output from Sigmoid or SigmoidLoss layer.
And What is the difference between 2 layers.
My output will actually be the probability of input image being class 1.**

For making predictions on test set, you can create a separate deploy prototxt by modifying the original prototxt.
Following are the steps for the same
Remove the data layer that was used for training, as for in the case of classification we are no longer providing labels for our data.
Remove any layer that is dependent upon data labels.
Set the network up to accept data.
Have the network output the result.
You can read more about this here: deploy prototxt
Else, you can add
include {
phase: TRAIN
}
to your SigmoidWithLoss layer so that it's not used when testing the network. To make predictions, simply check the output of Sigmoid layer.

SigmoidWithLoss layer outputs a single number per batch representing the loss w.r.t the ground truth labels.
On the other hand, Sigmoid layer outputs a probability value for each input in the batch. This output does not require the ground truth labels to be computed.
If you are looking for the probability per input, you should be looking at the output of the Sigmoid layer

Multiple real inputs and multiple real outputs in a neural network

How can I train a perceptron where there are multiple input and output nodes and both are real-valued?
I'm doing this because I want to train a neural network to predict the MFCCs given some data points (from the signal.)
Here is an example data: http://pastebin.com/dtHGUeax
I wont put the data here because the file is "big".
I am using nolearn at the moment, because later I will add more layers for deep learning.
net = NeuralNet(
layers=[('input', layers.InputLayer),
('output', layers.DenseLayer),
],
# Layer parameters
input_shape=(None, 256),
output_nonlinearity=lasagne.nonlinearities.softmax,
output_num_units=13,
# Optimization
update=nesterov_momentum,
update_learning_rate=0.01,
update_momentum=0.9,
regression=True,
max_epochs=500,
verbose=1
)
The error rate I get with this approach is very high.

MFCC extraction from power spectrum is non-linear operation, you can not reproduce it with a single layer. If you want to reproduce it with multiple layers you need to consider MFCC algorithm itself.
MFCC extraction might be represented with the following neural network:
Layer 1 dense matrix of size 256x40 with logarithm nonlinearity
Layer 2 dense matrix of size 40x13 without nonlinearity (same as linear nonlinearity or identical nonlinearity in Lasagne)
If you reproduce this network with nolearn it will be able to learn it properly, however, logarithm nonlinearity is not implemented yet in Lasagne, so you will have to implement it yourself. Another solution would be to replace logarithm nonlinearity with tanh or with couple of standard non-linear layers.
So to reproduce MFCC you need to have 2 layers with logarithm nonlinearity or 3-4 layers with say softmax nonlinearity and the last layer must be configured linear nonlinearity.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.