I have individually trained the same neural network architecture on a large number of different datasets (order of 100s) to learn a unique non-linear function for each i.e have basically learned a set of weights that describes the function for each dataset.
Now, I want to use these sets of weights as a pre-trained layer in another optimization problem. I know how to load in a single saved model and employ that as a layer. However, what I will be doing is a group-wise optimization across the 100s of different datasets, where I have a pre-trained weights for each (from above).
So the setup is a batch of x datasets, each with n data points in d dimensions i.e. input data is of the shape [X, N, D]. There are a series of layers which act on all this data, then when it gets to the "pre-trained" layer, I wish to use different pre-trained weights i.e. For [0,:,:] uses the weights learned from dataset 0 from above, [1,:,:] with weights learned from dataset 1 etc etc etc.
I then need to combine the output of all this together, as the loss function for this groupwise optimization is based on the variance across all datasets. So I don't believe I can trivially evaulate one set, calculate loss, change weights, rinse and repeat and sum up at the end.
I doubt it is feasible to have some massive duplicate branches going, where I have x copies of the pre-trained NN layers as the pre-trained NN architecture is already quite complex.
Is it is possible to use a split layer, then a for loop type approach, in which I change the weights, then pass the correct portion of data through? Then merge all the outputs? Or is there a better way of tackling this?
Any help much appreciated.
Related
In Keras, if I want to predict on my LSTM model for multiple instances, that are based on independent and new data from the training data, does the input array need to include the amount of time steps used in training? And, if so, can I expect that the shape of the input array for model.predict to be the same as the training data? (Ie [number of samples to be predicted on, their timesteps, their features])?
Thank you :)
You need to distinguish between the 'sample' or 'batch' axis and the time steps and features dimensions.
The number of samples is variable - you can train (fit) your model on thousands of samples, and make a prediction for a single sample.
The times steps and feature dimensions have to be identical for fit and predict - that is because the weights etc. have the same dimensions for the input layer.
In that, an LSTM is not that much different from a DNN.
There are cases (eg. one-to-many models) in which the application is different, but the formal design (i.e. input shape, output shape) is the same.
I am working with features extracted from pre-trained VGG16 and VGG19 models. The features have been extracted from second fully connected layer (FC2) of the above networks.
The resulting feature matrix (of dimensions (8000,4096)) has values in the range [0,45]. As a result, when I am using this feature matrix in gradient based optimization algorithms, the value for loss function, gradient, norms etc. take very high values.
In order to do away with such high values, I applied MinMax normalization to this feature matrix and since then the values are manageable. Also, the optimization algorithm is behaving properly. Is my strategy OK i.e. is it fair enough to normalize features that have been extracted from a pre-trained models for further processing.
From experience, as long as you are aware of the fact that your results are coming from normalized values, it is okay. If normalization helps you show gradients, norms, etc. better then I am for it.
What I would be cautious about though, would be any further analysis on those feature matrices as they are normalized and not the true values. Say, if you were to study the distributions and such, you should be fine, but I am not sure what is your next step, and if this can/will be harmful.
Can you share more details around "further analysis"?
I am working on a model the uses several different collections of features. There is one NN for each set of features but they all have the same structure. The NNs are built like the following,
results = []
sources = [input1, input2,...]
for src in sources :
result = Dense(25)(src)
results.append(result)
Model = model(input=sources, output=results)
I do have the model working such that it will compile and train.
My question is, since each component is separate, will the individual dense layers train using the loss from their corresponding y array? Or are all of the NNs trained using the combined loss?
I am hoping to keep all the NNs together like this if possible as they will always be run together.
All of the variables in each different NN will be independent of one another and therefore the combined loss can be seen as independent. What you will have to be careful of is that you will have N output tensors, where N is the number of different neural networks you have (i.e. the length of the results array).
You will have to make sure that your labels are in the same structure as the output of the model manually or create a custom loss function to handle this. One way to do this is to merge the outputs using tf.stack or tf.concat and then duplicate the labels to match these.
I am working on a classification task which uses byte sequences as samples. A byte sequence can be normalized as input to neural networks by applying x/255 to each byte x. In this way, I trained a simple MLP and the accuracy is about 80%. Then I trained an autoencoder using 'mse' loss on the whole data to see if it works well for the task. I freezed the weights of the encoder's layers and add a softmax dense layer to it for classification. I retrained the new model (only trained the last layer) and to my surprise, the result was much worse than the MLP, merely 60% accuracy.
Can't the autoencoder learn good features from all the data? Why the result is so bad?
Possible actions to take:
Check the error of autoencoder, could it really predict itself?
Visualize the autoencoder results (dimensionality reduction), is the variance explained with fewer dimensions?
Making model more complex does not necessarily outperform simpler ones, did you plot the validation mse versus epoch? Is there a global minimum after a number of steps?
Do you have enough epochs?
What is the number of units you have in your autoencoder? It may be too less (or too much, in case of underfitting) depending on the behavior of your data and its volume.
Did you make any comparison with other dimensionality reduction methods like PCA, NMF?
Last but not least, is it the best way to engineer your features with autoencoder for this task?
"Why the result is so bad?" This is not actually a surprise. You've trained one model to be good at compressing the information. The transformations it learns at each layer do not need to be good for any other type of task at all. In fact, it could be throwing away a lot of information that is perfectly helpful for whatever auxiliary classification task you have, but which is not needed for a task purely of compressing and reconstructing the sequence.
Instead of approaching it by training a separate autoencoder, you might have better luck with just adding sparsity penalty terms from the MLP layers into the loss function, or use some other types of regularization like dropout. Finally you could consider more advanced network architectures, like ResNet / ODE layers or Inception layers, modified for a 1D sequence.
Say we train a multilayer NN in tensorflow for a regression task (i.e. multi input and multi output case). Then we have new instances and we apply the trained model and of course we get the corresponding outputs. Is there a way to backpropagate the outputs and reconstruct the inputs in tensorflow in an easy/efficient manner? What I am thinking is to then use the difference of the original and the reconstructed inputs of the new instances as a QC measure i.e. if the reconstructed inputs are not close enough to the originals then we have a problem etc. I hope I am making myself clear.
No, unfortunately you cannot take a trained model and try to get the corresponding input. The reason for this is that you have infinite possible solutions for each output.
Furthermore, backpropagation is not passing an output backwards through the network. Its the idea of determining what parameters in the model are contributing to what extent to loss function. This will not give the inputs to these hidden layers, but the extent at which the weights affected your decision.