I am implementing a feed-forward neural network for a specific clustering problem.
I'm not sure if it is possible or even makes sense, but the network consists of multiple layers followed by a clustering layer (say, k-means) used to calculate the clustering loss.
The NN layers act as a feature extractor, while the last layer is only used to calculate the loss (for example, by calculating the similarity score among different data points).
Actually, this network architecture is part of a bigger auto-encoder similar to what is discussed in this paper.
The question here is can I define a custom loss function in Tensorflow/Keras that receives the output of NN and compute the clustering loss? And how?
Related
I am taking intro to ML on Coursera offered by Duke, which I recommend if you are interested in ML. The instructors of this course explained that "We typically include nonlinearities between layers of a neural network.There's a number of reasons to do so.For one, without anything nonlinear between them, successive linear transforms (fully connected layers) collapse into a single linear transform, which means the model isn't any more expressive than a single layer. On the other hand, intermediate nonlinearities prevent this collapse, allowing neural networks to approximate more complex functions." I am curious that, if I apply ReLU, aren't we losing information since ReLU is transforming every negative value to 0? Then how is this transformation more expressive than that without ReLU?
In Multilayer Perceptron, I tried to run MLP on MNIST dataset without a ReLU transformation, and it seems that the performance didn't change much (92% with ReLU and 90% without ReLU). But still, I am curious why this tranformation gives us more information rather than lose information.
the first point is that without nonlinearities, such as the ReLU function, in a neural network, the network is limited to performing linear combinations of the input. In other words, the network can only learn linear relationships between the input and output. This means that the network can't approximate complex functions that are not linear, such as polynomials or non-linear equations.
Consider a simple example where the task is to classify a 2D data point as belonging to one of two classes based on its coordinates (x, y). A linear classifier, such as a single-layer perceptron, can only draw a straight line to separate the two classes. However, if the data points are not linearly separable, a linear classifier will not be able to classify them accurately. A nonlinear classifier, such as a multi-layer perceptron with a nonlinear activation function, can draw a curved decision boundary and separate the two classes more accurately.
ReLU function increases the complexity of the neural network by introducing non-linearity, which allows the network to learn more complex representations of the data. The ReLU function is defined as f(x) = max(0, x), which sets all negative values to zero. By setting all negative values to zero, the ReLU function creates multiple linear regions in the network, which allows the network to represent more complex functions.
For example, suppose you have a neural network with two layers, where the first layer has a linear activation function and the second layer has a ReLU activation function. The first layer can only perform a linear transformation on the input, while the second layer can perform a non-linear transformation. By having a non-linear function in the second layer, the network can learn more complex representations of the data.
In the case of your experiment, it's normal that the performance did not change much when you removed the ReLU function, because the dataset and the problem you were trying to solve might not be complex enough to require a ReLU function. In other words, a linear model might be sufficient for that problem, but for more complex problems, ReLU can be a critical component to achieve good performance.
It's also important to note that ReLU is not the only function to introduce non-linearity and other non-linear activation functions such as sigmoid and tanh could be used as well. The choice of activation function depends on the problem and dataset you are working with.
Neural networks are inspired by the structure of brain. Neurons in the brain transmit information between different areas of the brain by using electrical impulses and chemical signals. Some signals are strong and some are not. Neurons with weak signals are not activated.
Neural networks work in the same fashion. Some input features have weak and some have strong signals. These depend on the features. If they are weak, the related neurons aren't activated and don't transmit the information forward. We know that some features or inputs aren't crucial players in contributing to the label. For the same reason, we don't bother with feature engineering in neural networks. The model takes care of it. Thus, activation functions help here and tell the model which neurons and how much information they should transmit.
I would try to implement a Siamese Neural Network thas has as output not only the similarity metric, but plus also able to classify the labels of each pair of input. The input are semantic audio embeddings.
Actually I have two problems:
1: In a siamese neural network the labels are the "labels of the pair"? Is there the possibility to preserve the label of the single input too? I mean is there the possibility to compute a loss function that combine the loss of the classifier+the similarity metric?
2: Do you think that I should divide the problems? I mean two networks, one is the siamese and then get the output embedding of the siamese and feed the Feed Forward network with the siamese output?(saving the similarity metric and using in the loss function of the second neural network?)
Hope I explain well the problem, and hope that someone has the solution.
Mike
For the first question, you can have a look at https://www.tensorflow.org/guide/keras/train_and_evaluate#custom_losses
There they give an example for how to write a custom loss function.
I'm interested in fitting a linear mixed model using the variational inference capabilities of tensorflow probabilities and keras. However, I cannot find a straight-forward answer on how to implement such analysis. Using the regression example in TF probabilities (see Case 3 here), I am able to grasp how to fit these models if we have only random variables in the model (the example is regression using a single feature). Following the radon example here, we have two features: floor (fixed) and county (random). My understanding is the latter should only be passed to the denseVariational layers, while the former can be passed to a regular dense layer. So I guess I would have to jointly train two networks, one for fixed and one for the random features and some how merge their outputs.
So my questions are:
(1) If these are fit jointly can the same loss function be applied to both? I often see mean square error used, while in VI negative log likelihood is used (I think this equivalent to maximizing evidence of lower bound).
(2) Does the input need to be split before-hand and fed as input to two networks?
Is there a way to give weights to features before training a neural network model? The meaning of weight here would be similar to that in linear regression - a representation of how much to trust each feature. This is unrelated to node weights or feature importance estimation after training.
Something similar to this publication.
I am working on a classification task which uses byte sequences as samples. A byte sequence can be normalized as input to neural networks by applying x/255 to each byte x. In this way, I trained a simple MLP and the accuracy is about 80%. Then I trained an autoencoder using 'mse' loss on the whole data to see if it works well for the task. I freezed the weights of the encoder's layers and add a softmax dense layer to it for classification. I retrained the new model (only trained the last layer) and to my surprise, the result was much worse than the MLP, merely 60% accuracy.
Can't the autoencoder learn good features from all the data? Why the result is so bad?
Possible actions to take:
Check the error of autoencoder, could it really predict itself?
Visualize the autoencoder results (dimensionality reduction), is the variance explained with fewer dimensions?
Making model more complex does not necessarily outperform simpler ones, did you plot the validation mse versus epoch? Is there a global minimum after a number of steps?
Do you have enough epochs?
What is the number of units you have in your autoencoder? It may be too less (or too much, in case of underfitting) depending on the behavior of your data and its volume.
Did you make any comparison with other dimensionality reduction methods like PCA, NMF?
Last but not least, is it the best way to engineer your features with autoencoder for this task?
"Why the result is so bad?" This is not actually a surprise. You've trained one model to be good at compressing the information. The transformations it learns at each layer do not need to be good for any other type of task at all. In fact, it could be throwing away a lot of information that is perfectly helpful for whatever auxiliary classification task you have, but which is not needed for a task purely of compressing and reconstructing the sequence.
Instead of approaching it by training a separate autoencoder, you might have better luck with just adding sparsity penalty terms from the MLP layers into the loss function, or use some other types of regularization like dropout. Finally you could consider more advanced network architectures, like ResNet / ODE layers or Inception layers, modified for a 1D sequence.