i was trying to use average ensembling on a group of models i trained earlier (i'm creating a new model in the ensemble for each pre-trained model i'm using and then loading the trained weights onto it, it's inefficient this way i know but i'm just learning about it so it doesn't really matter). and I mistakenly changed some of the network's parameters when loading the models in the ensemble code like using Relu instead of leakyRelu which i used in training the models and a different value for an l2 regularizer in the dense layer in one of the models. this however gave me a better testing accuracy for the ensemble. can you please explain to me if/how this is incorrect, and if it's normal can i use this method to further enhance the accuracy of the ensemble.
I believe it is NOT correct to chnage model's parameters after training it. parameters here I mean the trainable-parameters like the weights in Dense node but not hyper-parameters like learning rate.
What is training?
Training essentially is a loop that keeps changing, or update, the parameters. It updates the parameter in such a way that it believes it can reduce the loss. It is also like moving your point in a hyper-spaces that the loss function gives a small loss on that point.
Smaller loss means higher accruacy in general.
Changing Weights
So now, changing your parameters values, by mistake or by purpose, is like moving that point to somewhere BUT you have no logical reason behind that such move will give you a smaller loss. You are just randomly wandering around that hyper-space and in your case you are just lucky that you land to some point that so happened give you a smaller loss or a better testing accuracy. It is just purely luck.
Changing activation function
Also, altering the activation function from leakyRelu to relu is similar you randomly alter the shape of your hype-space. Even though you are at the some point the landscape changes, you are still have no logical reason to believe by such change of landscape you can have a smaller loss staying at the same point
When you change the model manually, you need to retrain.
Though you changed the network's parameters when loading the models. It is not incorrect to alter the hyper-parameters of your ensemble's underlying models. In some cases, the models that are used in an ensemble method require unique tunings which can, as you mentioned, give "you a better testing accuracy for the ensemble model."
To answer your second question, you can use this method to further enhance the accuracy of the ensemble, you can also use Bayesian optimization, GridSearch, and RandomSearch if you prefer more automated means of tuning your hyperparameters.
I'm design a classification model.
I have a problem, there are many categories which has similar features.
I think best options is re-generate category hierarchy, but those are fixed.
So, I focused on 3-best accuracy, instead of 1-best accuracy.
I want to defined a loss function for 3-best accuracy.
I don't care where is the answer in position 1 - 3.
Is there any good loss function for that? of How can I define it?
You can use keras.metrics.top_k_categorical_accuracy for calculating accuracy. But this one is accuracy metric. I don't think there is any inbuilt top_k loss function in TensorFlow or Keras as of now. A loss function should be differentiable to work with gradient based learning methods. While top_k is not a differentiable function. Just like accuracy metric. Hence it can be used as accuracy metric but not as learning objective. So you won't find any inbuilt method for this, however there are other research papers aiming to solve this problems. You might want to have a look at Learning with Average Top-k Loss and Smooth Loss Functions for Deep Top-k Classification.
you can use any of the below
top_k_categorical_accuracy
keras.metrics.top_k_categorical_accuracy(y_true, y_pred, k=3)
sparse_top_k_categorical_accuracy
keras.metrics.sparse_top_k_categorical_accuracy(y_true, y_pred, k=3)
classifier.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
classifier.fit(X_train, y_train, epochs=50, batch_size=100)
Epoch 1/50
27455/27455 [==============================] - 3s 101us/step - loss: 2.9622 - acc: 0.5374
I know I'm compiling my model in first line and fitting it in second. I know what is optimiser. I'm interested the meaning of metrics=['accuracy'] and what does the acc: XXX exactly mean when I compile the model.
Also, I'm getting acc : 1.000 when I train my model (100%) but when I test my model I'm getting 80% accuracy. Does my model overfitting?
Ok, let's begin from the top,
First, metrics = ['accuracy'], The model can be evaluated on multiple parameters, accuracy is one of the metrics, other can be binary_accuracy, categorical_accuracy, sparse_categorical_accuracy, top_k_categorical_accuracy, and sparse_top_k_categorical_accuracy, these are only the inbuilt ones, you can even create custom metrics, to understand metrics in more details, you need to have a clear understanding of loss in a Neural Network, you might know that loss function must be differentiable in order to be able to do back propagation, this is not necessary in case of metrics, metrics are used purely for model evaluation and thus can even be functions that are not differentiable, in Keras as mentioned even in their documentation
A metric function is similar to a loss function, except that the results from evaluating a metric are not used when training the model. You may use any of the loss functions as a metric function.
On your Own, you can custom define an accuracy that is not differentiable but creates an objective function on what you need from your model.
TLDR; Metrics are just loss functions not used in back propagation but used for model evaluation.
Now,
acc:xxx might just be that it has not even finished one minibatch propagation and thus cannot give an accuracy score yet, I have not paid much attention to it, but it usually stays there for a few seconds and is thus an speculation from that.
Finally 20% Decrease in model performance when taken out of training, yes this can be a case of Overfitting but no one can know for sure without looking at your dataset, but most probably yes, it is overfitting, and you may need to look at the data it is performing bad on to know the cause.
If something is unclear, doesn't make sense, feel free to comment.
Having 100% accuracy on train dataset while having 80% accuracy on test dataset doesn't mean that your model overfits. Moreover, it almost surely doesn't overfit if your model is equipped with much more effective parameters that the number of training samples [2], [5] (insanely large model example [1]). This contradicts to conventional statistical learning theory, but these are empirical results.
For models with number of parameters greater than number of samples, it's better to continue to optimize the logistic or cross-entropy loss even after the training error is zero and the training loss is extremely small, and even if the validation loss increases [3]. This may hold even regardless of batch size [4].
Clarifications (edit)
The "models" I was referring to are neural networks with two or more hidden layers (could be also convolutional layers prior to dense layers).
[1] is cited to show a clear contradiction to classical statistical learning theory, which says that large models may overfit without some form of regularization.
I would invite anyone who disagrees with "almost surely doesn't overfit" to provide a reproducible example where models, say for MNIST/CIFAR etc with few hundred thousand parameters do overfit (in a sense of increasing with iterations test error curve).
[1] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le,Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: Thesparsely-gated mixture-of-experts layer.CoRR, abs/1701.06538, 2017.
[2] Lei Wu, Zhanxing Zhu, et al. Towards understanding generalization of deep learn-ing: Perspective of loss landscapes.arXiv preprint arXiv:1706.10239, 2017.
[3] Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and NathanSrebro. The implicit bias of gradient descent on separable data.The Journal of Machine Learning Research, 19(1):2822–2878, 2018.
[4] Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: clos-ing the generalization gap in large batch training of neural networks. InAdvancesin Neural Information Processing Systems, pages 1731–1741, 2017.`
[5] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals.Understanding deep learning requires rethinking generalization.arXiv preprintarXiv:1611.03530, 2016.
Starting off with the first part of your question -
Keras defines a Metric as "a function that is used to judge the performance of your model". In this case you are using accuracy as the function to judge how good your model is. (This is the norm)
For the second part of your question - acc is the accuracy of your model at that Epoch. This can, and will change depending on which metrics were defined in the model.
Finally it is possible that you have ended up with an overfit model given what you have told us but there are simple solutions
So the meaning of metrics=['accuracy'] is actually dependent on what loss function you use. You can see how keras handels this from line 375 and down. Since you are using categorical_crossentropy, your case follows the logic in the elif (line 386). Hence your metric function is set to
metric_fn = metrics_module.sparse_categorical_accuracy
See this post for a description of the logic behind sparse_categorical_accuracy, it should clear the meaning of "accuracy" in your case. It basically just counts how many of your prediction (the class with maximum probability) was the same as the true class.
The train vs validation accuracy can show sign of over-fitting. To test this plot the train accuracy and validation accuracy against each other and see at what point the validation accuracy start to decrease. Follow this for a good description of how to plot accuracy and loss etc, to test for over-fitting.
I am performing a Multinomial Logistic Regression on variables in the NHTS 2017 dataset. According to the docs, sklearn.linear_model.LogisticRegression uses cross-entropy loss (log loss) as the loss function to optimize the model. However, as I add new features and fit the model, the loss does not seem to be monotone decreasing. Specifically, if I fit household driver count to vehicle ownership, (driver count is the single most predictive variable for vehicle ownership), I get less loss than if I indiscriminately fit all of the variables.
Possibly this is due to sklearn.metrics.log_loss doing something different than the actual loss function for LogisticRegression. Possibly the problem has become so non-convex that it finds a crappy solution. Can anybody hep explain why my loss would increase as I add features?
There could be multiple reasons but my guess is the following:
penalty - by default logistic regression is trained with a l2
penalty to prevent overfitting. In this case, the loss function is cross entropy loss plus the l2 norm of weights. As a result, more features will not necessarily guarantee that the cross entropy itself decreases.
Btw, it seems like your goal is to get the highest score (lowest loss) on a training set. I am not gonna dispute that but maybe look into test/validation sets.
I have been playing with Lasagne for a while now for a binary classification problem using a Convolutional Neural Network. However, although I get okay(ish) results for training and validation loss, my validation and test accuracy is always constant (the network always predicts the same class).
I have come across this, someone who has had the same problem as me with Lasagne. Their solution was to setregression=True as they are using Nolearn on top of Lasagne.
Does anyone know how to set this same variable within Lasagne (as I do not want to use Nolearn)? Further to this, does anyone have an explanation as to why this needs to happen?
Looking at the code of the NeuralNet class from nolearn, it looks like the parameter regression is used in various places, but most of the times it affects how the output value and loss are computed.
In case of regression=False (the default), the network outputs the class with the maximum probability, and computes the loss with the categorical crossentropy.
On the other hand, in case of regression=True, the network outputs the probabilities of each class, and computes the loss with the squared error on the output vector.
I am not an expert in deep learning and CNN, but the reason this may have worked is that in case of regression=True and if there is a small error gradient, applying small changes to the network parameters may not change the predicted class and the associated loss, and may lead the algorithm to "think" that it has converged. But if instead you look at the class probabilities, small parameter changes will affect the probabilities and the resulting mean squared error, and the network will continue down this path which may eventually change the predictions.
This is just a guess, it is hard to tell without seeing the code and the dataset.