I saw code like this
self.feature = model_func()
if loss_type == 'softmax':
self.classifier = nn.Linear(self.feature.final_feat_dim, num_class)
self.classifier.bias.data.fill_(0)
elif loss_type == 'dist': #Baseline ++
self.classifier = backbone.distLinear(self.feature.final_feat_dim, num_class)
where model_func is a ConvNet 4/6 or ResNet 10/18/34/101
What here is classifier?
I know that in neural networks we have parameters that we learn, buffers that are used to store something that gets updated during training, activations that are the results after each layer.
Is feature same as activation, and what is a classifier, where is the end of features, and the beginning of classifier in a neural network? Is the result of a classifier also activation?
I find the question a little messy, but I'll give my best from what I understand you're asking.
What here is classifier?
The classifier would be the model itself. The model is the one who will, after being trained, be able to classify new data.
Is feature same as activation
I don't know what kind of feature you have in mind. In data science context, a feature is understood to be one of the variables of the data one has. For example, if you have a dataset about houses, you may have features such as latitude, long., if it has a pool, how many bedrooms it has, etc.
Activation functions are mathematical equations that determine the output of a neural network. The function is attached to each neuron in the network, and determines whether it should be activated (“fired”) or not, based on whether each neuron’s input is relevant for the model’s prediction. [1]
I'm not sure I'm truly understanding what you're asking.
Is the result of a classifier also activation?
The result of a classifier is the label, the class to which each data point belongs. Activation functions are used by neural networks in the process of classifying.
Hope this helps!
[1] https://missinglink.ai/guides/neural-network-concepts/7-types-neural-network-activation-functions-right/
Related
I'm trying to do text classification with scikit-learn.
I have text that does not classify well. I think I can improve the predictions by adding data I can deduce in the form of an array of integers.
For example, sample 1 would come with [3, 1, 5, 2] and sample 2 would come with [2, 1, 4, 2]. This would also be true of the test data.
The idea is that the classifier could use both the text and the numbers to classify the data.
I've read the documentation for scikit learn and I can't find how to do it. It must be possible because all that is classified, internally, is vectors of numbers. So adding another vector of numbers should not be that much of a problem, but I can't figure out how. partial_fit adds more samples, it does not add more information about the existing samples. Is there a way to do what I'm trying to do. I tried to combine GaussianNB with SGDClassifier, but it turns out I don't know how to do that. (Was it a bad idea?)
What should I do?
I think you could add this new feature as another dimension to your training data. You need to modify the training data by adding your new features before calling SGD.
A simple/naive way would be:
For example, if my training data with two samples were
X = [ [1,2,3], [8,9,0] ]
And my new features for each sample was
new_feature_X = [ [11,22,33] , [77,88,00] ]
My new training data would be:
X_new = [[1,2,3,11,22,33] , [8,9,0,77,88,00]]
Then you call SGD.fit(X_new, labels)
As far as my SGD knowledge goes, I don't think there is any other way to combine two features.
The idea is that the classifier could use both the text and the
numbers to classify the data.
I find a neural network to be much more suitable for this. You could use two input layers, one for text vectors, and one for the numbers and feed them together into a network to get the output.
I tried to combine GaussianNB with SGDClassifier, but it turns out I
don't know how to do that. (Was it a bad idea?)
SGD means stochastic gradient descent. Is it possible to find the gradient of NaiveBayes? Whats the corresponding cost function ?
What should I do?
Ensemble. Train two separate classifiers. One using your text data, and another one for your new handcrafted feature. And then take the average of their prediction probabilities. You could train multiple classifier and take their votes. This tutorial is great for that.
Try out MLP Classifier. I used it a while ago, and found it works pretty great with text.
Neural networks. It's pretty easy with Keras.
Read research literature. There is pretty good chance academia might have done some work on your dataset. Try to read some of them. Google scholar, semantic scholar are great places to find published reseaerch.
from keras.layers import Input, Dense,Concatenate
from keras.models import Model
# This returns a tensor
text_input_vec = Input(shape=(784,))
new_numeric_feature = Input(shape=(4,))
# feed your text to a dense layer
dense1 = Dense(64, activation='relu')(text_input_vec)
# feed your numeric feature to another dense layer
dense2 = Dense(64, activation='relu')(new_numeric_feature)
# concatenate/combine the output of both
concat = Concatenate(axis=-1)([dense1,dense2])
# use the above to predict the label of your text. Layer below
# assumes you have 2 classes
predictions = Dense(2, activation='softmax')(concat)
model = Model(inputs=[text_input_vec,new_numeric_feature], outputs=predictions)
model.compile(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['accuracy'])
model.summary()
I have a trained neural networks in which I am trying to average their prediction using EnsembleVoteClassifier from mlxtend.classifier. The problem is my neural network don't share the same input, (I performed feature reduction and feature select algorithms randomly and stored the results on new different variables, so I have something like X_test_algo1, X_test_algo2 and X_test_algo3 and Y_test).
I am trying to average the weights, but as I said, I don't have the same X, and I didn't any example on the documentation. How can I average the predictions for my three models model1, model2 and model3
eclf = EnsembleVoteClassifier(clfs=[model1, model2, model3], weights=[1,1,1], refit=False)
names = ['NN1', 'NN2', 'NN2', 'Ensemble']
eclf.fit(X_train_algo1, Ytrain) #????
If it's not possible, that is okay. I am only interested on how to calculate the formulas of Hard Voting, Hard Voting and Weighted Voting, or if there is anther library that is more flexible or the explicit expressions of the formulas could be helpful too.
Why would you need a library to do that?
Simply pass the same examples through all your neural networks and get the predictions (either logits or probabilities or labels).
Hard voting choose the label predicted most often by classifiers.
Soft voting, average probabilities predicted by classifiers and choose the label having the highest.
Weighted voting - either of the above can be weighted. Just assign weights to each classifier and multiply their predictions by them. Weights are usually normalized to (0, 1] range.
In principle you could also sum logits and choose the label with highest.
Oh, and weight averaging is different technique and requires you to have the same model and usually is done for the same initialization but at different training timesteps. You can read about it in this blog post.
I am working on a classification task which uses byte sequences as samples. A byte sequence can be normalized as input to neural networks by applying x/255 to each byte x. In this way, I trained a simple MLP and the accuracy is about 80%. Then I trained an autoencoder using 'mse' loss on the whole data to see if it works well for the task. I freezed the weights of the encoder's layers and add a softmax dense layer to it for classification. I retrained the new model (only trained the last layer) and to my surprise, the result was much worse than the MLP, merely 60% accuracy.
Can't the autoencoder learn good features from all the data? Why the result is so bad?
Possible actions to take:
Check the error of autoencoder, could it really predict itself?
Visualize the autoencoder results (dimensionality reduction), is the variance explained with fewer dimensions?
Making model more complex does not necessarily outperform simpler ones, did you plot the validation mse versus epoch? Is there a global minimum after a number of steps?
Do you have enough epochs?
What is the number of units you have in your autoencoder? It may be too less (or too much, in case of underfitting) depending on the behavior of your data and its volume.
Did you make any comparison with other dimensionality reduction methods like PCA, NMF?
Last but not least, is it the best way to engineer your features with autoencoder for this task?
"Why the result is so bad?" This is not actually a surprise. You've trained one model to be good at compressing the information. The transformations it learns at each layer do not need to be good for any other type of task at all. In fact, it could be throwing away a lot of information that is perfectly helpful for whatever auxiliary classification task you have, but which is not needed for a task purely of compressing and reconstructing the sequence.
Instead of approaching it by training a separate autoencoder, you might have better luck with just adding sparsity penalty terms from the MLP layers into the loss function, or use some other types of regularization like dropout. Finally you could consider more advanced network architectures, like ResNet / ODE layers or Inception layers, modified for a 1D sequence.
Say we train a multilayer NN in tensorflow for a regression task (i.e. multi input and multi output case). Then we have new instances and we apply the trained model and of course we get the corresponding outputs. Is there a way to backpropagate the outputs and reconstruct the inputs in tensorflow in an easy/efficient manner? What I am thinking is to then use the difference of the original and the reconstructed inputs of the new instances as a QC measure i.e. if the reconstructed inputs are not close enough to the originals then we have a problem etc. I hope I am making myself clear.
No, unfortunately you cannot take a trained model and try to get the corresponding input. The reason for this is that you have infinite possible solutions for each output.
Furthermore, backpropagation is not passing an output backwards through the network. Its the idea of determining what parameters in the model are contributing to what extent to loss function. This will not give the inputs to these hidden layers, but the extent at which the weights affected your decision.
Currently I'm using VGG16 + Keras + Theano thought the Transfer Learning methodology to recognize plants classes. It works just fine and gives me a good accuracy. But the next problem I'm trying to solve - is to find a way of identifying if an input image contains plant at all. I don't want to have another one classifier that will do it, because it's not really efficiently.
So I did some search and have found that we can get activations from the latest model layer (before activation layer) and analyze it.
from keras import backend as K
model = util.load_model() # VGG16 model
model.load_weights(path_to_weights)
def get_activations(m, layer, X_batch):
x = [m.layers[0].input, K.learning_phase()]
y = [m.get_layer(layer).output]
get_activations = K.function(x, y)
activations = get_activations([X_batch, 0])
# trying to get some features from activations
# to understand how can we identify if an image is relevant
for l in activations[0]:
not_nulls = [x for x in l if x > 0]
# shows percentage of activated neurons
c1 = float(len(not_nulls)) / len(l)
n_activated = len(not_nulls)
print 'c1:{}, n_activated:{}'.format(c1, n_activated)
return activations
get_activations(model, 'the_latest_layer_name', inputs)
From the above code I've noticed that when we have very irrelevant image, the number of activated neurons is bigger than for images that contain plants:
For images that was using for model training, number of activated neurons 19%-23%
For images that contain unknown plants species 20%-26%
For irrelevant images 24%-28%
It's not really a good feature to understand if an image relevant as percentage values are intersect.
So, is there a good way to resolve this issue?
Thanks to Feras's idea in the comment above. After some trials, I've come up with the ultimate solution that allows solving this problem with accuracy up to 99.99%.
Steps are:
Train your model on a dataset;
Store activations (see method above how to get them) by predicting relevant and non-relevant images using trained model from the previous step. You should get activations from the penultimate layer. For VGG16 it's the last of two Dense(4096), for InceptionV3 - an extra penultimate Dense(1024) layer, for resnet50 - an extra penultimate Dense(2048) layer.
Solve a binary problem using stored activations data. I've tried a simple flat NN and Logistic Regression. Both were good in accuracy (flat NN was a bit more accurate), but I've chosen the Logistic Regression as it's simpler, faster and consumes less memory and CPU/GPU.
This process should be repeated each time after your model retrained as each time the final weights for CNN are different and what was working previously, will be different next time.
So as result we have another small model for solving the problem.