I have trained a NLP model with RNN on keras to classify tweets with word embeddings (Stanford GloVe) used as a feature selection method. I would like to apply this model trained onto new tweets extracted. However, this error appears when I try to apply the model to new data.
ValueError: Error when checking input: expected input_1 to have shape (22,) but got array with shape (51,)
Then I realised that the model trained is expecting an input with a 22-input vector (the max tweet length in the training set tweets). On the other hand, the new dataset I would like to apply the model to has a 51-input vector (the max tweet length in the new dataset tweets).
In attempt to tackle this, I increased the size of the vector when training the model to 51 so both would be balanced. A new error pops up:
InvalidArgumentError: indices[29,45] = 5870 is not in [0, 2489)
Thus, I decided to try to apply the model back on the training dataset to see if it was possible in the first place with the original parameters and model. It was not and a very similar error appeared.
InvalidArgumentError: indices[23,11] = 2489 is not in [0, 2489)
In this case, how can I export an end-to-end NLP RNN classification model to apply on new unseen data? (FYI: I was able to successfully to do this for Logistic Regression with TF-IDF used as a feature selection. There just seems to be a lot of issues with the Word Embeddings.)
===========
UPDATE:
I was able to solve this issue by pickling not only the model, but also variables such as the max_len, texttotensor_instance and tokenizer. When applying the model to new data, I will have to use the same variables generated from the training data (instead of redefining them with the new data).
Your error is because the maximum number of words in your training data exceeds the max in the Embeddings layer (aka. input_dim).
It seems that the input_dim param. in your Embeddings layer is set to 2489, where you have words in your dataset tokenized and mapped to a higher value (5870).
Also don't forget to add one to the maximum # of words when you set this in the Embedding layer (input_dim=max_number_of_words+1). If you're interested to know why check this question: Keras embedding layer masking. Why does input_dim need to be |vocabulary| + 2?
Related
I just trained a BERT model on a Dataset composed by products and labels (departments) for an e-commerce website. It's a multiclass problem. I used BertForSequenceClassification to predict the department for each product. I split it in train and evaluation, I used dataloader from pytorch, and I've got a good score with no overfit.
Now I want to try it on a new Dataset to check how it works on unseen data. But I can't achieve to load the model and apply on the new Dataset. I get the following error:
RuntimeError: Error(s) in loading state_dict for BertForSequenceClassification:
size mismatch for classifier.weight: copying a param with shape torch.Size([59, 1024]) from checkpoint, the shape in current model is torch.Size([105, 1024]).
size mismatch for classifier.bias: copying a param with shape torch.Size([59]) from checkpoint, the shape in current model is torch.Size([105]).
I see that the problem probably is a mismatch from labels size between both Datasets. I've searched a bit and I've found a recommendation to use ignore_mismatched_sizes=True as and argument for pretrained. But I keep receiving the same error.
Here is part of my code when trying to predict on unseen data:
from transformers import BertForSequenceClassification
# Just right before the actual usage select your hardware
device = torch.device('cuda') # or cpu
model = model.to(device) # send your model to your hardware
model = BertForSequenceClassification.from_pretrained("neuralmind/bert-large-portuguese-cased",
num_labels=len(label_dict),
output_attentions=False,
output_hidden_states=False,
ignore_mismatched_sizes=True)
model.to(device)
model.load_state_dict(torch.load('finetuned_BERT_epoch_2_full-Copy1.model', map_location=torch.device('cuda')))
_, predictions, true_vals = evaluate(dataloader_validation)
accuracy_per_class(predictions, true_vals)
Could someone help me how could I deal with it? I don't know what more can I do!
Any help I'm very grateful!
Your new dataset has 105 classes while your model was trained for 59 classes. As you have already mentioned, you can use ignore_mismatched_sizes to load your model. This parameter will load the the embedding and encoding layers of your model, but will randomly initialize the classification head:
model = BertForSequenceClassification.from_pretrained("finetuned_BERT_epoch_2_full-Copy1.model",
num_labels=105,
output_attentions=False,
output_hidden_states=False,
ignore_mismatched_sizes=True)
In case you want to keep the classification layer of the 59 labels and add 46 labels, you can refer to this answer. Please also note the comments of this answer, because this approach does not provide any meaningful results due to the random initialization for the new labels.
fruit_prediction = knn.predict([[20, 4.3, 5.5]])
lookup_fruit_name[fruit_prediction[0]]
I am not able to run above two lines. I am getting this kind of error
ValueError: query data dimension must match training data dimension
Checking the inputs of knn classifier would be good option to analysed what is wrong in the dimensions of knn.predict.
As you have nested list in your predict as inputs that can be just a list.
I was reading the BERT paper and was not clear regarding the inputs to the transformer encoder and decoder.
For learning masked language model (Cloze task), the paper says that 15% of the tokens are masked and the network is trained to predict the masked tokens. Since this is the case, what are the inputs to the transformer encoder and decoder?
Is the input to the transformer encoder this input representation (see image above). If so, what is the decoder input?
Further, how is the output loss computed? Is it a softmax for only the masked locations? For this, the same linear layer is used for all masked tokens?
Ah, but you see, BERT does not include a Transformer decoder.
It is only the encoder part, with a classifier added on top.
For masked word prediction, the classifier acts as a decoder of sorts, trying to reconstruct the true identities of the masked words.
Classifying Non-masked is not included in the classification task and does not effect loss.
BERT is also trained on predicting whether a pair of sentences really does precedes one another or not.
I do not remember how the two losses are weighted.
I hope this draws a clearer picture.
I'd like to apply lstm in my speech emotion datasets (dataset of features in numeric values with one column of targets).
I've done split_train_test. Do I need some other transformation to do in the data set before the model?
I ask this question because when I compile and fit the model I've got one error in the last dense layer.
Error when checking model target: expected activation_2 to have shape (8,) but got array with shape (1,).
Thanks.
After my internship I learn how to fix out this error and where to look.
Here's what you have to take care.
Unexpected error input form
If the reported layer is the first it is a cause of the input data for the train of a model as a same shape for a create your model.
If this is the last layer that bug then it is the labels that are well coded
Either you put a sigmoid but the labels are not binary either you put softmax and the labels are in one-hot format [0,1,0]: example 3 classes, this element is of class 2. So, the labels are badly encoded or you are deceived in the function (sigmoid / softmax) of your output layer.
Hope this help
Traditionally it seems that RNNs use logits to predict next time step in the sequence. In my case I need the RNN to output a word2vec (50 depth) vector prediction. This means that the cost function has be based off 2 vectors: Y the actual vector of the next word in the series and Y_hat, the network prediction.
I've tried using a cosine distance cost function but the network does not seem to learn (I've let it run other 10 hours on a AWS P3 and the cost is always around 0.7)
Is such a model possible at all ? If so what cost function should be used ?
Cosine distance in TF:
cosine_distance = tf.losses.cosine_distance(tf.nn.l2_normalize(outputs, 2), tf.nn.l2_normalize(targets, 2), axis=2)
Update:
I am trying to predict a word2vec so during sampling I could pick next word based on the closest neighbors of the predicted vector.
What is the reason that you want to predict a word embedding? Where are you getting the "ground truth" word embeddings from? For word2vec models you typically will re-use the trained word-embeddings in future models. If you trained a word2vec model with an embedding size of 50, then you would have 50-d embeddings that you could save and use in future models. If you just want to re-create an existing ground truth word2vec model, then you could just use those values. Typical word2vec would be having regular softmax outputs via continuous-bag-of-words or skip-gram and then saving the resulting word embeddings.
If you really do have a reason for trying to generate a model that creates tries to match word2vec, then looking at your loss function here are a few suggestions. I do not believe that you should be normalizing your outputs or your targets -- you probably want those to remain unaffected (the targets are no longer the "ground truth" targets if you have normalized them. Also, it appears you are using dim=0 which has now been deprecated and replaced with axis. Did you try different values for dim? This should represent the dimension along which to compute the cosine distance and I think that the 0th dimension would be the wrong dimension (as this likely should be the batch size. I would try with values of axis=-1 (last dimension) or axis=1 and see if you observe any difference.
Separately, what is your optimizer/learning rate? If the learning rate is too small then you may not actually be able to move enough in the right direction.