POS tagging using RNN

POS tagging using RNN - python

I implemented a POS tagger using RNN. There are 3 features, if current word is W_i:
Feature 1: W_i-2, W_i-1, W_i, W_i+1, W_i+2
Feature 2: suffix of Feature 1, 2 characters
Feature 3: [If W_i is all uppercase, If W_i is all lowercase, If
first character of W_i is uppercase]
In my model, I have two RNNs, for Feature 1 and Feature 2, then the outputs of the RNNs and Feature 3 are concatenated, following with a softmax. The RNN for Feature 1 is bidirectional.
I tried my model on PennTree Bank, but the accuracy is very low (<50% on both training and evaluation). Just wondering, if anyone know an open source POS tagger using RNN (word based feature) in python that I can compare it with my model, then I can find if there is a bug in my code or simply because this model is not working.
Thanks,

There is one that is implemented using a bi-directional LSTM and CRF. It can be found here.

Related

How to combine embeddins vectors of bert with other features?

I am working on a classification task with 3 labels (0,1,2 = neg, pos, neu). Data are sentences. So to produce vectors/embeddings of sentences, I use a Bert encoder to get embeddings for each sentence and then I used a simple knn to make predictions.
My data look like this : each sentence has a label and other numerical value of classification.
For example, my data look like this
Sentence embeddings_BERT level sub-level label
je mange [0.21, 0.56] 2 2.1 pos
il hait [0.25, 0.39] 3 3.1 neg
.....
As you can see each sentence has other categories but the are not the final one but indices to help figure the label when a human annotated the data. I want my model to take into consideration those two values when predicting the label. I was wondering if I have to concatenate them with the embeddings generate by the bert encoding or is there another way ?

There is not one perfect way to tackle this problem, but a simple solution will be to concat the bert embeddings with hard-coded features. The BERT embeddings (sentence embeddings) will be of dimension 768 (if you have used BERT base). These embeddings can be treated as features of the sentence itself. The additional features can be concatenated to form a higher dimensional vector. If the features are categorical, it will be ideal to convert to one-hot vectors and concatenate them. For example, if you want to use level in your example as set of input features, it will be best to convert it into one-hot feature vector and then concatenate with BERT embeddings. However, in some cases, your hard coded features can be dominant feature to bias the classifier, and in some other cases, it can have no influence at all. It all depends on the data that you have.

How to generate independent(X) variable using Word2vec?

I have a movie review data set which has two columns Review(Sentences) and Sentiment(1 or 0).
I want to create a classification model using word2vec for the embedding and a CNN for the classification.
I've looked for tutorials on youtube but all they do is create vectors for every words and show me the similar words. Like this-
model= gensim.models.Word2Vec(cleaned_dataset, min_count = 2, size = 100, window = 5)
words= model.wv.vocab
simalar= model.wv.most_similar("bad")
I already have my dependent variable(y) which is my 'Sentiment' column all I need is the independent variable(X) which I can pass on to my CNN model.
Before using word2vec I used the Bag Of Words(BOW) model which generated a sparse matrix which was my independent(X) variable. How can I achieve something similar using word2vec?
Kindly correct me if I'm doing something wrong.

To get the word vector, you have to do this:
model['word_that_you_want']
You may also want to handle the KeyError that could arise if you don't find that given word in your model. You also might want to read about what an embedding layer is, which is usually used as the first layer of the neural network (for NLP generally) and is basically a lookup mapping of a word to its corresponding word vector.
To get the word vectors for an entire sentence, you need to first initialize a numpy array of zeros to the dimensions you want.
You might need other variables such as the length of the longest sentence so that you can pad all sentences to that length. The documentation of the pad_sequences method for Keras is here.
A simple example of getting a sentence of word vectors is:
import numpy as np
embedding_matrix = np.zeros((vocab_len, size_of_your_word_vector))
Then iterate over the index of embedding_matrix and add to it, if you find a word vector in your model.
I use this resource which has a lot of examples and I have referenced some of the code there (which I have also used myself sometimes):
embedding_matrix = np.zeros((vocab_length, 100))
for word, index in word_tokenizer.word_index.items():
embedding_vector = model[word] # using your w2v model, KeyError possible
if embedding_vector is not None:
embedding_matrix[index] = embedding_vector
And in your model (I'm assuming Tensorflow with Keras)
embedding_layer = Embedding(vocab_length, 100, weights=[embedding_matrix], input_length=length_long_sentence, trainable=False)
I hope this helps.

Word2Vec doesn't inherently create vectors for a text (set of words) – just individual words.
But, sometimes a not-so-bad vector for a multi-word text is the average of all its word-vectors.
If list_of_words is a list of the words in your text, and all the words are in the Word2Vec model, a simple way to get the average of those words' vectors is:
avg_vector_of_words = model.wv[list_of_words].mean(axis=0)
(If some words aren't present, you'd need to filter them before attempting this to avoid KeyErrors. If you wanted to leave out some words, or use unit-normed word-vectors, or unit-normalize the final vector, you'd need more code.)
Then avg_vector_of_words is a small, dense/'embedded' feature vector for the list_of-words text.
You could pass these vectors, one per text, to another downstream classifier, like your CNN, exactly analogously to how you were previously using sparse BOW vectors.

Compute cross entropy loss for classification in pytorch

I am trying to build two neural network for classification. One for Binary and the second is for multi-class classification. I am trying to use the torch.nn.CrossEntropyLoss() as a loss function, but I try to train my first neural network I get the following error:
multi-target not supported at /opt/conda/conda-bld/pytorch_1565272271120/work/aten/src/THNN/generic/ClassNLLCriterion.c:22
From my analysis, I found that the my dataset has two problems that caused the error.
My data set is one hot encoded. I used one hot encoding to pre processes my dataset. The first target Y_binary variable has the shape of torch.Size([125973, 1]) full of 0s and 1 indicating classes 'No' and 'Yes'.
My data has the wrong dimensions? I found that I can't use a simple vector with the cross entropy loss function. Some people used the following code to reshape their target vector before feeding to the loss function.
out = out.permute(0, 2, 3, 1).contiguous().view(-1, class_number)
But I didn't really understand the reasoning behind this code. But it seems for my that I need to keep track of the following variables: Class_Number, Batch_size, Dimension_Output. For my code here are the dimensions
X_train.shape: (125973, 122)
Y_train2.shape: (125973, 1)
batch_size = 64
K = len(set(Y_train2)) # Binary classification For multi class classification use K = len(set(Y_train5))
Should the target value be one hot encoded? If not, how I can feed a nominal feature to the loss function?
If I use reshape the output, can you help me do this for my code ?
I am trying to use this loss function for both my neural networks.
Thank you in advance,

The error is due to the usage of torch.nn.CrossEntropyLoss() which can be used if you want to predict 1 class out of N classes. For multiclass classification, you should use torch.nn.BCEWithLogitsLoss() which combines a Sigmoid layer and the BCELoss in one single class.
In case of multi-class, and if you use Sigmoid + BCELoss, then you need the target to be one-hot encoding, i.e. something like this per sample: [0 1 0 0 0 1 0 0 1 0], where 1 will be at the locations of classes present.

Character embeddings with Keras

I am trying to implement the type of character level embeddings described in this paper in Keras. The character embeddings are calculated using a bidirectional LSTM.
To recreate this, I've first created a matrix of containing, for each word, the indexes of the characters making up the word:
char2ind = {char: index for index, char in enumerate(chars)}
max_word_len = max([len(word) for sentence in sentences for word in sentence])
X_char = []
for sentence in X:
for word in sentence:
word_chars = []
for character in word:
word_chars.append(char2ind[character])
X_char.append(word_chars)
X_char = sequence.pad_sequences(X_char, maxlen = max_word_len)
I then define a BiLSTM model with an embedding layer for the word-character matrix. I assume the input_dimension will have to be equal to the number of characters. I want a size of 64 for my character embeddings, so I set the hidden size of the BiLSTM to 32:
char_lstm = Sequential()
char_lstm.add(Embedding(len(char2ind) + 1, 64))
char_lstm.add(Bidirectional(LSTM(hidden_size, return_sequences=True)))
And this is where I get confused. How can I retrieve the embeddings from the model? I'm guessing I would have to compile the model and fit it then retrieve the weights to get the embeddings, but what parameters should I use to fit it ?
Additional details:
This is for an NER task, so the dataset technically could be be anything in the word - label format, although I am specifically working with the WikiGold ConLL corpus available here: https://github.com/pritishuplavikar/Resume-NER/blob/master/wikigold.conll.txt
The expected output from the network are the labels (I-MISC, O, I-PER...)
I expect the dataset to be large enough to be training character embeddings directly from it. All words are coded with the index of their constituting characters, alphabet size is roughly 200 characters. The words are padded / cut to 20 characters. There are around 30 000 different words in the dataset.
I hope to be able learn embeddings for each characters based on the info from the different words. Then, as in the paper, I would concatenate the character embeddings with the word's glove embedding before feeding into a Bi-LSTM network with a final CRF layer.
I would also like to be able to save the embeddings so I can reuse them for other similar NLP tasks.

Generally speaking Keras approach to building models (even seemingly complex ones) is dead simple. For example, the kind of model you want to build would simply look like (note this is for binary classification problem):
model = Sequential()
model.add(Embedding(max_features, out_dims, input_length=maxlen))
model.add(Bidirectional(LSTM(32)))
model.add(Dropout(0.1))
model.add(Dense(1, activation='sigmoid'))
model.compile('adam', 'binary_crossentropy', metrics=['accuracy'])
This is no different from plain vanilla NN with the exception of having the Embedding and Bidirectional layers in place Dense layers. This is one of the things that makes Keras amazing.
Usually it's helpful to look for a working example (Keras has loads) that is doing more or less the same thing you are trying to do. In this case you could first look at this model and then "reverse engineer" the workings of it to answer your questions. Usually things come down to formatting the data in the right way, where a working example model works wonders as you can carefully investigate the data format its using.

Keras - How to use the learned Embedding() Layer for Input and Output?

I would like to train a model to generate text, similar to this blog post
This model uses - as far as I understand it - the following architecture
[Sequence of Word Indices] -> [Embedding] -> [LSTM] -> [1 Hot Encoded "next word"]
Basically, the author models the process as classification problem, where the output layer has as many dimensions as there are words in the corpus.
I would like to model the process as regression problem, by re-using the learned Embeddings and then minimising the distance between predicted and real embedding.
Basically:
[Sequence of Word Indices] -> [Embedding] -> [LSTM] -> [Embedding-Vector of the "next word"]
My problem is, as the model is learning the embeddings on the fly, how could I feed the output in the same way I feed the input (as word indices) and then just tell the model "But before you use the output, replace it by its embedding vector" ?
Thank you very much for all help :-)

In training phase:
You can use two inputs (one for target, one for input, there's an offset of 1 between these two sequences) and reuse the embedding layer.
If you input sentence is [1, 2, 3, 4], you can generate two sequence from it: in = [1, 2, 3], out = [2, 3, 4]. Then you can use Keras' functional API to reuse embedding layer:
emb1 = Embedding(in)
emb2 = Embedding(out)
predict_emb = LSTM(emb1)
loss = mean_squared_error(emb2, predict_emb)
Note it's not Keras code, just pseudo code.
In testing phase:
Typically, you'll need to write your own decode function. Firstly, you choose a word (or a few words) to start from. Then, feed this word (or short word sequence) to network to predict next word's embedding. At this step, you can define your own sample function, say: you may want to choose the word whose embedding is nearest to the predicted one as the next word, or you may want to sample the next word from a distribution in which words with nearer embeddings to the predicted embedding has a larger probability to be chosen. Once you choose the next word, then feed it to network and predict the next one, and so forth.
So, you need to generate one word (put it another way, one embedding) at a time rather than input a whole sequence to the network.
If the above statements are too abstract for you, here's an good example: https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py
Line 85 is the introduction part, which randomly choose a small piece of texts from corpus to work on. From line 90 on there's a loop, in which each step samples a character (This is a char-rnn, so each timestep inputs a char. For your case, it should be a word, not a char): L95 predicts next char's distribution, L96 samples from the distribution. Hope this is clear enough.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

POS tagging using RNN - python

There is one that is implemented using a bi-directional LSTM and CRF. It can be found here.

Related

How to combine embeddins vectors of bert with other features?

How to generate independent(X) variable using Word2vec?

Compute cross entropy loss for classification in pytorch

Character embeddings with Keras

Keras - How to use the learned Embedding() Layer for Input and Output?

Categories

Resources