Use word2vec in tokenized sentences - python

I am trying to create a emotion recognition model resorting to SVM. I have a big dataset of sentences each one with a labeled emotion. After text pre-processing, I have a pandas data frame containing the tokenized sentences, like it can be seen in [1.] .
My objective is to turn all this tokenized sentences to word embeddings so that I can train models such as SVM. The problem is how to use this date frame as input to word2vec or any other word embedding model.

You need one vector per input instance if you want to use SVM. This means that you need to get embeddings for the words and do some operation, typically pooling, that will shrink the sequence of word embeddings into a single vector.
The most frequently used methods are mean-pooling and max-pooling, simply taking the average or the maximum of the embeddings.
Assuming, you pandas data frames in variable data and you have the word embeddings in a dictionary embedding_table with string keys and NumPy array value, you can do something like this (mean pooling), assuming that at least one word is covered by the word embeddings:
def embed(word_sequence):
embeddings = []
for word in word_sequence:
if word in embedding_table:
embeddings.append(word)
return np.mean(embeddings, axis=0)
data["vector"] = data.Utterance.map(embed)

Related

How to get cosine similarity of word embedding from BERT model

I was interesting in how to get the similarity of word embedding in different sentences from BERT model (actually, that means words have different meanings in different scenarios).
For example:
sent1 = 'I like living in New York.'
sent2 = 'New York is a prosperous city.'
I want to get the cos(New York, New York)'s value from sent1 and sent2, even if the phrase 'New York' is same, but it appears in different sentence. I got some intuition from https://discuss.huggingface.co/t/generate-raw-word-embeddings-using-transformer-models-like-bert-for-downstream-process/2958/2
But I still do not know which layer's embedding I need to extract and how to caculate the cos similarity for my above example.
Thanks in advance for any suggestions!
Okay let's do this.
First you need to understand that BERT has 13 layers. The first layer is basically just the embedding layer that BERT gets passed during the initial training. You can use it but probably don't want to since that's essentially a static embedding and you're after a dynamic embedding. For simplicity I'm going to only use the last hidden layer of BERT.
Here you're using two words: "New" and "York". You could treat this as one during preprocessing and combine it into "New-York" or something if you really wanted. In this case I'm going to treat it as two separate words and average the embedding that BERT produces.
This can be described in a few steps:
Tokenize the inputs
Determine where the tokenizer has word_ids for New and York (suuuuper important)
Pass through BERT
Average
Cosine similarity
First, what you need to import: from transformers import AutoTokenizer, AutoModel
Now we can create our tokenizer and our model:
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
model = model = AutoModel.from_pretrained('bert-base-cased', output_hidden_states=True).eval()
Make sure to use the model in evaluation mode unless you're trying to fine tune!
Next we need to tokenize (step 1):
tok1 = tokenizer(sent1, return_tensors='pt')
tok2 = tokenizer(sent2, return_tensors='pt')
Step 2. Need to determine where the index of the words match
# This is where the "New" and "York" can be found in sent1
sent1_idxs = [4, 5]
sent2_idxs = [0, 1]
tok1_ids = [np.where(np.array(tok1.word_ids()) == idx) for idx in sent1_idxs]
tok2_ids = [np.where(np.array(tok2.word_ids()) == idx) for idx in sent2_idxs]
The above code checks where the word_ids() produced by the tokenizer overlap the word indices from the original sentence. This is necessary because the tokenizer splits rare words. So if you have something like "aardvark", when you tokenize it and look at it you actually get this:
In [90]: tokenizer.convert_ids_to_tokens( tokenizer('aardvark').input_ids)
Out[90]: ['[CLS]', 'a', '##ard', '##var', '##k', '[SEP]']
In [91]: tokenizer('aardvark').word_ids()
Out[91]: [None, 0, 0, 0, 0, None]
Step 3. Pass through BERT
Now we grab the embeddings that BERT produces across the token ids that we've produced:
with torch.no_grad():
out1 = model(**tok1)
out2 = model(**tok2)
# Only grab the last hidden state
states1 = out1.hidden_states[-1].squeeze()
states2 = out2.hidden_states[-1].squeeze()
# Select the tokens that we're after corresponding to "New" and "York"
embs1 = states1[[tup[0][0] for tup in tok1_ids]]
embs2 = states2[[tup[0][0] for tup in tok2_ids]]
Now you will have two embeddings. Each is shape (2, 768). The first size is because you have two words we're looking at: "New" and "York. The second size is the embedding size of BERT.
Step 4. Average
Okay, so this isn't necessarily what you want to do but it's going to depend on how you treat these embeddings. What we have is two (2, 768) shaped embeddings. You can either compare New to New and York to York or you can combine New York into an average. I'll just do that but you can easily do the other one if it works better for your task.
avg1 = embs1.mean(axis=0)
avg2 = embs2.mean(axis=0)
Step 5. Cosine sim
Cosine similarity is pretty easy using torch:
torch.cosine_similarity(avg1.reshape(1,-1), avg2.reshape(1,-1))
# tensor([0.6440])
This is good! They point in the same direction. They're not exactly 1 but that can be improved in several ways.
You can fine tune on a training set
You can experiment with averaging different layers rather than just the last hidden layer like I did
You can try to be creative in combining New and York. I took the average but maybe there's a better way for your exact needs.

Spacy - number of lemma

I'm using spacy to replace every word in a sentence with a number/code, after I use the vector as a input of a recurrent neural network.
import spacy
str="basing based base"
sp = spacy.load('en_core_web_sm')
sentence=sp(str)
for w in sentence:
print(w.text,w.lemma)
In the first layer of Neural network with keras, the Embedding layer, I have to know the max number of words in the look up table, someone know this number?
Thank you
The lemma indices are in fact hashes, so there is not a continuous row of indices from 0 to the number of vocabulary entries. Even sp.vocab.strings["randomnonwordstring#"] gives you an integer.
For entry "base", the ID is 4715552063986449646 in sp.vocab (note it is a shared vocab both for forms and lemmas). You would never fit such a number of embeddings in a memory.
The correct solution is creating a dictionary transforming words into indices based on what you have in your training data.

How to add new embeddings for unknown words in Tensorflow (training & pre-set for testing)

I am curious as to how I can add a normal-randomized 300 dimension vector (elements' type = tf.float32) whenever a word unknown to the pre-trained vocabulary is encountered. I am using pre-trained GloVe word embeddings, but in some cases, I realize I encounter unknown words, and I want to create a normal-randomized word vector for this new found unknown word.
The problem is that with my current set up, I use tf.contrib.lookup.index_table_from_tensor to convert from words to integers based on the known vocabulary. This function can create new tokens and hash them for some predefined number of out of vocabulary words, but my embed will not contain an embedding for this new unknown hash value. I am uncertain if I can simply append a randomized embedding to the end of the embed list.
I also would like to do this in an efficient way, so pre-built tensorflow function or method involving tensorflow functions would probably be the most efficient. I define pre-known special tokens such as an end of sentence token and a default unknown as the empty string ("at index 0), but this is limited in its power to learn for various different unknown words. I currently use tf.nn.embedding_lookup() as the final embedding step.
I would like to be able to add new random 300d vectors for each unknown word in the training data, and I would also like to add pre-made random word vectors for any unknown tokens not seen in training that is possibly encountered during testing. What is the most efficient way of doing this?
def embed_tensor(string_tensor, trainable=True):
"""
Convert List of strings into list of indicies then into 300d vectors
"""
# ordered lists of vocab and corresponding (by index) 300d vector
vocab, embed = load_pretrained_glove()
# Set up tensorflow look up from string word to unique integer
vocab_lookup = tf.contrib.lookup.index_table_from_tensor(
mapping=tf.constant(vocab),
default_value = 0)
string_tensor = vocab_lookup.lookup(string_tensor)
# define the word embedding
embedding_init = tf.Variable(tf.constant(np.asarray(embed),
dtype=tf.float32),
trainable=trainable,
name="embed_init")
# return the word embedded version of the sentence (300d vectors/word)
return tf.nn.embedding_lookup(embedding_init, string_tensor)
The code example below adapts your embed_tensor function such that words are embedded as follows:
For words that have a pretrained embedding, the embedding is initialized with the pretrained embedding. The embedding can be kept fixed during training if trainable is False.
For words in the training data that don't have a pretrained embedding, the embedding is initialized randomly. The embedding can be kept fixed during training if trainable is False.
For words in the test data that don't occur in the training data and don't have a pretrained embedding, a single randomly initialized embedding vector is used. This vector can't be trained.
import tensorflow as tf
import numpy as np
EMB_DIM = 300
def load_pretrained_glove():
return ["a", "cat", "sat", "on", "the", "mat"], np.random.rand(6, EMB_DIM)
def get_train_vocab():
return ["a", "dog", "sat", "on", "the", "mat"]
def embed_tensor(string_tensor, trainable=True):
"""
Convert List of strings into list of indices then into 300d vectors
"""
# ordered lists of vocab and corresponding (by index) 300d vector
pretrained_vocab, pretrained_embs = load_pretrained_glove()
train_vocab = get_train_vocab()
only_in_train = list(set(train_vocab) - set(pretrained_vocab))
vocab = pretrained_vocab + only_in_train
# Set up tensorflow look up from string word to unique integer
vocab_lookup = tf.contrib.lookup.index_table_from_tensor(
mapping=tf.constant(vocab),
default_value=len(vocab))
string_tensor = vocab_lookup.lookup(string_tensor)
# define the word embedding
pretrained_embs = tf.get_variable(
name="embs_pretrained",
initializer=tf.constant_initializer(np.asarray(pretrained_embs), dtype=tf.float32),
shape=pretrained_embs.shape,
trainable=trainable)
train_embeddings = tf.get_variable(
name="embs_only_in_train",
shape=[len(only_in_train), EMB_DIM],
initializer=tf.random_uniform_initializer(-0.04, 0.04),
trainable=trainable)
unk_embedding = tf.get_variable(
name="unk_embedding",
shape=[1, EMB_DIM],
initializer=tf.random_uniform_initializer(-0.04, 0.04),
trainable=False)
embeddings = tf.concat([pretrained_embs, train_embeddings, unk_embedding], axis=0)
return tf.nn.embedding_lookup(embeddings, string_tensor)
FYI, to have a sensible, non-random representation for words that don't occur in the training data and don't have a pretrained embedding, you could consider mapping words with a low frequency in your training data to an unk token (that is not in your vocabulary) and make the unk_embedding trainable. This way you learn a prototype for words that are unseen in the training data.
I never tried it but I can try to provide a possible way using the same machineries of your code, but I will think of it more later.
The index_table_from_tensor method accepts a num_oov_buckets parameter that shuffles all your oov words into a predefined number of buckets.
If you set this parameter to a certain 'enough large' value, you will see your data spreads among these buckets (each bucket has an ID > ID of the last in-vocabulary word).
So,
if (at each lookup) you set (i.e. assign) the last rows (those corresponding to the buckets) of your embedding_init Variable to a random value
if you make num_oov_bucketsenough large that collisions will be minimized
you can obtain a behavior that is (an approximation of) what you are asking in a very efficient way.
The random behavior can be justified by a theory similar to the hash table ones: if the number of buckets is enough large, the hashing method of the strings will assign each oov word to a different bucket with high probability (i.e. minimizing collisions to the same buckets). Since, you are assigning a different random number to each different bucket, you can obtain a (almost) different mapping of each oov word.
An idea I had for this was to capture the new words to the pre-trained embedding by adding a new dimension for each new word (basically maintaining the one-hot nature of them).
Assuming the number of new words is small but they're important, you could for instance increase the dimensions of your embedded results from 300 to 300 + # of new words where each new word would get all zeros except 1 in it's dimension.

Best match for input query from a set of documents

I have 8 documents and I ran TF-IDF on it to get an array. I don't understand how I find out which is the best document match for a given input query?
all_documents = [doc1, doc2, ...., doc7]
sklearn_tfidf = TfidfVectorizer(norm='l2',min_df=0, use_idf=True, smooth_idf=False, sublinear_tf=True, tokenizer=tokenize)
sklearn_representation = sklearn_tfidf.fit_transform(all_documents).toarray()
Transform the input to tf-idf format using TfidfVectorizer. You can then use a distance metric (cosine, euclidean, manhattan, ...) to calculate the document that is closest to your input.
Each of the documents should use the same vocabulary. I assume that your 8 document vectors have the same length? The sklearn_tfidf object that you created has an attribute vocabulary_ that contains all words that are used in the vectors. Your input query should be reduced to only containing those words.
Example
Document1: dogs are cute
Document2: cats are awful
Leads to a vocabulary of [dogs, cats, are, cute, awful]. A query containing other words than these 5 cannot be used. For example if your query is cute animals, the animals has no meaning, because it cannot be found in one of the documents. The query thus reduces to following vector: [0,0,0,1,0] since cute is the only word that can be found in the documents.

Python - tf-idf predict a new document similarity

Inspired by this answer, I'm trying to find cosine similarity between a trained trained tf-idf vectorizer and a new document, and return the similar documents.
The code below finds the cosine similarity of the first vector and not a new query
>>> from sklearn.metrics.pairwise import linear_kernel
>>> cosine_similarities = linear_kernel(tfidf[0:1], tfidf).flatten()
>>> cosine_similarities
array([ 1. , 0.04405952, 0.11016969, ..., 0.04433602,
0.04457106, 0.03293218])
Since my train data is huge, looping through the entire trained vectorizer sounds like a bad idea.
How can I infer the vector of a new document, and find the related docs, same as the code below?
>>> related_docs_indices = cosine_similarities.argsort()[:-5:-1]
>>> related_docs_indices
array([ 0, 958, 10576, 3277])
>>> cosine_similarities[related_docs_indices]
array([ 1. , 0.54967926, 0.32902194, 0.2825788 ])
This problem can be partially addressed by combining the vector space model (which is the tf-idf & cosine similarity) together with the boolean model. These are concepts of information theory and they are used (and nicely explained) in ElasticSearch- a pretty good search engine.
The idea is simple: you store your documents as inverted indices. Which is comparable to the words present at the end of a book, which hold a reference to the pages (documents) they were mentioned in.
Instead of calculating the tf-idf vector for all document it will only calculate it for documents that have at least one (or specify a threshold) of the words in common. This can be simply done by looping over the words in the queried document, finding the documents that also have this word using the inverted index and calculate the similarity for those.
You should take a look at gensim. Example starting code looks like this:
from gensim import corpora, models, similarities
dictionary = corpora.Dictionary(line.lower().split() for line in open('corpus.txt'))
corpus = [dictionary.doc2bow(line.lower().split()) for line in open('corpus.txt')]
tfidf = models.TfidfModel(corpus)
index = similarities.SparseMatrixSimilarity(tfidf[corpus], num_features=12)
At prediction time you first get the vector for the new doc:
doc = "Human computer interaction"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_tfidf = tfidf[vec_bow]
Then get the similarities (sorted by most similar):
sims = index[vec_tfidf] # perform a similarity query against the corpus
print(list(enumerate(sims))) # print (document_number, document_similarity) 2-tuples
This does a linear scan like you wanted to do but they have a more optimized implementation. If the speed is not enough then you can look into approximate similarity search (Annoy, Falconn, NMSLIB).
For huge data sets, there is a solution called Text Clustering By Concept. search engines use this Technic,
At first step, you cluster your documents to some groups(e.g 50 cluster), then each cluster has a representative document(that contain some words that have some useful information about it's cluster)
At second step, for calculating cosine similarity between New Document and you data set, you loop through all representative(50 numbers) and find top near representatives(e.g 2 representative)
At final step, you can loop through all documents in selected representative and find nearest cosine similarity
With this Technic, you can reduce the number of loops and improve performace,
You can read more tecnincs in some chapter of this book: http://nlp.stanford.edu/IR-book/html/htmledition/irbook.html

Categories