I am trying to extract some word vectors from a transformer-based model. The steps are:
Run text through pipleine using nlp().
Break text into sentences (to be separately classified into binary categories).
Save sentence vectors to a list (to be pickled and used as model inputs once classification has occurred).
The actual model runs quickly enough, but (unlike with the skipgram vectors), the simple act of extracting the vectors from the spacy.tokens.span.Span object and saving them to a list or array object is very slow.
Here is a reproducible example:
import spacy
import requests
import numpy as np
nlp = spacy.load('en_stsb_roberta_large') # from https://github.com/MartinoMensio/spacy-sentence-bert
# Get some random text
def get_text(
url = "https://raw.githubusercontent.com/erikthiem/pride-and-prejudice/master/starwars4.txt",
num_chars = 10_000):
response = requests.get(url)
text = response.content.decode().replace("\n", "")
text = text[0:num_chars]
return text
Running the actual nlp() command takes very little time:
# This takes 0.2 seconds to run
text = get_text()
doc = nlp(text)
sents = list(doc.sents) # doc.sents is a generator
However, if I want to save the vectors from the first sentence (36 words) to a list, it is extremely slow:
# Get vectors - takes 5 seconds
example_sent = sents[0]
sent_vectors = [word.vector for word in example_sent]
Obviously this does not scale well.
I do not understand why it is so slow, as if you run type(example_sent[0].vector), it returns numpy.ndarray, i.e. not a generator, so the vector has already been computed.
Why does moving it from a span to a list take so long? The arrays are dimension (1024,) for each word, rather than (300,) for skip-gram vectors, which copy almost instantly. That is the only difference as far as I can tell.
I have also creating an empty array of the required dimensions (rather than a list), but that makes no difference.
Any ideas?
Related
I am trying to achieve something similar in calculating product similarity used in this example. how-to-build-recommendation-system-word2vec-python/
I have a dictionary where the key is the item_id and the value is the product associated with it. For eg: dict_items([('100018', ['GRAVY MIX PEPPER']), ('100025', ['SNACK CHEEZIT WHOLEGRAIN']), ('100040', ['CAULIFLOWER CELLO 6 CT.']), ('100042', ['STRIP FRUIT FLY ELIMINATOR'])....)
The data structure is the same as in the example (as far as I know). However, I am getting KeyError: "word '100018' not in vocabulary" when calling the similarity function on the model using the key present in the dictionary.
# train word2vec model
model = Word2Vec(window = 10, sg = 1, hs = 0,
negative = 10, # for negative sampling
alpha=0.03, min_alpha=0.0007,
seed = 14)
model.build_vocab(purchases_train, progress_per=200)
model.train(purchases_train, total_examples = model.corpus_count,
epochs=10, report_delay=1)
def similar_products(v, n = 6): #similarity function
# extract most similar products for the input vector
ms = model.similar_by_vector(v, topn= n+1)[1:]
# extract name and similarity score of the similar products
new_ms = []
for j in ms:
pair = (products_dict[j[0]][0], j[1])
new_ms.append(pair)
return new_ms
I am calling the function using:
similar_products(model['100018'])
Note: I was able to run the example code with the very similar data structure input which was also a dictionary. Can someone tell me what I am missing here?
If you get a KeyError telling you a word isn't in your model, then the word genuinely isn't in the model.
If you've trained the model yourself, and expected the word to be in the resulting model, but it isn't, something went wrong with training.
You should look at the corpus (purchases_train in your code) to make sure each item is of the form the model expects: a list of words. You should enable logging during training, and watch the output to confirm the expected amount of word-discovery and training is happening. You can also look at the exact list-of-words known-to-the-model (in model.wv.key_to_index) to make sure it has all the words you expect.
One common gotcha is that by default, for the best operation of the word2vec algorithm, the Word2Vec class uses a default min_count=5. (Word2vec only works well with multiple varied examples of a word's usage; a word appearing just once, or just a few times, usually won't get a good vector, and further, might make other surrounding word's vectors worse. So the usual best practice is to discard very-rare words.
Is the (pseudo-)word '100018' in your corpus less than 5 times? If so, the model will ignore it as a word too-rare to get a good vector, or have any positive influence on other word-vectors.
Separately, the site you're using example code from may not be a quality source of example code. It's changed a bunch of default values for no good reason - such as changing the alpha and min_alpha values to peculiar non-standard values, with no comment why. This is usually a signal that someone who doesn't know what they're doing is copying someone else who didn't know what they were doing's odd choices.
I am using gensim Doc2Vec model to generate my feature vectors. Here is the code I am using (I have explained what my problem is in the code):
cores = multiprocessing.cpu_count()
# creating a list of tagged documents
training_docs = []
# all_docs: a list of 53 strings which are my documents and are very long (not just a couple of sentences)
for index, doc in enumerate(all_docs):
# 'doc' is in unicode format and I have already preprocessed it
training_docs.append(TaggedDocument(doc.split(), str(index+1)))
# at this point, I have 53 strings in my 'training_docs' list
model = Doc2Vec(training_docs, size=400, window=8, min_count=1, workers=cores)
# now that I print the vectors, I only have 10 vectors while I should have 53 vectors for the 53 documents that I have in my training_docs list.
print(len(model.docvecs))
# output: 10
I am just wondering if I am doing a mistake or if there is any other parameter that I should set?
UPDATE: I was playing with the tags parameter in TaggedDocument, and when I changed it to a mixture of text and numbers like: Doc1, Doc2, ... I see a different number for the count of generated vectors, but still I do not have the same number of feature vectors as expected.
Look at the actual tags it has discovered in your corpus:
print(model.docvecs.offset2doctag)
Do you see a pattern?
The tags property of each document should be a list of tags, not a single tag. If you supply a simple string-of-an-integer, it will see it as a list-of-digits, and thus only learn the tags '0', '1', ..., '9'.
You could replace str(index+1) with [str(index+1)] and get the behavior you were expecting.
But, since your document IDs are just ascending integers, you can also just use plain Python ints as your doctags. This will save some memory, buy avoiding the creation of a lookup dict from string-tag to array-slot (int). To do this, replace the str(index+1) with [index]. (This starts the doc-IDs from 0 – which is a teensy bit more Pythonic, and also avoids wasting an unused 0 position in the raw array that holds the trained vectors.)
I am curious as to how I can add a normal-randomized 300 dimension vector (elements' type = tf.float32) whenever a word unknown to the pre-trained vocabulary is encountered. I am using pre-trained GloVe word embeddings, but in some cases, I realize I encounter unknown words, and I want to create a normal-randomized word vector for this new found unknown word.
The problem is that with my current set up, I use tf.contrib.lookup.index_table_from_tensor to convert from words to integers based on the known vocabulary. This function can create new tokens and hash them for some predefined number of out of vocabulary words, but my embed will not contain an embedding for this new unknown hash value. I am uncertain if I can simply append a randomized embedding to the end of the embed list.
I also would like to do this in an efficient way, so pre-built tensorflow function or method involving tensorflow functions would probably be the most efficient. I define pre-known special tokens such as an end of sentence token and a default unknown as the empty string ("at index 0), but this is limited in its power to learn for various different unknown words. I currently use tf.nn.embedding_lookup() as the final embedding step.
I would like to be able to add new random 300d vectors for each unknown word in the training data, and I would also like to add pre-made random word vectors for any unknown tokens not seen in training that is possibly encountered during testing. What is the most efficient way of doing this?
def embed_tensor(string_tensor, trainable=True):
"""
Convert List of strings into list of indicies then into 300d vectors
"""
# ordered lists of vocab and corresponding (by index) 300d vector
vocab, embed = load_pretrained_glove()
# Set up tensorflow look up from string word to unique integer
vocab_lookup = tf.contrib.lookup.index_table_from_tensor(
mapping=tf.constant(vocab),
default_value = 0)
string_tensor = vocab_lookup.lookup(string_tensor)
# define the word embedding
embedding_init = tf.Variable(tf.constant(np.asarray(embed),
dtype=tf.float32),
trainable=trainable,
name="embed_init")
# return the word embedded version of the sentence (300d vectors/word)
return tf.nn.embedding_lookup(embedding_init, string_tensor)
The code example below adapts your embed_tensor function such that words are embedded as follows:
For words that have a pretrained embedding, the embedding is initialized with the pretrained embedding. The embedding can be kept fixed during training if trainable is False.
For words in the training data that don't have a pretrained embedding, the embedding is initialized randomly. The embedding can be kept fixed during training if trainable is False.
For words in the test data that don't occur in the training data and don't have a pretrained embedding, a single randomly initialized embedding vector is used. This vector can't be trained.
import tensorflow as tf
import numpy as np
EMB_DIM = 300
def load_pretrained_glove():
return ["a", "cat", "sat", "on", "the", "mat"], np.random.rand(6, EMB_DIM)
def get_train_vocab():
return ["a", "dog", "sat", "on", "the", "mat"]
def embed_tensor(string_tensor, trainable=True):
"""
Convert List of strings into list of indices then into 300d vectors
"""
# ordered lists of vocab and corresponding (by index) 300d vector
pretrained_vocab, pretrained_embs = load_pretrained_glove()
train_vocab = get_train_vocab()
only_in_train = list(set(train_vocab) - set(pretrained_vocab))
vocab = pretrained_vocab + only_in_train
# Set up tensorflow look up from string word to unique integer
vocab_lookup = tf.contrib.lookup.index_table_from_tensor(
mapping=tf.constant(vocab),
default_value=len(vocab))
string_tensor = vocab_lookup.lookup(string_tensor)
# define the word embedding
pretrained_embs = tf.get_variable(
name="embs_pretrained",
initializer=tf.constant_initializer(np.asarray(pretrained_embs), dtype=tf.float32),
shape=pretrained_embs.shape,
trainable=trainable)
train_embeddings = tf.get_variable(
name="embs_only_in_train",
shape=[len(only_in_train), EMB_DIM],
initializer=tf.random_uniform_initializer(-0.04, 0.04),
trainable=trainable)
unk_embedding = tf.get_variable(
name="unk_embedding",
shape=[1, EMB_DIM],
initializer=tf.random_uniform_initializer(-0.04, 0.04),
trainable=False)
embeddings = tf.concat([pretrained_embs, train_embeddings, unk_embedding], axis=0)
return tf.nn.embedding_lookup(embeddings, string_tensor)
FYI, to have a sensible, non-random representation for words that don't occur in the training data and don't have a pretrained embedding, you could consider mapping words with a low frequency in your training data to an unk token (that is not in your vocabulary) and make the unk_embedding trainable. This way you learn a prototype for words that are unseen in the training data.
I never tried it but I can try to provide a possible way using the same machineries of your code, but I will think of it more later.
The index_table_from_tensor method accepts a num_oov_buckets parameter that shuffles all your oov words into a predefined number of buckets.
If you set this parameter to a certain 'enough large' value, you will see your data spreads among these buckets (each bucket has an ID > ID of the last in-vocabulary word).
So,
if (at each lookup) you set (i.e. assign) the last rows (those corresponding to the buckets) of your embedding_init Variable to a random value
if you make num_oov_bucketsenough large that collisions will be minimized
you can obtain a behavior that is (an approximation of) what you are asking in a very efficient way.
The random behavior can be justified by a theory similar to the hash table ones: if the number of buckets is enough large, the hashing method of the strings will assign each oov word to a different bucket with high probability (i.e. minimizing collisions to the same buckets). Since, you are assigning a different random number to each different bucket, you can obtain a (almost) different mapping of each oov word.
An idea I had for this was to capture the new words to the pre-trained embedding by adding a new dimension for each new word (basically maintaining the one-hot nature of them).
Assuming the number of new words is small but they're important, you could for instance increase the dimensions of your embedded results from 300 to 300 + # of new words where each new word would get all zeros except 1 in it's dimension.
I'm having issues with presenting my data in a form which sklearn will accept
My raw data is a few hundred strings, and these are classified into one of 5 classes, I've a list of the strings i'd like to classify, and a parallel list of their respective classes. I'm using GaussianNB()
Example Data:
For such a large, successful business, I really feel like they need to be
either choosier in their employee selection or teach their employees to
better serve their customers.|||Class:4
Which represents a given "feature" and a classification
Naturally, the strings themselves have to be converted to vectors prior to their use in the classifier, I've attempted to use DictVector to perform this task
dictionaryTraining = convertListToSentence(data)
vec = DictVectorizer()
print(dictionaryTraining)
vec.fit_transform(dictionaryTraining)
However in order todo it, i have to attach the actual classification of the data into the dictionary, otherwise i get the error 'str' object has no attribute 'items' I understand this is because .fit_transform requires features and indices, but i don't fully understand the purpose of the indice
fit_transform(X[, y]) Learn a list of feature name -> indices mappings and transform X.
My question is, how can i take a list of strings, and a list of numbers representing their classifications, and provide these to a gaussianNB() classifier such that i can present it with a similar string in the future and it will estimate the strings class?
Since your input data are in the format of raw text and not in the format of a dictionary where like {"word":number_of_occurrences, } I believe you should go with a CountVectorizer which will split your input text on white space and transform it on the input vectors you need.
A simple example of such a transformation would be:
from sklearn.feature_extraction.text import CountVectorizer
corpus = ['This is the first document.', 'This is the second second document.',
'And the third one.', 'Is this the first document?',]
x = CountVectorizer().fit_transform(corpus)
print x.todense() #x holds your features. Here I am only vizualizing it
I have a set of files and a query doc.My purpose is to return the most similar documents by comparing with query doc for each of the document.To use cosine similarity first i have to map the document strings to vectors.Also i have already created a tf-idf function that calculate for each of the document.
To get the index of the strings i have a function like that ;
def getvectorKeywordIndex(self, documentList):
""" create the keyword associated to the position of the elements within the document vectors """
#Mapped documents into a single word string
vocabularyString = " ".join(documentList)
vocabularylist= vocabularyString.split(' ')
vocabularylist= list(set(vocabularylist))
print 'vocabularylist',vocabularylist
vectorIndex={}
offset=0
#Associate a position with the keywords which maps to the dimension on the vector used to represent this word
for word in vocabularylist:
vectorIndex[word]=offset
offset+=1
print vectorIndex
return vectorIndex,vocabularylist #(keyword:position),vocabularylist
and for cosine similarity my function is that;
def cosine_distance(self,index, queryDoc):
vector1= self.makeVector(index)
vector2= self.makeVector(queryDoc)
return numpy.dot(vector1, vector2) / (math.sqrt(numpy.dot(vector1, vector1)) * math.sqrt(numpy.dot(vector2, vector2)))
TF-IDF is ;
def tfidf(self, term, key):
return (self.tf(term,key) * self.idf(term))
My problem is that how can i create the makevector by using the index and vocabulary list and also tf-idf inside of this function.
Any answer is welcome.
You should pass vectorIndex to makeVector as well and use it to look up the indices for terms in documents and queries. Ignore terms that do not appear in vectorIndex.
Mind you, when dealing with documents you should really be using scipy.sparse matrices instead of Numpy arrays, or you'll quickly run out of memory.
(Alternatively, consider using the Vectorizer in scikit-learn which handles all of this for you, uses scipy.sparse matrices and computes tf-idf values. Disclaimer: I wrote parts of that class.)