I am playing with Tensorflow sequence to sequence translation model. I was wondering if I could import my own word2vec into this model? Rather than using its original 'dense representation' mentioned in the tutorial.
From my point of view, it looks TensorFlow is using One-Hot representation for seq2seq model. Firstly,for function tf.nn.seq2seq.embedding_attention_seq2seq the encoder's input is a tokenized symbol, e.g. 'a' would be '4' and 'dog' would be '15715' etc. and requires num_encoder_symbols. So I think it makes me provide the position of the word and the total number of words, then the function could represent the word in One-Hot representation. I am still learning the source code, but it hard to understand.
Could anyone give me an idea on above problem?
The seq2seq embedding_* functions indeed create embedding matrices very similar to those from word2vec. They are a variable named sth like this:
EMBEDDING_KEY = "embedding_attention_seq2seq/RNN/EmbeddingWrapper/embedding"
Knowing this, you can just modify this variable. I mean -- get your word2vec vectors in some format, say a text file. Assuming you have your vocabulary in model.vocab you can then assign the read vectors in a way illustrated by the snippet below (it's just a snippet, you'll have to change it to make it work, but I hope it shows the idea).
vectors_variable = [v for v in tf.trainable_variables()
if EMBEDDING_KEY in v.name]
if len(vectors_variable) != 1:
print("Word vector variable not found or too many.")
sys.exit(1)
vectors_variable = vectors_variable[0]
vectors = vectors_variable.eval()
print("Setting word vectors from %s" % FLAGS.word_vector_file)
with gfile.GFile(FLAGS.word_vector_file, mode="r") as f:
# Lines have format: dog 0.045123 -0.61323 0.413667 ...
for line in f:
line_parts = line.split()
# The first part is the word.
word = line_parts[0]
if word in model.vocab:
# Remaining parts are components of the vector.
word_vector = np.array(map(float, line_parts[1:]))
if len(word_vector) != vec_size:
print("Warn: Word '%s', Expecting vector size %d, found %d"
% (word, vec_size, len(word_vector)))
else:
vectors[model.vocab[word]] = word_vector
# Assign the modified vectors to the vectors_variable in the graph.
session.run([vectors_variable.initializer],
{vectors_variable.initializer.inputs[1]: vectors})
I guess with the scope style, which Matthew mentioned, you can get variable:
with tf.variable_scope("embedding_attention_seq2seq"):
with tf.variable_scope("RNN"):
with tf.variable_scope("EmbeddingWrapper", reuse=True):
embedding = vs.get_variable("embedding", [shape], [trainable=])
Also, I would imagine you would want to inject embeddings into the decoder as well, the key (or scope) for it would be somthing like:
"embedding_attention_seq2seq/embedding_attention_decoder/embedding"
Thanks for your answer, Lukasz!
I was wondering, what exactly in the code snippet <b>model.vocab[word]</b> stands for? Just the position of the word in the vocabulary?
In this case wouldn't that be faster to iterate through the vocabulary and inject w2v vectors for the words that exist in w2v model.
Related
I am following the tutorial here: https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-language-model-nlp-python-code/#h2_5 to create a Language model. I am following the bit about the N-gram Language model.
This is the completed code:
from nltk.corpus import reuters
from nltk import bigrams, trigrams
from collections import Counter, defaultdict
# Create a placeholder for model
model = defaultdict(lambda: defaultdict(lambda: 0))
# Count frequency of co-occurance
for sentence in reuters.sents():
for w1, w2, w3 in trigrams(sentence, pad_right=True, pad_left=True):
model[(w1, w2)][w3] += 1
# Let's transform the counts to probabilities
for w1_w2 in model:
total_count = float(sum(model[w1_w2].values()))
for w3 in model[w1_w2]:
model[w1_w2][w3] /= total_count
input = input("Hi there! Please enter an incomplete sentence and I can help you\
finish it!\n").lower().split()
print(model[tuple(input)])
To get output from the model, the website does this: print(dict(model["the", "price"])) but I want to generate output from a user inputted sentence. When I write print(model[tuple(input)]), it gives me an empty defaultdict.
Disregard this (keeping for history):
How do I give it the list I create from the input? model is a
dictionary and I've read that using a list as a key isn't a good idea
but that's exactly what they're doing? And I'm assuming mine doesn't
work because I'm listing a list? Would I have to iterate through the
words to get results?
As a side note, is this model considering the sentence as a whole to
predict the next word, or just the last word?
I had to give the model the last two words from the list not the entire thing, even if it's two words. Like so:
model[tuple(input[-2:])]
With Gensim, there are three functions I use regularly, for example this one:
model = gensim.models.Word2Vec(corpus,size=100,min_count=5)
The output from gensim, but I cannot understand how to set the size and min_count parameters in the equivalent SciSpacy command of:
model = spacy.load('en_core_web_md')
(The output is a model of embeddings (too big to add here))).
This is another command I regularly use:
model.most_similar(positive=['car'])
and this is the output from gensim/Expected output from SciSpacy:
[('vehicle', 0.7857330441474915),
('motorbike', 0.7572781443595886),
('train', 0.7457204461097717),
('honda', 0.7383008003234863),
('volkswagen', 0.7298516035079956),
('mini', 0.7158907651901245),
('drive', 0.7093928456306458),
('driving', 0.7084407806396484),
('road', 0.7001082897186279),
('traffic', 0.6991947889328003)]
This is the third command I regularly use:
print(model.wv['car'])
Output from Gensim/Expected output from SciSpacy (in reality this vector is length 100):
[ 1.0942473 2.5680697 -0.43163642 -1.171171 1.8553845 -0.3164575
1.3645878 -0.5003705 2.912658 3.099512 2.0184739 -1.2413547
0.9156444 -0.08406237 -2.2248871 2.0038593 0.8751471 0.8953876
0.2207374 -0.157277 -1.4984075 0.49289042 -0.01171476 -0.57937795...]
Could someone show me the equivalent commands for SciSpacy? For example, for 'gensim.models.Word2Vec' I can't find how to specify the length of the vectors (size parameter), or the minimum number of times the word should be in the corpus (min_count) in SciSpacy (e.g. I looked here and here), but I'm not sure if I'm missing them?
A possible way to achieve your goal would be to:
parse you documents via nlp.pipe
collect all the words and pairwise similarities
process similarities to get the desired results
Let's prepare some data:
import spacy
nlp = spacy.load("en_core_web_md", disable = ['ner', 'tagger', 'parser'])
Then, to get a vector, like in model.wv['car'] one would do:
nlp("car").vector
To get most similar words like model.most_similar(positive=['car']) let's process the corpus:
corpus = ["This is a sentence about cars. This a sentence aboout train"
, "And this is a sentence about a bike"]
docs = nlp.pipe(corpus)
tokens = []
tokens_orth = []
for doc in docs:
for tok in doc:
if tok.orth_ not in tokens_orth:
tokens.append(tok)
tokens_orth.append(tok.orth_)
sims = np.zeros((len(tokens),len(tokens)))
for i, tok in enumerate(tokens):
sims[i] = [tok.similarity(tok_) for tok_ in tokens]
Then to retrieve top=3 most similar words:
def most_similar(word, tokens_orth = tokens_orth, sims=sims, top=3):
tokens_orth = np.array(tokens_orth)
id_word = np.where(tokens_orth == word)[0][0]
sim = sims[id_word]
id_ms = np.argsort(sim)[:-top-1:-1]
return list(zip(tokens_orth[id_ms], sim[id_ms]))
most_similar("This")
[('this', 1.0000001192092896), ('This', 1.0), ('is', 0.5970357656478882)]
PS
I have also noticed you asked for specification of dimension and frequency. Embedding length is fixed at the time the model is initialized, so it can't be changed after that. You can start from a blank model if you wish so, and feed embeddings you're comfortable with. As for the frequency, it's doable, via counting all the words and throwing away anything that is below desired threshold. But again, underlying embeddings will be from a not filtered text. SpaCy is different from Gensim in that it uses readily available embeddings whereas Gensim trains them.
I have a homegrown embedding model I have built. I am trying to load my word vectors in spacy (using the init-model CLI) and therefore need to reformat my vectors output as a word2vec table which from my understanding is: first line is shape of vectors and the following lines are "word" <word_embedding>.
My question (and maybe it is a stupid one), is there a way that I can write word as a string (with parentheses) and the raw vector? My current file is "word <word_vector>" so when the file is parsed the word vector is a string which is not desirable.
The code I am trying to conform to is (spacy init-model):
def read_vectors(vectors_loc, truncate_vectors=0):
f = open_file(vectors_loc)
shape = tuple(int(size) for size in next(f).split())
if truncate_vectors >= 1:
shape = (truncate_vectors, shape[1])
vectors_data = numpy.zeros(shape=shape, dtype="f")
vectors_keys = []
for i, line in enumerate(tqdm(f)):
line = line.rstrip()
pieces = line.rsplit(" ", vectors_data.shape[1])
word = pieces.pop(0)
if len(pieces) != vectors_data.shape[1]: # <- pieces is a string!
msg.fail(Errors.E094.format(line_num=i, loc=vectors_loc), exits=1)
vectors_data[i] = numpy.asarray(pieces, dtype="f") # <- will literally create a array of length 1 dtype=object
vectors_keys.append(word)
if i == truncate_vectors - 1:
break
return vectors_data, vectors_keys
I know I could pretty easily start hacking up the init-model code if need be, but I would really rather not.
Thanks in advance.
End of the day and a bit of a doofus moment.
If anyone else is has troubles passing vectors to Spacy... you can actually pass the vectors as a numpy .npz file in sofar that you also pass the --json-loc file (as the json lines file has the id idx into the word vectors).
A bit weird as I have always serialized my numpy with a .npy extension but nonetheless.
Hope this helps someone!
So i have build a model with sklearn Naive Bayes classifier.
I need to know how to predict a sentence with input
when i just hardcode the sentence its work fine, look like this
new_sentence = ['its so broken']
new_testdata_tfidf= tfidf.transform(new_sentence)
#transform it to matrix to see the score TFIDF on the training data
fit_feature_selection = selection.transform(new_testdata_tfidf)
#transform the new data to see if the feature remove or not, because after tfidf i use chi2 selection feature.
predicted = classifier.predict(feature_selection )
#then predict it. the classificaiton out, the class is -1 which is the correct answer
i need to type the text data with hand as an input so I use like this
new_sentence = input[('')]
#i input the same sentence its so broken
new_testdata_tfidf= tfidf.transform(new_sentence)
#transform it to matrix to see the score TFIDF on the training data
fit_feature_selection = selection.transform(new_testdata_tfidf)
#transform the new data to see if the feature remove or not, because after tfidf i use chi2 selection feature.
predicted = classifier.predict(feature_selection )
but it give me output
File "C:\Users\Myfile\OneDrive\Desktop\model.py", line 170, in <module>
new_testdata_tfidf= tfidf.transform(new_sentence)
File "E:\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 1898, in transform
X = super().transform(raw_documents)
File "E:\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 1265, in transform
"Iterable over raw text documents expected, "
ValueError: Iterable over raw text documents expected, string object received.
How to resolve this?
any help really appreciated.
Have you tried passing the new sentence as an array? i.e.
new_testdata_tfidf= tfidf.transform([new_sentence])
The first instance you are passing an array with one string element and the other you are simply passing a string
If you are trying to pass the list of strings with new_sentence = input[('')] in your code, then you might want to replace it with
new_sentence = [input()]
Hope this helps.
I am writing a Question Answering system using pre-trained BERT with a linear layer and a softmax layer on top. When following the templates available on the net the labels of one example usually only consists of one answer_start_index and one answer_end_index. For example, from Huggingface when instantiating a SQUADFeatures object:
```
self.input_ids = input_ids
self.attention_mask = attention_mask
self.token_type_ids = token_type_ids
self.cls_index = cls_index
self.p_mask = p_mask
self.example_index = example_index
self.unique_id = unique_id
self.paragraph_len = paragraph_len
self.token_is_max_context = token_is_max_context
self.tokens = tokens
self.token_to_orig_map = token_to_orig_map
self.start_position = start_position
self.end_position = end_position
self.is_impossible = is_impossible
self.qas_id = qas_id
```
However, in my own dataset I have examples where the answer word is found at several locations in the context, i.e. there may be several correct spans constituting the answer.
My problem is I don't know how to manage such examples? In the templates available on the net labels are usually in a list, say:
[start_example1, start_example2, start_example3]
[end_example1, end_example2, end_example3]
In my case this may look like:
[start_example1, [start_example2_1, start_example2_2], start_example3]
and same for ends of course
In other words, I do not have a list containing one label per example, but a list containing either single-labels or a list of "labels" for an example, i.e. a list consisting of lists.
When following other templates the next step in the process is:
```
input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
token_type_ids = torch.cat(token_type_ids, dim=0)
span_starts = torch(span_starts) #Something like this
span_ends = torch(span_ends) #Something like this
```
However this of course (?) raises an error as my span_start lists and span_end lists does not contain only single-items but sometimes a list within the list.
Anyone have an idea on how I can tackle this problem? Should I only use examples where there's only one span constituting the answer present in the context?
If I work around the torch-error, will the backpropagation / evaluation/ computation of loss still work?
Thank You! /B
Have you checked the code
from transformers import BertTokenizer, BertForQuestionAnswering
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
encoding = tokenizer.encode_plus(question, text)
input_ids, token_type_ids = encoding["input_ids"], encoding["token_type_ids"]
start_scores, end_scores = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([token_type_ids]))
all_tokens = tokenizer.convert_ids_to_tokens(input_ids)
answer = ' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1])
assert answer == "a nice puppet"
I am not sure if this is the best way, but you may check instead of argmax to use topk, and check if this correspond to the correct answer.
t = torch.LongTensor([0,1,2,3,4,5,6,7,8,9])
t
_, indices = t.topk(4)
indices#([9, 8, 7, 6])