I am following the tutorial here: https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-language-model-nlp-python-code/#h2_5 to create a Language model. I am following the bit about the N-gram Language model.
This is the completed code:
from nltk.corpus import reuters
from nltk import bigrams, trigrams
from collections import Counter, defaultdict
# Create a placeholder for model
model = defaultdict(lambda: defaultdict(lambda: 0))
# Count frequency of co-occurance
for sentence in reuters.sents():
for w1, w2, w3 in trigrams(sentence, pad_right=True, pad_left=True):
model[(w1, w2)][w3] += 1
# Let's transform the counts to probabilities
for w1_w2 in model:
total_count = float(sum(model[w1_w2].values()))
for w3 in model[w1_w2]:
model[w1_w2][w3] /= total_count
input = input("Hi there! Please enter an incomplete sentence and I can help you\
finish it!\n").lower().split()
print(model[tuple(input)])
To get output from the model, the website does this: print(dict(model["the", "price"])) but I want to generate output from a user inputted sentence. When I write print(model[tuple(input)]), it gives me an empty defaultdict.
Disregard this (keeping for history):
How do I give it the list I create from the input? model is a
dictionary and I've read that using a list as a key isn't a good idea
but that's exactly what they're doing? And I'm assuming mine doesn't
work because I'm listing a list? Would I have to iterate through the
words to get results?
As a side note, is this model considering the sentence as a whole to
predict the next word, or just the last word?
I had to give the model the last two words from the list not the entire thing, even if it's two words. Like so:
model[tuple(input[-2:])]
Related
With Gensim, there are three functions I use regularly, for example this one:
model = gensim.models.Word2Vec(corpus,size=100,min_count=5)
The output from gensim, but I cannot understand how to set the size and min_count parameters in the equivalent SciSpacy command of:
model = spacy.load('en_core_web_md')
(The output is a model of embeddings (too big to add here))).
This is another command I regularly use:
model.most_similar(positive=['car'])
and this is the output from gensim/Expected output from SciSpacy:
[('vehicle', 0.7857330441474915),
('motorbike', 0.7572781443595886),
('train', 0.7457204461097717),
('honda', 0.7383008003234863),
('volkswagen', 0.7298516035079956),
('mini', 0.7158907651901245),
('drive', 0.7093928456306458),
('driving', 0.7084407806396484),
('road', 0.7001082897186279),
('traffic', 0.6991947889328003)]
This is the third command I regularly use:
print(model.wv['car'])
Output from Gensim/Expected output from SciSpacy (in reality this vector is length 100):
[ 1.0942473 2.5680697 -0.43163642 -1.171171 1.8553845 -0.3164575
1.3645878 -0.5003705 2.912658 3.099512 2.0184739 -1.2413547
0.9156444 -0.08406237 -2.2248871 2.0038593 0.8751471 0.8953876
0.2207374 -0.157277 -1.4984075 0.49289042 -0.01171476 -0.57937795...]
Could someone show me the equivalent commands for SciSpacy? For example, for 'gensim.models.Word2Vec' I can't find how to specify the length of the vectors (size parameter), or the minimum number of times the word should be in the corpus (min_count) in SciSpacy (e.g. I looked here and here), but I'm not sure if I'm missing them?
A possible way to achieve your goal would be to:
parse you documents via nlp.pipe
collect all the words and pairwise similarities
process similarities to get the desired results
Let's prepare some data:
import spacy
nlp = spacy.load("en_core_web_md", disable = ['ner', 'tagger', 'parser'])
Then, to get a vector, like in model.wv['car'] one would do:
nlp("car").vector
To get most similar words like model.most_similar(positive=['car']) let's process the corpus:
corpus = ["This is a sentence about cars. This a sentence aboout train"
, "And this is a sentence about a bike"]
docs = nlp.pipe(corpus)
tokens = []
tokens_orth = []
for doc in docs:
for tok in doc:
if tok.orth_ not in tokens_orth:
tokens.append(tok)
tokens_orth.append(tok.orth_)
sims = np.zeros((len(tokens),len(tokens)))
for i, tok in enumerate(tokens):
sims[i] = [tok.similarity(tok_) for tok_ in tokens]
Then to retrieve top=3 most similar words:
def most_similar(word, tokens_orth = tokens_orth, sims=sims, top=3):
tokens_orth = np.array(tokens_orth)
id_word = np.where(tokens_orth == word)[0][0]
sim = sims[id_word]
id_ms = np.argsort(sim)[:-top-1:-1]
return list(zip(tokens_orth[id_ms], sim[id_ms]))
most_similar("This")
[('this', 1.0000001192092896), ('This', 1.0), ('is', 0.5970357656478882)]
PS
I have also noticed you asked for specification of dimension and frequency. Embedding length is fixed at the time the model is initialized, so it can't be changed after that. You can start from a blank model if you wish so, and feed embeddings you're comfortable with. As for the frequency, it's doable, via counting all the words and throwing away anything that is below desired threshold. But again, underlying embeddings will be from a not filtered text. SpaCy is different from Gensim in that it uses readily available embeddings whereas Gensim trains them.
Is there a way in python to map documents belonging to a certain topic. For example a list of documents that are primarily "Topic 0". I know there are ways to list topics for each document but how do I do it the other way around?
Edit:
I am using the following script for LDA:
doc_set = []
for file in files:
newpath = (os.path.join(my_path, file))
newpath1 = textract.process(newpath)
newpath2 = newpath1.decode("utf-8")
doc_set.append(newpath2)
texts = []
for i in doc_set:
raw = i.lower()
tokens = tokenizer.tokenize(raw)
stopped_tokens = [i for i in tokens if not i in stopwords.words()]
stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
texts.append(stemmed_tokens)
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, random_state=0, id2word = dictionary, passes=1)
You've got a tool/API (Gensim LDA) that, when given a document, gives you a list of topics.
But you want the reverse: a list of documents, for a topic.
Essentially, you'll want to build the reverse-mapping yourself.
Fortunately Python's native dicts & idioms for working with mapping make this pretty simple - just a few lines of code - as long as you're working with data that fully fits in memory.
Very roughly the approach would be:
create a new structure (dict or list) for mapping topics to lists-of-documents
iterate over all docs, adding them (perhaps with scores) to that topic-to-docs mapping
finally, look up (& perhaps sort) those lists-of-docs, for each topic of interest
If your question could be edited to include more information about the format/IDs of your documents/topics, and how you've trained your LDA model, this answer could be expanded with more specific example code to build the kind of reverse-mapping you'd need.
Update for your code update:
OK, if your model is in ldamodel and your BOW-formatted docs in corpus, you'd do something like:
# setup: get the model's topics in their native ordering...
all_topics = ldamodel.print_topics()
# ...then create a empty list per topic to collect the docs:
docs_per_topic = [[] for _ in all_topics]
# now, for every doc...
for doc_id, doc_bow in enumerate(corpus):
# ...get its topics...
doc_topics = ldamodel.get_document_topics(doc_bow)
# ...& for each of its topics...
for topic_id, score in doc_topics:
# ...add the doc_id & its score to the topic's doc list
docs_per_topic[topic_id].append((doc_id, score))
After this, you can see the list of all (doc_id, score) values for a certain topic like this (for topic 0):
print(docs_per_topic[0])
If you're interested in the top docs per topic, you can further sort each list's pairs by their score:
for doc_list in docs_per_topic:
doc_list.sort(key=lambda id_and_score: id_and_score[1], reverse=True)
Then, you could get the top-10 docs for topic 0 like:
print(docs_per_topic[0][:10])
Note that this does everything using all-in-memory lists, which might become impractical for very-large corpuses. In some cases, you might need to compile the per-topic listings into disk-backed structures, like files or a database.
Given a list of document words e.g. [['cow','boy','hat','mat],['village','boy','water','cow']....], gensim can be used to get bi-grams as follows:
bigrams = gensim.models.Phrases(data_words, min_count=1,threshold=1)
bigram_model = gensim.models.phrases.Phraser(bigrams)
I was wondering as to how to get the score of each bi-gram detected in the bigram_model?
It turns out that it is as simple as using:
bigram_model.phrasegrams
that yields something like below:
{(b'cow', b'boy'): 23.3228613654742079,
(b'village', b'water'): 1.3228613654742079}
I am playing with Tensorflow sequence to sequence translation model. I was wondering if I could import my own word2vec into this model? Rather than using its original 'dense representation' mentioned in the tutorial.
From my point of view, it looks TensorFlow is using One-Hot representation for seq2seq model. Firstly,for function tf.nn.seq2seq.embedding_attention_seq2seq the encoder's input is a tokenized symbol, e.g. 'a' would be '4' and 'dog' would be '15715' etc. and requires num_encoder_symbols. So I think it makes me provide the position of the word and the total number of words, then the function could represent the word in One-Hot representation. I am still learning the source code, but it hard to understand.
Could anyone give me an idea on above problem?
The seq2seq embedding_* functions indeed create embedding matrices very similar to those from word2vec. They are a variable named sth like this:
EMBEDDING_KEY = "embedding_attention_seq2seq/RNN/EmbeddingWrapper/embedding"
Knowing this, you can just modify this variable. I mean -- get your word2vec vectors in some format, say a text file. Assuming you have your vocabulary in model.vocab you can then assign the read vectors in a way illustrated by the snippet below (it's just a snippet, you'll have to change it to make it work, but I hope it shows the idea).
vectors_variable = [v for v in tf.trainable_variables()
if EMBEDDING_KEY in v.name]
if len(vectors_variable) != 1:
print("Word vector variable not found or too many.")
sys.exit(1)
vectors_variable = vectors_variable[0]
vectors = vectors_variable.eval()
print("Setting word vectors from %s" % FLAGS.word_vector_file)
with gfile.GFile(FLAGS.word_vector_file, mode="r") as f:
# Lines have format: dog 0.045123 -0.61323 0.413667 ...
for line in f:
line_parts = line.split()
# The first part is the word.
word = line_parts[0]
if word in model.vocab:
# Remaining parts are components of the vector.
word_vector = np.array(map(float, line_parts[1:]))
if len(word_vector) != vec_size:
print("Warn: Word '%s', Expecting vector size %d, found %d"
% (word, vec_size, len(word_vector)))
else:
vectors[model.vocab[word]] = word_vector
# Assign the modified vectors to the vectors_variable in the graph.
session.run([vectors_variable.initializer],
{vectors_variable.initializer.inputs[1]: vectors})
I guess with the scope style, which Matthew mentioned, you can get variable:
with tf.variable_scope("embedding_attention_seq2seq"):
with tf.variable_scope("RNN"):
with tf.variable_scope("EmbeddingWrapper", reuse=True):
embedding = vs.get_variable("embedding", [shape], [trainable=])
Also, I would imagine you would want to inject embeddings into the decoder as well, the key (or scope) for it would be somthing like:
"embedding_attention_seq2seq/embedding_attention_decoder/embedding"
Thanks for your answer, Lukasz!
I was wondering, what exactly in the code snippet <b>model.vocab[word]</b> stands for? Just the position of the word in the vocabulary?
In this case wouldn't that be faster to iterate through the vocabulary and inject w2v vectors for the words that exist in w2v model.
For a project I would like to measure the amount of ‘human centered’ words within a text. I plan on doing this using WordNet. I have never used it and I am not quite sure how to approach this task. I want to use WordNet to count the amount of words that belong to certain synsets, for example the sysnets ‘human’ and ‘person’.
I came up with the following (simple) piece of code:
word = 'girlfriend'
word_synsets = wn.synsets(word)[0]
hypernyms = word_synsets.hypernym_paths()[0]
for element in hypernyms:
print element
Results in:
Synset('entity.n.01')
Synset('physical_entity.n.01')
Synset('causal_agent.n.01')
Synset('person.n.01')
Synset('friend.n.01')
Synset('girlfriend.n.01')
My first question is, how do I properly iterate over the hypernyms? In the code above it prints them just fine. However, when using an ‘if’ statement, for example:
count_humancenteredness = 0
for element in hypernyms:
if element == 'person':
print 'found person hypernym'
count_humancenteredness +=1
I get ‘AttributeError: 'str' object has no attribute '_name'’. What method can I use to iterate over the hypernyms of my word and perform an action (e.g. increase the count of human centerdness) when a word does indeed belong to the ‘person’ or ‘human’ synset.
Secondly, is this an efficient approach? I assume that iterating over several texts and iterating over the hypernyms of each noun will take quite some time.. Perhaps that there is another way to use WordNet to perform my task more efficiently.
Thanks for your help!
wrt the error message
hypernyms = word_synsets.hypernym_paths() returns a list of list of SynSets.
Hence
if element == 'person':
tries to compare a SynSet object against a string. That kind of comparison is not supported by the SynSet.
Try something like
target_synsets = wn.synsets('person')
if element in target_synsets:
...
or
if u'person' in element.lemma_names():
...
instead.
wrt efficiency
Currently, you do a hypernym-lookup for every word inside your input text. As you note, this is not necessarily efficient. However, if this is fast enough, stop here and do not optimize what is not broken.
To speed up the lookup, you can pre-compile a list of "person related" words in advance by making use of the transitive closure over the hyponyms as explained here.
Something like
person_words = set(w for s in p.closure(lambda s: s.hyponyms()) for w in s.lemma_names())
should do the trick. This will return a set of ~ 10,000 words, which is not too much to store in main memory.
A simple version of the word counter then becomes something on the lines of
from collections import Counter
word_count = Counter()
for word in (w.lower() for w in words if w in person_words):
word_count[word] += 1
You might also need to pre-process the input words using stemming or other morphologic reductions before passing the words on to WordNet, though.
To get all the hyponyms of a synset, you can use the following function (tested with NLTK 3.0.3, dhke's closure trick doesn't work on this version):
def get_hyponyms(synset):
hyponyms = set()
for hyponym in synset.hyponyms():
hyponyms |= set(get_hyponyms(hyponym))
return hyponyms | set(synset.hyponyms())
Example:
from nltk.corpus import wordnet
food = wordnet.synset('food.n.01')
print(len(get_hyponyms(food))) # returns 1526