I use the gensim library to create a word2vec model. It contains the function predict_output_words() which I understand as follows:
For example, I have a model that is trained with the sentence: "Anarchism does not offer a fixed body of doctrine from a single particular world view instead fluxing and flowing as a philosophy."
and then I use
model.predict_output_words(context_words_list=['Anarchism', 'does', 'not', 'offer', 'a', 'fixed', 'body', 'of', 'from', 'a', 'single', 'particular', 'world', 'view', 'instead', 'fluxing'], topn=10).
In this situation, could I get/predict the correct word or the omitted word 'doctrine'?
Is this the right way? Please explain this function in detail.
I am wondering if you have seen the documentation of predict_output_word?
Report the probability distribution of the center word given the
context words as input to the trained model.
To answer your specific question about the word 'doctrine' - it strongly depends if for the words you listed as your context one of the 10 most probable words is 'doctrine'. This means that it must occur relatively frequently in the corpus you use for training of the model. Also, since 'doctrine' does not seem to be one of the very often used words there is a high chance other words will have a higher probability of appearing in the context. Therefore, if you base only on the returned probability of the words given the context you may end up failing to predict 'doctrine' in this case.
Related
I am totally new to Word2Vec. I want to find cosine similarity between word pairs in my data. My codes are as follows:
import pandas as pd
from gensim.models import Word2Vec
model = Word2Vec(corpus_file="corpus.txt", sg=0, window =7, size=100, min_count=10, iter=4)
vocabulary = list(model.wv.vocab)
data=pd.read_csv("experiment.csv")
cos_similarity = model.wv.similarity(data['word 1'], data['word 2'])
The problem is some words in the data columns of my "experiment.csv" file: "word 1" and "word 2" are not present in the corpus file ("corpus.txt"). So this error is returned:
"word 'setosa' not in vocabulary"
What should I do to handle words that are not present in my input corpus? I want to assign words in my experiment that are not present in the input corpus the vector zero, but I am stuck how to do it.
Any ideas for my problems?
It's really easy to give unknown words the origin (all 'zero') vector:
word = data['word 1']
if word in model.wv:
vec = model[word]
else:
vec = np.zeros(100)
But, this is unlikely what you want. The zero vector can't be cosine-similarity compared to other vectors.
It's often better to simply ignore unknown words. If they were so rare that your training data didn't haven enough of them to create a vector, they can't contribute much to other analyses.
If they're still important, the best approach is to get more data, with realistic usage contexts, so they get meaningful vectors.
Another alternative is to use an algorithm, such as the word2vec variant FastText, which can always synthesize a guess-vector for any words that were out-of-vocabulary (OOV) based on the training data. It does this by learning word-vectors for word-fragments (charactewr n-grams), then assembling a vector for a new unknown word from those fragments. It's often better than random, because unknown words are often typos or variants of known words with which they share a lot of segments. But it's still not great, and for really odd strings, essentially returns a random vector.
Another tactic I've seen used, but wouldn't personally recommend, is to replace a lot of the words that would otherwise be ignored – such as those with fewer than min_count occurrences – with some single plug token, like say '<OOV>'. Then that synthetic token becomes a quite-common word, but gets an almost entirely meaningless: a random low-magnitude vector. (The prevalence of this fake word & noise-vector in training will tend to make other surrounding words' vectors worse or slower-to-train, compared to simply eliding the low-frequency words.) But then, when dealing with later unknown words, you can use this same '<OOV>' pseudoword's vector as a not-too-harmful stand-in.
But again: it's almost always better to do some combination of – (a) more data; (b) ignoring rare words; (c) using a algorithm like FastText which can synthesize better-than-nothing vectors – than collapse all unknown words to a single nonsense vector.
I have a list of paragraphs, I would like to check if these words are valid English words or not. Sometimes, due to some external issues, i might not get valid English words in these paragraphs. I am aware of libraries like pyenchant and nltk which have a set of dictionaries and provide accuracy of some level but both of these have few drawbacks. I wonder if there exists another library or procedure that can provide me with what I am looking for with at-most accuracy possible.
It depends greatly on what do you mean by valid English words. Is ECG, Thor or Loki a valid English word? If your definition of valid words is different, you might need to create your own language model.
Anyway besides obvious use of pyEnchant or nltk I would suggest fasttext library. It has multiple pre-built word vector models and you can check your paragraph for rare or out-of-vocabulary words. What you essentially want to check is that the word embedding representation for this specific "non valid" word corresponds to low number (or zero) other words.
You can use fasttext directly from python
pip install fasstext
or you can use gensim library (which will provide you some additional algorithms as well, such as Word2Vec which can be useful for your case as well)
pip install --upgrade gensim
Or for conda
conda install -c conda-forge gensim
Assuming you use gensim and you use pre-trained fasttext model:
from gensim.models import FastText
from gensim.test.utils import datapath
cap_path = datapath("fasttext-model.bin")
fb_model = load_facebook_model(cap_path)
Now you can perform several tasks to achieve your goal:
1. Check out-of-vocabulary
'mybizarreword' in fb_model.wv.vocab
Check similarity
fb_model.wv.most_similar("man")
For rare word you will get low scores and by setting the threshold you will decide which word is not 'valid'
Linux and Mac OS X have a list of words which you can use directly, otherwise you can download a list of English words.
You can use it as follows:
d = {}
fname = "/usr/share/dict/words"
with open(fname) as f:
content = f.readlines()
for w in content:
d[w.strip()] = True
p ="""I have a list of paragraphs, I would like to check if these words are valid English words or not. Sometimes, due to some external issues, i might not get valid English words in these paragraphs. I am aware of libraries like pyenchant and nltk which have a set of dictionaries and provide accuracy of some level but both of these have few drawbacks. I wonder if there exists another library or procedure that can provide me with what I am looking for with at-most accuracy possible."""
lw = []
for w in p.split():
if len(w) < 4:
continue
if d.get(w, False):
lw.append(w)
print(len(lw))
print(lw)
#43
#['have', 'list', 'would', 'like', 'check', 'these', 'words', 'valid', 'English', 'words', 'some', 'external', 'might', 'valid', 'English', 'words', 'these', 'aware', 'libraries', 'like', 'which', 'have', 'dictionaries', 'provide', 'accuracy', 'some', 'level', 'both', 'these', 'have', 'wonder', 'there', 'exists', 'another', 'library', 'procedure', 'that', 'provide', 'with', 'what', 'looking', 'with', 'accuracy']
I try to lemmatize a text using spaCy 2.0.12 with the French model fr_core_news_sm. Morevoer, I want to replace people names by an arbitrary sequence of characters, detecting such names using token.ent_type_ == 'PER'. Example outcome would be "Pierre aime les chiens" -> "~PER~ aimer chien".
The problem is I can't find a way to do both. I only have these two partial options:
I can feed the pipeline with the original text: doc = nlp(text). Then, the NER will recognize most people names but the lemmas of words starting with a capital won't be correct. For example, the lemmas of the simple question "Pouvons-nous faire ça?" would be ['Pouvons', '-', 'se', 'faire', 'ça', '?'], where "Pouvons" is still an inflected form.
I can feed the pipeline with the lower case text: doc = nlp(text.lower()). Then my previous example would correctly display ['pouvoir', '-', 'se', 'faire', 'ça', '?'], but most people names wouldn't be recognized as entities by the NER, as I guess a starting capital is a useful indicator for finding entities.
My idea would be to perform the standard pipeline (tagger, parser, NER), then lowercase, and then lemmatize only at the end.
However, lemmatization doesn't seem to have its own pipeline component and the documentation doesn't explain how and where it is performed. This answer seem to imply that lemmatization is performed independent of any pipeline component and possibly at different stages of it.
So my question is: how to choose when to perform the lemmatization and which input to give to it?
If you can, use the most recent version of spacy instead. The French lemmatizer has been improved a lot in 2.1.
If you have to use 2.0, consider using an alternate lemmatizer like this one: https://spacy.io/universe/project/spacy-lefff
Having a trained word2vec model, is there a way to check which word in its vocabulary is the most "related" to a whole sentence ?
i was looking for something similar to
model.wv.most_similar("the dog is on the table")
which could result in ["dog","table"]
The most_similar() method can take multiple words as input, ideally as the named parameter positive. (That's as in, "positive examples", to be contrasted with "negative examples" which can also be provided via the negative parameter, and are useful when asking most_similar() to solve analogy-problems.)
When it receives multiple words, it returns results that are closest to the average of all words provided. That might be somewhat related to a whole sentence, but such an average-of-all-word-vectors is a fairly weak way of summarizing a sentence.
The multiple words should be provided as a list of strings, not a raw string of space-delimited words. So, for example:
sims = model.wv.most_similar(positive=['the', 'dog', 'is', 'on', 'the', 'table'])
"First thing we do, let's kill all the lawyers." - William Shakespeare
Given the quote above, I would like to pull out "kill" and "lawyers" as the two prominent keywords to describe the overall meaning of the sentence. I have extracted the following noun/verb POS tags:
[["First", "NNP"], ["thing", "NN"], ["do", "VBP"], ["lets", "NNS"], ["kill", "VB"], ["lawyers", "NNS"]]
The more general problem I am trying to solve is to distill a sentence to the "most important"* words/tags to summarise the overall "meaning"* of a sentence.
*note the scare quotes. I acknowledge this is a very hard problem and there is most likely no perfect solution at this point in time. Nonetheless, I am interested to see attempts at solving the specific problem (extracting "kill" and "lawyers") and the general problem (summarising the overall meaning of a sentence in keywords/tags)
I don't think theres any perfect answer to this question because there aren't any gold-set of input/output mappings which everybody will agree upon. You think the most important words for that sentence are ('kill', 'lawyers'), someone else might argue the correct answer should be ('first', 'kill', 'lawyers'). If you are able to very precisely and completely unambiguously describe exactly what you want your system to do, your problem will be more than half solved.
Until then, I can suggest some additional heuristics to help you get what you want.
Build an idf dictionary using your data, i.e. build a mapping from every word to a number that correlates with how rare that word is. Bonus points for doing it for larger n-grams as well.
By combining the idf values of each word in your input sentence along with their POS tags, you answer questions of the form 'What is the rarest verb in this sentence?', 'What is the rarest noun in this sentence', etc. In any reasonable corpus, 'kill' should be rarer than 'do', and 'lawyers' rarer than 'thing', so maybe trying to find the rarest noun and rarest verb in a sentence and returning just those two will do the trick for most of your intended use cases. If not, you can always make your algorithm a little more complicated and see if that seems to do the job better.
Ways to expand this include trying to identify larger phrases using n-gram idf's, building a full parse-tree of the sentence (using maybe the stanford parser) and identifying some pattern within these trees to help you figure out which parts of the tree do important things tend to be based, etc.
One simple approach would be to keep stop word lists for NN, VB etc. These would be high frequency words that usually don't add much semantic content to a sentence.
The snippet below shows distinct lists for each type of word token, but you could just as well employ a single stop word list for both verbs and nouns (such as this one).
stop_words = dict(
NNP=['first', 'second'],
NN=['thing'],
VBP=['do','done'],
VB=[],
NNS=['lets', 'things'],
)
def filter_stop_words(pos_list):
return [[token, token_type]
for token, token_type in pos_list
if token.lower() not in stop_words[token_type]]
in your case, you can simply use Rake (thanks to Fabian) package for python to get what you need:
>>> path = #your path
>>> r = RAKE.Rake(path)
>>> r.run("First thing we do, let's kill all the lawyers")
[('lawyers', 1.0), ('kill', 1.0), ('thing', 1.0)]
the path can be for example this file.
but in general, you better to use NLTK package for the NLP usages