Word2Vec vocab results in just letters and symbols - python

I'm new to Word2Vec and I am trying to cluster words based on their similarity. To start I am using nltk to separate the sentences and then using the resulting list of sentences as the input into Word2Vec. However, when I print the vocab, it is just a bunch of letters, numbers and symbols rather than words. To be specific, an example of one of the letters is "< gensim.models.keyedvectors.Vocab object at 0x00000238145AB438>, 'L':"
# imports needed and logging
import gensim
from gensim.models import word2vec
import logging
import nltk
#nltk.download('punkt')
#nltk.download('averaged_perceptron_tagger')
with open('C:\\Users\\Freddy\\Desktop\\Thesis\\Descriptions.txt','r') as f_open:
text = f_open.read()
arr = []
sentences = nltk.sent_tokenize(text) # this gives a list of sentences
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',level=logging.INFO)
model = word2vec.Word2Vec(sentences, size = 300)
print(model.wv.vocab)

As the tutorial and the documentation for Word2Vec class suggests the constructor of the class requires list of lists of words as the first parameter (or iterator of iterators of words in general):
sentences (iterable of iterables, optional) – The sentences iterable can be simply a list of lists of tokens, but for larger
corpora,...
I believe before feeding in sentences into Word2Vec you need to use words_tokenize on each of the sentences changing the crucial line to:
sentences = [nltk.word_tokenize(sent) for sent in nltk.sent_tokenize(text)]
TL;DR
You get letters as your "words" because Word2Vec treats strings corresponding to sentences as iterables containing words. Iterating over strings results in the sequence of letters. These letters are used as the basis for the model learning (instead of intended words).
As the ancient saying goes: trash in - trash out.

Related

Using BERT to extract most similar words instead of word2vec for labeling functions

I am fairly new to BERT, and I wanted to test both approaches of using word2vec and BERT to extract most_similar words to a given word to pattern match in my labeling functions
I am currently using snorkel, one labeling function looks as so:
#labeling_function()
def lf_find_good_synonyms(x):
good_synonyms = word_vectors.most_similar("good", topn=25)
good_list = syn_list(good_synonyms)
return POSITIVE if any(word in x.stemmed for word in good_list) else ABSTAIN
This function basically looks for the word "good" or any of it's similar words in a sentence (the sentences are stemmed so are the words as the function syn_list returns the stem of each similar word), if found, the function will simply label the sentence as POSITIVE.
The issue here is that my word vectors are based on word2vec, and it's an old approach, I was wondering if I could use BERT instead and will it improve the performance much, since labeling functions are allowed to be lousy?

KeyError: word not in vocabulary" in word2vec

I wanted to convert some Japanese word to vector so that I can train the model for prediction. For that I downloaded pretrained models from Here.
import gensim
from sklearn.feature_extraction.text import CountVectorizer
from gensim.models import KeyedVectors
from gensim import models
from janome.tokenizer import Tokenizer
w2v_model = models.KeyedVectors.load_word2vec_format(w2v_models_path)
t = Tokenizer()
# I am testing for some random string
sentence = "社名公開求人住宅手当・家賃補助制度がある企業在宅勤務・リモートワーク可能な求人テレワークコロナに負けるな!積極採用中の企業特集リモートワーク可能なWebデザイナー求人"
tokens = [x.surface for x in t.tokenize(sentence)]
vectors = [w2v_model[v] for v in tokens]
In the last line, I am getting KeyError: "word 'テレワークコロナ' not in vocabulary"
Is there anything wrong here?
If you get a "not in vocabulary" error, you can trust that the token (word/key) that you've requested isn't in that KeyedVectors model.
You can see the full list of words known to your model (in the order they are stored) with w2v_model.key_to_index. (Or, just quick peek at some range of 20 in the middle as sanity-check with Python ranged-access like w2v_model.key_to_index[500:520].)
Are you sure 'テレワークコロナ' (and any other string giving the same error) is a legitimate, common Japanese word? Might the tokenizer be failing in some way? Are most of the words the tokenizer returns in the model?
It looks like the site you've linked has just copied those sets-of-word-vectors from Facebook's FastText work (https://fasttext.cc/docs/en/crawl-vectors.html). And, you're just using the plain text "word2vec_format" lists of vectors, so you only have the exact words in that file, and not the full FastText model - which also models word-fragments, and can thus 'guess' vectors for unknown words. (These guesses aren't very good – like working out a word's possible meaning from word-roots – but are usually better than nothing.)
I don't know if that approach works well for Japanese, but you could try it. If you instead grab the .bin (rather than text) file, and load it using Gensim's FastText support – specifically the load_facebook_vectors() method. You'll then get a special kind of KeyedVectors (FastTextKeyedVectors) that will give you such guesses for unknown words, which might help for your purposes (or not).

Statistical sentence suggestion model like spell checking

There are already spell checking models available which help us to find the suggested correct spellings based on a corpus of trained correct spellings. Can the granularity be increased to "word" from alphabet so that we can have even phrase suggestions , such that if an incorrect phrase is entered then it should suggest the nearest correct phrase from the corpus of correct phrases, of course it is trained from a list of valid phrases.
Are there any python libraries which achieve this functionality already or how to proceed for this for an existing large gold standard phrase corpus to get statistically relevant suggestions?
Note: this is different from a spell checker as the alphabets in a spell checker are finite whereas in a phrase correcter the alphabet is itself a word hence theoretically infinite , but we can limit the number of words from a phrase bank.
What you want to build is a N-gram model which consist in computing the probability for each word to follow a sequence of n words.
You can use NLTK text corpora to train your model, or you can tokenize your own corpus with nltk.sent_tokenize(text) and nltk.word_tokenize(sentence).
You can consider 2-gram (Markov model):
What is the probability for "kitten" to follow "cute"?
...or 3-gram:
What is the probability for "kitten" to follow "the cute"?
etc.
Obviously training the model with n+1-gram is costlier than n-gram.
Instead of considering words, you can consider the couple (word, pos) where pos is the part-of-speech tag (you can get the tags using nltk.pos_tag(tokens))
You can also try to consider the lemmas instead of the words.
Here some interesting lectures about N-gram modelling:
Introduction to N-grams
Estimating N-gram Probabilities
This is a simple and short example of code (2-gram) not optimized:
from collections import defaultdict
import nltk
import math
ngram = defaultdict(lambda: defaultdict(int))
corpus = "The cat is cute. He jumps and he is happy."
for sentence in nltk.sent_tokenize(corpus):
tokens = map(str.lower, nltk.word_tokenize(sentence))
for token, next_token in zip(tokens, tokens[1:]):
ngram[token][next_token] += 1
for token in ngram:
total = math.log10(sum(ngram[token].values()))
ngram[token] = {nxt: math.log10(v) - total for nxt, v in ngram[token].items()}

How to find polysemy words from input query?

I am working on polysemy disambiguation project and for that I am trying to find polysemous words from input query. The way I am doing it is:
#! /usr/bin/python
from nltk.corpus import stopwords
from nltk.corpus import wordnet as wn
stop = stopwords.words('english')
print "enter input query"
string = raw_input()
str1 = [i for i in string.split() if i not in stop]
a = list()
for w in str1:
if(len(wn.synsets(w)) > 1):
a.append(w)
Here list a will contain polysemous words.
But using this method almost all words will be considered as polysemy.
e.g if my input query is "milk is white in colour" then it is storing ('milk','white','colour') as polysemy words
WordNet is known to be very fine grained and it sometimes makes distinctions between very subtly different senses that you and I might think are the same. There have been attempts to make WordNet coarser, google "Automatic of a coarse grained WordNet". I am not sure if the results of that paper are available for download, but you can always contact the authors.
Alternatively, change your working definition of polysemy. If the most frequent sense of a word accounts for more than 80% of its uses in a large corpus, then the word is not polysemous. You will have to obtain frequency counts for the different senses of as many words as possible. Start your research here and here.

Identifying important words and phrases in text

I have text stored in a python string.
What I Want
To identify key words in that text.
to identify N-grams in that text (ideally more than just bi and tri grams).
Keep in mind...
The text might be small (e.g. tweet sized)
The text might be middle (e.g. news article sized)
The text might be large (e.g. book or chapter sized)
What I Have
I'm already using nltk to break the corpus into tokens and remove stopwords:
# split across any non-word character
tokenizer = nltk.tokenize.RegexpTokenizer('[^\w\']+', gaps=True)
# tokenize
tokens = tokenizer.tokenize(text)
# remove stopwords
tokens = [w for w in tokens if not w in nltk.corpus.stopwords.words('english')]
I'm aware of the BigramCollocationFinder and TrigramCollectionFinder which does exaclty what I'm looking for for those two cases.
The Question
I need advice for n-grams of higher order, improving the kinds of results that come from BCF and TCF, and advice on the best way to identify the most unique individual key words.
Many thanks!
As for the best way to identify the most unique individual key words, tfidf is the total measure. So, you have somehow to integrate a search engine ( or make a simple custom inverted index that is dynamic and holds term frequencies, document frequencies ) as to calculate tfidf efficiently and on-the-fly.
As for your N-grams, why don't you create a custom parser using a "window" approach ( window is of length N) that identifies, say, the most frequent of them? ( just keep every N-gram as a key in a dictionary with value either the frequency or a score (based on tfidf of individual terms))

Categories