I used the following code to using glove vectors for word embeddings
from gensim.scripts.glove2word2vec import glove2word2vec #line1
glove_input_file = 'glove.840B.300d.txt' #line2
word2vec_output_file = 'glove.word2vec' #line3
glove2word2vec(glove_input_file, word2vec_output_file) #line4
from gensim.models import KeyedVectors #line5
glove_w2vec = KeyedVectors.load_word2vec_format('glove.word2vec', binary=False) #line6
I understand this chunk of code is for using glove pretrained vectors for your word embeddings. But I am not sure of what is happening in each line. Why to convert glove to word2vec format ? What does KeyedVectors.load_word2vec_format does exactly ?
Both the GloVe algorithm and word2vec both create word-vectors, a vector per word.
But the formats for storing those vectors are slightly different. The gensim glove2word2vec() function will let you convert a file in GloVe format to the format used by the original Google word2vec.c code.
https://radimrehurek.com/gensim/scripts/glove2word2vec.html
Meanwhile, the gensim KeyedVectors.load_word2vec_format() method can load vectors in that word2vec.c format, into an instance of KeyedVectors (or one of its same-interface subclasses), for easy lookup and other common word-vector operations.
Related
I have run a word2vec model on my data list_of_sentence:
from gensim.models import Word2Vec
w2v_model=Word2Vec(list_of_sentence,min_count=5, workers=4)
print(type(w2v_model))
<class 'gensim.models.word2vec.Word2Vec'>
I would like to know the dimensionality of w2v_model vectors. How can I check it?
The vector dimensionality is included as an argument in Word2Vec:
In gensim versions up to 3.8.3, the argument was called size (docs)
In the latest gensim versions (4.0 onwards), the relevant argument is renamed to vector_size (docs)
In both cases, the argument has a default value of 100; this means that, if you do not specify it explicitly (as you do here), the dimensionality will be 100.
Here is a reproducible example using gensim 3.6:
import gensim
gensim.__version__
# 3.6.0
from gensim.test.utils import common_texts
from gensim.models import Word2Vec
model = Word2Vec(sentences=common_texts, window=5, min_count=1, workers=4) # do not specify size, leave the default 100
wv = model.wv['computer'] # get numpy vector of a word in the corpus
wv.shape # verify the dimension of a single vector is 100
# (100,)
If you want to change this dimensionality to, say, 256, you should call Word2Vec with the argument size=256 (for gensim versions up to 3.8.3) or vector_size=256 (for gensim versions 4.0 or later).
I want to use Tensorflow Dataset api to initialize my dataset using tensorflow Hub. I want to use dataset.map function to convert my text data into embedding. My Tensorflow version is 1.14.
Since I used elmo v2 modlule which converts bunch of sentences array into their word embeddings, I used the following code:
import tensorflow as tf
import tensorflow_hub as hub
...
sentences_array = load_sentences()
#Sentence_array=["I love Python", "python is a good PL"]
def parse(sentences):
elmo = hub.Module("./ELMO")
embeddings = elmo([sentences], signature="default", as_dict=True)
["word_emb"]
return embeddings
dataset = tf.data.TextLineDataset(sentences_array)
dataset = dataset.apply(tf.data.experimental.map_and_batch(map_func =
parse, batch_size=batch_size))
I want embedding of text array like [batch_size, max_words_in_batch, embedding_size], but I got an error message as:
"NotImplementedError: Using TF-Hub module within a TensorFlow defined
function is currently not supported."
How can I get the expected results?
Unfortunately this is not supported in TensorFlow 1.x
It is, however, supported in TensorFlow 2.0 so if you can upgrade to tensorflow 2 and choose from the available text embedding modules for tf 2 (current list here) then you can use this in your dataset pipeline. Something like this:
embedder = hub.load("https://tfhub.dev/google/tf2-preview/nnlm-en-dim128/1")
def parse(sentences):
embeddings = embedder([sentences])
return embeddings
dataset = tf.data.TextLineDataset("text.txt")
dataset = dataset.map(parse)
If you are tied to 1.x or tied to Elmo (which I don't think is yet available in the new format) then the only option I can see for embedding in the preprocessing stage is to first run your dataset through a simple embedding model and save the results then use the embedded vectors for the downstream task separately. (I appreciate this is less than ideal).
I am using Gensim to load my fasttext .vec file as follows.
m=load_word2vec_format(filename, binary=False)
However, I am just confused if I need to load .bin file to perform commands like m.most_similar("dog"), m.wv.syn0, m.wv.vocab.keys() etc.? If so, how to do it?
Or .bin file is not important to perform this cosine similarity matching?
Please help me!
The following can be used:
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format(link to the .vec file)
model.most_similar("summer")
model.similarity("summer", "winter")
Many options to use the model now.
The gensim-lib has evolved, so some code fragments got deprecated. This is an actual working solution:
import gensim.models.wrappers.fasttext
model = gensim.models.wrappers.fasttext.FastTextKeyedVectors.load_word2vec_format(Source + '.vec', binary=False, encoding='utf8')
word_vectors = model.wv
# -- this saves space, if you plan to use only, but not to train, the model:
del model
# -- do your work:
word_vectors.most_similar("etc")
If you want to be able to retrain the gensim model later with additional data, you should save the whole model like this: model.save("fasttext.model").
If you save just the word vectors with model.wv.save_word2vec_format(Path("vectors.txt")), you will still be able to perform any of the functions that vectors provide - like similarity, but you will not be able to retrain the model with more data.
Note that if you are saving the whole model, you should pass a file name as a string instead of wrapping it in get_tmpfile, as suggested in the documentation here.
Maybe I am late in answering this:
But here you can find your answer in the documentation:https://github.com/facebookresearch/fastText/blob/master/README.md#word-representation-learning
Example use cases
This library has two main use cases: word representation learning and text classification. These were described in the two papers 1 and 2.
Word representation learning
In order to learn word vectors, as described in 1, do:
$ ./fasttext skipgram -input data.txt -output model
where data.txt is a training file containing UTF-8 encoded text. By default the word vectors will take into account character n-grams from 3 to 6 characters. At the end of optimization the program will save two files: model.bin and model.vec. model.vec is a text file containing the word vectors, one per line. model.bin is a binary file containing the parameters of the model along with the dictionary and all hyper parameters. The binary file can be used later to compute word vectors or to restart the optimization.
I was trying with two class text classification. Usually I created Pickle files of trained model and load those pickle in training phase to eliminate retraining.
When I had 12000 review + more then 50000 tweets for each of the class, the training model size goes to 1.4 GB.
Now storing this large model data into Pickle and loading it is really not feasible and advisable.
Is there any better alternative to this scenario?
Here is sample code, I tried multiple ways of pickleing, here i Have used dill package
def train(self):
global pos, neg, totals
retrain = False
# Load counts if they already exist.
if not retrain and os.path.isfile(CDATA_FILE):
# pos, neg, totals = cPickle.load(open(CDATA_FILE))
pos, neg, totals = dill.load(open(CDATA_FILE, 'r'))
return
for file in os.listdir("./suspected/"):
for word in set(self.negate_sequence(open("./unsuspected/" + file).read())):
neg[word] += 1
pos['not_' + word] += 1
for file in os.listdir("./suspected/"):
for word in set(self.negate_sequence(open("./suspected/" + file).read())):
pos[word] += 1
neg['not_' + word] += 1
self.prune_features()
totals[0] = sum(pos.values())
totals[1] = sum(neg.values())
countdata = (pos, neg, totals)
dill.dump(countdata, open(CDATA_FILE, 'w') )
UPDATE : Reason behind large pickle is, classification data is very large. And I have considered 1-4 gram for feature selection. Classification dataset itself is around 300mb, so considering multigram approach for feature selection creates large training model.
Pickle is very heavy as a format. It stores all the details of the objects.
It would be much better to store your data in an efficient format like hdf5.
If you are not familiar with hdf5, you can look into storing your data in a simple flat text files. You can use csv or json, depending on your data structure. You'll find that either is more efficient than pickle.
You can look at gzip to create and load compressed archives.
The problem and solution is explained here. In short, the problem is due to the fact that when doing featurization, e.g. using CountVectorizer, although you might ask for small number of features e.g. max_features=1000, the transformer still keeps a copy of all possible features for debugging purposes, under the hood.
For instance, the CountVectorizer has the following attribute:
stop_words_ : set
Terms that were ignored because they either:
- occurred in too many documents (max_df)
- occurred in too few documents (min_df)
- were cut off by feature selection (max_features).
This is only available if no vocabulary was given.
and this causes the model size to become too large. To solve this issue, you can set stop_words_ to None before pickling your model (taken from the above link's example): (please check the link above for details)
import pickle
model_name = 'clickbait-model-sm.pkl'
cfr_pipeline.named_steps.vectorizer.stop_words_ = None
pickle.dump(cfr_pipeline, open(model_name, 'wb'), protocol=2)
I'm trying to do Naive Bayes on a dataset that has over 6,000,000 entries and each entry 150k features. I've tried to implement the code from the following link:
Implementing Bag-of-Words Naive-Bayes classifier in NLTK
The problem is (as I understand), that when I try to run the train-method with a dok_matrix as it's parameter, it cannot find iterkeys (I've paired the rows with OrderedDict as labels):
Traceback (most recent call last):
File "skitest.py", line 96, in <module>
classif.train(add_label(matr, labels))
File "/usr/lib/pymodules/python2.6/nltk/classify/scikitlearn.py", line 92, in train
for f in fs.iterkeys():
File "/usr/lib/python2.6/dist-packages/scipy/sparse/csr.py", line 88, in __getattr__
return _cs_matrix.__getattr__(self, attr)
File "/usr/lib/python2.6/dist-packages/scipy/sparse/base.py", line 429, in __getattr__
raise AttributeError, attr + " not found"
AttributeError: iterkeys not found
My question is, is there a way to either avoid using a sparse matrix by teaching the classifier entry by entry (online), or is there a sparse matrix format I could use in this case efficiently instead of dok_matrix? Or am I missing something obvious?
Thanks for anyone's time. :)
EDIT, 6th sep:
Found the iterkeys, so atleast the code runs. It's still too slow, as it has taken several hours with a dataset of the size of 32k, and still hasn't finished. Here's what I got at the moment:
matr = dok_matrix((6000000, 150000), dtype=float32)
labels = OrderedDict()
#collect the data into the matrix
pipeline = Pipeline([('nb', MultinomialNB())])
classif = SklearnClassifier(pipeline)
add_label = lambda lst, lab: [(lst.getrow(x).todok(), lab[x])
for x in xrange(lentweets-foldsize)]
classif.train(add_label(matr[:(lentweets-foldsize),0], labels))
readrow = [matr.getrow(x + foldsize).todok() for x in xrange(lentweets-foldsize)]
data = np.array(classif.batch_classify(readrow))
The problem might be that each row that is taken doesn't utilize the sparseness of the vector, but goes through each of the 150k entry. As a continuation for the issue, does anyone know how to utilize this Naive Bayes with sparse matrices, or is there any other way to optimize the above code?
Check out the document classification example in scikit-learn. The trick is to let the library handle the feature extraction for you. Skip the NLTK wrapper, as it's not intended for such large datasets.(*)
If you have the documents in text files, then you can just hand those text files to the TfidfVectorizer, which creates a sparse matrix from them:
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(input='filename')
X = vect.fit_transform(list_of_filenames)
You now have a training set X in the CSR sparse matrix format, that you can feed to a Naive Bayes classifier if you also have a list of labels y (perhaps derived from the filenames, if you encoded the class in them):
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X, y)
If it turns out this doesn't work because the set of documents is too large (unlikely since the TfidfVectorizer was optimized for just this number of documents), look at the out-of-core document classification example, which demonstrates the HashingVectorizer and the partial_fit API for minibatch learning. You'll need scikit-learn 0.14 for this to work.
(*) I know, because I wrote that wrapper. Like the rest of NLTK, it's intended for educational purposes. I also worked on performance improvements in scikit-learn, and some of the code I'm advertising is my own.