FastText in Gensim - python

I am using Gensim to load my fasttext .vec file as follows.
m=load_word2vec_format(filename, binary=False)
However, I am just confused if I need to load .bin file to perform commands like m.most_similar("dog"), m.wv.syn0, m.wv.vocab.keys() etc.? If so, how to do it?
Or .bin file is not important to perform this cosine similarity matching?
Please help me!

The following can be used:
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format(link to the .vec file)
model.most_similar("summer")
model.similarity("summer", "winter")
Many options to use the model now.

The gensim-lib has evolved, so some code fragments got deprecated. This is an actual working solution:
import gensim.models.wrappers.fasttext
model = gensim.models.wrappers.fasttext.FastTextKeyedVectors.load_word2vec_format(Source + '.vec', binary=False, encoding='utf8')
word_vectors = model.wv
# -- this saves space, if you plan to use only, but not to train, the model:
del model
# -- do your work:
word_vectors.most_similar("etc")

If you want to be able to retrain the gensim model later with additional data, you should save the whole model like this: model.save("fasttext.model").
If you save just the word vectors with model.wv.save_word2vec_format(Path("vectors.txt")), you will still be able to perform any of the functions that vectors provide - like similarity, but you will not be able to retrain the model with more data.
Note that if you are saving the whole model, you should pass a file name as a string instead of wrapping it in get_tmpfile, as suggested in the documentation here.

Maybe I am late in answering this:
But here you can find your answer in the documentation:https://github.com/facebookresearch/fastText/blob/master/README.md#word-representation-learning
Example use cases
This library has two main use cases: word representation learning and text classification. These were described in the two papers 1 and 2.
Word representation learning
In order to learn word vectors, as described in 1, do:
$ ./fasttext skipgram -input data.txt -output model
where data.txt is a training file containing UTF-8 encoded text. By default the word vectors will take into account character n-grams from 3 to 6 characters. At the end of optimization the program will save two files: model.bin and model.vec. model.vec is a text file containing the word vectors, one per line. model.bin is a binary file containing the parameters of the model along with the dictionary and all hyper parameters. The binary file can be used later to compute word vectors or to restart the optimization.

Related

EffiecientDet mAP evaluation on custom dataset

I'm trying to run 'mAP_evaluation.py' to get mAP evaluation on my own dataset:
https://github.com/Tessellate-Imaging/Monk_Object_Detection/tree/master/4_efficientdet/lib
but the whole python file is made for COCO dataset only I think, but if I use the function evaluate_coco() then I don't know how to customize my dataset to match the function. Please help.
P/S: I already trained and export the EfficientDet model (pth file), predicted test images/videos, just don't know how to evaluate.
you can fix the issue like that
def __init__(root_dir, img_dir='images', set_dir='train2017', transform=None)
so I fixed right here from mAP_evaluation.py:
dataset_val = CocoDataset("/content/Monk_Object_Detection/4_efficientdet/lib/data/pothole", img_dir='images', set_dir='val2017',
transform=transforms.Compose([Normalizer(), Resizer()]))
evaluate_coco(dataset_val, efficientdet)

Understanding usage of glove vectors

I used the following code to using glove vectors for word embeddings
from gensim.scripts.glove2word2vec import glove2word2vec #line1
glove_input_file = 'glove.840B.300d.txt' #line2
word2vec_output_file = 'glove.word2vec' #line3
glove2word2vec(glove_input_file, word2vec_output_file) #line4
from gensim.models import KeyedVectors #line5
glove_w2vec = KeyedVectors.load_word2vec_format('glove.word2vec', binary=False) #line6
I understand this chunk of code is for using glove pretrained vectors for your word embeddings. But I am not sure of what is happening in each line. Why to convert glove to word2vec format ? What does KeyedVectors.load_word2vec_format does exactly ?
Both the GloVe algorithm and word2vec both create word-vectors, a vector per word.
But the formats for storing those vectors are slightly different. The gensim glove2word2vec() function will let you convert a file in GloVe format to the format used by the original Google word2vec.c code.
https://radimrehurek.com/gensim/scripts/glove2word2vec.html
Meanwhile, the gensim KeyedVectors.load_word2vec_format() method can load vectors in that word2vec.c format, into an instance of KeyedVectors (or one of its same-interface subclasses), for easy lookup and other common word-vector operations.

Training classifier with large data

I was trying with two class text classification. Usually I created Pickle files of trained model and load those pickle in training phase to eliminate retraining.
When I had 12000 review + more then 50000 tweets for each of the class, the training model size goes to 1.4 GB.
Now storing this large model data into Pickle and loading it is really not feasible and advisable.
Is there any better alternative to this scenario?
Here is sample code, I tried multiple ways of pickleing, here i Have used dill package
def train(self):
global pos, neg, totals
retrain = False
# Load counts if they already exist.
if not retrain and os.path.isfile(CDATA_FILE):
# pos, neg, totals = cPickle.load(open(CDATA_FILE))
pos, neg, totals = dill.load(open(CDATA_FILE, 'r'))
return
for file in os.listdir("./suspected/"):
for word in set(self.negate_sequence(open("./unsuspected/" + file).read())):
neg[word] += 1
pos['not_' + word] += 1
for file in os.listdir("./suspected/"):
for word in set(self.negate_sequence(open("./suspected/" + file).read())):
pos[word] += 1
neg['not_' + word] += 1
self.prune_features()
totals[0] = sum(pos.values())
totals[1] = sum(neg.values())
countdata = (pos, neg, totals)
dill.dump(countdata, open(CDATA_FILE, 'w') )
UPDATE : Reason behind large pickle is, classification data is very large. And I have considered 1-4 gram for feature selection. Classification dataset itself is around 300mb, so considering multigram approach for feature selection creates large training model.
Pickle is very heavy as a format. It stores all the details of the objects.
It would be much better to store your data in an efficient format like hdf5.
If you are not familiar with hdf5, you can look into storing your data in a simple flat text files. You can use csv or json, depending on your data structure. You'll find that either is more efficient than pickle.
You can look at gzip to create and load compressed archives.
The problem and solution is explained here. In short, the problem is due to the fact that when doing featurization, e.g. using CountVectorizer, although you might ask for small number of features e.g. max_features=1000, the transformer still keeps a copy of all possible features for debugging purposes, under the hood.
For instance, the CountVectorizer has the following attribute:
stop_words_ : set
Terms that were ignored because they either:
- occurred in too many documents (max_df)
- occurred in too few documents (min_df)
- were cut off by feature selection (max_features).
This is only available if no vocabulary was given.
and this causes the model size to become too large. To solve this issue, you can set stop_words_ to None before pickling your model (taken from the above link's example): (please check the link above for details)
import pickle
model_name = 'clickbait-model-sm.pkl'
cfr_pipeline.named_steps.vectorizer.stop_words_ = None
pickle.dump(cfr_pipeline, open(model_name, 'wb'), protocol=2)

Where does nlpnet get it's metadata pickle file from?

I have installed nlpnet (http://nilc.icmc.usp.br/nlpnet/), but I can't locate the metadata_pos.pickle file it needs to run a part of speech tagger. THis file does not appear to be on my machine, and is not included in the current github repository.
Any suggestions?
You need to download nlpnet-data(models for PoS, SRL and Dependency). It is available on http://nilc.icmc.usp.br/nlpnet/models.html . PoS tag model file Metadata_pos.pickle is available in http://nilc.icmc.usp.br/nlpnet/data/pos-pt.tgz
You need to download the models from this page http://nilc.icmc.usp.br/nlpnet/models.html (either POS or SRL)
decompress the file in some folder, let's say '/Users/Downloads', then import in your code like that:
import nlpnet
nlpnet.set_data_dir('/Users/Downloads/pos-pt')
# Now you can start using it
tagger = nlpnet.POSTagger()
op = tagger.tag('texto em portugues')
To train the model, you'll need examples with one sentence per line, having tokens and tags concatenated by an underscore character:
This_DT is_VBZ an_DT example_NN
Using this command with your corpus, you'll generate the data needed to use the POS tagger (including metadata_pos.pickle):
nlpnet-train.py pos --gold /path/to/training-data.txt
If you want to use an already trained model, they have one here. It was trained/evaluated with Mac-Morpho Corpus, a brazilian-portuguese news corpus so probably it won't work with other languages.

Using sparse matrices/online learning in Naive Bayes (Python, scikit)

I'm trying to do Naive Bayes on a dataset that has over 6,000,000 entries and each entry 150k features. I've tried to implement the code from the following link:
Implementing Bag-of-Words Naive-Bayes classifier in NLTK
The problem is (as I understand), that when I try to run the train-method with a dok_matrix as it's parameter, it cannot find iterkeys (I've paired the rows with OrderedDict as labels):
Traceback (most recent call last):
File "skitest.py", line 96, in <module>
classif.train(add_label(matr, labels))
File "/usr/lib/pymodules/python2.6/nltk/classify/scikitlearn.py", line 92, in train
for f in fs.iterkeys():
File "/usr/lib/python2.6/dist-packages/scipy/sparse/csr.py", line 88, in __getattr__
return _cs_matrix.__getattr__(self, attr)
File "/usr/lib/python2.6/dist-packages/scipy/sparse/base.py", line 429, in __getattr__
raise AttributeError, attr + " not found"
AttributeError: iterkeys not found
My question is, is there a way to either avoid using a sparse matrix by teaching the classifier entry by entry (online), or is there a sparse matrix format I could use in this case efficiently instead of dok_matrix? Or am I missing something obvious?
Thanks for anyone's time. :)
EDIT, 6th sep:
Found the iterkeys, so atleast the code runs. It's still too slow, as it has taken several hours with a dataset of the size of 32k, and still hasn't finished. Here's what I got at the moment:
matr = dok_matrix((6000000, 150000), dtype=float32)
labels = OrderedDict()
#collect the data into the matrix
pipeline = Pipeline([('nb', MultinomialNB())])
classif = SklearnClassifier(pipeline)
add_label = lambda lst, lab: [(lst.getrow(x).todok(), lab[x])
for x in xrange(lentweets-foldsize)]
classif.train(add_label(matr[:(lentweets-foldsize),0], labels))
readrow = [matr.getrow(x + foldsize).todok() for x in xrange(lentweets-foldsize)]
data = np.array(classif.batch_classify(readrow))
The problem might be that each row that is taken doesn't utilize the sparseness of the vector, but goes through each of the 150k entry. As a continuation for the issue, does anyone know how to utilize this Naive Bayes with sparse matrices, or is there any other way to optimize the above code?
Check out the document classification example in scikit-learn. The trick is to let the library handle the feature extraction for you. Skip the NLTK wrapper, as it's not intended for such large datasets.(*)
If you have the documents in text files, then you can just hand those text files to the TfidfVectorizer, which creates a sparse matrix from them:
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(input='filename')
X = vect.fit_transform(list_of_filenames)
You now have a training set X in the CSR sparse matrix format, that you can feed to a Naive Bayes classifier if you also have a list of labels y (perhaps derived from the filenames, if you encoded the class in them):
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X, y)
If it turns out this doesn't work because the set of documents is too large (unlikely since the TfidfVectorizer was optimized for just this number of documents), look at the out-of-core document classification example, which demonstrates the HashingVectorizer and the partial_fit API for minibatch learning. You'll need scikit-learn 0.14 for this to work.
(*) I know, because I wrote that wrapper. Like the rest of NLTK, it's intended for educational purposes. I also worked on performance improvements in scikit-learn, and some of the code I'm advertising is my own.

Categories