word-embendding: Convert supervised model into unsupervised model

word-embendding: Convert supervised model into unsupervised model - python

I'm wanna to load an pre-trained embendding to initalize my own unsupervise FastText model and retrain with my dataset.
The trained embendding file I have loads fine with gensim.models.KeyedVectors.load_word2vec_format('model.txt'). But when I try:
FastText.load_fasttext_format('model.txt') I get: NotImplementedError: Supervised fastText models are not supported.
Is there any way to convert supervised KeyedVectors to unsupervised FastText? And if possible, is it a bad idea?
I know that has an great difference between supervised and unsupervised models. But I really wanna try use/convert this and retrain it. I'm not finding a trained unsupervised model to load for my case (it's a portuguese dataset), and the best model I find is that

If your model.txt file loads OK with KeyedVectors.load_word2vec_format('model.txt'), then that's just a simple set of word-vectors. (That is, not a 'supervised' model.)
However, Gensim's FastText doesn't support preloading a simple set of vectors for further training - for continued training, it needs a full FastText model, either from Facebook's binary format, or a prior Gensim FastText model .save().
(That trying to load a plain-vectors file generates that error suggests the load_fasttext_format() method is momentarily mis-interpreting it as some other kind of binary FastText model it doesn't support.)
Update after comment below:
Of course you can mutate a model however you like, including ways not officially supported by Gensim. Whether that's helpful is another matter.
You can create an FT model with a compatible/overlapping vocabulary, load old word-vectors separately, then copy each prior vector over to replace the corresponding (randomly-initialized) vectors in the new model. (Note that the property to affect further training is actually ftModel.wv.vectors_vocab trained-up full-word vectors, not the .vectors which is composited from full-words & ngrams,)
But the tradeoffs of such an ad-hoc strategy are many. The ngrams would still start random. Taking some prior model's just-word vectors isn't quite the same as a FastText model's full-words-to-be-later-mixed-with-ngrams.
You'd want to make sure your new model's sense of word-frequencies is meaningful, as those affect further training - but that data isn't usually available with a plain-text prior word-vector set. (You could plausibly synthesize a good-enough set of frequencies by assuming a Zipf distribution.)
Your further training might get a "running start" from such initialization - but that wouldn't necessarily mean the end-vectors remain comparable to the starting ones. (All positions may be arbitrarily changed by the volume of newer training, progressively diluting away most of the prior influence.)
So: you'd be in an improvised/experimental setup, somewhat far from usual FastText practices and thus where you'd want to re-verify lots of assumptions, and rigorously evaluate if those extra steps/approximations are actually improving things.

Related

Reduce fastText memory usage for big models

I trained a machine learning sentence classification model that uses, among other features, also the vectors obtained from a pretrained fastText model (like these) which is 7Gb. I use the pretrained fastText Italian model: I am using this word embedding only to get some semantic features to feed into the effective ML model.
I built a simple API based on fastText that, at prediction time, computes the vectors needed by the effective ML model. Under the hood, this API receives a string as input and calls get_sentence_vector. When the API starts, it loads the fastText model into memory.
How can I reduce the memory footprint of fastText, which is loaded into RAM?
Constraints:
My model works fine, training was time-consuming and expensive, so I wouldn't want to retrain it using smaller vectors
I need the fastText ability to handle out-of-vocabulary words, so I can't use just vectors but I need the full model
I should reduce the RAM usage, even at the expense of a reduction in speed.
At the moment, I'm starting to experiment with compress-fasttext...
Please share your suggestions and thoughts even if they do not represent full-fledged solutions.

There is no easy solution for my specific problem: if you are using a fastText embedding as a feature extractor, and then you want to use a compressed version of this embedding, you have to retrain the final classifier, since produced vectors are somewhat different.
Anyway, I want to give a general answer for
fastText models reduction
Unsupervised models (=embeddings)
You are using pretrained embeddings provided by Facebook or you trained your embeddings in an unsupervised fashion. Format .bin. Now you want to reduce model size/memory consumption.
Straight-forward solutions:
compress-fasttext library: compress fastText word embedding models by orders of magnitude, without significantly affecting their quality; there are also available several pretrained compressed models (other interesting compressed models here).
fastText native reduce_model: in this case, you are reducing vector dimension (eg from 300 to 100), so you are explictly losing expressiveness; under the hood, this method employs PCA.
If you have training data and can perform retraining, you can use floret, a fastText fork by explosion (the company of Spacy), that uses a more compact representation for vectors.
If you are not interested in fastText ability to represent out-of-vocabulary words (words not seen during training), you can use .vec file (containing only vectors and not model weights) and select only a portion of the most common vectors (eg the first 200k words/vectors). If you need a way to convert .bin to .vec, read this answer.
Note: gensim package fully supports fastText embedding (unsupervised mode), so these operations can be done through this library (more details in this answer)
Supervised models
You used fastText to train a classifier, producing a .bin model. Now you want to reduce classifier size/memory consumption.
The best solution is fastText native quantize: the model is retrained applying weights quantization and feature selection. With the retrain parameter, you can decide whether to fine-tune the embeddings or not.
You can still use fastText reduce_model, but it leads to less expressive models and the size of the model is not heavily reduced.

Using tensorflow classification for feature extraction

I am currently working on a system that extracts certain features out of 3D-objects (Voxelgrids to be precise), and i would like to compare those features to automatically made features when it comes to performance (classification) in a tensorflow cNN with some other data, but that is not the point here, just for background.
My idea now was, to take a dataset (modelnet10), train a tensorflow cNN to classify them, and then use what it learned there on my dataset - not to classify, but to extract features.
So i want to throw away everything the cnn does,except for what it takes from the objects.
Is there anyway to get these features? and how do i do that? i certainly have no idea.

Yes, it is possible to train models exclusively for feature extraction. This is called transfer learning where you can either train your own model and then extract the features or you can extract features from pre-trained models and then use it in your task if your task is similar in nature to that of what the pre-trained model was trained for. You can of course find a lot of material online for these topics. However, I am providing some links below which give details on how you can go about it:
https://keras.io/api/applications/
https://keras.io/guides/transfer_learning/
https://machinelearningmastery.com/how-to-use-transfer-learning-when-developing-convolutional-neural-network-models/
https://www.pyimagesearch.com/2019/05/27/keras-feature-extraction-on-large-datasets-with-deep-learning/
https://www.kaggle.com/angqx95/feature-extractor-fine-tuning-with-keras

Mocking FastText model for utest

I am using fasttext models in my python library (from the official fasttext library). To run my u-tests, I need at some point a model (fasttext.FastText._FastText object), as light as possible so that I can version it in my repo.
I have tried to create a fake text dataset with 5 lines "fake.txt" and a few words and called
model = fasttext.train_unsupervised("./fake.txt")
fasttext.util.reduce_model(model, 2)
model.save_model("fake_model.bin")
It basically works but the model is 16Mb. It is kind of ok for a U-test resource but do you think I can go below this?

Note that FastText (& similar dense word-vector models) don't perform meaningfully when using toy-sized data or parameters. (All their useful/predictable/testable benefits depend on large, varied datasets & the subtle arrangements of many final vectors.)
But, if you just need a relatively meaningless object/file of the right type, your approach should work. The main parameter that would make a FastText model larger without regard to the tiny training-set is the bucket parameter, with a default value of 2000000. It will allocate that many character-ngram (word-fragment) slots, even if all your actual words don't have that many ngrams.
Setting bucket to some far-smaller value, in initial model creation, should make your plug/stand-in file far smaller as well.

How to use doc2vec model in production?

I wonder how to deploy a doc2vec model in production to create word vectors as input features to a classifier. To be specific, let say, a doc2vec model is trained on a corpus as follows.
dataset['tagged_descriptions'] = datasetf.apply(lambda x: doc2vec.TaggedDocument(
words=x['text_columns'], tags=[str(x.ID)]), axis=1)
model = doc2vec.Doc2Vec(vector_size=100, min_count=1, epochs=150, workers=cores,
window=5, hs=0, negative=5, sample=1e-5, dm_concat=1)
corpus = dataset['tagged_descriptions'].tolist()
model.build_vocab(corpus)
model.train(corpus, total_examples=model.corpus_count, epochs=model.epochs)
and then it is dumped into a pickle file. The word vectors are used to train a classifier such as random forests to predict movies sentiment.
Now suppose that in production, there is a document entailing some totally new vocabularies. That being said, they were not among the ones present during the training of the doc2vec model. I wonder how to tackle such a case.
As a side note, I am aware of Updating training documents for gensim Doc2Vec model and Gensim: how to retrain doc2vec model using previous word2vec model. However, I would appreciate more lights to be shed on this matter.

A Doc2Vec model will only be able to report trained-up vectors for documents that were present during training, and only be able to infer_vector() new doc-vectors for texts containing words that were present during training. (Unrecognized words passed to .infer_vector() will be ignored, similar to the way any words appearing fewer than min_count times are ignored during training.)
If over time you acquire many new texts with new vocabulary words, and those words are important, you'll have to occasionally re-train the Doc2Vec model. And, after re-training, the doc-vectors from the re-trained model will generally not be comparable to doc-vectors from the original model – so downstream classifiers and other applications using the doc-vectors will need updating, as well.
Your own production/deployment requirements will drive how often this re-training should happen, and old models replaced with newer ones.
(While a Doc2Vec model can be fed new training data at any time, doing so incrementally as a sort of 'fine-tuning' introduces hairy issues of balance between old and new data. And, there's no official gensim support for expanding existing the vocabulary of a Doc2Vec model. So, the most robust course is to retrain from scratch using all available data.)
A few side notes on your example training code:
it's rare for min_count=1 to be a good idea: rare words often serve as 'noise', without sufficient usage examples to model well, and thus 'noise' that just slows/interferes with patterns that can be learned from more common-words
dm_concat=1 is best considered an experimental/advanced mode, as it makes models significantly larger and slower to train, with unproven benefits.
much published work uses just 10-20 training epochs; smaller datasets or smaller docs will sometimes benefit from more, but 150 may be taking a lot of time with very little marginal benefit.

How to load the pre-trained doc2vec model and use it's vectors

Does anyone know which function should I use if I want to use the pre-trained doc2vec models in this website https://github.com/jhlau/doc2vec?
I know we can use the Keyvectors.load_word2vec_format()to laod the word vectors from pre-trained word2vec models, but do we have a similar function to load pre-trained doc2vec models as well in gensim?
Thanks a lot.

When a model like Doc2Vec is saved with gensim's native save(), it can be reloaded with the native load() method:
model = Doc2Vec.load(filename)
Note that large internal arrays may have been saved alongside the main filename, in other filenames with extra extensions – and all those files must be kept together to re-load a fully-functional model. (You still need to specify only the main save file, and the auxiliary files will be discovered at expected names alongside it in the same directory.)
You may have other issues trying to use those pre-trained models. In particular:
as noted in the linked page, the author used a custom variant of gensim that forked off about 2 years ago; the files might not load in standard gensim, or later gensims
it's not completely clear what parameters were used to train those models (though I suppose if you succeed in loading them you could see them as properties in the model), and how much meta-optimization was used for which purposes, and whether those purposes will match your own project
if the parameters are as shown in one of the repo files, [train_model.py][1], some are inconsistent with best practices (a min_count=1 is usually bad for Doc2Vec) or apparent model-size (a mere 1.4GB model couldn't hold 300-dimensional vectors for all of the millions of documents or word-tokens in 2015 Wikipedia)
I would highly recommend training your own model, on a corpus you understand, with recent code, and using metaparameters optimized for your own purposes.

Try this:
import gensim.models as g
model="model_folder/doc2vec.bin" #point to downloaded pre-trained doc2vec model
#load model
m = g.Doc2Vec.load(model)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.