Reduce fastText memory usage for big models - python

I trained a machine learning sentence classification model that uses, among other features, also the vectors obtained from a pretrained fastText model (like these) which is 7Gb. I use the pretrained fastText Italian model: I am using this word embedding only to get some semantic features to feed into the effective ML model.
I built a simple API based on fastText that, at prediction time, computes the vectors needed by the effective ML model. Under the hood, this API receives a string as input and calls get_sentence_vector. When the API starts, it loads the fastText model into memory.
How can I reduce the memory footprint of fastText, which is loaded into RAM?
Constraints:
My model works fine, training was time-consuming and expensive, so I wouldn't want to retrain it using smaller vectors
I need the fastText ability to handle out-of-vocabulary words, so I can't use just vectors but I need the full model
I should reduce the RAM usage, even at the expense of a reduction in speed.
At the moment, I'm starting to experiment with compress-fasttext...
Please share your suggestions and thoughts even if they do not represent full-fledged solutions.

There is no easy solution for my specific problem: if you are using a fastText embedding as a feature extractor, and then you want to use a compressed version of this embedding, you have to retrain the final classifier, since produced vectors are somewhat different.
Anyway, I want to give a general answer for
fastText models reduction
Unsupervised models (=embeddings)
You are using pretrained embeddings provided by Facebook or you trained your embeddings in an unsupervised fashion. Format .bin. Now you want to reduce model size/memory consumption.
Straight-forward solutions:
compress-fasttext library: compress fastText word embedding models by orders of magnitude, without significantly affecting their quality; there are also available several pretrained compressed models (other interesting compressed models here).
fastText native reduce_model: in this case, you are reducing vector dimension (eg from 300 to 100), so you are explictly losing expressiveness; under the hood, this method employs PCA.
If you have training data and can perform retraining, you can use floret, a fastText fork by explosion (the company of Spacy), that uses a more compact representation for vectors.
If you are not interested in fastText ability to represent out-of-vocabulary words (words not seen during training), you can use .vec file (containing only vectors and not model weights) and select only a portion of the most common vectors (eg the first 200k words/vectors). If you need a way to convert .bin to .vec, read this answer.
Note: gensim package fully supports fastText embedding (unsupervised mode), so these operations can be done through this library (more details in this answer)
Supervised models
You used fastText to train a classifier, producing a .bin model. Now you want to reduce classifier size/memory consumption.
The best solution is fastText native quantize: the model is retrained applying weights quantization and feature selection. With the retrain parameter, you can decide whether to fine-tune the embeddings or not.
You can still use fastText reduce_model, but it leads to less expressive models and the size of the model is not heavily reduced.

Related

Why is gensim FastText model smaller in size than the native Fasttext model by Facebook?

It seems that the Gensim's implementation in FastText leads to a smaller model size than Facebook's native implementation. With a corpus of 1 million words, the fasttext native model is is 6GB, while the gensim fasttext model size is only 68MB.
Is there any information stored in Facebook's implementation not present in Gensim's implementation?
Please show which models generated this comparison, or what process was used. It probably has bugs/misunderstandings.
The size of a model is more influenced by the number of unique words (and character n-gram buckets) than the 'corpus' size.
The saved sizes of a Gensim-trained FastText model, or a native Facebook FastText-trained model, should be roughly in the same ballpark. Be sure to include all subsidiary raw numpy files (ending .npy, alongside the main save-file) created by Gensim's .save() - as all such files are required to re-.load() the model!
Similarly, if you were to load a Facebook FastText model into Gensim, then use Gensim's .save(), the total disk space taken in both alternate formats should be quite close.

word-embendding: Convert supervised model into unsupervised model

I'm wanna to load an pre-trained embendding to initalize my own unsupervise FastText model and retrain with my dataset.
The trained embendding file I have loads fine with gensim.models.KeyedVectors.load_word2vec_format('model.txt'). But when I try:
FastText.load_fasttext_format('model.txt') I get: NotImplementedError: Supervised fastText models are not supported.
Is there any way to convert supervised KeyedVectors to unsupervised FastText? And if possible, is it a bad idea?
I know that has an great difference between supervised and unsupervised models. But I really wanna try use/convert this and retrain it. I'm not finding a trained unsupervised model to load for my case (it's a portuguese dataset), and the best model I find is that
If your model.txt file loads OK with KeyedVectors.load_word2vec_format('model.txt'), then that's just a simple set of word-vectors. (That is, not a 'supervised' model.)
However, Gensim's FastText doesn't support preloading a simple set of vectors for further training - for continued training, it needs a full FastText model, either from Facebook's binary format, or a prior Gensim FastText model .save().
(That trying to load a plain-vectors file generates that error suggests the load_fasttext_format() method is momentarily mis-interpreting it as some other kind of binary FastText model it doesn't support.)
Update after comment below:
Of course you can mutate a model however you like, including ways not officially supported by Gensim. Whether that's helpful is another matter.
You can create an FT model with a compatible/overlapping vocabulary, load old word-vectors separately, then copy each prior vector over to replace the corresponding (randomly-initialized) vectors in the new model. (Note that the property to affect further training is actually ftModel.wv.vectors_vocab trained-up full-word vectors, not the .vectors which is composited from full-words & ngrams,)
But the tradeoffs of such an ad-hoc strategy are many. The ngrams would still start random. Taking some prior model's just-word vectors isn't quite the same as a FastText model's full-words-to-be-later-mixed-with-ngrams.
You'd want to make sure your new model's sense of word-frequencies is meaningful, as those affect further training - but that data isn't usually available with a plain-text prior word-vector set. (You could plausibly synthesize a good-enough set of frequencies by assuming a Zipf distribution.)
Your further training might get a "running start" from such initialization - but that wouldn't necessarily mean the end-vectors remain comparable to the starting ones. (All positions may be arbitrarily changed by the volume of newer training, progressively diluting away most of the prior influence.)
So: you'd be in an improvised/experimental setup, somewhat far from usual FastText practices and thus where you'd want to re-verify lots of assumptions, and rigorously evaluate if those extra steps/approximations are actually improving things.

Using tensorflow classification for feature extraction

I am currently working on a system that extracts certain features out of 3D-objects (Voxelgrids to be precise), and i would like to compare those features to automatically made features when it comes to performance (classification) in a tensorflow cNN with some other data, but that is not the point here, just for background.
My idea now was, to take a dataset (modelnet10), train a tensorflow cNN to classify them, and then use what it learned there on my dataset - not to classify, but to extract features.
So i want to throw away everything the cnn does,except for what it takes from the objects.
Is there anyway to get these features? and how do i do that? i certainly have no idea.
Yes, it is possible to train models exclusively for feature extraction. This is called transfer learning where you can either train your own model and then extract the features or you can extract features from pre-trained models and then use it in your task if your task is similar in nature to that of what the pre-trained model was trained for. You can of course find a lot of material online for these topics. However, I am providing some links below which give details on how you can go about it:
https://keras.io/api/applications/
https://keras.io/guides/transfer_learning/
https://machinelearningmastery.com/how-to-use-transfer-learning-when-developing-convolutional-neural-network-models/
https://www.pyimagesearch.com/2019/05/27/keras-feature-extraction-on-large-datasets-with-deep-learning/
https://www.kaggle.com/angqx95/feature-extractor-fine-tuning-with-keras

How to use doc2vec model in production?

I wonder how to deploy a doc2vec model in production to create word vectors as input features to a classifier. To be specific, let say, a doc2vec model is trained on a corpus as follows.
dataset['tagged_descriptions'] = datasetf.apply(lambda x: doc2vec.TaggedDocument(
words=x['text_columns'], tags=[str(x.ID)]), axis=1)
model = doc2vec.Doc2Vec(vector_size=100, min_count=1, epochs=150, workers=cores,
window=5, hs=0, negative=5, sample=1e-5, dm_concat=1)
corpus = dataset['tagged_descriptions'].tolist()
model.build_vocab(corpus)
model.train(corpus, total_examples=model.corpus_count, epochs=model.epochs)
and then it is dumped into a pickle file. The word vectors are used to train a classifier such as random forests to predict movies sentiment.
Now suppose that in production, there is a document entailing some totally new vocabularies. That being said, they were not among the ones present during the training of the doc2vec model. I wonder how to tackle such a case.
As a side note, I am aware of Updating training documents for gensim Doc2Vec model and Gensim: how to retrain doc2vec model using previous word2vec model. However, I would appreciate more lights to be shed on this matter.
A Doc2Vec model will only be able to report trained-up vectors for documents that were present during training, and only be able to infer_vector() new doc-vectors for texts containing words that were present during training. (Unrecognized words passed to .infer_vector() will be ignored, similar to the way any words appearing fewer than min_count times are ignored during training.)
If over time you acquire many new texts with new vocabulary words, and those words are important, you'll have to occasionally re-train the Doc2Vec model. And, after re-training, the doc-vectors from the re-trained model will generally not be comparable to doc-vectors from the original model – so downstream classifiers and other applications using the doc-vectors will need updating, as well.
Your own production/deployment requirements will drive how often this re-training should happen, and old models replaced with newer ones.
(While a Doc2Vec model can be fed new training data at any time, doing so incrementally as a sort of 'fine-tuning' introduces hairy issues of balance between old and new data. And, there's no official gensim support for expanding existing the vocabulary of a Doc2Vec model. So, the most robust course is to retrain from scratch using all available data.)
A few side notes on your example training code:
it's rare for min_count=1 to be a good idea: rare words often serve as 'noise', without sufficient usage examples to model well, and thus 'noise' that just slows/interferes with patterns that can be learned from more common-words
dm_concat=1 is best considered an experimental/advanced mode, as it makes models significantly larger and slower to train, with unproven benefits.
much published work uses just 10-20 training epochs; smaller datasets or smaller docs will sometimes benefit from more, but 150 may be taking a lot of time with very little marginal benefit.

Python: Train a word2vec model using vectors as input

I am using Python to train a word2vec model and get embeddings for each word in vocabulary. I used gensim to do this before, and I also notice that such model can be trained by tools like TensorFlow, Theano, and so on..
However, during these training processes, the inputs are just texts which are basically in string format, then the words will be mapped to index for training. In my case, I want to input arrays for training. These arrays, can be one-hot encoded vectors or other vectors after some designed manipulation.
So, is there existing tool which trains word2vec model by inputting vectors? If there is no such tools, any recommendation for me to learn so that I can write my own code?

Categories