Is it possible to fine tune FastText models - python

I'm working on a project for text similarity using FastText, the basic example I have found to train a model is:
from gensim.models import FastText
model = FastText(tokens, size=100, window=3, min_count=1, iter=10, sorted_vocab=1)
As I understand it, since I'm specifying the vector and ngram size, the model is been trained from scratch here and if the dataset is small I would spect great resutls.
The other option I have found is to load the original Wikipedia model which is a huge file:
from gensim.models.wrappers import FastText
model = FastText.load_fasttext_format('wiki.simple')
My question is, can I load the Wikipedia or any other model, and fine tune it with my dataset?

If you have a labelled dataset, then you should be able to fine-tune to it. This GitHub issue explains that you want to use the pretrainedVectors option. You would start with the Wikipedia pretrained vectors, then train on your dataset. It seems that gensim can do this, but according to this GH issue, there has been some bugs.

Related

Does it exist a word2vec model in french?

Is there a pre-trained word2vec model in french language ? The must would be to get it with an api that let me finetune it easily. I was thinking of gensim but can't find such a model in french language.
You could try one of Facebook's published pre-trained FastText models: https://fasttext.cc/docs/en/crawl-vectors.html
Their text versions, with just whole-word vectors, can be loaded as read-only KeyedVectors instances in Gensim. Their full binary models can be loaded as a FastText model that (technically) supports additional training, but I've never seen a reliable writeup on how to do such fine-tuning in standard models.
If you think you need to do fine-tuning, maybe you should just train your own model, which in its training set includes all the words/senses you need, from the start?
You can find a number of different models trained on different sets of French data with various parameters here: https://fauconnier.github.io/#data However, they were produced in 2015 already.

Fine-tune a BERT model for context specific embeddigns

I'm trying to find information on how to train a BERT model, possibly from the Huggingface Transformers library, so that the embedding it outputs are more closely related to the context o the text I'm using.
However, all the examples that I'm able to find, are about fine-tuning the model for another task, such as classification.
Would anyone happen to have an example of a BERT fine-tuning model for masked tokens or next sentence prediction, that outputs another raw BERT model that is fine-tuned to the context?
Thanks!
Here is an example from the Transformers library on Fine tuning a language model for masked token prediction.
The model that is used is one of the BERTForLM familly. The idea is to create a dataset using the TextDataset that tokenizes and breaks the text into chunks. Then use a DataCollatorForLanguageModeling to randomly mask tokens in the chunks when traing, and pass the model, the data and the collator to the Trainer to train and evaluate the results.

Load fasttext quantized model (.ftz), and look up words

I have a pretrained embeddings file, which was quantized, in .ftz format. I need it to look up words, find the nearest neighbours. But I fail to find any toolkits that can do that. FastText can load the embeddings file, yet not able to look up the nearest neighbour, Gensim can lookup the nearest neighbour, but not be able to load the model...
Or it's me not finding the right function?
Thank you!
FastText models come in two flavours:
unsupervised models that produce word embeddings and can find similar words. The native Facebook package does not support quantization for them.
supervised models that are used for text classification and can be quantized natively, but generally do not produce meaningful word embeddings.
To compress unsupervised models, I have created a package compress-fasttext which is a wrapper around Gensim that can reduce the size of unsupervised models by pruning and quantization. This post describes it in more details.
With this package, you can lookup similar words in small models as follows:
import compress_fasttext
small_model = compress_fasttext.models.CompressedFastTextKeyedVectors.load(
'https://github.com/avidale/compress-fasttext/releases/download/v0.0.4/cc.en.300.compressed.bin'
)
print(small_model.most_similar('Python'))
# [('PHP', 0.5253), ('.NET', 0.5027), ('Java', 0.4897), ... ]
Of course, it works only if the model has been compressed using the same package. You can compress your own unsupervised model this way:
import compress_fasttext
from gensim.models.fasttext import load_facebook_model
big_model = load_facebook_model('path-to-original-model').wv
small_model = compress_fasttext.prune_ft_freq(big_model, pq=True)
small_model.save('path-to-new-model')

How to load a pre-trained Word2vec MODEL File and reuse it?

I want to use a pre-trained word2vec model, but I don't know how to load it in python.
This file is a MODEL file (703 MB).
It can be downloaded here:
http://devmount.github.io/GermanWordEmbeddings/
just for loading
import gensim
# Load pre-trained Word2Vec model.
model = gensim.models.Word2Vec.load("modelName.model")
now you can train the model as usual. also, if you want to be able to save it and retrain it multiple times, here's what you should do
model.train(//insert proper parameters here//)
"""
If you don't plan to train the model any further, calling
init_sims will make the model much more memory-efficient
If `replace` is set, forget the original vectors and only keep the normalized
ones = saves lots of memory!
replace=True if you want to reuse the model
"""
model.init_sims(replace=True)
# save the model for later use
# for loading, call Word2Vec.load()
model.save("modelName.model")
Use KeyedVectors to load the pre-trained model.
from gensim.models import KeyedVectors
from gensim import models
word2vec_path = 'path/GoogleNews-vectors-negative300.bin.gz'
w2v_model = models.KeyedVectors.load_word2vec_format(word2vec_path, binary=True)
I used the same model in my code and since I couldn't load it, I asked the author about it. His answer was that the model has to be loaded in binary format:
gensim.models.KeyedVectors.load_word2vec_format(w2v_path, binary=True)
This worked for me, and I think it should work for you, too.
I met the same issue and I downloaded GoogleNews-vectors-negative300 from Kaggle. I saved and extracted the file in my descktop. Then I implemented this code in python and it worked well:
model = KeyedVectors.load_word2vec_format=(r'C:/Users/juana/descktop/archive/GoogleNews-vectors-negative300.bin')

How to initialize a new word2vec model with pre-trained model weights?

I am using Gensim Library in python for using and training word2vector model. Recently, I was looking at initializing my model weights with some pre-trained word2vec model such as (GoogleNewDataset pretrained model). I have been struggling with it couple of weeks. Now, I just searched out that in gesim there is a function that can help me to initialize the weights of my model with pre-trained model weights. That is mentioned below:
reset_from(other_model)
Borrow shareable pre-built structures (like vocab) from the other_model. Useful if testing multiple models in parallel on the same corpus.
I don't know this function can do the same thing or not. Please help!!!
You can now do incremental training with gensim. I would recommend loading the pretrained model and then doing an update.
from gensim.models import Word2Vec
model = Word2Vec.load('pretrained_model.emb')
model.build_vocab(new_sentences, update=True)
model.train(new_sentences)

Categories