Is there a pre-trained word2vec model in french language ? The must would be to get it with an api that let me finetune it easily. I was thinking of gensim but can't find such a model in french language.
You could try one of Facebook's published pre-trained FastText models: https://fasttext.cc/docs/en/crawl-vectors.html
Their text versions, with just whole-word vectors, can be loaded as read-only KeyedVectors instances in Gensim. Their full binary models can be loaded as a FastText model that (technically) supports additional training, but I've never seen a reliable writeup on how to do such fine-tuning in standard models.
If you think you need to do fine-tuning, maybe you should just train your own model, which in its training set includes all the words/senses you need, from the start?
You can find a number of different models trained on different sets of French data with various parameters here: https://fauconnier.github.io/#data However, they were produced in 2015 already.
Related
It seems that the Gensim's implementation in FastText leads to a smaller model size than Facebook's native implementation. With a corpus of 1 million words, the fasttext native model is is 6GB, while the gensim fasttext model size is only 68MB.
Is there any information stored in Facebook's implementation not present in Gensim's implementation?
Please show which models generated this comparison, or what process was used. It probably has bugs/misunderstandings.
The size of a model is more influenced by the number of unique words (and character n-gram buckets) than the 'corpus' size.
The saved sizes of a Gensim-trained FastText model, or a native Facebook FastText-trained model, should be roughly in the same ballpark. Be sure to include all subsidiary raw numpy files (ending .npy, alongside the main save-file) created by Gensim's .save() - as all such files are required to re-.load() the model!
Similarly, if you were to load a Facebook FastText model into Gensim, then use Gensim's .save(), the total disk space taken in both alternate formats should be quite close.
I'm wanna to load an pre-trained embendding to initalize my own unsupervise FastText model and retrain with my dataset.
The trained embendding file I have loads fine with gensim.models.KeyedVectors.load_word2vec_format('model.txt'). But when I try:
FastText.load_fasttext_format('model.txt') I get: NotImplementedError: Supervised fastText models are not supported.
Is there any way to convert supervised KeyedVectors to unsupervised FastText? And if possible, is it a bad idea?
I know that has an great difference between supervised and unsupervised models. But I really wanna try use/convert this and retrain it. I'm not finding a trained unsupervised model to load for my case (it's a portuguese dataset), and the best model I find is that
If your model.txt file loads OK with KeyedVectors.load_word2vec_format('model.txt'), then that's just a simple set of word-vectors. (That is, not a 'supervised' model.)
However, Gensim's FastText doesn't support preloading a simple set of vectors for further training - for continued training, it needs a full FastText model, either from Facebook's binary format, or a prior Gensim FastText model .save().
(That trying to load a plain-vectors file generates that error suggests the load_fasttext_format() method is momentarily mis-interpreting it as some other kind of binary FastText model it doesn't support.)
Update after comment below:
Of course you can mutate a model however you like, including ways not officially supported by Gensim. Whether that's helpful is another matter.
You can create an FT model with a compatible/overlapping vocabulary, load old word-vectors separately, then copy each prior vector over to replace the corresponding (randomly-initialized) vectors in the new model. (Note that the property to affect further training is actually ftModel.wv.vectors_vocab trained-up full-word vectors, not the .vectors which is composited from full-words & ngrams,)
But the tradeoffs of such an ad-hoc strategy are many. The ngrams would still start random. Taking some prior model's just-word vectors isn't quite the same as a FastText model's full-words-to-be-later-mixed-with-ngrams.
You'd want to make sure your new model's sense of word-frequencies is meaningful, as those affect further training - but that data isn't usually available with a plain-text prior word-vector set. (You could plausibly synthesize a good-enough set of frequencies by assuming a Zipf distribution.)
Your further training might get a "running start" from such initialization - but that wouldn't necessarily mean the end-vectors remain comparable to the starting ones. (All positions may be arbitrarily changed by the volume of newer training, progressively diluting away most of the prior influence.)
So: you'd be in an improvised/experimental setup, somewhat far from usual FastText practices and thus where you'd want to re-verify lots of assumptions, and rigorously evaluate if those extra steps/approximations are actually improving things.
I want train a spacy custom NER model,which is the best option?
the train data is ready (doccano)
option 1. use an existing pre-trained spacy model and update it with custom NER?.
option 2. create an empty model using spacy.blank() with custom NER?
I just want to identify my custom entity in a text, the other types of entities are not necessary...currently
You want to leverage transfer learning as much as possible: this means you most likely want to use a pre-trained model (e.g. on Wikipedia data) and fine-tune it for your use case. This is because training a spacy.blank model from scratch will require lots of data, whereas fine tuning a pretrained model might require as few as a couple hundreds labels.
However, pay attention to catastrophic forgetting which is the fact that when fine-tuning on some of your new labels, the model might 'forget' some of the old labels because they are no longer present in the training set.
For example, let's say you are trying to label the entity DOCTOR on top of a pre-trained NER model that labels LOC, PERSON and ORG. You label 200 DOCTOR records and fine tune your model with them. You might find that the model now predicts every PERSON as a DOCTOR.
That's all one can say without knowing more about your data. Please check out the spacy docs on training ner for more details.
Am struggling with training wikipedia dump on doc2vec model, not experienced in setting up a server as a local machine is out of question due to the ram it requires to do the training. I couldnt find a pre trained model except outdated copies for python 2.
I'm not aware of any publicly-available standard gensim Doc2Vec models trained on Wikipedia.
Does anyone know which function should I use if I want to use the pre-trained doc2vec models in this website https://github.com/jhlau/doc2vec?
I know we can use the Keyvectors.load_word2vec_format()to laod the word vectors from pre-trained word2vec models, but do we have a similar function to load pre-trained doc2vec models as well in gensim?
Thanks a lot.
When a model like Doc2Vec is saved with gensim's native save(), it can be reloaded with the native load() method:
model = Doc2Vec.load(filename)
Note that large internal arrays may have been saved alongside the main filename, in other filenames with extra extensions – and all those files must be kept together to re-load a fully-functional model. (You still need to specify only the main save file, and the auxiliary files will be discovered at expected names alongside it in the same directory.)
You may have other issues trying to use those pre-trained models. In particular:
as noted in the linked page, the author used a custom variant of gensim that forked off about 2 years ago; the files might not load in standard gensim, or later gensims
it's not completely clear what parameters were used to train those models (though I suppose if you succeed in loading them you could see them as properties in the model), and how much meta-optimization was used for which purposes, and whether those purposes will match your own project
if the parameters are as shown in one of the repo files, [train_model.py][1], some are inconsistent with best practices (a min_count=1 is usually bad for Doc2Vec) or apparent model-size (a mere 1.4GB model couldn't hold 300-dimensional vectors for all of the millions of documents or word-tokens in 2015 Wikipedia)
I would highly recommend training your own model, on a corpus you understand, with recent code, and using metaparameters optimized for your own purposes.
Try this:
import gensim.models as g
model="model_folder/doc2vec.bin" #point to downloaded pre-trained doc2vec model
#load model
m = g.Doc2Vec.load(model)