I am using Python to train a word2vec model and get embeddings for each word in vocabulary. I used gensim to do this before, and I also notice that such model can be trained by tools like TensorFlow, Theano, and so on..
However, during these training processes, the inputs are just texts which are basically in string format, then the words will be mapped to index for training. In my case, I want to input arrays for training. These arrays, can be one-hot encoded vectors or other vectors after some designed manipulation.
So, is there existing tool which trains word2vec model by inputting vectors? If there is no such tools, any recommendation for me to learn so that I can write my own code?
Related
I am new to NLP and i am confused about the embedding.
Is it possible, if i already have trained GloVe embeddings / or Word2Vec embeddings and send these into Transformer? Or does the Transformer needs raw data and do its own embedding?
(Language: python, keras)
If you train a new transformer, you can do whatever you want with the bottom layer.
Most likely you are asking about pretrained transformers, though. Pretrained transformers such as Bert will have their own embeddings of the word pieces. In that case, you will probably get sufficient results just by using the results of the transformer.
Per https://en.wikipedia.org/wiki/BERT_(language_model)
BERT models are pre-trained from unlabeled data extracted from the
BooksCorpus with 800M words and English Wikipedia with 2,500M
words.
Whether to train your model depends on your data.
For simple english text, the out-of-the-box model should work well.
If your data concentrates on certain domain e.g. job requisitions and job applications, then you can extend the model by training it on your corpus (aka transfer learning).
https://huggingface.co/docs/transformers/training
I trained a machine learning sentence classification model that uses, among other features, also the vectors obtained from a pretrained fastText model (like these) which is 7Gb. I use the pretrained fastText Italian model: I am using this word embedding only to get some semantic features to feed into the effective ML model.
I built a simple API based on fastText that, at prediction time, computes the vectors needed by the effective ML model. Under the hood, this API receives a string as input and calls get_sentence_vector. When the API starts, it loads the fastText model into memory.
How can I reduce the memory footprint of fastText, which is loaded into RAM?
Constraints:
My model works fine, training was time-consuming and expensive, so I wouldn't want to retrain it using smaller vectors
I need the fastText ability to handle out-of-vocabulary words, so I can't use just vectors but I need the full model
I should reduce the RAM usage, even at the expense of a reduction in speed.
At the moment, I'm starting to experiment with compress-fasttext...
Please share your suggestions and thoughts even if they do not represent full-fledged solutions.
There is no easy solution for my specific problem: if you are using a fastText embedding as a feature extractor, and then you want to use a compressed version of this embedding, you have to retrain the final classifier, since produced vectors are somewhat different.
Anyway, I want to give a general answer for
fastText models reduction
Unsupervised models (=embeddings)
You are using pretrained embeddings provided by Facebook or you trained your embeddings in an unsupervised fashion. Format .bin. Now you want to reduce model size/memory consumption.
Straight-forward solutions:
compress-fasttext library: compress fastText word embedding models by orders of magnitude, without significantly affecting their quality; there are also available several pretrained compressed models (other interesting compressed models here).
fastText native reduce_model: in this case, you are reducing vector dimension (eg from 300 to 100), so you are explictly losing expressiveness; under the hood, this method employs PCA.
If you have training data and can perform retraining, you can use floret, a fastText fork by explosion (the company of Spacy), that uses a more compact representation for vectors.
If you are not interested in fastText ability to represent out-of-vocabulary words (words not seen during training), you can use .vec file (containing only vectors and not model weights) and select only a portion of the most common vectors (eg the first 200k words/vectors). If you need a way to convert .bin to .vec, read this answer.
Note: gensim package fully supports fastText embedding (unsupervised mode), so these operations can be done through this library (more details in this answer)
Supervised models
You used fastText to train a classifier, producing a .bin model. Now you want to reduce classifier size/memory consumption.
The best solution is fastText native quantize: the model is retrained applying weights quantization and feature selection. With the retrain parameter, you can decide whether to fine-tune the embeddings or not.
You can still use fastText reduce_model, but it leads to less expressive models and the size of the model is not heavily reduced.
I have my own corpus of plain text. I want to train a Bert model in TensorFlow, similar to gensim's word2vec to get the embedding vectors for each word.
What I have found is that all the examples are related to any downstream NLP tasks like classification. But, I want to train a Bert model with my custom corpus after which I can get the embedding vectors for a given word.
Any lead will be helpful.
If you have access to the required hardware, you can dig into NVIDIA's training scripts for BERT using TensorFlow. The repo is here. From the medium article:
BERT-large can be pre-trained in 3.3 days on four DGX-2H nodes (a
total of 64 Volta GPUs).
If you don't have an enormous corpus, you will probably have better results fine-tuning an available model. If you would like to do so, you can look into huggingface's transformers.
I have pre-trained word2vec from gensim. And Using gensim for finding the similarities between words works as expected. But I am having problem in finding the similarities between two different sentences. Using of cosine similarities is not a good option for sentences and Its not giving good result. Soft Cosine similarities in gensim gives a little better results but still, it is also not looking good.
I found WMDsimilarities in gensim. This is a bit better than softcosine and cosine.
I am thinking if there is more option like using deep learning like keras and tensorflow to find the sentences similarities from pre-trained word2vec. I know the classification can be done using word embbeding and this seems somewhat better options but then I need to find a training data and labeled it from the scratch.
So, I am wondering if there is any other option which can be used pre-trained word2vec in keras and get the sentences similarities. Is there way. I am open to any suggestions and advice.
Before reimplementing the wheel I'd suggest to try doc2vec method from gensim, it works quite well and it's easy to use.
To implement it in Keras reusing the embeddings you have computed with gensim:
Store the word embeddings in a file, one word per line with the corresponding embedding. Alternatively you can do as #Paul suggested and skip the step 2 and reuse the layer in step 3.
Load word embeddings into a Keras Embedding layer. You can checkout this Keras tutorial for more details (check how embedding_layer variable is initialized).
Then a sequence to sequence model can be used to compute the embedding of the text. In which you have an encoder that embeds the string and the decoder that converts the embedding back to a string. Here is a Keras tutorial that translates from English to French. You can use a similar process to transform your text into your text and pick the internal embedding for your similarity metric.
You can also have a look how the paragraph to vector model works, you can also implement it using Keras and loading the word embedding weights that you have computed.
With things like neural networks (NNs) in keras it is very clear how to use word embeddings within the training of the NN, you can simply do something like
embeddings = ...
model = Sequential(Embedding(...),
layer1,
layer2,...)
But I'm unsure of how to do this with algorithms in sklearn such as SVMs, NBs, and logistic regression. I understand that there is a Pipeline method, which works simply (http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html) like
pip = Pipeline([(Countvectorizer()), (TfidfTransformer()), (Classifier())])
pip.fit(X_train, y_train)
But how can I include loaded word embeddings in this pipeline? Or should it somehow be included outside the pipeline? I can't find much documentation online about how to do this.
Thanks.
You can use the FunctionTransformer class.
If your goal is to have a transformer that takes a matrix of indexes and outputs a 3d tensor with word vectors, then this should suffice:
# this assumes you're using numpy ndarrays
word_vecs_matrix = get_wv_matrix() # pseudo-code
def transform(x):
return word_vecs_matrix[x]
transformer = FunctionTransformer(transform)
Be aware that, unlike keras, the word vector will not be fine tuned using some kind of gradient descent
There is any easy way to get word embeddings transformers with the Zeugma package.
It handles the downloading of the pre-trained embeddings and returns a "Transformer interface" for the embeddings.
For example if you want to use the averge of the GloVe embeddings for sentences representations you'd just have to write:
from zeugma.embeddings import EmbeddingTransformer
glove = EmbeddingTransformer('glove')
Here glove is a sklearn transformer has the standard transform method that takes a list of sentences as input and outputs a design matrix, just like Tfidftransformer. You can get the resulting embeddings with embeddings = glove.transform(['first sentence of the corpus', 'another sentence']) and embeddings woud contain a 2 x N matrics, where N is the dimension of the chosen embedding. Note that you don't have to bother with embeddings downloading, or local loading if you've already done it, Zeugma handles this transparently.
Hope this helps