I am building a seq2seq model using functions in tensorflow's seq2seq.py, where they have a function like this:
embedding_rnn_seq2seq(encoder_inputs, decoder_inputs, cell,
num_encoder_symbols, num_decoder_symbols,
embedding_size, output_projection=None,
feed_previous=False, dtype=dtypes.float32,
scope=None)
however, it seems that this function does not take pre-trained embeddings as input, are there any ways that I can take pre-trained word embeddings as input in this function?
There is no parameter you just hand over. Read in your embeddings (make sure vocabulary IDs match). Then, once you initialized all variables, find the embedding tensor (iterate through tf.all_variables to find the name). Then use tf.assign to overwrite the randomly initialized embeddings there with your embeddings.
Related
I have pre-trained word2vec from gensim. And Using gensim for finding the similarities between words works as expected. But I am having problem in finding the similarities between two different sentences. Using of cosine similarities is not a good option for sentences and Its not giving good result. Soft Cosine similarities in gensim gives a little better results but still, it is also not looking good.
I found WMDsimilarities in gensim. This is a bit better than softcosine and cosine.
I am thinking if there is more option like using deep learning like keras and tensorflow to find the sentences similarities from pre-trained word2vec. I know the classification can be done using word embbeding and this seems somewhat better options but then I need to find a training data and labeled it from the scratch.
So, I am wondering if there is any other option which can be used pre-trained word2vec in keras and get the sentences similarities. Is there way. I am open to any suggestions and advice.
Before reimplementing the wheel I'd suggest to try doc2vec method from gensim, it works quite well and it's easy to use.
To implement it in Keras reusing the embeddings you have computed with gensim:
Store the word embeddings in a file, one word per line with the corresponding embedding. Alternatively you can do as #Paul suggested and skip the step 2 and reuse the layer in step 3.
Load word embeddings into a Keras Embedding layer. You can checkout this Keras tutorial for more details (check how embedding_layer variable is initialized).
Then a sequence to sequence model can be used to compute the embedding of the text. In which you have an encoder that embeds the string and the decoder that converts the embedding back to a string. Here is a Keras tutorial that translates from English to French. You can use a similar process to transform your text into your text and pick the internal embedding for your similarity metric.
You can also have a look how the paragraph to vector model works, you can also implement it using Keras and loading the word embedding weights that you have computed.
I am solving a multi-class classification problem using Keras. But I am assuming the accuracy is bad due to poor word embedding of my data (domain-specific data).
Keras has its own Embedding layer, which is a supervised learning method.
So I have 2 questions regarding this :
Can I use word2vec embedding in Embedding layer of Keras, because word2vec is a form of unsupervised learning/self-supervised?
If yes, then can I use transfer learning on word2vec pre-train model to put extra knowledge of my domain specific features.
You can initialize the embeddings layer with word2vec or any other pre-trained embeddings (maybe FastText?) in such a way that you manually construct the embedding matrix, i.e., just load all the numbers form the word2vec files and make an np.array of it. Then you create a constant initializer and pass it as an argument to your embeddings layer constructor.
If you don't want the embeddings to get updated during training, just set trainable to False on the layer object.
With things like neural networks (NNs) in keras it is very clear how to use word embeddings within the training of the NN, you can simply do something like
embeddings = ...
model = Sequential(Embedding(...),
layer1,
layer2,...)
But I'm unsure of how to do this with algorithms in sklearn such as SVMs, NBs, and logistic regression. I understand that there is a Pipeline method, which works simply (http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html) like
pip = Pipeline([(Countvectorizer()), (TfidfTransformer()), (Classifier())])
pip.fit(X_train, y_train)
But how can I include loaded word embeddings in this pipeline? Or should it somehow be included outside the pipeline? I can't find much documentation online about how to do this.
Thanks.
You can use the FunctionTransformer class.
If your goal is to have a transformer that takes a matrix of indexes and outputs a 3d tensor with word vectors, then this should suffice:
# this assumes you're using numpy ndarrays
word_vecs_matrix = get_wv_matrix() # pseudo-code
def transform(x):
return word_vecs_matrix[x]
transformer = FunctionTransformer(transform)
Be aware that, unlike keras, the word vector will not be fine tuned using some kind of gradient descent
There is any easy way to get word embeddings transformers with the Zeugma package.
It handles the downloading of the pre-trained embeddings and returns a "Transformer interface" for the embeddings.
For example if you want to use the averge of the GloVe embeddings for sentences representations you'd just have to write:
from zeugma.embeddings import EmbeddingTransformer
glove = EmbeddingTransformer('glove')
Here glove is a sklearn transformer has the standard transform method that takes a list of sentences as input and outputs a design matrix, just like Tfidftransformer. You can get the resulting embeddings with embeddings = glove.transform(['first sentence of the corpus', 'another sentence']) and embeddings woud contain a 2 x N matrics, where N is the dimension of the chosen embedding. Note that you don't have to bother with embeddings downloading, or local loading if you've already done it, Zeugma handles this transparently.
Hope this helps
I'm using an RNN on sequences of word embeddings to classify sentences. At first I was feeding pre-trained word embeddings and everything worked fine. I made the embeddings matrix a tf.placeholder with dimension (Vocab size, Embedding size) and fed some pre-trained embeddings from GloVe. I also use tf.nn.embedding_lookup to translate my inputs (which are sequences of word IDs) into sequences of embeddings.
Then I wanted to allow the model to train the embeddings as well, so I made the embedding matrix a tf.Variable instead of a placeholder. Now TensorFlow gives me this error -- apparently the AdamOptimizer can't handle the embedding lookup. Any idea what's up or how to fix this?
tensorflow.python.framework.errors_impl.InvalidArgumentError: Input 0 of node
Adam/update_embeddings/AssignSub was passed float from _recv_embeddings_0:0
incompatible with expected float_ref.
You cannot feed a value to a variable and optimize it at the same time. Instead, you must first run a tf.assign on that variable to initialize it to the fed value, and then run the optimier. Or, more easily, you can just pass the glove vectors as the initializer of the variable and run tf.global_variables_initializer.
I am using Python to train a word2vec model and get embeddings for each word in vocabulary. I used gensim to do this before, and I also notice that such model can be trained by tools like TensorFlow, Theano, and so on..
However, during these training processes, the inputs are just texts which are basically in string format, then the words will be mapped to index for training. In my case, I want to input arrays for training. These arrays, can be one-hot encoded vectors or other vectors after some designed manipulation.
So, is there existing tool which trains word2vec model by inputting vectors? If there is no such tools, any recommendation for me to learn so that I can write my own code?