How to get the dimensions of a word2vec vector? - python

I have run a word2vec model on my data list_of_sentence:
from gensim.models import Word2Vec
w2v_model=Word2Vec(list_of_sentence,min_count=5, workers=4)
print(type(w2v_model))
<class 'gensim.models.word2vec.Word2Vec'>
I would like to know the dimensionality of w2v_model vectors. How can I check it?

The vector dimensionality is included as an argument in Word2Vec:
In gensim versions up to 3.8.3, the argument was called size (docs)
In the latest gensim versions (4.0 onwards), the relevant argument is renamed to vector_size (docs)
In both cases, the argument has a default value of 100; this means that, if you do not specify it explicitly (as you do here), the dimensionality will be 100.
Here is a reproducible example using gensim 3.6:
import gensim
gensim.__version__
# 3.6.0
from gensim.test.utils import common_texts
from gensim.models import Word2Vec
model = Word2Vec(sentences=common_texts, window=5, min_count=1, workers=4) # do not specify size, leave the default 100
wv = model.wv['computer'] # get numpy vector of a word in the corpus
wv.shape # verify the dimension of a single vector is 100
# (100,)
If you want to change this dimensionality to, say, 256, you should call Word2Vec with the argument size=256 (for gensim versions up to 3.8.3) or vector_size=256 (for gensim versions 4.0 or later).

Related

How to use the DeBERTa model by He et al. (2022) on Spyder?

I have recently successfully analyzed text-based data using sentence transformers based on the BERT model. Inspired by the book by Kulkarni et al. (2022), my code looked like this:
# Import SentenceTransformer
from sentence_transformers import SentenceTransformer
# use paraphrase-MiniLM-L12-v2 pre trained model
sbert_model = SentenceTransformer('paraphrase-MiniLM-L12-v2')
# My text
x='The cat cought the mouse'
# get embeddings for each question
sentence_embeddings_BERT= sbert_model.encode(x)
I would like to do the same using the DeBERTa model but can't get it running. I managed to load the model, but how to apply it?
import transformers
from transformers import DebertaTokenizer, AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-base")
model = AutoModel.from_pretrained("microsoft/deberta-v3-base")
sentence_embeddings_deBERTa= model(x)
The last line does not run, error message is:
AttributeError: 'str' object has no attribute 'size'
Any experienced DeBERTa users out there?
Thanks
Pat
Welcome to SO ;)
When you call encode() method it would tokenize the input then encode it to the tensors a transformer model expects, then pass it through model architecture. When you're using transformers you must do the steps manually.
from transformers import DebertaTokenizer, DebertaModel
import torch
# downloading the models
tokenizer = DebertaTokenizer.from_pretrained("microsoft/deberta-base")
model = DebertaModel.from_pretrained("microsoft/deberta-base")
# tokenizing the input text and converting it into pytorch tensors
inputs = tokenizer(["The cat cought the mouse", "This is the second sentence"], return_tensors="pt", padding=True)
# pass through the model
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)
Lastly, you must know what kind of output you're supposed to work with.

Change input size of ONNX model

I need to change the input size of an ONNX model from [1024,2048,3] to [1,1024,2048,3].
For this, I've tried using update_inputs_outputs_dims by ONNX
import onnx
from onnx.tools import update_model_dims
model = onnx.load("./0818_pspnet_1.0_713_resnet_v1/pspnet_citysc.onnx")
updated_model = update_model_dims.update_inputs_outputs_dims(model, {"inputs:0":[1,1024,2048,3]}, {"predictions:0":[1, 1025, 2049, 1]})
onnx.save(updated_model, 'pspnet_citysc_upd.onnx')
However, this is the error I end up with.
ValueError: Unable to set dimension value to 1 for axis 0 of inputs:0. Contradicts existing dimension value 1024.
The ONNX model is exported from a Tensorflow frozen graph of PSPNet. If the above approach does not work, would I need to modify the frozen graph?
Any help is greatly appreciated.
You can use the dynamic shape fixed tool from onnxruntime
python -m onnxruntime.tools.make_dynamic_shape_fixed --dim_param batch --dim_value 1 model.onnx model.fixed.onnx

Understanding usage of glove vectors

I used the following code to using glove vectors for word embeddings
from gensim.scripts.glove2word2vec import glove2word2vec #line1
glove_input_file = 'glove.840B.300d.txt' #line2
word2vec_output_file = 'glove.word2vec' #line3
glove2word2vec(glove_input_file, word2vec_output_file) #line4
from gensim.models import KeyedVectors #line5
glove_w2vec = KeyedVectors.load_word2vec_format('glove.word2vec', binary=False) #line6
I understand this chunk of code is for using glove pretrained vectors for your word embeddings. But I am not sure of what is happening in each line. Why to convert glove to word2vec format ? What does KeyedVectors.load_word2vec_format does exactly ?
Both the GloVe algorithm and word2vec both create word-vectors, a vector per word.
But the formats for storing those vectors are slightly different. The gensim glove2word2vec() function will let you convert a file in GloVe format to the format used by the original Google word2vec.c code.
https://radimrehurek.com/gensim/scripts/glove2word2vec.html
Meanwhile, the gensim KeyedVectors.load_word2vec_format() method can load vectors in that word2vec.c format, into an instance of KeyedVectors (or one of its same-interface subclasses), for easy lookup and other common word-vector operations.

How to use tensorflow-hub module with tensorflow-dataset api

I want to use Tensorflow Dataset api to initialize my dataset using tensorflow Hub. I want to use dataset.map function to convert my text data into embedding. My Tensorflow version is 1.14.
Since I used elmo v2 modlule which converts bunch of sentences array into their word embeddings, I used the following code:
import tensorflow as tf
import tensorflow_hub as hub
...
sentences_array = load_sentences()
#Sentence_array=["I love Python", "python is a good PL"]
def parse(sentences):
elmo = hub.Module("./ELMO")
embeddings = elmo([sentences], signature="default", as_dict=True)
["word_emb"]
return embeddings
dataset = tf.data.TextLineDataset(sentences_array)
dataset = dataset.apply(tf.data.experimental.map_and_batch(map_func =
parse, batch_size=batch_size))
I want embedding of text array like [batch_size, max_words_in_batch, embedding_size], but I got an error message as:
"NotImplementedError: Using TF-Hub module within a TensorFlow defined
function is currently not supported."
How can I get the expected results?
Unfortunately this is not supported in TensorFlow 1.x
It is, however, supported in TensorFlow 2.0 so if you can upgrade to tensorflow 2 and choose from the available text embedding modules for tf 2 (current list here) then you can use this in your dataset pipeline. Something like this:
embedder = hub.load("https://tfhub.dev/google/tf2-preview/nnlm-en-dim128/1")
def parse(sentences):
embeddings = embedder([sentences])
return embeddings
dataset = tf.data.TextLineDataset("text.txt")
dataset = dataset.map(parse)
If you are tied to 1.x or tied to Elmo (which I don't think is yet available in the new format) then the only option I can see for embedding in the preprocessing stage is to first run your dataset through a simple embedding model and save the results then use the embedded vectors for the downstream task separately. (I appreciate this is less than ideal).

FastText in Gensim

I am using Gensim to load my fasttext .vec file as follows.
m=load_word2vec_format(filename, binary=False)
However, I am just confused if I need to load .bin file to perform commands like m.most_similar("dog"), m.wv.syn0, m.wv.vocab.keys() etc.? If so, how to do it?
Or .bin file is not important to perform this cosine similarity matching?
Please help me!
The following can be used:
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format(link to the .vec file)
model.most_similar("summer")
model.similarity("summer", "winter")
Many options to use the model now.
The gensim-lib has evolved, so some code fragments got deprecated. This is an actual working solution:
import gensim.models.wrappers.fasttext
model = gensim.models.wrappers.fasttext.FastTextKeyedVectors.load_word2vec_format(Source + '.vec', binary=False, encoding='utf8')
word_vectors = model.wv
# -- this saves space, if you plan to use only, but not to train, the model:
del model
# -- do your work:
word_vectors.most_similar("etc")
If you want to be able to retrain the gensim model later with additional data, you should save the whole model like this: model.save("fasttext.model").
If you save just the word vectors with model.wv.save_word2vec_format(Path("vectors.txt")), you will still be able to perform any of the functions that vectors provide - like similarity, but you will not be able to retrain the model with more data.
Note that if you are saving the whole model, you should pass a file name as a string instead of wrapping it in get_tmpfile, as suggested in the documentation here.
Maybe I am late in answering this:
But here you can find your answer in the documentation:https://github.com/facebookresearch/fastText/blob/master/README.md#word-representation-learning
Example use cases
This library has two main use cases: word representation learning and text classification. These were described in the two papers 1 and 2.
Word representation learning
In order to learn word vectors, as described in 1, do:
$ ./fasttext skipgram -input data.txt -output model
where data.txt is a training file containing UTF-8 encoded text. By default the word vectors will take into account character n-grams from 3 to 6 characters. At the end of optimization the program will save two files: model.bin and model.vec. model.vec is a text file containing the word vectors, one per line. model.bin is a binary file containing the parameters of the model along with the dictionary and all hyper parameters. The binary file can be used later to compute word vectors or to restart the optimization.

Categories