Get Bert Embeddings for every Token in a Sentence - python

I have a dataframe in python in which i have a column of textual data. I need to run a loop where i would take each row in that textual column and get the bert embedding for every token in that particular row. I then need to append those vector embeddings and try it out for some purpose.
e.g " My name is Obama"
get 768 vector embedding for 'My'
get 768 vector embedding for 'name'
get 768 vector embedding for 'is'
get 768 vector embedding for 'Obama'
final output: vector embedding of size 768*4 = 3072
assume every row has exact number of words present

I believe that you are trying to bring contextual based embedding for individual words of a sentence into picture, instead of fixed vectors like that of GloVe.
Your approach should be.
Tokenize your paragraphs into individual sentences ( look at some sentence tokenizers or SBD (sentence boundary detection) methods if applicable)
Now for each of the Sentence which constitute a paragraph, get the embedding for words.
Average that across so that you get vectors of consistent shapes across multiple paragraps (in your case dataframe cell - which is essentially paragraphs)
pip install sentence-transformers
once installed;
model = SentenceTransformer('paraphrase-distilroberta-base-v1')
#Our sentences we like to encode
sentences = ['This framework generates embeddings for each input sentence',
'Sentences are passed as a list of string.',
'The quick brown fox jumps over the lazy dog.']
#Sentences are encoded by calling model.encode()
embeddings = model.encode(sentences)
#Print the embeddings
for sentence, embedding in zip(sentences, embeddings):
print("Sentence:", sentence)
print("Embedding:", embedding)
print("")
look at the embedding vector and aggregation techniques around embeddings.

Related

Word2Vec returning vectors for individual character and not words

For the following list:
words= ['S.B.MILS','Siblings','DEVASTHALY','KOTRESHA','unimodal','7','regarding','random','59','intimating','COMPETITION','prospects','2K15','gather','Mega','SENSOR','NCTT','NETWORKING','orgainsed','acts']
I try to:
from gensim.models import Word2Vec
vec_model= Word2Vec(words, min_count=1, size=30)
vec_model['gather']
Which returns:
KeyError: "word 'gather' not in vocabulary"
But
vec_model['g']
Does return a vector, so believe i'm returning all vectors for characters found in the list instead of vectors for all words found in the list.
Word2Vec expects a list of lists as input, where the corpus (main list) is composed of individual documents. The individual documents are composed of individual words (tokens). Word2Vec iterates over all documents and all tokens. In your example you have passed a single list to Word2Vec, therefore Word2Vec interprets each word as an individual document and iterates over each word character which is interpreted as a token. Therefore you have built a vocabulary of characters not words. To build a vocabulary of words you can pass a nested list to Word2Vec as in the example below.
from gensim.models import Word2Vec
words= [['S.B.MILS','Siblings','DEVASTHALY','KOTRESHA'],
['unimodal','7','regarding','random','59','intimating'],
['COMPETITION','prospects','2K15','gather','Mega'],
['SENSOR','NCTT','NETWORKING','orgainsed','acts']]
vec_model= Word2Vec(words, min_count=1, size=30)
vec_model['gather']
Output:
array([ 0.01106581, 0.00968017, -0.00090574, 0.01115612, -0.00766465,
-0.01648632, -0.01455364, 0.01107104, 0.00769841, 0.01037362,
0.01551551, -0.01188449, 0.01262331, 0.01608987, 0.01484082,
0.00528397, 0.01613582, 0.00437328, 0.00372362, 0.00480989,
-0.00299072, -0.00261444, 0.00282137, -0.01168992, -0.01402746,
-0.01165612, 0.00088562, 0.01581018, -0.00671618, -0.00698833],
dtype=float32)

How to understand question about text document similarity

My teacher rarely answers emails so forgive me for asking here. I'm having trouble understanding what he means about the highlighted part:
"Create a python program that will compute the text document similarity between different documents. Your implementation will take a list of documents as an input text corpus and it will compute a dictionary of words for the given corpus."
So is the input a list of strings like so: e = ["a", "a", "b", "f"]
Or is it just a list with a single string where I pull the individual words?
Full question:
Create a python program that will compute the text document similarity between different documents. Your implementation will take a list of documents as an input text corpus and it will compute a dictionary of words for the given corpus. Later, when a new document (i.e search document) is provided, your implementation should provide a list of documents that are similar to the given search document, in descending order of their similarity with the search document. For computing similarity between any two documents in our question, you can use the following distance measures (optionally you can also use any other measure as well).
1.dot product between the two vectors
2.distance norm (or Euclidean distance) between two vectors .e.g.||u−v||
Hint A text document can be represented as a word vector against a given dictionary of words. So first compute the dictionary of words for a given text corpus, containing the unique words from the documents of given corpus. Then transform every text document of the given corpus into vector form i.e. creating a word vector where 0 indicates the word is not in the document and 1 indicates that the word is present in the given document. In our question, a text document is just represented as a string, so the text corpus is nothing but a list of strings
The way I understand it you are given a list of documents and you will match a new document to them, for instance:
my_docs = ["this is the first doc", "this is the second doc", "and this is yet another doc"]
inpd = input('please enter a document: ')
my_vocab, v_dim = build_vocab(my_docs + [inpd]) # build vocabulary from doclist and new doc
my_vdocs = [build_dvec(doc, v_dim, my_vocab) for doc in my_docs] # vectorize documents
inpdv = build_dvec(inpd, v_dim, my_vocab) # vectorize input document
print('docs and score: ', [str(my_docs[itm[0]])+': '+str(itm[1]) for itm in dmtch(inpdv, my_vdocs)])
this would produce
please enter a document: yet another doc
docs and score: ['and this is yet another doc: 3.0', 'this is the first doc: 1.0', 'this is the second doc: 1.0']
The best match is the last document in your list as it contains the most words from the input document. I have omitted the code for constructing the vocabulary, transforming the documents to vectors and matching them (=assignment).

How to predict the probability label of each word in sentence using a classifier

I am trying to predict each label of word from review sentences. By far I have trained a classifier before with training data and now I am using on testing data. Basically, what I want to achieve is that from a full sentence the first word is removed, predict the rest of the sentence label, next return the first word and remove the second, predict again so on. This procedure would remove i-th word and predict rest of sentence label in every case.
I ran following code which is capable to remove one word and then return it back, but the problem is that my classifier predicts the whole sentence, because I use stems1 pre-process it list as whole document, thus I thought I could use i loop instead of stems1 at fit_transform but does not work.
vec = TfidfVectorizer(tokenizer=identity_tokenizer,use_idf = True, lowercase = False, max_features = 401)
X_new = vec.fit_transform(stems1).toarray() #stems1 is a pre-processed list that contains inside reviews lists
from typing import List
for i in stems1:
for j in range(0,len(i)):
i.pop(int(j)
new_pred = m.predict_proba(X_new)[:, 0]
i.insert(j, a)

How to generate alignments for word-based translation models if number of words are different in both sentences

I am working on implementing IBM Model 1. I have a parallel corpus of some 2,000,000 sentences (English to Dutch). Also, the sentences of the two docs are already aligned. The aim is to translate a Dutch sentence into English and vice-versa.
The code I am using for generating the alignments is:
A = pair_sent[0].split() # To split English sentence
B = pair_sent[1].split() # To split Dutch sentence
trips.append([zip(A, p) for p in product(B, repeat=len(A))])
Now, there are pair sentences with an unequal number of words (like 10 in English and 14 in its Dutch Translation). Our professor told us that we should use NULLs or drop a word. But I don't understand how to do that? Where to insert NULL and how to choose which word to drop.
In the end, I require the pair of sentences to have the equal number of words.
The problem is not that the sentences have a different number of words. After all, the IBM model computes for each word in a source sentence a probability distribution over all words in the target sentence and does not care how many words the target sentence has. The problem is that there might words that do not have counter-part in the target sentence.
If you append a NULL word into the target sentence (no matter where because IBM Model 1 does not consider reordering), you can also model the probability that a word does not have a counter-part in the target sentence.
The actual bilingual alignment is then done using a symmetrization heuristic from a pair of IBM models on both sides.

scikit-learn TfidfVectorizer ignoring certain words

I'm trying TfidfVectorizer on a sentence taken from wikipedia page about the History of Portugal. However i noticed that the TfidfVec.fit_transform method is ignoring certain words. Here's the sentence i tried with:
sentence = "The oldest human fossil is the skull discovered in the Cave of Aroeira in Almonda."
TfidfVec = TfidfVectorizer()
tfidf = TfidfVec.fit_transform([sentence])
cols = [words[idx] for idx in tfidf.indices]
matrix = tfidf.todense()
pd.DataFrame(matrix,columns = cols,index=["Tf-Idf"])
output of the dataframe:
Essentially, it is ignoring the words "Aroeira" and "Almonda".
But i don't want it to ignore those words so what should i do? I can't find anywhere on the documentation where they talk about this.
Another question is why is the word "the" repeated? should the algorithm consider just one "the" and compute its tf-idf?
tfidf.indices are just indexes for feature names in TfidfVectorizer.
Getting words by this indexes from the sentence is a mistake.
You should get columns names for your df as TfidfVec.get_feature_names()
The output is the giving two the because you have two in the sentence. The entire sentence is encoded and your getting values for each of the indices. The reason why the other two words are not appearing is because they are rare words. You can make them appear by reducing the threshold.
Refer to min_df and max_features:
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

Categories