Extracting embedding values of NLP pertained models from tokenized strings - python

I am using huggingface pipeline to extract embeddings of words in a sentence. As far as I know, first a sentence will be turned into a tokenized strings. I think the length of the tokenized string might not be equal to the number of words in the original sentence. I need to retrieve word embedding of a particular sentence.
For example, here is my code:
#https://discuss.huggingface.co/t/extracting-token-embeddings-from-pretrained-language-models/6834/6
from transformers import pipeline, AutoTokenizer, AutoModel
import numpy as np
import re
model_name = "xlnet-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
model.resize_token_embeddings(len(tokenizer))
model_pipeline = pipeline('feature-extraction', model=model_name, tokenizer=tokenizer)
def find_wordNo_sentence(word, sentence):
print(sentence)
splitted_sen = sentence.split(" ")
print(splitted_sen)
index = splitted_sen.index(word)
for i,w in enumerate(splitted_sen):
if(word == w):
return i
print("not found") #0 base
def return_xlnet_embedding(word, sentence):
word = re.sub(r'[^\w]', " ", word)
word = " ".join(word.split())
sentence = re.sub(r'[^\w]', ' ', sentence)
sentence = " ".join(sentence.split())
id_word = find_wordNo_sentence(word, sentence)
try:
data = model_pipeline(sentence)
n_words = len(sentence.split(" "))
#print(sentence_emb.shape)
n_embs = len(data[0])
print(n_embs, n_words)
print(len(data[0]))
if (n_words != n_embs):
"There is extra tokenized word"
results = data[0][id_word]
return np.array(results)
except:
return "word not found"
return_xlnet_embedding('your', "what is your name?")
Then the output is:
what is your name ['what', 'is', 'your', 'name'] 6 4 6
So the length of tokenized string that is fed to the pipeline is two more than number of my words.
How can I find which one (among these 6 values) are the embedding of my word?

As you may know, huggingface tokenizer contains frequent subwords as well as complete ones. So if you are willing to extract word embeddings for some tokens you should consider that may contain more than one vector! In addition, huggingface pipelines encode input sentences at the first steps and this would be performed by adding special tokens to beginning & end of the actual sentence.
string = 'This is a test for clarification'
print(pipeline.tokenizer.tokenize(string))
print(pipeline.tokenizer.encode(string))
output:
['this', 'is', 'a', 'test', 'for', 'cl', '##ari', '##fication']
[101, 2023, 2003, 1037, 3231, 2005, 18856, 8486, 10803, 102]

Related

Stopwords coming up in most influential words

I am running some NLP code, trying to find the most influential (positively or negatively) words in a survey. My problem is that, while I successfully add some extra stopwords to the NLTK stopwords file, they keep coming up as influential words later on.
So, I have a dataframe, first column contains scores, second column contains comments.
I add extra stopwords:
stopwords = stopwords.words('english')
extra = ['Cat', 'Dog']
stopwords.extend(extra)
I check that they are added, using the len method before and after.
I create this function to remove punctuation and stopwords from my comments:
def text_process(comment):
nopunc = [char for char in comment if char not in string.punctuation]
nopunc = ''.join(nopunc)
return [word for word in nopunc.split() if word.lower() not in stopwords]
I run the model (not going to include the whole code since it doesn't make a difference):
corpus = df['Comment']
y = df['Label']
vectorizer = CountVectorizer(analyzer=text_process)
x = vectorizer.fit_transform(corpus)
...
And then to get the most influential words:
feature_to_coef = {word: coef for word, coef in zip(vectorizer.get_feature_names(), nb.coef_[0])}
for best_positive in sorted(
feature_to_coef.items(),
key=lambda x: x[1],
reverse=True)[:20]:
print (best_positive)
But, Cat and Dog are in the results.
What am I doing wrong, any ideas?
Thank you very much!
Looks like it is because you have capitalize words 'Cat' and 'Dog'
In your text_process function, you have if word.lower() not in stopwords which only works if the stopwords are lower case

How can I find and print unmatched/dissimilar words from the documents(dataset)?

I am trying to rewrite algorithm that basically takes a input text file and compares with different documents and results the similarities.
Now I want to print output of unmatched words and output a new textile with unmatched words.
From this code, "hello force" is the input and is checked against the raw_documents and prints out rank for matched document between 0-1(word "force" is matched with second document and ouput gives more rank to second document but "hello" is not in any raw_document i want to print unmatched word "hello" as not matched ), But what i want is to print unmatched input word that was not matched with any of the raw_document
import gensim
import nltk
from nltk.tokenize import word_tokenize
raw_documents = ["I'm taking the show on the road",
"My socks are a force multiplier.",
"I am the barber who cuts everyone's hair who doesn't cut their own.",
"Legend has it that the mind is a mad monkey.",
"I make my own fun."]
gen_docs = [[w.lower() for w in word_tokenize(text)]
for text in raw_documents]
dictionary = gensim.corpora.Dictionary(gen_docs)
corpus = [dictionary.doc2bow(gen_doc) for gen_doc in gen_docs]
tf_idf = gensim.models.TfidfModel(corpus)
s = 0
for i in corpus:
s += len(i)
sims = gensim.similarities.Similarity('/usr/workdir/',tf_idf[corpus],
num_features=len(dictionary))
query_doc = [w.lower() for w in word_tokenize("hello force")]
query_doc_bow = dictionary.doc2bow(query_doc)
query_doc_tf_idf = tf_idf[query_doc_bow]
result = sims[query_doc_tf_idf]
print result

Code to create a reliable Language model from my own corpus

I have a corpus of sentences in a specific domain.
I am looking for an open-source code/package, that I can give the data and it will generate a good, reliable language model. (Meaning, given a context, know the probability for each word).
Is there such a code/project?
I saw this github repo: https://github.com/rafaljozefowicz/lm, but it didn't work.
I recommend writing your own basic implementation. First, we need some sentences:
import nltk
from nltk.corpus import brown
words = brown.words()
total_words = len(words)
sentences = list(brown.sents())
sentences is now a list of lists. Each sublist represents a sentence with each word as an element. Now you need to decide whether or not you want to include punctuation in your model. If you want to remove it, try something like the following:
punctuation = [",", ".", ":", ";", "!", "?"]
for i, sentence in enumerate(sentences.copy()):
new_sentence = [word for word in sentence if word not in punctuation]
sentences[i] = new_sentence
Next, you need to decide whether or not you care about capitalization. If you don't care about it, you could remove it like so:
for i, sentence in enumerate(sentences.copy()):
new_sentence = list()
for j, word in enumerate(sentence.copy()):
new_word = word.lower() # Lower case all characters in word
new_sentence.append(new_word)
sentences[i] = new_sentence
Next, we need special start and end words to represent words that are valid at the beginning and end of sentences. You should pick start and end words that don't exist in your training data.
start = ["<<START>>"]
end = ["<<END>>"]
for i, sentence in enumerate(sentences.copy()):
new_sentence = start + sentence + end
sentences[i] = new_sentence
Now, let's count unigrams. A unigram is a sequence of one word in a sentence. Yes, a unigram model is just a frequency distribution of each word in the corpus:
new_words = list()
for sentence in sentences:
for word in sentence:
new_words.append(word)
unigram_fdist = nltk.FreqDist(new_words)
And now it's time to count bigrams. A bigram is a sequence of two words in a sentence. So, for the sentence "i am the walrus", we have the following bigrams: "<> i", "i am", "am the", "the walrus", and "walrus <>".
bigrams = list()
for sentence in sentences:
new_bigrams = nltk.bigrams(sentence)
bigrams += new_bigrams
Now we can create a frequency distribution:
bigram_fdist = nltk.ConditionalFreqDist(bigrams)
Finally, we want to know the probability of each word in the model:
def getUnigramProbability(word):
if word in unigram_fdist:
return unigram_fdist[word]/total_words
else:
return -1 # You should figure out how you want to handle out-of-vocabulary words
def getBigramProbability(word1, word2):
if word1 not in bigram_fdist:
return -1 # You should figure out how you want to handle out-of-vocabulary words
elif word2 not in bigram_fdist[word1]:
# i.e. "word1 word2" never occurs in the corpus
return getUnigramProbability(word2)
else:
bigram_frequency = bigram_fdist[word1][word2]
unigram_frequency = unigram_fdist[word1]
bigram_probability = bigram_frequency / unigram_frequency
return bigram_probability
While this isn't a framework/library that just builds the model for you, I hope seeing this code has demystified what goes on in a language model.
You might try word_language_model from PyTorch examples. There just might be an issue if you have a big corpus. They load all data in memory.

NLTK-based stemming and lemmatization

I am trying to preprocess a string using lemmatizer and then remove the punctuation and digits. I am using the code below to do this. I am not getting any error but the text is not preprocessed appropriately. Only the stop words are removed but the lemmatizing does not work and punctuation and digits also remain.
from nltk.stem import WordNetLemmatizer
import string
import nltk
tweets = "This is a beautiful day16~. I am; working on an exercise45.^^^45 text34."
lemmatizer = WordNetLemmatizer()
tweets = lemmatizer.lemmatize(tweets)
data=[]
stop_words = set(nltk.corpus.stopwords.words('english'))
words = nltk.word_tokenize(tweets)
words = [i for i in words if i not in stop_words]
data.append(' '.join(words))
corpus = " ".join(str(x) for x in data)
p = string.punctuation
d = string.digits
table = str.maketrans(p, len(p) * " ")
corpus.translate(table)
table = str.maketrans(d, len(d) * " ")
corpus.translate(table)
print(corpus)
The final output I get is:
This beautiful day16~ . I ; working exercise45.^^^45 text34 .
And expected output should look like:
This beautiful day I work exercise text
No, your current approach does not work, because you must pass one word at a time to the lemmatizer/stemmer, otherwise, those functions won't know to interpret your string as a sentence (they expect words).
import re
__stop_words = set(nltk.corpus.stopwords.words('english'))
def clean(tweet):
cleaned_tweet = re.sub(r'([^\w\s]|\d)+', '', tweets.lower())
return ' '.join([lemmatizer.lemmatize(i, 'v')
for i in cleaned_tweet.split() if i not in __stop_words])
Alternatively, you can use a PorterStemmer, which does the same thing as lemmatisation, but without context.
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
And, call the stemmer like this:
stemmer.stem(i)
I think this is what you're looking for, but do this prior to calling the lemmatizer as the commenter noted.
>>>import re
>>>s = "This is a beautiful day16~. I am; working on an exercise45.^^^45text34."
>>>s = re.sub(r'[^A-Za-z ]', '', s)
This is a beautiful day I am working on an exercise text
To process a tweet properly you can use following code:
import re
import nltk
def process(text, lemmatizer=nltk.stem.wordnet.WordNetLemmatizer()):
""" Normalizes case and handles punctuation
Inputs:
text: str: raw text
lemmatizer: an instance of a class implementing the lemmatize() method
(the default argument is of type nltk.stem.wordnet.WordNetLemmatizer)
Outputs:
list(str): tokenized text
"""
bcd=[]
pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
text1= text.lower()
text1= re.sub(pattern,"", text1)
text1= text1.replace("'s "," ")
text1= text1.replace("'","")
text1= text1.replace("—", " ")
table= str.maketrans(string.punctuation,32*" ")
text1= text1.translate(table)
geek= nltk.word_tokenize(text1)
abc=nltk.pos_tag(geek)
output = []
for value in abc:
value = list(value)
if value[1][0] =="N":
value[1] = 'n'
elif value[1][0] =="V":
value[1] = 'v'
elif value[1][0] =="J":
value[1] = 'a'
elif value[1][0] =="R":
value[1] = 'r'
else:
value[1]='n'
output.append(value)
abc=output
for value in abc:
bcd.append(lemmatizer.lemmatize(value[0],pos=value[1]))
return bcd
here I have use post_tag (only N,V,J,R and converted rest all into noun as well). This will return a tokenized and lemmatized list of words.

Wordnet based semantic similarity measurements

Currently I'm working with WordNet based semantic similarity measurement project. As I know below are the steps for computing semantic similarity between two sentences:
Each sentence is partitioned into a list of tokens.
Stemming words.
Part-of-speech disambiguation (or tagging).
Find the most appropriate sense for every word in a sentence (Word Sense Disambiguation).
Compute the similarity of the sentences based on the similarity of the pairs of words.
Now I'm at step 3. But I couldn't get the correct output. I'm not very familiar with Python. So I would appreciate your help.
This is my code.
import nltk
from nltk.corpus import stopwords
def get_tokens():
test_sentence = open("D:/test/resources/AnswerEvaluation/Sample.txt", "r")
try:
for item in test_sentence:
stop_words = set(stopwords.words('english'))
token_words = nltk.word_tokenize(item)
sentence_tokenization = [word for word in token_words if word not in stop_words]
print (sentence_tokenization)
return sentence_tokenization
except Exception as e:
print (str(e))
def get_stems():
tokenized_sentence = get_tokens()
for tokens in tokenized_sentence:
sentence_stemming = nltk.PorterStemmer().stem(tokens)
print (sentence_stemming)
return sentence_stemming
def get_tags():
stemmed_sentence = get_stems()
tag_words = nltk.pos_tag(stemmed_sentence)
print (tag_words)
return tag_words
get_tags()
Sample.txt contains the sentences, I was taking a ride in the car. I was riding in the car.

Categories