How to understand byte pair encoding? - python

I read a lot of tutorial about BPE but I am still confuse how it works.
for example.
In a tutorial online, they said the folowing :
Algorithm
Prepare a large enough training data (i.e. corpus)
Define a desired subword vocabulary size
Split word to sequence of characters and appending suffix “” to end of
word with word frequency. So the basic unit is character in this stage. For example, the frequency of “low” is 5, then we rephrase it to “l o w ”: 5
Generating a new subword according to the high frequency occurrence.
Repeating step 4 until reaching subword vocabulary size which is defined in step 2 or the next highest frequency pair is 1.
Taking “low: 5”, “lower: 2”, “newest: 6” and “widest: 3” as an example, the highest frequency subword pair is e and s. It is because we get 6 count from newest and 3 count from widest. Then new subword (es) is formed and it will become a candidate in next iteration.
In the second iteration, the next high frequency subword pair is es (generated from previous iteration )and t. It is because we get 6count
from newest and 3 count from widest.
I do not understand why low is 5 and lower is 2:
does this meand l , o, w , lo, ow + = 6 and then lower equal two but why is not e, r, er which gives three ?

The numbers you are asking about are the frequencies of the words in the corpus. The word "low" was seen in the corpus 5 times and the word "lower" 2 times (they just assume this for the example).
In the first iteration we see that the character pair "es" is the most frequent one because it appears 6 times in the 6 occurrences of "newest" and 3 times in the 3 occurrences of the word "widest".
In the second iteration we have "es" as a unit in our vocabulary the same way we have single characters. Then we see that "est" is the most common character combination ("newest" and "widest").

Related

NLP- sentiment analysis using word vectors

I have a code that does the following:
Generate word vectors using brown corpus fron nltk
maintain 2 list, one having few positive sentimental words (eg: good, happy, nice) and other negative sentimental words (ed. bad, sad, unnhappy)
Define a statement whose sentiment we wish to obtain.
perform preprocessing on this statement (tokenize, lowercase, remove special characters, remove stopwords, lemmatize words
Generate word vectors for all these words and store it in a list
I have a test sentence of 7 words and I wish to determine its sentiment. First I define two lists:
good_words=[good, excellent, happy]
bad_words=[bad,terrible,sad]
Now I run a loop taking i words at a time where i ranges from 1 to sentence length. For a particular i, I have few windows of words that span the test sentence. For each window, I take average of word vectors of the window and compute euclidian distance of this windowed vector and the 2 lists.For example i= 3, and test sentence: food looks fresh healthy. I will have 2 windows: food looks fresh and looks fresh healthy for i =3. Now I take mean of vectors of the words in each window and compute euclidian distance with the good_words and bad_words. So corresponding to each word in both lists I will have 2 values(for 2 windows). Now I take mean of these 2 values for each word in the lists and whichever word has least distance lies closest to the test sentence.
I wish to show that window size(i) = 3 or 4 shows highest accuracy in determining the sentiment of test sentence but I am facing difficulty in achieving it. Any leads on how I can produce my results would be highly appreciated.
Thanks in advance.
b = Word2Vec(brown.sents(), window=5, min_count=5, negative=15, size=50, iter= 10, workers=multiprocessing.cpu_count())
pos_words=['good','happy','nice','excellent','satisfied']
neg_words=['bad','sad','unhappy','disgusted','afraid','fearful','angry']
pos_vec=[b[word] for word in pos_words]
neg_vec=[b[word] for word in neg_words]
test="Sound quality on both end is excellent."
tokenized_word= word_tokenize(test)
lower_tokens= convert_lowercase(tokenized_word)
alpha_tokens= remove_specialchar(lower_tokens)
rem_tokens= removestopwords(alpha_tokens)
lemma_tokens= lemmatize(rem_tokens)
word_vec=[b[word] for word in lemma_tokens]
for i in range(0,len(lemma_tokens)):
windowed_vec=[]
for j in range(0,len(lemma_tokens)-i):
windowed_vec.append(np.mean([word_vec[j+k] for k in range(0,i+1)],axis=0))
gen_pos_arr=[]
gen_neg_arr=[]
for p in range(0,len(pos_vec)):
gen_pos_arr.append([euclidian_distance(vec,pos_vec[p]) for vec in windowed_vec])
for q in range(0,len(neg_vec)):
gen_neg_arr.append([euclidian_distance(vec,neg_vec[q]) for vec in windowed_vec])
gen_pos_arr_mean=[]
gen_pos_arr_mean.append([np.mean(x) for x in gen_pos_arr])
gen_neg_arr_mean=[]
gen_neg_arr_mean.append([np.mean(x) for x in gen_neg_arr])
min_value=np.min([np.min(gen_pos_arr_mean),np.min(gen_neg_arr_mean)])
for v in gen_pos_arr_mean:
print('min value:',min_value)
if min_value in v:
print('pos',v)
plt.scatter(i,min_value,color='blue')
plt.text(i,min_value,pos_words[gen_pos_arr_mean[0].index(min_value)])
else:
print('neg',v)
plt.scatter(i,min_value,color='red')
plt.text(i,min_value,neg_words[gen_neg_arr_mean[0].index(min_value)])
print(test)
plt.title('')
plt.xlabel('window size')
plt.ylabel('avg of distances of windows from sentiment words')
plt.show()

How to generate alignments for word-based translation models if number of words are different in both sentences

I am working on implementing IBM Model 1. I have a parallel corpus of some 2,000,000 sentences (English to Dutch). Also, the sentences of the two docs are already aligned. The aim is to translate a Dutch sentence into English and vice-versa.
The code I am using for generating the alignments is:
A = pair_sent[0].split() # To split English sentence
B = pair_sent[1].split() # To split Dutch sentence
trips.append([zip(A, p) for p in product(B, repeat=len(A))])
Now, there are pair sentences with an unequal number of words (like 10 in English and 14 in its Dutch Translation). Our professor told us that we should use NULLs or drop a word. But I don't understand how to do that? Where to insert NULL and how to choose which word to drop.
In the end, I require the pair of sentences to have the equal number of words.
The problem is not that the sentences have a different number of words. After all, the IBM model computes for each word in a source sentence a probability distribution over all words in the target sentence and does not care how many words the target sentence has. The problem is that there might words that do not have counter-part in the target sentence.
If you append a NULL word into the target sentence (no matter where because IBM Model 1 does not consider reordering), you can also model the probability that a word does not have a counter-part in the target sentence.
The actual bilingual alignment is then done using a symmetrization heuristic from a pair of IBM models on both sides.

Constant part of string

I've got a problem and don't know how to solve it.
E.x. I have a dynamically expanding file which contains lines splited by '\n'
Each line - a message (string) which is built by some pattern and value part which is specific only for this line.
E.x.:
line 1: The temperature is 10 above zero
line 2: The temperature is 16 above zero
line 3: The temperature is 5 degree zero
So, as you see, the constant part (pattern) is
The temperature is zero
Value part:
For line 1 will be: 10 above
For line 2 will be: 16 above
For line 3 will be: 5 degree
Of course it's very simple example.
In fact there're too many lines and about ~50 pattern in one file.
The value part may be anything - it can be number, word, punctuation, etc!
And my question is - how can I find all possible patterns from data?
This sounds like a log message clustering problem.
Trivial solution: replace all numbers with the string NUMBER using a regex. You might need to exclude dates or IP addresses or something. That might be enough to give you a list of all patterns in your logs.
Alternately, you might be able to count the number of words (whitespace-delimited fields) in each message and group the messages that way. For example, maybe all messages with 7 words are in the same format. If two different messages have the same format you can also match on the first word or something.
If neither of the above work then things get much more complicated; clustering arbitrary log messages is a research problem. If you search for "event log clustering" on Google Scholar you should see a lot of approaches you can learn from.
If the no of words in a line is fixed, like in your eg str, then you can use str.split()
str='''
The temperature is 10 above zero
The temperature is 16 above zero
The temperature is 5 degree zero
'''
for line in str.split('\n'):
if len(line.split()) >= 5:
a,b = line.split()[3], line.split()[4]
print(a,b)
Output:
10 above
16 above
5 degree
First, we would read the file line by line and add all sentences to a List.
In the example below, I am adding few lines to a list.
This list has all sentences..
lstSentences = ['The temperature is 10 above zero', 'The temperature is 16 above zero', 'The temperature is 5 degree above zero','Weather is ten degree below normal', 'Weather is five degree below normal' , 'Weather is two degree below normal']
Create a list to store all patterns
lstPatterns=[]
Initialize
intJ = len(lstSentences)-1
Compare one sentence against the one that follows it. If there are more than 2 matching words between two setences, perhaps this is a pattern.
for inti, sentence in enumerate(lstSentences):
if intJ!=inti:
lstMatch = [ matching for matching in sentence.split() if matching in
lstSentences[inti+1].split()]
if len(lstMatch)>2: #We need min 2 words matching between sentences
if not ' '.join(lstMatch) in lstPatterns: #if not in list, add
lstPatterns.append(' '.join(lstMatch))
lstMatch=[]
print(lstPatterns)
I am assuming patterns come one after the other (i.e., 10 rows with one pattern and then, 10 rows with another pattern). If not, the above code needs to change

Count number of texts in which a word occurs

I am building a word frequency, and relative frequency, list(s) for a collection of text files. Having discovered, by hand, that a couple of texts can overly influence the frequency of a word, one of the things I want to be able to do is count the number of times a word occurs. It strikes me that there are two ways to do this:
First, to compile a word frequency dictionary (as below -- and I'm not using the NLTK FreqDist because this code actually runs more quickly but if FreqDist has the above functionality built-in and I just didn't know it, I'll take it):
import nltk
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
freq_dic = {}
for text in ftexts:
words = tokenizer.tokenize(text)
for word in words:
# form dictionary
try:
freq_dic[word] += 1
except:
freq_dic[word] = 1
From there, I assume I'll need to write another loop that uses the keys above as keywords:
# This is just scratch code
for text in ftexts:
while True:
if keyword not in line:
continue
else:
break
count = count + 1
And then I'll find some way to mesh these two dictionaries into a tuple or, possibly, a pandas dataframe by word, such that:
word1, frequency, # of texts in which it occurs
word2, frequency, # of texts in which it occurs
The other thing that occurred to me as I was writing this question was to use SciKit's term frequency matrix and then count rows in which a word occurs? Is that possible?
ADDED TO CLARIFY:
Imagine three sentences:
["I need to keep count of the children.",
"If you want to know what the count is, just ask."
"There is nothing here but chickens, chickens, chickens."]
"count" occurs 2x but is in two different texts; "chickens" occurs three times, but is in only one text. What I want is a report that looks like this:
WORD, FREQ, TEXTS
count, 2, 2
chicken, 3, 1

Extracting collocates for a given word from a text corpus - Python

I am trying to find out how to extract the collocates of a specific word out of a text. As in: what are the words that make a statistically significant collocation with e.g. the word "hobbit" in the entire text corpus? I am expecting a result similar to a list of words (collocates ) or maybe tuples (my word + its collocate).
I know how to make bi- and tri-grams using nltk, and also how to select only the bi- or trigrams that contain my word of interest. I am using the following code (adapted from this StackOverflow question).
import nltk
from nltk.collocations import *
corpus = nltk.Text(text) # "text" is a list of tokens
trigram_measures = nltk.collocations.TrigramAssocMeasures()
tri_finder = TrigramCollocationFinder.from_words(corpus)
# Only trigrams that appear 3+ times
tri_finder.apply_freq_filter(3)
# Only the ones containing my word
my_filter = lambda *w: 'Hobbit' not in w
tri_finder.apply_ngram_filter(my_filter)
print tri_finder.nbest(trigram_measures.likelihood_ratio, 20)
This works fine and gives me a list of trigrams (one element of of which is my word) each with their log-likelihood value. But I don't really want to select words only from a list of trigrams. I would like to make all possible N-Gram combinations in a window of my choice (for example, all words in a window of 3 left and 3 right from my word - that would mean a 7-Gram), and then check which of those N-gram words has a statistically relevant frequency paired with my word of interest. I would like to take the Log-Likelihood value for that.
My idea would be:
1) Calculate all N-Gram combinations in different sizes containing my word (not necessarily using nltk, unless it allows to calculate units larger than trigrams, but i haven't found that option),
2) Compute the log-likelihood value for each of the words composing my N-grams, and somehow compare it against the frequency of the n-gram they appear in (?). Here is where I get lost a bit... I am not experienced in this and I don't know how to think this step.
Does anyone have suggestions how I should do?
And assuming I use the pool of trigrams provided by nltk for now: does anyone have ideas how to proceed from there to get a list of the most relevant words near my search word?
Thank you
Interesting problem ...
Related to 1) take a look at this thread...different nice solutions to make ngrams .. basically I lo
from nltk import ngrams
sentence = 'this is a foo bar sentences and i want to ngramize it'
n = 6
sixgrams = ngrams(sentence.split(), n)
for grams in sixgrams:
print (grams)
The other way could be:
phrases = Phrases(doc,min_count=2)
bigram = models.phrases.Phraser(phrases)
phrases = Phrases(bigram[doc],min_count=2)
trigram = models.phrases.Phraser(phrases)
phrases = Phrases(trigram[doc],min_count=2)
Quadgram = models.phrases.Phraser(phrases)
... (you could continue infinitely)
min_count controls the frequency of each word in the corpora.
Related to 2) It's somehow tricky calculating loglikelihood for more than two variables since you should count for all the permutations. look this thesis which guy proposed a solution (page 26 contains a good explanation).
However, in addition to log-likelihood function, there is PMI (Pointwise Mutual Information) metric which calculates the co-occurrence of pair of words divided by their individual frequency in the text. PMI is easy to understand and calculate which you could use it for each pair of the words.

Categories