Calculate PMI values using a given context window - python

Given the following basis:
basis = "Each word of the text is converted as follows: move any consonant (or consonant cluster) that appears at the start of the word to the end, then append ay."
and the following words:
words = "word, text, bank, tree"
How can I calculate the PMI-values of each word in "words" compared to each word in "basis", where I can use a context window size 5 (that is two positions before and two after the target word)?
I know how to calculate the PMI, but I don't know how to handle the fact of the context window.
I calculate the 'normal' PMI-values as follows:
def PMI(ContingencyTable):
(a,b,c,d,N) = ContingencyTable
# avoid log(0)
a += 1
b += 1
c += 1
d += 1
N += 4
R_1 = a + b
C_1 = a + c
return log(float(a)/(float(R_1)*float(C_1))*float(N),2)

I did a little searching on PMI, looks like heavy duty packages are out there, "windowing" included
In PMI the "mutual" seems to refer to the joint probability of two different words so you need to firm up that idea with respect to the problem statement
I took on the smaller problem of just generating the short windowed lists in your problem statement mostly for my own exercise
def wndw(wrd_l, m_l, pre, post):
"""
returns a list of all lists of sequential words in input wrd_l
that are within range -pre and +post of any word in wrd_l that matches
a word in m_l
wrd_l = list of words
m_l = list of words to match on
pre, post = ints giving range of indices to include in window size
"""
wndw_l = list()
for i, w in enumerate(wrd_l):
if w in m_l:
wndw_l.append([wrd_l[i + k] for k in range(-pre, post + 1)
if 0 <= (i + k ) < len(wrd_l)])
return wndw_l
basis = """Each word of the text is converted as follows: move any
consonant (or consonant cluster) that appears at the start
of the word to the end, then append ay."""
words = "word, text, bank, tree"
print(*wndw(basis.split(), [x.strip() for x in words.split(',')], 2, 2),
sep="\n")
['Each', 'word', 'of', 'the']
['of', 'the', 'text', 'is', 'converted']
['of', 'the', 'word', 'to', 'the']

Related

How to encode (replace) parts of word from end to beginning for some N value (like abcabcab to cbacbaba for n=3)?

I would like to create a program for encoding and decoding words.
Specifically, the program should take part of the word (count characters depending on the value of n) and turns them backwards.
This cycle will be running until it encodes the whole word.
At first I created the number of groups of parts of the word which is the number of elements n + some possible remainder
*(For example for Language with n = 3 has 3 parts - two parts of 3 chars and one remainder with 2 chars).This unit is called a general.
Then, depending on the general, I do a cycle that n * takes the given character and always adds it to the group (group has n chars).
At the end of the group cycle, I add (in reverse order) to new_word and reset the group value.
The goal should be to example decode word Language with (n value = 2) to aLgnaueg.
Or Language with (n value = 3) to naL aug eg and so on.
Next example is word abcabcab (n=3) to cba cba ba ?
Output of my code donĀ“t do it right. Output for n=3 is "naLaugeg"
Could I ask how to improve it? Is there some more simple python function how to rewrite it?
My code is there:
n = 3
word = "Language"
new_word = ""
group = ""
divisions = (len(word)//n)
residue = (len(word)%n)
general = divisions + residue
for i in range(general):
j=2
for l in range(n):
group += word[i+j]
print(word[i+j], l)
j=j-1
for j in range((len(group)-1),-1,-1):
new_word += group[j]
print(word[j])
group = ""
print(group)
print(new_word)
import textwrap
n = 3
word = "Language"
chunks = textwrap.wrap(word, n)
reversed_chunks = [chunk[::-1] for chunk in chunks]
>>> print(' '.join(reversed_chunks))
naL aug eg

Time complexity of a sliding window question

I'm working on the following problem:
Given a string and a list of words, find all the starting indices of substrings in the given string that are a concatenation of all the given words exactly once without any overlapping of words. It is given that all words are of the same length. For example:
Input: String = "catfoxcat", Words = ["cat", "fox"]
Output: [0, 3]
Explanation: The two substring containing both the words are "catfox" & "foxcat".
My solution is:
def find_word_concatenation(str, words):
result_indices = []
period = len(words[0])
startIndex = 0
wordCount = {}
matched = 0
for w in words:
if w not in wordCount:
wordCount[w] = 1
else:
wordCount[w] += 1
for endIndex in range(0, len(str) - period + 1, period):
rightWord = str[endIndex: endIndex + period]
if rightWord in wordCount:
wordCount[rightWord] -= 1
if wordCount[rightWord] == 0:
matched += 1
while matched == len(wordCount):
if endIndex + period - startIndex == len(words)*period:
result_indices.append(startIndex)
leftWord = str[startIndex: startIndex + period]
if leftWord in wordCount:
wordCount[leftWord] += 1
if wordCount[leftWord] > 0:
matched -= 1
startIndex += period
return result_indices
Can anyone help me figure out its time complexity please?
We should start by drawing a distinction between the time complexity of your code vs what you might actually be looking for.
In your case, you have a set of nested loops (a for and a while). So, worst case, which is what Big O is based on, you would do each of those while loops n times. But you also have that outer loop which would also be done n times.
O(n) * O(n) = O(n) 2
Which is not very good. Now, while not really so bad with this example, imagine if you were looking for "what a piece of work is man" in all of the Library of Congress or even in the collected works of Shakespeare.
On the plus side, you can refactor your code and get it down quite a bit.

Apply collocation from listo of bigrams with NLTK in Python

I have to find and "apply" collocations in several sentences. The sentences are stored in a list of string. Let' focus on only one sentence now.
Here's an example:
sentence = 'I like to eat the ice cream in new york'
Here's what I want in the end:
sentence_final = 'I like to eat the ice_cream in new_york'
I'm using Python NLTK to find the collocations and I'm able to create a set containing all the possible collocations over all the sentences I have.
Here's an example of the set:
set_collocations = set([('ice', 'cream'), ('new', 'york'), ('go', 'out')])
It's obviously bigger in reality.
I created the following function, which should return the new function, modified as described above:
def apply_collocations(sentence, set_colloc):
window_size = 2
words = sentence.lower().split()
list_bigrams = list(nltk.bigrams(words))
set_bigrams=set(list_bigrams)
intersect = set_bigrams.intersection(set_colloc)
print(set_colloc)
print(set_bigrams)
# No collocation in this sentence
if not intersect:
return sentence
# At least one collocation in this sentence
else:
set_words_iters = set()
# Create set of words of the collocations
for bigram in intersect:
set_words_iters.add(bigram[0])
set_words_iters.add(bigram[1])
# Sentence beginning
if list_bigrams[0][0] not in set_words_iters:
new_sentence = list_bigrams[0][0]
begin = 1
else:
new_sentence = list_bigrams[0][0] + '_' + list_bigrams[0][1]
begin = 2
for i in range(begin, len(list_bigrams)):
print(new_sentence)
if list_bigrams[i][1] in set_words_iters and list_bigrams[i] in intersect:
new_sentence += ' ' + list_bigrams[i][0] + '_' + list_bigrams[i][1]
elif list_bigrams[i][1] not in set_words_iters:
new_sentence += ' ' + list_bigrams[i][1]
return new_sentence
2 question:
Is there a more optimized way to to this?
Since I'm a little bit inexpert with NLTK, can someone tell me if there' a "direct way" to apply collocations to a certain text? I mean, once I have identified the bigrams which I consider collocations, is there some function (or fast method) to modify my sentences?
You can simply replace the string "x y" by "x_y" for each element in your collocations set:
def apply_collocations(sentence, set_colloc):
res = sentence.lower()
for b1,b2 in set_colloc:
res = res.replace("%s %s" % (b1 ,b2), "%s_%s" % (b1 ,b2))
return res

I want to extract a certain number of words surrounding a given word in a long string(paragraph) in Python 2.7

I am trying to extract a selected number of words surrounding a given word. I will give example to make it clear:
string = "Education shall be directed to the full development of the human personality and to the strengthening of respect for human rights and fundamental freedoms."
1) The selected word is development and I need to get the 6 words surrounding it, and get : [to, the, full, of, the, human]
2) But if the selected word is in the beginning or in second position I still need to get 6 words, e.g:
The selected word is shall , I should get: [Education, be, directed, to , the , full]
I should use 're' module. What I managed to find until now is :
def search(text,n):
'''Searches for text, and retrieves n words either side of the text, which are retuned seperatly'''
word = r"\W*([\w]+)"
groups = re.search(r'{}\W*{}{}'.format(word*n,'place',word*n), text).groups()
return groups[:n],groups[n:]
but it helps me only with the first case. Can someone help me out with this, I will be really grateful. Thank you in advance!
This will extract all occurrences of the target word in your text, with context:
import re
text = ("Education shall be directed to the full development of the human personality "
"and to the strengthening of respect for human rights and fundamental freedoms.")
def search(target, text, context=6):
# It's easier to use re.findall to split the string,
# as we get rid of the punctuation
words = re.findall(r'\w+', text)
matches = (i for (i,w) in enumerate(words) if w.lower() == target)
for index in matches:
if index < context //2:
yield words[0:context+1]
elif index > len(words) - context//2 - 1:
yield words[-(context+1):]
else:
yield words[index - context//2:index + context//2 + 1]
print(list(search('the', text)))
# [['be', 'directed', 'to', 'the', 'full', 'development', 'of'],
# ['full', 'development', 'of', 'the', 'human', 'personality', 'and'],
# ['personality', 'and', 'to', 'the', 'strengthening', 'of', 'respect']]
print(list(search('shall', text)))
# [['Education', 'shall', 'be', 'directed', 'to', 'the', 'full']]
print(list(search('freedoms', text)))
# [['respect', 'for', 'human', 'rights', 'and', 'fundamental', 'freedoms']]
Tricky with potential for off-by-one errors but I think this meets your spec. I have left removal of punctuation, probably best to remove it before sending the string for analysis. I assumed case was not important.
test_str = "Education shall be directed to the full development of the human personality and to the strengthening of respect for human rights and fundamental freedoms."
def get_surrounding_words(search_word, s, n_words):
words = s.lower().split(' ')
try:
i = words.index(search_word)
except ValueError:
return []
# Word is near start
if i < n_words/2:
words.pop(i)
return words[:n_words]
# Word is near end
elif i >= len(words) - n_words/2:
words.pop(i)
return words[-n_words:]
# Word is in middle
else:
words.pop(i)
return words[i-n_words/2:i+n_words/2]
def test(word):
print('{}: {}'.format(word, get_surrounding_words(word, test_str, 6)))
test('notfound')
test('development')
test('shall')
test('education')
test('fundamental')
test('for')
test('freedoms')
import sys, os
args = sys.argv[1:]
if len(args) != 2:
os.exit("Use with <string> <query>")
text = args[0]
query = args[1]
words = text.split()
op = []
left = 3
right = 3
try:
index = words.index(query)
if index <= left:
start = 0
else:
start = index - left
if start + left + right + 1 > len(words):
start = len(words) - left - right - 1
if start < 0:
start = 0
while len(op) < left + right and start < len(words):
if start != index:
op.append(words[start])
start += 1
except ValueError:
pass
print op
How do this work?
find the word in the string
See if we can make left+right words from the index the
Take left+right number of words and save them in op
print op
A simple approach to your problem. First separates all the words and then selects words from left and right.
def custom_search(sentence, word, n):
given_string = sentence
given_word = word
total_required = n
word_list = given_string.strip().split(" ")
length_of_words = len(word_list)
output_list = []
given_word_position = word_list.index(given_word)
word_from_left = 0
word_from_right = 0
if given_word_position + 1 > total_required / 2:
word_from_left = total_required / 2
if given_word_position + 1 + (total_required / 2) <= length_of_words:
word_from_right = total_required / 2
else:
word_from_right = length_of_words - (given_word_position + 1)
remaining_words = (total_required / 2) - word_from_right
word_from_left += remaining_words
else:
word_from_right = total_required / 2
word_from_left = given_word_position
if word_from_left + word_from_right < total_required:
remaining_words = (total_required / 2) - word_from_left
word_from_right += remaining_words
required_words = []
for i in range(given_word_position - word_from_left, word_from_right +
given_word_position + 1):
if i != given_word_position:
required_words.append(word_list[i])
return required_words
sentence = "Education shall be directed to the full development of the human personality and to the strengthening of respect for human rights and fundamental freedoms."
custom_search(sentence, "shall", 6)
>>[Education, be, directed, to , the , full]
custom_search(sentence, "development", 6)
>>['to', 'the', 'full', 'of', 'the', 'human']
I don't think regular expressions are necessary here. Assuming the text is well-constructed, just split it up into an array of words, and write a couple if-else statements to make sure it retrieves the necessary amount of surrounding words:
def search(text, word, n):
# text is the string you are searching
# word is the word you are looking for
# n is the TOTAL number of words you want surrounding the word
words = text.split(" ") # Create an array of words from the string
position = words.index(word) # Find the position of the desired word
distance_from_end = len(words) - position # How many words are after the word in the text
if position < n // 2 + n % 2: # If there aren't enough words before...
return words[:position], words[position + 1:n + 1]
elif distance_from_end < n // 2 + n % 2: # If there aren't enough words after...
return words[position - n + distance_from_end:position], words[position + 1:]
else: # Otherwise, extract an equal number of words from both sides (take from the right if odd)
return words[position - n // 2 - n % 2:position], words[position + 1:position + 1 + n//2]
string = "Education shall be directed to the full development of the human personality and to the \
strengthening of respect for human rights and fundamental freedoms."
print search(string, "shall", 6)
# >> (['Education'], ['be', 'directed', 'to', 'the', 'full'])
print search(string, "human", 5)
# >> (['development', 'of', 'the'], ['personality', 'and'])
In your example you didn't have the target word included in the output, so I kept it out as well. If you'd like the target word included simply combine the two arrays the function returns (join them at position).
Hope this helped!

The Alphabet and Recursion

I'm almost done with my program, but I've made a subtle mistake. My program is supposed to take a word, and by changing one letter at a time, is eventually supposed to reach a target word, in the specified number of steps. I had been trying at first to look for similarities, for example: if the word was find, and the target word lose, here's how my program would output in 4 steps:
['find','fine','line','lone','lose]
Which is actually the output I wanted. But if you consider a tougher set of words, like Java and work, the output is supposed to be in 6 steps.
['java', 'lava', 'lave', 'wave', 'wove', 'wore', 'work']
So my mistake is that I didn't realize you could get to the target word, by using letters that don't exist in the target word or original word.
Here's my Original Code:
import string
def changeling(word,target,steps):
alpha=string.ascii_lowercase
x=word##word and target has been changed to keep the coding readable.
z=target
if steps==0 and word!= target:##if the target can't be reached, return nothing.
return []
if x==z:##if target has been reached.
return [z]
if len(word)!=len(target):##if the word and target word aren't the same length print error.
print "error"
return None
i=1
if lookup
if lookup(z[0]+x[1:]) is True and z[0]+x[1:]!=x :##check every letter that could be from z, in variations of, and check if they're in the dictionary.
word=z[0]+x[1:]
while i!=len(x):
if lookup(x[:i-1]+z[i-1]+x[i:]) and x[:i-1]+z[i-1]+x[i:]!=x:
word=x[:i-1]+z[i-1]+x[i:]
i+=1
if lookup(x[:len(x)-1]+z[len(word)-1]) and x[:len(x)-1]+z[len(x)-1]!=x :##same applies here.
word=x[:len(x)-1]+z[len(word)-1]
y = changeling(word,target,steps-1)
if y :
return [x] + y##used to concatenate the first word to the final list, and if the list goes past the amount of steps.
else:
return None
Here's my current code:
import string
def changeling(word,target,steps):
alpha=string.ascii_lowercase
x=word##word and target has been changed to keep the coding readable.
z=target
if steps==0 and word!= target:##if the target can't be reached, return nothing.
return []
if x==z:##if target has been reached.
return [z]
holderlist=[]
if len(word)!=len(target):##if the word and target word aren't the same length print error.
print "error"
return None
i=1
for items in alpha:
i=1
while i!=len(x):
if lookup(x[:i-1]+items+x[i:]) is True and x[:i-1]+items+x[i:]!=x:
word =x[:i-1]+items+x[i:]
holderlist.append(word)
i+=1
if lookup(x[:len(x)-1]+items) is True and x[:len(x)-1]+items!=x:
word=x[:len(x)-1]+items
holderlist.append(word)
y = changeling(word,target,steps-1)
if y :
return [x] + y##used to concatenate the first word to the final list, and if the/
list goes past the amount of steps.
else:
return None
The differences between the two is that the first checks every variation of find with the letters from lose. Meaning: lind, fond, fisd, and fine. Then, if it finds a working word with the lookup function, it calls changeling on that newfound word.
As opposed to my new program, which checks every variation of find with every single letter in the alphabet.
I can't seem to get this code to work. I've tested it by simply printing what the results are of find:
for items in alpha:
i=1
while i!=len(x):
print (x[:i-1]+items+x[i:])
i+=1
print (x[:len(x)-1]+items)
This gives:
aind
fand
fiad
fina
bind
fbnd
fibd
finb
cind
fcnd
ficd
finc
dind
fdnd
fidd
find
eind
fend
fied
fine
find
ffnd
fifd
finf
gind
fgnd
figd
fing
hind
fhnd
fihd
finh
iind
find
fiid
fini
jind
fjnd
fijd
finj
kind
fknd
fikd
fink
lind
flnd
fild
finl
mind
fmnd
fimd
finm
nind
fnnd
find
finn
oind
fond
fiod
fino
pind
fpnd
fipd
finp
qind
fqnd
fiqd
finq
rind
frnd
fird
finr
sind
fsnd
fisd
fins
tind
ftnd
fitd
fint
uind
fund
fiud
finu
vind
fvnd
fivd
finv
wind
fwnd
fiwd
finw
xind
fxnd
fixd
finx
yind
fynd
fiyd
finy
zind
fznd
fizd
finz
Which is perfect! Notice that each letter in the alphabet goes through my word at least once. Now, what my program does is use a helper function to determine if that word is in a dictionary that I've been given.
Consider this, instead of like my first program, I now receive multiple words that are legal, except when I do word=foundword it means I'm replacing the previous word each time. Which is why I'm trying holderlist.append(word).
I think my problem is that I need changeling to run through each word in holderlist, and I'm not sure how to do that. Although that's only speculation.
Any help would be appreciated,
Cheers.
I might be slightly confused about what you need, but by borrowing from this post I belive I have some code that should be helpful.
>>> alphabet = 'abcdefghijklmnopqrstuvwxyz'
>>> word = 'java'
>>> splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
>>> splits
[('', 'java'), ('j', 'ava'), ('ja', 'va'), ('jav', 'a'), ('java', '')]
>>> replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b]
>>> replaces
['aava', 'bava', 'cava', 'dava', 'eava', 'fava', 'gava', 'hava', 'iava', 'java', 'kava', 'lava', 'mava', 'nava', 'oava', 'pava', 'qava', 'rava', 'sava', 'tava', 'uava', 'vava', 'wav
a', 'xava', 'yava', 'zava', 'java', 'jbva', 'jcva', 'jdva', 'jeva', 'jfva', 'jgva', 'jhva', 'jiva', 'jjva', 'jkva', 'jlva', 'jmva', 'jnva', 'jova', 'jpva', 'jqva', 'jrva', 'jsva', '
jtva', 'juva', 'jvva', 'jwva', 'jxva', 'jyva', 'jzva', 'jaaa', 'jaba', 'jaca', 'jada', 'jaea', 'jafa', 'jaga', 'jaha', 'jaia', 'jaja', 'jaka', 'jala', 'jama', 'jana', 'jaoa', 'japa'
, 'jaqa', 'jara', 'jasa', 'jata', 'jaua', 'java', 'jawa', 'jaxa', 'jaya', 'jaza', 'java', 'javb', 'javc', 'javd', 'jave', 'javf', 'javg', 'javh', 'javi', 'javj', 'javk', 'javl', 'ja
vm', 'javn', 'javo', 'javp', 'javq', 'javr', 'javs', 'javt', 'javu', 'javv', 'javw', 'javx', 'javy', 'javz']
Once you have a list of all possible replaces, you can simply do
valid_words = [valid for valid in replaces if lookup(valid)]
Which should give you all words that can be formed by replacing 1 character in word. By placing this code in a separate method, you could take a word, obtain possible next words from that current word, and recurse over each of those words. For example:
alphabet = 'abcdefghijklmnopqrstuvwxyz'
def next_word(word):
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b]
return [valid for valid in replaces if lookup(valid)]
Is this enough help? I think your code could really benefit by separating tasks into smaller chunks.
Fixed your code:
import string
def changeling(word, target, steps):
alpha=string.ascii_lowercase
x = word #word and target has been changed to keep the coding readable.
z = target
if steps == 0 and word != target: #if the target can't be reached, return nothing.
return []
if x == z: #if target has been reached.
return [z]
holderlist = []
if len(word) != len(target): #if the word and target word aren't the same length print error.
raise BaseException("Starting word and target word not the same length: %d and %d" % (len(word),
i = 1
for items in alpha:
i=1
while i != len(x):
if lookup(x[:i-1] + items + x[i:]) is True and x[:i-1] + items + x[i:] != x:
word = x[:i-1] + items + x[i:]
holderlist.append(word)
i += 1
if lookup(x[:len(x)-1] + items) is True and x[:len(x)-1] + items != x:
word = x[:len(x)-1] + items
holderlist.append(word)
y = [changeling(pos_word, target, steps-1) for pos_word in holderlist]
if y:
return [x] + y #used to concatenate the first word to the final list, and if the list goes past the amount of steps.
else:
return None
Where len(word) and len(target), it'd be better to raise an exception than print something obscure, w/o a stack trace and non-fatal.
Oh and backslashes(\), not forward slashes(/), are used to continue lines. And they don't work on comments

Categories