Replace Words on the basis of Bigram Frequency,Python - python

I have a series type object where i have to apply a function that uses bigrams to correct the word in case it occurs with another one. I created a bigrams list , sorted it according to frequency (highest comes first) and called it fdist .
bigrams = [b for l in text2 for b in zip(l.split(" ")[:-1], l.split(" ")[1:])]
freq = nltk.FreqDist(bigrams) #computes freq of occurrence
fdist = freq.keys() # sorted according to freq
Next ,I created a function that accepts each line ("or sentence","object of a list") and uses the bigram to decide whether to correct it further or not.
def bigram_corr(line): #function with input line(sentence)
words = line.split() #split line into words
for word1, word2 in zip(words[:-1], words[1:]): #generate 2 words at a time words 1,2 followed by 2,3 3,4 and so on
for i,j in fdist: #iterate over bigrams
if (word2==j) and (jf.levenshtein_distance(word1,i) < 3): #if 2nd words of both match, and 1st word is at an edit distance of 2 or 1, replace word with highest occurring bigram
word1=i #replace
return word1 #return word
The problem is that only a single word is returned for an entire sentence , eg :
"Lts go twards the east is" replaced by lets . It looks that further iterations arent working.
The for loop for word1, word2 works this way :
"Lts go" in 1st iteration, which will be eventually replaced by "lets" as lets occurs more frequently with "go"
"go towards" in 2nd iteration.
"towards the" in 3rd iteration.. and so on.
There is a minor error which i cant figure out , please help.

Sounds like you're doing word1 = i with the expectation that this will modify the contents of words. But this won't happen. If you want to modify words, you'll have to do so directly. Use enumerate to keep track of word1's index.
As 2rs2ts pointed out, you're returning early. If you want the inner loop to terminate once you find the first good replacement, break instead of returning. Then return at the end of the function.
def bigram_corr(line): #function with input line(sentence)
words = line.split() #split line into words
for idx, (word1, word2) in enumerate(zip(words[:-1], words[1:])):
for i,j in fdist: #iterate over bigrams
if (word2==j) and (jf.levenshtein_distance(word1,i) < 3): #if 2nd words of both match, and 1st word is at an edit distance of 2 or 1, replace word with highest occurring bigram
words[idx] = i
break
return " ".join(words)

The return statement halts the function entirely. I think what you want is:
def bigram_corr(line):
words = line.split()
words_to_return = []
for word1, word2 in zip(words[:-1], words[1:]):
for i,j in fdist:
if (word2==j) and (jf.levenshtein_distance(word1,i) < 3):
words_to_return.append(i)
return ' '.join(words_to_return)
This puts each of the words which you have processed into a list, then rejoins them with spaces and returns that entire string, since you said something about returning "the entire sentence."
I am not sure if the semantics of your code are correct, since I don't have the jf library or whatever it is that you're using and therefore I can't test this code, so this may or may not solve your problem entirely. But this will help.

Related

How to check generated strings against a text file

I'm trying to have the user input a string of characters with one asterisk. The asterisk indicates a character that can be subbed out for a vowel (a,e,i,o,u) in order to see what substitutions produce valid words.
Essentially, I want to take an input "l*g" and have it return "lag, leg, log, lug" because "lig" is not a valid English word. Below I have invalid words to be represented as "x".
I've gotten it to properly output each possible combination (e.g., including "lig"), but once I try to compare these words with the text file I'm referencing (for the list of valid words), it'll only return 5 lines of x's. I'm guessing it's that I'm improperly importing or reading the file?
Here's the link to the file I'm looking at so you can see the formatting:
https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/words.zip
Using the "en" file ~2.5MB
It's not in a dictionary layout i.e. no corresponding keys/values, just lines (maybe I could use the line number as the index, but I don't know how to do that). What can I change to check the test words to narrow down which are valid words based on the text file?
with open(os.path.expanduser('~/Downloads/words/en')) as f:
words = f.readlines()
inputted_word = input("Enter a word with ' * ' as the missing letter: ")
letters = []
for l in inputted_word:
letters.append(l)
### find the index of the blank
asterisk = inputted_word.index('*') # also used a redundant int(), works fine
### sub in vowels
vowels = ['a','e','i','o','u']
list_of_new_words = []
for v in vowels:
letters[asterisk] = v
new_word = ''.join(letters)
list_of_new_words.append(new_word)
for w in list_of_new_words:
if w in words:
print(new_word)
else:
print('x')
There are probably more efficient ways to do this, but I'm brand new to this. The last two for loops could probably be combined but debugging it was tougher that way.
print(list_of_new_words)
gives
['lag', 'leg', 'lig', 'log', 'lug']
So far, so good.
But this :
for w in list_of_new_words:
if w in words:
print(new_word)
else:
print('x')
Here you print new_word, which is defined in the previous for loop :
for v in vowels:
letters[asterisk] = v
new_word = ''.join(letters) # <----
list_of_new_words.append(new_word)
So after the loop, new_word still has the last value it was assigned to : "lug" (if the script input was l*g).
You probably meant w instead ?
for w in list_of_new_words:
if w in words:
print(w)
else:
print('x')
But it still prints 5 xs ...
So that means that w in words is always False. How is that ?
Looking at words :
print(words[0:10]) # the first 10 will suffice
['A\n', 'a\n', 'aa\n', 'aal\n', 'aalii\n', 'aam\n', 'Aani\n', 'aardvark\n', 'aardwolf\n', 'Aaron\n']
All the words from the dictionary contain a newline character (\n) at the end. I guess you were not aware that it is what readlines do. So I recommend using :
words = f.read().splitlines()
instead.
With these 2 modifications (w and splitlines) :
Enter a word with ' * ' as the missing letter: l*g
lag
leg
x
log
lug
🎉

How to extract first letter of every nth word in a sentence?

I was trying to extract the first letter of every 5th word and after doing a bit of research I was able to figure out how to obtain every 5th word. But, how do I know extract the first letters of every 5th word and put them together to make a word out of them. This is my progress so far:
def extract(text):
for word in text.split()[::5]:
print(word)
extract("I like to jump on trees when I am bored")
As the comment pointed out, split it and then just access the first character:
def extract(text):
for word in text.split(" "):
print(word[0])
text.split(" ") returns an array and we are looping through that array. word is the current entry (string) in that array. Now, in python you can access the first character of a string in typical array notation. Therefore, word[0] returns the first character of that word, word[-1] would return the last character of that word.
I don't know how did you solve the first part and can not solve the second one,
but anyway, strings in python are simply a list of characters, so if you want to access the 1st character you get the 0th index. so applying that to your example, as the comment mentioned you type (word[0]),
so you can print the word[0] or maybe collect the 1st characters in a list to do any further operations (I do believe that what you want to do, not just printing them!)
def extract(text):
mychars=[]
for word in text.split()[::5]:
mychars.append(word[0])
print(mychars)
extract("I like to jump on trees when I am bored")
The below code might help you out. Just an example idea based on what you said.
#
# str Text : A string of words, such as a sentence.
# int split : Split the string every nth word
# int maxLen : Max number of chars extracted from beginning of each word
#
def extract(text,split,maxLen):
newWord = ""
# Every nth word
for word in text.split()[::split]:
if len(word) < maxLen:
newWord += word[0:] #Entire word (if maxLength is small)
else:
newWord += word[:maxLen] #Beginning of word until nth letter
return (None if newWord=="" else newWord)
text = "The quick brown fox jumps over the lazy dog."
result = extract(text, split=5, maxLen=2) #Use split=5, maxLen=1 to do what you said specifically
if (result):
print (result) #Expected output: "Thov"

Return first word in sentence? [duplicate]

This question already has answers here:
How to extract the first and final words from a string?
(7 answers)
Closed 5 years ago.
Heres the question I have to answer for school
For the purposes of this question, we will define a word as ending a sentence if that word is immediately followed by a period. For example, in the text “This is a sentence. The last sentence had four words.”, the ending words are ‘sentence’ and ‘words’. In a similar fashion, we will define the starting word of a sentence as any word that is preceded by the end of a sentence. The starting words from the previous example text would be “The”. You do not need to consider the first word of the text as a starting word. Write a program that has:
An endwords function that takes a single string argument. This functioin must return a list of all sentence ending words that appear in the given string. There should be no duplicate entries in the returned list and the periods should not be included in the ending words.
The code I have so far is:
def startwords(astring):
mylist = astring.split()
if mylist.endswith('.') == True:
return my list
but I don't know if I'm using the right approach. I need some help
Several issues with your code. The following would be a simple approach. Create a list of bigrams and pick the second token of each bigram where the first token ends with a period:
def startwords(astring):
mylist = astring.split() # a list! Has no 'endswith' method
bigrams = zip(mylist, mylist[1:])
return [b[1] for b in bigrams if b[0].endswith('.')]
zip and list comprehenion are two things worth reading up on.
mylist = astring.split()
if mylist.endswith('.')
that cannot work, one of the reasons being that mylist is a list, and doesn't have endswith as a method.
Another answer fixed your approach so let me propose a regular expression solution:
import re
print(re.findall(r"\.\s*(\w+)","This is a sentence. The last sentence had four words."))
match all words following a dot and optional spaces
result: ['The']
def endwords(astring):
mylist = astring.split('.')
temp_words = [x.rpartition(" ")[-1] for x in mylist if len(x) > 1]
return list(set(temp_words))
This creates a set so there are no duplicates. Then goes on a for loop in a list of sentences (split by ".") then for each sentence, splits it in words then using [:-1] makes a list of the last word only and gets [0] item in that list.
print (set([ x.split()[:-1][0] for x in s.split(".") if len(x.split())>0]))
The if in theory is not needed but i couldn't make it work without it.
This works as well:
print (set([ x.split() [len(x.split())-1] for x in s.split(".") if len(x.split())>0]))
This is one way to do it ->
#!/bin/env/ python
from sets import Set
sentence = 'This is a sentence. The last sentence had four words.'
uniq_end_words = Set()
for word in sentence.split():
if '.' in word:
# check if period (.) is at the end
if '.' == word[len(word) -1]:
uniq_end_words.add(word.rstrip('.'))
print list(uniq_end_words)
Output (list of all the end words in a given sentence) ->
['words', 'sentence']
If your input string has a period in one of its word (lets say the last word), something like this ->
'I like the documentation of numpy.random.rand.'
The output would be - ['numpy.random.rand']
And for input string 'I like the documentation of numpy.random.rand a lot.'
The output would be - ['lot']

'difficult' Determine proximity between 2 strings in python

I have 2 strings loss of gene and aquaporin protein. In a line, I want to find if these two exist in a line of my file, within a proximity of 5 words.
Any ideas? I have searched extensively but cannot find anything.
Also, since these are multi-word strings, I cannot use abs(array.index) for the two (which was possible with single words).
Thanks
You could try the following approach:
First sanitise your text by converting it to lowercase, keeping only the characters and enforcing one space between each word.
Next, search for each of the phrases in the resulting text and keep a note of the starting index and the length of the phrase matched. Sort this index list.
Next make sure that all of the phrases were present in the text by making sure all found indexes are not -1.
If all are found count the number of words between the end of the first phrase, and the start of the last phrase. To do this take a text slice starting from the end of the first phrase to the start of the second phrase, and split it into words.
Script as follows:
import re
text = "The Aquaporin protein, sometimes 'may' exhibit a big LOSS of gene."
text = ' '.join(re.findall(r'\b(\w+)\b', text.lower()))
indexes = sorted((text.find(x), len(x)) for x in ['loss of gene', 'aquaporin protein'])
if all(i[0] != -1 for i in indexes) and len(text[indexes[0][0] + indexes[0][1] : indexes[-1][0]].split()) <= 5:
print "matched"
To extend this to work on a file with a list of phrases, the following approach could be used:
import re
log = 'loss of gene'
phrases = ['aquaporin protein', 'another protein']
with open('input.txt') as f_input:
for number, line in enumerate(f_input, start=1):
# Sanitise the line
text = ' '.join(re.findall(r'\b(\w+)\b', line.lower()))
# Only process lines containing 'loss of gene'
log_index = text.find(log)
if log_index != -1:
for phrase in phrases:
phrase_index = text.find(phrase)
if phrase_index != -1:
if log_index < phrase_index:
start, end = (log_index + len(log), phrase_index)
else:
start, end = (phrase_index + len(phrase), log_index)
if len(text[start:end].split()) <= 5:
print "line {} matched - {}".format(number, phrase)
break
This would give you the following kind of output:
line 1 matched - aquaporin protein
line 5 matched - another protein
Note, this will only spot one phrase pair per line.
I am not completely sure if this is what you want, but I'll give it a shot!
In Python, you can use "in" to check if a string is in another string. I am going to assume you already have a way to store a line from a file:
"loss of gene" in fileLine -> returns boolean (either True or False)
With this you can check if "loss of gene" and "aquaporin protein" are in your line from your file. Once you have confirmed that they are both there you can check their proximity by splitting the line of text into a list as so:
wordsList = fileLine.split()
If in your text file you have the string:
"The aquaporin protein sometimes may exhibit a loss of gene"
After splitting it becomes:
["The","aquaporin","protein","sometimes","may","exhibit","a","loss","of","gene"]
I'm not sure if that is a valid sentence but for the sake of example let's ignore it :P
Once you have the line of text split into a list of words and confirmed the words are in there, you can get their proximity with the index function that comes with lists in python!
wordsList.index("protein") -> returns index 2
After finding what index "protein" is at you can check what index "loss" is at, then subtract them to find out if they are within a 5 word proximity.
You can use the index function to discern if "loss of gene" comes before or after "aquaporin protein". If "loss of gene" comes first, index "gene" and "aquaporin" and subtract those indexes. If "aquaporin protein" comes first, index "protein" and "loss" and subtract those indexes.
You will have to do a bit more to ensure that you subtract indexes correctly if the words come in different orders, but this should cover the meat of the problem. Good luck Chahat!

Optimize a find and match code in Python

I have a code which takes as input two files:
(1) a dictionary/lexicon
(2) a text file (one sentence per line)
The first part of my code reads the dictionary in tuples so outputs something like:
('mthy3lkw', 'weakBelief', 'U')
('mthy3lkm', 'firmBelief', 'B')
('mthy3lh', 'notBelief', 'A')
The second part of the code is to search each sentence in the text file for the words in position 0 in those tuples and then print out the sentence, the search word and it's type.
So given the sentence mthy3lkw ana mesh 3arif , desired output is:
["mthy3lkw ana mesh 3arif", 'mthy3lkw', 'weakBelief', 'U'] given that the highlighted word is found in the dictionary.
The second part of my code - the matching part - is TOO slow. How do I make it faster?
Here is my code
findings = []
for sentence in data: # I open the sentences file with .readlines()
for word in tuples: # similar to the ones mentioned above
p1 = re.compile('\\b%s\\b'%word[0]) # get the first word in every tuple
if p1.findall(sentence) and word[1] == "firmBelief":
findings.append([sentence, word[0], "firmBelief"])
print findings
Build a dict lookup structure so you can find the correct one from your tuples quickly. Then you can restructure your loops so that instead of going through your whole dictionary for each sentence, trying to match every entry up, you instead go over each word in the sentence and look it up in the dictionary dict:
# Create a lookup structure for words
word_dictionary = dict((entry[0], entry) for entry in tuples)
findings = []
word_re = re.compile(r'\b\S+\b') # only need to create the regexp once
for sentence in data:
for word in word_re.findall(sentence): # Check every word in the sentence
if word in word_dictionary: # A match was found
entry = word_dictionary[word]
findings.append([sentence, word, entry[1], entry[2]])
Convert your list of tuples into a trie, and use that for searching.

Categories