Optimize a find and match code in Python - python

I have a code which takes as input two files:
(1) a dictionary/lexicon
(2) a text file (one sentence per line)
The first part of my code reads the dictionary in tuples so outputs something like:
('mthy3lkw', 'weakBelief', 'U')
('mthy3lkm', 'firmBelief', 'B')
('mthy3lh', 'notBelief', 'A')
The second part of the code is to search each sentence in the text file for the words in position 0 in those tuples and then print out the sentence, the search word and it's type.
So given the sentence mthy3lkw ana mesh 3arif , desired output is:
["mthy3lkw ana mesh 3arif", 'mthy3lkw', 'weakBelief', 'U'] given that the highlighted word is found in the dictionary.
The second part of my code - the matching part - is TOO slow. How do I make it faster?
Here is my code
findings = []
for sentence in data: # I open the sentences file with .readlines()
for word in tuples: # similar to the ones mentioned above
p1 = re.compile('\\b%s\\b'%word[0]) # get the first word in every tuple
if p1.findall(sentence) and word[1] == "firmBelief":
findings.append([sentence, word[0], "firmBelief"])
print findings

Build a dict lookup structure so you can find the correct one from your tuples quickly. Then you can restructure your loops so that instead of going through your whole dictionary for each sentence, trying to match every entry up, you instead go over each word in the sentence and look it up in the dictionary dict:
# Create a lookup structure for words
word_dictionary = dict((entry[0], entry) for entry in tuples)
findings = []
word_re = re.compile(r'\b\S+\b') # only need to create the regexp once
for sentence in data:
for word in word_re.findall(sentence): # Check every word in the sentence
if word in word_dictionary: # A match was found
entry = word_dictionary[word]
findings.append([sentence, word, entry[1], entry[2]])

Convert your list of tuples into a trie, and use that for searching.

Related

how to get the next word from a string according to list element in python

I am new to python and trying to solve this problem.
words = ['plus', 'Constantly', 'the']
string = "Plus, I Constantly adding new resources, guides, and the personality quizzes to help you travel beyond the Guidebook"
output: I adding Guidebook
here I want to match the list element to the string and get the next word from the string to construct a new word.
I tried to do it by splitting the word into list and check if they are in the list. But the 'Plus,' won't match because of the ',' and also there has two 'the' but I only need to get the last word after 'the'
One way to do this is to use regex to split string by words (the pattern used is [\w]+). Then you can build a dictionary of the pairs, so that you can look up the first word to retrieve the following word.
import re
words = ['plus', 'Constantly', 'the']
string = "Plus, I Constantly adding new resources, guides, and the personality quizzes to help you travel beyond the Guidebook"
string_splits = re.findall(r'[\w]+',string)
pairs = {x:y for x,y in zip(map(lambda x: x.lower(), string_splits), string_splits[1:])}
print(' '.join(pairs.get(word.lower()) for word in words))
Edit to expand the dict comprehsion;
pairs = {}
for x,y in zip(map(lambda x: x.lower(), string_splits), string_splits[1:]):
pairs[x] = y

Duplicates with in a sentence of a text file in python

Hi I want to write a code that reads a text file, and identifies the sentences in the file with words that have duplicates within that sentence. I was thinking of putting each sentence of the file in a dictionary and finding which sentences have duplicates. Since I am new to Python, I need some help in writing the code.
This is what I have so far:
def Sentences():
def Strings():
l = string.split('.')
for x in range(len(l)):
print('Sentence', x + 1, ': ', l[x])
return
text = open('Rand article.txt', 'r')
string = text.read()
Strings()
return
The code above converts files to sentences.
Suppose you have a file where each line is a sentence, e.g. "sentences.txt":
I contain unique words.
This sentence repeats repeats a word.
The strategy could be to split the sentence into its constituent words, then use set to find the unique words in the sentence. If the resulting set is shorter than the list of all words, then you know that the sentence contains at least one duplicated word:
sentences_with_dups = []
with open("sentences.txt") as fh:
for sentence in fh:
words = sentence.split(" ")
if len(set(words)) != len(words):
sentences_with_dups.append(sentence)

Return first word in sentence? [duplicate]

This question already has answers here:
How to extract the first and final words from a string?
(7 answers)
Closed 5 years ago.
Heres the question I have to answer for school
For the purposes of this question, we will define a word as ending a sentence if that word is immediately followed by a period. For example, in the text “This is a sentence. The last sentence had four words.”, the ending words are ‘sentence’ and ‘words’. In a similar fashion, we will define the starting word of a sentence as any word that is preceded by the end of a sentence. The starting words from the previous example text would be “The”. You do not need to consider the first word of the text as a starting word. Write a program that has:
An endwords function that takes a single string argument. This functioin must return a list of all sentence ending words that appear in the given string. There should be no duplicate entries in the returned list and the periods should not be included in the ending words.
The code I have so far is:
def startwords(astring):
mylist = astring.split()
if mylist.endswith('.') == True:
return my list
but I don't know if I'm using the right approach. I need some help
Several issues with your code. The following would be a simple approach. Create a list of bigrams and pick the second token of each bigram where the first token ends with a period:
def startwords(astring):
mylist = astring.split() # a list! Has no 'endswith' method
bigrams = zip(mylist, mylist[1:])
return [b[1] for b in bigrams if b[0].endswith('.')]
zip and list comprehenion are two things worth reading up on.
mylist = astring.split()
if mylist.endswith('.')
that cannot work, one of the reasons being that mylist is a list, and doesn't have endswith as a method.
Another answer fixed your approach so let me propose a regular expression solution:
import re
print(re.findall(r"\.\s*(\w+)","This is a sentence. The last sentence had four words."))
match all words following a dot and optional spaces
result: ['The']
def endwords(astring):
mylist = astring.split('.')
temp_words = [x.rpartition(" ")[-1] for x in mylist if len(x) > 1]
return list(set(temp_words))
This creates a set so there are no duplicates. Then goes on a for loop in a list of sentences (split by ".") then for each sentence, splits it in words then using [:-1] makes a list of the last word only and gets [0] item in that list.
print (set([ x.split()[:-1][0] for x in s.split(".") if len(x.split())>0]))
The if in theory is not needed but i couldn't make it work without it.
This works as well:
print (set([ x.split() [len(x.split())-1] for x in s.split(".") if len(x.split())>0]))
This is one way to do it ->
#!/bin/env/ python
from sets import Set
sentence = 'This is a sentence. The last sentence had four words.'
uniq_end_words = Set()
for word in sentence.split():
if '.' in word:
# check if period (.) is at the end
if '.' == word[len(word) -1]:
uniq_end_words.add(word.rstrip('.'))
print list(uniq_end_words)
Output (list of all the end words in a given sentence) ->
['words', 'sentence']
If your input string has a period in one of its word (lets say the last word), something like this ->
'I like the documentation of numpy.random.rand.'
The output would be - ['numpy.random.rand']
And for input string 'I like the documentation of numpy.random.rand a lot.'
The output would be - ['lot']

Python: Searching text for Strings from one List and replacing them with Strings from another

I'm trying to create a program to do very simple encryption/compression.
Basically, the program requests input in the form of a sentence which is split into words, e.g.
this is the gamma
I then have two lists:
mylist1 = ("alpha","beta","gamma","delta")
mylist2 = ("1","2","3","4")
Each word is matched against a list which contains a number of words. if the words are present in the list then it should replace the word with the corresponding number in the other list. The output should be:
this is the 3
Here is the code I have so far:
text = input("type your sentence \n")
words = text.split(" ")
mylist1 = ("alpha","beta","gamma","delta")
mylist2 = ("1","2","3","4")
#I was looking at zipping the lists together but wasn't sure I was on the right track
#ziplist = zip(mylist1, mylist2)
for word in words:
if word in mylist1:
text = text.replace(word,mylist2[])
print (text)
This question that I was looking into yesterday shows how to do this using a dictionary but I encountered a problem when trying to convert the text file back from numbers to words again. (I swapped the Keys and Values around. I'm pretty certain I shouldn't do that)
Any input would be fantastic.
You need to get the index of word from mylist1 and replace the occurrence of word in the variable text with the element at that same index in mylist2
text = text.replace(word, mylist2[mylist1.index(word)])

Replace Words on the basis of Bigram Frequency,Python

I have a series type object where i have to apply a function that uses bigrams to correct the word in case it occurs with another one. I created a bigrams list , sorted it according to frequency (highest comes first) and called it fdist .
bigrams = [b for l in text2 for b in zip(l.split(" ")[:-1], l.split(" ")[1:])]
freq = nltk.FreqDist(bigrams) #computes freq of occurrence
fdist = freq.keys() # sorted according to freq
Next ,I created a function that accepts each line ("or sentence","object of a list") and uses the bigram to decide whether to correct it further or not.
def bigram_corr(line): #function with input line(sentence)
words = line.split() #split line into words
for word1, word2 in zip(words[:-1], words[1:]): #generate 2 words at a time words 1,2 followed by 2,3 3,4 and so on
for i,j in fdist: #iterate over bigrams
if (word2==j) and (jf.levenshtein_distance(word1,i) < 3): #if 2nd words of both match, and 1st word is at an edit distance of 2 or 1, replace word with highest occurring bigram
word1=i #replace
return word1 #return word
The problem is that only a single word is returned for an entire sentence , eg :
"Lts go twards the east is" replaced by lets . It looks that further iterations arent working.
The for loop for word1, word2 works this way :
"Lts go" in 1st iteration, which will be eventually replaced by "lets" as lets occurs more frequently with "go"
"go towards" in 2nd iteration.
"towards the" in 3rd iteration.. and so on.
There is a minor error which i cant figure out , please help.
Sounds like you're doing word1 = i with the expectation that this will modify the contents of words. But this won't happen. If you want to modify words, you'll have to do so directly. Use enumerate to keep track of word1's index.
As 2rs2ts pointed out, you're returning early. If you want the inner loop to terminate once you find the first good replacement, break instead of returning. Then return at the end of the function.
def bigram_corr(line): #function with input line(sentence)
words = line.split() #split line into words
for idx, (word1, word2) in enumerate(zip(words[:-1], words[1:])):
for i,j in fdist: #iterate over bigrams
if (word2==j) and (jf.levenshtein_distance(word1,i) < 3): #if 2nd words of both match, and 1st word is at an edit distance of 2 or 1, replace word with highest occurring bigram
words[idx] = i
break
return " ".join(words)
The return statement halts the function entirely. I think what you want is:
def bigram_corr(line):
words = line.split()
words_to_return = []
for word1, word2 in zip(words[:-1], words[1:]):
for i,j in fdist:
if (word2==j) and (jf.levenshtein_distance(word1,i) < 3):
words_to_return.append(i)
return ' '.join(words_to_return)
This puts each of the words which you have processed into a list, then rejoins them with spaces and returns that entire string, since you said something about returning "the entire sentence."
I am not sure if the semantics of your code are correct, since I don't have the jf library or whatever it is that you're using and therefore I can't test this code, so this may or may not solve your problem entirely. But this will help.

Categories