'difficult' Determine proximity between 2 strings in python - python

I have 2 strings loss of gene and aquaporin protein. In a line, I want to find if these two exist in a line of my file, within a proximity of 5 words.
Any ideas? I have searched extensively but cannot find anything.
Also, since these are multi-word strings, I cannot use abs(array.index) for the two (which was possible with single words).
Thanks

You could try the following approach:
First sanitise your text by converting it to lowercase, keeping only the characters and enforcing one space between each word.
Next, search for each of the phrases in the resulting text and keep a note of the starting index and the length of the phrase matched. Sort this index list.
Next make sure that all of the phrases were present in the text by making sure all found indexes are not -1.
If all are found count the number of words between the end of the first phrase, and the start of the last phrase. To do this take a text slice starting from the end of the first phrase to the start of the second phrase, and split it into words.
Script as follows:
import re
text = "The Aquaporin protein, sometimes 'may' exhibit a big LOSS of gene."
text = ' '.join(re.findall(r'\b(\w+)\b', text.lower()))
indexes = sorted((text.find(x), len(x)) for x in ['loss of gene', 'aquaporin protein'])
if all(i[0] != -1 for i in indexes) and len(text[indexes[0][0] + indexes[0][1] : indexes[-1][0]].split()) <= 5:
print "matched"
To extend this to work on a file with a list of phrases, the following approach could be used:
import re
log = 'loss of gene'
phrases = ['aquaporin protein', 'another protein']
with open('input.txt') as f_input:
for number, line in enumerate(f_input, start=1):
# Sanitise the line
text = ' '.join(re.findall(r'\b(\w+)\b', line.lower()))
# Only process lines containing 'loss of gene'
log_index = text.find(log)
if log_index != -1:
for phrase in phrases:
phrase_index = text.find(phrase)
if phrase_index != -1:
if log_index < phrase_index:
start, end = (log_index + len(log), phrase_index)
else:
start, end = (phrase_index + len(phrase), log_index)
if len(text[start:end].split()) <= 5:
print "line {} matched - {}".format(number, phrase)
break
This would give you the following kind of output:
line 1 matched - aquaporin protein
line 5 matched - another protein
Note, this will only spot one phrase pair per line.

I am not completely sure if this is what you want, but I'll give it a shot!
In Python, you can use "in" to check if a string is in another string. I am going to assume you already have a way to store a line from a file:
"loss of gene" in fileLine -> returns boolean (either True or False)
With this you can check if "loss of gene" and "aquaporin protein" are in your line from your file. Once you have confirmed that they are both there you can check their proximity by splitting the line of text into a list as so:
wordsList = fileLine.split()
If in your text file you have the string:
"The aquaporin protein sometimes may exhibit a loss of gene"
After splitting it becomes:
["The","aquaporin","protein","sometimes","may","exhibit","a","loss","of","gene"]
I'm not sure if that is a valid sentence but for the sake of example let's ignore it :P
Once you have the line of text split into a list of words and confirmed the words are in there, you can get their proximity with the index function that comes with lists in python!
wordsList.index("protein") -> returns index 2
After finding what index "protein" is at you can check what index "loss" is at, then subtract them to find out if they are within a 5 word proximity.
You can use the index function to discern if "loss of gene" comes before or after "aquaporin protein". If "loss of gene" comes first, index "gene" and "aquaporin" and subtract those indexes. If "aquaporin protein" comes first, index "protein" and "loss" and subtract those indexes.
You will have to do a bit more to ensure that you subtract indexes correctly if the words come in different orders, but this should cover the meat of the problem. Good luck Chahat!

Related

Find the biggest sentence in a txt file using Python

I'm trying to find the biggest sentence in a text file. I'm using the dot (.) to define the beginning and end of sentences. The text file don't have special punctuation (like ?! etc).
My code currently only return the first letter of my text file. I'm not sure why.
def recherche(source):
"find the biggest sentence"
fs = open(source, "r")
while 1:
txt = fs.readline()
if txt == "":
break
else:
grande_phrase= max(txt, key=len)
print (grande_phrase)
fs.close()
recherche("for92.txt")
Your current code reads each line, and finds the max of that line. Since a string is just a collection of characters, your expression max(txt, key=len) gives you the character in txt that has the maximum length. Since all characters have a length of 1, you just get the first character of the line.
You want to create a list of all sentences, and then use max on that list. There seems to be no guarantee that your input file will have one sentence per line. Since you use a period to define where a sentence ends, you're going to have to split the entire file at . to get your list of sentences. Keep in mind that this is not a foolproof strategy to split any text into sentences, since you risk splitting at other occurrences of ., such as a decimal point or an abbreviation.
def recherche(source):
"find the biggest sentence"
with open(source, "r") as fs:
sentences = fs.read().split(".")
grande_phrase = max(sentences, key=len)
print(grande_phrase)
With an input file that looks like so:
It was the best of times. It was the worst of times. It was the age of wisdom. It was the age of foolishness. It was the epoch of belief. It was the epoch of incredulity. It was the season of light. It was the season of darkness. It was the spring of hope. It was the winter of despair.
we get the output:
It was the epoch of incredulity
Try it online Note: I replaced the file with an io.StringIO to work on tio.run

How to find and manipulate words in sentences in python?

I am trying to identify words within sentences that are only made up of numbers. Once I find a word only made up of numbers, I have a certain manipulation I would like to do to it. I am able to do this manipulation to a single string of numbers, but I am absolutely at a loss of how to do so if the strings are randomly positioned across a sentence.
To do so to one string, I confirmed it was only numbers and iterated through its characters so that I skipped the first number, changed the rest to certain letter values and added a new character to the end. These specifics aren't necessarily what is important. I am trying to find a way of treating each random "word" of numbers in a sentence the same way. Is this possible?
I am not supposed to use any advanced functions. Only loops, enumerate, if chains, string functions etc. I feel like I am just overthinking something!
NUM_BRAILLE="*"
digits='1234567890'
decade="abcdefhij"
def numstuff(s):
if len(s)==1 and s.isdigit():
s=s+NUM_BRAILLE
elif " " not in s and s.isdigit():
start_s=s[:1]
s=s[1:]
for i in s:
if i in digits:
s=s.replace(i,decade[int(i)-1])
s=start_s+s+NUM_BRAILLE
else:
#if sentence contains many " " (spaces) how to find "words" of numbers and treat them using method above?
You can do something like this to extract numeric values from a sentence and pass the values to your function.
sentence = "This is 234 some text 888 with few words in 33343 numeric"
words = sentence.split(" ")
values= [int(word) if word.isdigit() else 0 for word in words]
print values
Output:

How do I find the position of a word in a .txt file in Python 3

I have a .txt filled with 4000 different words that are listed vertically.
I'm supposed to find the position of a certain word that I input, but I get really weird values for the position for the word's that do exist in the file. This is what my code looks like so far
So, the very first word on the list is 'the', so if I were to input into the search_word input 'the', I would get a zero when I'm supposed to get 1. Another example, is if I input 'be', I'd get 4 when it's supposed to be ranked at 2.
I think the problem is that my code is only scanning each character in the list instead of scanning each word separately. I have no clue how to fix this!
You can use enumerate to generate ranking numbers instead, and the for-else construct to output the word and break the loop as soon as the word is found, or wait until the loop finishes without breaking to decide that there is no matching word found:
with lexicon_file as file:
for i, w in enumerate(file, 1):
if search_word == w.rstrip():
print("Accoding to our Lexicon, '%s' is ranked number %d in the most common word in contemporary American English" % (search_word, i))
break
else:
print("According to our Lexicon, '%s' is NOT in the 4000 most common words of contemporary American English")

Replace Words on the basis of Bigram Frequency,Python

I have a series type object where i have to apply a function that uses bigrams to correct the word in case it occurs with another one. I created a bigrams list , sorted it according to frequency (highest comes first) and called it fdist .
bigrams = [b for l in text2 for b in zip(l.split(" ")[:-1], l.split(" ")[1:])]
freq = nltk.FreqDist(bigrams) #computes freq of occurrence
fdist = freq.keys() # sorted according to freq
Next ,I created a function that accepts each line ("or sentence","object of a list") and uses the bigram to decide whether to correct it further or not.
def bigram_corr(line): #function with input line(sentence)
words = line.split() #split line into words
for word1, word2 in zip(words[:-1], words[1:]): #generate 2 words at a time words 1,2 followed by 2,3 3,4 and so on
for i,j in fdist: #iterate over bigrams
if (word2==j) and (jf.levenshtein_distance(word1,i) < 3): #if 2nd words of both match, and 1st word is at an edit distance of 2 or 1, replace word with highest occurring bigram
word1=i #replace
return word1 #return word
The problem is that only a single word is returned for an entire sentence , eg :
"Lts go twards the east is" replaced by lets . It looks that further iterations arent working.
The for loop for word1, word2 works this way :
"Lts go" in 1st iteration, which will be eventually replaced by "lets" as lets occurs more frequently with "go"
"go towards" in 2nd iteration.
"towards the" in 3rd iteration.. and so on.
There is a minor error which i cant figure out , please help.
Sounds like you're doing word1 = i with the expectation that this will modify the contents of words. But this won't happen. If you want to modify words, you'll have to do so directly. Use enumerate to keep track of word1's index.
As 2rs2ts pointed out, you're returning early. If you want the inner loop to terminate once you find the first good replacement, break instead of returning. Then return at the end of the function.
def bigram_corr(line): #function with input line(sentence)
words = line.split() #split line into words
for idx, (word1, word2) in enumerate(zip(words[:-1], words[1:])):
for i,j in fdist: #iterate over bigrams
if (word2==j) and (jf.levenshtein_distance(word1,i) < 3): #if 2nd words of both match, and 1st word is at an edit distance of 2 or 1, replace word with highest occurring bigram
words[idx] = i
break
return " ".join(words)
The return statement halts the function entirely. I think what you want is:
def bigram_corr(line):
words = line.split()
words_to_return = []
for word1, word2 in zip(words[:-1], words[1:]):
for i,j in fdist:
if (word2==j) and (jf.levenshtein_distance(word1,i) < 3):
words_to_return.append(i)
return ' '.join(words_to_return)
This puts each of the words which you have processed into a list, then rejoins them with spaces and returns that entire string, since you said something about returning "the entire sentence."
I am not sure if the semantics of your code are correct, since I don't have the jf library or whatever it is that you're using and therefore I can't test this code, so this may or may not solve your problem entirely. But this will help.

Limit the number of sentences in a string

A beginner's Python question:
I have a string with x number of sentences. How to I extract first 2 sentences (may end with . or ? or !)
Ignoring considerations such as when a . constitutes the end of sentence:
import re
' '.join(re.split(r'(?<=[.?!])\s+', phrase, 2)[:-1])
EDIT: Another approach that just occurred to me is this:
re.match(r'(.*?[.?!](?:\s+.*?[.?!]){0,1})', phrase).group(1)
Notes:
Whereas the first solution lets you replace the 2 with some other number to choose a different number of sentences, in the second solution, you change the 1 in {0,1} to one less than the number of sentences you want to extract.
The second solution isn't quite as robust in handling, e.g., empty strings, or strings with no punctuation. It could be made so, but the regex would be even more complex than it is already, and I would favour the slightly less efficient first solution over an unreadable mess.
I solved it like this: Separating sentences, though a comment on that post also points to NLTK, though I don't know how to find the sentence segmenter on their site...
Here's how yo could do it:
str = "Sentence one? Sentence two. Sentence three? Sentence four. Sentence five."
sentences = str.split(".")
allSentences = []
for sentence in sentences
allSentences.extend(sentence.split("?"))
print allSentences[0:3]
There are probably better ways, I look forward to seeing them.
Here is a step by step explanation of how to disassemble, choose the first two sentences, and reassemble it. As noted by others, this does not take into account that not all dot/question/exclamation characters are really sentence separators.
import re
testline = "Sentence 1. Sentence 2? Sentence 3! Sentence 4. Sentence 5."
# split the first two sentences by the dot/question/exclamation.
sentences = re.split('([.?!])', testline, 2)
print "result of split: ", sentences
# toss everything else (the last item in the list)
firstTwo = sentences[:-1]
print firstTwo
# put the first two sentences back together
finalLine = ''.join(firstTwo)
print finalLine
Generator alternative using my utility function returning piece of string until any item in search sequence:
from itertools import islice
testline = "Sentence 1. Sentence 2? Sentence 3! Sentence 4. Sentence 5."
def multis(search_sequence,text,start=0):
""" multisearch by given search sequence values from text, starting from position start
yielding tuples of text before found item and found sequence item"""
x=''
for ch in text[start:]:
if ch in search_sequence:
if x: yield (x,ch)
else: yield ch
x=''
else:
x+=ch
else:
if x: yield x
# split the first two sentences by the dot/question/exclamation.
two_sentences = list(islice(multis('.?!',testline),2)) ## must save the result of generation
print "result of split: ", two_sentences
print '\n'.join(sentence.strip()+sep for sentence,sep in two_sentences)

Categories