Printing text after searching for a string - python

I am following a tutorial to identify and print the words in between a particular string;
f is the string Mango grapes Lemon Ginger Pineapple
def findFruit(f):
global fruit
found = [re.search(r'(.*?) (Lemon) (.*?)$', word) for word in f]
for i in found:
if i is not None:
fruit = i.group(1)
fruit = i.group(3)
grapes and Ginger will be outputted when i print fruit. However what i want the output is to look like "grapes" # "Ginger" (note the "" and # sign).

You can use string formatting here with the use of the str.format() function:
def findFruit(f):
found = re.search(r'.*? (.*?) Lemon (.*?) .*?$', f)
if found is not None:
print '"{}" # "{}"'.format(found.group(1), found.group(2))
Or, a lovely solution Kimvais posted in the comments:
print '"{0}" # "{1}"'.format(*found.groups())
I've done some edits. Firstly, a for-loop isn't needed here (nor is a list comprehension. You're iterating through each letter of the string, instead of each word. Even then you don't want to iterate through each word.
I also changed your regular expression (Do note that I'm not that great in regex, so there probably is a better solution).

Related

How to extract compound words from a multiline string in Python

I have a text string from which I want to extract specific words (fruits) that may appear in it; the respective words are stored in a set. (Actually, the set is very large, but I tried to simplify the code). I achieved extracting single fruit words with this simple code:
# The text string.
text = """
My friend loves healthy food: Yesterday, he enjoyed
an apple, a pine apple and a banana. But what he
likes most, is a blueberry-orange cake.
"""
# Remove noisy punctuation and lowercase the text string.
prep_text = text.replace(",", "")
prep_text = prep_text.replace(".", "")
prep_text = prep_text.replace(":", "")
prep_text = prep_text.lower()
# The word set.
fruits = {"apple", "banana", "orange", "blueberry",
"pine apple"}
# Extracting single fruits.
extracted_fruits = []
for word in prep_text.split():
if word in fruits:
extracted_fruits.append(word)
print(extracted_fruits)
# Out: ['apple', 'apple', 'banana']
# Missing: 'pine apple', 'blueberry-orange'
# False: the second 'apple'
But if the text string contains a fruit compound separated by a space (here: "pine apple"), it is not extracted (or rather, just "apple" is extracted from it, even though I don't want this occurrence of "apple" because it's part of the compound). I know this is because I used split() on prep_text. Neither extracted is the hyphenated combination "blueberry-orange", which I want to get as well. Other hyphenated words that don't include fruits should not be extracted, though.
If I could use a variable fruit for each item in the fruits set, I would solve it with f-strings like:
fruit = # How can I get a single fruit element and similarly all fruit elements from 'fruits'?
hyphenated_fruit = f"{fruit}-{fruit}"
for word in prep_text.split():
if word == hyphenated_fruit:
extracted_fruits.append(word)
I can't use the actual strings "blueberry" and "orange" as variables though, because other fruits could also appear hyphenated in a different text string. Moreover, I don't want to just add "blueberry-orange" to the set - I'm searching for a way without changing the set.
Is there a way to add "pine apple" and "blueberry-orange" as well to the extracted_fruits list?
I appreciate any help and tips. Thanks a lot in advance!
The quickest (and dirtiest?) approach might be to use this regex:
(banana|orange|(pine\s*)?apple|blueberry)(-\s*(banana|orange|(pine\s*)?apple|blueberry))?
It will match pineapple, pine apple, pine apple, and pine apple with newlines or tabs between pine and apple.
Demo
One disadvantage is, of course, that it matches things like orange-orange and there's repeated text in there. You can construct the regex programmatically to fix that, though.
It's just a start but may be good enough for your use case. You can grow it to add more capabilities for a bit, I think.

How to use backreferences as index to substitute via list?

I have a list
fruits = ['apple', 'banana', 'cherry']
I like to replace all these elements by their index in the list. I know, that I can go through the list and use replace of a string like
text = "I like to eat apple, but banana are fine too."
for i, fruit in enumerate(fruits):
text = text.replace(fruit, str(i))
How about using regular expression? With \number we can backreference to a match. But
import re
text = "I like to eat apple, but banana are fine too."
text = re.sub('apple|banana|cherry', fruits.index('\1'), text)
doesn't work. I get an error that \x01 is not in fruits. But \1 should refer to 'apple'.
I am interested in the most efficient way to do the replacement, but I also like to understand regex better. How can I get the match string from the backreference in regex.
Thanks a lot.
Using Regex.
Ex:
import re
text = "I like to eat apple, but banana are fine too."
fruits = ['apple', 'banana', 'cherry']
pattern = re.compile("|".join(fruits))
text = pattern.sub(lambda x: str(fruits.index(x.group())), text)
print(text)
Output:
I like to eat 0, but 1 are fine too.

Search a string for a word/sentence and print the following word

I have a string that has around 10 lines of text. What I am trying to do is find a sentence that has a specific word(s) in it, and display the word following.
Example String:
The quick brown fox
The slow donkey
The slobbery dog
The Furry Cat
I want the script to search for 'The slow', then print the following word, so in this case, 'donkey'.
I have tried using the Find function, but that just prints the location of the word(s).
Example code:
sSearch = output.find("destination-pattern")
print(sSearch)
Any help would be greatly appreciated.
output = "The slow donkey brown fox"
patt = "The slow"
sSearch = output.find(patt)
print(output[sSearch+len(patt)+1:].split(' ')[0])
output:
donkey
You could work with regular expressions. Python has builtin library called re.
Example usage:
s = "The slow donkey some more text"
finder = "The slow"
idx_finder_end = s.find(finder) + len(finder)
next_word_match = re.match(r"\s\w*\s", s[idx_finder_end:])
next_word = next_word_match.group().strip()
# donkey
I would do it using regular expressions (re module) following way:
import re
txt = '''The quick brown fox
The slow donkey
The slobbery dog
The Furry Cat'''
words = re.findall(r'(?<=The slow) (\w*)',txt)
print(words) # prints ['donkey']
Note that words is now list of words, if you are sure that there is exactly 1 word to be found you could do then:
word = words[0]
print(word) # prints donkey
Explanation: I used so-called lookbehind assertion in first argument of re.findall, which mean I am looking for something behind The slow. \w* means any substring consisting of: letters, digits, underscores (_). I enclosed it in group (brackets) because it is not part of word.
You can do it using regular expressions:
>>> import re
>>> r=re.compile(r'The slow\s+\b(\w+)\b')
>>> r.match('The slow donkey')[1]
'donkey'
>>>

Search in a string and obtain the 2 words before and after the match in Python

I'm using Python to search some words (also multi-token) in a description (string).
To do that I'm using a regex like this
result = re.search(word, description, re.IGNORECASE)
if(result):
print ("Trovato: "+result.group())
But what I need is to obtain the first 2 word before and after the match. For example if I have something like this:
Parking here is horrible, this shop sucks.
"here is" is the word that I looking for. So after I matched it with my regex I need the 2 words (if exists) before and after the match.
In the example:
Parking here is horrible, this
"Parking" and horrible, this are the words that I need.
ATTTENTION
The description cab be very long and the pattern "here is" can appear multiple times?
How about string operations?
line = 'Parking here is horrible, this shop sucks.'
before, term, after = line.partition('here is')
before = before.rsplit(maxsplit=2)[-2:]
after = after.split(maxsplit=2)[:2]
Result:
>>> before
['Parking']
>>> after
['horrible,', 'this']
Try this regex: ((?:[a-z,]+\s+){0,2})here is\s+((?:[a-z,]+\s*){0,2})
with re.findall and re.IGNORECASE set
Demo
I would do it like this (edit: added anchors to cover most cases):
(\S+\s+|^)(\S+\s+|)here is(\s+\S+|)(\s+\S+|$)
Like this you will always have 4 groups (might have to be trimmed) with the following behavior:
If group 1 is empty, there was no word before (group 2 is empty too)
If group 2 is empty, there was only one word before (group 1)
If group 1 and 2 are not empty, they are the words before in order
If group 3 is empty, there was no word after
If group 4 is empty, there was only one word after
If group 3 and 4 are not empty, they are the words after in order
Corrected demo link
Based on your clarification, this becomes a bit more complicated. The solution below deals with scenarios where the searched pattern may in fact also be in the two preceding or two subsequent words.
line = "Parking here is horrible, here is great here is mediocre here is here is "
print line
pattern = "here is"
r = re.search(pattern, line, re.IGNORECASE)
output = []
if r:
while line:
before, match, line = line.partition(pattern)
if match:
if not output:
before = before.split()[-2:]
else:
before = ' '.join([pattern, before]).split()[-2:]
after = line.split()[:2]
output.append((before, after))
print output
Output from my example would be:
[(['Parking'], ['horrible,', 'here']), (['is', 'horrible,'], ['great', 'here']), (['is', 'great'], ['mediocre', 'here']), (['is', 'mediocre'], ['here', 'is']), (['here', 'is'], [])]

re: Match any word in a set repeating

Given a set of space delimited words that may come in any order how can I match only those words in a given set of words. For example say I have:
apple monkey banana dog and I want to match apple and banana how might I do that?
Here's what I've tried:
m = re.search("(?P<fruit>[apple|banana]*)", "apple monkey banana dog")
m.groupdict() --> {'fruit':'apple'}
But I want to match both apple and banana.
In (?P<fruit>[apple|banana]*)
[apple|banana]* defines a character class, e.g. this token matches one a, one p, one l, one e, one |, one b or one n, and then says 'match this 0 or more times'. (You probably meant to use a +, anyway, which would mean 'match one or more times')
What you want is (apple|banana) which will match the string apple or the string banana.
Learn more: http://www.regular-expressions.info/reference.html
For your next question, to get all matches a regex makes against a string, not just the first, use http://docs.python.org/2/library/re.html#re.findall
If you want it to be able to repeat, you're going to fail on white space. Try this:
input = ['apple','banana','orange']
reg_string = '(' + ('|').join(input) + ')'
lookahead_string = '(\s(?=' + ('|').join(input) + '))?' + reg_string + '?'
out_reg_string = reg_string + (len(input)-1)*lookahead_string
matches = re.findall(out_reg_string, string_to_match)
where string_to_match is what you are looking for the pattern within. out_reg_string can be used to match something like:
"apple banana orange"
"apple orange"
"apple banana"
"banana apple"
or any of the cartesian product of your input list.

Categories