How to use backreferences as index to substitute via list? - python

I have a list
fruits = ['apple', 'banana', 'cherry']
I like to replace all these elements by their index in the list. I know, that I can go through the list and use replace of a string like
text = "I like to eat apple, but banana are fine too."
for i, fruit in enumerate(fruits):
text = text.replace(fruit, str(i))
How about using regular expression? With \number we can backreference to a match. But
import re
text = "I like to eat apple, but banana are fine too."
text = re.sub('apple|banana|cherry', fruits.index('\1'), text)
doesn't work. I get an error that \x01 is not in fruits. But \1 should refer to 'apple'.
I am interested in the most efficient way to do the replacement, but I also like to understand regex better. How can I get the match string from the backreference in regex.
Thanks a lot.

Using Regex.
Ex:
import re
text = "I like to eat apple, but banana are fine too."
fruits = ['apple', 'banana', 'cherry']
pattern = re.compile("|".join(fruits))
text = pattern.sub(lambda x: str(fruits.index(x.group())), text)
print(text)
Output:
I like to eat 0, but 1 are fine too.

Related

How to extract every string between two substring in a paragraph?

After web-scrapping, I get the following:
[<p>xxx<p>, <p>1.apple</p>, <p>aaa</p>, <p>xxxxx</p>, <p>xxxxx</p>, <p>2.orange</p>, <p>aaa</p>, <p>xxxxx</p>,<p>3.banana</p>, <p>aaa</p>, <p>xxxxx</p>]
From the list, "xxxx" are those useless values. I can see the pattern that the result I want is between two substrings. Substring1 = "<p>1" / "<p>2" / "<p>3" ; Substring2 = "</p>, <p>aaa".
Assume this pattern repeats hundreds of times. How do I get the result by python? Many thanks !!
My target result is :
apple
orange
banana
I have tried to use split and tried [sub1:sub2] but it doesn't work
From what I INFER from your question (assuming the words you're looking for follow a beacon of format <p>number. ), a regex would do the job:
import re
print(re.findall(r'<p>\d+.([^<]+)', html_string)
# ['apple', 'orange', 'banana']

How to extract compound words from a multiline string in Python

I have a text string from which I want to extract specific words (fruits) that may appear in it; the respective words are stored in a set. (Actually, the set is very large, but I tried to simplify the code). I achieved extracting single fruit words with this simple code:
# The text string.
text = """
My friend loves healthy food: Yesterday, he enjoyed
an apple, a pine apple and a banana. But what he
likes most, is a blueberry-orange cake.
"""
# Remove noisy punctuation and lowercase the text string.
prep_text = text.replace(",", "")
prep_text = prep_text.replace(".", "")
prep_text = prep_text.replace(":", "")
prep_text = prep_text.lower()
# The word set.
fruits = {"apple", "banana", "orange", "blueberry",
"pine apple"}
# Extracting single fruits.
extracted_fruits = []
for word in prep_text.split():
if word in fruits:
extracted_fruits.append(word)
print(extracted_fruits)
# Out: ['apple', 'apple', 'banana']
# Missing: 'pine apple', 'blueberry-orange'
# False: the second 'apple'
But if the text string contains a fruit compound separated by a space (here: "pine apple"), it is not extracted (or rather, just "apple" is extracted from it, even though I don't want this occurrence of "apple" because it's part of the compound). I know this is because I used split() on prep_text. Neither extracted is the hyphenated combination "blueberry-orange", which I want to get as well. Other hyphenated words that don't include fruits should not be extracted, though.
If I could use a variable fruit for each item in the fruits set, I would solve it with f-strings like:
fruit = # How can I get a single fruit element and similarly all fruit elements from 'fruits'?
hyphenated_fruit = f"{fruit}-{fruit}"
for word in prep_text.split():
if word == hyphenated_fruit:
extracted_fruits.append(word)
I can't use the actual strings "blueberry" and "orange" as variables though, because other fruits could also appear hyphenated in a different text string. Moreover, I don't want to just add "blueberry-orange" to the set - I'm searching for a way without changing the set.
Is there a way to add "pine apple" and "blueberry-orange" as well to the extracted_fruits list?
I appreciate any help and tips. Thanks a lot in advance!
The quickest (and dirtiest?) approach might be to use this regex:
(banana|orange|(pine\s*)?apple|blueberry)(-\s*(banana|orange|(pine\s*)?apple|blueberry))?
It will match pineapple, pine apple, pine apple, and pine apple with newlines or tabs between pine and apple.
Demo
One disadvantage is, of course, that it matches things like orange-orange and there's repeated text in there. You can construct the regex programmatically to fix that, though.
It's just a start but may be good enough for your use case. You can grow it to add more capabilities for a bit, I think.

Regex for two words

I'm not able to create a regex for capture two separate words.
For example the pattern must contain the word (pizza)+ and (cheese|tomatoes)* like this:
I want eat a pizza with cheese
capture:
pizza, cheese
How can I do that?
Use re.findall. It will return all matched strings.
>>> import re
>>> re.findall(r'\b(pizza|cheese|tomatoes)\b', 'I want eat a pizza with cheese')
['pizza', 'cheese']

Printing text after searching for a string

I am following a tutorial to identify and print the words in between a particular string;
f is the string Mango grapes Lemon Ginger Pineapple
def findFruit(f):
global fruit
found = [re.search(r'(.*?) (Lemon) (.*?)$', word) for word in f]
for i in found:
if i is not None:
fruit = i.group(1)
fruit = i.group(3)
grapes and Ginger will be outputted when i print fruit. However what i want the output is to look like "grapes" # "Ginger" (note the "" and # sign).
You can use string formatting here with the use of the str.format() function:
def findFruit(f):
found = re.search(r'.*? (.*?) Lemon (.*?) .*?$', f)
if found is not None:
print '"{}" # "{}"'.format(found.group(1), found.group(2))
Or, a lovely solution Kimvais posted in the comments:
print '"{0}" # "{1}"'.format(*found.groups())
I've done some edits. Firstly, a for-loop isn't needed here (nor is a list comprehension. You're iterating through each letter of the string, instead of each word. Even then you don't want to iterate through each word.
I also changed your regular expression (Do note that I'm not that great in regex, so there probably is a better solution).

re: Match any word in a set repeating

Given a set of space delimited words that may come in any order how can I match only those words in a given set of words. For example say I have:
apple monkey banana dog and I want to match apple and banana how might I do that?
Here's what I've tried:
m = re.search("(?P<fruit>[apple|banana]*)", "apple monkey banana dog")
m.groupdict() --> {'fruit':'apple'}
But I want to match both apple and banana.
In (?P<fruit>[apple|banana]*)
[apple|banana]* defines a character class, e.g. this token matches one a, one p, one l, one e, one |, one b or one n, and then says 'match this 0 or more times'. (You probably meant to use a +, anyway, which would mean 'match one or more times')
What you want is (apple|banana) which will match the string apple or the string banana.
Learn more: http://www.regular-expressions.info/reference.html
For your next question, to get all matches a regex makes against a string, not just the first, use http://docs.python.org/2/library/re.html#re.findall
If you want it to be able to repeat, you're going to fail on white space. Try this:
input = ['apple','banana','orange']
reg_string = '(' + ('|').join(input) + ')'
lookahead_string = '(\s(?=' + ('|').join(input) + '))?' + reg_string + '?'
out_reg_string = reg_string + (len(input)-1)*lookahead_string
matches = re.findall(out_reg_string, string_to_match)
where string_to_match is what you are looking for the pattern within. out_reg_string can be used to match something like:
"apple banana orange"
"apple orange"
"apple banana"
"banana apple"
or any of the cartesian product of your input list.

Categories