Regex for two words - python

I'm not able to create a regex for capture two separate words.
For example the pattern must contain the word (pizza)+ and (cheese|tomatoes)* like this:
I want eat a pizza with cheese
capture:
pizza, cheese
How can I do that?

Use re.findall. It will return all matched strings.
>>> import re
>>> re.findall(r'\b(pizza|cheese|tomatoes)\b', 'I want eat a pizza with cheese')
['pizza', 'cheese']

Related

Trying to find two words before and after a group of words with regex

sentence = "I love the grand mother bag i bought . I love my sister's ring "
import re
regex = re.search('(\w+){2}the grand mother bag(\w+){2}', sentence)
print(regex.groups())
I should have extracted: I love and I bought.
Any idea where I went wrong?
Change your regex: \w does not match word but a character so you extract only 2 characters:
>>> re.search('(\w+\s+\w+)\s+the grand mother bag\s+(\w+\s+\w+)', sentence).groups()
('I love', 'i bought')

How to use backreferences as index to substitute via list?

I have a list
fruits = ['apple', 'banana', 'cherry']
I like to replace all these elements by their index in the list. I know, that I can go through the list and use replace of a string like
text = "I like to eat apple, but banana are fine too."
for i, fruit in enumerate(fruits):
text = text.replace(fruit, str(i))
How about using regular expression? With \number we can backreference to a match. But
import re
text = "I like to eat apple, but banana are fine too."
text = re.sub('apple|banana|cherry', fruits.index('\1'), text)
doesn't work. I get an error that \x01 is not in fruits. But \1 should refer to 'apple'.
I am interested in the most efficient way to do the replacement, but I also like to understand regex better. How can I get the match string from the backreference in regex.
Thanks a lot.
Using Regex.
Ex:
import re
text = "I like to eat apple, but banana are fine too."
fruits = ['apple', 'banana', 'cherry']
pattern = re.compile("|".join(fruits))
text = pattern.sub(lambda x: str(fruits.index(x.group())), text)
print(text)
Output:
I like to eat 0, but 1 are fine too.

Regex: Grab first instance of word unless "forbidden" word in between

I need to match the first instance of either two words ("ham" or "turkey") ONLY if either word follows the word "sandwich" AND the word "forbidden" isn't present between "sandwich" and ("ham" or "turkey").
reuben sandwich with ham and turkey ham sandwich with cheese
reuben sandwich with forbidden ham and turkey ham sandwich with cheese
The regex I'm using is sandwich(.*?)(?!.*forbidden)(ham)((.*?)(turkey))? still matches the 2nd sentence. How can I modify the regex to NOT match the 2nd sentence. Thanks in advance!
You may use
sandwich((?:(?!forbidden).)*?)(ham|turkey)
See the regex demo
sandwich - matches a substring sandwich
((?:(?!forbidden).)*?) - Group 1 matching any char that is not starting a forbidden word, zero or more times, as few as possible
(ham|turkey) - either ham or turkey.
Here are some variations you may consider for your scenarios:
If you need to match ham and/or turkey after sandwich, use sandwich((?:(?!forbidden).)*?)(ham|turkey)(.*?)(ham|turkey)
If you need to match only ham and then turkey or turkey and then ham, you can add a negative lookahead in the above regex: sandwich((?:(?!forbidden).)*?)(ham|turkey)(.*?)(?!\2)(ham|turkey).

Regex uppercase words with condition

I'm new to regex and I can't figure it out how to do this:
Hello this is JURASSIC WORLD shut up Ok
[REVIEW] The movie BATMAN is awesome lol
What I need is the title of the movie. It will be only one per sentence. I have to ignore the words between [] as it will not be the title of the movie.
I thought of this:
^\w([A-Z]{2,})+
Any help would be welcome.
Thanks.
You can use negative look arounds to ensure that the title is not within []
\b(?<!\[)[A-Z ]{2,}(?!\])\b
\b Matches word boundary.
(?<!\[) Negative look behind. Checks if the matched string is not preceded by [
[A-Z ]{2,} Matches 2 or more uppercase letters.
(?!\]) Negative look ahead. Ensures that the string is not followed by ]
Example
>>> string = """Hello this is JURASSIC WORLD shut up Ok
... [REVIEW] The movie BATMAN is awesome lol"""
>>> re.findall(r'\b(?<!\[)[A-Z ]{2,}(?!\])\b', string)
[' JURASSIC WORLD ', ' BATMAN ']
>>>

Replace all the occurrences of specific words

Suppose that I have the following sentence:
bean likes to sell his beans
and I want to replace all occurrences of specific words with other words. For example, bean to robert and beans to cars.
I can't just use str.replace because in this case it'll change the beans to roberts.
>>> "bean likes to sell his beans".replace("bean","robert")
'robert likes to sell his roberts'
I need to change the whole words only, not the occurrences of the word in the other word. I think that I can achieve this by using regular expressions but don't know how to do it right.
If you use regex, you can specify word boundaries with \b:
import re
sentence = 'bean likes to sell his beans'
sentence = re.sub(r'\bbean\b', 'robert', sentence)
# 'robert likes to sell his beans'
Here 'beans' is not changed (to 'roberts') because the 's' on the end is not a boundary between words: \b matches the empty string, but only at the beginning or end of a word.
The second replacement for completeness:
sentence = re.sub(r'\bbeans\b', 'cars', sentence)
# 'robert likes to sell his cars'
If you replace each word one at a time, you might replace words several times (and not get what you want). To avoid this, you can use a function or lambda:
d = {'bean':'robert', 'beans':'cars'}
str_in = 'bean likes to sell his beans'
str_out = re.sub(r'\b(\w+)\b', lambda m:d.get(m.group(1), m.group(1)), str_in)
That way, once bean is replaced by robert, it won't be modified again (even if robert is also in your input list of words).
As suggested by georg, I edited this answer with dict.get(key, default_value).
Alternative solution (also suggested by georg):
str_out = re.sub(r'\b(%s)\b' % '|'.join(d.keys()), lambda m:d.get(m.group(1), m.group(1)), str_in)
This is a dirty way to do this. using folds
reduce(lambda x,y : re.sub('\\b('+y[0]+')\\b',y[1],x) ,[("bean","robert"),("beans","cars")],"bean likes to sell his beans")
"bean likes to sell his beans".replace("beans", "cars").replace("bean", "robert")
Will replace all instances of "beans" with "cars" and "bean" with "robert". This works because .replace() returns a modified instance of original string. As such, you can think of it in stages. It essentially works this way:
>>> first_string = "bean likes to sell his beans"
>>> second_string = first_string.replace("beans", "cars")
>>> third_string = second_string.replace("bean", "robert")
>>> print(first_string, second_string, third_string)
('bean likes to sell his beans', 'bean likes to sell his cars',
'robert likes to sell his cars')

Categories