How to make regular expression match a required text? - python

I have a regular expression that I have written,
regex = "car\S*\s*(\w+\s+){1,2}\s*\S*wash"
This regex match the texts such as (i.e., one or two words between "car" and "wash"),
"car. was good ?wash"
"car wash"
"car will never wash"
But I want the above regex to also match these variation of texts,
texts = [
"Car, not ... (?!) wash", # (i.e., this should match because only one words between car and wash but has any number punctuations in between)
"Car never:)... $##! with wash", # (i.e., this also should match because only two words between car and wash but has any more punctuations in between)
"Car, was never wash",
"Car...:) things, not wash"]
But the regex I have written is failing? How can I modify the regex I wrote to make it match all the above texts given,
import re
# Define the regular expression
regex = "car\S*\s*(\w+\s+){1,2}\s*\S*wash"
# Use the re.search() function to find a match
match = re.search(regex, "Car, not ... (?!) wash", flags=re.I)
# Check if a match was found
if match:
print("Match found: ", match.group(0))
else:
print("No match found")
In short, I have to match any text that start with "car" and end with "wash" but with conditions.
It can have only 1 to 2 words in between car and wash. The regex I
wrote take care of that issue.
Along with those N words, it can have
any number of punctuation's or spaces between them.

Based on my interpretation of your question plus additional comments, I am defining the following rules.
A word is a sequence of one or more non-whitespace characters, at least one of which must be a letter, digit or underscore (the latter as it is part of the character class \w).
Words are separated by at least one whitespace character, mixed with zero or more 'punctuation' characters (i.e. anything but letter/digit/underscore).
There is some ambiguity in my definition; punctuation between a letter and a space could be considered part of the word (rule #1) or part of the separation between words (rule #2).
But when counting words, that makes no difference.
From there, I can build two subpatterns.
\S*\w\S* - a word has at least one word character, and no whitespace
\W*\s\W* - a separator has at least one whitespace character, and no word character
Chaining the subpatterns:
\bcar\W*\s\W*(\S*\w\S*\W*\s\W*){1,2}wash\b
Notice the word boundaries \b on either side, to prevent "scar" and "washing" to be mistaken for "car" and "wash".
This matches all of these texts:
car. was good ?wash # 2 words and punctuation between car and wash
car will never wash # 2 words
Car, not ... (?!) wash # 1 word
Car never:)... $##! with wash # 2 words
Car, was never wash # 2 words
Car...:) things, not wash # 2 words
An alternate approach would be to first strip all punctuation from the string, and then match against \bcar\s+(\S+\s+){1,2}wash\b

Related

What is the regex to match the entire paragraph with a word condition? (paragraph may contain multiple periods/full stops)

String need to match : Lisa Ellis Analyst, MoffettNathanson LLC Q Hi. Good afternoon, guys, and welcome, Brian. I look forward to working with you.
Regex Tried : [^.]*Analyst[^.]*
Matched Output : Lisa Ellis Analyst, MoffettNathanson LLC Q Hi
As you can see above, it stops matching after the first full stop.
Could somebody tell me how should I match the entire paragraph so that it does not stop after the first period?
This regex will match the whole para.
^.*Analyst.*$/m
I think you just need to set the multiline flag.
I am assuming that paragraphs are delimited by one or more newline characters, that is, the sentences comprising a paragraph have no newline characters embedded. Then, in multiline mode the anchors ^ and $ match the start and end of a line respectively in addition to the start and end of the input string. You also want to ensure that the word you are looking for is on word boundaries, that is, separated on either side by non-word characters. In that way, if you are looking for Analyst, you will not match Analysts:
\bAnalyst\b
If you want to match either Analyst or Analysts, then make it explicit:
\bAnalysts?\b
If you want to match any word that begins with Analyst:
\bAnalyst\w+\b
The complete regex:
(?m)^.*?\bAnalyst\b.*?$
(m) Turn on multiline mode.
^ Matches the start of the string or the start of a line.
.*? Matches minimally 0 or more characters until:
\bAnalyst\b Matches Analyst on a word boundary (use \bAnalyst\w+\b for any word that begins with Analyst).
.*?$ Matches minimally 0 or more characters until the end of line or the end of string. You could use .*, greedy matching, because . will never match the newline character, so there is really no danger of matching beyond the end of the paragraph.
The code:
import re
text = """This is sentence 1 in paragraph 1. This is sentence 2 in paragraph 1.
This is sentence 1 in paragraph 2. This is sentence 2 in paragraph 2 with the word Analyst contained within.
"""
l = re.findall(r'(?m)^.*?\bAnalyst\b.*?$', text)
print(l)
Prints:
['This is sentence 1 in paragraph 2. This is sentence 2 in paragraph 2 with the word Analyst contained within.']

Regex (Python) - Match words with two or more distinct vowels

I'm attempting to match words in a string that contain two or more distinct vowels. The question can be restricted to lowercase.
string = 'pool pound polio papa pick pair'
Expected result:
pound, polio, pair
pool and papa would fail because they contain only one distinct vowel. However, polio is fine, because even though it contains two os, it contains two distinct vowels (i and o). mississippi would fail, but albuquerque would pass).
Thought process: Using a lookaround, perhaps five times (ignore uppercase), wrapped in a parenthesis, with a {2} afterward. Something like:
re.findall(r'\w*((?=a{1})|(?=e{1})|(?=i{1})|(?=o{1})|(?=u{1})){2}\w*', string)
However, this matches on all six words.
I killed the {1}s, which makes it prettier (the {1}s seem to be unnecessary), but it still returns all six:
re.findall(r'\w*((?=a)|(?=e)|(?=i)|(?=o)|(?=u))\w*', string)
Thanks in advance for any assistance. I checked other queries, including "How to find words with two vowels", but none seemed close enough. Also, I'm looking for pure RegEx.
You don't need 5 separate lookaheads, that's complete overkill. Just capture the first vowel in a capture group, and then use a negative lookahead to assert that it's different from the second vowel:
[a-z]*([aeiou])[a-z]*(?!\1)[aeiou][a-z]*
See the online demo.
Your \w*((?=a)|(?=e)|(?=i)|(?=o)|(?=u))\w* regex matches all words that have at least 1 any vowel. \w* matches 0+ word chars, so the first pattern grabs the whole chunk of letters, digits and underscores. Then, backtracking begins, the regex engine tries to find a location that is followed with either a, e, i, o, or u. Once it finds that location, the previously grabbed word chars are again grabbed and consumed with the trailing \w*.
To match whole words with at least 2 different vowels, you may use
\b(?=\w*([aeiou])\w*(?!\1)[aeiou])\w+
See the regex demo.
Details
\b - word boundary
(?=\w*([aeiou])\w*(?!\1)[aeiou]) - a positive lookahead that, immediately to the left of the current location, requires
\w* - 0+ word chars
([aeiou]) - Capturing group 1 (its value is referenced to with \1 backreference later in the pattern): any vowel
\w* - 0+ word chars
(?!\1)[aeiou] - any vowel from the [aeiou] set that is not equal to the vowel stored in Group 1 (due to the negative lookahead (?!\1) that fails the match if, immediately to the right of the current location, the lookahead pattern match is found)
\w+ - 1 or more word chars.
Match words in a string that contain at least two distinct vowels in the least amount of characters (to my knowledge): \w*([aeiou])\w*(?!\1)[aeiou]\w*
Demo: https://regex101.com/r/uRgVVa/1
Explanation:
\w*: matches 0 or more word characters. You don't need to start with a word boundary (\b) because \w does not include spaces, so using \b would be redundant.
([aeiou]): [aeiou] matches any one vowel. It is in parenthesis so we can reference what vowel was matched later. Whatever is inside these first parenthesis is group 1.
\w*: matches 0 or more word characters.
(?!\1): says the following regex cannot be the same as the character selected in group 1. For example, if the vowel matched in group 1 was a, the following regex cannot be a. This is called by \1, which references what character was chosen in group 1 (e.g. if a matched group 1, \1 references a). ?! is a negative lookahead that says the following regex outside the parenthesis cannot match what follows ?!.
\w*: matches 0 or more word characters.

Ignore part of match in `re.split`

Input is a two-sentence string:
s = 'Sentence 1 here. This sentence contains 1 fl. oz. but is one sentence.'
I'd like to .split s into sentences based on the logic that:
sentences end with one or more periods, exclamation marks, questions marks, or period+quotation mark
and are then followed by 1+ whitespace characters and a capitalized alpha character.
Desired result:
['Sentence 1 here.', 'This sentence contains 1 fl. oz. but is one sentence.']
Also okay:
['Sentence 1 here', 'This sentence contains 1 fl. oz. but is one sentence.']
But I currently chop off the 0th element of each sentence because the uppercase character is captured:
import re
END_SENT = re.compile(r'[.!?(.")]+[ ]+[A-Z]')
print(END_SENT.split(s))
['Sentence 1 here', 'his sentence contains 1 fl. oz. but is one sentence.']
Notice the missing T. How can I tell .split to ignore certain elements of the compiled pattern?
((?<=[.!?])|(?<=\.\")) +(?=[A-Z])
Try it here.
Although I would suggest the below to allow quotes to be followed by any of .!? to be a split condition
((?<=[.!?])|(?<=[.!?]\")) +(?=[A-Z])
Try it here.
Explanation
The common stuff in both +(?=[A-Z])
' +' #One or more spaces(The actual splitting chars used.)
(?= #START positive look ahead check if it followed by this, but do not consume
[A-Z] #Any capitalized alphabet
) #END positive look ahead
The conditions for what comes before the space
For Solution1
( #GROUP START
(?<= #START Positive look behind, Make sure this comes before but do not consume
[.!?] #any one of these chars should come before the splitting space
) #END positive look behind
| #OR condition this is also the reason we had to put all this in GROUP
(?<= #START Positive look behind,
\.\" #splitting space could precede by .", covering a condition that is not by the previous set of . or ! or ?
) #END positive look behind
) #END GROUP
For Solution2
( #GROUP START
(?<=[.!?]) #Same as the previous look behind
| #OR condition
(?<=[.!?]\") #Only difference here is that we are allowing quote after any of . or ! or ?
) #GROUP END
It's easier to describe the sentence than trying to identify the delimiter. So instead of re.split try with re.findall:
re.findall(r'([^.?!\s].*?[.?!]*)\s*(?![^A-Z])', s)
To preserve the next uppercase letter, the pattern uses a lookahead that is only a test and doesn't consume characters.
details:
( # capture group: re.findall return only the capture group content if any
[^.?!\s] # the first character isn't a space or a punctuation character
.*? # a non-greedy quantifier
[.?!]* # eventual punctuation characters
)
\s* # zero or more white-spaces
(?![^A-Z]) # not followed by a character that isn't a uppercase letter
# (this includes an uppercase letter and the end of the string)
Obviously, for more complicated cases with abbreviations, names, etc., you have to use tools like nltk or any other nlp tools trained with dictionaries.

How are regex quantifiers applied?

I have the following regex:
res = re.finditer(r'(?:\w+[ \t,]+){0,4}my car',txt,re.IGNORECASE|re.MULTILINE)
for item in res:
print(item.group())
When I use this regex with the following string:
"my house is painted white, my car is red.
A horse is galloping very fast in the road, I drive my car slowly."
I am getting the following results:
house is painted white, my car
the road, I drive my car
My question is about the quantifier {0,4} that should apply to the whole group. The group collects words with the expression \w+ and some separation symbols with the [ ]. Does the the quantifier apply only to the "words" defined by \w+? In the results I am getting 4 words plus space and comma. It's unclear to me.
So, here's what's happening. You're using ?: to make a non capture group, which collects 1 or more "words", followed by a [ \t,] (a space, tab char, or comma), match one or more of the preceeding. {0,4} matches between 0-4 of the non-capturing group. So it looks at the word "my car" and captures the 4 words before it, since all 4 of them match the \w+ and the , and space get eaten by the character set you specified.
Broken apart more succinctly
(?: -- Non capturing group
\w+ Grab all words
[ \t,]+ -- Grab all spaces, comma, or tab characters
) -- End capture group
{0,4} -- Match the previous capture group 0-4 times
my car -- Based off where you find the words "my car"
As a result this will match 0-4 words / spaces / commas / tabs before the appearance of "my car"
This is working as written

Python regex with \w does not work

I want to have a regex to find a phrase and two words preceding it if there are two words.
For example I have the string (one sentence per line):
Chevy is my car and Rusty is my horse.
My car is very pretty my dog is red.
If i use the regex:
re.finditer(r'[\w+\b|^][\w+\b]my car',txt)
I do not get any match.
If I use the regex:
re.finditer(r'[\S+\s|^][\S+\s]my car',txt)
I am getting:
's my car' and '. My car' (I am ignoring case and using multi-line)
Why is the regex with \w+\b not finding anything? It should find two words and 'my car'
How can I get two complete words before 'my car' if there are two words. If there is only one word preceding my car, I should get it. If there are no words preceding it I should get only 'my car'. In my string example I should get: 'Chevy is my car' and 'My car' (no preceding words here)
In your r'[\w+\b|^][\w+\b]my car regex, [\w+\b|^] matches 1 symbol that is either a word char, a +, a backdpace, |, or ^ and [\w+\b] matches 1 symbol that is either a word char, or +, or a backspace.
The point is that inside a character class, quantifiers and a lot (but not all) special characters match literal symbols. E.g. [+] matches a plus symbol, [|^] matches either a | or ^. Since you want to match a sequence, you need to provide a sequence of subpatterns outside of a character class.
It seems as if you intended to use \b as a word boundary, however, \b inside a character class matches only a backspace character.
To find two words and 'my car', you can use, for example
\S+\s+\S+\s+my car
See the regex demo (here, \S+ matches one or more non-whitespace symbols, and \s+ matches 1 or more whitespaces, and the 2 occurrences of these 2 consecutive subpatterns match these symbols as a sequence).
To make the sequences before my car optional, just use a {0,2} quantifier like this:
(?:\S+[ \t]+){0,2}my car
See this regex demo (to be used with the re.IGNORECASE flag). See Python demo:
import re
txt = 'Chevy is my car and Rusty is my horse.\nMy car is very pretty my dog is red.'
print(re.findall(r'(?:\S+[ \t]+){0,2}my car', txt, re.I))
Details:
(?:\S+[ \t]+){0,2} - 0 to 2 sequences of 1+ non-whitespaces followed with 1+ space or tab symbols (you may also replace it with [^\S\r\n] to match any horizontal space or \s if you also plan to match linebreaks).
my car - a literal text my car.

Categories