How are regex quantifiers applied? - python

I have the following regex:
res = re.finditer(r'(?:\w+[ \t,]+){0,4}my car',txt,re.IGNORECASE|re.MULTILINE)
for item in res:
print(item.group())
When I use this regex with the following string:
"my house is painted white, my car is red.
A horse is galloping very fast in the road, I drive my car slowly."
I am getting the following results:
house is painted white, my car
the road, I drive my car
My question is about the quantifier {0,4} that should apply to the whole group. The group collects words with the expression \w+ and some separation symbols with the [ ]. Does the the quantifier apply only to the "words" defined by \w+? In the results I am getting 4 words plus space and comma. It's unclear to me.

So, here's what's happening. You're using ?: to make a non capture group, which collects 1 or more "words", followed by a [ \t,] (a space, tab char, or comma), match one or more of the preceeding. {0,4} matches between 0-4 of the non-capturing group. So it looks at the word "my car" and captures the 4 words before it, since all 4 of them match the \w+ and the , and space get eaten by the character set you specified.
Broken apart more succinctly
(?: -- Non capturing group
\w+ Grab all words
[ \t,]+ -- Grab all spaces, comma, or tab characters
) -- End capture group
{0,4} -- Match the previous capture group 0-4 times
my car -- Based off where you find the words "my car"
As a result this will match 0-4 words / spaces / commas / tabs before the appearance of "my car"
This is working as written

Related

How to make regular expression match a required text?

I have a regular expression that I have written,
regex = "car\S*\s*(\w+\s+){1,2}\s*\S*wash"
This regex match the texts such as (i.e., one or two words between "car" and "wash"),
"car. was good ?wash"
"car wash"
"car will never wash"
But I want the above regex to also match these variation of texts,
texts = [
"Car, not ... (?!) wash", # (i.e., this should match because only one words between car and wash but has any number punctuations in between)
"Car never:)... $##! with wash", # (i.e., this also should match because only two words between car and wash but has any more punctuations in between)
"Car, was never wash",
"Car...:) things, not wash"]
But the regex I have written is failing? How can I modify the regex I wrote to make it match all the above texts given,
import re
# Define the regular expression
regex = "car\S*\s*(\w+\s+){1,2}\s*\S*wash"
# Use the re.search() function to find a match
match = re.search(regex, "Car, not ... (?!) wash", flags=re.I)
# Check if a match was found
if match:
print("Match found: ", match.group(0))
else:
print("No match found")
In short, I have to match any text that start with "car" and end with "wash" but with conditions.
It can have only 1 to 2 words in between car and wash. The regex I
wrote take care of that issue.
Along with those N words, it can have
any number of punctuation's or spaces between them.
Based on my interpretation of your question plus additional comments, I am defining the following rules.
A word is a sequence of one or more non-whitespace characters, at least one of which must be a letter, digit or underscore (the latter as it is part of the character class \w).
Words are separated by at least one whitespace character, mixed with zero or more 'punctuation' characters (i.e. anything but letter/digit/underscore).
There is some ambiguity in my definition; punctuation between a letter and a space could be considered part of the word (rule #1) or part of the separation between words (rule #2).
But when counting words, that makes no difference.
From there, I can build two subpatterns.
\S*\w\S* - a word has at least one word character, and no whitespace
\W*\s\W* - a separator has at least one whitespace character, and no word character
Chaining the subpatterns:
\bcar\W*\s\W*(\S*\w\S*\W*\s\W*){1,2}wash\b
Notice the word boundaries \b on either side, to prevent "scar" and "washing" to be mistaken for "car" and "wash".
This matches all of these texts:
car. was good ?wash # 2 words and punctuation between car and wash
car will never wash # 2 words
Car, not ... (?!) wash # 1 word
Car never:)... $##! with wash # 2 words
Car, was never wash # 2 words
Car...:) things, not wash # 2 words
An alternate approach would be to first strip all punctuation from the string, and then match against \bcar\s+(\S+\s+){1,2}wash\b

how to exclude words in regex using Negative Lookahead?

I am trying to exclude a word from a sentence, but if the excluded word does not appear, the regex should keep searching for characters until the exclude word is found.
For example, lets suppose I have a list like this:
S.no Vehicle Status
1 car sold
2 car not sold
3 car sold
4 car Repair
I want to match all those cars which don't have a status of sold (they could be anything but sold) and I want it to catch the status too (if not sold)
I tried this regex:
f"car(?!\s+sold)"
But how can I tell it to continue if it doesn't find the "sold" in the negative lookahead (but still search with that filter)
You can write the pattern like this:
pattern = r"\bcar\b(?!\s+sold\b).+"
Explanation
\bcar\b Match the word car
(?!\s+sold\b) Assert not 1+ whitespace chars followed by the word "sold" to the right
.+ Match 1+ chars
See a regex demo.
If there has to be a non whitespace char present after "car" and you don't want to cross newlines:
\bcar\b(?![^\S\n]+sold\b)[^\S\n]+\S.*
See another Regex demo

RegEx for matching strings with spaces and words

I have the following string:
the quick brown fox abc(1)(x)
with the following regex:
(?i)(\s{1})(abc\(1\)\([x|y]\))
and the output is
abc(1)(x)
which is expected, however, I can't seem to:
use \W \w \d \D etc to extract more than 1 space
combine the quantifier to add more spaces.
I would like the following output:
the quick brown fox abc(1)(x)
from the primary lookup "abc(1)(x)" I would like up to 5 words on either side of the lookup. my assumption is that spaces would demarcate a word.
Edit 1:
The 5 words on either side would be unknown for future examples. the string may be:
cat with a black hat is abc(1)(x) the quick brown fox jumps over the
lazy dog.
In this case, the desired output would be:
with a black hat is abc(1)(x) the quick brown fox jumps
Edit 2:
edited the expected output in the first example and added "up to" 5 words
(?:[0-9A-Za-z_]+[^0-9A-Za-z_]+){0,5}abc\(1\)\([xy]\)(?:[^0-9A-Za-z_]+[0-9A-Za-z_]+){0,5}
Note that I've changed \w+ to [0-9A-Za-z_]+ and \W+ to [^0-9A-Za-z_]+ because depending on your locale / Unicode settings \W and \w might not act the way you expect in Python.
Also note I don't specifically look for spaces, just "non-word characters" this probably handles edge cases a little better for quote characters etc.
But regardless this should get you most of the way there.
BTW: You calling this "lookaround" - really it has nothing to do with "regex lookaround" the regex feature.
If I understand your requirements correctly, you want to do something like this:
(?:\w+[ ]){0,5}(abc\(1\)\([xy]\))(?:[ ]\w+){0,5}
Demo.
BreakDown:
(?: # Start of a non-capturing group.
\w+ # Any word character repeated one or more times (basically, a word).
[ ] # Matches a space character literally.
) # End of the non-capturing group.
{0,5} # Match the previous group between 0 and 5 times.
( # Start of the first capturing group.
abc\(1\) # Matches "abc(1)" literally.
\([xy]\) # Matches "(x)" or "(y)". You don't need "|" inside a character class.
) # End of the capturing group.
(?:[ ]\w+){0,5} # Same as the non-capturing group above but the space is before the word.
Notes:
To make the pattern case insensitive, you may start it with (?i) as you're doing already or use the re.IGNORECASE flag.
If you want to support words not separated by a space, you may replace [ ] with either \W+ (which means non-word characters) or with a character class which includes all the punctuation characters that you want to support (e.g., [.,;?! ]).

Python regex with \w does not work

I want to have a regex to find a phrase and two words preceding it if there are two words.
For example I have the string (one sentence per line):
Chevy is my car and Rusty is my horse.
My car is very pretty my dog is red.
If i use the regex:
re.finditer(r'[\w+\b|^][\w+\b]my car',txt)
I do not get any match.
If I use the regex:
re.finditer(r'[\S+\s|^][\S+\s]my car',txt)
I am getting:
's my car' and '. My car' (I am ignoring case and using multi-line)
Why is the regex with \w+\b not finding anything? It should find two words and 'my car'
How can I get two complete words before 'my car' if there are two words. If there is only one word preceding my car, I should get it. If there are no words preceding it I should get only 'my car'. In my string example I should get: 'Chevy is my car' and 'My car' (no preceding words here)
In your r'[\w+\b|^][\w+\b]my car regex, [\w+\b|^] matches 1 symbol that is either a word char, a +, a backdpace, |, or ^ and [\w+\b] matches 1 symbol that is either a word char, or +, or a backspace.
The point is that inside a character class, quantifiers and a lot (but not all) special characters match literal symbols. E.g. [+] matches a plus symbol, [|^] matches either a | or ^. Since you want to match a sequence, you need to provide a sequence of subpatterns outside of a character class.
It seems as if you intended to use \b as a word boundary, however, \b inside a character class matches only a backspace character.
To find two words and 'my car', you can use, for example
\S+\s+\S+\s+my car
See the regex demo (here, \S+ matches one or more non-whitespace symbols, and \s+ matches 1 or more whitespaces, and the 2 occurrences of these 2 consecutive subpatterns match these symbols as a sequence).
To make the sequences before my car optional, just use a {0,2} quantifier like this:
(?:\S+[ \t]+){0,2}my car
See this regex demo (to be used with the re.IGNORECASE flag). See Python demo:
import re
txt = 'Chevy is my car and Rusty is my horse.\nMy car is very pretty my dog is red.'
print(re.findall(r'(?:\S+[ \t]+){0,2}my car', txt, re.I))
Details:
(?:\S+[ \t]+){0,2} - 0 to 2 sequences of 1+ non-whitespaces followed with 1+ space or tab symbols (you may also replace it with [^\S\r\n] to match any horizontal space or \s if you also plan to match linebreaks).
my car - a literal text my car.

python regex match optional square brackets

I have the following strings:
1 "R J BRUCE & OTHERS V B J & W L A EDWARDS And Ors CA CA19/02 27 February 2003",
2 "H v DIRECTOR OF PROCEEDINGS [2014] NZHC 1031 [16 May 2014]",
3 '''GREGORY LANCASTER AND JOHN HENRY HUNTER V CULLEN INVESTMENTS LIMITED AND
ERIC JOHN WATSON CA CA51/03 26 May 2003'''
I am trying to find a regular expression which matches all of them. I don't know how to match optional square brackets around the date at the end of the string eg [16 May 2014].
casename = re.compile(r'(^[A-Z][A-Za-z\'\(\) ]+\b[v|V]\b[A-Za-z\'\(\) ]+(.*?)[ \[ ]\d+ \w+ \d\d\d\d[\] ])', re.S)
The date regex at the end only matches cases with dates in square bracket but not the ones without.
Thank to everybody who answered. #Matt Clarkson what I am trying to match is a judicial decision 'handle' in a much larger text. There is a large variation within those handles, but they all start at the beginning of a line have 'v' for versus between the party names and a date at the end. Mostly the names of the parties are in capital but not exclusively. I am trying to have only one match per document and no false positives.
I got all of them to match using this (You'll need to add the case-insensitive flag):
(^[a-z][a-z\'&\(\) ]+\bv\b[a-z&\'\(\) ]+(?:.*?) \[?\d+ \w+ \d{4}\]?)
Regex Demo
Explanation:
( Begin capture group
[a-z\'&\(\) ]+ Match one or more of the characters in this group
\b Match a word boundary
v Match the character 'v' literally
\b Match a word boundary
[a-z&\'\(\) ]+ Match one or more of the characters in this group
(?: Begin non-capturing group
.*? Match anything
) End non-capturing group
\[?\d+ \w+ \d{4}\]? Match a date, optionally surrounded by brackets
) End capture group
How to make Square brackets optional, can be achieved like this:
[\[]* with the * it makes the opening [ optional.
A few recommendations if I may:
This \d\d\d\d could be also expressed like this as well \d{4}
[v|V] in regex what is inside the [] is already one or other | is not necessary [vV]
And here is what an online demo
Using your regex and input strings, it looks like you will match only the 2nd line (if you get rid of the '^' at the beginning of the regex. I've added inline comments to each section of the regular expression you provided to make it more clear.
Can you indicate what you are trying to capture from each line? Do you want the entire string? Only the word immediately preceding the lone letter 'v'? Do you want the date captured separately?
Depending on the portions that you wish to capture, each section can be broken apart into their respective match groups: regex101.com example. This is a little looser than yours (capturing the entire section between quotation marks instead of only the single word immediately preceding the lone 'v'), and broken apart to help readability (each "group" on its own line).
This example also assumes the newline is intentional, and supports the newline component (warning: it COULD suck up more than you intend, depending on whether the date at the end gets matched or not).

Categories