Python regex with \w does not work - python

I want to have a regex to find a phrase and two words preceding it if there are two words.
For example I have the string (one sentence per line):
Chevy is my car and Rusty is my horse.
My car is very pretty my dog is red.
If i use the regex:
re.finditer(r'[\w+\b|^][\w+\b]my car',txt)
I do not get any match.
If I use the regex:
re.finditer(r'[\S+\s|^][\S+\s]my car',txt)
I am getting:
's my car' and '. My car' (I am ignoring case and using multi-line)
Why is the regex with \w+\b not finding anything? It should find two words and 'my car'
How can I get two complete words before 'my car' if there are two words. If there is only one word preceding my car, I should get it. If there are no words preceding it I should get only 'my car'. In my string example I should get: 'Chevy is my car' and 'My car' (no preceding words here)

In your r'[\w+\b|^][\w+\b]my car regex, [\w+\b|^] matches 1 symbol that is either a word char, a +, a backdpace, |, or ^ and [\w+\b] matches 1 symbol that is either a word char, or +, or a backspace.
The point is that inside a character class, quantifiers and a lot (but not all) special characters match literal symbols. E.g. [+] matches a plus symbol, [|^] matches either a | or ^. Since you want to match a sequence, you need to provide a sequence of subpatterns outside of a character class.
It seems as if you intended to use \b as a word boundary, however, \b inside a character class matches only a backspace character.
To find two words and 'my car', you can use, for example
\S+\s+\S+\s+my car
See the regex demo (here, \S+ matches one or more non-whitespace symbols, and \s+ matches 1 or more whitespaces, and the 2 occurrences of these 2 consecutive subpatterns match these symbols as a sequence).
To make the sequences before my car optional, just use a {0,2} quantifier like this:
(?:\S+[ \t]+){0,2}my car
See this regex demo (to be used with the re.IGNORECASE flag). See Python demo:
import re
txt = 'Chevy is my car and Rusty is my horse.\nMy car is very pretty my dog is red.'
print(re.findall(r'(?:\S+[ \t]+){0,2}my car', txt, re.I))
Details:
(?:\S+[ \t]+){0,2} - 0 to 2 sequences of 1+ non-whitespaces followed with 1+ space or tab symbols (you may also replace it with [^\S\r\n] to match any horizontal space or \s if you also plan to match linebreaks).
my car - a literal text my car.

Related

How to make regular expression match a required text?

I have a regular expression that I have written,
regex = "car\S*\s*(\w+\s+){1,2}\s*\S*wash"
This regex match the texts such as (i.e., one or two words between "car" and "wash"),
"car. was good ?wash"
"car wash"
"car will never wash"
But I want the above regex to also match these variation of texts,
texts = [
"Car, not ... (?!) wash", # (i.e., this should match because only one words between car and wash but has any number punctuations in between)
"Car never:)... $##! with wash", # (i.e., this also should match because only two words between car and wash but has any more punctuations in between)
"Car, was never wash",
"Car...:) things, not wash"]
But the regex I have written is failing? How can I modify the regex I wrote to make it match all the above texts given,
import re
# Define the regular expression
regex = "car\S*\s*(\w+\s+){1,2}\s*\S*wash"
# Use the re.search() function to find a match
match = re.search(regex, "Car, not ... (?!) wash", flags=re.I)
# Check if a match was found
if match:
print("Match found: ", match.group(0))
else:
print("No match found")
In short, I have to match any text that start with "car" and end with "wash" but with conditions.
It can have only 1 to 2 words in between car and wash. The regex I
wrote take care of that issue.
Along with those N words, it can have
any number of punctuation's or spaces between them.
Based on my interpretation of your question plus additional comments, I am defining the following rules.
A word is a sequence of one or more non-whitespace characters, at least one of which must be a letter, digit or underscore (the latter as it is part of the character class \w).
Words are separated by at least one whitespace character, mixed with zero or more 'punctuation' characters (i.e. anything but letter/digit/underscore).
There is some ambiguity in my definition; punctuation between a letter and a space could be considered part of the word (rule #1) or part of the separation between words (rule #2).
But when counting words, that makes no difference.
From there, I can build two subpatterns.
\S*\w\S* - a word has at least one word character, and no whitespace
\W*\s\W* - a separator has at least one whitespace character, and no word character
Chaining the subpatterns:
\bcar\W*\s\W*(\S*\w\S*\W*\s\W*){1,2}wash\b
Notice the word boundaries \b on either side, to prevent "scar" and "washing" to be mistaken for "car" and "wash".
This matches all of these texts:
car. was good ?wash # 2 words and punctuation between car and wash
car will never wash # 2 words
Car, not ... (?!) wash # 1 word
Car never:)... $##! with wash # 2 words
Car, was never wash # 2 words
Car...:) things, not wash # 2 words
An alternate approach would be to first strip all punctuation from the string, and then match against \bcar\s+(\S+\s+){1,2}wash\b

RegEx for matching strings with spaces and words

I have the following string:
the quick brown fox abc(1)(x)
with the following regex:
(?i)(\s{1})(abc\(1\)\([x|y]\))
and the output is
abc(1)(x)
which is expected, however, I can't seem to:
use \W \w \d \D etc to extract more than 1 space
combine the quantifier to add more spaces.
I would like the following output:
the quick brown fox abc(1)(x)
from the primary lookup "abc(1)(x)" I would like up to 5 words on either side of the lookup. my assumption is that spaces would demarcate a word.
Edit 1:
The 5 words on either side would be unknown for future examples. the string may be:
cat with a black hat is abc(1)(x) the quick brown fox jumps over the
lazy dog.
In this case, the desired output would be:
with a black hat is abc(1)(x) the quick brown fox jumps
Edit 2:
edited the expected output in the first example and added "up to" 5 words
(?:[0-9A-Za-z_]+[^0-9A-Za-z_]+){0,5}abc\(1\)\([xy]\)(?:[^0-9A-Za-z_]+[0-9A-Za-z_]+){0,5}
Note that I've changed \w+ to [0-9A-Za-z_]+ and \W+ to [^0-9A-Za-z_]+ because depending on your locale / Unicode settings \W and \w might not act the way you expect in Python.
Also note I don't specifically look for spaces, just "non-word characters" this probably handles edge cases a little better for quote characters etc.
But regardless this should get you most of the way there.
BTW: You calling this "lookaround" - really it has nothing to do with "regex lookaround" the regex feature.
If I understand your requirements correctly, you want to do something like this:
(?:\w+[ ]){0,5}(abc\(1\)\([xy]\))(?:[ ]\w+){0,5}
Demo.
BreakDown:
(?: # Start of a non-capturing group.
\w+ # Any word character repeated one or more times (basically, a word).
[ ] # Matches a space character literally.
) # End of the non-capturing group.
{0,5} # Match the previous group between 0 and 5 times.
( # Start of the first capturing group.
abc\(1\) # Matches "abc(1)" literally.
\([xy]\) # Matches "(x)" or "(y)". You don't need "|" inside a character class.
) # End of the capturing group.
(?:[ ]\w+){0,5} # Same as the non-capturing group above but the space is before the word.
Notes:
To make the pattern case insensitive, you may start it with (?i) as you're doing already or use the re.IGNORECASE flag.
If you want to support words not separated by a space, you may replace [ ] with either \W+ (which means non-word characters) or with a character class which includes all the punctuation characters that you want to support (e.g., [.,;?! ]).

Ignore part of match in `re.split`

Input is a two-sentence string:
s = 'Sentence 1 here. This sentence contains 1 fl. oz. but is one sentence.'
I'd like to .split s into sentences based on the logic that:
sentences end with one or more periods, exclamation marks, questions marks, or period+quotation mark
and are then followed by 1+ whitespace characters and a capitalized alpha character.
Desired result:
['Sentence 1 here.', 'This sentence contains 1 fl. oz. but is one sentence.']
Also okay:
['Sentence 1 here', 'This sentence contains 1 fl. oz. but is one sentence.']
But I currently chop off the 0th element of each sentence because the uppercase character is captured:
import re
END_SENT = re.compile(r'[.!?(.")]+[ ]+[A-Z]')
print(END_SENT.split(s))
['Sentence 1 here', 'his sentence contains 1 fl. oz. but is one sentence.']
Notice the missing T. How can I tell .split to ignore certain elements of the compiled pattern?
((?<=[.!?])|(?<=\.\")) +(?=[A-Z])
Try it here.
Although I would suggest the below to allow quotes to be followed by any of .!? to be a split condition
((?<=[.!?])|(?<=[.!?]\")) +(?=[A-Z])
Try it here.
Explanation
The common stuff in both +(?=[A-Z])
' +' #One or more spaces(The actual splitting chars used.)
(?= #START positive look ahead check if it followed by this, but do not consume
[A-Z] #Any capitalized alphabet
) #END positive look ahead
The conditions for what comes before the space
For Solution1
( #GROUP START
(?<= #START Positive look behind, Make sure this comes before but do not consume
[.!?] #any one of these chars should come before the splitting space
) #END positive look behind
| #OR condition this is also the reason we had to put all this in GROUP
(?<= #START Positive look behind,
\.\" #splitting space could precede by .", covering a condition that is not by the previous set of . or ! or ?
) #END positive look behind
) #END GROUP
For Solution2
( #GROUP START
(?<=[.!?]) #Same as the previous look behind
| #OR condition
(?<=[.!?]\") #Only difference here is that we are allowing quote after any of . or ! or ?
) #GROUP END
It's easier to describe the sentence than trying to identify the delimiter. So instead of re.split try with re.findall:
re.findall(r'([^.?!\s].*?[.?!]*)\s*(?![^A-Z])', s)
To preserve the next uppercase letter, the pattern uses a lookahead that is only a test and doesn't consume characters.
details:
( # capture group: re.findall return only the capture group content if any
[^.?!\s] # the first character isn't a space or a punctuation character
.*? # a non-greedy quantifier
[.?!]* # eventual punctuation characters
)
\s* # zero or more white-spaces
(?![^A-Z]) # not followed by a character that isn't a uppercase letter
# (this includes an uppercase letter and the end of the string)
Obviously, for more complicated cases with abbreviations, names, etc., you have to use tools like nltk or any other nlp tools trained with dictionaries.

How are regex quantifiers applied?

I have the following regex:
res = re.finditer(r'(?:\w+[ \t,]+){0,4}my car',txt,re.IGNORECASE|re.MULTILINE)
for item in res:
print(item.group())
When I use this regex with the following string:
"my house is painted white, my car is red.
A horse is galloping very fast in the road, I drive my car slowly."
I am getting the following results:
house is painted white, my car
the road, I drive my car
My question is about the quantifier {0,4} that should apply to the whole group. The group collects words with the expression \w+ and some separation symbols with the [ ]. Does the the quantifier apply only to the "words" defined by \w+? In the results I am getting 4 words plus space and comma. It's unclear to me.
So, here's what's happening. You're using ?: to make a non capture group, which collects 1 or more "words", followed by a [ \t,] (a space, tab char, or comma), match one or more of the preceeding. {0,4} matches between 0-4 of the non-capturing group. So it looks at the word "my car" and captures the 4 words before it, since all 4 of them match the \w+ and the , and space get eaten by the character set you specified.
Broken apart more succinctly
(?: -- Non capturing group
\w+ Grab all words
[ \t,]+ -- Grab all spaces, comma, or tab characters
) -- End capture group
{0,4} -- Match the previous capture group 0-4 times
my car -- Based off where you find the words "my car"
As a result this will match 0-4 words / spaces / commas / tabs before the appearance of "my car"
This is working as written

Python Regex doesn't match . (dot) as a character

I have a regex that matches all three characters words in a string:
\b[^\s]{3}\b
When I use it with the string:
And the tiger attacked you.
this is the result:
regex = re.compile("\b[^\s]{3}\b")
regex.findall(string)
[u'And', u'the', u'you']
As you can see it matches you as a word of three characters, but I want the expression to take "you." with the "." as a 4 chars word.
I have the same problem with ",", ";", ":", etc.
I'm pretty new with regex but I guess it happens because those characters are treated like word boundaries.
Is there a way of doing this?
Thanks in advance,
EDIT
Thaks to the answers of #BrenBarn and #Kendall Frey I managed to get to the regex I was looking for:
(?<!\w)[^\s]{3}(?=$|\s)
If you want to make sure the word is preceded and followed by a space (and not a period like is happening in your case), then use lookaround.
(?<=\s)\w{3}(?=\s)
If you need it to match punctuation as part of words (such as 'in.') then \w won't be adequate, and you can use \S (anything but a space)
(?<=\s)\S{3}(?=\s)
As described in the documentation:
A word is defined as a sequence of alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore character.
So if you want a period to count as a word character and not a word boundary, you can't use \b to indicate a word boundary. You'll have to use your own character class. For instance, you can use a regex like \s[^\s]{3}\s if you want to match 3 non-space characters surrounded by spaces. If you still want the boundary to be zero-width (i.e., restrict the match but not be included in it), you could use lookaround, something like (?<=\s)[^\s]{3}(?=\s).
This would be my approach. Also matches words that come right after punctuations.
import re
r = r'''
\b # word boundary
( # capturing parentheses
[^\s]{3} # anything but whitespace 3 times
\b # word boundary
(?=[^\.,;:]|$) # dont allow . or , or ; or : after word boundary but allow end of string
| # OR
[^\s]{2} # anything but whitespace 2 times
[\.,;:] # a . or , or ; or :
)
'''
s = 'And the tiger attacked you. on,bla tw; th: fo.tes'
print re.findall(r, s, re.X)
output:
['And', 'the', 'on,', 'bla', 'tw;', 'th:', 'fo.', 'tes']

Categories