Problem omitting optional word in python3 regex

Problem omitting optional word in python3 regex - python

I need a regex that captures 2 groups: a movie and the year. Optionally, there could be a 'from ' string between them.
My expected results are:
first_query="matrix 2013" => ('matrix', '2013')
second_query="matrix from 2013" => ('matrix', '2013')
third_query="matrix" => ('matrix', None)
I've done 2 simulations on https://regex101.com/ for python3:
I- r"(.+)(?:from ){0,1}([1-2]\d{3})"
Doesn't match first_query and third_query, also doesn't omit 'from' in group one, which is what I want to avoid.
II- r"(.+)(?:from ){1}([1-2]\d{3})"
Works with second_query, but does not match first_query and third_query.
Is it possible to match all three strings, omitting the 'from ' string from the first group?
Thanks in advance.

You may use
^(.+?)(?:\s+(?:from\s+)?([12]\d{3}))?$
See the regex demo
Details
^ - start of a string
(.+?) - Group 1: any 1+ chars other than line break chars, as few as possible
(?:\s+(?:from\s+)?([12]\d{3}))? - an optional non-capturing group matching 1 or 0 occurrences of:
\s+ - 1+ whitespaces
(?:from\s+)? - an optional sequence of from substring followed with 1+ whitespaces
([12]\d{3}) - Group 2: 1 or 2 followed with 3 digits
$ - end of string.

This will output your patters, but have a space too much in from of the number:
import re
pat = r"^(.+?)(?: from)? ?(\d+)?$"
text = """matrix 2013
matrix from 2013
matrix"""
for t in text.split("\n"):
print(re.findall(pat,t))
Output:
[('matrix', '2013')]
[('matrix', '2013')]
[('matrix', '')]
Explanation:
^ start of string
(.+?) lazy anythings as few as possible
(?: from)? non-grouped optional ` from`
? optional space
(\d+=)?$ optional digits till end of string
Demo: https://regex101.com/r/VD0SZb/1

import re
pattern = re.compile( r"""
^\s* # start of string (optional whitespace)
(?P<title>\S+) # one or more non-whitespace characters (title)
(?:\s+from)? # optionally, some space followed by the word 'from'
\s* # optional whitespace
(?P<year>[0-9]+)? # optional digit string (year)
\s*$ # end of string (optional whitespace)
""", re.VERBOSE )
for query in [ 'matrix 2013', 'matrix from 2013', 'matrix' ]:
m = re.match( pattern, query )
if m: print( m.groupdict() )
# Prints:
# {'title': 'matrix', 'year': '2013'}
# {'title': 'matrix', 'year': '2013'}
# {'title': 'matrix', 'year': None}
Disclaimer: this regex does not contain the logic necessary to reject the first two matches on the grounds that The Matrix actually came out in 1999.

Related

Python Regex: Match a string not preceded by or followed by a word with digits in it

I would like to have a Regex in Python to replace a string not preceded by or followed by a word with digits in it.
i.e.
For the following sentence,
Today is 4th April. Her name is April. Tomorrow is April 5th.
I would like to match the April(in bold) only and replace it with 'PERSON' and result should be like below:
Today is 4th April. Her name is PERSON. Tomorrow is April 5th.
I tried to use this regex:
(\w*(?<!\w*\d\w*\s)April(?!\s\w*\d\w*))
However, I've got an error saying:
error: look-behind requires fixed-width pattern
Any help is appreciated.

It can be done with Pypi regex library that supports variable-length lookbehind.
import regex
str = 'Today is 4th April. Her name is April. Tomorrow is April 5th.'
res = regex.sub(r'(?<!\d[a-z]* )April(?! [a-z]*\d)', 'PERSON', str)
print(res)
Output:
Today is 4th April. Her name is PERSON. Tomorrow is April 5th.
Explanation:
(?<!\d[a-z]* ) # negative lookbehind, make sure we haven't a digit followed by 0 or more letters and a space before
April # literally
(?! [a-z]*\d) # negative lookahead, make sure we haven't a space, 0 or more letters and a digit after
Update with re module:
import re
str = 'Today is 4th April. Her name is April. Tomorrow is April 5th.'
res = re.sub(r'(\b[a-z]+ )April(?! [a-z]*\d)', '\g<1>PERSON', str)
print(res)

This is one regex you could use:
(?:^\s+|[^\w\s]+\s*|\b[^\d\s]+\s+)(April)\b(?!\s*\w*\d)
with the case-indifferent flag set. The target word is captured in capture group 1.
Demo
Python's regex engine performs the following operations:
(?: # begin non-cap grp
^ # match beginning of line
\s* # match 0+ whitespace characters
| # or
[^\w\s]+ # match 1+ chars other than word chars and whitespace
\s* # match 0+ whitespace chars
| # or
\b # match word break
[^\d\s]+ # match 1+ chars other than digits and whitespace
\s+ # match 1+ whitespace chars
) # end non-cap grp
(April) # match 'April' in capture group
\b # match word break
(?! # begin negative lookahead
\s* # match 0+ whitespace chars
\w* # match 0+ word chars
\d # match a digit
) # end negative lookahead
The approach I've taken was to specify what may precede "April" and why may not follow it. I could not specify what cannot precede "April" as that would require a negative lookbehind, which is not supported by Python's regex engine.
I've assume that "April" may:
be at the beginning of the string, possibly followed by spaces;
be preceded by a character that is neither a word character nor a space, possibly followed by spaces; or
be preceded by a word containing no digits, possibly followed by spaces.
I've also assumed that "April" is followed by a word break which may not be followed by a word containing a digit, possibly preceded by spaces.

regex with multiple conditional groups in lookahead that must also be captured in match

I which to match 4 patterns in a string, 3 of which are optional
strings can look as follows:
form1 = "N-e-1-[(5E)-5,6-e]-4 c"
form2 = "#3,4# N-e-1-[(5E)-5,6-e]-4 c <5,6,7>"
form3 = "#1,2,3# {N-e-1-[(5E)-5,6-e]-4 c} (#4,5# comments <6,7>) <8,9,10>"
and i want to match:
assert pattern.match(form1).groups() == (None, 'N-e-1-[(5E)-5,6-e]-4 c', None, None)
assert pattern.match(form2).groups() == ('3,4', 'N-e-1-[(5E)-5,6-e]-4 c', None, '5,6,7')
assert pattern.match(form3).groups() == ('1,2,3', 'N-e-1-[(5E)-5,6-e]-4 c', '#4,5# comments <6,7>', '8,9,10')
but I'm not quite getting there. This is what I have so far:
# match any digits, comma or space separated, enclosed by "#", at the start of the line
optional_first_part = r'^#?([,\d\s]+)?#?'
# match anything up to the start of an optional third or fourth part
second_part = r'(.*?)(?:<\d+|\(#|$)'
# match anything between "(#X" and "X>)", where X are integers
optional_third_part = r'\(?(#\d+.*\d+\>)?\)?'
# match any digits, comma or space separated, enclosed by "<" and ">", at the end of the line
optional_fourth_part = r'<?([,\d\s]+)?>?$'
# compile pattern
pattern = re.compile(r'{0}{1}{2}{3}'.format(optional_first_part, second_part,
optional_third_part, optional_fourth_part))
and what I now get:
pattern.match(form1).groups()
>>> (None, 'N - e - 1 - [(5E) - 5, 6 - e] - 4c', None, None)
pattern.match(form2).groups()
>>> ('3,4', ' N-e-1-[(5E)-5,6-e]-4 c ', None, ',6,7') # unwanted white spaces, losing start of the fourth part
pattern.match(form3).groups()
>>> ('1,2,3', ' {N-e-1-[(5E)-5,6-e]-4 c} (#4,5# comments <6,7>) ', None, '9,10') # completely horrible
part of the issue is the lookahead: since I match "<\d+" there, the optional fourth part actually doesn't match it. Somehow I need to be able to capture it again in fourth part
In the last example i don't seem able to non-greedily match up to the occurrence of "(#\d+" in the second_part, and thus the third_part is not used
any suggestions?

You can use the following regular expression:
/^(?:#(\d+(?:,\d+)*)#)? *([^<]+?) *(?:\(([^()]*)\))? *(?:<(\d+(?:,\d+)*)>)?$/
demo
We can write it in free spacing mode to make it self-documenting:
/
^ # match beginning of line
(?: # begin non-capture group
# # match '#'
( # begin capture group 1
\d+ # match 1+ digits
(?:,\d+)* # match a comma then 1+ digits in non-capture
# group, executed 0+ times (*)
) # end capture group #1
# # match '#'
)? # end non-capture group and make it optional
\ * # match 0+ spaces
(.+?) # match any char 1+ times (+), non-greedily
# in capture group 2 (not optional)
\ * # match 0+ spaces
(?: # begin non-capture group
\( # match '('
([^()]*) # match 0+ (*) chars other than '(' and
# ')' in capture group 3
\) # match ')'
)? # end non-capture group and make it optional
\ * # match 0+ spaces
(?: # begin non-capture group
< # match '<'
( # begin capture group 4
\d+ # match 1+ digits
(?:,\d+)* # match a comma then 1+ digits in non-
# capture group, 0+ times
) # end capture group 4
> # match '>'
)? # end non-capture group and make it optional
$ # match end of line
/x # free-spacing regex definition mode

^(?:#(.*?)#)?\s*\{?(.*?)\}?\s*(?:\((#.*?)\))*\s*(?:<(\d.*?)>)*$
Demo here

Invalid pattern in look-behind

Why does this regex work in Python but not in Ruby:
/(?<!([0-1\b][0-9]|[2][0-3]))/
Would be great to hear an explanation and also how to get around it in Ruby
EDIT w/ the whole line of code:
re.sub(r'(?<!([0-1\b][0-9]|[2][0-3])):(?!([0-5][0-9])((?i)(am)|(pm)|(a\.m)|(p\.m)|(a\.m\.)|(p\.m\.))?\b)' , ':\n' , s)
Basically, I'm trying to add '\n' when there is a colon and it is not a time.

Ruby regex engine doesn't allow capturing groups in look behinds.
If you need grouping, you can use a non-capturing group (?:):
[8] pry(main)> /(?<!(:?[0-1\b][0-9]|[2][0-3]))/
SyntaxError: (eval):2: invalid pattern in look-behind: /(?<!(:?[0-1\b][0-9]|[2][0-3]))/
[8] pry(main)> /(?<!(?:[0-1\b][0-9]|[2][0-3]))/
=> /(?<!(?:[0-1\b][0-9]|[2][0-3]))/
Docs:
(?<!subexp) negative look-behind
Subexp of look-behind must be fixed-width.
But top-level alternatives can be of various lengths.
ex. (?<=a|bc) is OK. (?<=aaa(?:b|cd)) is not allowed.
In negative look-behind, capturing group isn't allowed,
but non-capturing group (?:) is allowed.
Learned from this answer.

Acc. to Onigmo regex documentation, capturing groups are not supported in negative lookbehinds. Although it is common among regex engines, not all of them count it as an error, hence you see the difference in the re and Onigmo regex libraries.
Now, as for your regex, it is not working correctly nor in Ruby nor in Python: the \b inside a character class in a Python and Ruby regex matches a BACKSPACE (\x08) char, not a word boundary. Moreover, when you use a word boundary after an optional non-word char, if the char appears in the string a word char must appear immediately to the right of that non-word char. The word boundary must be moved to right after m before \.?.
Another flaw with the current approach is that lookbehinds are not the best to exclude certain contexts like here. E.g. you can't account for a variable amount of whitespaces between the time digits and am / pm. It is better to match the contexts you do not want to touch and match and capture those you want to modify. So, we need two main alternatives here, one matching am/pm in time strings and another matching them in all other contexts.
Your pattern also has too many alternatives that can be merged using character classes and ? quantifiers.
Regex demo
\b((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\s*[pa]\.?m\b\.?):
\b - word boundary
((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\s*[pa]\.?m\b\.?) - capturing group 1:
(?:[01]?[0-9]|2[0-3]) - an optional 0 or 1 and then any digit or 2 and then a digit from 0 to 3
:[0-5][0-9] - : and then a number from 00 to 59
\s* - 0+ whitespaces
[pa]\.?m\b\.? - a or p, an optional dot, m, a word boundary, an optional dot
| - or
\b[ap]\.?m\b\.? - word boundary, a or p, an optional dot, m, a word boundary, an optional dot
Python fixed solution:
import re
text = 'am pm P.M. 10:56pm 10:43 a.m.'
rx = r'\b((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\s*[pa]\.?m\b\.?)|\b[ap]\.?m\b\.?'
result = re.sub(rx, lambda x: x.group(1) if x.group(1) else "\n", text, flags=re.I)
Ruby solution:
text = 'am pm P.M. 10:56pm 10:43 a.m.'
rx = /\b((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\s*[pa]\.?m\b\.?)|\b[ap]\.?m\b\.?/i
result = text.gsub(rx) { $1 || "\n" }
Output:
"\n \n \n 10:56pm 10:43 a.m."

For sure #mrzasa found the problem out.
But ..
Taking a guess at your intent to replace a non-time colon with a ':\n`
it could be done like this I guess. Does a little whitespace trim as well.
(?i)(?<!\b[01][0-9])(?<!\b[2][0-3])([^\S\r\n]*:)[^\S\r\n]*(?![0-5][0-9](?:[ap]\.?m\b\.?)?)
PCRE - https://regex101.com/r/7TxbAJ/1 Replace $1\n
Python - https://regex101.com/r/w0oqdZ/1 Replace \1\n
Readable version
(?i)
(?<!
\b [01] [0-9]
)
(?<!
\b [2] [0-3]
)
( # (1 start)
[^\S\r\n]*
:
) # (1 end)
[^\S\r\n]*
(?!
[0-5] [0-9]
(?: [ap] \.? m \b \.? )?
)

Stripping the last occurrence of text inside braces from a string

I would like to know how to strip the last occurrence of () and its contents given a string.
The below code strips all the () in a string.
bracketedString = '*AWL* (GREATER) MINDS LIMITED (CLOSED)'
nonBracketedString = re.sub("\s\(.*?\)", '', bracketedString)
print(nonBracketedString1)
I would like the following output.
*AWL* (GREATER) MINDS LIMITED

You may remove a (...) substring with a leading whitespace at the end of the string only:
\s*\([^()]*\)$
See the regex demo.
Details
\s* - 0+ whitespace chars
\( - a (
[^()]* - 0+ chars other than ( and )
\) - a )
$ - end of string.
See the Python demo:
import re
bracketedString = '*AWL* (GREATER) MINDS LIMITED (CLOSED)'
nonBracketedString = re.sub(r"\s*\([^()]*\)$", '', bracketedString)
print(nonBracketedString) # => *AWL* (GREATER) MINDS LIMITED
With PyPi regex module you may also remove nested parentheses at the end of the string:
import regex
s = "*AWL* (GREATER) MINDS LIMITED (CLOSED(Jan))" # => *AWL* (GREATER) MINDS LIMITED
res = regex.sub(r'\s*(\((?>[^()]+|(?1))*\))$', '', s)
print(res)
See the Python demo.
Details
\s* - 0+ whitespaces
(\((?>[^()]+|(?1))*\)) - Group 1:
\( - a (
(?>[^()]+|(?1))* - zero or more repetitions of 1+ chars other than ( and ) or the whole Group 1 pattern
\) - a )
$ - end of string.

In case you want to replace last occurrence of brackets even if they are not at the end of the string:
*AWL* (GREATER) MINDS LIMITED (CLOSED) END
you can use tempered greedy token:
>>> re.sub(r"\([^)]*\)(((?!\().)*)$", r'\1', '*AWL* (GREATER) MINDS LIMITED (CLOSED) END')
# => '*AWL* (GREATER) MINDS LIMITED END'
Demo
Explanation:
\([^)]*\) matches string in brackets
(((?!\().)*)$ assures that there are no other opening bracket until the end of the string
(?!\() is negative lookeahead checking that there is no ( following
. matches next char (that cannot be ( because of the negative lookahead)
(((?!\().)*)$ the whole sequence is repeated until the end of the string $ and kept in a capturing group
we replace the match with the first capturing group (\1) that keeps the match after the brackets

Return the next nth result \w+ after a hyphen globally

Just getting to the next stage of understanding regex, hoping the community can help...
string = These.Final.Hours-AUSVERSION.2013-TEST-TESTAGAIN-YIFY.cp(tt123456).MiLLENiUM.mp4
There are multiple test names preceded by a '-' hyphen which I derive from regex
\(?<=-)\w+\g
Result:
AUSVERSION
TEST
TESTAGAIN
YIFY
I can parse the very last result using greediness with regex \(?!.*-)(?<=-)\w+\g
Result:
YIFI (4th & last result)
Can you please help me parse either the 1st, 2nd, or 3rd result Globally using the same string?

In Python, you can get these matches with a simple -\s*(\w+) regex and re.findall and then access any match with the appropriate index:
See IDEONE demo:
import re
s = 'These.Final.Hours-AUSVERSION.2013-TEST-TESTAGAIN-YIFY.cp(tt123456).MiLLENiUM.mp4'
r = re.findall(r'-\s*(\w+)', s)
print(r[0]) # => AUSVERSION
print(r[1]) # => TEST
print(r[2]) # => TESTAGAIN
print(r[3]) # => YIFY
The -\s*(\w+) pattern search for a hyphen, followed with 0+ whitespaces, and then captures 1+ digits, letters or underscores. re.findall only returns the texts captured with capturing groups, so you only get those Group 1 values captured with (\w+).
To get these matches one by one, with re.search, you can use ^(?:.*?-\s*(\w+)){n}, where n is the match index you want. Here is a regex demo.
A quick Python demo (in real code, assign the result of re.search and only access Group 1 value after checking if there was a match):
s = "These.Final.Hours-AUSVERSION.2013-TEST-TESTAGAIN- YIFY.cp(tt123456).MiLLENiUM.mp4"
print(re.search(r'^(?:.*?-\s*(\w+))', s).group(1))
print(re.search(r'^(?:.*?-\s*(\w+)){2}', s).group(1))
print(re.search(r'^(?:.*?-\s*(\w+)){3}', s).group(1))
print(re.search(r'^(?:.*?-\s*(\w+)){4}', s).group(1))
Explanation of the pattern:
^ - start of string
(?:.*?-\s*(\w+)){2} - a non-capturing group that matches (here) 2 sequences of:
.*? - 0+ any characters other than a newline (since no re.DOTALL modifier is used) up to the first...
- - hyphen
\s* - 0 or more whitespaces
(\w+) - Group 1 capturing 1+ word characters (letters, digits or underscores).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Problem omitting optional word in python3 regex - python

Related

Python Regex: Match a string not preceded by or followed by a word with digits in it

regex with multiple conditional groups in lookahead that must also be captured in match

Invalid pattern in look-behind

Stripping the last occurrence of text inside braces from a string

Return the next nth result \w+ after a hyphen globally

Categories

Resources