Stripping the last occurrence of text inside braces from a string - python

I would like to know how to strip the last occurrence of () and its contents given a string.
The below code strips all the () in a string.
bracketedString = '*AWL* (GREATER) MINDS LIMITED (CLOSED)'
nonBracketedString = re.sub("\s\(.*?\)", '', bracketedString)
print(nonBracketedString1)
I would like the following output.
*AWL* (GREATER) MINDS LIMITED

You may remove a (...) substring with a leading whitespace at the end of the string only:
\s*\([^()]*\)$
See the regex demo.
Details
\s* - 0+ whitespace chars
\( - a (
[^()]* - 0+ chars other than ( and )
\) - a )
$ - end of string.
See the Python demo:
import re
bracketedString = '*AWL* (GREATER) MINDS LIMITED (CLOSED)'
nonBracketedString = re.sub(r"\s*\([^()]*\)$", '', bracketedString)
print(nonBracketedString) # => *AWL* (GREATER) MINDS LIMITED
With PyPi regex module you may also remove nested parentheses at the end of the string:
import regex
s = "*AWL* (GREATER) MINDS LIMITED (CLOSED(Jan))" # => *AWL* (GREATER) MINDS LIMITED
res = regex.sub(r'\s*(\((?>[^()]+|(?1))*\))$', '', s)
print(res)
See the Python demo.
Details
\s* - 0+ whitespaces
(\((?>[^()]+|(?1))*\)) - Group 1:
\( - a (
(?>[^()]+|(?1))* - zero or more repetitions of 1+ chars other than ( and ) or the whole Group 1 pattern
\) - a )
$ - end of string.

In case you want to replace last occurrence of brackets even if they are not at the end of the string:
*AWL* (GREATER) MINDS LIMITED (CLOSED) END
you can use tempered greedy token:
>>> re.sub(r"\([^)]*\)(((?!\().)*)$", r'\1', '*AWL* (GREATER) MINDS LIMITED (CLOSED) END')
# => '*AWL* (GREATER) MINDS LIMITED END'
Demo
Explanation:
\([^)]*\) matches string in brackets
(((?!\().)*)$ assures that there are no other opening bracket until the end of the string
(?!\() is negative lookeahead checking that there is no ( following
. matches next char (that cannot be ( because of the negative lookahead)
(((?!\().)*)$ the whole sequence is repeated until the end of the string $ and kept in a capturing group
we replace the match with the first capturing group (\1) that keeps the match after the brackets

Related

regex pattern to match whole word or word followed by another

I'm starting to learn regex in order to match words in python columns and replace them for other values.
df['col1']=df['col1'].str.replace(r'(?i)unlimi+\w*', 'Unlimited', regex=True)
This pattern serves to match different variations of the world Unlimited. But I have some values in the column that have not only one word, but two or more:
ex:
[Unlimited, Unlimited (on-net), Unlimited (on-off-net)]`
I was wondering if there is a way to match all of the words in the previous example with a single regex line.
You can use
df['col1']=df['col1'].str.replace(r'(?i)unlimi\w*(?:\s*\([^()]*\))?', 'Unlimited', regex=True)
See the regex demo.
The (?i)unlimi\w*(?:\s*\([^()]*\))? regex matches
(?i) - the regex to the right is case insensitive
unlimi - a fixed string
\w* - zero or more word chars
(?:\s*\([^()]*\))? - an optional sequence of
\s* - zero or more whitespaces
\( - a ( char
[^()]* - zero or more chars other than ( and )
\) - a ) char.

Invalid pattern in look-behind

Why does this regex work in Python but not in Ruby:
/(?<!([0-1\b][0-9]|[2][0-3]))/
Would be great to hear an explanation and also how to get around it in Ruby
EDIT w/ the whole line of code:
re.sub(r'(?<!([0-1\b][0-9]|[2][0-3])):(?!([0-5][0-9])((?i)(am)|(pm)|(a\.m)|(p\.m)|(a\.m\.)|(p\.m\.))?\b)' , ':\n' , s)
Basically, I'm trying to add '\n' when there is a colon and it is not a time.
Ruby regex engine doesn't allow capturing groups in look behinds.
If you need grouping, you can use a non-capturing group (?:):
[8] pry(main)> /(?<!(:?[0-1\b][0-9]|[2][0-3]))/
SyntaxError: (eval):2: invalid pattern in look-behind: /(?<!(:?[0-1\b][0-9]|[2][0-3]))/
[8] pry(main)> /(?<!(?:[0-1\b][0-9]|[2][0-3]))/
=> /(?<!(?:[0-1\b][0-9]|[2][0-3]))/
Docs:
(?<!subexp) negative look-behind
Subexp of look-behind must be fixed-width.
But top-level alternatives can be of various lengths.
ex. (?<=a|bc) is OK. (?<=aaa(?:b|cd)) is not allowed.
In negative look-behind, capturing group isn't allowed,
but non-capturing group (?:) is allowed.
Learned from this answer.
Acc. to Onigmo regex documentation, capturing groups are not supported in negative lookbehinds. Although it is common among regex engines, not all of them count it as an error, hence you see the difference in the re and Onigmo regex libraries.
Now, as for your regex, it is not working correctly nor in Ruby nor in Python: the \b inside a character class in a Python and Ruby regex matches a BACKSPACE (\x08) char, not a word boundary. Moreover, when you use a word boundary after an optional non-word char, if the char appears in the string a word char must appear immediately to the right of that non-word char. The word boundary must be moved to right after m before \.?.
Another flaw with the current approach is that lookbehinds are not the best to exclude certain contexts like here. E.g. you can't account for a variable amount of whitespaces between the time digits and am / pm. It is better to match the contexts you do not want to touch and match and capture those you want to modify. So, we need two main alternatives here, one matching am/pm in time strings and another matching them in all other contexts.
Your pattern also has too many alternatives that can be merged using character classes and ? quantifiers.
Regex demo
\b((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\s*[pa]\.?m\b\.?):
\b - word boundary
((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\s*[pa]\.?m\b\.?) - capturing group 1:
(?:[01]?[0-9]|2[0-3]) - an optional 0 or 1 and then any digit or 2 and then a digit from 0 to 3
:[0-5][0-9] - : and then a number from 00 to 59
\s* - 0+ whitespaces
[pa]\.?m\b\.? - a or p, an optional dot, m, a word boundary, an optional dot
| - or
\b[ap]\.?m\b\.? - word boundary, a or p, an optional dot, m, a word boundary, an optional dot
Python fixed solution:
import re
text = 'am pm P.M. 10:56pm 10:43 a.m.'
rx = r'\b((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\s*[pa]\.?m\b\.?)|\b[ap]\.?m\b\.?'
result = re.sub(rx, lambda x: x.group(1) if x.group(1) else "\n", text, flags=re.I)
Ruby solution:
text = 'am pm P.M. 10:56pm 10:43 a.m.'
rx = /\b((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\s*[pa]\.?m\b\.?)|\b[ap]\.?m\b\.?/i
result = text.gsub(rx) { $1 || "\n" }
Output:
"\n \n \n 10:56pm 10:43 a.m."
For sure #mrzasa found the problem out.
But ..
Taking a guess at your intent to replace a non-time colon with a ':\n`
it could be done like this I guess. Does a little whitespace trim as well.
(?i)(?<!\b[01][0-9])(?<!\b[2][0-3])([^\S\r\n]*:)[^\S\r\n]*(?![0-5][0-9](?:[ap]\.?m\b\.?)?)
PCRE - https://regex101.com/r/7TxbAJ/1 Replace $1\n
Python - https://regex101.com/r/w0oqdZ/1 Replace \1\n
Readable version
(?i)
(?<!
\b [01] [0-9]
)
(?<!
\b [2] [0-3]
)
( # (1 start)
[^\S\r\n]*
:
) # (1 end)
[^\S\r\n]*
(?!
[0-5] [0-9]
(?: [ap] \.? m \b \.? )?
)

Problem omitting optional word in python3 regex

I need a regex that captures 2 groups: a movie and the year. Optionally, there could be a 'from ' string between them.
My expected results are:
first_query="matrix 2013" => ('matrix', '2013')
second_query="matrix from 2013" => ('matrix', '2013')
third_query="matrix" => ('matrix', None)
I've done 2 simulations on https://regex101.com/ for python3:
I- r"(.+)(?:from ){0,1}([1-2]\d{3})"
Doesn't match first_query and third_query, also doesn't omit 'from' in group one, which is what I want to avoid.
II- r"(.+)(?:from ){1}([1-2]\d{3})"
Works with second_query, but does not match first_query and third_query.
Is it possible to match all three strings, omitting the 'from ' string from the first group?
Thanks in advance.
You may use
^(.+?)(?:\s+(?:from\s+)?([12]\d{3}))?$
See the regex demo
Details
^ - start of a string
(.+?) - Group 1: any 1+ chars other than line break chars, as few as possible
(?:\s+(?:from\s+)?([12]\d{3}))? - an optional non-capturing group matching 1 or 0 occurrences of:
\s+ - 1+ whitespaces
(?:from\s+)? - an optional sequence of from substring followed with 1+ whitespaces
([12]\d{3}) - Group 2: 1 or 2 followed with 3 digits
$ - end of string.
This will output your patters, but have a space too much in from of the number:
import re
pat = r"^(.+?)(?: from)? ?(\d+)?$"
text = """matrix 2013
matrix from 2013
matrix"""
for t in text.split("\n"):
print(re.findall(pat,t))
Output:
[('matrix', '2013')]
[('matrix', '2013')]
[('matrix', '')]
Explanation:
^ start of string
(.+?) lazy anythings as few as possible
(?: from)? non-grouped optional ` from`
? optional space
(\d+=)?$ optional digits till end of string
Demo: https://regex101.com/r/VD0SZb/1
import re
pattern = re.compile( r"""
^\s* # start of string (optional whitespace)
(?P<title>\S+) # one or more non-whitespace characters (title)
(?:\s+from)? # optionally, some space followed by the word 'from'
\s* # optional whitespace
(?P<year>[0-9]+)? # optional digit string (year)
\s*$ # end of string (optional whitespace)
""", re.VERBOSE )
for query in [ 'matrix 2013', 'matrix from 2013', 'matrix' ]:
m = re.match( pattern, query )
if m: print( m.groupdict() )
# Prints:
# {'title': 'matrix', 'year': '2013'}
# {'title': 'matrix', 'year': '2013'}
# {'title': 'matrix', 'year': None}
Disclaimer: this regex does not contain the logic necessary to reject the first two matches on the grounds that The Matrix actually came out in 1999.

Starts with anything but not space and ends with extensions like (.png, .jpg, .mp4, .avi, .flv)

I need to get all files with media like extension( .png, .jpg, .mp4, .avi, .flv ) in a list by using regex.What i had tried is Below
import re
st = '''
/mnt/data/Content:
ManifestFile.txt kompass-tags_e2d5dac5ba548a1206b5d40f58e448e4 tmp_content
default_55a655f340908dce55d10a191b6a0140 price-tags_b3c756dda783ad0691163a900fb5fe15
/mnt/data/Content/default_55a655f340908dce55d10a191b6a0140:
LayoutFile_34450b33c8b44af409abb057ddedfdfe.txt blank_decommissioned.jpeg tmp_content
ManifestFile.txt blank_unregistered.png
/mnt/data/Content/default_55a655f340908dce55d10a191b6a0140/tmp_content:
/mnt/data/Content/kompass-tags_e2d5dac5ba548a1206b5d40f58e448e4:
0001111084948-kompass-LARGE.avi 0076738703404-kompass-LARGE.png LayoutFile_7c1b3793e49204982e0e41923303c17b.txt
0001111087321-kompass-LARGE.jpg 0076738703419-kompass-LARGE.mp4 ManifestFile.txt
0001111087325-kompass-LARGE.png 0076738703420-kompass-LARGE.png tmp_content
/mnt/data/Content/kompass-tags_e2d5dac5ba548a1206b5d40f58e448e4/tmp_content:
/mnt/data/Content/price-tags_b3c756dda783ad0691163a900fb5fe15:
0001111084948-consumer-large.png 0076738703404-consumer-large.png LayoutFile_a694b1e05d08705aaf4dd589ac61d493.txt
0001111087321-consumer-large.png 0076738703419-consumer-large.avi ManifestFile.txt
0001111087325-consumer-large.mp4 0076738703420-consumer-large.png tmp_content
/mnt/data/Content/price-tags_b3c756dda783ad0691163a900fb5fe15/tmp_content:
/mnt/data/Content/tmp_content:
'''
patt = '^.*(.png|.jpg|.gif|.bmp|.jpeg|.mp4|.avi|.flv)'
patt = '^.*$.png'
fList = re.findall(patt, st)
print fList
I have very less idea about regex please help.
The ^.*(.png|.jpg|.gif|.bmp|.jpeg|.mp4|.avi|.flv) pattern matches the start of a string, then any 0+ chars other than line break chars as many as possible and then the extensions with any single char before them (an unescaped . matches any char but a line break char). So, this can't work for you since . matches too much here and ^ only yields a match at the start of the string.
The ^.*$.png pattern only matches the start of the string, any 0+ chars other than line break chars then the end of string and any char + png - this is a pattern that will never match any string.
Judging by your description you need
patt = r'\S+\.(?:png|jpe?g|gif|bmp|mp4|avi|flv)\b'
See the regex demo.
Details
\S+ - 1+ non-whitespace chars
\. - a literal dot
(?:png|jpe?g|gif|bmp|mp4|avi|flv) - a non-capturing group (i.e. what it captures won't be part of the list returned by re.findall) matching any of the mentioned extenstions
\b - a word boundary (actually, it is optional, but it will make sure you match an extension above as a whole word).
See the Python demo:
import re
st = '<YOUR_STRING_HERE>'
patt = r'\S+\.(?:png|jpe?g|gif|bmp|mp4|avi|flv)\b'
fList = re.findall(patt, st)
for s in fList:
print(s)
yielding
blank_decommissioned.jpeg
blank_unregistered.png
0001111084948-kompass-LARGE.avi
0076738703404-kompass-LARGE.png
0001111087321-kompass-LARGE.jpg
0076738703419-kompass-LARGE.mp4
0001111087325-kompass-LARGE.png
0076738703420-kompass-LARGE.png
0001111084948-consumer-large.png
0076738703404-consumer-large.png
0001111087321-consumer-large.png
0076738703419-consumer-large.avi
0001111087325-consumer-large.mp4
0076738703420-consumer-large.png
You can use the RegEx \S+\.(?:png|jpg|gif|bmp|jpeg|mp4|avi|flv)
\S+ matches any non white-space char at least one time
\. matches a dot
(?: ... ) is a non capturing group
(png|jpg|gif|bmp|jpeg|mp4|avi|flv matches your defined extensions
Demo.
Try this:
patt = '[^ \n]+?\.(?:png|jpg|gif|bmp|jpeg|mp4|avi|flv)'
[^ \n] is a negated character class, allowing no spaces or newlines.
The dot (.) is a special character and needs to be escaped with a backslash.
Try it online here.

Return the next nth result \w+ after a hyphen globally

Just getting to the next stage of understanding regex, hoping the community can help...
string = These.Final.Hours-AUSVERSION.2013-TEST-TESTAGAIN-YIFY.cp(tt123456).MiLLENiUM.mp4
There are multiple test names preceded by a '-' hyphen which I derive from regex
\(?<=-)\w+\g
Result:
AUSVERSION
TEST
TESTAGAIN
YIFY
I can parse the very last result using greediness with regex \(?!.*-)(?<=-)\w+\g
Result:
YIFI (4th & last result)
Can you please help me parse either the 1st, 2nd, or 3rd result Globally using the same string?
In Python, you can get these matches with a simple -\s*(\w+) regex and re.findall and then access any match with the appropriate index:
See IDEONE demo:
import re
s = 'These.Final.Hours-AUSVERSION.2013-TEST-TESTAGAIN-YIFY.cp(tt123456).MiLLENiUM.mp4'
r = re.findall(r'-\s*(\w+)', s)
print(r[0]) # => AUSVERSION
print(r[1]) # => TEST
print(r[2]) # => TESTAGAIN
print(r[3]) # => YIFY
The -\s*(\w+) pattern search for a hyphen, followed with 0+ whitespaces, and then captures 1+ digits, letters or underscores. re.findall only returns the texts captured with capturing groups, so you only get those Group 1 values captured with (\w+).
To get these matches one by one, with re.search, you can use ^(?:.*?-\s*(\w+)){n}, where n is the match index you want. Here is a regex demo.
A quick Python demo (in real code, assign the result of re.search and only access Group 1 value after checking if there was a match):
s = "These.Final.Hours-AUSVERSION.2013-TEST-TESTAGAIN- YIFY.cp(tt123456).MiLLENiUM.mp4"
print(re.search(r'^(?:.*?-\s*(\w+))', s).group(1))
print(re.search(r'^(?:.*?-\s*(\w+)){2}', s).group(1))
print(re.search(r'^(?:.*?-\s*(\w+)){3}', s).group(1))
print(re.search(r'^(?:.*?-\s*(\w+)){4}', s).group(1))
Explanation of the pattern:
^ - start of string
(?:.*?-\s*(\w+)){2} - a non-capturing group that matches (here) 2 sequences of:
.*? - 0+ any characters other than a newline (since no re.DOTALL modifier is used) up to the first...
- - hyphen
\s* - 0 or more whitespaces
(\w+) - Group 1 capturing 1+ word characters (letters, digits or underscores).

Categories