Invalid pattern in look-behind - python

Why does this regex work in Python but not in Ruby:
/(?<!([0-1\b][0-9]|[2][0-3]))/
Would be great to hear an explanation and also how to get around it in Ruby
EDIT w/ the whole line of code:
re.sub(r'(?<!([0-1\b][0-9]|[2][0-3])):(?!([0-5][0-9])((?i)(am)|(pm)|(a\.m)|(p\.m)|(a\.m\.)|(p\.m\.))?\b)' , ':\n' , s)
Basically, I'm trying to add '\n' when there is a colon and it is not a time.

Ruby regex engine doesn't allow capturing groups in look behinds.
If you need grouping, you can use a non-capturing group (?:):
[8] pry(main)> /(?<!(:?[0-1\b][0-9]|[2][0-3]))/
SyntaxError: (eval):2: invalid pattern in look-behind: /(?<!(:?[0-1\b][0-9]|[2][0-3]))/
[8] pry(main)> /(?<!(?:[0-1\b][0-9]|[2][0-3]))/
=> /(?<!(?:[0-1\b][0-9]|[2][0-3]))/
Docs:
(?<!subexp) negative look-behind
Subexp of look-behind must be fixed-width.
But top-level alternatives can be of various lengths.
ex. (?<=a|bc) is OK. (?<=aaa(?:b|cd)) is not allowed.
In negative look-behind, capturing group isn't allowed,
but non-capturing group (?:) is allowed.
Learned from this answer.

Acc. to Onigmo regex documentation, capturing groups are not supported in negative lookbehinds. Although it is common among regex engines, not all of them count it as an error, hence you see the difference in the re and Onigmo regex libraries.
Now, as for your regex, it is not working correctly nor in Ruby nor in Python: the \b inside a character class in a Python and Ruby regex matches a BACKSPACE (\x08) char, not a word boundary. Moreover, when you use a word boundary after an optional non-word char, if the char appears in the string a word char must appear immediately to the right of that non-word char. The word boundary must be moved to right after m before \.?.
Another flaw with the current approach is that lookbehinds are not the best to exclude certain contexts like here. E.g. you can't account for a variable amount of whitespaces between the time digits and am / pm. It is better to match the contexts you do not want to touch and match and capture those you want to modify. So, we need two main alternatives here, one matching am/pm in time strings and another matching them in all other contexts.
Your pattern also has too many alternatives that can be merged using character classes and ? quantifiers.
Regex demo
\b((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\s*[pa]\.?m\b\.?):
\b - word boundary
((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\s*[pa]\.?m\b\.?) - capturing group 1:
(?:[01]?[0-9]|2[0-3]) - an optional 0 or 1 and then any digit or 2 and then a digit from 0 to 3
:[0-5][0-9] - : and then a number from 00 to 59
\s* - 0+ whitespaces
[pa]\.?m\b\.? - a or p, an optional dot, m, a word boundary, an optional dot
| - or
\b[ap]\.?m\b\.? - word boundary, a or p, an optional dot, m, a word boundary, an optional dot
Python fixed solution:
import re
text = 'am pm P.M. 10:56pm 10:43 a.m.'
rx = r'\b((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\s*[pa]\.?m\b\.?)|\b[ap]\.?m\b\.?'
result = re.sub(rx, lambda x: x.group(1) if x.group(1) else "\n", text, flags=re.I)
Ruby solution:
text = 'am pm P.M. 10:56pm 10:43 a.m.'
rx = /\b((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\s*[pa]\.?m\b\.?)|\b[ap]\.?m\b\.?/i
result = text.gsub(rx) { $1 || "\n" }
Output:
"\n \n \n 10:56pm 10:43 a.m."

For sure #mrzasa found the problem out.
But ..
Taking a guess at your intent to replace a non-time colon with a ':\n`
it could be done like this I guess. Does a little whitespace trim as well.
(?i)(?<!\b[01][0-9])(?<!\b[2][0-3])([^\S\r\n]*:)[^\S\r\n]*(?![0-5][0-9](?:[ap]\.?m\b\.?)?)
PCRE - https://regex101.com/r/7TxbAJ/1 Replace $1\n
Python - https://regex101.com/r/w0oqdZ/1 Replace \1\n
Readable version
(?i)
(?<!
\b [01] [0-9]
)
(?<!
\b [2] [0-3]
)
( # (1 start)
[^\S\r\n]*
:
) # (1 end)
[^\S\r\n]*
(?!
[0-5] [0-9]
(?: [ap] \.? m \b \.? )?
)

Related

Find something between parentheses

I got a string like that:
LD(_030S.F.IN)OR(_080T_SAF_OUT)COIL(xxSF[4].Flt[120].0)
I want to look only for OR(_080T_SAF_OUT)COIL(xxSF[4].Flt[120].0), but the OR could be LD as well. _080T_SAF_OUT could be different being always alphanumeric with bottom slash sometimes. COIL(xxSF[4].Flt[120].0), must be always in the format COIL(xxSF["digits"].Flt["digits"]."digits")
I am trying to use the re library of Python 2.7.
m = re.search('\OR|\LD'+'\('+'.+'+'\)'+'+'\COIL+'\('+'\xxSF+'\['+'\d+'+'\].'+ Flt\['+'\d+'+'\]'+'\.'+'\d+', Text)
My Output:
OR(abc_TEST_X)LD(xxSF[16].Flt[0].22
OR
LD(TEST_X_dsfa)OR(WASS_READY)COIL(xxSF[16].Flt[11].10
The first one is the right one which I am getting I want to discard the second one and the third one.
I think that the problem is here:
'\('+'.+'+'\)'
Because of I just want to find something alphanumeric and possibly with symbols between the first pair of paréntesis, and I am not filtering this situation.
You should group alternations like (?:LD|OR), and to match any chars other than ( and ) you may use [^()]* rather than .+ (.+ matches any chars, as many as possible, hence it matches across parentheses).
Here is a Python demo:
import re
Text = 'LD(_030S.F.IN)OR(_080T_SAF_OUT)COIL(xxSF[4].Flt[120].0)'
m = re.search(r'(?:OR|LD)\([^()]*\)COIL\(xxSF\[\d+]\.Flt\[\d+]\.\d+', Text)
if m:
print(m.group()) # => OR(_080T_SAF_OUT)COIL(xxSF[4].Flt[120].0
Pattern details
(?:OR|LD) - a non-capturing group matching OR or LD
\( - a ( char
[^()]* - a negated character class matching 0+ chars other than ( and )
\)COIL\(xxSF\[ - )COIL(xxSF[ substring
\d+ - 1+ digits
]\.Flt\[ - ].Flt[ substring
\d+]\.\d+ - 1+ digits, ]. substring and 1+ digits
See the regex demo.
TIP Add a \b before (?:OR|LD) to match them as whole words (not as part of NOR and NLD).
Thanks, I am capturing everything which I want. Just something else to filter. Take a look to some Outputs:
OR(_1B21_A53021_2_En)OR(_1_A21_Z53021_2)COIL(xxSF[9].Flt[15].3);
LD(_1B21_A53021_2_En)LD(_1_A21_Z53021_2)COIL(xxSF[9].Flt[15].3);
I only want to capture the last one "LD" or "OR" as follow:
OR(_1_A21_Z53021_2)COIL(xxSF[9].Flt[15].3);
LD(_1_A21_Z53021_2)COIL(xxSF[9].Flt[15].3);

Stripping the last occurrence of text inside braces from a string

I would like to know how to strip the last occurrence of () and its contents given a string.
The below code strips all the () in a string.
bracketedString = '*AWL* (GREATER) MINDS LIMITED (CLOSED)'
nonBracketedString = re.sub("\s\(.*?\)", '', bracketedString)
print(nonBracketedString1)
I would like the following output.
*AWL* (GREATER) MINDS LIMITED
You may remove a (...) substring with a leading whitespace at the end of the string only:
\s*\([^()]*\)$
See the regex demo.
Details
\s* - 0+ whitespace chars
\( - a (
[^()]* - 0+ chars other than ( and )
\) - a )
$ - end of string.
See the Python demo:
import re
bracketedString = '*AWL* (GREATER) MINDS LIMITED (CLOSED)'
nonBracketedString = re.sub(r"\s*\([^()]*\)$", '', bracketedString)
print(nonBracketedString) # => *AWL* (GREATER) MINDS LIMITED
With PyPi regex module you may also remove nested parentheses at the end of the string:
import regex
s = "*AWL* (GREATER) MINDS LIMITED (CLOSED(Jan))" # => *AWL* (GREATER) MINDS LIMITED
res = regex.sub(r'\s*(\((?>[^()]+|(?1))*\))$', '', s)
print(res)
See the Python demo.
Details
\s* - 0+ whitespaces
(\((?>[^()]+|(?1))*\)) - Group 1:
\( - a (
(?>[^()]+|(?1))* - zero or more repetitions of 1+ chars other than ( and ) or the whole Group 1 pattern
\) - a )
$ - end of string.
In case you want to replace last occurrence of brackets even if they are not at the end of the string:
*AWL* (GREATER) MINDS LIMITED (CLOSED) END
you can use tempered greedy token:
>>> re.sub(r"\([^)]*\)(((?!\().)*)$", r'\1', '*AWL* (GREATER) MINDS LIMITED (CLOSED) END')
# => '*AWL* (GREATER) MINDS LIMITED END'
Demo
Explanation:
\([^)]*\) matches string in brackets
(((?!\().)*)$ assures that there are no other opening bracket until the end of the string
(?!\() is negative lookeahead checking that there is no ( following
. matches next char (that cannot be ( because of the negative lookahead)
(((?!\().)*)$ the whole sequence is repeated until the end of the string $ and kept in a capturing group
we replace the match with the first capturing group (\1) that keeps the match after the brackets

extract string using regular expression

fix_release='Ubuntu 16.04 LTS'
p = re.compile(r'(Ubuntu)\b(\d+[.]\d+)\b')
fix_release = p.search(fix_release)
logger.info(fix_release) #fix_release is None
I want to extract the string 'Ubuntu 16.04'
But, result is None.... How can I extract the correct sentence?
You confused the word boundary \b with white space, the former matches the boundary between a word character and a non word character and consumes zero character, you can simply use r'Ubuntu \d+\.\d+' for your case:
fix_release='Ubuntu 16.04 LTS'
p = re.compile(r'Ubuntu \d+\.\d+')
p.search(fix_release).group(0)
# 'Ubuntu 16.04'
Try this Regex:
Ubuntu\s*\d+(?:\.\d+)?
Click for Demo
Explanation:
Ubuntu - matches Ubuntu literally
\s* - matches 0+ occurrences of a white-space, as many as possible
\d+ - matches 1+ digits, as many as possible
(?:\.\d+)? - matches a . followed by 1+ digits, as many as possible. A ? at the end makes this part optional.
Note: In your regex, you are using \b for the spaces. \b returns 0 length matches between a word-character and a non-word character. You can use \s instead

Regex: exception to negative character class

Using Python with Matthew Barnett's regex module.
I have this string:
The well known *H*rry P*tter*.
I'm using this regex to process the asterisks to obtain <em>H*rry P*tter</em>:
REG = re.compile(r"""
(?<!\p{L}|\p{N}|\\)
\*
([^\*]*?) # I need this part to deal with nested patterns; I really can't omit it
\*
(?!\p{L}|\p{N})
""", re.VERBOSE)
PROBLEM
The problem is that this regex doesn't match this kind of strings unless I protect intraword asterisks first (I convert them to decimal entities), which is awfully expensive in documents with lots of asterisks.
QUESTION
Is it possible to tell the negative class to block at internal asterisks only if they are not surrounded by word characters?
I tried these patterns in vain:
([^(?:[^\p{L}|\p{N}]\*[^\p{L}|\p{N}])]*?)
([^(?<!\p{L}\p{N})\*(?!\p{L}\p{N})]*?)
I suggest a single regex replacement for the cases like you mentioned above:
re.sub(r'\B\*\b([^*]*(?:\b\*\b[^*]*)*)\b\*\B', r'<em>\1</em>', s)
See the regex demo
Details:
\B\*\b - a * that is preceded with a non-word boundary and followed with a word boundary
([^*]*(?:\b\*\b[^*]*)*) - Group 1 capturing:
[^*]* - 0+ chars other than *
(?:\b\*\b[^*]*)* - zero or more sequences of:
\b\*\b - a * enclosed with word boundaries
[^*]* - 0+ chars other than *
\b\*\B - a * that is followed with a non-word boundary and preceded with a word boundary
More information on word boundaries and non-word boundaries:
Word boundaries at regular-expressions.info
Difference between \b and \B in regex
What are non-word boundary in regex (\B), compared to word-boundary?

Return the next nth result \w+ after a hyphen globally

Just getting to the next stage of understanding regex, hoping the community can help...
string = These.Final.Hours-AUSVERSION.2013-TEST-TESTAGAIN-YIFY.cp(tt123456).MiLLENiUM.mp4
There are multiple test names preceded by a '-' hyphen which I derive from regex
\(?<=-)\w+\g
Result:
AUSVERSION
TEST
TESTAGAIN
YIFY
I can parse the very last result using greediness with regex \(?!.*-)(?<=-)\w+\g
Result:
YIFI (4th & last result)
Can you please help me parse either the 1st, 2nd, or 3rd result Globally using the same string?
In Python, you can get these matches with a simple -\s*(\w+) regex and re.findall and then access any match with the appropriate index:
See IDEONE demo:
import re
s = 'These.Final.Hours-AUSVERSION.2013-TEST-TESTAGAIN-YIFY.cp(tt123456).MiLLENiUM.mp4'
r = re.findall(r'-\s*(\w+)', s)
print(r[0]) # => AUSVERSION
print(r[1]) # => TEST
print(r[2]) # => TESTAGAIN
print(r[3]) # => YIFY
The -\s*(\w+) pattern search for a hyphen, followed with 0+ whitespaces, and then captures 1+ digits, letters or underscores. re.findall only returns the texts captured with capturing groups, so you only get those Group 1 values captured with (\w+).
To get these matches one by one, with re.search, you can use ^(?:.*?-\s*(\w+)){n}, where n is the match index you want. Here is a regex demo.
A quick Python demo (in real code, assign the result of re.search and only access Group 1 value after checking if there was a match):
s = "These.Final.Hours-AUSVERSION.2013-TEST-TESTAGAIN- YIFY.cp(tt123456).MiLLENiUM.mp4"
print(re.search(r'^(?:.*?-\s*(\w+))', s).group(1))
print(re.search(r'^(?:.*?-\s*(\w+)){2}', s).group(1))
print(re.search(r'^(?:.*?-\s*(\w+)){3}', s).group(1))
print(re.search(r'^(?:.*?-\s*(\w+)){4}', s).group(1))
Explanation of the pattern:
^ - start of string
(?:.*?-\s*(\w+)){2} - a non-capturing group that matches (here) 2 sequences of:
.*? - 0+ any characters other than a newline (since no re.DOTALL modifier is used) up to the first...
- - hyphen
\s* - 0 or more whitespaces
(\w+) - Group 1 capturing 1+ word characters (letters, digits or underscores).

Categories