Regex 'sre_constants.error: bad character range' in large regex pattern - python

The following is the error message:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/re.py", line 194, in compile
return _compile(pattern, flags)
File "/usr/lib/python2.7/re.py", line 251, in _compile
raise error, v # invalid expression
sre_constants.error: bad character range
This is my object:
>>> re101121=re.compile("""(?i)激[ _]{0,}活[ _]{0,}邮[ _]{0,}箱|(click|clicking)[ _]{1,}[here ]{0,1}to[ _]{1,}verify|stop[ _]{1,}mail[ _]{1,}.{1,16}[ _]{1,}here|(click|clicking|view|update)([ _-]{1,}|\\xc2\\xa0)(on|here|Validate)[^a-z0-9]{1}|(點|点)[ _]{0,}(擊|击)[ _]{0,}(這|这|以)[ _]{0,}(裡|里|下)|DHL[ _]{1,}international|DHL[ _]{1,}Customer[ _]{1,}Service|Online[ _]{1,}Banking|更[ _]{0,}新[ _]{0,}您[ _]{0,}的[ _]{0,}(帐|账)[ _]{0,}户|CONFIRM[ _]{1,}ACCOUNT[ _]{1,}NOW|avoid[ _]{1,}Account[ _]{1,}malfunction|confirm[ _]{1,}this[ _]{1,}request|verify your account IP|Continue to Account security|继[\\s-_]*续[\\s-_]*使[\\s-_]*用|崩[\\s-_]*溃[\\s-_]*信[\\s-_]*息|shipment[\\s]+confirmation|will be shutdown in [0-9]{0,} (hours|days)|DHL Account|保[ ]{0,}留[ ]{0,}密[ ]{0,}码|(Password|password|PASSWORD).*(expired|expiring)|login.*email.*password.*confirm|[0-9]{0,} messages were quarantined|由于.*错误(的)?(送货)?信息|confirm.*(same)? password|keep.*account secure|settings below|loss.*(email|messages)|simply login|quick verification now""")

After minimization, your error boils down to re.compile("""[\\s-_]"""). This is a bad character range indeed; you probably meant the dash to be literal re.compile(r"[\s\-_]") (always use raw strings for regex r"..."). Moving the dash to the end of the bracket group works too: r"[\s_-]".
In the future, try to binary search to find the minimal failing input: remove the right half of the regex. If it still fails, the problem must have been in the left half. Remove the right half of the remaining substring and repeat until you're down to a minimal failing case. This technique doesn't always work when the problem spans both halves, but it can't hurt to try.
As mentioned in the comments, it's pretty odd to have such a massive regex as this, but I'll assume you know what you're doing.
As another aside, there are some antipatterns in this regex (pardon the pun) like {0,} which can be simplified to *.

Related

How to fix bad escape regex error (python re)

I've been messing around with re.sub() to see how I would change the format from Y-m-d to M/d/y. To perform the test, I defined the starting variable: current_date = "2012-05-26"
I would try to achieve to convert that date to 05/26/2012.
I tried to achieve this without using DateTime but with regex. I used re.sub as below:
formatted_date = re.sub(r"\d{2,4}-\d{1,2}-\d{1,2}", r"[^a-zA-Z]\d{1,2}/\d{1,2}/\d{2,4}", current_date)
The first regex is to match the original format of Y-M-D and the second Regex is to try to convert it to the format that I want it to be. I got the following error:
Traceback (most recent call last):
File "C:\Users\ghub4\AppData\Local\Programs\Python\Python39\lib\sre_parse.py", line 1039, in parse_template
this = chr(ESCAPES[this][1])
KeyError: '\\d'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "c:\Users\ghub4\OneDrive\Desktop\test_sub.py", line 5, in <module>
formatted_date = re.sub(r"\d{2,4}-\d{1,2}-\d{1,2}", r"[^a-zA-Z]\d{1,2}/\d{1,2}/\d{2,4}", current_date)
File "C:\Users\ghub4\AppData\Local\Programs\Python\Python39\lib\re.py", line 210, in sub
return _compile(pattern, flags).sub(repl, string, count)
File "C:\Users\ghub4\AppData\Local\Programs\Python\Python39\lib\re.py", line 327, in _subx
template = _compile_repl(template, pattern)
File "C:\Users\ghub4\AppData\Local\Programs\Python\Python39\lib\re.py", line 318, in _compile_repl
return sre_parse.parse_template(repl, pattern)
File "C:\Users\ghub4\AppData\Local\Programs\Python\Python39\lib\sre_parse.py", line 1042, in parse_template
raise s.error('bad escape %s' % this, len(this))
re.error: bad escape \d at position 9
Full Code:
import re
current_date = "2012-05-26"
formatted_date = re.sub(r"\d{2,4}-\d{1,2}-\d{1,2}", r"[^a-zA-Z]\d{1,2}/\d{1,2}/\d{2,4}", current_date)
print(formatted_date)
I've traced the error to potential the second regex but I'm unsure where position 9 is and how to fix the error. Another reason why I'm not sure how to fix it is due to the first error where it stated a keyerror raised by \\d. I'm sure that when the regex is interpret somewhere in the code, it is taking the \d as \\d instead which Im also not sure how to prevent that. I'm also pretty sure that the second regex may backfire on me and I am working on a solution on that after this question is posted. How would I be able to correct these errors?
The replacement string for a regex is not a regex in itself, rather it is a string which may contain references to groups captured by the original regex. In your case, you want to capture the year, month and day and then output them in the result string. You do that with () around the values you want to capture, and then refer to the groups by \1, \2, and \3 in the replacement string, with the numbers being assigned in order of the groups being captured. So for your code, you want:
formatted_date = re.sub(r"(\d{2,4})-(\d{1,2})-(\d{1,2})", r"\2/\3/\1", current_date)
Try and group your digits (If you goal is testing then position 9 is your first \d in your second regex-check - It is an invalid group reference):
formatted_date = re.sub(r"(\d{2,4})-(\d{1,2})-(\d{1,2})",r"\2/\3/\1",current_date)

pyPEG2 parsing of newlines

I'm trying to use pyPEG2 to translate MoinMoin markup to Markdown, and I need to pay attention to newlines in certain cases. However, I can't even get my newline parsing tests to work. I'm new to pyPEG and my Python is rusty. Please bear with me.
Here's the code:
#!/usr/local/bin/python3
from pypeg2 import *
import re
class Newline(List):
grammar = re.compile(r'\n')
parse("\n", Newline)
parse("""
""", Newline)
This results in:
Traceback (most recent call last):
File "./pyPegNewlineTest.py", line 7, in <module>
parse("\n", Newline)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pypeg2/__init__.py", line 667, in parse
t, r = parser.parse(text, thing)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pypeg2/__init__.py", line 794, in parse
raise r
File "<string>", line 2
^
SyntaxError: expecting match on \n
It's as if pypeg is inserting an empty line after the \n.
Trying other options such as
grammar = re.compile(r'\n', re.MULTILINE)
grammar = re.compile(r'\r\n|\r|\n', re.MULTILINE)
grammar = contiguous(re.compile(r'\r\n|\r|\n', re.MULTILINE))
and various combinations of those don't change the error message (although I don't think I tried all combinations). Changing Newline to subclass str instead of List doesn't change the error either.
Update
I have figured out that pypeg is stripping the newline before parsing it:
#!/usr/local/bin/python3
from pypeg2 import *
import re
class Newline(str):
grammar = contiguous(re.compile(r'a'))
parse("\na", Newline)
parse("""
a""", Newline)
print("Success, of a sort.")
Running this results in:
Success, of a sort.
If I override the Newline's parse method I don't even see the newline. The first thing it gets is the "a". This is consistent with what I'm seeing elsewhere. pypeg strips all leading whitespace, even when you specify contiguous.
So, that's what's happening. Not sure what to do about it.
Yes by default pypeg remove the whitespaces including the newlines.
This is easly configurable by setting the optional whitespace argument in the parse() function, e.g. in:
parse("\na", Newline, whitespace=re.compile(r"[ \t\r]"))
Doing so spaces and tabs will still be skipped, but not newlines \n.
With this example the parser now correctly find the syntax error:
SyntaxError: expecting match on a

Having some issues with re.sub

In my program I'm parsing Japanese definitions, and I need to take a few things out. There are three things I need to take things out between. 「text」 (text) 《text》
To take out things between 「」 I've been doing sentence = re.sub('「[^)]*」','', sentence) The problem with this is, for some reason if there are parentheses within 「」 it will not replace anything. Also, I've tried using the same code for the other two things like sentence = re.sub('([^)]*)','', sentence)
sentence = re.sub('《[^)]*》','', sentence) but it doesn't work for some reason. There isn't an error or anything, it just doesn't replace anything.
How can I make this work, or is there some better way of doing this?
EDIT:
I'm having a slight problem with another part of this though. Before I replace anything I check the length to make sure it's over a certain length.
parse = re.findall(r'「[^」]*」','', match.text)
if len(str(parse)) > 8:
sentence = re.sub(r'「[^」]*」','', match.text)
This seems to be causing an error now:
Traceback (most recent call last):
File "C:/Users/Dominic/PycharmProjects/untitled9/main.py", line 48, in <module>
parse = re.findall(r'「[^」]*」','', match.text)
File "C:\Python34\lib\re.py", line 206, in findall
return _compile(pattern, flags).findall(string)
File "C:\Python34\lib\re.py", line 275, in _compile
bypass_cache = flags & DEBUG
TypeError: unsupported operand type(s) for &: 'str' and 'int'
I sort of understand what's causing this, but I don't understand why It's not working just from that slight change. I know the re.sub part is fine, It's just the first two lines that are causing the problems.
You should read a tutorial on regular expressions so you understand what your regexps do.
The regexp '「[^)]*」' matches anything between the angles that is not a closing parenthesis. You need this:
sentence = re.sub(r'「[^」]*」','', sentence)
The second regexp has an additional problem: Parentheses have a special meaning (when they are not inside square brackets), so to match parentheses you need to write \( and \). So you need this:
'\([^)]*\)'
Finally: You should always use raw strings for your python regexps. It doesn't happen to make a difference in this case, but it very often does, and the bugs are maddening to spot. E.g., use:
r'\([^)]*\)'
sentence = re.sub(ur'「[^」]*」','', sentence)
^^
You need to change the negatiion based quantifer to stop at 」 instead of ).
You should use unicode flag if dealing with them.If there are ) within them then it will fail as you have used 「[^)]*」
^^
You have instructed regex to stop when it finds ).

Python re "bogus escape error"

I've been messing around with the python re modules .search method. cur is the input from a Tkinter entry widget. Whenever I enter a "\" into the entry widget, it throws this error. I'm not all to sure what the error is or how to deal with it. Any insight would be much appreciated.
cur is a string
tup[0] is also a string
Snippet:
se = re.search(cur, tup[0], flags=re.IGNORECASE)
The error:
Exception in Tkinter callback
Traceback (most recent call last):
File "C:\Python26\Lib\Tkinter.py", line 1410, in __call__
return self.func(*args)
File "C:\Python26\Suite\quidgets7.py", line 2874, in quick_links_results
self.quick_links_results_s()
File "C:\Python26\Suite\quidgets7.py", line 2893, in quick_links_results_s
se = re.search(cur, tup[0], flags=re.IGNORECASE)
File "C:\Python26\Lib\re.py", line 142, in search
return _compile(pattern, flags).search(string)
File "C:\Python26\Lib\re.py", line 245, in _compile
raise error, v # invalid expression
error: bogus escape (end of line)
"bogus escape (end of line)" means that your pattern ends with a backslash. This has nothing to do with Tkinter. You can duplicate the error pretty easily in an interactive shell:
>>> import re
>>> pattern="foobar\\"
>>> re.search(pattern, "foobar")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/re.py", line 142, in search
return _compile(pattern, flags).search(string)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/re.py", line 241, in _compile
raise error, v # invalid expression
sre_constants.error: bogus escape (end of line)
The solution? Make sure your pattern doesn't end with a single backslash.
The solution to this issue is to use a raw string as the replacement text. The following won't work:
re.sub('this', 'This \\', 'this is a text')
It will throw the error: bogus escape (end of line)
But the following will work just fine:
re.sub('this', r'This \\', 'this is a text')
Now, the question is how do you convert a string generated during program runtime into a raw string in Python. You can find a solution for this here. But I prefer using a simpler method to do this:
def raw_string(s):
if isinstance(s, str):
s = s.encode('string-escape')
elif isinstance(s, unicode):
s = s.encode('unicode-escape')
return s
The above method can convert only ascii and unicode strings into raw strings. Well, this has been working great for me till date :)
If you are trying to search for "cur" in "tup[0]" you should do this through "try:... except:..." block to catch invalid pattern:
try :
se = re.search(cur, tup[0], flags=re.IGNORECASE)
except re.error, e:
# print to stdout or any status widget in your gui
print "Your search pattern is not valid."
# Some details for error:
print e
# Or some other code for default action.
The first parameter to re is the pattern to search for, thus if 'cur' contains a backslash at the end of the line, it'll be an invalid escape sequence. You've probably swapped your arguments around (I don't know what tup[0] is, but is it your pattern?) and it should be like this
se = re.search(tup[0], cur, flags=re.IGNORECASE)
As you very rarely use user input as a pattern (unless you're doing a regular expression search mechanism, in which case you might want to show the error instead).
HTH.
EDIT:
The error it is reporting is that you're using an escape character before the end of line (which is what bogus escape (end of line) means), that is your pattern ends with a backslash, which is not a valid pattern. Escape character (backslash) must be followed by another character, which removes or adds special meaning to that character (not sure exactly how python does it, posix makes groups by adding escape to parentheses, perl removes the group effect by escaping it). That is \* matches a literal asterix, whereas * matches the preceding character 0 or more times.

Python's Regular Expression Source String Length

In Python Regular Expressions,
re.compile("x"*50000)
gives me OverflowError: regular expression code size limit exceeded
but following one does not get any error, but it hits 100% CPU, and took 1 minute in my PC
>>> re.compile(".*?.*?.*?.*?.*?.*?.*?.*?.*?.*?"*50000)
<_sre.SRE_Pattern object at 0x03FB0020>
Is that normal?
Should I assume, ".*?.*?.*?.*?.*?.*?.*?.*?.*?.*?"*50000 is shorter than "x"*50000?
Tested on Python 2.6, Win32
UPDATE 1:
It Looks like ".*?.*?.*?.*?.*?.*?.*?.*?.*?.*?"*50000 could be reduce to .*?
So, how about this one?
re.compile(".*?x"*50000)
It does compile, and if that one also can reduce to ".*?x", it should match to string "abcx" or "x" alone, but it does not match.
So, Am I missing something?
UPDATE 2:
My Point is not to know max limit of regex source strings, I like to know some reasons/concepts of "x"*50000 caught by overflow handler, but not on ".*?x"*50000.
It does not make sense for me, thats why.
It is something missing on overflow checking or Its just fine or its really overflowing something?
Any Hints/Opinions will be appreciated.
The difference is that ".*?.*?.*?.*?.*?.*?.*?.*?.*?.*?"*50000 can be reduced to ".*?", while "x"*50000 has to generate 50000 nodes in the FSM (or a similar structure used by the regex engine).
EDIT: Ok, I was wrong. It's not that smart. The reason why "x"*50000 fails, but ".*?x"*50000 doesn't is that there is a limit on size of one "code item". "x"*50000 will generate one long item and ".*?x"*50000 will generate many small items. If you could split the string literal somehow without changing the meaning of the regex, it would work, but I can't think of a way to do that.
you want to match 50000 "x"s , correct??? if so, an alternative without regex
if "x"*50000 in mystring:
print "found"
if you want to match 50000 "x"s using regex, you can use range
>>> pat=re.compile("x{50000}")
>>> pat.search(s)
<_sre.SRE_Match object at 0xb8057a30>
on my system it will take in length of 65535 max
>>> pat=re.compile("x{65536}")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/re.py", line 188, in compile
return _compile(pattern, flags)
File "/usr/lib/python2.6/re.py", line 241, in _compile
p = sre_compile.compile(pattern, flags)
File "/usr/lib/python2.6/sre_compile.py", line 529, in compile
groupindex, indexgroup
RuntimeError: invalid SRE code
>>> pat=re.compile("x{65535}")
>>>
I don't know if there are tweaks in Python we can use to increase that limit though.

Categories