Python's Regular Expression Source String Length

Python's Regular Expression Source String Length - python

In Python Regular Expressions,
re.compile("x"*50000)
gives me OverflowError: regular expression code size limit exceeded
but following one does not get any error, but it hits 100% CPU, and took 1 minute in my PC
>>> re.compile(".*?.*?.*?.*?.*?.*?.*?.*?.*?.*?"*50000)
<_sre.SRE_Pattern object at 0x03FB0020>
Is that normal?
Should I assume, ".*?.*?.*?.*?.*?.*?.*?.*?.*?.*?"*50000 is shorter than "x"*50000?
Tested on Python 2.6, Win32
UPDATE 1:
It Looks like ".*?.*?.*?.*?.*?.*?.*?.*?.*?.*?"*50000 could be reduce to .*?
So, how about this one?
re.compile(".*?x"*50000)
It does compile, and if that one also can reduce to ".*?x", it should match to string "abcx" or "x" alone, but it does not match.
So, Am I missing something?
UPDATE 2:
My Point is not to know max limit of regex source strings, I like to know some reasons/concepts of "x"*50000 caught by overflow handler, but not on ".*?x"*50000.
It does not make sense for me, thats why.
It is something missing on overflow checking or Its just fine or its really overflowing something?
Any Hints/Opinions will be appreciated.

The difference is that ".*?.*?.*?.*?.*?.*?.*?.*?.*?.*?"*50000 can be reduced to ".*?", while "x"*50000 has to generate 50000 nodes in the FSM (or a similar structure used by the regex engine).
EDIT: Ok, I was wrong. It's not that smart. The reason why "x"*50000 fails, but ".*?x"*50000 doesn't is that there is a limit on size of one "code item". "x"*50000 will generate one long item and ".*?x"*50000 will generate many small items. If you could split the string literal somehow without changing the meaning of the regex, it would work, but I can't think of a way to do that.

you want to match 50000 "x"s , correct??? if so, an alternative without regex
if "x"*50000 in mystring:
print "found"
if you want to match 50000 "x"s using regex, you can use range
>>> pat=re.compile("x{50000}")
>>> pat.search(s)
<_sre.SRE_Match object at 0xb8057a30>
on my system it will take in length of 65535 max
>>> pat=re.compile("x{65536}")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/re.py", line 188, in compile
return _compile(pattern, flags)
File "/usr/lib/python2.6/re.py", line 241, in _compile
p = sre_compile.compile(pattern, flags)
File "/usr/lib/python2.6/sre_compile.py", line 529, in compile
groupindex, indexgroup
RuntimeError: invalid SRE code
>>> pat=re.compile("x{65535}")
>>>
I don't know if there are tweaks in Python we can use to increase that limit though.

Related

Regex 'sre_constants.error: bad character range' in large regex pattern

The following is the error message:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/re.py", line 194, in compile
return _compile(pattern, flags)
File "/usr/lib/python2.7/re.py", line 251, in _compile
raise error, v # invalid expression
sre_constants.error: bad character range
This is my object:
>>> re101121=re.compile("""(?i)激[ _]{0,}活[ _]{0,}邮[ _]{0,}箱|(click|clicking)[ _]{1,}[here ]{0,1}to[ _]{1,}verify|stop[ _]{1,}mail[ _]{1,}.{1,16}[ _]{1,}here|(click|clicking|view|update)([ _-]{1,}|\\xc2\\xa0)(on|here|Validate)[^a-z0-9]{1}|(點|点)[ _]{0,}(擊|击)[ _]{0,}(這|这|以)[ _]{0,}(裡|里|下)|DHL[ _]{1,}international|DHL[ _]{1,}Customer[ _]{1,}Service|Online[ _]{1,}Banking|更[ _]{0,}新[ _]{0,}您[ _]{0,}的[ _]{0,}(帐|账)[ _]{0,}户|CONFIRM[ _]{1,}ACCOUNT[ _]{1,}NOW|avoid[ _]{1,}Account[ _]{1,}malfunction|confirm[ _]{1,}this[ _]{1,}request|verify your account IP|Continue to Account security|继[\\s-_]*续[\\s-_]*使[\\s-_]*用|崩[\\s-_]*溃[\\s-_]*信[\\s-_]*息|shipment[\\s]+confirmation|will be shutdown in [0-9]{0,} (hours|days)|DHL Account|保[ ]{0,}留[ ]{0,}密[ ]{0,}码|(Password|password|PASSWORD).*(expired|expiring)|login.*email.*password.*confirm|[0-9]{0,} messages were quarantined|由于.*错误(的)?(送货)?信息|confirm.*(same)? password|keep.*account secure|settings below|loss.*(email|messages)|simply login|quick verification now""")

After minimization, your error boils down to re.compile("""[\\s-_]"""). This is a bad character range indeed; you probably meant the dash to be literal re.compile(r"[\s\-_]") (always use raw strings for regex r"..."). Moving the dash to the end of the bracket group works too: r"[\s_-]".
In the future, try to binary search to find the minimal failing input: remove the right half of the regex. If it still fails, the problem must have been in the left half. Remove the right half of the remaining substring and repeat until you're down to a minimal failing case. This technique doesn't always work when the problem spans both halves, but it can't hurt to try.
As mentioned in the comments, it's pretty odd to have such a massive regex as this, but I'll assume you know what you're doing.
As another aside, there are some antipatterns in this regex (pardon the pun) like {0,} which can be simplified to *.

How to tell if a single line of python is syntactically valid?

It is very similar to this:
How to tell if a string contains valid Python code
The only difference being instead of the entire program being given altogether, I am interested in a single line of code at a time.
Formally, we say a line of python is "syntactically valid" if there exists any syntactically valid python program that uses that particular line.
For instance, I would like to identify these as syntactically valid lines:
for i in range(10):
x = 1
Because one can use these lines in some syntactically valid python programs.
I would like to identify these lines as syntactically invalid lines:
for j in range(10 in range(10(
x =++-+ 1+-
Because no syntactically correct python programs could ever use these lines
The check does not need to be too strict, it just need to be good enough to filter out obviously bogus statements (like the ones shown above). The line is given as a string, of course.

This uses codeop.compile_command to attempt to compile the code. This is the same logic that the code module does to determine whether to ask for another line or immediately fail with a syntax error.
import codeop
def is_valid_code(line):
try:
codeop.compile_command(line)
except SyntaxError:
return False
else:
return True
It can be used as follows:
>>> is_valid_code('for i in range(10):')
True
>>> is_valid_code('')
True
>>> is_valid_code('x = 1')
True
>>> is_valid_code('for j in range(10 in range(10(')
True
>>> is_valid_code('x = ++-+ 1+-')
False
I'm sure at this point, you're saying "what gives? for j in range(10 in range(10( was supposed to be invalid!" The problem with this line is that 10() is technically syntactically valid, at least according to the Python interpreter. In the REPL, you get this:
>>> 10()
Traceback (most recent call last):
File "<pyshell#22>", line 1, in <module>
10()
TypeError: 'int' object is not callable
Notice how this is a TypeError, not a SyntaxError. ast.parse says it is valid as well, and just treats it as a call with the function being an ast.Num.
These kinds of things can't easily be caught until they actually run.
If some kind of monster managed to modify the value of the cached 10 value (which would technically be possible), you might be able to do 10(). It's still allowed by the syntax.
What about the unbalanced parentheses? This fits the same bill as for i in range(10):. This line is invalid on its own, but may be the first line in a multi-line expression. For example, see the following:
>>> is_valid_code('if x ==')
False
>>> is_valid_code('if (x ==')
True
The second line is True because the expression could continue like this:
if (x ==
3):
print('x is 3!')
and the expression would be complete. In fact, codeop.compile_command distinguishes between these different situations by returning a code object if it's a valid self-contained line, None if the line is expected to continue for a full expression, and throwing a SyntaxError on an invalid line.
However, you can also get into a much more complicated problem than initially stated. For example, consider the line ). If it's the start of the module, or the previous line is {, then it's invalid. However, if the previous line is (1,2,, it's completely valid.
The solution given here will work if you only work forward, and append previous lines as context, which is what the code module does for an interactive session. Creating something that can always accurately identify whether a single line could possibly exist in a Python file without considering surrounding lines is going to be extremely difficult, as the Python grammar interacts with newlines in non-trivial ways. This answer responds with whether a given line could be at the beginning of a module and continue on to the next line without failing.
It would be better to identify what the purpose of recognizing single lines is and solve that problem in a different way than trying to solve this for every case.

I am just suggesting, not sure if going to work... But maybe something with exec and try-except?
code_line += "\n" + ("\t" if code_line[-1] == ":" else "") + "pass"
try:
exec code_line
except SyntaxError:
print "Oops! Wrong syntax..."
except:
print "Syntax all right"
else:
print "Syntax all right"
Simple lines should cause an appropriate answer

How to make the re.search() try a best attempt approach

This is the text file sb.txt
JOHN:ENGINEER:35?:
Now this the piece of code that tries to perform a regex search on the above line.
biodata1 = re.search(r'([\w\W])+?:([\w\W])+?:([\w\W])+?:',line)
Now I get a proper output for biodata1.group(1), biodata1.group(2) and biodata1.group(3).
If however, I modify the file by removing ":" from the end
JOHN:ENGINEER:35?
and run the script again, I get the following error which makes sense since group(3) didn't match successfully
Traceback (most recent call last):
File "dictionary.py", line 26, in <module>
print('re.search(r([\w\W])+?:([\w\W])+?:([\w\W])+? '+biodata1.group(1)+' '+biodata1.group(2)+' '+biodata1.group(3)) # STMT1
AttributeError: 'NoneType' object has no attribute 'group'
But group(1) and group(2) should've still matched "N" "R" respectively. Is there anyway to avoid this error and attempt a best attempt approach to regex so it doesn't fail and at least prints biodata1.group(1) & biodata1.group(2).
I tried to edit the output statment by not having it print biodata1.group(3) though that didn't work

I think you misunderstand what has happened. Your entire regular expression has failed to match and therefore there is no match object.
Where it says AttributeError: 'NoneType' object has no attribute 'group' it's trying to tell you that biodata1 is None. None is the return you get from re.search when it fails to match.
To be clear, there's no way to get a "best match". What you're asking for is that re should make a decision as to what you really want. If you want groups to be optional, you need to make them optional.
Depending on what you actually want you can try the regexes:
r'([\w\W])+?:([\w\W])+?:([\w\W])+?:?'
or
r'([\w\W])+?:([\w\W])+?:(([\w\W])+?:)?'
Which respectively make the last : and the entire last group optional.

You'll have to modify the regex to instruct it on what exactly is optional and what isn't. Python regexes don't have this concept of partial matches. One possibility is to change it to
biodata1 = re.search(r'([\w\W])+?:(?:([\w\W])+?:(?:([\w\W])+?:)?)?',line)
Where you allow 1, 2 or 3 groups to match. In this case, any groups that don't match will return the empty string when you do match.group(X)

What a regex does is it matches exactly what you provided. There is no best try or anything like that.
If you want some part of your match to be optional you need to declare it using the ? operator. So in your case your regex would need to look like this:
biodata1 = re.search(r'([\w\W])+?:([\w\W])+?:([\w\W])+?:?',line)
Also +? (at least once, or not at all) is equal to * (at least zero times), so you could just do this:
biodata1 = re.search(r'([\w\W])*:([\w\W])*:([\w\W])*:?',line)

Having some issues with re.sub

In my program I'm parsing Japanese definitions, and I need to take a few things out. There are three things I need to take things out between. 「text」 (text) 《text》
To take out things between 「」 I've been doing sentence = re.sub('「[^)]*」','', sentence) The problem with this is, for some reason if there are parentheses within 「」 it will not replace anything. Also, I've tried using the same code for the other two things like sentence = re.sub('([^)]*)','', sentence)
sentence = re.sub('《[^)]*》','', sentence) but it doesn't work for some reason. There isn't an error or anything, it just doesn't replace anything.
How can I make this work, or is there some better way of doing this?
EDIT:
I'm having a slight problem with another part of this though. Before I replace anything I check the length to make sure it's over a certain length.
parse = re.findall(r'「[^」]*」','', match.text)
if len(str(parse)) > 8:
sentence = re.sub(r'「[^」]*」','', match.text)
This seems to be causing an error now:
Traceback (most recent call last):
File "C:/Users/Dominic/PycharmProjects/untitled9/main.py", line 48, in <module>
parse = re.findall(r'「[^」]*」','', match.text)
File "C:\Python34\lib\re.py", line 206, in findall
return _compile(pattern, flags).findall(string)
File "C:\Python34\lib\re.py", line 275, in _compile
bypass_cache = flags & DEBUG
TypeError: unsupported operand type(s) for &: 'str' and 'int'
I sort of understand what's causing this, but I don't understand why It's not working just from that slight change. I know the re.sub part is fine, It's just the first two lines that are causing the problems.

You should read a tutorial on regular expressions so you understand what your regexps do.
The regexp '「[^)]*」' matches anything between the angles that is not a closing parenthesis. You need this:
sentence = re.sub(r'「[^」]*」','', sentence)
The second regexp has an additional problem: Parentheses have a special meaning (when they are not inside square brackets), so to match parentheses you need to write \( and \). So you need this:
'\([^)]*\)'
Finally: You should always use raw strings for your python regexps. It doesn't happen to make a difference in this case, but it very often does, and the bugs are maddening to spot. E.g., use:
r'\([^)]*\)'

sentence = re.sub(ur'「[^」]*」','', sentence)
^^
You need to change the negatiion based quantifer to stop at 」 instead of ).
You should use unicode flag if dealing with them.If there are ) within them then it will fail as you have used 「[^)]*」
^^
You have instructed regex to stop when it finds ).

Problem with Boolean Expression with a string value from a lIst

I have the following problem:
# line is a line from a file that contains ["baa","beee","0"]
line = TcsLine.split(",")
NumPFCs = eval(line[2])
if NumPFCs==0:
print line
I want to print all the lines from the file if the second position of the list has a value == 0.
I print the lines but after that the following happens:
Traceback (most recent call last):
['baaa', 'beee', '0', '\n']
BUT after I have the next ERROR
ilation.py", line 141, in ?
getZeroPFcs()
ilation.py", line 110, in getZeroPFcs
NumPFCs = eval(line[2])
File "<string>", line 0
Can you please help me?
thanks
What0s

Let me explain a little what you do here.
If you write:
NumPFCs = eval(line[2])
the order of evaluation is:
take the second character of the string line, i.e. a quote '"'
eval this quote as a python expression, which is an error.
If you write it instead as:
NumPFCs = eval(line)[2]
then the order of evaluation is:
eval the line, producing a python list
take the second element of that list, which is a one-character string: "0"
a string cannot be compared with a number; this is an error too.
In your terms, you want to do the following:
NumPFCs = eval(eval(line)[2])
or, slightly better, compare NumPFCs to a string:
if NumPFCs == "0":
but the ways this could go wrong are almost innumerable. You should forget about eval and try to use other methods: string splitting, regular expressions etc. Others have already provided some suggestions, and I'm sure more will follow.

Your question is kind of hard to read, but using eval there is definitely not a good idea. Either just do a direct string comparison:
line=TcsLine.split(",")
if line[2] == "0":
print line
or use int
line=TcsLine.split(",")
if int(line[2]) == 0:
print line
Either way, your bad data will fail you.
I'd also recomment reading PEP 8.

There are a few issues I see in your code segment:
you make an assumption that list always has at least 3 elements
eval will raise exception if containing string isn't valid python
you say you want second element, but you access the 3rd element.
This is a safer way to do this
line=TcsLine.split(",")
if len(line) >=3 and line[2].rfind("0") != -1:
print line

I'd recommend using a regular expression to capture all of the variants of how 0 can be specified: with double-quotes, without any quotes, with single quotes, with extra whitespace outside the quotes, with whitespace inside the quotes, how you want the square brackets handled, etc.

There are many ways of skinning a cat, as it were :)
Before we begin though, don't use eval on strings that are not yours so if the string has ever left your program; i.e. it has stayed in a file, sent over a network, someone can send in something nasty. And if someone can, you can be sure someone will.
And you might want to look over your data format. Putting strings like ["baa","beee","0", "\n"] in a file does not make much sense to me.
The first and simplest way would be to just strip away the stuff you don't need and to a string comparison. This would work as long as the '0'-string always looks the same and you're not really after the integer value 0, only the character pattern:
TcsLine = '["baa","beee","0"]'
line = TcsLine.strip('[]').split(",")
if line[2] == '"0"':
print line
The second way would be to similar to the first except that we cast the numeric string to an integer, yielding the integer value you were looking for (but printing 'line' without all the quotation marks):
TcsLine = '["baa","beee","0"]'
line = [e.strip('"') for e in TcsLine.strip('[]').split(",")]
NumPFCs = int(line[2])
if NumPFCs==0:
print line
Could it be that the string is actually a json array? Then I would probably go get simplejson to parse it properly if I were running Python<2.6 or just import json on Python>=2.6. Then cast the resulting '0'-string to an integer as in the previous example.
TcsLine = '["baa","beee","0"]'
#import json # for >= Python2.6
import simplejson as json # for <Python2.6
line = json.loads(TcsLine)
NumPFCs = int(line[2])
if NumPFCs==0:
print line

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python's Regular Expression Source String Length - python

Related

Regex 'sre_constants.error: bad character range' in large regex pattern

How to tell if a single line of python is syntactically valid?

How to make the re.search() try a best attempt approach

Having some issues with re.sub

Problem with Boolean Expression with a string value from a lIst

Categories

Resources