Non-halting regular expressions (python re module)

Non-halting regular expressions (python re module) - python

What can cause non-halting behavior in regular expression match() operation (with Python's re module)?
I'm current wracking my brains trying to work out a problem that has stumped me for hours. Why does the below line hang?
re.compile(r'.*?(now|with that|at this time|ready|stop|wrap( it|) up|at this point|happy|pleased|concludes|concluded|we will|like to)(,)*(( |\n)[a-z]+(\'[a-z]+)*,*){0,20}( |\n)(take|for|to|open|entertain|answer|address)(( |\n|)[a-z]+(\'[a-z]+)*,*){0,20}( |\n|)(questions|Q *& *A).*?', re.DOTALL| re.IGNORECASE).match("I would now like to turn the presentation over to your host for today's call, Mr. Mitch Feiger, please proceed.")
In short, I'm using match(), the regular expression is r'.*?(now|with that|at this time|ready|stop|wrap( it|) up|at this point|happy|pleased|concludes|concluded|we will|like to)(,)*(( |\n)[a-z]+(\'[a-z]+)*,*){0,20}( |\n)(take|for|to|open|entertain|answer|address)(( |\n|)[a-z]+(\'[a-z]+)*,*){0,20}( |\n|)(questions|Q *& *A).*?'
And the text is: "I would now like to turn the presentation over to your host for today's call, Mr. Mitch Feiger, please proceed."
I understand my regular expression is a bit of a mess, it's been built over time to somewhat cheatily match paragraphs in which the speakers announces the start of a question session. My main confusion right now is trying to find what in there could be causing what I assume is a non-halting search.
It gets stuck on a lot of other pieces of text my program uses, but far from all of them (the program processes thousands of text files, each with ~100 of these text pieces it needs to do matching on), and I can't see any common factors. To be clear, this is not supposed to return a match, but this check does need to be done, and I can't understand why it hangs like it does.
More generally, what are the sorts of things that could cause a Python regular expression match to hang indefinitely? I'd love to have the information so I can work out the problem myself, but at this point, I'd take a cheap answer...

Perl-compatible regular expressions (PCRE), which is what Python's re module uses, are no longer "regular" in the Computer Science sense. Because of this, they can suffer from catastrophic backtracking: https://swtch.com/~rsc/regexp/regexp1.html
This doesn't help you much with your problem. What would help you is:
break down your regexp in multiple small blocks
see how long each block takes to execute
start putting the blocks together to get closer to your original huge regexp
You might have to stop trying to do everything with 1 single regexp and you might use 1 or 2 and a bit of code to put the 2 parts together more efficiently.

Related

Python IN vs grep

I have a script that iterates over file contents of hundreds of thousands of files to find specific matches. For convenience I am using a string in. What are the performance differences between the two? I'm looking for more of a conceptual understanding here.
list_of_file_contents = [...] # 1GB
key = 'd89fns;3ofll'
matches = []
for item in list_of_file_contents:
if key in item:
matches.append(key)
--vs--
grep -r my_files/ 'd89fns;3ofll'

The biggest conceptual difference is that grep does regular expression matching. In python you'd need to explicitly write code using the re module. The search expression in your example doesn't exploit any of the richness of regular expressions, so the search behaves just like the plain string match in python, and should consume only a tiny bit more resources than fgrep would. The python script is really fgrep and hopefully operates on par with that.
If the files are encoded, say in UTF-16, depending on the version of the various programs, there could be a big difference in whether matches are found, and a little in how long it takes.
And that's assuming that the actual python code deals with input and output efficiently, i.e. list_of_file_contents isn't an actual list of the data, but for instance a list comprehension around fileinput; and there is not a huge number of matches or a different matches.

I suggest you try it out for yourself. Profiling Python code is really easy: https://stackoverflow.com/a/582337/970247. For a more conceptual approach. Regex is a powerful string parsing engine full of features, in contrast Python "in" will do just one thing in a really straightforward way. I would say the latter will be the more efficient but again, trying it for yourself is the way to go.

Is there any streaming regular-expression module for Python?

I'm looking for a way to run a regex over a (long) iterable of "characters" in Python. (Python doesn't actually have characters, so it's actually an iterable of one-length strings. But same difference.)
The re module only allows searching over strings (or buffers), as far as I can tell.
I could implement it myself, but that seems a little silly.
Alternatively, I could convert the iterable to a string and run the regex over the string, but that gets (hideously) inefficient. (A worst-case example: re.search(".a", "".join('a' for a in range(10**8))) peaks at over 900M of RAM (private working set) on my (x64) machine, and takes ~12 seconds, even though it only needs to look at the first two characters in the iterable.)

As far as I understand, the example that joins a lot of 'a's is just extremely simple example that shows the problem. In other words, the construction of the content (generally) can be more time and memory consuming than the search itself.
The problem with the standard re module is that it uses the extended regular expression syntax, and it requires backtracking.
You may be interested in the very classic implementation by Thomson (NFA) -- see http://swtch.com/~rsc/regexp/regexp1.html for the explanation and the comparison of performance with the libraries that implement the extended syntax.
It seems that the re2 project can be useful for you. There should be the Python port -- see Is it possible to use re2 from Python? However, I do not know if it supports streaming and wherher any streaming regular expression engine for Python exists.
For understanding the Thomsons idea, you can also try the on-line visualization of the Regular Expression to NFA.

If the number of elements in that list is truly to the order of 10**8 then you are probably better off doing a linear search if you only want to do it once. Otherwise, you have to create this huge string that is really very inefficient. The other thing I can think of if you need to do this more than once is inserting the collection into a hashtable and do the search faster.

RegEx anomalous behavior in a program

I have written the following regex to match a set of e-mails from HTML files. The e-mails can take various formats such as
alice # so.edu
alice at sm.so.edu
alice # sm.com
<a href="mailto:alice at bob dot com">
I generally use RegexPal to test my regular expressions before implementing them in a programing language. I observe a strange behavior on the last e-mail example posted. RegexPal shows me a match for my regex but while using the same regex in a Python program it doesn't give me a hit. What could be the reason?
mail_regex = (?:[a-zA-Z]+[\w+\.]+[a-zA-Z]+)\s*(?:#|\bat\b)\s*(?:(?:(?:(?:[a-zA-Z]+)\s*
(?:\.|dot|dom)\s*(?:[a-zA-Z]+)\s*(?:\.|dot|dom)\s*)(?:edu|com))|(?:(?:[a-zA-Z]+\s*(?:\.|dot|dom)\s*(?:edu|com))))
The RegEx is a little bit complex to accommodate variety of other examples (email patterns found in the dataset). You can also run and inspect the Python program on CodePad - http://codepad.org/W2p6waBb
Edit
Just to give a perspective the same regex works on - http://pythonregex.com/

It looks like the specific issue here is that you need to use a raw string:
mail_re = r"(?:[a-zA-Z]+[\w+\.]+[a-zA-Z]+)\s*(?:#|\bat\b)\s*(?:(?:(?:(?:[a-zA-Z]+)\s*(?:\.|dot|dom)\s*(?:[a-zA-Z]+)\s*(?:\.|dot|dom)\s*)(?:edu|com))|(?:(?:[a-zA-Z]+\s*(?:\.|dot|dom)\s*(?:edu|com))))"
Otherwise, for instance \b will be backspace instead of word boundary.
Also, you're using a JavaScript tester. Python has different syntax and behavior. To avoid surprises, it would better to test with the Python-specific syntax.

Is it safe to use user input for Python's regular expressions?

I would like to let my users use regular expressions for some features. I'm curious what the implications are of passing user input to re.compile(). I assume there is no way for a user to give me a string that could let them execute arbitrary code. The dangers I have thought of are:
The user could pass input that raises an exception.
The user could pass input that causes the regex engine to take a long time, or to use a lot of memory.
The solution to 1. is easy: catch exceptions. I'm not sure if there is a good solution to 2. Perhaps just limiting the length of the regex would work.
Is there anything else I need to worry about?

I have worked on a program that allows users to enter their own regex and you are right - they can (and do) enter regex that can take a long time to finish - sometimes longer than than the lifetime of the universe. What is worse, while processing a regex Python holds the GIL, so it will not only hang the thread that is running the regex, but the entire program.
Limiting the length of the regex will not work, since the problem is backtracking. For example, matching the regex r"(\S+)+x" on a string of length N that does not contain an "x" will backtrack 2**N times. On my system this takes about a second to match against "a"*21 and the time doubles for each additional character, so a string of 100 characters would take approximately 19167393131891000 years to complete (this is an estimate, I have not timed it).
For more information read the O'Reilly book "Mastering Regular Expressions" - this has a couple of chapters on performance.
edit
To get round this we wrote a regex analysing function that tried to catch and reject some of the more obvious degenerate cases, but it is impossible to get all of them.
Another thing we looked at was patching the re module to raise an exception if it backtracks too many times. This is possible, but requires changing the Python C source and recompiling, so is not portable. We also submitted a patch to release the GIL when matching against python strings, but I don't think it was accepted into the core (python only holds the GIL because regex can be run against mutable buffers).

It's much simpler for casual users to give them a subset language. The shell's globbing rules in fnmatch, for example. The SQL LIKE condition rules are another example.
Translate the user's language into a proper regex for execution at runtime.

Compiling the regular expression should be reasonably safe. Although what it compiles into is not strictly an NFA (backreferences mean it's not quite as clean) it should still be sort of straightforward to compile into.
Now as to performance characteristics, this is another problem entirely. Even a small regular expression can have exponential time characteristics because of backtracking. It might be better to define a certain subset of features and only support very limited expressions that you translate yourself.
If you really want to support general regular expressions you either have to trust your users (sometimes an option) or limit the amount of space and time used. I believe that space used is determined only by the length of the regular expression.
edit: As Dave notes, apparently the global interpreter lock is held during regex matching, which would make setting that timeout harder. If that is the case, your only option to set a timeout is to run the match in a separate process. While not exactly ideal it is doable. I completely forgot about multiprocessing. Point of interest is this section on sharing objects. If you really need the hard constraints, separate processes are the way to go here.

It's not necessary to use compile() except when you need to reuse a lot of different regular expressions. The module already caches the last expressions.
The point 2 (at execution) could be a very difficult one if you allow the user to input any regular expression. You can make a complex regexp with few characters, like the famous (x+x+)+y one. I think it's a problem yet to be resolved in a general way.
A workaround could be launching a different thread and monitor it, if it exceeds the allowed time, kill the thread and return with an error.

I really don't think it is possible to execute code simply by passing it into an re.compile. The way I understand it, re.compile (or any regex system in any language) converts the regex string into a finite automaton (DFA or NFA), and despite the ominous name 'compile' it has nothing to do with the execution of any code.

You technically don't need to use re.compile() to perform a regular expression operation on a string. In fact, the compile method can often be slower if you're only executing the operation once since there's overhead associated with the initial compiling.
If you're worried about the word "compile" then avoid it all together and simply pass the raw expression to match, search, etc. You may wind up improving the performance of your code slightly anyways.

Do you have a hard time keeping to 80 columns with Python? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I find myself breaking strings constantly just to get them on the next line. And of course when I go to change those strings (think logging messages), I have to reformat the breaks to keep them within the 80 columns.
How do most people deal with this?

I recommend trying to stay true to 80-column, but not at any cost. Sometimes, like for logging messages, it just makes more sense to keep 'em long than breaking up. But for most cases, like complex conditions or list comprehensions, breaking up is a good idea because it will help you divide the complex logic to more understandable parts. It's easier to understand:
print sum(n for n in xrange(1000000)
if palindromic(n, bits_of_n) and palindromic(n, digits))
Than:
print sum(n for n in xrange(1000000) if palindromic(n, bits_of_n) and palindromic(n, digits))
It may look the same to you if you've just written it, but after a few days those long lines become hard to understand.
Finally, while PEP 8 dictates the column restriction, it also says:
A style guide is about consistency. Consistency with this style guide
is important. Consistency within a project is more important.
Consistency within one module or function is most important.
But most importantly: know when to be inconsistent -- sometimes the
style guide just doesn't apply. When in doubt, use your best judgment.
Look at other examples and decide what looks best. And don't hesitate
to ask!

"A foolish consistency is the hobgoblin of little minds, adored by little statesmen and philosophers and divines."
The important part is "foolish".
The 80-column limit, like other parts of PEP 8 is a pretty strong suggestion. But, there is a limit, beyond which it could be seen as foolish consistency.
I have the indentation guides and edge line turned on in Komodo. That way, I know when I've run over. The questions are "why?" and "is it worth fixing it?"
Here are our common situations.
logging messages. We try to make these easy to wrap. They look like this
logger.info( "unique thing %s %s %s",
arg1, arg2, arg3 )
Django filter expressions. These can run on, but that's a good thing. We often
knit several filters together in a row. But it doesn't have to be one line of code,
multiple lines can make it more clear what's going on.
This is an example of functional-style programming, where a long expression is sensible. We avoid it, however.
Unit Test Expected Result Strings. These happen because we cut and paste to create the unit test code and don't spend a lot of time refactoring it. When it bugs us we pull the strings out into separate string variables and clean the self.assertXXX() lines up.
We generally don't have long lines of code because we don't use lambdas. We don't strive for fluent class design. We don't pass lots and lots of arguments (except in a few cases).
We rarely have a lot of functional-style long-winded expressions. When we do, we're not embarrassed to break them up and leave an intermediate result lying around. If we were functional purists, we might have gas with intermediate result variables, but we're not purists.

It doesn't matter what year is it or what output devices you use (to some extent). Your code should be readable if possible by humans. It is hard for humans to read long lines.
It depends on the line's content how long it should be. If It is a log message then its length matters less. If it is a complex code then its big length won't be helping to comprehend it.

Temporary variables. They solve almost every problem I have with long lines. Very occasionally, I'll need to use some extra parens (like in a longer if-statement). I won't make any arguments for or against 80 character limitations since that seems irrelevant.
Specifically, for a log message; instead of:
self._log.info('Insert long message here')
Use:
msg = 'Insert long message here'
self._log.info(msg)
The cool thing about this is that it's going to be two lines no matter what, but by using good variable names, you also make it self-documenting. E.g., instead of:
obj.some_long_method_name(subtotal * (1 + tax_rate))
Use:
grand_total = subtotal * (1 + tax_rate)
obj.some_long_method_name(grand_total)
Most every long line I've seen is trying to do more than one thing and it's trivial to pull one of those things out into a temp variable. The primary exception is very long strings, but there's usually something you can do there too, since strings in code are often structured. Here's an example:
br = mechanize.Browser()
ua = '; '.join(('Mozilla/5.0 (Macintosh', 'U', 'Intel Mac OS X 10.4',
'en-US', 'rv:1.9.0.6) Gecko/2009011912 Firefox/3.0.6'))
br.addheaders = [('User-agent', ua)]

This is a good rule to keep to a large part of the time, but don't pull your hair out over it. The most important thing is that stylistically your code looks readable and clean, and keeping your lines to reasonable length is part of that.
Sometimes it's nicer to let things run on for more than 80 columns, but most of the time I can write my code such that it's short and concise and fits in 80 or less. As some responders point out the limit of 80 is rather dated, but it's not bad to have such a limit and many people have terminals
Here are some of the things that I keep in mind when trying to restrict the length of my lines:
is this code that I expect other people to use? If so, what's the standard that those other people and use for this type of code?
do those people have laptops, use giant fonts, or have other reasons for their screen real estate being limited?
does the code look better to me split up into multiple lines, or as one long line?
This is a stylistic question, but style is really important because you want people to read and understand your code.

I would suggest being willing to go beyond 80 columns. 80 columns is a holdover from when it was a hard limit based on various output devices.
Now, I wouldn't go hog wild...set a reasonable limit, but an arbitary limit of 80 columns seems a bit overzealous.
EDIT: Other answers are also clarifing this: it matters what you're breaking. Strings can more often be "special cases" where you may want to bend the rules a bit, for the sake of clarity. If your code, on the other hand, is getting long, that's a good time to look at where it is logical to break it up.

80 character limits? What year is it?
Make your code readable. If a long line is readable, it's fine. If it's hard to read, split it.
For example, I tend to make long lines when there is a method call with lots of arguments, and the arguments are the normal arguments you'd expect. So, let's say I'm passing 10 variables around to a bunch of methods. If every method takes a transaction id, an order id, a user id, a credit card number, etc, and these are stored in appropriately named variables, then it's ok for the method call to appear on one line with all the variables one after another, because there are no surprises.
If, however, you are dealing with multiple transactions in one method, you need to ensure that the next programmer can see that THIS time you're using transId1, and THAT time transId2. In that case make sure it's clear. (Note: sometimes using long lines HELPS that too).
Just because a "style guide" says you should do something doesn't mean you have to do it. Some style guides are just plain wrong.

The 80 column count is one of the few places I disagree with the Python style guide. I'd recommend you take a look at the audience for your code. If everyone you're working with uses a modern IDE on a monitor with a reasonable resolution, it's probably not worth your time. My monitor is cheap and has a relatively weak resolution, but I can still fit 140 columns plus scroll bars, line markers, break markers, and a separate tree-view frame on the left.
However, you will probably end up following some kind of limit, even if it's not a fixed number. With the exception of messages and logging, long lines are hard to read. Lines that are broken up are also harder to read. Judge each situation on its own, and do what you think will make life easiest for the person coming after you.

Strings are special because they tend to be long, so break them when you need and don't worry about it.
When your actual code starts bumping the 80 column mark it's a sign that you might want to break up deeply nested code into smaller logical chunks.

I deal with it by not worrying about the length of my lines. I know that some of the lines I write are longer than 80 characters but most of them aren't.
I know that my position is not considered "pythonic" by many and I understand their points. Part of being an engineer is knowing the trade-offs for each decision and then making the decision that you think is the best.

Sticking to 80 columns is important not only for readability, but because many of us like to have narrow terminal windows so that, at the same time as we are coding, we can also see things like module documentation loaded in our web browser and an error message sitting in an xterm. Giving your whole screen to your IDE is a rather primitive, if not monotonous, way to use screen space.
Generally, if a line stretches to more than 80 columns it means that something is going wrong anyway: either you are trying to do too much on one line, or have allowed a section of your code to become too deeply indented. I rarely find myself hitting the right edge of the screen unless I am also failing to refactor what should be separate functions; name temporary results; and do other things like that will make testing and debugging much easier in the end. Read Linus's Kernel Coding Style guide for good points on this topic, albeit from a C perspective:
http://www.kernel.org/doc/Documentation/CodingStyle
And always remember that long strings can either be broken into smaller pieces:
print ("When Python reads in source code"
" with string constants written"
" directly adjacent to one another"
" without any operators between"
" them, it considers them one"
" single string constant.")
Or, if they are really long, they're generally best defined as a constant then used in your code under that abbreviated name:
STRING_MESSAGE = (
"When Python reads in source code"
" with string constants written directly adjacent to one"
" another without any operators between them, it considers"
" them one single string constant.")
...
print STRING_MESSAGE
...

Pick a style you like, apply a layer of common sense, and use it consistently.
PEP 8 is a style guide for libraries included as part of the Python standard library. It was never intended to be picked up as the style rules for all Python code. That said, there's no reason people shouldn't use it, but it's definitely not a set of hard rules. Like any style, there is no single correct way and the most important thing is consistency.

I do run into code that spills past 79 columns on every now and then. I've either broken them up with '\' (although recently, I've read about using parenthesis instead as a preferred alternative, so I'll give that a shot), or just let it be if it's no more than 15 or so past. And this coming from someone who indents only 2, not 4 spaces (I know, shame on me :\ )! It isn't entirely science. It's also part style, and sometimes, keeping things on one line is just easier to manage or read. Other times, excessive side-to-side scrolling can be worse.
Much of the time has to do with longer variable names. For variables beyond temp values and iterators, I don't want to reduce them to 1 to 5 letters. These that are 7 to 15 characters long actually do provide context as to their uses and what classes they refer to.
When I need to print stuff out where parts of the output are dynamic, I'll replace those portions with function calls that cuts down on the conditional statements and sheer content that would've been in that body of code.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.