I have a script that iterates over file contents of hundreds of thousands of files to find specific matches. For convenience I am using a string in. What are the performance differences between the two? I'm looking for more of a conceptual understanding here.
list_of_file_contents = [...] # 1GB
key = 'd89fns;3ofll'
matches = []
for item in list_of_file_contents:
if key in item:
matches.append(key)
--vs--
grep -r my_files/ 'd89fns;3ofll'
The biggest conceptual difference is that grep does regular expression matching. In python you'd need to explicitly write code using the re module. The search expression in your example doesn't exploit any of the richness of regular expressions, so the search behaves just like the plain string match in python, and should consume only a tiny bit more resources than fgrep would. The python script is really fgrep and hopefully operates on par with that.
If the files are encoded, say in UTF-16, depending on the version of the various programs, there could be a big difference in whether matches are found, and a little in how long it takes.
And that's assuming that the actual python code deals with input and output efficiently, i.e. list_of_file_contents isn't an actual list of the data, but for instance a list comprehension around fileinput; and there is not a huge number of matches or a different matches.
I suggest you try it out for yourself. Profiling Python code is really easy: https://stackoverflow.com/a/582337/970247. For a more conceptual approach. Regex is a powerful string parsing engine full of features, in contrast Python "in" will do just one thing in a really straightforward way. I would say the latter will be the more efficient but again, trying it for yourself is the way to go.
Related
What can cause non-halting behavior in regular expression match() operation (with Python's re module)?
I'm current wracking my brains trying to work out a problem that has stumped me for hours. Why does the below line hang?
re.compile(r'.*?(now|with that|at this time|ready|stop|wrap( it|) up|at this point|happy|pleased|concludes|concluded|we will|like to)(,)*(( |\n)[a-z]+(\'[a-z]+)*,*){0,20}( |\n)(take|for|to|open|entertain|answer|address)(( |\n|)[a-z]+(\'[a-z]+)*,*){0,20}( |\n|)(questions|Q *& *A).*?', re.DOTALL| re.IGNORECASE).match("I would now like to turn the presentation over to your host for today's call, Mr. Mitch Feiger, please proceed.")
In short, I'm using match(), the regular expression is r'.*?(now|with that|at this time|ready|stop|wrap( it|) up|at this point|happy|pleased|concludes|concluded|we will|like to)(,)*(( |\n)[a-z]+(\'[a-z]+)*,*){0,20}( |\n)(take|for|to|open|entertain|answer|address)(( |\n|)[a-z]+(\'[a-z]+)*,*){0,20}( |\n|)(questions|Q *& *A).*?'
And the text is: "I would now like to turn the presentation over to your host for today's call, Mr. Mitch Feiger, please proceed."
I understand my regular expression is a bit of a mess, it's been built over time to somewhat cheatily match paragraphs in which the speakers announces the start of a question session. My main confusion right now is trying to find what in there could be causing what I assume is a non-halting search.
It gets stuck on a lot of other pieces of text my program uses, but far from all of them (the program processes thousands of text files, each with ~100 of these text pieces it needs to do matching on), and I can't see any common factors. To be clear, this is not supposed to return a match, but this check does need to be done, and I can't understand why it hangs like it does.
More generally, what are the sorts of things that could cause a Python regular expression match to hang indefinitely? I'd love to have the information so I can work out the problem myself, but at this point, I'd take a cheap answer...
Perl-compatible regular expressions (PCRE), which is what Python's re module uses, are no longer "regular" in the Computer Science sense. Because of this, they can suffer from catastrophic backtracking: https://swtch.com/~rsc/regexp/regexp1.html
This doesn't help you much with your problem. What would help you is:
break down your regexp in multiple small blocks
see how long each block takes to execute
start putting the blocks together to get closer to your original huge regexp
You might have to stop trying to do everything with 1 single regexp and you might use 1 or 2 and a bit of code to put the 2 parts together more efficiently.
I'm looking for a way to run a regex over a (long) iterable of "characters" in Python. (Python doesn't actually have characters, so it's actually an iterable of one-length strings. But same difference.)
The re module only allows searching over strings (or buffers), as far as I can tell.
I could implement it myself, but that seems a little silly.
Alternatively, I could convert the iterable to a string and run the regex over the string, but that gets (hideously) inefficient. (A worst-case example: re.search(".a", "".join('a' for a in range(10**8))) peaks at over 900M of RAM (private working set) on my (x64) machine, and takes ~12 seconds, even though it only needs to look at the first two characters in the iterable.)
As far as I understand, the example that joins a lot of 'a's is just extremely simple example that shows the problem. In other words, the construction of the content (generally) can be more time and memory consuming than the search itself.
The problem with the standard re module is that it uses the extended regular expression syntax, and it requires backtracking.
You may be interested in the very classic implementation by Thomson (NFA) -- see http://swtch.com/~rsc/regexp/regexp1.html for the explanation and the comparison of performance with the libraries that implement the extended syntax.
It seems that the re2 project can be useful for you. There should be the Python port -- see Is it possible to use re2 from Python? However, I do not know if it supports streaming and wherher any streaming regular expression engine for Python exists.
For understanding the Thomsons idea, you can also try the on-line visualization of the Regular Expression to NFA.
If the number of elements in that list is truly to the order of 10**8 then you are probably better off doing a linear search if you only want to do it once. Otherwise, you have to create this huge string that is really very inefficient. The other thing I can think of if you need to do this more than once is inserting the collection into a hashtable and do the search faster.
Is there any easy way to go about adding custom extensions to a
Regular Expression engine? (For Python in particular, but I would take
a general solution as well).
It might be easier to explain what I'm trying to build with an
example. Here is the use case that I have in mind:
I want users to be able to match strings that may contain arbitrary
ASCII characters. Regular Expressions are a good start, but aren't
quite enough for the type of data I have in mind. For instance, say I
have data that contains strings like this:
<STX>12.3,45.6<ETX>
where <STX> and <ETX> are the Start of Text/End of Text characters
0x02 and 0x03. To capture the two numbers, it would be very
convenient for the user to be able to specify any ASCII
character in their expression. Something like so:
\x02(\d\d\.\d),(\d\d\.\d)\x03
Where the "\x02" and "\x03" are matching the control characters and
the first and second match groups are the numbers. So, something like
regular expressions with just a few domain-specific add-ons.
How should I go about doing this? Is this even the right way to go?
I have to believe this sort of problem has been solved, but my initial
searches didn't turn up anything promising. Regular Expression have
the advantage of being well known, keeping the learning curve down.
A few notes:
I am not looking for a fixed parser for a particular protocol - it needs to be general and user configurable
I really don't want to write my own regex engine
Although it would be nice, I am not looking for "regex macros" where I create shortcuts for a handful of common expressions. (perhaps a follow-up question...)
Bonus: Have you heard of any academic work, i.e "Creating Domain Specific search languages"
EDIT: Thanks for the replies so far, I hadn't realized Python re supported arbitrary ascii chars. However, this is still not quite what I'm looking for. Here is another example that hopefully give the breadth of what I want in the end:
Suppose I have data that contains strings like this:
$\x01\x02\x03\r\n
Where the 123 forms two 12-bit integers (0x010 and 0x023). So how could I add syntax so the user could match it with a regex like this:
\$(\int12)(\int12)\x0d\x0a
Where the \int12's each pull out 12 bits. This would be handy if trying to search for packed data.
\x escapes are already supported by the Python regular expression parser:
>>> import re
>>> regex = re.compile(r'\x02(\d\d\.\d),(\d\d\.\d)\x03')
>>> regex.match('\x0212.3,45.6\x03')
<_sre.SRE_Match object at 0x7f551b0c9a48>
One of the practices I have gotten into in Python from the beginning is to reduce the number of variables I create as compared to the number I would create when trying to do the same thing in SAS or Fortran
for example here is some code I wrote tonight:
def idMissingFilings(dEFilings,indexFilings):
inBoth=set(indexFilings.keys()).intersection(dEFilings.keys())
missingFromDE=[]
for each in inBoth:
if len(dEFilings[each])<len(indexFilings[each]):
dEtemp=[]
for filing in dEFilings[each]:
#dateText=filing.split("\\")[-1].split('-')[0]
#year=dateText[0:5]
#month=dateText[5:7]
#day=dateText[7:]
#dETemp.append(year+"-"+month+"-"+day+"-"+filing[-2:])
dEtemp.append(filing.split('\\')[-1].split('-')[0][1:5]+"-"+filing.split('\\')[-1].split('-')[0][5:7]+"-"+filing.split('\\')[-1].split('-')[0][7:]+"-"+filing[-2:])
indexTemp=[]
for infiling in indexFilings[each]:
indexTemp.append(infiling.split('|')[3]+"-"+infiling[-6:-4])
tempMissing=set(indexTemp).difference(dEtemp)
for infiling in indexFilings[each]:
if infiling.split('|')[3]+"-"+infiling[-6:-4] in tempMissing:
missingFromDE.append(infiling)
return missingFromDE
Now I split one of the strings I am processing 4 times in the line dEtemp.append(blah blah blah)
filing.split('\\')
Historically in Fortran or SAS if I were to attempt the same I would have 'sliced' my string once and assigned a variable to each part of the string that I was going to use in this expression.
I am constantly forcing myself to use expressions instead of first resolving to a value and using the value. The only reason I do this is that I am learning by mimicking other people's code but it has been in the back of my mind to ask this question - where can I find a cogent discussion of why one is better than the other
The code compares a set of documents on a drive and a source list of those documents and checks to see whether all of those from the source are represented on the drive
Okay the commented section is much easier to read and how I decided to respond to nosklos answer
Yeah, it is not better to put everything in the expression. Please use variables.
Using variables is not only better because you will do the operation only once and save the value for multiple uses. The main reason is that code becomes more readable that way. If you name the variable right, it doubles as free implicit documentation!
Use more variables. Python is known for its readability; taking away that feature is called not "Pythonic" (See https://docs.python-guide.org/writing/style/). Code that is more readable will be easier for others to understand, and easier to understand yourself later.
I would like to let my users use regular expressions for some features. I'm curious what the implications are of passing user input to re.compile(). I assume there is no way for a user to give me a string that could let them execute arbitrary code. The dangers I have thought of are:
The user could pass input that raises an exception.
The user could pass input that causes the regex engine to take a long time, or to use a lot of memory.
The solution to 1. is easy: catch exceptions. I'm not sure if there is a good solution to 2. Perhaps just limiting the length of the regex would work.
Is there anything else I need to worry about?
I have worked on a program that allows users to enter their own regex and you are right - they can (and do) enter regex that can take a long time to finish - sometimes longer than than the lifetime of the universe. What is worse, while processing a regex Python holds the GIL, so it will not only hang the thread that is running the regex, but the entire program.
Limiting the length of the regex will not work, since the problem is backtracking. For example, matching the regex r"(\S+)+x" on a string of length N that does not contain an "x" will backtrack 2**N times. On my system this takes about a second to match against "a"*21 and the time doubles for each additional character, so a string of 100 characters would take approximately 19167393131891000 years to complete (this is an estimate, I have not timed it).
For more information read the O'Reilly book "Mastering Regular Expressions" - this has a couple of chapters on performance.
edit
To get round this we wrote a regex analysing function that tried to catch and reject some of the more obvious degenerate cases, but it is impossible to get all of them.
Another thing we looked at was patching the re module to raise an exception if it backtracks too many times. This is possible, but requires changing the Python C source and recompiling, so is not portable. We also submitted a patch to release the GIL when matching against python strings, but I don't think it was accepted into the core (python only holds the GIL because regex can be run against mutable buffers).
It's much simpler for casual users to give them a subset language. The shell's globbing rules in fnmatch, for example. The SQL LIKE condition rules are another example.
Translate the user's language into a proper regex for execution at runtime.
Compiling the regular expression should be reasonably safe. Although what it compiles into is not strictly an NFA (backreferences mean it's not quite as clean) it should still be sort of straightforward to compile into.
Now as to performance characteristics, this is another problem entirely. Even a small regular expression can have exponential time characteristics because of backtracking. It might be better to define a certain subset of features and only support very limited expressions that you translate yourself.
If you really want to support general regular expressions you either have to trust your users (sometimes an option) or limit the amount of space and time used. I believe that space used is determined only by the length of the regular expression.
edit: As Dave notes, apparently the global interpreter lock is held during regex matching, which would make setting that timeout harder. If that is the case, your only option to set a timeout is to run the match in a separate process. While not exactly ideal it is doable. I completely forgot about multiprocessing. Point of interest is this section on sharing objects. If you really need the hard constraints, separate processes are the way to go here.
It's not necessary to use compile() except when you need to reuse a lot of different regular expressions. The module already caches the last expressions.
The point 2 (at execution) could be a very difficult one if you allow the user to input any regular expression. You can make a complex regexp with few characters, like the famous (x+x+)+y one. I think it's a problem yet to be resolved in a general way.
A workaround could be launching a different thread and monitor it, if it exceeds the allowed time, kill the thread and return with an error.
I really don't think it is possible to execute code simply by passing it into an re.compile. The way I understand it, re.compile (or any regex system in any language) converts the regex string into a finite automaton (DFA or NFA), and despite the ominous name 'compile' it has nothing to do with the execution of any code.
You technically don't need to use re.compile() to perform a regular expression operation on a string. In fact, the compile method can often be slower if you're only executing the operation once since there's overhead associated with the initial compiling.
If you're worried about the word "compile" then avoid it all together and simply pass the raw expression to match, search, etc. You may wind up improving the performance of your code slightly anyways.