Text processing - Python vs Perl performance [closed] - python

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
The community reviewed whether to reopen this question 3 months ago and left it closed:
Original close reason(s) were not resolved
Here is my Perl and Python script to do some simple text processing from about 21 log files, each about 300 KB to 1 MB (maximum) x 5 times repeated (total of 125 files, due to the log repeated 5 times).
Python Code (code modified to use compiled re and using re.I)
#!/usr/bin/python
import re
import fileinput
exists_re = re.compile(r'^(.*?) INFO.*Such a record already exists', re.I)
location_re = re.compile(r'^AwbLocation (.*?) insert into', re.I)
for line in fileinput.input():
fn = fileinput.filename()
currline = line.rstrip()
mprev = exists_re.search(currline)
if(mprev):
xlogtime = mprev.group(1)
mcurr = location_re.search(currline)
if(mcurr):
print fn, xlogtime, mcurr.group(1)
Perl Code
#!/usr/bin/perl
while (<>) {
chomp;
if (m/^(.*?) INFO.*Such a record already exists/i) {
$xlogtime = $1;
}
if (m/^AwbLocation (.*?) insert into/i) {
print "$ARGV $xlogtime $1\n";
}
}
And, on my PC both code generates exactly the same result file of 10,790 lines. And, here is the timing done on Cygwin's Perl and Python implementations.
User#UserHP /cygdrive/d/tmp/Clipboard
# time /tmp/scripts/python/afs/process_file.py *log* *log* *log* *log* *log* >
summarypy.log
real 0m8.185s
user 0m8.018s
sys 0m0.092s
User#UserHP /cygdrive/d/tmp/Clipboard
# time /tmp/scripts/python/afs/process_file.pl *log* *log* *log* *log* *log* >
summarypl.log
real 0m1.481s
user 0m1.294s
sys 0m0.124s
Originally, it took 10.2 seconds using Python and only 1.9 secs using Perl for this simple text processing.
(UPDATE) but, after the compiled re version of Python, it now takes 8.2 seconds in Python and 1.5 seconds in Perl. Still Perl is much faster.
Is there a way to improve the speed of Python at all OR it is obvious that Perl will be the speedy one for simple text processing.
By the way this was not the only test I did for simple text processing... And, each different way I make the source code, always always Perl wins by a large margin. And, not once did Python performed better for simple m/regex/ match and print stuff.
Please do not suggest to use C, C++, Assembly, other flavours of
Python, etc.
I am looking for a solution using Standard Python with its built-in
modules compared against Standard Perl (not even using the modules).
Boy, I wish to use Python for all my tasks due to its readability, but
to give up speed, I don't think so.
So, please suggest how can the code be improved to have comparable
results with Perl.
UPDATE: 2012-10-18
As other users suggested, Perl has its place and Python has its.
So, for this question, one can safely conclude that for simple regex match on each line for hundreds or thousands of text files and writing the results to a file (or printing to screen), Perl will always, always WIN in performance for this job. It as simple as that.
Please note that when I say Perl wins in performance... only standard Perl and Python is compared... not resorting to some obscure modules (obscure for a normal user like me) and also not calling C, C++, assembly libraries from Python or Perl. We don't have time to learn all these extra steps and installation for a simple text matching job.
So, Perl rocks for text processing and regex.
Python has its place to rock in other places.
Update 2013-05-29: An excellent article that does similar comparison is here. Perl again wins for simple text matching... And for more details, read the article.

This is exactly the sort of stuff that Perl was designed to do, so it doesn't surprise me that it's faster.
One easy optimization in your Python code would be to precompile those regexes, so they aren't getting recompiled each time.
exists_re = re.compile(r'^(.*?) INFO.*Such a record already exists')
location_re = re.compile(r'^AwbLocation (.*?) insert into')
And then in your loop:
mprev = exists_re.search(currline)
and
mcurr = location_re.search(currline)
That by itself won't magically bring your Python script in line with your Perl script, but repeatedly calling re in a loop without compiling first is bad practice in Python.

Hypothesis: Perl spends less time backtracking in lines that don't match due to optimisations it has that Python doesn't.
What do you get by replacing
^(.*?) INFO.*Such a record already exists
with
^((?:(?! INFO).)*?) INFO.*Such a record already
or
^(?>(.*?) INFO).*Such a record already exists

Function calls are a bit expensive in terms of time in Python. And yet you have a loop invariant function call to get the file name inside the loop:
fn = fileinput.filename()
Move this line above the for loop and you should see some improvement to your Python timing. Probably not enough to beat out Perl though.

In general, all artificial benchmarks are evil. However, everything else being equal (algorithmic approach), you can make improvements on a relative basis. However, it should be noted that I don't use Perl, so I can't argue in its favor. That being said, with Python you can try using Pyrex or Cython to improve performance. Or, if you are adventurous, you can try converting the Python code into C++ via ShedSkin (which works for most of the core language, and some - but not all, of the core modules).
Nevertheless, you can follow some of the tips posted here:
http://wiki.python.org/moin/PythonSpeed/PerformanceTips

I expect Perl be faster. Just being curious, can you try the following?
#!/usr/bin/python
import re
import glob
import sys
import os
exists_re = re.compile(r'^(.*?) INFO.*Such a record already exists', re.I)
location_re = re.compile(r'^AwbLocation (.*?) insert into', re.I)
for mask in sys.argv[1:]:
for fname in glob.glob(mask):
if os.path.isfile(fname):
f = open(fname)
for line in f:
mex = exists_re.search(line)
if mex:
xlogtime = mex.group(1)
mloc = location_re.search(line)
if mloc:
print fname, xlogtime, mloc.group(1)
f.close()
Update as reaction to "it is too complex".
Of course it looks more complex than the Perl version. The Perl was built around the regular expressions. This way, you can hardly find interpreted language that is faster in regular expressions. The Perl syntax...
while (<>) {
...
}
... also hides a lot of things that have to be done somehow in a more general language. On the other hand, it is quite easy to make the Python code more readable if you move the unreadable part out:
#!/usr/bin/python
import re
import glob
import sys
import os
def input_files():
'''The generator loops through the files defined by masks from cmd.'''
for mask in sys.argv[1:]:
for fname in glob.glob(mask):
if os.path.isfile(fname):
yield fname
exists_re = re.compile(r'^(.*?) INFO.*Such a record already exists', re.I)
location_re = re.compile(r'^AwbLocation (.*?) insert into', re.I)
for fname in input_files():
with open(fname) as f: # Now the f.close() is done automatically
for line in f:
mex = exists_re.search(line)
if mex:
xlogtime = mex.group(1)
mloc = location_re.search(line)
if mloc:
print fname, xlogtime, mloc.group(1)
Here the def input_files() could be placed elsewhere (say in another module), or it can be reused. It is possible to mimic even the Perl's while (<>) {...} easily, even though not the same way syntactically:
#!/usr/bin/python
import re
import glob
import sys
import os
def input_lines():
'''The generator loops through the lines of the files defined by masks from cmd.'''
for mask in sys.argv[1:]:
for fname in glob.glob(mask):
if os.path.isfile(fname):
with open(fname) as f: # now the f.close() is done automatically
for line in f:
yield fname, line
exists_re = re.compile(r'^(.*?) INFO.*Such a record already exists', re.I)
location_re = re.compile(r'^AwbLocation (.*?) insert into', re.I)
for fname, line in input_lines():
mex = exists_re.search(line)
if mex:
xlogtime = mex.group(1)
mloc = location_re.search(line)
if mloc:
print fname, xlogtime, mloc.group(1)
Then the last for may look as easy (in principle) as the Perl's while (<>) {...}. Such readability enhancements are more difficult in Perl.
Anyway, it will not make the Python program faster. Perl will be faster again here. Perl is a file/text cruncher. But--in my opinion--Python is a better programming language for more general purposes.

Related

Python difflib with long diff blocks

I was using Python's difflib to create comprehensive differential logs between rather long files. Everything was running smoothly, until I encountered problem of never-ending diffs. After digging around, it turned out that difflib cannot handle long sequences of semi-matching lines.
Here is a (somewhat minimal) example:
import sys
import random
import difflib
def make_file(fname, dlines):
with open(fname, 'w') as f:
f.write("This is a small file with a long sequence of different lines\n")
f.write("Some of the starting lines could differ {}\n".format(random.random()))
f.write("...\n")
f.write("...\n")
f.write("...\n")
f.write("...\n")
for i in range(dlines):
f.write("{}\t{}\t{}\t{}\n".format(i, i+random.random()/100, i+random.random()/10000, i+random.random()/1000000))
make_file("a.txt", 125)
make_file("b.txt", 125)
with open("a.txt") as ff:
fromlines = ff.readlines()
with open("b.txt") as tf:
tolines = tf.readlines()
diff = difflib.ndiff(fromlines, tolines)
sys.stdout.writelines(diff)
Even for the 125 lines in the example, it took Python over 4 seconds to compute and print the diff, while for GNU Diff it took literally a few milliseconds. And I'm facing problems, where the number of lines is approx. 100 times larger.
Is there a sensible solution to the issue? I hoped for using difflib, as it produces rather nice HTML diffs, but I am open to suggestions. I need a portable solution, that would work on as many platforms as possible, although I am already considering porting GNU Diff for the matter :). Hacking into difflib is also possible as long as I wouldn't have to literally rewrite the whole library.
PS. The files might have variable-length prefixes, so splitting them into parts without aligning diff context might not be the best idea.

Python runs regex on variable but not on file with same content

I am writing a python (2.7) script to parse some logs from a java application using regex.
I used http://pythex.org/ to help test the patterns and they work there with a reduced log sample just fine.
Once I do the same on my script it works if I put some of the log in a variable, but wont work if I point it to a file.
Here is the code
import re
regex_sql_java_error = "\[use.(.*?)\]\nThread:.{9}(GENERAL|LOADER).{17}(ERROR(.*?)\n)"
logfile = open('example_files/Log_file.txt', 'r')
data = logfile.read()
logfile.close()
filtered = re.finditer(regex_sql_java_error, data, re.DOTALL | re.MULTILINE)
if filtered:
for item in filtered:
print item.group(0)
The logfile I am using is a measly 1MB file.
I can't imagine the pattern being the issue, but heres a sample of the log file that matched just fine on pythex.org
Thread: 5624 LOADER 08:26:37.078 INFO executeDdlStatements:
[use ADMINI;, SOME BROKEN SQL HERE;]
Thread: 5624 LOADER 08:26:37.086 ERROR 'executeDdlStatements' command failed with the error: Table 'ADMININTT' doesn't exist
RANDOM JAVA STUFF
Link to it on pythex http://goo.gl/mZSx4z
I've been bashing my head on this for a couple days, read a bunch of docs, but I cant figure out what I am doing wrong.
Hopefully its something really stupid Ill be able to laugh about later.
Anyway, if anybody can point me in the right direction I'd really appreciate.
This was dumb and fast, and like I thought, I can laugh about it now..
Logfiles came from windows, substitute \n for \r\n everywhere and be happy!

Python replacement of Rubys grep?

abc=123
dabc=123
abc=456
dabc=789
aabd=123
From the above file I need to find lines beginning with abc= (whitespaces doesn't matter)
in ruby I would put this in an array and do
matches = input.grep(/^\s*abc=.*/).map(&:strip)
I'm a totally noob in Python, even said I'm a fresh Python developer is too much.
Maybe there is a better "Python way" of doing this without even grepping ?
The Python version I have available on the platform where I need to solve the problem is 2.6
There is no way of use Ruby at that time
with open("myfile.txt") as myfile:
matches = [line.rstrip() for line in myfile if line.lstrip().startswith("abc=")]
In Python you would typically use a list comprehension whose if clause does what you'd accomplish with Ruby's grep:
import sys, re
matches = [line.strip() for line in sys.stdin
if re.match(r'^\s*abc=.*', line)]

python system call

having a difficult time understanding how to get python to call a system function...
the_file = ('logs/consolidated.log.gz')
webstuff = subprocess.Popen(['/usr/bin/zgrep', '/meatsauce/', the_file ],stdout=subprocess.PIPE) % dpt_search
for line in webstuff.stdout:
print line
Trying to get python to build another file with my search string.
Thanks!
I recommend the PyMotW Subprocess page from Doug Hellmann who (quoted) "Reads the docs so you don't have to"
Apart from that:
f = file('sourcefile')
for line in f:
if 'pattern' in line:
# mind the , at the end,
# since there's no stripping involved
# and print adds a newline without it
print line,
if you need to match regular expressions apart from the documentation in the Python Standard Library documentation for the re module also refer to the PyMotW Regular Expression page

Replace part of string using python regular expression

I have the following lines (many, many):
...
gfnfgnfgnf: 5656756734
arvervfdsa: 1343453563
particular: 4685685685
erveveersd: 3453454545
verveversf: 7896789567
..
What I'd like to do is to find line 'particular' (whatever number is after ':')
and replace this number with '111222333'. How can I do that using python regular expressions ?
for line in input:
key, val = line.split(':')
if key == 'particular':
val = '111222333'
I'm not sure regex would be of any value in this specific case. My guess is they'd be slower. That said, it can be done. Here's one way:
for line in input:
re.sub('^particular : .*', 'particular : 111222333')
There are subtleties involved in this, and this is almost certainly not what you'd want in production code. You need to check all of the re module constants to make sure the regex is acting the way you expect, etc. You might be surprised at the flexibility you find in dealing with problems like this in Python if you try not to use re (of course, this isn't to say re isn't useful) ;-)
Sure you need a regular expression?
other_number = '111222333'
some_text, some_number = line.split(': ')
new_line = ': '.join(some_text, other_number)
#!/usr/bin/env python
import re
text = '''gfnfgnfgnf: 5656756734
arvervfdsa: 1343453563
particular: 4685685685
erveveersd: 3453454545
verveversf: 7896789567'''
print(re.sub('[0-9]+', '111222333', text))
input = """gfnfgnfgnf: 5656756734
arvervfdsa: 1343453563
particular: 4685685685
erveveersd: 3453454545
verveversf: 7896789567"""
entries = re.split("\n+", input)
for entry in entries:
if entry.startswith("particular"):
entry = re.sub(r'[0-9]+', r'111222333', entry)
or with sed:
sed -e 's/^particular: [0-9].*$/particular: 111222333/g' file
An important point here is that if you have a lot of lines, you want to process them one by one. That is, instead of reading all the lines in replacing them, and writing them out again, you should read in a line at a time and write out a line at a time. (This would be inefficient if you were actually reading a line at a time from the disk; however, Python's IO is competent and will buffer the file for you.)
with open(...) as infile, open(...) as outfile:
for line in infile:
if line.startswith("particular"):
outfile.write("particular: 111222333")
else:
outfile.write(line)
This will be speed- and memory-efficient.
Your sed example forces me to say neat!
python -c "import re, sys; print ''.join(re.sub(r'^(particular:) \d+', r'\1 111222333', l) for l in open(sys.argv[1]))" file

Categories