Python difflib with long diff blocks - python

I was using Python's difflib to create comprehensive differential logs between rather long files. Everything was running smoothly, until I encountered problem of never-ending diffs. After digging around, it turned out that difflib cannot handle long sequences of semi-matching lines.
Here is a (somewhat minimal) example:
import sys
import random
import difflib
def make_file(fname, dlines):
with open(fname, 'w') as f:
f.write("This is a small file with a long sequence of different lines\n")
f.write("Some of the starting lines could differ {}\n".format(random.random()))
f.write("...\n")
f.write("...\n")
f.write("...\n")
f.write("...\n")
for i in range(dlines):
f.write("{}\t{}\t{}\t{}\n".format(i, i+random.random()/100, i+random.random()/10000, i+random.random()/1000000))
make_file("a.txt", 125)
make_file("b.txt", 125)
with open("a.txt") as ff:
fromlines = ff.readlines()
with open("b.txt") as tf:
tolines = tf.readlines()
diff = difflib.ndiff(fromlines, tolines)
sys.stdout.writelines(diff)
Even for the 125 lines in the example, it took Python over 4 seconds to compute and print the diff, while for GNU Diff it took literally a few milliseconds. And I'm facing problems, where the number of lines is approx. 100 times larger.
Is there a sensible solution to the issue? I hoped for using difflib, as it produces rather nice HTML diffs, but I am open to suggestions. I need a portable solution, that would work on as many platforms as possible, although I am already considering porting GNU Diff for the matter :). Hacking into difflib is also possible as long as I wouldn't have to literally rewrite the whole library.
PS. The files might have variable-length prefixes, so splitting them into parts without aligning diff context might not be the best idea.

Related

high RAM usage by eval function in python [duplicate]

This question already has an answer here:
eval vs json.loads memory consumption
(1 answer)
Closed 2 years ago.
i implement each line 1 (not line 0) as string from 2 files (1st ~30MB and 2nd ~50MB) where line 0 has just some information which i dont need atm. line 1 is a string array which has around 1.3E6 smaller arrays like that ['I1000009', 'A', '4024', 'A'] as information in it.
[[['I1000009', 'A', '4024', 'A'], ['I1000009', 'A', '6734', 'G'],...],[['H1000004', 'B', '4024', 'A'], ['L1000009', 'B', '6734', 'C'],...],[and so on],...]
both files are in the same way filled. thats the reason why the files are between 30 and 50MB big. i read that files with my .py script to have access to the single information which i need:
import sys
myID = sys.argv[1]
otherID = sys.argv[2]
samePath = '/home/srv/Dokumente/srv/'
FolderName = 'checkArrays/'
finishedFolder = samePath+'finishedAnalysis/'
myNewFile = samePath+FolderName+myID[0]+'/'+myID+'.txt'
otherFile = samePath+FolderName+otherID[0]+'/'+otherID+'.txt'
nameFileOKarray = '_array_goodData.txt'
import csv
import os
import re #for regular expressions
# Text 2 - Start
import operator # zum sortieren der csv files
# Text 2 - End
whereIsMyArray = 1
text_file = open(finishedFolder+myID+nameFileOKarray, "r")
line = text_file.readlines()[whereIsMyArray:];
myGoodFile = eval(line[0])
text_file.close()
text_file = open(finishedFolder+otherID+nameFileOKarray, "r")
line = text_file.readlines()[whereIsMyArray:];
otherGoodFile = eval(line[0])
text_file.close()
print(str(myGoodFile[0][0][0]))
print(str(otherGoodFile[0][0][0]))
the problem what i have is, that if i start my .py script over the shell:
python3 checkarr_v1.py 44 39
the RAM of my 4GB pi server increase to the limit of RAM and Swap and dies. then i tried to start the .py script on a 32Gb RAM server and look at that it worked, but the usage of the RAM is really huge. see pics
(slack mode) overview of normal usage of RAM and CPU:
slackmode
(startsequence) overview in highest usage of RAM ~6GB and CPU: highest point
then it goes up and down after for ~1min: 1.2Gb to 3.6Gb then to 1.7Gb then to 1Gb and then the script finish ~1min and the right output was shown.
can you help me to understand if there is a better way to solve that for an 4Gb raspberry pi? is that a better way to write the 2 files, because the [",] symbols took also there spaces in the file? Is that a better solution as the eval function is to implement that string to an array? sry for that questions, but i cant understand why the 80MB files increase the RAM to around 6Gb. that sounds that i make something wrong. br and thx
1.3E9 arrays is going to be lots and lots of bytes if you read that into your application, no matter what you do.
I don't know if your code does what you actually want to do, but you're only ever using the first data item. If that's what you want to do, then don't read the whole file, just read that first part.
But also: I would advice against using "eval" for deserializing data.
The built-in json module will give data in almost the same format (if you control the input format).
Still, in the end: If you want to hold that much data in your program, you're looking at many GB of memory usage.
If you just want to process it, I'd take a more iterative approach and do a little at the time rather than to swallow the whole files. Especially with limited resources.
Update: I See now that it's 1.3e6, not 1.3e9 entries. Big difference. :-) Then json data should be okay. On my machine a list of 1.3M ['RlFKUCUz', 'A', '4024', 'A'] takes about 250MB.

Python text file read/write optimization

I have been working on this file i/o and have made some progress reading through the site and i am wondering what other ways this can be optimized. I am parsing a test infile of 10GB/30MM lines and writing to an outfile the fields which results in aprog 1.4GB clean file. Initially, it took 40m to run this process and i have reduced it to around 30m. Anyone have any other ideas to reduce this in python. Long term i will be looking to write this in C++ - i just have to learn the language first. thanks in advance.
with open(fdir+"input.txt",'rb',(50*(1024*1024))) as r:
w=open(fdir+"output0.txt",'wb',50*(1024*1024)))
for i,l in enumerate(r):
if l[42:44]=='25':
# takes fixed width line into csv line of only a few cols
wbun.append(','.join([
l[7:15],
l[26:35],
l[44:52],
l[53:57],
format(int(l[76:89])/100.0,'.02f'),
l[89:90],
format(int(l[90:103])/100.0,'.02f'),
l[193:201],
l[271:278]+'\n'
]))
# write about every 5MM lines
if len(wbun)==wsize:
w.writelines(wbun)
wbun=[]
print "i_count:",i
# splits about every 4GB
if (i+1)%fsplit==0:
w.close()
w=open(fdir+"output%d.txt"%(i/fsplit+1),'wb',50*(1024*1024)))
w.writelines(wbun)
w.close()
Try running it in Pypy (https://pypy.org), it will run without changes to your code, and probably faster.
Also, C++ might be an overkill, especially if you don't know it yet. Consider learning Go or D instead.

Python - Checking concordance between two huge text files

So, this one has been giving me a hard time!
I am working with HUGE text files, and by huge I mean 100Gb+. Specifically, they are in the fastq format. This format is used for DNA sequencing data, and consists of records of four lines, something like this:
#REC1
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))*55CCF>>>>>>CCCCCCC65
#REC2
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
.
.
.
For the sake of this question, just focus on the header lines, starting with a '#'.
So, for QA purposes, I need to compare two such files. These files should have matching headers, so the first record in the other file should also have the header '#REC1', the next should be '#REC2' and so on. I want to make sure that this is the case, before I proceed to heavy downstream analyses.
Since the files are so large, a naive iteration a string comparisson would take very long, but this QA step will be run numerous times, and I can't afford to wait that long. So I thought a better way would be to sample records from a few points in the files, for example every 10% of the records. If the order of the records is messed up, I'd be very likely to detect it.
So far, I have been able to handle such files by estimating the file size and than using python's file.seek() to access a record in the middle of the file. For example, to access a line approximately in the middle, I'd do:
file_size = os.stat(fastq_file).st_size
start_point = int(file_size/2)
with open(fastq_file) as f:
f.seek(start_point)
# look for the next beginning of record, never mind how
But now the problem is more complex, since I don't know how to coordinate between the two files, since the bytes location is not an indicator of the line index in the file. In other words, how can I access the 10,567,311th lines in both files to make sure they are the same, without going over the whole file?
Would appreciate any ideas\hints. Maybe iterating in parallel? but how exactly?
Thanks!
Sampling is one approach, but you're relying on luck. Also, Python is the wrong tool for this job. You can do things differently and calculate an exact answer in a still reasonably efficient way, using standard Unix command-line tools:
Linearize your FASTQ records: replace the newlines in the first three lines with tabs.
Run diff on a pair of linearized files. If there is a difference, diff will report it.
To linearize, you can run your FASTQ file through awk:
$ awk '\
BEGIN { \
n = 0; \
} \
{ \
a[n % 4] = $0; \
if ((n+1) % 4 == 0) { \
print a[0]"\t"a[1]"\t"a[2]"\t"a[3]; \
} \
n++; \
}' example.fq > example.fq.linear
To compare a pair of files:
$ diff example_1.fq.linear example_2.fq.linear
If there's any difference, diff will find it and tell you which FASTQ record is different.
You could just run diff on the two files directly, without doing the extra work of linearizing, but it is easier to see which read is problematic if you first linearize.
So these are large files. Writing new files is expensive in time and disk space. There's a way to improve on this, using streams.
If you put the awk script into a file (e.g., linearize_fq.awk), you can run it like so:
$ awk -f linearize_fq.awk example.fq > example.fq.linear
This could be useful with your 100+ Gb files, in that you can now set up two Unix file streams via bash process substitutions, and run diff on those streams directly:
$ diff <(awk -f linearize_fq.awk example_1.fq) <(awk -f linearize_fq.awk example_2.fq)
Or you can use named pipes:
$ mkfifo example_1.fq.linear
$ mkfifo example_2.fq.linear
$ awk -f linearize_fq.awk example_1.fq > example_1.fq.linear &
$ awk -f linearize_fq.awk example_2.fq > example_2.fq.linear &
$ diff example_1.fq.linear example_2.fq.linear
$ rm example_1.fq.linear example_2.fq.linear
Both named pipes and process substitutions avoid the step of creating extra (regular) files, which could be an issue for your kind of input. Writing linearized copies of 100+ Gb files to disk could take a while to do, and those copies could also use disk space you may not have much of.
Using streams gets around those two problems, which makes them very useful for handling bioinformatics datasets in an efficient way.
You could reproduce these approaches with Python, but it will almost certainly run much slower, as Python is very slow at I/O-heavy tasks like these.
Iterating in parallel might be the best way to do this in Python. I have no idea how fast this will run (a fast SSD will probably be the best way to speed this up), but since you'll have to count newlines in both files anyway, I don't see a way around this:
with open(file1) as f1, open(file2) as f2:
for l1, l2 in zip(f1,f2):
if l1.startswith("#REC"):
if l1 != l2:
print("Difference at record", l1)
break
else:
print("No differences")
This is written for Python 3 where zip returns an iterator; in Python 2, you need to use itertools.izip() instead.
Have you looked into using the rdiff command.
The upsides of rdiff are:
with the same 4.5GB files, rdiff only ate about 66MB of RAM and scaled very well. It never crashed to date.
it is also MUCH faster than diff.
rdiff itself combines both diff and patch capabilities, so you can create deltas and apply them using the same program
The downsides of rdiff are:
it's not part of standard Linux/UNIX distribution – you have to
install the librsync package.
delta files rdiff produces have a slightly different format than diff's.
delta files are slightly larger (but not significantly enough to care).
a slightly different approach is used when generating a delta with rdiff, which is both good and bad – 2 steps are required. The
first one produces a special signature file. In the second step, a
delta is created using another rdiff call (all shown below). While
the 2-step process may seem annoying, it has the benefits of
providing faster deltas than when using diff.
See: http://beerpla.net/2008/05/12/a-better-diff-or-what-to-do-when-gnu-diff-runs-out-of-memory-diff-memory-exhausted/
import sys
import re
""" To find of the difference record in two HUGE files. This is expected to
use of minimal memory. """
def get_rec_num(fd):
""" Look for the record number. If not found return -1"""
while True:
line = fd.readline()
if len(line) == 0: break
match = re.search('^#REC(\d+)', line)
if match:
num = int(match.group(1))
return(num)
return(-1)
f1 = open('hugefile1', 'r')
f2 = open('hugefile2', 'r')
hf1 = dict()
hf2 = dict()
while f1 or f2:
if f1:
r = get_rec_num(f1)
if r < 0:
f1.close()
f1 = None
else:
# if r is found in f2 hash, no need to store in f1 hash
if not r in hf2:
hf1[r] = 1
else:
del(hf2[r])
pass
pass
if f2:
r = get_rec_num(f2)
if r < 0:
f2.close()
f2 = None
else:
# if r is found in f1 hash, no need to store in f2 hash
if not r in hf1:
hf2[r] = 1
else:
del(hf1[r])
pass
pass
print('Records found only in f1:')
for r in hf1:
print('{}, '.format(r));
print('Records found only in f2:')
for r in hf2:
print('{}, '.format(r));
Both answers from #AlexReynolds and #TimPietzcker are excellent from my point of view, but I would like to put my two cents in. You also might want to speed up your hardware:
Raplace HDD with SSD
Take n SSD's and create a RAID 0. In the perfect world you will get n times speed up for your disk IO.
Adjust the size of chunks you read from the SSD/HDD. I would expect, for instance, one 16 MB read to be executed faster than sixteen 1 MB reads. (this applies to a single SSD, for RAID 0 optimization one has to take a look at RAID controller options and capabilities).
The last option is especially relevant to NOR SSD's. Don't pursuit the minimal RAM utilization, but try to read as much as it needs to keep your disk reading fast. For instance, parallel reads of single rows from two files can probably speed down reading - imagine an HDD where two rows of the two files are always on the same side of the same magnetic disk(s).

Output every nth byte of stdin

What's the easiest efficient way to read from stdin and output every nth byte?
I'd like a command-line utility that works on OS X, and would prefer to avoid compiled languages.
This Python script is fairly slow (25s for a 3GB file when n=100000000):
#!/usr/bin/env python
import sys
n = int(sys.argv[1])
while True:
chunk = sys.stdin.read(n)
if not chunk:
break
sys.stdout.write(chunk[0])
Unfortunately we can't use sys.stdin.seek to avoid reading the entire file.
Edit: I'd like to optimize for the case when n is a significant fraction of the file size. For example, I often use this utility to sample 500 bytes at equally-spaced locations from a large file.
NOTE: OP change the example n from 100 to 100000000 which effectively render my code slower than his, normally i would just delete my answer since it is no longer better than the original example, but my answer gotten a vote so i will just leave it as it is.
the only way that i can think of to make it faster is to read everything at once and use slice
#!/usr/bin/env python
import sys
n = int(sys.argv[1])
data = sys.stdin.read()
print(data[::n])
although, trying to fit a 3GB file into the ram might be a very bad idea

Text processing - Python vs Perl performance [closed]

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
The community reviewed whether to reopen this question 3 months ago and left it closed:
Original close reason(s) were not resolved
Here is my Perl and Python script to do some simple text processing from about 21 log files, each about 300 KB to 1 MB (maximum) x 5 times repeated (total of 125 files, due to the log repeated 5 times).
Python Code (code modified to use compiled re and using re.I)
#!/usr/bin/python
import re
import fileinput
exists_re = re.compile(r'^(.*?) INFO.*Such a record already exists', re.I)
location_re = re.compile(r'^AwbLocation (.*?) insert into', re.I)
for line in fileinput.input():
fn = fileinput.filename()
currline = line.rstrip()
mprev = exists_re.search(currline)
if(mprev):
xlogtime = mprev.group(1)
mcurr = location_re.search(currline)
if(mcurr):
print fn, xlogtime, mcurr.group(1)
Perl Code
#!/usr/bin/perl
while (<>) {
chomp;
if (m/^(.*?) INFO.*Such a record already exists/i) {
$xlogtime = $1;
}
if (m/^AwbLocation (.*?) insert into/i) {
print "$ARGV $xlogtime $1\n";
}
}
And, on my PC both code generates exactly the same result file of 10,790 lines. And, here is the timing done on Cygwin's Perl and Python implementations.
User#UserHP /cygdrive/d/tmp/Clipboard
# time /tmp/scripts/python/afs/process_file.py *log* *log* *log* *log* *log* >
summarypy.log
real 0m8.185s
user 0m8.018s
sys 0m0.092s
User#UserHP /cygdrive/d/tmp/Clipboard
# time /tmp/scripts/python/afs/process_file.pl *log* *log* *log* *log* *log* >
summarypl.log
real 0m1.481s
user 0m1.294s
sys 0m0.124s
Originally, it took 10.2 seconds using Python and only 1.9 secs using Perl for this simple text processing.
(UPDATE) but, after the compiled re version of Python, it now takes 8.2 seconds in Python and 1.5 seconds in Perl. Still Perl is much faster.
Is there a way to improve the speed of Python at all OR it is obvious that Perl will be the speedy one for simple text processing.
By the way this was not the only test I did for simple text processing... And, each different way I make the source code, always always Perl wins by a large margin. And, not once did Python performed better for simple m/regex/ match and print stuff.
Please do not suggest to use C, C++, Assembly, other flavours of
Python, etc.
I am looking for a solution using Standard Python with its built-in
modules compared against Standard Perl (not even using the modules).
Boy, I wish to use Python for all my tasks due to its readability, but
to give up speed, I don't think so.
So, please suggest how can the code be improved to have comparable
results with Perl.
UPDATE: 2012-10-18
As other users suggested, Perl has its place and Python has its.
So, for this question, one can safely conclude that for simple regex match on each line for hundreds or thousands of text files and writing the results to a file (or printing to screen), Perl will always, always WIN in performance for this job. It as simple as that.
Please note that when I say Perl wins in performance... only standard Perl and Python is compared... not resorting to some obscure modules (obscure for a normal user like me) and also not calling C, C++, assembly libraries from Python or Perl. We don't have time to learn all these extra steps and installation for a simple text matching job.
So, Perl rocks for text processing and regex.
Python has its place to rock in other places.
Update 2013-05-29: An excellent article that does similar comparison is here. Perl again wins for simple text matching... And for more details, read the article.
This is exactly the sort of stuff that Perl was designed to do, so it doesn't surprise me that it's faster.
One easy optimization in your Python code would be to precompile those regexes, so they aren't getting recompiled each time.
exists_re = re.compile(r'^(.*?) INFO.*Such a record already exists')
location_re = re.compile(r'^AwbLocation (.*?) insert into')
And then in your loop:
mprev = exists_re.search(currline)
and
mcurr = location_re.search(currline)
That by itself won't magically bring your Python script in line with your Perl script, but repeatedly calling re in a loop without compiling first is bad practice in Python.
Hypothesis: Perl spends less time backtracking in lines that don't match due to optimisations it has that Python doesn't.
What do you get by replacing
^(.*?) INFO.*Such a record already exists
with
^((?:(?! INFO).)*?) INFO.*Such a record already
or
^(?>(.*?) INFO).*Such a record already exists
Function calls are a bit expensive in terms of time in Python. And yet you have a loop invariant function call to get the file name inside the loop:
fn = fileinput.filename()
Move this line above the for loop and you should see some improvement to your Python timing. Probably not enough to beat out Perl though.
In general, all artificial benchmarks are evil. However, everything else being equal (algorithmic approach), you can make improvements on a relative basis. However, it should be noted that I don't use Perl, so I can't argue in its favor. That being said, with Python you can try using Pyrex or Cython to improve performance. Or, if you are adventurous, you can try converting the Python code into C++ via ShedSkin (which works for most of the core language, and some - but not all, of the core modules).
Nevertheless, you can follow some of the tips posted here:
http://wiki.python.org/moin/PythonSpeed/PerformanceTips
I expect Perl be faster. Just being curious, can you try the following?
#!/usr/bin/python
import re
import glob
import sys
import os
exists_re = re.compile(r'^(.*?) INFO.*Such a record already exists', re.I)
location_re = re.compile(r'^AwbLocation (.*?) insert into', re.I)
for mask in sys.argv[1:]:
for fname in glob.glob(mask):
if os.path.isfile(fname):
f = open(fname)
for line in f:
mex = exists_re.search(line)
if mex:
xlogtime = mex.group(1)
mloc = location_re.search(line)
if mloc:
print fname, xlogtime, mloc.group(1)
f.close()
Update as reaction to "it is too complex".
Of course it looks more complex than the Perl version. The Perl was built around the regular expressions. This way, you can hardly find interpreted language that is faster in regular expressions. The Perl syntax...
while (<>) {
...
}
... also hides a lot of things that have to be done somehow in a more general language. On the other hand, it is quite easy to make the Python code more readable if you move the unreadable part out:
#!/usr/bin/python
import re
import glob
import sys
import os
def input_files():
'''The generator loops through the files defined by masks from cmd.'''
for mask in sys.argv[1:]:
for fname in glob.glob(mask):
if os.path.isfile(fname):
yield fname
exists_re = re.compile(r'^(.*?) INFO.*Such a record already exists', re.I)
location_re = re.compile(r'^AwbLocation (.*?) insert into', re.I)
for fname in input_files():
with open(fname) as f: # Now the f.close() is done automatically
for line in f:
mex = exists_re.search(line)
if mex:
xlogtime = mex.group(1)
mloc = location_re.search(line)
if mloc:
print fname, xlogtime, mloc.group(1)
Here the def input_files() could be placed elsewhere (say in another module), or it can be reused. It is possible to mimic even the Perl's while (<>) {...} easily, even though not the same way syntactically:
#!/usr/bin/python
import re
import glob
import sys
import os
def input_lines():
'''The generator loops through the lines of the files defined by masks from cmd.'''
for mask in sys.argv[1:]:
for fname in glob.glob(mask):
if os.path.isfile(fname):
with open(fname) as f: # now the f.close() is done automatically
for line in f:
yield fname, line
exists_re = re.compile(r'^(.*?) INFO.*Such a record already exists', re.I)
location_re = re.compile(r'^AwbLocation (.*?) insert into', re.I)
for fname, line in input_lines():
mex = exists_re.search(line)
if mex:
xlogtime = mex.group(1)
mloc = location_re.search(line)
if mloc:
print fname, xlogtime, mloc.group(1)
Then the last for may look as easy (in principle) as the Perl's while (<>) {...}. Such readability enhancements are more difficult in Perl.
Anyway, it will not make the Python program faster. Perl will be faster again here. Perl is a file/text cruncher. But--in my opinion--Python is a better programming language for more general purposes.

Categories