I'm writing a script to find all duplicate files in two different file trees. The script works fine except it's too slow to be practical on large numbers of files (>1000). Profiling my script with cProfile revealed that a single line in my code is responsible for almost all of the execution time.
The line is a call to os.system():
cmpout = os.system("cmp -s -n 10MiB %s %s" % (callA, callB));
This call is inside a for loop that gets called about N times if I have N identical files. The average execution time is 0.53 seconds
ncalls tottime percall cumtime percall filename:lineno(function)
563 301.540 0.536 301.540 0.536 {built-in method system}
This of course quickly adds up for more than a thousand files. I have tried speeding it up by replacing it with call from the subprocess module:
cmpout = call("cmp -s -n 10MiB %s %s" % (callA, callB), shell=True);
But this has a near identical execution time. I have also tried reducing the byte limit on the cmp command itself, but this only saves a very small amount of time.
Is there anyway I can speed this up?
Full function I'm using:
def dirintersect(dirA, dirB):
intersectionAB = []
filesA = listfiles(dirA);
filesB = listfiles(dirB);
for (pathB, filenameB) in filesB:
for (pathA, filenameA) in filesA:
if filenameA == filenameB:
callA = shlex.quote(os.path.join(pathA, filenameA));
callB = shlex.quote(os.path.join(pathB, filenameB));
cmpout = os.system("cmp -s -n 10MiB %s %s" % (callA, callB));
#cmpout = call("cmp -s -n 10MiB %s %s" % (callA, callB), shell=True);
if cmpout is 0:
intersectionAB.append((filenameB, pathB, pathA))
return intersectionAB
Update: Thanks for all the feedback! I will try to address most of your comments and give some more information.
#Iarsmans. You're absolutely right that my nested for loop scales with n² I had already figured out myself that I could do the same by using a dictionary or set and do set operations. But even the overhead of this 'bad' algorithm is insignificant to the time it takes to run os.system . The actual if clause triggers approximately once for each filename (that is I expect there to be only one duplicate for each filename). So os.system only gets run N times and not N² times, but even for this linear time it isn't fast enough.
#Iarsman and #Alex Reynolds: The reason I didn't choose for a hashing solution like you suggest is because in the usage case I envision I compare a smaller directory tree with a larger one and hashing all of the files in the larger tree would take a very long time (as it could be all the files in an entire partition), while I would only require to do the actual comparison on a small fraction of the files.
#abarnert: The reason I use shell=True in the call command is simply because I started with os.system and then read that it was better to use subprocess.call and this was the way to convert between the too. If there is a better way to run the cmp command, I'd like to know. The reason I qoute the arguments is because I had issues with spaces in filenames when I just passed the os.path.join result in the command.
Thanks for your suggestion, I will change it to if cmpout == 0
#Gabe: I don't know how to time a bash command, but I believe it runs much faster than half a second when I just run the command.
I said the byte limit didn't matter much, because when I changed it to only 10Kib it changed the total execution time of my test run to 290seconds instead of around 300s. The reason I kept the limit is to prevent it from comparing really large files (such as 1GiB video files).
Update 2:
I have followed #abarnert 's suggestion and changed the call to:
cmpout = call(["cmp", '-s', '-n', '10MiB', callA, callB])
The execution time for my test scenario has now dropped to 270s from 300seconds. Not sufficient yet, but it's a start.
You're using the wrong algorithm to do this. Comparing all pairs of files takes Θ(n²) time for n files, while you can get the intersection of two directories in linear time by hashing the files:
from hashlib import sha512
import os
import os.path
def hash_file(fname):
with open(fname) as f:
return sha512(f.read()).hexdigest()
def listdir(d):
return [os.path.join(d, fname) for fname in os.listdir(d)]
def dirintersect(d1, d2):
files1 = {hash_file(fname): fname for fname in listdir(d1)}
return [(files1[hash_file(fname)], fname) for fname in listdir(d2)
if hash_file(fname) in files1]
This function loops over the first directory, storing filenames indexed by their SHA-512 hash, then filters the files in the second directory by the presence of files with the same hash in the index built from the first directory. A few obvious optimizations are left as an exercise for the reader :)
The function assumes the directories contain only regular files or symlinks to those, and it reads the files into memory in one go (but that's not too hard to fix).
(SHA-512 doesn't actually guarantee equality of files, so a full comparison can be installed as a backup measure, though you'll be hard-pressed to find two files with the same SHA-512.)
Related
I'm automating some tedious shell tasks, mostly file conversions, in a kind of blunt force way with os.system calls (Python 2.7). For some bizarre reason, however, my running interpreter doesn't seem to be able to find the files that I just created.
Example code:
import os, time, glob
# call a node script to template a word document
os.system('node wordcv.js')
# print the resulting document to pdf
os.system('launch -p gowdercv.docx')
# move to the directory that pdfwriter prints to
os.chdir('/users/shared/PDFwriter/pauliglot')
print glob.glob('*.pdf')
I expect to have a length 1 list with the resulting filename, instead I get an empty list.
The same occurs with
pdfs = [file for file in os.listdir('/users/shared/PDFwriter/pauliglot') if file.endswith(".pdf")]
print pdfs
I've checked by hand, and the expected files are actually where they're supposed to be.
Also, I was under the impression that os.system blocked, but just in case it doesn't, I also stuck a time.sleep(1) in there before looking for the files. (That's more than enough time for the other tasks to finish.) Still nothing.
Hmm. Help? Thanks!
You should add a wait after the call to launch. Launch will spawn the task in the background and return before the document is finished printing. You can either put in some arbitrary sleep statements or if you want you can also check for file existence if you know what the expected filename will be.
import time
# print the resulting document to pdf
os.system('launch -p gowdercv.docx')
# give word about 30 seconds to finish printing the document
time.sleep(30)
Alternative:
import time
# print the resulting document to pdf
os.system('launch -p gowdercv.docx')
# wait for a maximum of 90 seconds
for x in xrange(0, 90):
time.sleep(1)
if os.path.exists('/path/to/expected/filename'):
break
Reference for potentially needing a longer than 1 second wait here
So, this one has been giving me a hard time!
I am working with HUGE text files, and by huge I mean 100Gb+. Specifically, they are in the fastq format. This format is used for DNA sequencing data, and consists of records of four lines, something like this:
#REC1
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))*55CCF>>>>>>CCCCCCC65
#REC2
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
.
.
.
For the sake of this question, just focus on the header lines, starting with a '#'.
So, for QA purposes, I need to compare two such files. These files should have matching headers, so the first record in the other file should also have the header '#REC1', the next should be '#REC2' and so on. I want to make sure that this is the case, before I proceed to heavy downstream analyses.
Since the files are so large, a naive iteration a string comparisson would take very long, but this QA step will be run numerous times, and I can't afford to wait that long. So I thought a better way would be to sample records from a few points in the files, for example every 10% of the records. If the order of the records is messed up, I'd be very likely to detect it.
So far, I have been able to handle such files by estimating the file size and than using python's file.seek() to access a record in the middle of the file. For example, to access a line approximately in the middle, I'd do:
file_size = os.stat(fastq_file).st_size
start_point = int(file_size/2)
with open(fastq_file) as f:
f.seek(start_point)
# look for the next beginning of record, never mind how
But now the problem is more complex, since I don't know how to coordinate between the two files, since the bytes location is not an indicator of the line index in the file. In other words, how can I access the 10,567,311th lines in both files to make sure they are the same, without going over the whole file?
Would appreciate any ideas\hints. Maybe iterating in parallel? but how exactly?
Thanks!
Sampling is one approach, but you're relying on luck. Also, Python is the wrong tool for this job. You can do things differently and calculate an exact answer in a still reasonably efficient way, using standard Unix command-line tools:
Linearize your FASTQ records: replace the newlines in the first three lines with tabs.
Run diff on a pair of linearized files. If there is a difference, diff will report it.
To linearize, you can run your FASTQ file through awk:
$ awk '\
BEGIN { \
n = 0; \
} \
{ \
a[n % 4] = $0; \
if ((n+1) % 4 == 0) { \
print a[0]"\t"a[1]"\t"a[2]"\t"a[3]; \
} \
n++; \
}' example.fq > example.fq.linear
To compare a pair of files:
$ diff example_1.fq.linear example_2.fq.linear
If there's any difference, diff will find it and tell you which FASTQ record is different.
You could just run diff on the two files directly, without doing the extra work of linearizing, but it is easier to see which read is problematic if you first linearize.
So these are large files. Writing new files is expensive in time and disk space. There's a way to improve on this, using streams.
If you put the awk script into a file (e.g., linearize_fq.awk), you can run it like so:
$ awk -f linearize_fq.awk example.fq > example.fq.linear
This could be useful with your 100+ Gb files, in that you can now set up two Unix file streams via bash process substitutions, and run diff on those streams directly:
$ diff <(awk -f linearize_fq.awk example_1.fq) <(awk -f linearize_fq.awk example_2.fq)
Or you can use named pipes:
$ mkfifo example_1.fq.linear
$ mkfifo example_2.fq.linear
$ awk -f linearize_fq.awk example_1.fq > example_1.fq.linear &
$ awk -f linearize_fq.awk example_2.fq > example_2.fq.linear &
$ diff example_1.fq.linear example_2.fq.linear
$ rm example_1.fq.linear example_2.fq.linear
Both named pipes and process substitutions avoid the step of creating extra (regular) files, which could be an issue for your kind of input. Writing linearized copies of 100+ Gb files to disk could take a while to do, and those copies could also use disk space you may not have much of.
Using streams gets around those two problems, which makes them very useful for handling bioinformatics datasets in an efficient way.
You could reproduce these approaches with Python, but it will almost certainly run much slower, as Python is very slow at I/O-heavy tasks like these.
Iterating in parallel might be the best way to do this in Python. I have no idea how fast this will run (a fast SSD will probably be the best way to speed this up), but since you'll have to count newlines in both files anyway, I don't see a way around this:
with open(file1) as f1, open(file2) as f2:
for l1, l2 in zip(f1,f2):
if l1.startswith("#REC"):
if l1 != l2:
print("Difference at record", l1)
break
else:
print("No differences")
This is written for Python 3 where zip returns an iterator; in Python 2, you need to use itertools.izip() instead.
Have you looked into using the rdiff command.
The upsides of rdiff are:
with the same 4.5GB files, rdiff only ate about 66MB of RAM and scaled very well. It never crashed to date.
it is also MUCH faster than diff.
rdiff itself combines both diff and patch capabilities, so you can create deltas and apply them using the same program
The downsides of rdiff are:
it's not part of standard Linux/UNIX distribution – you have to
install the librsync package.
delta files rdiff produces have a slightly different format than diff's.
delta files are slightly larger (but not significantly enough to care).
a slightly different approach is used when generating a delta with rdiff, which is both good and bad – 2 steps are required. The
first one produces a special signature file. In the second step, a
delta is created using another rdiff call (all shown below). While
the 2-step process may seem annoying, it has the benefits of
providing faster deltas than when using diff.
See: http://beerpla.net/2008/05/12/a-better-diff-or-what-to-do-when-gnu-diff-runs-out-of-memory-diff-memory-exhausted/
import sys
import re
""" To find of the difference record in two HUGE files. This is expected to
use of minimal memory. """
def get_rec_num(fd):
""" Look for the record number. If not found return -1"""
while True:
line = fd.readline()
if len(line) == 0: break
match = re.search('^#REC(\d+)', line)
if match:
num = int(match.group(1))
return(num)
return(-1)
f1 = open('hugefile1', 'r')
f2 = open('hugefile2', 'r')
hf1 = dict()
hf2 = dict()
while f1 or f2:
if f1:
r = get_rec_num(f1)
if r < 0:
f1.close()
f1 = None
else:
# if r is found in f2 hash, no need to store in f1 hash
if not r in hf2:
hf1[r] = 1
else:
del(hf2[r])
pass
pass
if f2:
r = get_rec_num(f2)
if r < 0:
f2.close()
f2 = None
else:
# if r is found in f1 hash, no need to store in f2 hash
if not r in hf1:
hf2[r] = 1
else:
del(hf1[r])
pass
pass
print('Records found only in f1:')
for r in hf1:
print('{}, '.format(r));
print('Records found only in f2:')
for r in hf2:
print('{}, '.format(r));
Both answers from #AlexReynolds and #TimPietzcker are excellent from my point of view, but I would like to put my two cents in. You also might want to speed up your hardware:
Raplace HDD with SSD
Take n SSD's and create a RAID 0. In the perfect world you will get n times speed up for your disk IO.
Adjust the size of chunks you read from the SSD/HDD. I would expect, for instance, one 16 MB read to be executed faster than sixteen 1 MB reads. (this applies to a single SSD, for RAID 0 optimization one has to take a look at RAID controller options and capabilities).
The last option is especially relevant to NOR SSD's. Don't pursuit the minimal RAM utilization, but try to read as much as it needs to keep your disk reading fast. For instance, parallel reads of single rows from two files can probably speed down reading - imagine an HDD where two rows of the two files are always on the same side of the same magnetic disk(s).
I am calling a python function from console using the following command:
printf '%s\0' *.txt | xargs -0 python ./functionName.py
I have almost 10500 text files in the directory I want to process.
For every file processed, I print the file number and the total number of files:
cnt=0
for f in sys.argv[1:]:
cnt=cnt+1
print "Processing file ", cnt, " of : ", len(sys.argv[1:])
Using this, I see that len(sys.argv[1:] is 5000, and then it starts again for another 5000, and finally for the remaining 500 files.
Finally, I want to write for each text file I process some key variables on a .csv file
writer.writerow([var1, var2, var3, ... , varN])
The problem I have is that only the variables of only the last 500 files are written..
I suspect it has to do with len(sys.argv[1:] being 5000 although it should be 10500..
I know there is something wrong with the number of files, since it works for less files..
Is there some limit to 5000?
Can I fix this somehow?
This is actually one of the features of xargs: it splits large inputs into multiple invocations of the command it is supposed to call (see the xargs manual page). The default maximum number for arguments is 5000, so xargs calls your program 3 times: with 5000, 5000 and 500 file names as arguments. You can modify the xargs setting for the number of arguments per invocation using the -n option.
Said that, I doubt that passing 10500 file names as command line arguments is a very good idea. You should use Python's facilities for scanning the file system in the way you want. In your case, it is a matter of using the glob module. For example like so:
import glob
for filename in glob.glob("*.txt"): ...
I have a Python script that imports some log data into a StringIO object and then reads the data in that, line by line, and enters them into a DB table. The script takes considerably longer after some iteration. To explain, it takes ~1.6 seconds to run through 1500 logs, and ~1m16s to run through 3500 logs and then 20 second for 1100 logs!
My script is laid out as follows:
for dir in dirlist:
file = StringIO.StringIO(...output from some system command to get logs...)
for line in file:
ctr+=1
...
do some regex matches and replacements
...
cursor.insert(..."insert query"...)
if ctr >= 1000:
conn.commit() # commit once every 1000 transactions
Use cProfile to profile your script and find out where the time is actually spent. It is not usually helpful to just guess where the time is spent without any details. Profiling will tell you whether the performance issue is with some regex matching stuff or the insert query.
I'm trying to count the occurrences of strings in text files.
The text files look like this, and each file is about 200MB.
String1 30
String2 100
String3 23
String1 5
.....
I want to save the counts into dict.
count = {}
for filename in os.listdir(path):
if(filename.endswith("idx")):
continue
print filename
f = open(os.path.join(path, filename))
for line in f:
(s, cnt) = line[:-1].split("\t")
if(s not in count):
try:
count[s] = 0
except MemoryError:
print(len(count))
exit()
count[s] += int(cnt)
f.close()
print(len(count))
I got memory error at count[s] = 0,
but I still have much more available memory in my computer.
How do I resolve this problem?
Thank you!
UPDATE:
I copied the actual code here.
My python version is 2.4.3, and the machine is running linux and has about 48G memory, but it only consumes less than 5G. the code stops at len(count)=44739243.
UPDATE2:
The strings can be duplicated (not unique string), so I want to add up all the counts for the strings. The operation I want is just reading the count for each string. There are about 10M lines per each file, and I have more than 30 files. I expect the count is less than 100 billion.
UPDATE3:
the OS is linux 2.6.18.
cPython 2.4 can have problems with large memory allocations, even on x64:
$ python2.4 -c "'a' * (2**31-1)"
Traceback (most recent call last):
File "<string>", line 1, in ?
MemoryError
$ python2.5 -c "'a' * (2**31-1)"
$
Update to a recent python interpreter (like cPython 2.7) to get around these issues, and make sure to install a 64-bit version of the interpreter.
If the strings are of nontrivial size (i.e. longer than the <10 bytes in your example), you may also want to simply store their hashes instead, or even use a probabilistic (but way more efficient) storage like a bloom filter. To store their hashes, replace the file handling loop with
import hashlib
# ...
for line in f:
s, cnt = line[:-1].split("\t")
idx = hashlib.md5(s).digest()
count[idx] = count.get(idx, 0) + int(cnt)
# ...
I'm not really sure why this crash happens. How long is the estimated average size of your strings? 44 million strings, if they are somewhat lengthy, you should maybe consider hashing them, as already suggested. The downside is, that you loose the option to list your unique keys, you can just check, if a string is in your data or not.
Concerning the memory limit already being hit at 5 GB, maybe it's related to your outdated python version. If you have the option to update, get 2.7. Same syntax (plus some extras), no issues. Well, I don't even know if the following code is still compatible with 2.4, maybe you have to kick out the with-statement again, at least this is how you would write it in 2.7.
The main difference to your version is to run garbage collection by hand. Additionally you can raise the memory limit, that python uses. As you mentioned, it only uses a small fraction of actual ram, so in case there is some strange default setting prohibiting it to grow larger, try this:
MEMORY_MB_MAX = 30000
import gc
import os
import resource
from collections import defaultdict
resource.setrlimit(resource.RLIMIT_AS, (MEMORY_MB_MAX * 1048576L, -1L))
count = defaultdict(int)
for filename in os.listdir(path):
if(filename.endswith("idx")):
continue
print filename
with open(os.path.join(path, filename)) as f:
for line in f:
s, cnt = line[:-1].split("\t")
count[s] += int(cnt)
print(len(count))
gc.collect()
Besides that, I don't get the meaning of your line s, cnt = line[:-1].split("\t"), especially the [:-1]. If the files look like you noted, then this would erase the last digits of your numbers. Is this on purpose?
If all you are trying to do is count the number of unique strings, you could hugely reduce your memory footprint by hashing each string:
(s, cnt) = line[:-1].split("\t")
s = hash(s)