I'm trying to count the occurrences of strings in text files.
The text files look like this, and each file is about 200MB.
String1 30
String2 100
String3 23
String1 5
.....
I want to save the counts into dict.
count = {}
for filename in os.listdir(path):
if(filename.endswith("idx")):
continue
print filename
f = open(os.path.join(path, filename))
for line in f:
(s, cnt) = line[:-1].split("\t")
if(s not in count):
try:
count[s] = 0
except MemoryError:
print(len(count))
exit()
count[s] += int(cnt)
f.close()
print(len(count))
I got memory error at count[s] = 0,
but I still have much more available memory in my computer.
How do I resolve this problem?
Thank you!
UPDATE:
I copied the actual code here.
My python version is 2.4.3, and the machine is running linux and has about 48G memory, but it only consumes less than 5G. the code stops at len(count)=44739243.
UPDATE2:
The strings can be duplicated (not unique string), so I want to add up all the counts for the strings. The operation I want is just reading the count for each string. There are about 10M lines per each file, and I have more than 30 files. I expect the count is less than 100 billion.
UPDATE3:
the OS is linux 2.6.18.
cPython 2.4 can have problems with large memory allocations, even on x64:
$ python2.4 -c "'a' * (2**31-1)"
Traceback (most recent call last):
File "<string>", line 1, in ?
MemoryError
$ python2.5 -c "'a' * (2**31-1)"
$
Update to a recent python interpreter (like cPython 2.7) to get around these issues, and make sure to install a 64-bit version of the interpreter.
If the strings are of nontrivial size (i.e. longer than the <10 bytes in your example), you may also want to simply store their hashes instead, or even use a probabilistic (but way more efficient) storage like a bloom filter. To store their hashes, replace the file handling loop with
import hashlib
# ...
for line in f:
s, cnt = line[:-1].split("\t")
idx = hashlib.md5(s).digest()
count[idx] = count.get(idx, 0) + int(cnt)
# ...
I'm not really sure why this crash happens. How long is the estimated average size of your strings? 44 million strings, if they are somewhat lengthy, you should maybe consider hashing them, as already suggested. The downside is, that you loose the option to list your unique keys, you can just check, if a string is in your data or not.
Concerning the memory limit already being hit at 5 GB, maybe it's related to your outdated python version. If you have the option to update, get 2.7. Same syntax (plus some extras), no issues. Well, I don't even know if the following code is still compatible with 2.4, maybe you have to kick out the with-statement again, at least this is how you would write it in 2.7.
The main difference to your version is to run garbage collection by hand. Additionally you can raise the memory limit, that python uses. As you mentioned, it only uses a small fraction of actual ram, so in case there is some strange default setting prohibiting it to grow larger, try this:
MEMORY_MB_MAX = 30000
import gc
import os
import resource
from collections import defaultdict
resource.setrlimit(resource.RLIMIT_AS, (MEMORY_MB_MAX * 1048576L, -1L))
count = defaultdict(int)
for filename in os.listdir(path):
if(filename.endswith("idx")):
continue
print filename
with open(os.path.join(path, filename)) as f:
for line in f:
s, cnt = line[:-1].split("\t")
count[s] += int(cnt)
print(len(count))
gc.collect()
Besides that, I don't get the meaning of your line s, cnt = line[:-1].split("\t"), especially the [:-1]. If the files look like you noted, then this would erase the last digits of your numbers. Is this on purpose?
If all you are trying to do is count the number of unique strings, you could hugely reduce your memory footprint by hashing each string:
(s, cnt) = line[:-1].split("\t")
s = hash(s)
Related
So, this one has been giving me a hard time!
I am working with HUGE text files, and by huge I mean 100Gb+. Specifically, they are in the fastq format. This format is used for DNA sequencing data, and consists of records of four lines, something like this:
#REC1
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))*55CCF>>>>>>CCCCCCC65
#REC2
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
.
.
.
For the sake of this question, just focus on the header lines, starting with a '#'.
So, for QA purposes, I need to compare two such files. These files should have matching headers, so the first record in the other file should also have the header '#REC1', the next should be '#REC2' and so on. I want to make sure that this is the case, before I proceed to heavy downstream analyses.
Since the files are so large, a naive iteration a string comparisson would take very long, but this QA step will be run numerous times, and I can't afford to wait that long. So I thought a better way would be to sample records from a few points in the files, for example every 10% of the records. If the order of the records is messed up, I'd be very likely to detect it.
So far, I have been able to handle such files by estimating the file size and than using python's file.seek() to access a record in the middle of the file. For example, to access a line approximately in the middle, I'd do:
file_size = os.stat(fastq_file).st_size
start_point = int(file_size/2)
with open(fastq_file) as f:
f.seek(start_point)
# look for the next beginning of record, never mind how
But now the problem is more complex, since I don't know how to coordinate between the two files, since the bytes location is not an indicator of the line index in the file. In other words, how can I access the 10,567,311th lines in both files to make sure they are the same, without going over the whole file?
Would appreciate any ideas\hints. Maybe iterating in parallel? but how exactly?
Thanks!
Sampling is one approach, but you're relying on luck. Also, Python is the wrong tool for this job. You can do things differently and calculate an exact answer in a still reasonably efficient way, using standard Unix command-line tools:
Linearize your FASTQ records: replace the newlines in the first three lines with tabs.
Run diff on a pair of linearized files. If there is a difference, diff will report it.
To linearize, you can run your FASTQ file through awk:
$ awk '\
BEGIN { \
n = 0; \
} \
{ \
a[n % 4] = $0; \
if ((n+1) % 4 == 0) { \
print a[0]"\t"a[1]"\t"a[2]"\t"a[3]; \
} \
n++; \
}' example.fq > example.fq.linear
To compare a pair of files:
$ diff example_1.fq.linear example_2.fq.linear
If there's any difference, diff will find it and tell you which FASTQ record is different.
You could just run diff on the two files directly, without doing the extra work of linearizing, but it is easier to see which read is problematic if you first linearize.
So these are large files. Writing new files is expensive in time and disk space. There's a way to improve on this, using streams.
If you put the awk script into a file (e.g., linearize_fq.awk), you can run it like so:
$ awk -f linearize_fq.awk example.fq > example.fq.linear
This could be useful with your 100+ Gb files, in that you can now set up two Unix file streams via bash process substitutions, and run diff on those streams directly:
$ diff <(awk -f linearize_fq.awk example_1.fq) <(awk -f linearize_fq.awk example_2.fq)
Or you can use named pipes:
$ mkfifo example_1.fq.linear
$ mkfifo example_2.fq.linear
$ awk -f linearize_fq.awk example_1.fq > example_1.fq.linear &
$ awk -f linearize_fq.awk example_2.fq > example_2.fq.linear &
$ diff example_1.fq.linear example_2.fq.linear
$ rm example_1.fq.linear example_2.fq.linear
Both named pipes and process substitutions avoid the step of creating extra (regular) files, which could be an issue for your kind of input. Writing linearized copies of 100+ Gb files to disk could take a while to do, and those copies could also use disk space you may not have much of.
Using streams gets around those two problems, which makes them very useful for handling bioinformatics datasets in an efficient way.
You could reproduce these approaches with Python, but it will almost certainly run much slower, as Python is very slow at I/O-heavy tasks like these.
Iterating in parallel might be the best way to do this in Python. I have no idea how fast this will run (a fast SSD will probably be the best way to speed this up), but since you'll have to count newlines in both files anyway, I don't see a way around this:
with open(file1) as f1, open(file2) as f2:
for l1, l2 in zip(f1,f2):
if l1.startswith("#REC"):
if l1 != l2:
print("Difference at record", l1)
break
else:
print("No differences")
This is written for Python 3 where zip returns an iterator; in Python 2, you need to use itertools.izip() instead.
Have you looked into using the rdiff command.
The upsides of rdiff are:
with the same 4.5GB files, rdiff only ate about 66MB of RAM and scaled very well. It never crashed to date.
it is also MUCH faster than diff.
rdiff itself combines both diff and patch capabilities, so you can create deltas and apply them using the same program
The downsides of rdiff are:
it's not part of standard Linux/UNIX distribution – you have to
install the librsync package.
delta files rdiff produces have a slightly different format than diff's.
delta files are slightly larger (but not significantly enough to care).
a slightly different approach is used when generating a delta with rdiff, which is both good and bad – 2 steps are required. The
first one produces a special signature file. In the second step, a
delta is created using another rdiff call (all shown below). While
the 2-step process may seem annoying, it has the benefits of
providing faster deltas than when using diff.
See: http://beerpla.net/2008/05/12/a-better-diff-or-what-to-do-when-gnu-diff-runs-out-of-memory-diff-memory-exhausted/
import sys
import re
""" To find of the difference record in two HUGE files. This is expected to
use of minimal memory. """
def get_rec_num(fd):
""" Look for the record number. If not found return -1"""
while True:
line = fd.readline()
if len(line) == 0: break
match = re.search('^#REC(\d+)', line)
if match:
num = int(match.group(1))
return(num)
return(-1)
f1 = open('hugefile1', 'r')
f2 = open('hugefile2', 'r')
hf1 = dict()
hf2 = dict()
while f1 or f2:
if f1:
r = get_rec_num(f1)
if r < 0:
f1.close()
f1 = None
else:
# if r is found in f2 hash, no need to store in f1 hash
if not r in hf2:
hf1[r] = 1
else:
del(hf2[r])
pass
pass
if f2:
r = get_rec_num(f2)
if r < 0:
f2.close()
f2 = None
else:
# if r is found in f1 hash, no need to store in f2 hash
if not r in hf1:
hf2[r] = 1
else:
del(hf1[r])
pass
pass
print('Records found only in f1:')
for r in hf1:
print('{}, '.format(r));
print('Records found only in f2:')
for r in hf2:
print('{}, '.format(r));
Both answers from #AlexReynolds and #TimPietzcker are excellent from my point of view, but I would like to put my two cents in. You also might want to speed up your hardware:
Raplace HDD with SSD
Take n SSD's and create a RAID 0. In the perfect world you will get n times speed up for your disk IO.
Adjust the size of chunks you read from the SSD/HDD. I would expect, for instance, one 16 MB read to be executed faster than sixteen 1 MB reads. (this applies to a single SSD, for RAID 0 optimization one has to take a look at RAID controller options and capabilities).
The last option is especially relevant to NOR SSD's. Don't pursuit the minimal RAM utilization, but try to read as much as it needs to keep your disk reading fast. For instance, parallel reads of single rows from two files can probably speed down reading - imagine an HDD where two rows of the two files are always on the same side of the same magnetic disk(s).
I don't know how advanced this will be but, I am trying to generate a huge list of permutations and then parse them to print them in a specific way. THe problem I'm having is that when I use the following line:
x=([''.join(i) for i in itertools.product("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890",repeat=8)])
I run into the problem that my computer is either 32 bit or has only 4 gigs of RAM. So I am wondering if there is a way to stop this line halfway through, write all the ones generated to a txt file, then continue. The idea is that that way the variable is stored on the disk instead of in the RAM. I will then parse the txt file later to get the situation I am looking for. I'm on Windows 7 by the way.
Thanks!
First thing to note:
>>> len("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890") ** 8
218340105584896L
>>> format(_, ",")
'218,340,105,584,896'
That's how many products there are. Written out, each would take 8 bytes plus a newline character (or other delimiter).
>>> 62 ** 8 * 9
1965060950264064L
So if you buy about 2,000 1-terabyte disks, you'll have just enough disk space to store them all.
Still want to go down this path? ;-) Once you buy all that hardware:
import itertools
s = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890"
with open("perms.txt", "w") as f:
for i in itertools.product(s, repeat=8):
f.write("".join(i) + "\n")
will do it using very little RAM.
Just break it down into its components and save as you go. Something like:
for i in itertools.product(string.ascii_letters + string.digits, repeat=8):
print(''.join(i))
I'm trying to run the following code but for some reason I get the following error: "TypeError: limit must be an integer".
Reading csv data file
import sys
import csv
maxInt = sys.maxsize
decrement = True
while decrement:
decrement = False
try:
**csv.field_size_limit(maxInt)**
except OverflowError:
maxInt = int(maxInt/10)
decrement = True
with open("Data.csv", 'rb') as textfile:
text = csv.reader(textfile, delimiter=" ", quotechar='|')
for line in text:
print ' '.join(line)
The error occurs in the starred line. I have only added the extra bit above the csv read statement as the file was too large to read normally. Alternatively, I could change the file to a text file from csv but I'm not sure whether this will corrupt the data further I can't actually see any of the data as the file is >2GB and hence costly to open.
Any ideas? I'm fairly new to Python but I'd really like to learn a lot more.
I'm not sure whether this qualifies as an answer or not, but here are a few things:
First, the csv reader automatically buffers per line of the CSV, so the file size shouldn't matter too much, 2KB or 2GB, whatever.
What might matter is the number of columns or amount of data inside the fields themselves. If this CSV contains War and Peace in each column, then yeah, you're going to have an issue reading it.
Some ways to potentially debug are to run print sys.maxsize, and to just open up a python interpreter, import sys, csv and then run csv.field_size_limit(sys.maxsize). If you are getting some terribly small number or an exception, you may have a bad install of Python. Otherwise, try to take a simpler version of your file. Maybe the first line, or the first several lines and just 1 column. See if you can reproduce the smallest possible case and remove the variability of your system and the file size.
On Windows 7 64bit with Python 2.6, maxInt = sys.maxsize returns 9223372036854775807L which consequently results in a TypeError: limit must be an integer when calling csv.field_size_limit(maxInt). Interestingly, using maxInt = int(sys.maxsize) does not change this. A crude workaround is to simlpy use csv.field_size_limit(2147483647) which of course cause issues on other platforms. In my case this was adquat to identify the broken value in the CSV, fix the export options in the other application and remove the need for csv.field_size_limit().
-- originally posted by user roskakori on this related question
I need to import a binary file from Python -- the contents are signed 16-bit integers, big endian.
The following Stack Overflow questions suggest how to pull in several bytes at a time, but is this the way to scale up to read in a whole file?
Reading some binary file in Python
Receiving 16-bit integers in Python
I thought to create a function like:
from numpy import *
import os
def readmyfile(filename, bytes=2, endian='>h'):
totalBytes = os.path.getsize(filename)
values = empty(totalBytes/bytes)
with open(filename, 'rb') as f:
for i in range(len(values)):
values[i] = struct.unpack(endian, f.read(bytes))[0]
return values
filecontents = readmyfile('filename')
But this is quite slow (the file is 165924350 bytes). Is there a better way?
Use numpy.fromfile.
I would directly read until EOF (it means checking for receiving an empty string), removing then the need to use range() and getsize.
Alternatively, using xrange (instead of range) should improve things, especially for memory usage.
Moreover, as Falmarri suggested, reading more data at the same time would improve performance quite a lot.
That said, I would not expect miracles, also because I am not sure a list is the most efficient way to store all that amount of data.
What about using NumPy's Array, and its facilities to read/write binary files? In this link there is a section about reading raw binary files, using numpyio.fread. I believe this should be exactly what you need.
Note: personally, I have never used NumPy; however, its main raison d'etre is exactly handling of big sets of data - and this is what you are doing in your question.
You're reading and unpacking 2 bytes at a time
values[i] = struct.unpack(endian,f.read(bytes))[0]
Why don't you read like, 1024 bytes at a time?
I have had the same kind of problem, although in my particular case I have had to convert a very strange binary format (500 MB) file with interlaced blocks of 166 elements that were 3-bytes signed integers; so I've had also the problem of converting from 24-bit to 32-bit signed integers that slow things down a little.
I've resolved it using NumPy's memmap (it's just a handy way of using Python's memmap) and struct.unpack on large chunk of the file.
With this solution I'm able to convert (read, do stuff, and write to disk) the entire file in about 90 seconds (timed with time.clock()).
I could upload part of the code.
I think the bottleneck you have here is twofold.
Depending on your OS and disc controller, the calls to f.read(2) with f being a bigish file are usually efficiently buffered -- usually. In other words, the OS will read one or two sectors (with disc sectors usually several KB) off disc into memory because this is not a lot more expensive than reading 2 bytes from that file. The extra bytes are cached efficiently in memory ready for the next call to read that file. Don't rely on that behavior -- it might be your bottleneck -- but I think there are other issues here.
I am more concerned about the single byte conversions to a short and single calls to numpy. These are not cached at all. You can keep all the shorts in a Python list of ints and convert the whole list to numpy when (and if) needed. You can also make a single call struct.unpack_from to convert everything in a buffer vs one short at a time.
Consider:
#!/usr/bin/python
import random
import os
import struct
import numpy
import ctypes
def read_wopper(filename,bytes=2,endian='>h'):
buf_size=1024*2
buf=ctypes.create_string_buffer(buf_size)
new_buf=[]
with open(filename,'rb') as f:
while True:
st=f.read(buf_size)
l=len(st)
if l==0:
break
fmt=endian[0]+str(l/bytes)+endian[1]
new_buf+=(struct.unpack_from(fmt,st))
na=numpy.array(new_buf)
return na
fn='bigintfile'
def createmyfile(filename):
bytes=165924350
endian='>h'
f=open(filename,"wb")
count=0
try:
for int in range(0,bytes/2):
# The first 32,767 values are [0,1,2..0x7FFF]
# to allow testing the read values with new_buf[value<0x7FFF]
value=count if count<0x7FFF else random.randint(-32767,32767)
count+=1
f.write(struct.pack(endian,value&0x7FFF))
except IOError:
print "file error"
finally:
f.close()
if not os.path.exists(fn):
print "creating file, don't count this..."
createmyfile(fn)
else:
read_wopper(fn)
print "Done!"
I created a file of random shorts signed ints of 165,924,350 bytes (158.24 MB) which comports to 82,962,175 signed 2 byte shorts. With this file, I ran the read_wopper function above and it ran in:
real 0m15.846s
user 0m12.416s
sys 0m3.426s
If you don't need the shorts to be numpy, this function runs in 6 seconds. All this on OS X, python 2.6.1 64 bit, 2.93 gHz Core i7, 8 GB ram. If you change buf_size=1024*2 in read_wopper to buf_size=2**16 the run time is:
real 0m10.810s
user 0m10.156s
sys 0m0.651s
So your main bottle neck, I think, is the single byte calls to unpack -- not your 2 byte reads from disc. You might want to make sure that your data files are not fragmented and if you are using OS X that your free disc space (and here) is not fragmented.
Edit I posted the full code to create then read a binary file of ints. On my iMac, I consistently get < 15 secs to read the file of random ints. It takes about 1:23 to create since the creation is one short at a time.
Let's say that I routinely have to work with files with an unknown, but large, number of lines. Each line contains a set of integers (space, comma, semicolon, or some non-numeric character is the delimiter) in the closed interval [0, R], where R can be arbitrarily large. The number of integers on each line can be variable. Often times I get the same number of integers on each line, but occasionally I have lines with unequal sets of numbers.
Suppose I want to go to Nth line in the file and retrieve the Kth number on that line (and assume that the inputs N and K are valid --- that is, I am not worried about bad inputs). How do I go about doing this efficiently in Python 3.1.2 for Windows?
I do not want to traverse the file line by line.
I tried using mmap, but while poking around here on SO, I learned that that's probably not the best solution on a 32-bit build because of the 4GB limit. And in truth, I couldn't really figure out how to simply move N lines away from my current position. If I can at least just "jump" to the Nth line then I can use .split() and grab the Kth integer that way.
The nuance here is that I don't just need to grab one line from the file. I will need to grab several lines: they are not necessarily all near each other, the order in which I get them matters, and the order is not always based on some deterministic function.
Any ideas? I hope this is enough information.
Thanks!
Python's seek goes to a byte offset in a file, not to a line offset, simply because that's the way modern operating systems and their filesystems work -- the OS/FS just don't record or remember "line offsets" in any way whatsoever, and there's no way for Python (or any other language) to just magically guess them. Any operation purporting to "go to a line" will inevitably need to "walk through the file" (under the covers) to make the association between line numbers and byte offsets.
If you're OK with that and just want it hidden from your sight, then the solution is the standard library module linecache -- but performance won't be any better than that of code you could write yourself.
If you need to read from the same large file multiple times, a large optimization would be to run once on that large file a script that builds and saves to disk the line number - to - byte offset correspondence (technically an "index" auxiliary file); then, all your successive runs (until the large file changes) could very speedily use the index file to navigate with very high performance through the large file. Is this your use case...?
Edit: since apparently this may apply -- here's the general idea (net of careful testing, error checking, or optimization;-). To make the index, use makeindex.py, as follows:
import array
import sys
BLOCKSIZE = 1024 * 1024
def reader(f):
blockstart = 0
while True:
block = f.read(BLOCKSIZE)
if not block: break
inblock = 0
while True:
nextnl = block.find(b'\n', inblock)
if nextnl < 0:
blockstart += len(block)
break
yield nextnl + blockstart
inblock = nextnl + 1
def doindex(fn):
with open(fn, 'rb') as f:
# result format: x[0] is tot # of lines,
# x[N] is byte offset of END of line N (1+)
result = array.array('L', [0])
result.extend(reader(f))
result[0] = len(result) - 1
return result
def main():
for fn in sys.argv[1:]:
index = doindex(fn)
with open(fn + '.indx', 'wb') as p:
print('File', fn, 'has', index[0], 'lines')
index.tofile(p)
main()
and then to use it, for example, the following useindex.py:
import array
import sys
def readline(n, f, findex):
f.seek(findex[n] + 1)
bytes = f.read(findex[n+1] - findex[n])
return bytes.decode('utf8')
def main():
fn = sys.argv[1]
with open(fn + '.indx', 'rb') as f:
findex = array.array('l')
findex.fromfile(f, 1)
findex.fromfile(f, findex[0])
findex[0] = -1
with open(fn, 'rb') as f:
for n in sys.argv[2:]:
print(n, repr(readline(int(n), f, findex)))
main()
Here's an example (on my slow laptop):
$ time py3 makeindex.py kjv10.txt
File kjv10.txt has 100117 lines
real 0m0.235s
user 0m0.184s
sys 0m0.035s
$ time py3 useindex.py kjv10.txt 12345 98765 33448
12345 '\r\n'
98765 '2:6 But this thou hast, that thou hatest the deeds of the\r\n'
33448 'the priest appointed officers over the house of the LORD.\r\n'
real 0m0.049s
user 0m0.028s
sys 0m0.020s
$
The sample file is a plain text file of King James' Bible:
$ wc kjv10.txt
100117 823156 4445260 kjv10.txt
100K lines, 4.4 MB, as you can see; this takes about a quarter second to index and 50 milliseconds to read and print out three random-y lines (no doubt this can be vastly accelerated with more careful optimization and a better machine). The index in memory (and on disk too) takes 4 bytes per line of the textfile being indexed, and performance should scale in a perfectly linear way, so if you had about 100 million lines, 4.4 GB, I would expect about 4-5 minutes to build the index, a minute to extract and print out three arbitrary lines (and the 400 MB of RAM taken for the index should not inconvenience even a small machine -- even my tiny slow laptop has 2GB after all;-).
You can also see that (for speed and convenience) I treat the file as binary (and assume utf8 encoding -- works with any subset like ASCII too of course, eg that KJ text file is ASCII) and don't bother collapsing \r\n into a single character if that's what the file has as line terminator (it's pretty trivial to do that after reading each line if you want).
The problem is that since your lines are not of fixed length, you have to pay attention to line end markers to do your seeking, and that effectively becomes "traversing the file line by line". Thus, any viable approach is still going to be traversing the file, it's merely a matter of what can traverse it fastest.
Another solution, if the file is potentially going to change a lot, is to go full-way to a proper database. The database engine will create and maintain the indexes for you so you can do very fast searches/queries.
This may be an overkill though.