Reading a large file in python

Reading a large file in python - python

I have a "not so" large file (~2.2GB) which I am trying to read and process...
graph = defaultdict(dict)
error = open("error.txt","w")
print "Reading file"
with open("final_edge_list.txt","r") as f:
for line in f:
try:
line = line.rstrip(os.linesep)
tokens = line.split("\t")
if len(tokens)==3:
src = long(tokens[0])
destination = long(tokens[1])
weight = float(tokens[2])
#tup1 = (destination,weight)
#tup2 = (src,weight)
graph[src][destination] = weight
graph[destination][src] = weight
else:
print "error ", line
error.write(line+"\n")
except Exception, e:
string = str(Exception) + " " + str(e) +"==> "+ line +"\n"
error.write(string)
continue
Am i doing something wrong??
Its been like an hour.. since the code is reading the file.. (its still reading..)
And tracking memory usage is already 20GB..
why is it taking so time and memory??

To get a rough idea of where the memory is going, you can use the gc.get_objects function. Wrap your above code in a make_graph() function (this is best practice anyway), and then wrap the call to this function with a KeyboardInterrupt exception handler which prints out the gc data to a file.
def main():
try:
make_graph()
except KeyboardInterrupt:
write_gc()
def write_gc():
from os.path import exists
fname = 'gc.log.%i'
i = 0
while exists(fname % i):
i += 1
fname = fname % i
with open(fname, 'w') as f:
from pprint import pformat
from gc import get_objects
f.write(pformat(get_objects())
if __name__ == '__main__':
main()
Now whenever you ctrl+c your program, you'll get a new gc.log. Given a few samples you should be able to see the memory issue.

There are a few things you can do:
Run your code on a subset of data. Measure time required. Extrapolate to the full size of your data. That will give you an estimate how long it will run.
counter = 0
with open("final_edge_list.txt","r") as f:
for line in f:
counter += 1
if counter == 200000:
break
try:
...
On 1M lines it runs ~8 sec on my machine, so for 2.2Gb file with about 100M lines it suppose to run ~15 min. Though, once you get over you available memory, it won't hold anymore.
Your graph seems symmetric
graph[src][destination] = weight
graph[destination][src] = weight
In your graph processing code use symmetry of graph, reduce memory usage by half.
Run profilers on you code using subset of the data, see what happens there. Simplest would be to run
python -m cProfile --sort cumulative youprogram.py
There is a good article on speed and memory profilers: http://www.huyng.com/posts/python-performance-analysis/

Python's numeric types use quite a lot of memory compared to other programming languages. For my setting it appears to be 24 bytes for each number:
>>> import sys
>>> sys.getsizeof(int())
24
>>> sys.getsizeof(float())
24
Given you have hundreds of millions of lines in that 2.2 GB input file the reported memory consumption should not come unexpected.
To add another thing, some versions of the Python interpreter (including CPython 2.6) are known for keeping so called free lists for allocation performance, especially for objects of type int and float. Once allocated, this memory will not be returned to the operating system until your process terminates. Also have a look at this question I posted when I first discovered this issue:
Python: garbage collection fails?
Suggestions to work around this include:
use a subprocess to do the memory hungry computation, e.g., based on the multiprocessing module
use a library that implements the functionality in C, e.g., numpy, pandas
use another interpreter, e.g., PyPy

You don't need graph to be defaultdict(dict), user dict instead; graph[src, destination] = weight and graph[destination, src] = weight will do. Or only one of them.
To reduce memory usage, try store resulting dataset in scipy.sparse matrix, it consumes less memory and might be compressed.
What do you plan to do with your nodes list afterwards?

Related

Multiprocessing child task continues to leak more memory

I have a large delimited file. I need to apply a function to each line in this file where each function call takes a while. So I have sharded the main file into subfiles like <shard-dir>/lines_<start>_<stop>.tsv and
am applying a function via pool.starmap to each file. Since I want to also maintain the results, I am writing the results as they come to a corresponding output file: <output-shard-dir>/lines_<start>_<stop>_results.tsv.
The function I am mapping looks something like:
# this is pseudo-code, but similar to what I am using
def process_shard_file(file):
output_file = output_filename_from_shard_filename(file)
with open(file, 'r') as fi, open(output_file, 'w') as fo:
result = heavy_computation_function(fi.readline())
fo.write(stringify(result))
the multiprocessing is then started via something like:
shard_files = [...] # a lot of filenames
with Pool(processes=os.cpu_count()) as pool:
sargs = [(fname,) for fname in shard_files]
pool.starmap(process_shard_file, sargs)
when monitoring my computer's resources with htop I see that all cores are full throttle, fine. However, I notice that the memory usage just keeps increasing, and increasing, until it hits swap... and then until swap is also full.
I do not understand why this is happening as several files (n * cpu_cores) from process_shard_file are completed successfully. So why isn't the memory stable? Assuming that heavy_computation_function uses essentially equal memory regardless of file and result is also equally sized
Update
def process_shard_file(file):
output_file = output_filename_from_shard_filename(file)
with open(file, 'r') as fi, open(output_file, 'w') as fo:
result = fi.readline()# heavy_computation_function(fi.readline())
fo.write(result)
above does not seem to cause this issue of memory leakage, where result from
heavy_computation_function can be thought of basically another line to be written to the output file.
So what does heavy_computation_function look like?
def heavy_computation_function(fileline):
numba_input = convert_line_to_numba_input(fileline)
result = cached_njitted_function(numba_input)
return convert_to_more_friendly_format(result)
I know this is still fairly vague, but I am trying to see if this is a generalized problem or not. I have also tried adding the option of maxtasksperchild=1 to my Pool to really try and prevent leakage to no avail.

Your program works, but only for a little while
before self-destructing due to resource leakage.
One could choose to accept such un-diagnosed leakage
as a fact of life that won't change today.
The documentation points out that sometimes Leaks Happen,
and offers a maxtasksperchild parameter to help deal with it.
Set it high enough so you benefit from amortizing initial startup
cost over several tasks, but low enough to avoid swapping.
Let us know how it works out for you.

How to speed up a repeated execution of exec()?

I have the following code:
for k in pool:
x = []
y = []
try:
exec(pool[k])
except Exception as e:
...
do_something(x)
do_something_else(y)
where pool[k] is python code that will eventually append items to x and y (that's why I am using exec instead of eval).
I have tried already to execute the same code with pypy but for this particular block I don't get much better, that line with exec is still my bottleneck.
That said, my question is:
Is there a faster alternative to exec?
If not, do you have any workaround to get some speed up in such a case?
--UPDATE--
To clarify, pool contains around one million keys, to each key it is associated a script (around 50 line of code). The inputs for the scripts are defined before the for loop and the outputs generated by a script are stored in x and y. So, each script has a line in the code stating x.append(something) and y.append(something). The rest of the program will evaluate the results and score each script. Therefore, I need to loop over each script, execute it and process the results. The scripts are originally stored in different text files. pool is a dictionary obtained by parsing these files.
P.S.
Using the pre-compiled version of the code:
for k in pool.keys():
pool[k] = compile(pool[k], '<string>', 'exec')
I have got 5x speed increase, not much but it is something already. I am experimenting with other solutions...

If you really need to execute some code in such manner, use compile() to prepare it.
I.e. do not pass raw Python code into exec but compiled object. Use compile() on your codes before to make them Python byte compiled objects.
But still, it'll make more sense to write a function that will perform what you need on an input argument i.e. pool[k] and return results that are corresponding to x and y.
If you are getting your code out of a file, you have also IO slowdowns
to cope with. So it would be nice to have these files already compiled to *.pyc.
You may think about using execfile() in Python2.
An idea for using functions in a pool:
template = """\
def newfunc ():
%s
return result
"""
pool = [] # For iterating it will be faster if it is a list (just a bit)
# This compiles code as a real function and adds a pointer to a pool
def AddFunc (code):
code = "\n".join([" "+x for x in code.splitlines()])
exec template % code
pool.append(newfunc)
# Usage:
AddFunc("""\
a = 8.34**0.5
b = 8
c = 13
result = []
for x in range(10):
result.append(math.sin(a*b+c)/math.pi+x)""")
for f in pool:
x = f()

Is there a hidden possible deadlock in ppmap/parallel python?

I am having some trouble with using a parallel version of map (ppmap wrapper, implementation by Kirk Strauser).
The function I am trying to run in parallel runs a simple regular expression search on large number of strings (protein sequences), which are parsed from the filesystem using BioPython's SeqIO. Each of function calls uses their own file.
If I run the function using a normal map, everything works as expected. However, when using the ppmap, some of the runs simple freeze, there is no CPU usage and the main program does not even react to KeyboardInterrupt. Also, when I look onto the running processes, the workers are still there (but not using any CPU anymore).
e.g.
/usr/bin/python -u /usr/local/lib/python2.7/dist-packages/pp-1.6.1-py2.7.egg/ppworker.py 2>/dev/null
Furthermore, the workers do not seem to freeze on any particular data entry - if I manually kill the process and re-run the execution, it stops at a different point. (So I have temporarily resorted to keeping a list of finished entries and re-started the program multiple times).
Is there any way to see where the problem is?
Sample of the code that I am running:
def analyse_repeats(data):
"""
Loads whole proteome in memory and then looks for repeats in sequences,
flags both real repeats and sequences not containing particular aminoacid
"""
(organism, organism_id, filename) = data
import re
letters = ['C','M','F','I','L','V','W','Y','A','G','T','S','Q','N','E','D','H','R','K','P']
try:
handle = open(filename)
data = Bio.SeqIO.parse(handle, "fasta")
records = [record for record in data]
store_records = []
for record in records:
sequence = str(record.seq)
uniprot_id = str(record.name)
for letter in letters:
items = set(re.compile("(%s+)" % tuple(([letter] * 1))).findall(sequence))
if items:
for item in items:
store_records.append((organism_id,len(item), uniprot_id, letter))
else:
# letter not present in the string, "zero" repeat
store_records.append((organism_id,0, uniprot_id, letter))
handle.close()
return (organism,store_records)
except IOError as e:
print e
return (organism, [])
res_generator = ppmap.ppmap(
None,
analyse_repeats,
zip(todo_list, organism_ids, filenames)
)
for res in res_generator:
# process the output
If I use simple map instead of the ppmap, everything works fine:
res_generator = map(
analyse_repeats,
zip(todo_list, organism_ids, filenames)
)

You could try using one of the methods (like map) of the Pool object from the multiprocessing module instead. The advantage is that it's built in and doesn't require external packages. It also works very well.
By default, it uses as many worker processes as your computer has cores, but you can specifiy a higher number as well.

May I suggest using dispy (http://dispy.sourceforge.net)? Disclaimer: I am the author. I understand it doesn't address the question directly, but hopefully helps you.

How to get REALLY fast Python over a simple loop

I'm working on a SPOJ problem, INTEST. The goal is to specify the number of test cases (n) and a divisor (k), then feed your program n numbers. The program will accept each number on a newline of stdin and after receiving the nth number, will tell you how many were divisible by k.
The only challenge in this problem is getting your code to be FAST because k can be anything up to 10^7 and n can be as high as 10^9.
I'm trying to write it in Python and have trouble speeding it up. Any ideas?
Edit 2: I finally got it to pass at 10.54 seconds. I used nearly all of your answers to get there, and thus it was hard to choose one as 'correct', but I believe the one I chose sums it up the best. Thanks to you all. Final passing code is below.
Edit: I included some of the suggested updates in the included code.
Extensions and third-party modules are not allowed. The code is also run by the SPOJ judge machine, so I do not have the option of changing interpreters.
import sys
import psyco
psyco.full()
def main():
from sys import stdin, stdout
first_in = stdin.readline()
thing = first_in.split()
n = int(thing[0])
k = int(thing[1])
total = 0
list = stdin.readlines()
for item in list:
if int(item) % k == 0:
total += 1
stdout.write(str(total) + "\n")
if __name__ == "__main__":
main()

[Edited to reflect new findings and passing code on spoj]
Generally, when using Python for spoj:
Don't use "raw_input", use sys.stdin.readlines(). That can make a difference for large input. Also, if possible (and it is, for this problem), read everything at once (sys.stdin. readlines()), instead of reading line by line ("for line in sys.stdin...").
Similarly, don't use "print", use sys.stdout.write() - and don't forget "\n". Of course, this is only relevant when printing multiple times.
As S.Mark suggested, use psyco. It's available for both python2.5 and python2.6, at spoj (test it, it's there, and easy to spot: solutions using psyco usually have a ~35Mb memory usage offset). It's really simple: just add, after "import sys": import psyco; psyco.full()
As Justin suggested, put your code (except psyco incantation) inside a function, and simply call it at the end of your code
Sometimes creating a list and checking its length can be faster than creating a list and adding its components.
Favour list comprehensions (and generator expressions, when possible) over "for" and "while" as well. For some constructs, map/reduce/filter may also speed up your code.
Using (some of) these guidelines, I've managed to pass INTEST. Still testing alternatives, though.

Hey, I got it to be within the time limit. I used the following:
Psyco with Python 2.5.
a simple loop with a variable to keep count in
my code was all in a main() function (except the psyco import) which I called.
The last one is what made the difference. I believe that it has to do with variable visibility, but I'm not completely sure. My time was 10.81 seconds. You might get it to be faster with a list comprehension.
Edit:
Using a list comprehension brought my time down to 8.23 seconds. Bringing the line from sys import stdin, stdout inside of the function shaved off a little too to bring my time down to 8.12 seconds.

Use psyco, it will JIT your code, very effective when there is big loop and calculations.
Edit: Looks like third party modules are not allowed,
So, you may try converting your loop to list comprehensions, it supposed to be run at C level, so it should be faster a little bit.
sum(1 if int(line) % k == 0 else 0 for line in sys.stdin)

Just recently Alex Martinelli said that invoking code inside a function, outperforms code run in the module ( I can't find the post though )
So, why don't you try:
import sys
import psyco
psyco.full1()
def main():
first_in = raw_input()
thing = first_in.split()
n = int(thing[0])
k = int(thing[1])
total = 0
i = 0
total = sum(1 if int(line) % k == 0 else 0 for line in sys.stdin)
print total
if __name__ == "__main__":
main()
IIRC the reason was code inside a function can be optimized.

Using list comprehensions with psyco is counter productive.
This code:
count = 0
for l in sys.stdin:
count += not int(l)%k
runs twice as fast as
count = sum(not int(l)%k for l in sys.stdin)
when using psyco.

For other readers, here is the INTEST problem statement. It's intended to be an I/O throughput test.
On my system, I was able to shave 15% off the execution time by replacing the loop with the following:
print sum(1 for line in sys.stdin if int(line) % k == 0)

Save memory in Python. How to iterate over the lines and save them efficiently with a 2million line file?

I have a tab-separated data file with a little over 2 million lines and 19 columns.
You can find it, in US.zip: http://download.geonames.org/export/dump/.
I started to run the following but with for l in f.readlines(). I understand that just iterating over the file is supposed to be more efficient so I'm posting that below. Still, with this small optimization, I'm using 30% of my memory on the process and have only done about 6.5% of the records. It looks like, at this pace, it will run out of memory like it did before. Also, the function I have is very slow. Is there anything obvious I can do to speed it up? Would it help to del the objects with each pass of the for loop?
def run():
from geonames.models import POI
f = file('data/US.txt')
for l in f:
li = l.split('\t')
try:
p = POI()
p.geonameid = li[0]
p.name = li[1]
p.asciiname = li[2]
p.alternatenames = li[3]
p.point = "POINT(%s %s)" % (li[5], li[4])
p.feature_class = li[6]
p.feature_code = li[7]
p.country_code = li[8]
p.ccs2 = li[9]
p.admin1_code = li[10]
p.admin2_code = li[11]
p.admin3_code = li[12]
p.admin4_code = li[13]
p.population = li[14]
p.elevation = li[15]
p.gtopo30 = li[16]
p.timezone = li[17]
p.modification_date = li[18]
p.save()
except IndexError:
pass
if __name__ == "__main__":
run()
EDIT, More details (the apparently important ones):
The memory consumption is going up as the script runs and saves more lines.
The method, .save() is an adulterated django model method with unique_slug snippet that is writing to a postgreSQL/postgis db.
SOLVED: DEBUG database logging in Django eats memory.

Make sure that Django's DEBUG setting is set to False

This looks perfectly fine to me. Iterating over the file like that or using xreadlines() will read each line as needed (with sane buffering behind the scenes). Memory usage should not grow as you read in more and more data.
As for performance, you should profile your app. Most likely the bottleneck is somewhere in a deeper function, like POI.save().

There's no reason to worry in the data you've given us: is memory consumption going UP as you read more and more lines? Now that would be cause for worry -- but there's no indication that this would happen in the code you've shown, assuming that p.save() saves the object to some database or file and not in memory, of course. There's nothing real to be gained by adding del statements, as the memory is getting recycled at each leg of the loop anyway.
This could be sped up if there's a faster way to populate a POI instance than binding its attributes one by one -- e.g., passing those attributes (maybe as keyword arguments? positional would be faster...) to the POI constructor. But whether that's the case depends on that geonames.models module, of which I know nothing, so I can only offer very generic advice -- e.g., if the module lets you save a bunch of POIs in a single gulp, then making them (say) 100 at a time and saving them in bunches should yield a speedup (at the cost of slightly higher memory consumption).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reading a large file in python - python

Related

Multiprocessing child task continues to leak more memory

How to speed up a repeated execution of exec()?

Is there a hidden possible deadlock in ppmap/parallel python?

How to get REALLY fast Python over a simple loop

Save memory in Python. How to iterate over the lines and save them efficiently with a 2million line file?

Categories

Resources