Multiprocessing child task continues to leak more memory - python

I have a large delimited file. I need to apply a function to each line in this file where each function call takes a while. So I have sharded the main file into subfiles like <shard-dir>/lines_<start>_<stop>.tsv and
am applying a function via pool.starmap to each file. Since I want to also maintain the results, I am writing the results as they come to a corresponding output file: <output-shard-dir>/lines_<start>_<stop>_results.tsv.
The function I am mapping looks something like:
# this is pseudo-code, but similar to what I am using
def process_shard_file(file):
output_file = output_filename_from_shard_filename(file)
with open(file, 'r') as fi, open(output_file, 'w') as fo:
result = heavy_computation_function(fi.readline())
fo.write(stringify(result))
the multiprocessing is then started via something like:
shard_files = [...] # a lot of filenames
with Pool(processes=os.cpu_count()) as pool:
sargs = [(fname,) for fname in shard_files]
pool.starmap(process_shard_file, sargs)
when monitoring my computer's resources with htop I see that all cores are full throttle, fine. However, I notice that the memory usage just keeps increasing, and increasing, until it hits swap... and then until swap is also full.
I do not understand why this is happening as several files (n * cpu_cores) from process_shard_file are completed successfully. So why isn't the memory stable? Assuming that heavy_computation_function uses essentially equal memory regardless of file and result is also equally sized
Update
def process_shard_file(file):
output_file = output_filename_from_shard_filename(file)
with open(file, 'r') as fi, open(output_file, 'w') as fo:
result = fi.readline()# heavy_computation_function(fi.readline())
fo.write(result)
above does not seem to cause this issue of memory leakage, where result from
heavy_computation_function can be thought of basically another line to be written to the output file.
So what does heavy_computation_function look like?
def heavy_computation_function(fileline):
numba_input = convert_line_to_numba_input(fileline)
result = cached_njitted_function(numba_input)
return convert_to_more_friendly_format(result)
I know this is still fairly vague, but I am trying to see if this is a generalized problem or not. I have also tried adding the option of maxtasksperchild=1 to my Pool to really try and prevent leakage to no avail.

Your program works, but only for a little while
before self-destructing due to resource leakage.
One could choose to accept such un-diagnosed leakage
as a fact of life that won't change today.
The documentation points out that sometimes Leaks Happen,
and offers a maxtasksperchild parameter to help deal with it.
Set it high enough so you benefit from amortizing initial startup
cost over several tasks, but low enough to avoid swapping.
Let us know how it works out for you.

Related

How do I add a separate function for average calculation?

I am stuck on this problem. Code I have so far works but my Professor wants to see some changes. I need to add error handing and I need a separate function for calculating average which I will call in main. Here is the what I have so far...
import os
def process_file(filename):
f = open(filename,'r')
lines = f.readlines()[1:]
f.close()
scores = []
for line in lines:
parsed = line.split(",")
count = int(parsed[1])
scores.append(count)
calculate_result(scores)
def calculate_result(scores):
print("High: ", max(scores))
print("Low: ", min(scores))
print("Average: ", sum(scores)/len(scores))
def main():
filename = "scores.text"
if os.path.isfile(filename):
process_file(filename)
else:
print ("File does not exist")
return 0
main()
I guess there are 2 parts:
I need to add error handling
and
I need a separate function for calculating average which I will call in main
The second part I don't think you need help with. But error handling is kind of an art, so I can see where you might be stuck on that. Here are some suggestions to help get started.
The most common type of error handling involves dealing with input. Thinking more broadly, we could expand that to anything that crosses the boundary of the programs memory space. This includes not just user input, but also output; filesystem interaction; using network interfaces (or any communication device or hardware interface); starting/stopping or otherwise interacting with other programs; calling a library that does any of these things on our behalf; and many more....
So what parts of your program are interacting with "the outside" ? I can see a few:
in main() the program is making an assumption about the existence of a file. You are already checking to make sure this file exists, and returning 0 if it doesn't (you might want to change that to a non-zero value, since 0 is usually used to signal that no error occurred)
process_file() does this: f = open(filename,'r') but are you sure that will work? Are there conditions where this could fail?
What if the user that is running the program doesn't have permissions to read that file?
What if the file was deleted or changed between the time it was checked in main and the subsequent open call in process_file? This is a TOCTOU race condition, and it is something that every software developer needs to watch out for.
Probably the most obvious source of potential errors for this program is the content of the input file:
We're assuming the input is comma-separated. What if the user uses tabs or some other character?
While processing the lines, you've got: count = int(parsed[1]), but how do you know that parsed[1] can be cast to an int?
What will happen if the file exists, but is empty (hint: len(scores)==0)? Always look at these edge cases.
Finally, it looks like you are using if-then statements for error checking. That is fine, but another powerful tool for dealing with errors are try-except statements. They are not mutually exclusive: sometimes it's easier to use an if statement, and sometimes catching an exception with try-except is better. Some of the errors you'll need to deal with are easier to handle using one approach over the other.

Why it's needed to open file every time we want to append the file

As the thread How do you append to a file?, most answer is about open a file and append to it, for instance:
def FileSave(content):
with open(filename, "a") as myfile:
myfile.write(content)
FileSave("test1 \n")
FileSave("test2 \n")
Why don't we just extract myfile out and only write to it when FileSave is invoked.
global myfile
myfile = open(filename)
def FileSave(content):
myfile.write(content)
FileSave("test1 \n")
FileSave("test2 \n")
Is the latter code better cause it's open the file only once and write it multiple times?
Or, there is no difference cause what's inside python will guarantee the file is opened only once albeit the open method is invoked multiple times.
There are a number of problems with your modified code that aren't really relevant to your question: you open the file in read-only mode, you never close the file, you have a global statement that does nothing…
Let's ignore all of those and just talk about the advantages and disadvantages of opening and closing a file over and over:
Wastes a bit of time. If you're really unlucky, the file could even just barely keep falling out of the disk cache and waste even more time.
Ensures that you're always appending to the end of the file, even if some other program is also appending to the same file. (This is pretty important for, e.g., syslog-type logs.)1
Ensures that you've flushed your writes to disk at some point, which reduces the chance of lost data if your program crashes or gets killed.
Ensures that you've flushed your writes to disk as soon as you write them. If you try to open and read the file elsewhere in the same program, or in a different program, or if the end user just opens it in Notepad, you won't be missing the last 1.73KB worth of lines because they're still in a buffer somewhere and won't be written until later.2
So, it's a tradeoff. Often, you want one of those guarantees, and the performance cost isn't a big deal. Sometimes, it is a big deal and the guarantees don't matter. Sometimes, you really need both, so you have to write something complicated where you manually buffer up bits and write-and-flush them all at once.
1. As the Python docs for open make clear, this will happen anyway on some Unix systems. But not on other Unix systems, and not on Windows..
2. Also, if you have multiple writers, they're all appending a line at a time, rather than appending whenever they happen to flush, which is again pretty important for logfiles.
In general global should be avoided if possible.
The reason that people use the with command when dealing with files is that it explicitly controls the scope. Once the with operator is done the file is closed and the file variable is discarded.
You can avoid using the with operator but then you must remember to call myfile.close(). Particularly if you're dealing with a lot of files.
One way that avoids using the with block that also avoids using global is
def filesave(f_obj, string):
f_obj.write(string)
f = open(filename, 'a')
filesave(f, "test1\n")
filesave(f, "test2\n")
f.close()
However at this point you'd be better off getting rid of the function and just simply doing:
f = open(filename, 'a')
f.write("test1\n")
f.write("test2\n")
f.close()
At which point you could easily put it within a with block:
with open(filename, 'a') as f:
f.write("test1\n")
f.write("test2\n")
So yes. There's no hard reason to not do what you're doing. It's just not very Pythonic.
The latter code may be more efficient, but the former code is safer because it makes sure that the content that each call to FileSave writes to the file gets flushed to the filesystem so that other processes can read the updated content, and by closing the file handle with each call using open as a context manager, you allow other processes a chance to write to the file as well (specifically in Windows).
It really depends on the circumstances, but here are some thoughts:
A with block absolutely guarantees that the file will be closed once the block is exited. Python does not make and weird optimizations for appending files.
In general, globals make your code less modular, and therefore harder to read and maintain. You would think that the original FileSave function is attempting to avoid globals, but it's using the global name filename, so you may as well use a global file altogether at that point, as it will save you some I/O overhead.
A better option would be to avoid globals at all, or to at least use them properly. You really don't need a separate function to wrap file.write, but if it represents something more complex, here is a design suggestion:
def save(file, content):
print(content, file=file)
def my_thing(filename):
with open(filename, 'a') as f:
# do some stuff
save(f, 'test1')
# do more stuff
save(f, 'test2')
if __name__ == '__main__':
my_thing('myfile.txt')
Notice that when you call the module as a script, a file name defined in the global scope will be passed in to the main routine. However, since the main routine does not reference global variables, you can A) read it easier because it's self contained, and B) test it without having to wonder how to feed it inputs without breaking everything else.
Also, by using print instead of file.write, you avoid having to spend newlines manually.

Reading a large file in python

I have a "not so" large file (~2.2GB) which I am trying to read and process...
graph = defaultdict(dict)
error = open("error.txt","w")
print "Reading file"
with open("final_edge_list.txt","r") as f:
for line in f:
try:
line = line.rstrip(os.linesep)
tokens = line.split("\t")
if len(tokens)==3:
src = long(tokens[0])
destination = long(tokens[1])
weight = float(tokens[2])
#tup1 = (destination,weight)
#tup2 = (src,weight)
graph[src][destination] = weight
graph[destination][src] = weight
else:
print "error ", line
error.write(line+"\n")
except Exception, e:
string = str(Exception) + " " + str(e) +"==> "+ line +"\n"
error.write(string)
continue
Am i doing something wrong??
Its been like an hour.. since the code is reading the file.. (its still reading..)
And tracking memory usage is already 20GB..
why is it taking so time and memory??
To get a rough idea of where the memory is going, you can use the gc.get_objects function. Wrap your above code in a make_graph() function (this is best practice anyway), and then wrap the call to this function with a KeyboardInterrupt exception handler which prints out the gc data to a file.
def main():
try:
make_graph()
except KeyboardInterrupt:
write_gc()
def write_gc():
from os.path import exists
fname = 'gc.log.%i'
i = 0
while exists(fname % i):
i += 1
fname = fname % i
with open(fname, 'w') as f:
from pprint import pformat
from gc import get_objects
f.write(pformat(get_objects())
if __name__ == '__main__':
main()
Now whenever you ctrl+c your program, you'll get a new gc.log. Given a few samples you should be able to see the memory issue.
There are a few things you can do:
Run your code on a subset of data. Measure time required. Extrapolate to the full size of your data. That will give you an estimate how long it will run.
counter = 0
with open("final_edge_list.txt","r") as f:
for line in f:
counter += 1
if counter == 200000:
break
try:
...
On 1M lines it runs ~8 sec on my machine, so for 2.2Gb file with about 100M lines it suppose to run ~15 min. Though, once you get over you available memory, it won't hold anymore.
Your graph seems symmetric
graph[src][destination] = weight
graph[destination][src] = weight
In your graph processing code use symmetry of graph, reduce memory usage by half.
Run profilers on you code using subset of the data, see what happens there. Simplest would be to run
python -m cProfile --sort cumulative youprogram.py
There is a good article on speed and memory profilers: http://www.huyng.com/posts/python-performance-analysis/
Python's numeric types use quite a lot of memory compared to other programming languages. For my setting it appears to be 24 bytes for each number:
>>> import sys
>>> sys.getsizeof(int())
24
>>> sys.getsizeof(float())
24
Given you have hundreds of millions of lines in that 2.2 GB input file the reported memory consumption should not come unexpected.
To add another thing, some versions of the Python interpreter (including CPython 2.6) are known for keeping so called free lists for allocation performance, especially for objects of type int and float. Once allocated, this memory will not be returned to the operating system until your process terminates. Also have a look at this question I posted when I first discovered this issue:
Python: garbage collection fails?
Suggestions to work around this include:
use a subprocess to do the memory hungry computation, e.g., based on the multiprocessing module
use a library that implements the functionality in C, e.g., numpy, pandas
use another interpreter, e.g., PyPy
You don't need graph to be defaultdict(dict), user dict instead; graph[src, destination] = weight and graph[destination, src] = weight will do. Or only one of them.
To reduce memory usage, try store resulting dataset in scipy.sparse matrix, it consumes less memory and might be compressed.
What do you plan to do with your nodes list afterwards?

readinto() replacement?

Copying a File using a straight-forward approach in Python is typically like this:
def copyfileobj(fsrc, fdst, length=16*1024):
"""copy data from file-like object fsrc to file-like object fdst"""
while 1:
buf = fsrc.read(length)
if not buf:
break
fdst.write(buf)
(This code snippet is from shutil.py, by the way).
Unfortunately, this has drawbacks in my special use-case (involving threading and very large buffers) [Italics part added later]. First, it means that with each call of read() a new memory chunk is allocated and when buf is overwritten in the next iteration this memory is freed, only to allocate new memory again for the same purpose. This can slow down the whole process and put unnecessary load on the host.
To avoid this I'm using the file.readinto() method which, unfortunately, is documented as deprecated and "don't use":
def copyfileobj(fsrc, fdst, length=16*1024):
"""copy data from file-like object fsrc to file-like object fdst"""
buffer = array.array('c')
buffer.fromstring('-' * length)
while True:
count = fsrc.readinto(buffer)
if count == 0:
break
if count != len(buffer):
fdst.write(buffer.toString()[:count])
else:
buf.tofile(fdst)
My solution works, but there are two drawbacks as well: First, readinto() is not to be used. It might go away (says the documentation). Second, with readinto() I cannot decide how many bytes I want to read into the buffer and with buffer.tofile() I cannot decide how many I want to write, hence the cumbersome special case for the last block (which also is unnecessarily expensive).
I've looked at array.array.fromfile(), but it cannot be used to read "all there is" (reads, then throws EOFError and doesn't hand out the number of processed items). Also it is no solution for the ending special-case problem.
Is there a proper way to do what I want to do? Maybe I'm just overlooking a simple buffer class or similar which does what I want.
This code snippet is from shutil.py
Which is a standard library module. Why not just use it?
First, it means that with each call of read() a new memory chunk is allocated and when buf is overwritten in the next iteration this memory is freed, only to allocate new memory again for the same purpose. This can slow down the whole process and put unnecessary load on the host.
This is tiny compared to the effort required to actually grab a page of data from disk.
Normal Python code would not be in need off such tweaks as this - however if you really need all that performance tweaking to read files from inside Python code (as in, you are on the rewriting some server coe you wrote and already works for performance or memory usage) I'd rather call the OS directly using ctypes - thus having a copy performed as low level as I want too.
It may even be possible that simple calling the "cp" executable as an external process is less of a hurdle in your case (and it would take full advantages of all OS and filesystem level optimizations for you).

Save memory in Python. How to iterate over the lines and save them efficiently with a 2million line file?

I have a tab-separated data file with a little over 2 million lines and 19 columns.
You can find it, in US.zip: http://download.geonames.org/export/dump/.
I started to run the following but with for l in f.readlines(). I understand that just iterating over the file is supposed to be more efficient so I'm posting that below. Still, with this small optimization, I'm using 30% of my memory on the process and have only done about 6.5% of the records. It looks like, at this pace, it will run out of memory like it did before. Also, the function I have is very slow. Is there anything obvious I can do to speed it up? Would it help to del the objects with each pass of the for loop?
def run():
from geonames.models import POI
f = file('data/US.txt')
for l in f:
li = l.split('\t')
try:
p = POI()
p.geonameid = li[0]
p.name = li[1]
p.asciiname = li[2]
p.alternatenames = li[3]
p.point = "POINT(%s %s)" % (li[5], li[4])
p.feature_class = li[6]
p.feature_code = li[7]
p.country_code = li[8]
p.ccs2 = li[9]
p.admin1_code = li[10]
p.admin2_code = li[11]
p.admin3_code = li[12]
p.admin4_code = li[13]
p.population = li[14]
p.elevation = li[15]
p.gtopo30 = li[16]
p.timezone = li[17]
p.modification_date = li[18]
p.save()
except IndexError:
pass
if __name__ == "__main__":
run()
EDIT, More details (the apparently important ones):
The memory consumption is going up as the script runs and saves more lines.
The method, .save() is an adulterated django model method with unique_slug snippet that is writing to a postgreSQL/postgis db.
SOLVED: DEBUG database logging in Django eats memory.
Make sure that Django's DEBUG setting is set to False
This looks perfectly fine to me. Iterating over the file like that or using xreadlines() will read each line as needed (with sane buffering behind the scenes). Memory usage should not grow as you read in more and more data.
As for performance, you should profile your app. Most likely the bottleneck is somewhere in a deeper function, like POI.save().
There's no reason to worry in the data you've given us: is memory consumption going UP as you read more and more lines? Now that would be cause for worry -- but there's no indication that this would happen in the code you've shown, assuming that p.save() saves the object to some database or file and not in memory, of course. There's nothing real to be gained by adding del statements, as the memory is getting recycled at each leg of the loop anyway.
This could be sped up if there's a faster way to populate a POI instance than binding its attributes one by one -- e.g., passing those attributes (maybe as keyword arguments? positional would be faster...) to the POI constructor. But whether that's the case depends on that geonames.models module, of which I know nothing, so I can only offer very generic advice -- e.g., if the module lets you save a bunch of POIs in a single gulp, then making them (say) 100 at a time and saving them in bunches should yield a speedup (at the cost of slightly higher memory consumption).

Categories