Related
I am trying to understand whether my way of using multiprocessing.Pool is efficient. The method that I would like to do in parallel is a script that reads a certain file, do calculation, and then saves the results to a different file. My code looks something like this:
from multiprocessing import Pool, TimeoutError
import deepdish.io as dd
def savefile(a,b,t,c,g,e,d):
print a
dd.save(str(a),{'b':b,'t':t,'c':c,'g':g,'e':e,'d':d})
def run_many_calcs():
num_processors = 6
print "Num processors - ",num_processors
pool = Pool(processes=num_processors) # start 4 worker processes
for a in ['a','b','c','d','e','f','g','t','y','e','r','w']:
pool.apply(savefile,args=(a,4,5,6,7,8,1))
How can I see that immediately after one process is finished in one of the processors it continues to the next file?
When considering performance of any program, you have to work out if the performance is bound by I/O (memory, disk, network, whatever) or Compute (core count, core speed, etc).
If I/O is the bottleneck, there's no point having multiple processes, a faster CPU, etc.
If the computation is taking up all the time, then it is worth investing in multiple processes, etc. "Computation time" is often diagnosed as being the problem, but on closer investigation turns out to be limited by the computer's memory bus speed, not the clock rate of the cores. In such circumstances adding multiple processes can make things worse...
Check
You can check yours by doing some performance profiling of your code (there's bound to be a whole load of profiling tools out there for Python).
My Guess
Most of the time these days it's I/O that's the bottleneck. If you don't want to profile your code, betting on a faster SSD is likely the best initial approach.
Unsolvable Computer Science Problem
The architectural features of modern CPUs (L1, L2, L3 cache, QPI, hyperthreads) are all symptoms of the underlying problem in computer design; cores are too quick for the I/O we can wrap around them.
For example, the time taken to transfer 1 byte from SDRAM to the core is exceedingly slow in comparison to the core speed. One just has to hope that the L3, L2 and L1 cache subsystems have correctly predicted the need for that 1 byte and have already fetched it ahead of time. If not, there's a big delay; that's where hyperthreading can help the overall performance of the computer's other processes (they can nip in and get some work done), but does absolutely nothing for the stalled program.
Data fetched from files or networks is very slow indeed.
File System Caching
In your case it sounds like you have 1 single input file; that will at least get cached in RAM by the OS (provided it's not too big).
You may be tempted to read it into memory yourself; I wouldn't bother. If it's large you would be allocating a large amount of memory to hold it, and if that's too big for the RAM in the machine the OS will swap some of that RAM out to the virtual memory page file anyway, and you're worse off than before. If it's small enough there's a good chance the OS will cache the whole thing for you anyway, saving you the bother.
Written files are also cached, up to a point. Ultimately there's nothing you can do if "total process time" is taken to mean that all the data is written to disk; you'd be having to wait for the disk to complete writing no matter what you did in memory and what the OS cached.
The OS's filesystem cache might give an initial impression that file writing has completed (the OS will get on with consolidating the data on the actual drive shortly), but successive runs of the same program will get blocked once that write cache is full.
If you do profile your code, be sure to run it for a long time (or repeatedly), to make sure that the measurements made by the profiler reveal the true underlying performance of the computer. If the results show that most of the time is in file.Read() or file.Write()...
I have a few related questions regarding memory usage in the following example.
If I run in the interpreter,
foo = ['bar' for _ in xrange(10000000)]
the real memory used on my machine goes up to 80.9mb. I then,
del foo
real memory goes down, but only to 30.4mb. The interpreter uses 4.4mb baseline so what is the advantage in not releasing 26mb of memory to the OS? Is it because Python is "planning ahead", thinking that you may use that much memory again?
Why does it release 50.5mb in particular - what is the amount that is released based on?
Is there a way to force Python to release all the memory that was used (if you know you won't be using that much memory again)?
NOTE
This question is different from How can I explicitly free memory in Python?
because this question primarily deals with the increase of memory usage from baseline even after the interpreter has freed objects via garbage collection (with use of gc.collect or not).
I'm guessing the question you really care about here is:
Is there a way to force Python to release all the memory that was used (if you know you won't be using that much memory again)?
No, there is not. But there is an easy workaround: child processes.
If you need 500MB of temporary storage for 5 minutes, but after that you need to run for another 2 hours and won't touch that much memory ever again, spawn a child process to do the memory-intensive work. When the child process goes away, the memory gets released.
This isn't completely trivial and free, but it's pretty easy and cheap, which is usually good enough for the trade to be worthwhile.
First, the easiest way to create a child process is with concurrent.futures (or, for 3.1 and earlier, the futures backport on PyPI):
with concurrent.futures.ProcessPoolExecutor(max_workers=1) as executor:
result = executor.submit(func, *args, **kwargs).result()
If you need a little more control, use the multiprocessing module.
The costs are:
Process startup is kind of slow on some platforms, notably Windows. We're talking milliseconds here, not minutes, and if you're spinning up one child to do 300 seconds' worth of work, you won't even notice it. But it's not free.
If the large amount of temporary memory you use really is large, doing this can cause your main program to get swapped out. Of course you're saving time in the long run, because that if that memory hung around forever it would have to lead to swapping at some point. But this can turn gradual slowness into very noticeable all-at-once (and early) delays in some use cases.
Sending large amounts of data between processes can be slow. Again, if you're talking about sending over 2K of arguments and getting back 64K of results, you won't even notice it, but if you're sending and receiving large amounts of data, you'll want to use some other mechanism (a file, mmapped or otherwise; the shared-memory APIs in multiprocessing; etc.).
Sending large amounts of data between processes means the data have to be pickleable (or, if you stick them in a file or shared memory, struct-able or ideally ctypes-able).
Memory allocated on the heap can be subject to high-water marks. This is complicated by Python's internal optimizations for allocating small objects (PyObject_Malloc) in 4 KiB pools, classed for allocation sizes at multiples of 8 bytes -- up to 256 bytes (512 bytes in 3.3). The pools themselves are in 256 KiB arenas, so if just one block in one pool is used, the entire 256 KiB arena will not be released. In Python 3.3 the small object allocator was switched to using anonymous memory maps instead of the heap, so it should perform better at releasing memory.
Additionally, the built-in types maintain freelists of previously allocated objects that may or may not use the small object allocator. The int type maintains a freelist with its own allocated memory, and clearing it requires calling PyInt_ClearFreeList(). This can be called indirectly by doing a full gc.collect.
Try it like this, and tell me what you get. Here's the link for psutil.Process.memory_info.
import os
import gc
import psutil
proc = psutil.Process(os.getpid())
gc.collect()
mem0 = proc.memory_info().rss
# create approx. 10**7 int objects and pointers
foo = ['abc' for x in range(10**7)]
mem1 = proc.memory_info().rss
# unreference, including x == 9999999
del foo, x
mem2 = proc.memory_info().rss
# collect() calls PyInt_ClearFreeList()
# or use ctypes: pythonapi.PyInt_ClearFreeList()
gc.collect()
mem3 = proc.memory_info().rss
pd = lambda x2, x1: 100.0 * (x2 - x1) / mem0
print "Allocation: %0.2f%%" % pd(mem1, mem0)
print "Unreference: %0.2f%%" % pd(mem2, mem1)
print "Collect: %0.2f%%" % pd(mem3, mem2)
print "Overall: %0.2f%%" % pd(mem3, mem0)
Output:
Allocation: 3034.36%
Unreference: -752.39%
Collect: -2279.74%
Overall: 2.23%
Edit:
I switched to measuring relative to the process VM size to eliminate the effects of other processes in the system.
The C runtime (e.g. glibc, msvcrt) shrinks the heap when contiguous free space at the top reaches a constant, dynamic, or configurable threshold. With glibc you can tune this with mallopt (M_TRIM_THRESHOLD). Given this, it isn't surprising if the heap shrinks by more -- even a lot more -- than the block that you free.
In 3.x range doesn't create a list, so the test above won't create 10 million int objects. Even if it did, the int type in 3.x is basically a 2.x long, which doesn't implement a freelist.
eryksun has answered question #1, and I've answered question #3 (the original #4), but now let's answer question #2:
Why does it release 50.5mb in particular - what is the amount that is released based on?
What it's based on is, ultimately, a whole series of coincidences inside Python and malloc that are very hard to predict.
First, depending on how you're measuring memory, you may only be measuring pages actually mapped into memory. In that case, any time a page gets swapped out by the pager, memory will show up as "freed", even though it hasn't been freed.
Or you may be measuring in-use pages, which may or may not count allocated-but-never-touched pages (on systems that optimistically over-allocate, like linux), pages that are allocated but tagged MADV_FREE, etc.
If you really are measuring allocated pages (which is actually not a very useful thing to do, but it seems to be what you're asking about), and pages have really been deallocated, two circumstances in which this can happen: Either you've used brk or equivalent to shrink the data segment (very rare nowadays), or you've used munmap or similar to release a mapped segment. (There's also theoretically a minor variant to the latter, in that there are ways to release part of a mapped segment—e.g., steal it with MAP_FIXED for a MADV_FREE segment that you immediately unmap.)
But most programs don't directly allocate things out of memory pages; they use a malloc-style allocator. When you call free, the allocator can only release pages to the OS if you just happen to be freeing the last live object in a mapping (or in the last N pages of the data segment). There's no way your application can reasonably predict this, or even detect that it happened in advance.
CPython makes this even more complicated—it has a custom 2-level object allocator on top of a custom memory allocator on top of malloc. (See the source comments for a more detailed explanation.) And on top of that, even at the C API level, much less Python, you don't even directly control when the top-level objects are deallocated.
So, when you release an object, how do you know whether it's going to release memory to the OS? Well, first you have to know that you've released the last reference (including any internal references you didn't know about), allowing the GC to deallocate it. (Unlike other implementations, at least CPython will deallocate an object as soon as it's allowed to.) This usually deallocates at least two things at the next level down (e.g., for a string, you're releasing the PyString object, and the string buffer).
If you do deallocate an object, to know whether this causes the next level down to deallocate a block of object storage, you have to know the internal state of the object allocator, as well as how it's implemented. (It obviously can't happen unless you're deallocating the last thing in the block, and even then, it may not happen.)
If you do deallocate a block of object storage, to know whether this causes a free call, you have to know the internal state of the PyMem allocator, as well as how it's implemented. (Again, you have to be deallocating the last in-use block within a malloced region, and even then, it may not happen.)
If you do free a malloced region, to know whether this causes an munmap or equivalent (or brk), you have to know the internal state of the malloc, as well as how it's implemented. And this one, unlike the others, is highly platform-specific. (And again, you generally have to be deallocating the last in-use malloc within an mmap segment, and even then, it may not happen.)
So, if you want to understand why it happened to release exactly 50.5mb, you're going to have to trace it from the bottom up. Why did malloc unmap 50.5mb worth of pages when you did those one or more free calls (for probably a bit more than 50.5mb)? You'd have to read your platform's malloc, and then walk the various tables and lists to see its current state. (On some platforms, it may even make use of system-level information, which is pretty much impossible to capture without making a snapshot of the system to inspect offline, but luckily this isn't usually a problem.) And then you have to do the same thing at the 3 levels above that.
So, the only useful answer to the question is "Because."
Unless you're doing resource-limited (e.g., embedded) development, you have no reason to care about these details.
And if you are doing resource-limited development, knowing these details is useless; you pretty much have to do an end-run around all those levels and specifically mmap the memory you need at the application level (possibly with one simple, well-understood, application-specific zone allocator in between).
First, you may want to install glances:
sudo apt-get install python-pip build-essential python-dev lm-sensors
sudo pip install psutil logutils bottle batinfo https://bitbucket.org/gleb_zhulik/py3sensors/get/tip.tar.gz zeroconf netifaces pymdstat influxdb elasticsearch potsdb statsd pystache docker-py pysnmp pika py-cpuinfo bernhard
sudo pip install glances
Then run it in the terminal!
glances
In your Python code, add at the begin of the file, the following:
import os
import gc # Garbage Collector
After using the "Big" variable (for example: myBigVar) for which, you would like to release memory, write in your python code the following:
del myBigVar
gc.collect()
In another terminal, run your python code and observe in the "glances" terminal, how the memory is managed in your system!
Good luck!
P.S. I assume you are working on a Debian or Ubuntu system
I'm running Linux Mint via VirtualBox, and the Python code I'm using contains an iteration over a large data set to produce plots. The first couple of times I tried to run it, part way through the process it stopped with a message simple saying "Killed".
A bit of research showed that this is most likely due to a lack of RAM for the process. When repeating the process and monitoring the system resource usage (using command top -s), and sure enough I can watch the ram usage going up at a fairly constant rate as the program runs. I'm giving the VirtualBox all the RAM my system can afford (just over 2Gb), but it doesn't seem to be enough for the iterations I'm doing. The code looks like this:
for file in os.listdir('folder/'):
calledfunction('folder/'+file, 'output/'+file)
The calledfunction produces a png image, so it takes about 50mb of RAM per iteration and I want to do it about 40 times.
So, my question is, can I use a function to prevent the build up of RAM usage, or clear the RAM after each iteration? I've seen people talking garbage collection but I'm not really sure how to use it, or where I can/should put it in my loop. Any protips?
Thanks
I have a few related questions regarding memory usage in the following example.
If I run in the interpreter,
foo = ['bar' for _ in xrange(10000000)]
the real memory used on my machine goes up to 80.9mb. I then,
del foo
real memory goes down, but only to 30.4mb. The interpreter uses 4.4mb baseline so what is the advantage in not releasing 26mb of memory to the OS? Is it because Python is "planning ahead", thinking that you may use that much memory again?
Why does it release 50.5mb in particular - what is the amount that is released based on?
Is there a way to force Python to release all the memory that was used (if you know you won't be using that much memory again)?
NOTE
This question is different from How can I explicitly free memory in Python?
because this question primarily deals with the increase of memory usage from baseline even after the interpreter has freed objects via garbage collection (with use of gc.collect or not).
I'm guessing the question you really care about here is:
Is there a way to force Python to release all the memory that was used (if you know you won't be using that much memory again)?
No, there is not. But there is an easy workaround: child processes.
If you need 500MB of temporary storage for 5 minutes, but after that you need to run for another 2 hours and won't touch that much memory ever again, spawn a child process to do the memory-intensive work. When the child process goes away, the memory gets released.
This isn't completely trivial and free, but it's pretty easy and cheap, which is usually good enough for the trade to be worthwhile.
First, the easiest way to create a child process is with concurrent.futures (or, for 3.1 and earlier, the futures backport on PyPI):
with concurrent.futures.ProcessPoolExecutor(max_workers=1) as executor:
result = executor.submit(func, *args, **kwargs).result()
If you need a little more control, use the multiprocessing module.
The costs are:
Process startup is kind of slow on some platforms, notably Windows. We're talking milliseconds here, not minutes, and if you're spinning up one child to do 300 seconds' worth of work, you won't even notice it. But it's not free.
If the large amount of temporary memory you use really is large, doing this can cause your main program to get swapped out. Of course you're saving time in the long run, because that if that memory hung around forever it would have to lead to swapping at some point. But this can turn gradual slowness into very noticeable all-at-once (and early) delays in some use cases.
Sending large amounts of data between processes can be slow. Again, if you're talking about sending over 2K of arguments and getting back 64K of results, you won't even notice it, but if you're sending and receiving large amounts of data, you'll want to use some other mechanism (a file, mmapped or otherwise; the shared-memory APIs in multiprocessing; etc.).
Sending large amounts of data between processes means the data have to be pickleable (or, if you stick them in a file or shared memory, struct-able or ideally ctypes-able).
Memory allocated on the heap can be subject to high-water marks. This is complicated by Python's internal optimizations for allocating small objects (PyObject_Malloc) in 4 KiB pools, classed for allocation sizes at multiples of 8 bytes -- up to 256 bytes (512 bytes in 3.3). The pools themselves are in 256 KiB arenas, so if just one block in one pool is used, the entire 256 KiB arena will not be released. In Python 3.3 the small object allocator was switched to using anonymous memory maps instead of the heap, so it should perform better at releasing memory.
Additionally, the built-in types maintain freelists of previously allocated objects that may or may not use the small object allocator. The int type maintains a freelist with its own allocated memory, and clearing it requires calling PyInt_ClearFreeList(). This can be called indirectly by doing a full gc.collect.
Try it like this, and tell me what you get. Here's the link for psutil.Process.memory_info.
import os
import gc
import psutil
proc = psutil.Process(os.getpid())
gc.collect()
mem0 = proc.memory_info().rss
# create approx. 10**7 int objects and pointers
foo = ['abc' for x in range(10**7)]
mem1 = proc.memory_info().rss
# unreference, including x == 9999999
del foo, x
mem2 = proc.memory_info().rss
# collect() calls PyInt_ClearFreeList()
# or use ctypes: pythonapi.PyInt_ClearFreeList()
gc.collect()
mem3 = proc.memory_info().rss
pd = lambda x2, x1: 100.0 * (x2 - x1) / mem0
print "Allocation: %0.2f%%" % pd(mem1, mem0)
print "Unreference: %0.2f%%" % pd(mem2, mem1)
print "Collect: %0.2f%%" % pd(mem3, mem2)
print "Overall: %0.2f%%" % pd(mem3, mem0)
Output:
Allocation: 3034.36%
Unreference: -752.39%
Collect: -2279.74%
Overall: 2.23%
Edit:
I switched to measuring relative to the process VM size to eliminate the effects of other processes in the system.
The C runtime (e.g. glibc, msvcrt) shrinks the heap when contiguous free space at the top reaches a constant, dynamic, or configurable threshold. With glibc you can tune this with mallopt (M_TRIM_THRESHOLD). Given this, it isn't surprising if the heap shrinks by more -- even a lot more -- than the block that you free.
In 3.x range doesn't create a list, so the test above won't create 10 million int objects. Even if it did, the int type in 3.x is basically a 2.x long, which doesn't implement a freelist.
eryksun has answered question #1, and I've answered question #3 (the original #4), but now let's answer question #2:
Why does it release 50.5mb in particular - what is the amount that is released based on?
What it's based on is, ultimately, a whole series of coincidences inside Python and malloc that are very hard to predict.
First, depending on how you're measuring memory, you may only be measuring pages actually mapped into memory. In that case, any time a page gets swapped out by the pager, memory will show up as "freed", even though it hasn't been freed.
Or you may be measuring in-use pages, which may or may not count allocated-but-never-touched pages (on systems that optimistically over-allocate, like linux), pages that are allocated but tagged MADV_FREE, etc.
If you really are measuring allocated pages (which is actually not a very useful thing to do, but it seems to be what you're asking about), and pages have really been deallocated, two circumstances in which this can happen: Either you've used brk or equivalent to shrink the data segment (very rare nowadays), or you've used munmap or similar to release a mapped segment. (There's also theoretically a minor variant to the latter, in that there are ways to release part of a mapped segment—e.g., steal it with MAP_FIXED for a MADV_FREE segment that you immediately unmap.)
But most programs don't directly allocate things out of memory pages; they use a malloc-style allocator. When you call free, the allocator can only release pages to the OS if you just happen to be freeing the last live object in a mapping (or in the last N pages of the data segment). There's no way your application can reasonably predict this, or even detect that it happened in advance.
CPython makes this even more complicated—it has a custom 2-level object allocator on top of a custom memory allocator on top of malloc. (See the source comments for a more detailed explanation.) And on top of that, even at the C API level, much less Python, you don't even directly control when the top-level objects are deallocated.
So, when you release an object, how do you know whether it's going to release memory to the OS? Well, first you have to know that you've released the last reference (including any internal references you didn't know about), allowing the GC to deallocate it. (Unlike other implementations, at least CPython will deallocate an object as soon as it's allowed to.) This usually deallocates at least two things at the next level down (e.g., for a string, you're releasing the PyString object, and the string buffer).
If you do deallocate an object, to know whether this causes the next level down to deallocate a block of object storage, you have to know the internal state of the object allocator, as well as how it's implemented. (It obviously can't happen unless you're deallocating the last thing in the block, and even then, it may not happen.)
If you do deallocate a block of object storage, to know whether this causes a free call, you have to know the internal state of the PyMem allocator, as well as how it's implemented. (Again, you have to be deallocating the last in-use block within a malloced region, and even then, it may not happen.)
If you do free a malloced region, to know whether this causes an munmap or equivalent (or brk), you have to know the internal state of the malloc, as well as how it's implemented. And this one, unlike the others, is highly platform-specific. (And again, you generally have to be deallocating the last in-use malloc within an mmap segment, and even then, it may not happen.)
So, if you want to understand why it happened to release exactly 50.5mb, you're going to have to trace it from the bottom up. Why did malloc unmap 50.5mb worth of pages when you did those one or more free calls (for probably a bit more than 50.5mb)? You'd have to read your platform's malloc, and then walk the various tables and lists to see its current state. (On some platforms, it may even make use of system-level information, which is pretty much impossible to capture without making a snapshot of the system to inspect offline, but luckily this isn't usually a problem.) And then you have to do the same thing at the 3 levels above that.
So, the only useful answer to the question is "Because."
Unless you're doing resource-limited (e.g., embedded) development, you have no reason to care about these details.
And if you are doing resource-limited development, knowing these details is useless; you pretty much have to do an end-run around all those levels and specifically mmap the memory you need at the application level (possibly with one simple, well-understood, application-specific zone allocator in between).
First, you may want to install glances:
sudo apt-get install python-pip build-essential python-dev lm-sensors
sudo pip install psutil logutils bottle batinfo https://bitbucket.org/gleb_zhulik/py3sensors/get/tip.tar.gz zeroconf netifaces pymdstat influxdb elasticsearch potsdb statsd pystache docker-py pysnmp pika py-cpuinfo bernhard
sudo pip install glances
Then run it in the terminal!
glances
In your Python code, add at the begin of the file, the following:
import os
import gc # Garbage Collector
After using the "Big" variable (for example: myBigVar) for which, you would like to release memory, write in your python code the following:
del myBigVar
gc.collect()
In another terminal, run your python code and observe in the "glances" terminal, how the memory is managed in your system!
Good luck!
P.S. I assume you are working on a Debian or Ubuntu system
I wrote a program to crunch some data about 2.2 million records. For each record, it will pass through a series of 20 computations that takes about total 0.01 second. To make it run faster, I use Python multiprocess, break my data generally into 6 blocks and run them in parallel with a master process distributing payloads to the processes and coordinating execution. BTW, as a result of the computations, the program will write out about 22 millions records to database.
I'm running this on a MacBookPro i7 2.2GHz with 8GB RAM running on Python 3.2.2. The data is on a local MySQL server.
The program starts well - running in an predictable manner, CPU on average is 60-70% utilized and my Macbook Pro just heats up like an oven. Then it slows down after running for about 5 hours with CPU utilization drops to average 20% on each core. Some observations I made then are:
- each Python process is consuming about 480 MB real RAM and about 850 MB virtual RAM. There are total 6 of these heavy processes
- The total virtual memory consumed by OSX (as shown by Activity Monitor) is about 300GB
I suspect the performance degradation is due to huge memory consumption and potentially high page swaps.
How can I better diagnose these symptoms?
Is there any problem with Python running with large in-memory objects for long period of time? Really I don't think running 6 hours is heavy for today's technology but heck I am only about half a year of Python experience, so... what do I know?!
Thanks!
I'm guessing the performance drop is because it is swapping stuff in and out of memory. I don't think the issue is how long the program is running - Python uses a garbage collector, so it doesn't have memory leaks.
Well, that's not entirely true. The garbage collector will make sure that it deletes anything that isn't reachable. (In other words, it will not delete something that you could conceivably access.) However, it's not smart enough to detect when a data structure won't be used for the rest of the program; you may need to clarify by setting all references to it to None that for it to work correctly.
Could you post the code involved?
Is this a program where you need a given record more than once? Is the order that you load records important for your program?
If the python process has only a few gigabytes of memory allocated, then why do you have 300 GB of used memory?