Why does my loop require more memory on each iteration?

Why does my loop require more memory on each iteration? - python

I am trying to reduce the memory requirements of my python 3 code. Right now each iteration of the for loop requires more memory than the last one.
I wrote a small piece of code that has the same behaviour as my project:
import numpy as np
from multiprocessing import Pool
from itertools import repeat
def simulation(steps, y): # the function that starts the parallel execution of f()
pool = Pool(processes=8, maxtasksperchild=int(steps/8))
results = pool.starmap(f, zip(range(steps), repeat(y)), chunksize=int(steps/8))
pool.close()
return results
def f(steps, y): # steps is used as a counter. My code doesn't need it.
a, b = np.random.random(2)
return y*a, y*b
def main():
steps = 2**20 # amount of times a random sample is taken
y = np.ones(5) # dummy variable to show that the next iteration of the code depends on the previous one
total_results = np.zeros((0,2))
for i in range(5):
results = simulation(steps, y[i-1])
y[i] = results[0][0]
total_results = np.vstack((total_results, results))
print(total_results, y)
if __name__ == "__main__":
main()
For each iteration of the for loop the threads in simulation() each have a memory usage equal to the total memory used by my code.
Does Python clone my entire environment each time the parallel processes are run, including the variables not required by f()? How can I prevent this behaviour?
Ideally I would want my code to only copy the memory it requires to execute f() while I can save the results in memory.

Though the script does use quite a bit of memory even with the "smaller" example values, the answer to
Does Python clone my entire environment each time the parallel
processes are run, including the variables not required by f()? How
can I prevent this behaviour?
is that it does in a way clone the environment with forking a new process, but if copy-on-write semantics are available, no actual physical memory needs to be copied until it is written to. For example on this system
% uname -a
Linux mypc 4.2.0-27-generic #32-Ubuntu SMP Fri Jan 22 04:49:08 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
COW seems to be available and in use, but this may not be the case on other systems. On Windows this is strictly different as a new Python interpreter is executed from .exe instead of forking. Since you mention using htop, you're using some flavour of UNIX or UNIX like system, and you get COW semantics.
For each iteration of the for loop the processes in simulation() each
have a memory usage equal to the total memory used by my code.
The spawned processes will display almost identical values of RSS, but this can be misleading, because mostly they occupy the same actual physical memory mapped to multiple processes, if writes do not occur. With Pool.map the story is a bit more complicated, since it "chops the iterable into a number of chunks which it submits to the process pool as separate tasks". This submitting happens over IPC and submitted data will be copied. In your example the IPC and 2**20 function calls also dominate the CPU usage. Replacing the mapping with a single vectorized multiplication in simulation took the script's runtime from around 150s to 0.66s on this machine.
We can observe COW with a (somewhat) simplified example that allocates a large array and passes it to a spawned process for read-only processing:
import numpy as np
from multiprocessing import Process, Condition, Event
from time import sleep
import psutil
def read_arr(arr, done, stop):
with done:
S = np.sum(arr)
print(S)
done.notify()
while not stop.is_set():
sleep(1)
def main():
# Create a large array
print('Available before A (MiB):', psutil.virtual_memory().available / 1024 ** 2)
input("Press Enter...")
A = np.random.random(2**28)
print('Available before Process (MiB):', psutil.virtual_memory().available / 1024 ** 2)
input("Press Enter...")
done = Condition()
stop = Event()
p = Process(target=read_arr, args=(A, done, stop))
with done:
p.start()
done.wait()
print('Available with Process (MiB):', psutil.virtual_memory().available / 1024 ** 2)
input("Press Enter...")
stop.set()
p.join()
if __name__ == '__main__':
main()
Output on this machine:
% python3 test.py
Available before A (MiB): 7779.25
Press Enter...
Available before Process (MiB): 5726.125
Press Enter...
134221579.355
Available with Process (MiB): 5720.79296875
Press Enter...
Now if we replace the function read_arr with a function that modifies the array:
def mutate_arr(arr, done, stop):
with done:
arr[::4096] = 1
S = np.sum(arr)
print(S)
done.notify()
while not stop.is_set():
sleep(1)
the results are quite different:
Available before A (MiB): 7626.12109375
Press Enter...
Available before Process (MiB): 5571.82421875
Press Enter...
134247509.654
Available with Process (MiB): 3518.453125
Press Enter...
The for-loop does indeed require more memory after each iteration, but that's obvious: it stacks the total_results from the mapping, so it has to allocate space for a new array to hold both the old results and the new and free the now unused array of old results.

Maybe you should know the difference between thread and process in Operating System. see this What is the difference between a process and a thread.
In the for loop, there are processes, not threads. Threads share the address space of the process that created it; processes have their own address space.
You can print the process id, type os.getpid().

Related

Faster Startup of Processes Python

I'm trying to run two functions in Python3 in parallel. They both take about 30ms, and unfortunately, after writing a testing script, I've found that the startup-time to get the processes running in the background takes over 100ms which is a pretty high overhead that I would like to avoid. Is anybody aware of a faster way to run functions concurrently in Python3 (having a lower overhead -- ideally in the ones or tens of milliseconds) where I can still get the results of their functions in the main process. Any guidance on this would be appreciated, and if there is any information that I can provide, please let me know.
For hardware information, I'm running this on a 2019 MacBook Pro with Python 3.10.9 with a 2GHz Quad-Core Intel Core i5.
I've provided the script that I've written below as well as the output that I typically get from it.
import multiprocessing as mp
import time
import numpy as np
def t(s):
return (time.perf_counter() - s) * 1000
def run0(s):
print(f"Time to reach run0: {t(s):.2f}ms")
time.sleep(0.03)
return np.ones((1,4))
def run1(s):
print(f"Time to reach run1: {t(s):.2f}ms")
time.sleep(0.03)
return np.zeros((1,5))
def main():
s = time.perf_counter()
with mp.Pool(processes=2) as p:
print(f"Time to init pool: {t(s):.2f}ms")
f0 = p.apply_async(run0, args=(time.perf_counter(),))
f1 = p.apply_async(run1, args=(time.perf_counter(),))
r0 = f0.get()
r1 = f1.get()
print(r0, r1)
print(f"Time to run end-to-end: {t(s):.2f}ms")
if __name__ == "__main__":
main()
Below is the output that I typically get from running the above script
Time to init pool: 33.14ms
Time to reach run0: 198.50ms
Time to reach run1: 212.06ms
[[1. 1. 1. 1.]] [[0. 0. 0. 0. 0.]]
Time to run end-to-end: 287.68ms
Note: I'm looking to decrease the quantities on the 2nd and 3rd line by a factor of 10-20x smaller. I know that that is a lot, and if it is not possible, that is perfectly fine, but I was just wondering if anybody more knowledgable would know any methods. Thanks!

several points to consider:
"Time to init pool" is wrong. The child processes haven't finished starting, only the main process has initiated their startup. Once the workers have actually started, the speed of "Time to reach run" should drop to not include process startup. If you have a long lived pool of workers, you only pay startup cost once.
startup cost of the interpreter is often dominated by imports in this case you really only have numpy, and it is used by the target function, so you can't exactly get rid of it. Another that can be slow is the automatic import of site, but it makes other imports difficult to skip that one.
you're on MacOS, and can switch to using "fork" instead of "spawn" which should be much faster, but fundamentally changes how multiprocessing works in a few ways (and is incompatible with certain OS libraries)
example:
import multiprocessing as mp
import time
# import numpy as np
def run():
time.sleep(0.03)
return "whatever"
def main():
s = time.perf_counter()
with mp.Pool(processes=1) as p:
p.apply_async(run).get()
print(f"first job time: {(time.perf_counter() -s)*1000:.2f}ms")
#first job 166ms with numpy ; 85ms without ; 45ms on linux (wsl2 ubuntu 20.04) with fork
s = time.perf_counter()
p.apply_async(run).get()
print(f"after startup job time: {(time.perf_counter() -s)*1000:.2f}ms")
#second job about 30ms every time
if __name__ == "__main__":
main()

you can switch to python 3.11+ as it has a faster startup time (and faster everything), but as your application grows you will get even slower startup times compared to your toy example.
one option, is to run your application inside a linux docker image so you can use fork to avoid the spawn overhead, (though the COW overhead will still be visible)
the ultimate solution ? don't write your application in python (or any other language with a VM or a garbage collector), python multiprocessing isn't made for small fast tasks but for long running tasks, if you need that low startup time then write it in C or C++.
if you have to use python then you should reuse your workers to "absorb" this startup time in a much larger task time.

Why do I have idle workers when using Python multiprocessing pools?

I am breaking a very large text file up into smaller chunks, and performing further processing on the chunks. For this example, let text_chunks be a list of lists, each list containing a section of text. The elements of text_chunks range in length from ~50 to ~15000. The class ProcessedText exists elsewhere in the code and does a large amount of subsequent processing and data classification based on the text fed to it. The different text chunks are processed into ProcessedText instances in parallel using code like the following:
def do_things_to_text(a, b):
#pull out necessary things for ProcessedText initialization and return an instance
print('Processing {0}'.format(a))
return ProcessedText(a, b)
import multiprocessing as mp
#prepare inputs for starmap, pairing with list index so order can be reimposed later
pool_inputs = list(enumerate(text_chunks))
#parallel processing
pool = mp.Pool(processes=8)
results = pool.starmap_async(do_things_to_text, pool_inputs)
output = results.get()
The code executes successfully, but it seems that some of the worker processes created as part of the Pool randomly sit idle while the code runs. I track the memory usage, CPU usage, and status in top while the code executes.
At the beginning all 8 worker processes are engaged (status "R" in top and nonzero CPU usage), after ~20 entries from text_chunks are completed, the worker processes start to vary wildly. At times, as few as 1 worker process is running, and the others are in status "S" with zero CPU usage. I can also see from my printed output statements that do_things_to_text() is being called less frequently. So far I haven't been able to identify why the processes start to idle. There are plenty of entries left to process, so them sitting idle leads to time-inefficiency.
My questions are:
Why are these worker processes sitting idle?
Is there a better way to implement multiprocessing that will prevent this?
EDITED to ADD:
I have further characterized the problem. It is clear from the indexes I print out in do_things_to_text() that multiprocessing is dividing the total number of jobs into threads at every tenth index. So my console output shows Job 0, 10, 20, 30, 40, 50, 60, 70 being submitted at the same time (8 processes). And some of the Jobs complete faster than others, so you might see Job 22 completed before you see Job 1 completed.
Up until this first batch of threads is completed, all processes are active with nothing idle. However, when that batch is complete, and Job 80 starts, only one process is active, and the other 7 are idle. I have not confirmed, but I believe it stays like this until the 80-series is complete.

Here are some recommendations for better memory utilization:
I don't know how text_chunks is created but ultimately you end up with 8GB worth of strings in pool_inputs. Ideally, you would have a generator function, for example make_text_chunks, that yields the individual "text chunks" that formerly comprised the text_chunks iterable (if text_chunks is already such a generator expression, then you are all set). The idea is to not create all 8GB worth of data at once but only as the data is needed. With this strategy you can no longer use Pool method starmap_asynch; we will be using Pool.imap. This method, unlike startmap_asynch, will iteratively submit jobs in chunksize chunks and you can process the results as they become available (although that doesn't seem to be an issue).
def make_text_chunks():
# logic goes here to generate the next chunk
yield text_chunk
def do_things_to_text(t):
# t is now a tuple:
a, b = t
#pull out necessary things for ProcessedText initialization and return an instance
print('Processing {0}'.format(a))
return ProcessedText(a, b)
import multiprocessing as mp
# do not turn into a list!
pool_inputs = enumerate(make_text_chunks())
def compute_chunksize(n_jobs, poolsize):
"""
function to compute chunksize as is done by Pool module
"""
if n_jobs == 0:
return 0
chunksize, remainder = divmod(n_jobs, poolsize * 4)
if remainder:
chunksize += 1
return chunksize
#parallel processing
# number of jobs approximately
# don't know exactly without turning pool_inputs into a list, which would be self-defeating
N_JOBS = 300
POOLSIZE = 8
CHUNKSIZE = compute_chunksize(N_JOBS, POOLSIZE)
with mp.Pool(processes=POOLSIZE) as pool:
output = [result for result in pool.imap(do_things_to_text, pool_inputs, CHUNKSIZE)]

Python: how to parallelizing a simple loop with MPI

I need to rewrite a simple for loop with MPI cause each step is time consuming. Lets say I have a list including several np.array and I want to apply some computation on each array. For example:
def myFun(x):
return x+2 # simple example, the real one would be complicated
dat = [np.random.rand(3,2), np.random.rand(3,2),np.random.rand(3,2),np.random.rand(3,2)] # real data would be much larger
result = []
for item in dat:
result.append(myFun(item))
Instead of using the simple for loop above, I want to use MPI to run the 'for loop' part of the above code in parallel with 24 different nodes also I want the order of items in the result list follow the same with dat list.
Note The data is read from other file which can be treated 'fix' for each processor.
I haven't use mpi before, so this stucks me for a while.

For simplicity let us assume that the master process (the process with rank = 0) is the one that will read the entire file from disk into memory. This problem can be solved only knowing about the following MPI routines, Get_size(), Get_rank(), scatter, and gather.
The Get_size():
Returns the number of processes in the communicator. It will return
the same number to every process.
The Get_rank():
Determines the rank of the calling process in the communicator.
In MPI to each process is assigned a rank, that varies from 0 to N - 1, where N is the total number of processes running.
The scatter:
MPI_Scatter involves a designated root process sending data to all
processes in a communicator. The primary difference between MPI_Bcast
and MPI_Scatter is small but important. MPI_Bcast sends the same piece
of data to all processes while MPI_Scatter sends chunks of an array to
different processes.
and the gather:
MPI_Gather is the inverse of MPI_Scatter. Instead of spreading
elements from one process to many processes, MPI_Gather takes elements
from many processes and gathers them to one single process.
Obviously, you should first follow a tutorial and read the MPI documentation to understand its parallel programming model, and its routines. Otherwise, you will find it very hard to understand how it all works. That being said your code could look like the following:
from mpi4py import MPI
def myFun(x):
return x+2 # simple example, the real one would be complicated
comm = MPI.COMM_WORLD
rank = comm.Get_rank() # get your process ID
data = # init the data
if rank == 0: # The master is the only process that reads the file
data = # something read from file
# Divide the data among processes
data = comm.scatter(data, root=0)
result = []
for item in data:
result.append(myFun(item))
# Send the results back to the master processes
newData = comm.gather(result,root=0)
In this way, each process will work (in parallel) in only a certain chunk of the data. After having finish their work, each process send back to the master process their data chunks (i.e., comm.gather(result,root=0)). This is just a toy example, now it is up to you to improved according to your testing environment and code.

You could either go the low-level MPI way as shown in the answer of #dreamcrash or you could go for a more Pythonic solution that uses an executor pool very similar to the one provided by the standard Python multiprocessing module.
First, you need to turn your code into a more functional-style one by noticing that you are actually doing a map operation, which applies myFun to each element of dat:
def myFun(x):
return x+2 # simple example, the real one would be complicated
dat = [
np.random.rand(3,2), np.random.rand(3,2), np.random.rand(3,2), np.random.rand(3,2)
] # real data would be much larger
result = map(myFun, dat)
map here runs sequentially in one Python interpreter process.
To run that map in parallel with the multiprocessing module, you only need to instantiate a Pool object and then call its map() method in place of the Python map() function:
from multiprocessing import Pool
def myFun(x):
return x+2 # simple example, the real one would be complicated
if __name__ == '__main__':
dat = [
np.random.rand(3,2), np.random.rand(3,2), np.random.rand(3,2), np.random.rand(3,2)
] # real data would be much larger
with Pool() as pool:
result = pool.map(myFun, dat)
Here, Pool() creates a new executor pool with as many interpreter processes as there are logical CPUs as seen by the OS. Calling the map() method of the pool runs the mapping in parallel by sending items to the different processes in the pool and waiting for completion. Since the worker processes import the Python script as a module, it is important to have the code that was previously at the top level moved under the if __name__ == '__main__': conditional so it doesn't run in the workers too.
Using multiprocessing.Pool() is very convenient because it requires only a slight change of the original code and the module handles for you all the work scheduling and the required data movement to and from the worker processes. The problem with multiprocessing is that it only works on a single host. Fortunately, mpi4py provides a similar interface through the mpi4py.futures.MPIPoolExecutor class:
from mpi4py.futures import MPIPoolExecutor
def myFun(x):
return x+2 # simple example, the real one would be complicated
if __name__ == '__main__':
dat = [
np.random.rand(3,2), np.random.rand(3,2), np.random.rand(3,2), np.random.rand(3,2)
] # real data would be much larger
with MPIPoolExecutor() as pool:
result = pool.map(myFun, dat)
Like with the Pool object from the multiprocessing module, the MPI pool executor handles for you all the work scheduling and data movement.
There are two ways to run the MPI program. The first one starts the script as an MPI singleton and then uses the MPI process control facility to spawn a child MPI job with all the pool workers:
mpiexec -n 1 python program.py
You also need to specify the MPI universe size (the total number of MPI ranks in both the main and all child jobs). The specific way of doing so differs between the implementations, so you need to consult your implementation's manual.
The second option is to launch directly the desired number of MPI ranks and have them execute the mpi4py.futures module itself with the script name as argument:
mpiexec -n 24 python -m mpi4py.futures program.py
Keep in mind that no mater which way you launch the script one MPI rank will be reserved for the controller and will not be running mapping tasks. You are aiming at running on 24 hosts, so you should be having plenty of CPU cores and can probably afford to have one reserved. Or you could instruct MPI to oversubscribe the first host with one more rank.
One thing to note with both multiprocessing.Pool and mpi4py.futures.MPIPoolExecutor is that the map() method guarantees the order of the items in the output array, but it doesn't guarantee the order in which the different items are evaluated. This shouldn't be a problem in most cases.
A word of advise. If your data is actually chunks read from a file, you may be tempted to do something like this:
if __name__ == '__main__':
data = read_chunks()
with MPIPoolExecutor() as p:
result = p.map(myFun, data)
Don't do that. Instead, if possible, e.g., if enabled by the presence of a shared (and hopefully parallel) filesytem, delegate the reading to the workers:
NUM_CHUNKS = 100
def myFun(chunk_num):
# You may need to pass the value of NUM_CHUNKS to read_chunk()
# for it to be able to seek to the right position in the file
data = read_chunk(NUM_CHUNKS, chunk_num)
return ...
if __name__ == '__main__':
chunk_nums = range(NUM_CHUNKS) # 100 chunks
with MPIPoolExecutor() as p:
result = p.map(myFun, chunk_nums)

multiprocessing.pool context and load balancing

I've encountered some unexpected behaviour of the python multiprocessing Pool class.
Here are my questions:
1) When does Pool creates its context, which is later used for serialization? The example below runs fine as long as the Pool object is created after the Container definition. If you swap the Pool initializations, serialization error occurs. In my production code I would like to initialize Pool way before defining the container class. Is it possible to refresh Pool "context" or to achieve this in another way.
2) Does Pool have its own load balancing mechanism and if so how does it work?
If I run a similar example on my i7 machine with the pool of 8 processes I get the following results:
- For a light evaluation function Pool favours using only one process for computation. It creates 8 processes as requested but for most of the time only one is used (I printed the pid from inside and also see this in htop).
- For a heavy evaluation function the behaviour is as expected. It uses all 8 processes equally.
3) When using Pool I always see 4 more processes that I requested (i.e. for Pool(processes=2) I see 6 new processes). What is their role?
I use Linux with Python 2.7.2
from multiprocessing import Pool
from datetime import datetime
POWER = 10
def eval_power(container):
for power in xrange(2, POWER):
container.val **= power
return container
#processes = Pool(processes=2)
class Container(object):
def __init__(self, value):
self.val = value
processes = Pool(processes=2)
if __name__ == "__main__":
cont = [Container(foo) for foo in xrange(20)]
then = datetime.now()
processes.map(eval_power, cont)
now = datetime.now()
print "Eval time:", now - then
EDIT - TO BAKURIU
1) I was afraid that that's the case.
2) I don't understand what the linux scheduler has to do with python assigning computations to processes. My situation can be ilustrated by the example below:
from multiprocessing import Pool
from os import getpid
from collections import Counter
def light_func(ind):
return getpid()
def heavy_func(ind):
for foo in xrange(1000000):
ind += foo
return getpid()
if __name__ == "__main__":
list_ = range(100)
pool = Pool(4)
l_func = pool.map(light_func, list_)
h_func = pool.map(heavy_func, list_)
print "light func:", Counter(l_func)
print "heavy func:", Counter(h_func)
On my i5 machine (4 threads) I get the following results:
light func: Counter({2967: 100})
heavy func: Counter({2969: 28, 2967: 28, 2968: 23, 2970: 21})
It seems that the situation is as I've described it. However I still don't understand why python does it this way. My guess would be that it tries to minimise communication expenses, but still the mechanism which it uses for load balancing is unknown. The documentation isn't very helpful either, the multiprocessing module is very poorly documented.
3) If I run the above code I get 4 more processes as described before. The screen comes from htop: http://i.stack.imgur.com/PldmM.png

The Pool object creates the subprocesses during the call to __init__ hence you must define Container before. By the way, I wouldn't include all the code in a single file but use a module to implement the Container and other utilities and write a small file that launches the main program.
The Pool does exactly what is described in the documentation. In particular it has no control over the scheduling of the processes hence what you see is what Linux's scheduler thinks it is right. For small computations they take so little time that the scheduler doesn't bother parallelizing them(this probably have better performances due to core affinity etc.)
Could you show this with an example and what you see in the task manager? I think they may be the processes that handle the queue inside the Pool, but I'm not sure. On my machine I can see only the main process plus the two subprocesses.
Update on point 2:
The Pool object simply puts the tasks into a queue, and the child processes get the arguments from this queue. If a process takes almost no time to execute an object, than Linux scheduler let the process execute more time(hence consuming more items from the queue). If the execution takes much time then this scheduler will change processes and thus the other child processes are also executed.
In your case a single process is consuming all items because the computation take so little time that before the other child processes are ready it has already finished all items.
As I said, Pool doesn't do anything about balancing the work of the subprocesses. It's simply a queue and a bunch of workers, the pool puts items in the queue and the processes get the items and compute the results. AFAIK the only thing that it does to control the queue is putting a certain number of tasks in a single item in the queue(see the documentation) but there is no guarantee about which process will grab which task. Everything else is left to the OS.
On my machine the results are less extreme. Two processes get about twice the number of calls than the other two for the light computation, while for the heavy one all have more or less the same number of items processed. Probably on different OSes and/or hardware we would obtain even different results.

Python: Multicore processing?

I've been reading about Python's multiprocessing module. I still don't think I have a very good understanding of what it can do.
Let's say I have a quadcore processor and I have a list with 1,000,000 integers and I want the sum of all the integers. I could simply do:
list_sum = sum(my_list)
But this only sends it to one core.
Is it possible, using the multiprocessing module, to divide the array up and have each core get the sum of it's part and return the value so the total sum may be computed?
Something like:
core1_sum = sum(my_list[0:500000]) #goes to core 1
core2_sum = sum(my_list[500001:1000000]) #goes to core 2
all_core_sum = core1_sum + core2_sum #core 3 does final computation
Any help would be appreciated.

Yes, it's possible to do this summation over several processes, very much like doing it with multiple threads:
from multiprocessing import Process, Queue
def do_sum(q,l):
q.put(sum(l))
def main():
my_list = range(1000000)
q = Queue()
p1 = Process(target=do_sum, args=(q,my_list[:500000]))
p2 = Process(target=do_sum, args=(q,my_list[500000:]))
p1.start()
p2.start()
r1 = q.get()
r2 = q.get()
print r1+r2
if __name__=='__main__':
main()
However, it is likely that doing it with multiple processes is likely slower than doing it in a single process, as copying the data forth and back is more expensive than summing them right away.

Welcome the world of concurrent programming.
What Python can (and can't) do depends on two things.
What the OS can (and can't) do. Most OS's allocate processes to cores. To use 4 cores, you need to break your problem into four processes. This is easier than it sounds. Sometimes.
What the underlying C libraries can (and can't) do. If the C libraries expose features of the OS AND the OS exposes features of the hardware, you're solid.
To break a problem into multiple processes -- especially in GNU/Linux -- is easy. Break it into a multi-step pipeline.
In the case of summing a million numbers, think of the following shell script. Assuming some hypothetical sum.py program that sums either a range of numbers or a list of numbers on stdin.
( sum.py 0 500000 & sum.py 50000 1000000 ) | sum.py
This would have 3 concurrent processes. Two are doing sums of a lot of numbers, the third is summing two numbers.
Since the GNU/Linux shells and the OS already handle some parts of concurrency for you, you can design simple (very, very simple) programs that read from stdin, write to stdout, and are designed to do small parts of a large job.
You can try to reduce the overheads by using subprocess to build the pipeline instead of allocating the job to the shell. You may find, however, that the shell builds pipelines very, very quickly. (It was written directly in C and makes direct OS API calls for you.)

Sure, for example:
from multiprocessing import Process, Queue
thelist = range(1000*1000)
def f(q, sublist):
q.put(sum(sublist))
def main():
start = 0
chunk = 500*1000
queue = Queue()
NP = 0
subprocesses = []
while start < len(thelist):
p = Process(target=f, args=(queue, thelist[start:start+chunk]))
NP += 1
print 'delegated %s:%s to subprocess %s' % (start, start+chunk, NP)
p.start()
start += chunk
subprocesses.append(p)
total = 0
for i in range(NP):
total += queue.get()
print "total is", total, '=', sum(thelist)
while subprocesses:
subprocesses.pop().join()
if __name__ == '__main__':
main()
results in:
$ python2.6 mup.py
delegated 0:500000 to subprocess 1
delegated 500000:1000000 to subprocess 2
total is 499999500000 = 499999500000
note that this granularity is too fine to be worth spawning processes for -- the overall summing task is small (which is why I can recompute the sum in main as a check;-) and too many data is being moved back and forth (in fact the subprocesses wouldn't need to get copies of the sublists they work on -- indices would suffice). So, it's a "toy example" where multiprocessing isn't really warranted. With different architectures (use a pool of subprocesses that receive multiple tasks to perform from a queue, minimize data movement back and forth, etc, etc) and on less granular tasks you could actually get benefits in terms of performance, however.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.