Python: Multicore processing?

Python: Multicore processing? - python

I've been reading about Python's multiprocessing module. I still don't think I have a very good understanding of what it can do.
Let's say I have a quadcore processor and I have a list with 1,000,000 integers and I want the sum of all the integers. I could simply do:
list_sum = sum(my_list)
But this only sends it to one core.
Is it possible, using the multiprocessing module, to divide the array up and have each core get the sum of it's part and return the value so the total sum may be computed?
Something like:
core1_sum = sum(my_list[0:500000]) #goes to core 1
core2_sum = sum(my_list[500001:1000000]) #goes to core 2
all_core_sum = core1_sum + core2_sum #core 3 does final computation
Any help would be appreciated.

Yes, it's possible to do this summation over several processes, very much like doing it with multiple threads:
from multiprocessing import Process, Queue
def do_sum(q,l):
q.put(sum(l))
def main():
my_list = range(1000000)
q = Queue()
p1 = Process(target=do_sum, args=(q,my_list[:500000]))
p2 = Process(target=do_sum, args=(q,my_list[500000:]))
p1.start()
p2.start()
r1 = q.get()
r2 = q.get()
print r1+r2
if __name__=='__main__':
main()
However, it is likely that doing it with multiple processes is likely slower than doing it in a single process, as copying the data forth and back is more expensive than summing them right away.

Welcome the world of concurrent programming.
What Python can (and can't) do depends on two things.
What the OS can (and can't) do. Most OS's allocate processes to cores. To use 4 cores, you need to break your problem into four processes. This is easier than it sounds. Sometimes.
What the underlying C libraries can (and can't) do. If the C libraries expose features of the OS AND the OS exposes features of the hardware, you're solid.
To break a problem into multiple processes -- especially in GNU/Linux -- is easy. Break it into a multi-step pipeline.
In the case of summing a million numbers, think of the following shell script. Assuming some hypothetical sum.py program that sums either a range of numbers or a list of numbers on stdin.
( sum.py 0 500000 & sum.py 50000 1000000 ) | sum.py
This would have 3 concurrent processes. Two are doing sums of a lot of numbers, the third is summing two numbers.
Since the GNU/Linux shells and the OS already handle some parts of concurrency for you, you can design simple (very, very simple) programs that read from stdin, write to stdout, and are designed to do small parts of a large job.
You can try to reduce the overheads by using subprocess to build the pipeline instead of allocating the job to the shell. You may find, however, that the shell builds pipelines very, very quickly. (It was written directly in C and makes direct OS API calls for you.)

Sure, for example:
from multiprocessing import Process, Queue
thelist = range(1000*1000)
def f(q, sublist):
q.put(sum(sublist))
def main():
start = 0
chunk = 500*1000
queue = Queue()
NP = 0
subprocesses = []
while start < len(thelist):
p = Process(target=f, args=(queue, thelist[start:start+chunk]))
NP += 1
print 'delegated %s:%s to subprocess %s' % (start, start+chunk, NP)
p.start()
start += chunk
subprocesses.append(p)
total = 0
for i in range(NP):
total += queue.get()
print "total is", total, '=', sum(thelist)
while subprocesses:
subprocesses.pop().join()
if __name__ == '__main__':
main()
results in:
$ python2.6 mup.py
delegated 0:500000 to subprocess 1
delegated 500000:1000000 to subprocess 2
total is 499999500000 = 499999500000
note that this granularity is too fine to be worth spawning processes for -- the overall summing task is small (which is why I can recompute the sum in main as a check;-) and too many data is being moved back and forth (in fact the subprocesses wouldn't need to get copies of the sublists they work on -- indices would suffice). So, it's a "toy example" where multiprocessing isn't really warranted. With different architectures (use a pool of subprocesses that receive multiple tasks to perform from a queue, minimize data movement back and forth, etc, etc) and on less granular tasks you could actually get benefits in terms of performance, however.

Related

Python: how to parallelizing a simple loop with MPI

I need to rewrite a simple for loop with MPI cause each step is time consuming. Lets say I have a list including several np.array and I want to apply some computation on each array. For example:
def myFun(x):
return x+2 # simple example, the real one would be complicated
dat = [np.random.rand(3,2), np.random.rand(3,2),np.random.rand(3,2),np.random.rand(3,2)] # real data would be much larger
result = []
for item in dat:
result.append(myFun(item))
Instead of using the simple for loop above, I want to use MPI to run the 'for loop' part of the above code in parallel with 24 different nodes also I want the order of items in the result list follow the same with dat list.
Note The data is read from other file which can be treated 'fix' for each processor.
I haven't use mpi before, so this stucks me for a while.

For simplicity let us assume that the master process (the process with rank = 0) is the one that will read the entire file from disk into memory. This problem can be solved only knowing about the following MPI routines, Get_size(), Get_rank(), scatter, and gather.
The Get_size():
Returns the number of processes in the communicator. It will return
the same number to every process.
The Get_rank():
Determines the rank of the calling process in the communicator.
In MPI to each process is assigned a rank, that varies from 0 to N - 1, where N is the total number of processes running.
The scatter:
MPI_Scatter involves a designated root process sending data to all
processes in a communicator. The primary difference between MPI_Bcast
and MPI_Scatter is small but important. MPI_Bcast sends the same piece
of data to all processes while MPI_Scatter sends chunks of an array to
different processes.
and the gather:
MPI_Gather is the inverse of MPI_Scatter. Instead of spreading
elements from one process to many processes, MPI_Gather takes elements
from many processes and gathers them to one single process.
Obviously, you should first follow a tutorial and read the MPI documentation to understand its parallel programming model, and its routines. Otherwise, you will find it very hard to understand how it all works. That being said your code could look like the following:
from mpi4py import MPI
def myFun(x):
return x+2 # simple example, the real one would be complicated
comm = MPI.COMM_WORLD
rank = comm.Get_rank() # get your process ID
data = # init the data
if rank == 0: # The master is the only process that reads the file
data = # something read from file
# Divide the data among processes
data = comm.scatter(data, root=0)
result = []
for item in data:
result.append(myFun(item))
# Send the results back to the master processes
newData = comm.gather(result,root=0)
In this way, each process will work (in parallel) in only a certain chunk of the data. After having finish their work, each process send back to the master process their data chunks (i.e., comm.gather(result,root=0)). This is just a toy example, now it is up to you to improved according to your testing environment and code.

You could either go the low-level MPI way as shown in the answer of #dreamcrash or you could go for a more Pythonic solution that uses an executor pool very similar to the one provided by the standard Python multiprocessing module.
First, you need to turn your code into a more functional-style one by noticing that you are actually doing a map operation, which applies myFun to each element of dat:
def myFun(x):
return x+2 # simple example, the real one would be complicated
dat = [
np.random.rand(3,2), np.random.rand(3,2), np.random.rand(3,2), np.random.rand(3,2)
] # real data would be much larger
result = map(myFun, dat)
map here runs sequentially in one Python interpreter process.
To run that map in parallel with the multiprocessing module, you only need to instantiate a Pool object and then call its map() method in place of the Python map() function:
from multiprocessing import Pool
def myFun(x):
return x+2 # simple example, the real one would be complicated
if __name__ == '__main__':
dat = [
np.random.rand(3,2), np.random.rand(3,2), np.random.rand(3,2), np.random.rand(3,2)
] # real data would be much larger
with Pool() as pool:
result = pool.map(myFun, dat)
Here, Pool() creates a new executor pool with as many interpreter processes as there are logical CPUs as seen by the OS. Calling the map() method of the pool runs the mapping in parallel by sending items to the different processes in the pool and waiting for completion. Since the worker processes import the Python script as a module, it is important to have the code that was previously at the top level moved under the if __name__ == '__main__': conditional so it doesn't run in the workers too.
Using multiprocessing.Pool() is very convenient because it requires only a slight change of the original code and the module handles for you all the work scheduling and the required data movement to and from the worker processes. The problem with multiprocessing is that it only works on a single host. Fortunately, mpi4py provides a similar interface through the mpi4py.futures.MPIPoolExecutor class:
from mpi4py.futures import MPIPoolExecutor
def myFun(x):
return x+2 # simple example, the real one would be complicated
if __name__ == '__main__':
dat = [
np.random.rand(3,2), np.random.rand(3,2), np.random.rand(3,2), np.random.rand(3,2)
] # real data would be much larger
with MPIPoolExecutor() as pool:
result = pool.map(myFun, dat)
Like with the Pool object from the multiprocessing module, the MPI pool executor handles for you all the work scheduling and data movement.
There are two ways to run the MPI program. The first one starts the script as an MPI singleton and then uses the MPI process control facility to spawn a child MPI job with all the pool workers:
mpiexec -n 1 python program.py
You also need to specify the MPI universe size (the total number of MPI ranks in both the main and all child jobs). The specific way of doing so differs between the implementations, so you need to consult your implementation's manual.
The second option is to launch directly the desired number of MPI ranks and have them execute the mpi4py.futures module itself with the script name as argument:
mpiexec -n 24 python -m mpi4py.futures program.py
Keep in mind that no mater which way you launch the script one MPI rank will be reserved for the controller and will not be running mapping tasks. You are aiming at running on 24 hosts, so you should be having plenty of CPU cores and can probably afford to have one reserved. Or you could instruct MPI to oversubscribe the first host with one more rank.
One thing to note with both multiprocessing.Pool and mpi4py.futures.MPIPoolExecutor is that the map() method guarantees the order of the items in the output array, but it doesn't guarantee the order in which the different items are evaluated. This shouldn't be a problem in most cases.
A word of advise. If your data is actually chunks read from a file, you may be tempted to do something like this:
if __name__ == '__main__':
data = read_chunks()
with MPIPoolExecutor() as p:
result = p.map(myFun, data)
Don't do that. Instead, if possible, e.g., if enabled by the presence of a shared (and hopefully parallel) filesytem, delegate the reading to the workers:
NUM_CHUNKS = 100
def myFun(chunk_num):
# You may need to pass the value of NUM_CHUNKS to read_chunk()
# for it to be able to seek to the right position in the file
data = read_chunk(NUM_CHUNKS, chunk_num)
return ...
if __name__ == '__main__':
chunk_nums = range(NUM_CHUNKS) # 100 chunks
with MPIPoolExecutor() as p:
result = p.map(myFun, chunk_nums)

Multiprocessing time increases linearly with more cores

I have an arcpy process that requires doing a union on a bunch of layers, running some calculations, and writing an HTML report. Given the number of reports I need to generate (~2,100) I need this process to be as quick as possible (my target is 2 seconds per report). I've tried a number of ways to do this, including multiprocessing, when I ran across a problem, namely, that running the multi-process part essentially takes the same amount of time no matter how many cores I use.
For instance, for the same number of reports:
2 cores took ~30 seconds per round (so 40 reports takes 40/2 * 30 seconds)
4 cores took ~60 seconds (40/4 * 60)
10 cores took ~160 seconds (40/10 * 160)
and so on. It works out to the same total time because churning through twice as many at a time takes twice as long to do.
Does this mean my problem is I/O bound, rather than CPU bound? (And if so - what do I do about it?) I would have thought it was the latter, given that the large bottleneck in my timing is the union (it takes up about 50% of the processing time). Unions are often expensive in ArcGIS, so I assumed breaking it up and running 2 - 10 at once would have been 2 - 10 times faster. Or, potentially I implementing multi-process incorrectly?
## Worker function just included to give some context
def worker(sub_code):
layer = 'in_memory/lyr_{}'.format(sub_code)
arcpy.Select_analysis(subbasinFC, layer, where_clause="SUB_CD = '{}'".format(sub_code))
arcpy.env.extent = layer
union_name = 'in_memory/union_' + sub_code
arcpy.Union_analysis([fields],
union_name,
"NO_FID", "1 FEET")
#.......Some calculations using cursors
# Templating using Jinjah
context = {}
context['DATE'] = now.strftime("%B %d, %Y")
context['SUB_CD'] = sub_code
context['SUB_ACRES'] = sum([r[0] for r in arcpy.da.SearchCursor(union, ["ACRES"], where_clause="SUB_CD = '{}'".format(sub_code))])
# Etc
# Then write the report out using custom function
write_html('template.html', 'output_folder', context)
if __name__ == '__main__':
subList = sorted({r[0] for r in arcpy.da.SearchCursor(subbasinFC, ["SUB_CD"])})
NUM_CORES = 7
chunk_list = [subList[i:i+NUM_CORES] for i in range(0, len(subList), NUM_CORES-1)]
for chunk in chunk_list:
jobs = []
for subbasin in chunk:
p = multiprocessing.Process(target=worker, args=(subbasin,))
jobs.append(p)
p.start()
for process in jobs:
process.join()

There isn't much to go on here, and I have no experience with ArcGIS. So I can just note two higher-level things. First, "the usual" way to approach this would be to replace all the code below your NUM_CORES = 7 with:
pool = multiprocessing.Pool(NUM_CORES)
pool.map(worker, subList)
pool.close()
pool.join()
map() takes care of keeping all the worker processes as busy as possible. As is, you fire up 7 processes, then wait for all of them to finish. All the processes that complete before the slowest vanish, and their cores sit idle waiting for the next outer loop iteration. A Pool keeps the 7 processes alive for the duration of the job, and feeds each a new piece of work to do as soon as it finishes its last piece of work.
Second, this part ends with a logical error:
chunk_list = [subList[i:i+NUM_CORES] for i in range(0, len(subList), NUM_CORES-1)]
You want NUM_CORES there rather than NUM_CORES-1. As-is, the first time around you extract
subList[0:7]
then
subList[6:13]
then
subList[12:19]
and so on. subList[6] and subList[12] (etc) are extracted twice each. The sublists overlap.

You don't show us quite enough to be sure what you are doing. For example, what is your env.workspace? And what is the value of subbasinFC? It seems like you're doing an analysis at the beginning of each process to filter down the data into layer. But is subbasinFC coming from disk, or from memory? If it's from disk, I'd suggest you read everything into memory before any of the processes try their filtering. That should speed things along, if you have the memory to support it. Otherwise, yeah, you're I/O bound on the input data.
Forgive my arcpy cluelessness, but why are you inserting a where clause in your sum of context['SUB_ACRES']? Didn't you already filter on sub_code at the start? (We don't know what the union is, so maybe you're unioning with something unfiltered...)

I'm not sure you are using the Process pool correctly to track your jobs. This:
for subbasin in chunk:
p = multiprocessing.Process(target=worker, args=(subbasin,))
jobs.append(p)
p.start()
for process in jobs:
process.join()
Should instead be:
for subbasin in chunk:
p = multiprocessing.Process(target=worker, args=(subbasin,))
p.start()
p.join()
Is there a specific reason you are going against the spec of using the multiprocessing library? You are not waiting until the thread terminates before spinning another process up, which is just going to create a whole bunch of processes that are not handled by the parent calling process.

Multiprocessing Pool in Python - Only single CPU is utilized

Original Question
I am trying to use multiprocessing Pool in Python. This is my code:
def f(x):
return x
def foo():
p = multiprocessing.Pool()
mapper = p.imap_unordered
for x in xrange(1, 11):
res = list(mapper(f,bar(x)))
This code makes use of all CPUs (I have 8 CPUs) when the xrange is small like xrange(1, 6). However, when I increase the range to xrange(1, 10). I observe that only 1 CPU is running at 100% while the rest are just idling. What could be the reason? Is it because, when I increase the range, the OS shutdowns the CPUs due to overheating?
How can I resolve this problem?
minimal, complete, verifiable example
To replicate my problem, I have created this example: Its a simple ngram generation from a string problem.
#!/usr/bin/python
import time
import itertools
import threading
import multiprocessing
import random
def f(x):
return x
def ngrams(input_tmp, n):
input = input_tmp.split()
if n > len(input):
n = len(input)
output = []
for i in range(len(input)-n+1):
output.append(input[i:i+n])
return output
def foo():
p = multiprocessing.Pool()
mapper = p.imap_unordered
num = 100000000 #100
rand_list = random.sample(xrange(100000000), num)
rand_str = ' '.join(str(i) for i in rand_list)
for n in xrange(1, 100):
res = list(mapper(f, ngrams(rand_str, n)))
if __name__ == '__main__':
start = time.time()
foo()
print 'Total time taken: '+str(time.time() - start)
When num is small (e.g., num = 10000), I find that all 8 CPUs are utilised. However, when num is substantially large (e.g.,num = 100000000). Only 2 CPUs are used and rest are idling. This is my problem.
Caution: When num is too large it may crash your system/VM.

First, ngrams itself takes a lot of time. While that's happening, it's obviously only one one core. But even when that finishes (which is very easy to test by just moving the ngrams call outside the mapper and throwing a print in before and after it), you're still only using one core. I get 1 core at 100% and the other cores all around 2%.
If you try the same thing in Python 3.4, things are a little different—I still get 1 core at 100%, but the others are at 15-25%.
So, what's happening? Well, in multiprocessing, there's always some overhead for passing parameters and returning values. And in your case, that overhead completely swamps the actual work, which is just return x.
Here's how the overhead works: The main process has to pickle the values, then put them on a queue, then wait for values on another queue and unpickle them. Each child process waits on the first queue, unpickles values, does your do-nothing work, pickles the values, and puts them on the other queue. Access to the queues has to be synchronized (by a POSIX semaphore on most non-Windows platforms, I think an NT kernel mutex on Windows).
From what I can tell, your processes are spending over 99% of their time waiting on the queue or reading or writing it.
This isn't too unexpected, given that you have a large amount of data to process, and no computation at all beyond pickling and unpickling that data.
If you look at the source for SimpleQueue in CPython 2.7, the pickling and unpickling happens with the lock held. So, pretty much all the work any of your background processes do happens with the lock held, meaning they all end up serialized on a single core.
But in CPython 3.4, the pickling and unpickling happens outside the lock. And apparently that's enough work to use up 15-25% of a core. (I believe this change happened in 3.2, but I'm too lazy to track it down.)
Still, even on 3.4, you're spending far more time waiting for access to the queue than doing anything, even the multiprocessing overhead. Which is why the cores only get up to 25%.
And of course you're spending orders of magnitude more time on the overhead than the actual work, which makes this not a great test, unless you're trying to test the maximum throughput you can get out of a particular multiprocessing implementation on your machine or something.
A few observations:
In your real code, if you can find a way to batch up larger tasks (explicitly—just relying on chunksize=1000 or the like here won't help), that would probably solve most of your problem.
If your giant array (or whatever) never actually changes, you may be able to pass it in the pool initializer, instead of in each task, which would pretty much eliminate the problem.
If it does change, but only from the main process side, it may be worth sharing rather than passing the data.
If you need to mutate it from the child processes, see if there's a way to partition the data so each task can own a slice without contention.
Even if you need fully-contended shared memory with explicit locking, it may still be better than passing something this huge around.
It may be worth getting a backport of the 3.2+ version of multiprocessing or one of the third-party multiprocessing libraries off PyPI (or upgrading to Python 3.x), just to move the pickling out of the lock.

The problem is that your f() function (which is the one running on separate processes) is doing nothing special, hence it is not putting load on the CPU.
ngrams(), on the other hand, is doing some "heavy" computation, but you are calling this function on the main process, not in the pool.
To make things clearer, consider that this piece of code...
for n in xrange(1, 100):
res = list(mapper(f, ngrams(rand_str, n)))
...is equivalent to this:
for n in xrange(1, 100):
arg = ngrams(rand_str, n)
res = list(mapper(f, arg))
Also the following is a CPU-intensive operation that is being performed on your main process:
num = 100000000
rand_list = random.sample(xrange(100000000), num)
You should either change your code so that sample() and ngrams() are called inside the pool, or change f() so that it does something CPU-intensive, and you'll see a high load on all of your CPUs.

Multiprocessing in python

I am writing a Python script (in Python 2.7) wherein I need to generate around 500,000 uniform random numbers within a range. I need to do this 4 times, perform some calculations on them and write out the 4 files.
At the moment I am doing: (this is just part of my for loop, not the entire code)
random_RA = []
for i in xrange(500000):
random_RA.append(np.random.uniform(6.061,6.505)) # FINAL RANDOM RA
random_dec = []
for i in xrange(500000):
random_dec.append(np.random.uniform(min(data_dec_1),max(data_dec_1))) # FINAL RANDOM 'dec'
to generate the random numbers within the range. I am running Ubuntu 14.04 and when I run the program I also open my system manager to see how the 8 CPU's I have are working. I seem to notice that when the program is running, only 1 of the 8 CPU's seem to work at 100% efficiency. So the entire program takes me around 45 minutes to complete.
I noticed that it is possible to use all the CPU's to my advantage using the module Multiprocessing
I would like to know if this is enough in my example:
random_RA = []
for i in xrange(500000):
multiprocessing.Process()
random_RA.append(np.random.uniform(6.061,6.505)) # FINAL RANDOM RA
i.e adding just the line multiprocessing.Process(), would that be enough?

If you use multiprocessing, you should avoid shared state (like your random_RA list) as much as possible.
Instead, try to use a Pool and its map method:
from multiprocessing import Pool, cpu_count
def generate_random_ra(x):
return np.random.uniform(6.061, 6.505)
def generate_random_dec(x):
return np.random.uniform(min(data_dec_1), max(data_dec_1))
pool = Pool(cpu_count())
random_RA = pool.map(generate_random_ra, xrange(500000))
random_dec = pool.map(generate_random_dec, xrange(500000))

To get you started:
import multiprocessing
import random
def worker(i):
random.uniform(1,100000)
print i,'done'
if __name__ == "__main__":
for i in range(4):
t = multiprocessing.Process(target = worker, args=(i,))
t.start()
print 'All the processes have been started.'
You must gate the t = multiprocess.Process(...) with __name__ == "__main__" as each worker calls this program (module) again to find out what it has to do. If the gating didn't happen it would spawn more processes ...
Just for completeness, generating 500000 random numbers is not going to take you 45 minutes so i assume there are some intensive calculations going on here: you may want to look at them closely.

Multiprocessing in Python while limiting the number of running processes

I'd like to run multiple instances of program.py simultaneously, while limiting the number of instances running at the same time (e.g. to the number of CPU cores available on my system). For example, if I have 10 cores and have to do 1000 runs of program.py in total, only 10 instances will be created and running at any given time.
I've tried using the multiprocessing module, multithreading, and using queues, but there's nothing that seemed to me to lend itself to an easy implementation. The biggest problem I have is finding a way to limit the number of processes running simultaneously. This is important because if I create 1000 processes at once, it becomes equivalent to a fork bomb. I don't need the results returned from the processes programmatically (they output to disk), and the processes all run independently of each other.
Can anyone please give me suggestions or an example of how I could implement this in python, or even bash? I'd post the code I've written so far using queues, but it doesn't work as intended and might already be down the wrong path.
Many thanks.

I know you mentioned that the Pool.map approach doesn't make much sense to you. The map is just an easy way to give it a source of work, and a callable to apply to each of the items. The func for the map could be any entry point to do the actual work on the given arg.
If that doesn't seem right for you, I have a pretty detailed answer over here about using a Producer-Consumer pattern: https://stackoverflow.com/a/11196615/496445
Essentially, you create a Queue, and start N number of workers. Then you either feed the queue from the main thread, or create a Producer process that feeds the queue. The workers just keep taking work from the queue and there will never be more concurrent work happening than the number of processes you have started.
You also have the option of putting a limit on the queue, so that it blocks the producer when there is already too much outstanding work, if you need to put constraints also on the speed and resources that the producer consumes.
The work function that gets called can do anything you want. This can be a wrapper around some system command, or it can import your python lib and run the main routine. There are specific process management systems out there which let you set up configs to run your arbitrary executables under limited resources, but this is just a basic python approach to doing it.
Snippets from that other answer of mine:
Basic Pool:
from multiprocessing import Pool
def do_work(val):
# could instantiate some other library class,
# call out to the file system,
# or do something simple right here.
return "FOO: %s" % val
pool = Pool(4)
work = get_work_args()
results = pool.map(do_work, work)
Using a process manager and producer
from multiprocessing import Process, Manager
import time
import itertools
def do_work(in_queue, out_list):
while True:
item = in_queue.get()
# exit signal
if item == None:
return
# fake work
time.sleep(.5)
result = item
out_list.append(result)
if __name__ == "__main__":
num_workers = 4
manager = Manager()
results = manager.list()
work = manager.Queue(num_workers)
# start for workers
pool = []
for i in xrange(num_workers):
p = Process(target=do_work, args=(work, results))
p.start()
pool.append(p)
# produce data
# this could also be started in a producer process
# instead of blocking
iters = itertools.chain(get_work_args(), (None,)*num_workers)
for item in iters:
work.put(item)
for p in pool:
p.join()
print results

You should use a process supervisor. One approach would be using the API provided by Circus to do that "programatically", the documentation site is now offline but I think its just a temporary problem, anyway, you can use the Circus to handle this. Another approach would be using the supervisord and setting the parameter numprocs of the process to the number of cores you have.
An example using Circus:
from circus import get_arbiter
arbiter = get_arbiter("myprogram", numprocesses=3)
try:
arbiter.start()
finally:
arbiter.stop()

Bash script rather than Python, but I use it often for simple parallel processing:
#!/usr/bin/env bash
waitForNProcs()
{
nprocs=$(pgrep -f $procName | wc -l)
while [ $nprocs -gt $MAXPROCS ]; do
sleep $SLEEPTIME
nprocs=$(pgrep -f $procName | wc -l)
done
}
SLEEPTIME=3
MAXPROCS=10
procName=myPython.py
for file in ./data/*.txt; do
waitForNProcs
./$procName $file &
done
Or for very simple cases, another option is xargs where P sets the number of procs
find ./data/ | grep txt | xargs -P10 -I SUB ./myPython.py SUB

While there are many answers about using multiprocessing.pool, there are not many code snippets on how to use multiprocessing.Process, which is indeed more beneficial when memory usage matters. starting 1000 processes will overload the CPU and kill the memory. If each process and its data pipelines are memory intensive, OS or Python itself will limit the number of parallel processes. I developed the below code to limit the simultaneous number of jobs submitted to the CPU in batches. The batch size can be scaled proportional to the number of CPU cores. In my windows PC, the number of jobs per batch can be efficient upto 4 times the CPU coures available.
import multiprocessing
def func_to_be_multiprocessed(q,data):
q.put(('s'))
q = multiprocessing.Queue()
worker = []
for p in range(number_of_jobs):
worker[p].append(multiprocessing.Process(target=func_to_be_multiprocessed, \
args=(q,data)...))
num_cores = multiprocessing.cpu_count()
Scaling_factor_batch_jobs = 3.0
num_jobs_per_batch = num_cores * Scaling_factor_batch_jobs
num_of_batches = number_of_jobs // num_jobs_per_batch
for i_batch in range(num_of_batches):
floor_job = i_batch * num_jobs_per_batch
ceil_job = floor_job + num_jobs_per_batch
for p in worker[floor_job : ceil_job]:
worker.start()
for p in worker[floor_job : ceil_job]:
worker.join()
for p in worker[ceil_job :]:
worker.start()
for p in worker[ceil_job :]:
worker.join()
for p in multiprocessing.active_children():
p.terminate()
result = []
for p in worker:
result.append(q.get())
The only problem is, if any of the job in any batch could not complete and leads to a hanging situation, rest of the batches of jobs will not be initiated. So, the function to be processed must have proper error handling routines.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.