From multiprocessing to distributed processing in python standard library

From multiprocessing to distributed processing in python standard library - python

I am studying this code from gitHub about distributed processing. I would like to thank eliben for this nice post. I have read his explanations but there are some dark spots. As far as I understand, the code is for distributing tasks in multiple machines/clients. My questions are:
The most basic of my questions is where the distribution of the work to different machines is happening?
Why there is an if else statement in the main function?
Let me start this question in a more general way. I thought that we usually start a Process in a specific chunk (independent memory part) and not pass all the chunks at once like this:
chunksize = int(math.ceil(len(HugeList) / float(nprocs)))
for i in range(nprocs):
p = Process(
target = myWorker, # This is my worker
args=(HugeList[chunksize * i:chunksize * (i + 1)],
HUGEQ)
processes.append(p)
p.start()
In this simple case where we have nprocs processes. Each process initiate an instance of the function myWorker that work on the specified chunk.
My question here is:
How many threads do we have for each process that work in each chunk?
Looking now into the gitHub code I am trying to understand the mp_factorizer? More specifically, in this function we do not have chunks but a huge queue (shared_job_q). This huge queue is consisted of sub-lists of size 43 maximum. This queue is passed into the factorizer_worker. There via get we obtain those sub-lists and pass them into the serial worker. I understand that we need this queue to share data between clients.
My questions here is:
Do we call an instance of the factorizer_worker function for each of the nprocs(=8) processes?
Which part of the data each process work? (Generally, we have 8 processes and 43 chunks.)
How many threads exist for each process?
Does get function called from each process thread?
Thanks for your time.

The distribution to multiple machines only happens if you actually run the script on multiple machines. The first time you run the script (without the --client option), it starts the Manager server on a specific IP/port, which hosts the shared job/result queues. In addition to starting the Manager server, runserver will also act as a worker, by calling mp_factorizer. It is additionally responsible for collecting the results from the result queue and processing them. You could run this script by itself and get a complete result.
However, you can also distribute the factorization work to other machines, by running the script on other machines using the --client option. That will call runclient, which will connect to the existing Manager server you started with the initial run of the script. That means that the clients are accessing the same shared queues runserver is using, so they can all pull work from and put results to the same queues.
The above should covers questions 1 and 2.
I'm not exactly sure what you're asking in question 3. I think you're wondering why we don't pass a chunk of the list to each worker explicitly (like in the example you included), rather than putting all the chunks into a queue. The answer there is because the runserver method doesn't know how many workers there are actually going to be. It knows that it's going to start 8 workers. However, it doesn't want to split the HugeList into eight chunks and send them to the 8 processes it's creating, because it wants to support remote clients connection to the Manager and doing work, too. So instead, it picks an arbitrary size for each chunk (43) and divides the list into as many chunks of that size as it takes to consume the entire HugeList, and sticks it in a Queue. Here's the code in runserver that does that:
chunksize = 43
for i in range(0, len(nums), chunksize):
#print 'putting chunk %s:%s in job Q' % (i, i + chunksize)
shared_job_q.put(nums[i:i + chunksize]) # Adds a 43-item chunk to the shared queue.
That way, as many workers as you want can connect to the Manager server, grab a chunk from shared_job_q, process it, and return a result.
Do we call an instance of the factorizer_worker function for each of the nprocs(=8) processes?
Yes
Which part of the data each process work? (Generally, we have 8 processes and 43 chunks.)
We don't have 43 chunks. We have X number of chunks, each of size 43. Each worker process just grabs chunks off the queue and processes them. Which part it gets is arbitrary and depends on how many workers there are and how fast each is going.
How many threads exist for each process?
One. If you mean now many worker processes exist for each instance of the script, there are 8 in the server process, and 4 in each client process.
Does get function called from each process thread?
Not sure what you mean by this.

Related

Understanding python multiprocessing pool map thread safety

This question had conflicting answers: Are Python multiprocessing Pool thread safe?
I am new to concurrency patterns and I am trying to run a project that takes in an array and distributes the work of the array onto multiple processes. The array is large.
inputs = range(100000)
with Pool(2) as pool:
res = pool.map(some_func, inputs)
My understanding is that pool will distribute tasks to the processes. My questions are:
Is this map operation thread safe? Will two processes ever accidentally try to process the same value?
I superficially understand that tasks will be divide up into chunks and sent to processes. However, if different inputs take more time than others, will the work always be evenly distributed across my processes? Will I ever be in a scenario where one process is hanging but has a long queue of tasks to do while other processes are idle?
My understanding is that since I am just reading inputs in, I don't need to use any interprocess communication paterns like a server manager / shared memory. Is that right?
If I set up more processes than cores, will it basically operate like threads where the CPU is switching between tasks?
Thank you!

With the code provided, it is impossible that the same item of inputs will be processed by more than one process (an exception would be if the same instance of an object appears more than once in the iterable passed as argument). Nevertheless, this way of using multiprocessing has a lot of overhead, since the inputs items are sent one by one to the processes. A better approach is to use the chunksize parameter:
inputs = range(100000)
n_proc = 2
chunksize = len(inputs)//n_proc
if len(inputs) % n_proc:
chunksize += 1
with Pool(nproc) as pool:
res = pool.map(some_func, inputs, chunksize=chunksize)
this way, chunks of inputs are passed at once to each process, leading to a better performance.
The work is not divided in chunks unless you ask so. If no chunksize is provided, each chunk is one item from the iterable (the equivalent of chunksize=1). Each chunk will be 'sent' one by one to the available processes in the pool. The chunks are sent to the processes as they finish working on the previous one and become available. There is no need for every process to take the same number of chunks. In your example, if some_func takes longer for larger values and chunksize = len(items)/2 the process that gets the chunk with the first half of inputs (with smaller values) will finish first while the other takes much longer. In that case, a smaller chunk is a better option so the work is evenly distributed.
This depends on what some_func does. If you do not need the result of some_func(n) to process some_func(m), you do not need to communicate between processes. If you are using map and need to communicate between processes, it is very likely that you are taking a bad approach to solving your problem.
if max_workers > os.cpu_count() the CPU will switch between processes more often than with a lower number of processes. Don't forget that there are many more processes running in a (not amazingly old) computer than your program. In windows, max_workers must be equal or less than 61 (see the docs here)

Python multiprocessing pool number of jobs not correct

I wrote a python program to launch parallel processes (16) using pool, to process some files. At the beginning of the run, the number of processes is maintained at 16 until almost all files get processed. Then, for some reasons which I don't understand, when there're only a few files left, only one process runs at a time which makes processing time much longer than necessary. Could you help with this?

Force map() to use a chunksize of 1 instead of guessing the best value by itself, es.:
pool = Pool(16)
pool.map(func, iterable, 1)
This should (in theory) guarantee the best distribution of load among workers until the end of the input data.
See here

Python, before starts the execution of the process that you specify in applyasync/asyncmap of Pool, assigns to each worker a piece of the work.
For example, lets say that you have 8 files to process and you start a Pool with 4 workers.
Before starting the file processing, two specific files will be assigned to each worker. This means that if some worker ends its job earlier than the others, will simply "have a break" and will not start helping the others.

Parallel processing within a queue (using Pool within Celery)

I'm using Celery to queue jobs from a CGI application I made. The way I've set it up, Celery makes each job run one- or two-at-a-time by setting CELERYD_CONCURRENCY = 1 or = 2 (so they don't crowd the processor or thrash from memory consumption). The queue works great, thanks to advice I got on StackOverflow.
Each of these jobs takes a fair amount of time (~30 minutes serial), but has an embarrassing parallelizability. For this reason, I was using Pool.map to split it and do the work in parallel. It worked great from the command line, and I got runtimes around 5 minutes using a new many-cored chip.
Unfortunately, there is some limitation that does not allow daemonic process to have subprocesses, and when I run the fancy parallelized code within the CGI queue, I get this error:
AssertionError: daemonic processes are not allowed to have children
I noticed other people have had similar questions, but I can't find an answer that wouldn't require abandoning Pool.map altogether, and making more complicated thread code.
What is the appropriate design choice here? I can easily run my serial jobs using my Celery queue. I can also run my much faster parallelized jobs without a queue. How should I approach this, and is it possible to get what I want (both the queue and the per-job parallelization)?
A couple of ideas I've had (some are quite hacky):
The job sent to the Celery queue simply calls the command line program. That program can use Pool as it pleases, and then saves the result figures & data to a file (just as it does now). Downside: I won't be able to check on the status of the job or see if it terminated successfully. Also, system calls from CGI may cause security issues.
Obviously, if the queue is very full of jobs, I can make use of the CPU resources (by setting CELERYD_CONCURRENCY = 6 or so); this will allow many people to be "at the front of the queue" at once.Downside: Each job will spend a lot of time at the front of the queue; if the queue isn't full, there will be no speedup. Also, many partially finished jobs will be stored in memory at the same time, using much more RAM.
Use Celery's #task to parallelize within sub-jobs. Then, instead of setting CELERYD_CONCURRENCY = 1, I would set it to 6 (or however many sub jobs I'd like to allow in memory at a time). Downside: First of all, I'm not sure whether this will successfully avoid the "task-within-task" problem. But also, the notion of queue position may be lost, and many partially finished jobs may end up in memory at once.
Perhaps there is a way to call Pool.map and specify that the threads are non-daemonic? Or perhaps there is something more lightweight I can use instead of Pool.map? This is similar to an approach taken on another open StackOverflow question. Also, I should note that the parallelization I exploit via Pool.map is similar to linear algebra, and there is no inter-process communication (each just runs independently and returns its result without talking to the others).
Throw away Celery and use multiprocessing.Queue. Then maybe there'd be some way to use the same "thread depth" for every thread I use (i.e. maybe all of the threads could use the same Pool, avoiding nesting)?
Thanks a lot in advance.

What you need is a workflow management system (WFMS) that manages
task concurrency
task dependency
task nesting
among other things.
From a very high level view, a WFMS sits on top of a task pool like celery, and submits the tasks which are ready to execute to the pool. It is also responsible for opening up a nest and submitting the tasks in the nest accordingly.
I've developed a system to do just that. It's called pomsets. Try it out, and feel free to send me any questions.

I using a multiprocessed deamons based on Twisted with forking and Gearman jobs query normally.
Try to look at Gearman.

File downloading using python with threads

I'm creating a python script which accepts a path to a remote file and an n number of threads. The file's size will be divided by the number of threads, when each thread completes I want them to append the fetch data to a local file.
How do I manage it so that the order in which the threads where generated will append to the local file in order so that the bytes don't get scrambled?
Also, what if I'm to download several files simultaneously?

You could coordinate the works with locks &c, but I recommend instead using Queue -- usually the best way to coordinate multi-threading (and multi-processing) in Python.
I would have the main thread spawn as many worker threads as you think appropriate (you may want to calibrate between performance, and load on the remote server, by experimenting); every worker thread waits at the same global Queue.Queue instance, call it workQ for example, for "work requests" (wr = workQ.get() will do it properly -- each work request is obtained by a single worker thread, no fuss, no muss).
A "work request" can in this case simply be a triple (tuple with three items): identification of the remote file (URL or whatever), offset from which it is requested to get data from it, number of bytes to get from it (note that this works just as well for one or multiple files ot fetch).
The main thread pushes all work requests to the workQ (just workQ.put((url, from, numbytes)) for each request) and waits for results to come to another Queue instance, call it resultQ (each result will also be a triple: identifier of the file, starting offset, string of bytes that are the results from that file at that offset).
As each working thread satisfies the request it's doing, it puts the results into resultQ and goes back to fetch another work request (or wait for one). Meanwhile the main thread (or a separate dedicated "writing thread" if needed -- i.e. if the main thread has other work to do, for example on the GUI) gets results from resultQ and performs the needed open, seek, and write operations to place the data at the right spot.
There are several ways to terminate the operation: for example, a special work request may be asking the thread receiving it to terminate -- the main thread puts on workQ just as many of those as there are working threads, after all the actual work requests, then joins all the worker threads when all data have been received and written (many alternatives exist, such as joining the queue directly, having the worker threads daemonic so they just go away when the main thread terminates, and so forth).

You need to fetch completely separate parts of the file on each thread. Calculate the chunk start and end positions based on the number of threads. Each chunk must have no overlap obviously.
For example, if target file was 3000 bytes long and you want to fetch using three thread:
Thread 1: fetches bytes 1 to 1000
Thread 2: fetches bytes 1001 to 2000
Thread 3: fetches bytes 2001 to 3000
You would pre-allocate an empty file of the original size, and write back to the respective positions within the file.

You can use a thread safe "semaphore", like this:
class Counter:
counter = 0
#classmethod
def inc(cls):
n = cls.counter = cls.counter + 1 # atomic increment and assignment
return n
Using Counter.inc() returns an incremented number across threads, which you can use to keep track of the current block of bytes.
That being said, there's no need to split up file downloads into several threads, because the downstream is way slower than the writing to disk, so one thread will always finish before the next one is downloading.
The best and least resource hungry way is simply to have a download file descriptor linked directly to a file object on disk.

for "download several files simultaneously", I recommond this article: Practical threaded programming with Python . It provides a simultaneously download related example by combining threads with Queues, I thought it's worth a reading.

Multiprocessing in python with more then 2 levels

I want to do a program and want make a the spawn like this process -> n process -> n process
can the second level spawn process with multiprocessing ? using multiprocessinf module of python 2.6
thnx

#vilalian's answer is correct, but terse. Of course, it's hard to supply more information when your original question was vague.
To expand a little, you'd have your original program spawn its n processes, but they'd be slightly different than the original in that you'd want them (each, if I understand your question) to spawn n more processes. You could accomplish this by either by having them run code similar to your original process, but that spawned new sets of programs that performed the task at hand, without further processing, or you could use the same code/entry point, just providing different arguments - something like
def main(level):
if level == 0:
do_work
else:
for i in range(n):
spawn_process_that_runs_main(level-1)
and start it off with level == 2

You can structure your app as a series of process pools communicating via Queues at any nested depth. Though it can get hairy pretty quick (probably due to the required context switching).
It's not erlang though that's for sure.
The docs on multiprocessing are extremely useful.
Here(little too much to drop in a comment) is some code I use to increase throughput in a program that updates my feeds. I have one process polling for feeds that need to fetched, that stuffs it's results in a queue that a Process Pool of 4 workers picks up those results and fetches the feeds, it's results(if any) are then put in a queue for a Process Pool to parse and put into a queue to shove back in the database. Done sequentially, this process would be really slow due to some sites taking their own sweet time to respond so most of the time the process was waiting on data from the internet and would only use one core. Under this process based model, I'm actually waiting on the database the most it seems and my NIC is saturated most of the time as well as all 4 cores are actually doing something. Your mileage may vary.

Yes - but, you might run into an issue which would require the fix I committed to python trunk yesterday. See bug http://bugs.python.org/issue5313

Sure you can. Expecially if you are using fork to spawn child processes, they works as perfectly normal processes (like the father). Thread management is quite different, but you can also use "second level" sub-treading.
Pay attention to not over-complicate your program, as example program with two level threads are normally unused.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.