python: spawn threads as per requirements - python

I am creating a small application which will perform say 4 different, time consuming tasks such that the output of first is the input of second and so on.
At every task level, the output is appended to a list and the next task pops, operates and appends its output to its output list and so on...
The way I thought I would get the task done is by having multiple threads on each of those 4 tasks.
Coming to the question, is there any way using which I can have my application spawn threads at each task level depending upon the number of tasks in its input queue?
Say the input list of the second task is empty in the beginning so number threads is zero however if one task is there, single thread is spawned, two for two etc ... And of course have an upper limit on the number of threads say 10, so that if the length of the input list goes as high as 100, number of threads operating stays still at 10.
Please suggest the pythonic way to go about achieving this.

You have successfully invented the Thread Pool. There is builtin support and there are many libraries and examples that will provide this for you, so use one or learn from their code.
from multiprocessing.pool import ThreadPool

Related

Python: Multiprocessing Recipe for Queue(s) with Many Consumers

Since I have been wasting a lot of time trying to join() workers in python multiprocessing architectures that get their tasks from a multiprocessing.Queue, which is at the same time fed by a feeder function through put():
Can someone contribute a short but robust recipe for this kind of architecture?
Let's say up to 10 feeders and up to 100 workers. Queue items might be large.
I can only guess that in the past I have often had the queue busy and not responding anymore, but mostly, I had the task done, only the jobs never joined, or wether they joined or not seemed to depend on arbitrary parameters.
So imagine the following workflow:
One or more feeder() jobs read from input (i.e. disk) and create tasks for the first line of workers. They put the tasks into a queue Q. When there are no more tasks to create, the feeders should join.
Many workers take the tasks out of Q and process it. Then they put the result into another queue R. When these workers fail to get more jobs from Q and are done with the last task, they should join. Consider that they might see an exception due to invalid input/task. They should still join.
Let's add another line of workers that takes always two results from R, merges them and puts the result back into R. When there is only one result left in R, all workers of the second line should join.
Finally I can take the last remaining aggregated result out of R and be happy.
I think this setup should include enough aspects to allow for generalization of many different tasks.
To avoid blocking of the queue(s), I have already tried out the following ideas:
Use multiple queues with always putting to the smallest queue and getting from random queue
Do not block in get(), but try getting without wait, if fails, sleep for a short time and retry, keep track of number of tries, after a while, give up, assuming queue is empty.
I can add some code, but maybe this would lead to bug fixing instead of sharing best practices.

Python multiprocessing pool number of jobs not correct

I wrote a python program to launch parallel processes (16) using pool, to process some files. At the beginning of the run, the number of processes is maintained at 16 until almost all files get processed. Then, for some reasons which I don't understand, when there're only a few files left, only one process runs at a time which makes processing time much longer than necessary. Could you help with this?
Force map() to use a chunksize of 1 instead of guessing the best value by itself, es.:
pool = Pool(16)
pool.map(func, iterable, 1)
This should (in theory) guarantee the best distribution of load among workers until the end of the input data.
See here
Python, before starts the execution of the process that you specify in applyasync/asyncmap of Pool, assigns to each worker a piece of the work.
For example, lets say that you have 8 files to process and you start a Pool with 4 workers.
Before starting the file processing, two specific files will be assigned to each worker. This means that if some worker ends its job earlier than the others, will simply "have a break" and will not start helping the others.

How to create a pool of threads in python where i can have fixed number of threads running at any time?

I want to create a pool of threads for a certain task.I want to maintain a certain number of threads at any time.
How can I do that ? How can I check the number of alive threads at any time and create new threads if their count is less than the specified count ?
If you're using an existing thread pool implementation—concurrent.futures.ThreadPoolExecutor, multiprocessing.dummy.Pool, or anything you find on PyPI or elsewhere—this will all be taken care of automatically.
If you want to build something yourself manually, the question doesn't really make sense. You will need to keep track of your threads, start them manually, and know when they finish, so you always know the number of alive threads.
If you want sample code to look at, the source to concurrent.futures is very readable, and pretty short. As you can see from a quick look, it calls _adjust_thread_count every time you submit a new job, and that function just checks whether len(self._threads) < self.max_workers and creates a new thread if needed.

using multiple threads in Python

I'm trying to solve a problem, where I have many (on the order of ten thousand) URLs, and need to download the content from all of them. I've been doing this in a "for link in links:" loop up till now, but the amount of time it's taking is now too long. I think it's time to implement a multithreaded or multiprocessing approach. My question is, what is the best approach to take?
I know about the Global Interpreter Lock, but since my problem is network-bound, not CPU-bound, I don't think that will be an issue. I need to pass data back from each thread/process to the main thread/process. I don't need help implementing whatever approach (Terminate multiple threads when any thread completes a task covers that), I need advice on which approach to take. My current approach:
data_list = get_data(...)
output = []
for datum in data:
output.append(get_URL_data(datum))
return output
There's no other shared state.
I think the best approach would be to have a queue with all the data in it, and have several worker threads pop from the input queue, get the URL data, then push onto an output queue.
Am I right? Is there anything I'm missing? This is my first time implementing multithreaded code in any language, and I know it's generally a Hard Problem.
For your specific task I would recommend a multiprocessing worker pool. You simply define a pool and tell it how many processes you want to use (one per processor core by default) as well as a function you want to run on each unit of work. Then you ready every unit of work (in your case this would be a list of URLs) in a list and give it to the worker pool.
Your output will be a list of the return values of your worker function for every item of work in your original array. All the cool multi-processing goodness will happen in the background. There is of course other ways of working with the worker pool as well, but this is my favourite one.
Happy multi-processing!
The best approach I can think of in your use case will be to use a thread pool and maintain a work queue. The threads in the thread pool get work from the work queue, do the work and then go get some more work. This way you can finely control the number of threads working on your URLs.
So, create a WorkQueue, which in your case is basically a list containing the URLs that need to be downloaded.
Create a thread pool, which create the number of threads you specify, fetches work from the WorkQueue and assigns it to a thread. Each time a thread finishes and returns you check if the work queues has more work and accordingly assign work to that thread again. You may also want to put a hook so that every time work is added to the work queue, your threads assigns it to a free thread if available.
The fastest and most efficient method of doing IO bound tasks like this is an asynchronous event loop. The libcurl can do this, and there is a Python wrapper for that called pycurl. Using it's "multi" interface you can do high-performance client activities. I have done over 1000 simultaneous fetchs as fast as one.
However, the API is quite low-level and difficult to use. There is a simplifying wrapper here, which you can use as an example.

Multiprocessing in python with more then 2 levels

I want to do a program and want make a the spawn like this process -> n process -> n process
can the second level spawn process with multiprocessing ? using multiprocessinf module of python 2.6
thnx
#vilalian's answer is correct, but terse. Of course, it's hard to supply more information when your original question was vague.
To expand a little, you'd have your original program spawn its n processes, but they'd be slightly different than the original in that you'd want them (each, if I understand your question) to spawn n more processes. You could accomplish this by either by having them run code similar to your original process, but that spawned new sets of programs that performed the task at hand, without further processing, or you could use the same code/entry point, just providing different arguments - something like
def main(level):
if level == 0:
do_work
else:
for i in range(n):
spawn_process_that_runs_main(level-1)
and start it off with level == 2
You can structure your app as a series of process pools communicating via Queues at any nested depth. Though it can get hairy pretty quick (probably due to the required context switching).
It's not erlang though that's for sure.
The docs on multiprocessing are extremely useful.
Here(little too much to drop in a comment) is some code I use to increase throughput in a program that updates my feeds. I have one process polling for feeds that need to fetched, that stuffs it's results in a queue that a Process Pool of 4 workers picks up those results and fetches the feeds, it's results(if any) are then put in a queue for a Process Pool to parse and put into a queue to shove back in the database. Done sequentially, this process would be really slow due to some sites taking their own sweet time to respond so most of the time the process was waiting on data from the internet and would only use one core. Under this process based model, I'm actually waiting on the database the most it seems and my NIC is saturated most of the time as well as all 4 cores are actually doing something. Your mileage may vary.
Yes - but, you might run into an issue which would require the fix I committed to python trunk yesterday. See bug http://bugs.python.org/issue5313
Sure you can. Expecially if you are using fork to spawn child processes, they works as perfectly normal processes (like the father). Thread management is quite different, but you can also use "second level" sub-treading.
Pay attention to not over-complicate your program, as example program with two level threads are normally unused.

Categories