A multi-part/threaded downloader via python? - python

I've seen a few threaded downloaders online, and even a few multi-part downloaders (HTTP).
I haven't seen them together as a class/function.
If any of you have a class/function lying around, that I can just drop into any of my applications where I need to grab multiple files, I'd be much obliged.
If there is there a library/framework (or a program's back-end) that does this, please direct me towards it?

Threadpool by Christopher Arndt may be what you're looking for. I've used this "easy to use object-oriented thread pool framework" for the exact purpose you describe and it works great. See the usage examples at the bottom on the linked page. And it really is easy to use: just define three functions (one of which is an optional exception handler in place of the default handler) and you are on your way.
from http://www.chrisarndt.de/projects/threadpool/:
Object-oriented, reusable design
Provides callback mechanism to process results as they are returned from the worker threads.
WorkRequest objects wrap the tasks assigned to the worker threads and allow for easy passing of arbitrary data to the callbacks.
The use of the Queue class solves most locking issues.
All worker threads are daemonic, so they exit when the main program exits, no need for joining.
Threads start running as soon as you create them. No need to start or stop them. You can increase or decrease the pool size at any time, superfluous threads will just exit when they finish their current task.
You don't need to keep a reference to a thread after you have assigned the last task to it. You just tell it: "don't come back looking for work, when you're done!"
Threads don't eat up cycles while waiting to be assigned a task, they just block when the task queue is empty (though they wake up every few seconds to check whether they are dismissed).
Also available at http://pypi.python.org/pypi/threadpool, easy_install, or as a subversion checkout (see project homepage).

Related

Will I run into trouble with python's Global Interpreter Lock?

I am aware that this question is rather high-level and may be vague. Please ask if you need any more details and I will try to edit.
I am using QuickFix with Python bindings to consume high-throughput market data from circa 30 markets simultaneously. Most of computing the work is done in separate CPUs via the multiprocessing module. These parallel processes are spawned by the main process on startup. If I wish to interact with the market in any way via QuickFix, I have to do this within the main process, thus any commands (to enter orders, for example) which come from the child processes must be piped (via an mp.Queue object we will call Q) to the main process before execution.
This raises the problem of monitoring Q, which must be done within the main process. I cannot use Q.get(), since this method blocks and my entire main process will hang until something shows up in Q. In order to decrease latency, I must check Q frequently, on the order of 50 times per second. I have been using the apscheduler to do this, but I keep getting Warning errors stating that the runtime was missed. These errors are a serious issue because they prevent me from easily viewing important information.
I have therefore refactored my application to use the code posted by MestreLion as an answer to this question. This is working for me because it starts a new thread from the main process, and it does not print error messages. However, I am worried that this will cause nasty problems down the road.
I am aware of the Global Interpreter Lock in python (this is why I used the multiprocessing module to begin with), but I don't really understand it. Owing to the high-frequency nature of my application, I do not know if the Q monitoring thread and the main process consuming lots of incoming messages will compete for resources and slow each other down.
My questions:
Am I likely to run into trouble in this scenario?
If not, can I add more monitoring threads using the present approach and still be okay? There are at least two other things I would like to monitor at high frequency.
Thanks.
#MestreLion's solution that you've linked creates 50 threads per second in your case.
All you need is a single thread to consume the queue without blocking the rest of the main process:
import threading
def consume(queue, sentinel=None):
for item in iter(queue.get, sentinel):
pass_to_quickfix(item)
threading.Thread(target=consume, args=[queue], daemon=True).start()
GIL may or may not matter for performance in this case. Measure it.
Without knowing your scenario, it's difficult to say anything specific. Your question suggests, that the threads are waiting most of the time via get, so GIL is not a problem. Interprocess communication may result in problems much earlier. There you can think of switching to another protocol, using some kind of TCP-sockets. Then you can write the scheduler more efficient with select instead of threads, as threads are also slow and resource consuming. select is a system function, that allows to monitor many socket-connection at once, therefore it scales incredibly efficient with the amount of connections and needs nearly no CPU-power for monitoring.

When should I be using asyncio over regular threads, and why? Does it provide performance increases?

I have a pretty basic understanding of multithreading in Python and an even basic-er understanding of asyncio.
I'm currently writing a small Curses-based program (eventually going to be using a full GUI, but that's another story) that handles the UI and user IO in the main thread, and then has two other daemon threads (each with their own queue/worker-method-that-gets-things-from-a-queue):
a watcher thread that watches for time-based and conditional (e.g. posts to a message board, received messages, etc.) events to occur and then puts required tasks into...
the other (worker) daemon thread's queue which then completes them.
All three threads are continuously running concurrently, which leads me to some questions:
When the worker thread's queue (or, more generally, any thread's queue) is empty, should it be stopped until is has something to do again, or is it okay to leave continuously running? Do concurrent threads take up a lot of processing power when they aren't doing anything other than watching its queue?
Should the two threads' queues be combined? Since the watcher thread is continuously running a single method, I guess the worker thread would be able to just pull tasks from the single queue that the watcher thread puts in.
I don't think it'll matter since I'm not multiprocessing, but is this setup affected by Python's GIL (which I believe still exists in 3.4) in any way?
Should the watcher thread be running continuously like that? From what I understand, and please correct me if I'm wrong, asyncio is supposed to be used for event-based multithreading, which seems relevant to what I'm trying to do.
The main thread is basically always just waiting for the user to press a key to access a different part of the menu. This seems like a situation asyncio would be perfect for, but, again, I'm not sure.
Thanks!
When the worker thread's queue (or, more generally, any thread's queue) is empty, should it be stopped until is has something to do again, or is it okay to leave continuously running? Do concurrent threads take up a lot of processing power when they aren't doing anything other than watching its queue?
You should just use a blocking call to queue.get(). That will leave the thread blocked on I/O, which means the GIL will be released, and no processing power (or at least a very minimal amount) will be used. Don't use non-blocking gets in a while loop, since that's going to require a lot more CPU wakeups.
Should the two threads' queues be combined? Since the watcher thread is continuously running a single method, I guess the worker thread would be able to just pull tasks from the single queue that the watcher thread puts in.
If all the watcher is doing is pulling things off a queue and immediately putting it into another queue, where it gets consumed by a single worker, it sounds like its unnecessary overhead - you may as well just consume it directly in the worker. It's not exactly clear to me if that's the case, though - is the watcher consuming from a queue, or just putting items into one? If it is consuming from a queue, who is putting stuff into it?
I don't think it'll matter since I'm not multiprocessing, but is this setup affected by Python's GIL (which I believe still exists in 3.4) in any way?
Yes, this is affected by the GIL. Only one of your threads can run Python bytecode at a time, so won't get true parallelism, except when threads are running I/O (which releases the GIL). If your worker thread is doing CPU-bound activities, you should seriously consider running it in a separate process via multiprocessing, if possible.
Should the watcher thread be running continuously like that? From what I understand, and please correct me if I'm wrong, asyncio is supposed to be used for event-based multithreading, which seems relevant to what I'm trying to do.
It's hard to say, because I don't know exactly what "running continuously" means. What is it doing continuously? If it spends most of its time sleeping or blocking on a queue, it's fine - both of those things release the GIL. If it's constantly doing actual work, that will require the GIL, and therefore degrade the performance of the other threads in your app (assuming they're trying to do work at the same time). asyncio is designed for programs that are I/O-bound, and can therefore be run in a single thread, using asynchronous I/O. It sounds like your program may be a good fit for that depending on what your worker is doing.
The main thread is basically always just waiting for the user to press a key to access a different part of the menu. This seems like a situation asyncio would be perfect for, but, again, I'm not sure.
Any program where you're mostly waiting for I/O is potentially a good for for asyncio - but only if you can find a library that makes curses (or whatever other GUI library you eventually choose) play nicely with it. Most GUI frameworks come with their own event loop, which will conflict with asyncio's. You would need to use a library that can make the GUI's event loop play nicely with asyncio's event loop. You'd also need to make sure that you can find asyncio-compatible versions of any other synchronous-I/O based library your application uses (e.g. a database driver).
That said, you're not likely to see any kind of performance improvement by switching from your thread-based program to something asyncio-based. It'll likely perform about the same. Since you're only dealing with 3 threads, the overhead of context switching between them isn't very significant, so switching from that a single-threaded, asynchronous I/O approach isn't going to make a very big difference. asyncio will help you avoid thread synchronization complexity (if that's an issue with your app - it's not clear that it is), and at least theoretically, would scale better if your app potentially needed lots of threads, but it doesn't seem like that's the case. I think for you, it's basically down to which style you prefer to code in (assuming you can find all the asyncio-compatible libraries you need).

why python threadpool creat daemonic threads and join them at last?

I've been reading python's threadpool module's code.
It manipulates threads in this way : All workerThreads are created as daemonic thread. And it also have a dismiss mechanism that you can safely quit the worker thread by setting event, after all the job's done the dismissed threads will be joined in the main thread.
The python doc says that if worker threads were set daemonic, they will quit when main thread terminates. But it might be an ugly implementation, a better way is to make them non-daemonic and stop them with event.
Here is my question: Is it a good design to use both of the quit strategies? Is it better to set the threads non-daemonic and join them all before the main thread terminates?
In looking at this particular threadpool module, it appears to be designed to work either by allowing you to quit summarily, or waiting for the threads to complete. You would choose one or the other depending on how you want to handle requests currently in process:
If you don't care about whether threads die in the middle of processing requests, just let the program exit, and the daemon threads will be taken care of.
On the other hand, if you want to make sure a thread exits only between fully processing requests, either use dismissWorkers with do_join=True, or use dismissWorkers followed by joinAllDismissedWorkers.
That choice would vary depending on what you're processing and how. Note that the sample code that comes in the main routine does some of one and some of the other, which is probably not what you'd want to do in a real situation – the sample code is just designed to demonstrate capabilities.
You could argue that it's bad form to create daemon threads when you do care about how/when they exit, and it wouldn't be hard to fix the library so that daemon is an option for your worker threads when they are created, not a necessity. Currently, however, the module picks a default that favors ease of use over consistency.

using multiple threads in Python

I'm trying to solve a problem, where I have many (on the order of ten thousand) URLs, and need to download the content from all of them. I've been doing this in a "for link in links:" loop up till now, but the amount of time it's taking is now too long. I think it's time to implement a multithreaded or multiprocessing approach. My question is, what is the best approach to take?
I know about the Global Interpreter Lock, but since my problem is network-bound, not CPU-bound, I don't think that will be an issue. I need to pass data back from each thread/process to the main thread/process. I don't need help implementing whatever approach (Terminate multiple threads when any thread completes a task covers that), I need advice on which approach to take. My current approach:
data_list = get_data(...)
output = []
for datum in data:
output.append(get_URL_data(datum))
return output
There's no other shared state.
I think the best approach would be to have a queue with all the data in it, and have several worker threads pop from the input queue, get the URL data, then push onto an output queue.
Am I right? Is there anything I'm missing? This is my first time implementing multithreaded code in any language, and I know it's generally a Hard Problem.
For your specific task I would recommend a multiprocessing worker pool. You simply define a pool and tell it how many processes you want to use (one per processor core by default) as well as a function you want to run on each unit of work. Then you ready every unit of work (in your case this would be a list of URLs) in a list and give it to the worker pool.
Your output will be a list of the return values of your worker function for every item of work in your original array. All the cool multi-processing goodness will happen in the background. There is of course other ways of working with the worker pool as well, but this is my favourite one.
Happy multi-processing!
The best approach I can think of in your use case will be to use a thread pool and maintain a work queue. The threads in the thread pool get work from the work queue, do the work and then go get some more work. This way you can finely control the number of threads working on your URLs.
So, create a WorkQueue, which in your case is basically a list containing the URLs that need to be downloaded.
Create a thread pool, which create the number of threads you specify, fetches work from the WorkQueue and assigns it to a thread. Each time a thread finishes and returns you check if the work queues has more work and accordingly assign work to that thread again. You may also want to put a hook so that every time work is added to the work queue, your threads assigns it to a free thread if available.
The fastest and most efficient method of doing IO bound tasks like this is an asynchronous event loop. The libcurl can do this, and there is a Python wrapper for that called pycurl. Using it's "multi" interface you can do high-performance client activities. I have done over 1000 simultaneous fetchs as fast as one.
However, the API is quite low-level and difficult to use. There is a simplifying wrapper here, which you can use as an example.

Parallel processing within a queue (using Pool within Celery)

I'm using Celery to queue jobs from a CGI application I made. The way I've set it up, Celery makes each job run one- or two-at-a-time by setting CELERYD_CONCURRENCY = 1 or = 2 (so they don't crowd the processor or thrash from memory consumption). The queue works great, thanks to advice I got on StackOverflow.
Each of these jobs takes a fair amount of time (~30 minutes serial), but has an embarrassing parallelizability. For this reason, I was using Pool.map to split it and do the work in parallel. It worked great from the command line, and I got runtimes around 5 minutes using a new many-cored chip.
Unfortunately, there is some limitation that does not allow daemonic process to have subprocesses, and when I run the fancy parallelized code within the CGI queue, I get this error:
AssertionError: daemonic processes are not allowed to have children
I noticed other people have had similar questions, but I can't find an answer that wouldn't require abandoning Pool.map altogether, and making more complicated thread code.
What is the appropriate design choice here? I can easily run my serial jobs using my Celery queue. I can also run my much faster parallelized jobs without a queue. How should I approach this, and is it possible to get what I want (both the queue and the per-job parallelization)?
A couple of ideas I've had (some are quite hacky):
The job sent to the Celery queue simply calls the command line program. That program can use Pool as it pleases, and then saves the result figures & data to a file (just as it does now). Downside: I won't be able to check on the status of the job or see if it terminated successfully. Also, system calls from CGI may cause security issues.
Obviously, if the queue is very full of jobs, I can make use of the CPU resources (by setting CELERYD_CONCURRENCY = 6 or so); this will allow many people to be "at the front of the queue" at once.Downside: Each job will spend a lot of time at the front of the queue; if the queue isn't full, there will be no speedup. Also, many partially finished jobs will be stored in memory at the same time, using much more RAM.
Use Celery's #task to parallelize within sub-jobs. Then, instead of setting CELERYD_CONCURRENCY = 1, I would set it to 6 (or however many sub jobs I'd like to allow in memory at a time). Downside: First of all, I'm not sure whether this will successfully avoid the "task-within-task" problem. But also, the notion of queue position may be lost, and many partially finished jobs may end up in memory at once.
Perhaps there is a way to call Pool.map and specify that the threads are non-daemonic? Or perhaps there is something more lightweight I can use instead of Pool.map? This is similar to an approach taken on another open StackOverflow question. Also, I should note that the parallelization I exploit via Pool.map is similar to linear algebra, and there is no inter-process communication (each just runs independently and returns its result without talking to the others).
Throw away Celery and use multiprocessing.Queue. Then maybe there'd be some way to use the same "thread depth" for every thread I use (i.e. maybe all of the threads could use the same Pool, avoiding nesting)?
Thanks a lot in advance.
What you need is a workflow management system (WFMS) that manages
task concurrency
task dependency
task nesting
among other things.
From a very high level view, a WFMS sits on top of a task pool like celery, and submits the tasks which are ready to execute to the pool. It is also responsible for opening up a nest and submitting the tasks in the nest accordingly.
I've developed a system to do just that. It's called pomsets. Try it out, and feel free to send me any questions.
I using a multiprocessed deamons based on Twisted with forking and Gearman jobs query normally.
Try to look at Gearman.

Categories