I'm trying to do something in Python 2.7, and I can't quite figure it out.
What I want is to carry out two sets of actions simultaneously, and in addition there is some need for the two threads to communicate with each other.
More specifically: I want to send a series of HTTP requests, and at the same time (in parallel) send a similar series of HTTP requests. This way I don't have to wait for a (potentially delayed) response, because the other series can just continue on.
The thing is, the number of requests per second cannot exceed a certain value; let's say one request per second. So I need to make sure that the combined request-frequency of the two parallel threads does not exceed this value.
Any help would be appreciated. Apologies if the solution is obvious, I'm still pretty new to python.
Raymond Hettinger gave a really good keynote talk about the proper way to think about concurrency and multithreading here: https://www.youtube.com/watch?v=Bv25Dwe84g0&t=2
And his notes can be found here: https://dl.dropboxusercontent.com/u/3967849/pyru/_build/html/index.html
What I recommend, which is from the talk, is to use an atomic message queue to "talk" between the threads. However, this talk and Raymond's work is done in 3.5 or 3.6. This library https://docs.python.org/3/library/queue.html will help you significantly.
A common way to enforce your rate-limiting requirement is to use a Token Bucket approach.
Specifically in Python, you'd have a queue shared between the threads, and a 3rd thread (perhaps the original initiating thread) which puts one plug object into the queue per second. (That is, it's a simple loop: wait 1 second, put an object, repeat.)
The two worker threads each try to take an object from the queue, and for each object they take, they issue one request. Voila! The workers can't issue more requests, in total, than tokens made available (which equal to the number of seconds that have passed. Even if one thread is stuck on a long-running request, the other can just be the one to repeatedly obtain a token. It's generalizable to N threads: they're all just competing to get the next allow-one-request token from the shared queue.
If many threads are stuck on long-running requests, multiple tokens collect in the queue, allowing a burst of catch-up requests – but still only reaching the overall target average-number-of-requests over a longer period. (By adjusting the maximum size of the queue, or whether it is preloaded with a small surplus of tokens, the exact enforcement of the limit can be adjusted – for example, so that it converges to the correct limit within 10 seconds, or 30, or 3600, whatever.)
The shared queue can also be the mechanism that is used to cleanly tell the worker threads to quit. That is, instead of pushing-into-the-queue whatever signalling-object means, "do one request", an external control thread can push-into-the-queue an object meaning, "finish and exit". Pushing in N such objects will cause the N worker threads to each get the command.
Seems like you need a "semaphore". From the python2.7 docs:
A semaphore manages an internal counter which is decremented by each acquire() call and incremented by each release() call. The counter can never go below zero; when acquire() finds that it is zero, it blocks, waiting until some other thread calls release().
So this semaphore of yours is basically a counter of calls, that reset to the allowed rate every second, shared by all the HTTP threads. If it reaches 0 no thread can make request no more, until another thread release the connection or a second passes and the Counter is filled again.
You can set-up your script with x HTTP request workers and one HTTP Call Rate Resetter worker:
the resetter destroys and regen the semaphore
each worker acquire() every HTTP is made.
If you are using Python2.7 and threading you can find all the docs here:
https://docs.python.org/2/library/threading.html.
And a nice tutorial here:
https://pymotw.com/2/threading/
Related
I am creating a multi-threaded program, in which I want only 1 thread at a time to go in the critical section where is creates a socket and send some data and all the other to wait for that variable to clear.
I tried threading.Events but later realized that on set() it will notify all the threads waiting. While I only wanted to notify one.
Tried locks(acquire and release). It suited my scenario well but I got to know that lock contention for a long time is expensive. After acquiring the lock my thread was performing many functions and hence resulted in holding the lock for long.
Now I tried threading.conditions. Just wanted to know if acquiring and holding the condition for a long time, is it not a good practice as it also uses locks.
Can anyone suggest a better approach to my problem statement.
I would use an additional thread dedicated to sending. Use a Queue where the other threads put their Send-Data. The socket-thread gets items from the queue in a loop and sends them one after the other.
As long as the queue is empty, .get blocks and the send-thread sleeps.
The "producer" threads have no waiting time at all, they just put their data in the queue and continue.
There is no concern about possible deadlock conditions.
To stop the send-thread, put some special item (e.g. None) in the queue.
To enable returning of values, put a tuple (send_data, return_queue) in the send-queue. when a result is ready, return it by putting it in the return_queue.
I've never used the async-await syntax but I do often need to make HTTP/S requests and parse responses while awaiting future responses. To accomplish this task, I currently use the ThreadPoolExecutor class which execute the calls asynchronously anyways; effectively I'm achieving (I believe) the same result I would get with more lines of code to use async-await.
Operating under the assumption that my current implementations work asynchronously, I am wondering how the async-await implementation would differ from that of my original one which used Threads and a Queue to manage workers; it also used a Semaphore to limit workers.
That implementation was devised under the following conditions:
There may be any number of requests
Total number of active requests may be 4
Only send next request when a response is received
The basic flow of the implementation was as follows:
Generate container of requests
Create a ListeningQueue
For each request create a Thread and pass the URL, ListeningQueue and Semaphore
Each Thread attempts to acquire the Semaphore (limited to 4 Threads)
Main Thread continues in a while checking ListeningQueue
When a Thread receives a response, place in ListeningQueue and release Semaphore
A waiting Thread acquires Semaphore (process repeats)
Main Thread processes responses until count equals number of requests
Because I need to limit the number of active Threads I use a Semaphore, and if I were to try this using async-await I would have to devise some logic in the Main Thread or in the async def that prevents a request from being sent if the limit has been reached. Apart from that constraint, I don't see where using async-await would be any more useful. Is it that it lowers overhead and race condition chances by eliminating Threads? Is that the main benefit? If so, even though using a ThreadPoolExecutor is making asynchronous calls it is using a pool of Threads, thus making async-await a better option?
Operating under the assumption that my current implementations work asynchronously, I am wondering how the async-await implementation would differ from that of my original one which used Threads and a Queue to manage workers
It would not be hard to implement very similar logic using asyncio and async-await, which has its own version of semaphore that is used in much the same way. See answers to this question for examples of limiting the number of parallel requests with a fixed number of tasks or by using a semaphore.
As for advantages of asyncio over equivalent code using threads, there are several:
Everything runs in a single thread regardless of the number of active connections. Your program can scale to a large number of concurrent tasks without swamping the OS with an unreasonable number of threads or the downloads having to wait for a free slot in the thread pool before they even start.
As you pointed out, single-threaded execution is less susceptible to race conditions because the points where a task switch can occur are clearly marked with await, and everything in-between is effectively atomic. The advantage of this is less obvious in small threaded programs where the executor just hands tasks to threads in a fire-and-collect fashion, but as the logic grows more complex and the threads begin to share more state (e.g. due to caching or some synchronization logic), this becomes more pronounced.
async/await allows you to easily create additional independent tasks for things like monitoring, logging and cleanup. When using threads, those do not fit the executor model and require additional threads, always with a design smell that suggests threads are being abused. With asyncio, each task can be as if it were running in its own thread, and use await to wait for something to happen (and yield control to others) - e.g. a timer-based monitoring task would consist of a loop that awaits asyncio.sleep(), but the logic could be arbitrarily complex. Despite the code looking sequential, each task is lightweight and carries no more weight to the OS than that of a small allocated object.
async/await supports reliable cancellation, which threads never did and likely never will. This is often overlooked, but in asyncio it is perfectly possible to cancel a running task, which causes it to wake up from await with an exception that terminates it. Cancellation makes it straightforward to implement timeouts, task groups, and other patterns that are impossible or a huge chore when using threads.
On the flip side, the disadvantage of async/await is that all your code must be async. Among other things, it means that you cannot use libraries like requests, you have to switch to asyncio-aware alternatives like aiohttp.
I've got the following problem:
I have two different classes; let's call them the interface and worker. The interface is supposed to accept requests from outside, and multiplexes them to several workers.
Contrary to almost every example I have found, I have several peculiarities:
The workers are not supposed to be recreated for every request.
The workers are different; a request for workers[0] cannot be answered by workers[1]. This multiplexing is done in interface.
I have a number of function-like calls which are difficult to model via events or simple queues.
There are a few different requests, which would make one queue per request difficult.
For example, assume that each worker is storing a single integer number (let's say the number of calls this worker received). In non-parallel processing, I'd use something like this:
class interface(object):
workers = None #set somewhere else.
def get_worker_calls(self, worker_id):
return self.workers[worker_id].get_calls()
class worker(object)
calls = 0
def get_calls(self):
self.calls += 1
return self.calls
This, obviously, doesn't work. What does?
Or, maybe more relevantly, I don't have experience with multiprocessing. Is there a design paradigm I'm missing that would easily solve the above?
Thanks!
For reference, I have considered several approaches, and I was unable to find a good one:
Use one request and answer queue. I've discarded this idea since that'd either block interface'for the answer-time of the current worker (making it badly scalable), or would require me sending around extra information.
Use of one request queue. Each message contains a pipe to return the answer to that request. After fixing the issue with being unable to send pipes via pipes, I've run into problems with pipe closing unless sending both ends over the connection.
Use of one request queue. Each message contains a queue to return the answer to that request. Fails since I cannot send queues via queues, but the reduction trick doesn't work.
The above also applies to the respective Manager-generated objects.
Multiprocessing means you have 2+ separated processes running. There is no way to access memory from one process to another directly (as with multithreading).
Your best shot is to use some kind of external Queue mechanism, you can start with Celery or RQ. RQ is simpler but celery has built-in monitoring.
But you have to know that Multiprocessing will work only if Celery/RQ are able to "pack" the needed functions/classes and send them to other process. Therefore you have to use __main__ level functions (that are in top of file, not belongs to any class).
You can always implement it yourself, Redis is very simple, ZeroMQ and RabbitMQ are also good.
Beaver library is good example of how to deal with multiprocessing in python using ZeroMQ queue.
I'm trying to solve a problem, where I have many (on the order of ten thousand) URLs, and need to download the content from all of them. I've been doing this in a "for link in links:" loop up till now, but the amount of time it's taking is now too long. I think it's time to implement a multithreaded or multiprocessing approach. My question is, what is the best approach to take?
I know about the Global Interpreter Lock, but since my problem is network-bound, not CPU-bound, I don't think that will be an issue. I need to pass data back from each thread/process to the main thread/process. I don't need help implementing whatever approach (Terminate multiple threads when any thread completes a task covers that), I need advice on which approach to take. My current approach:
data_list = get_data(...)
output = []
for datum in data:
output.append(get_URL_data(datum))
return output
There's no other shared state.
I think the best approach would be to have a queue with all the data in it, and have several worker threads pop from the input queue, get the URL data, then push onto an output queue.
Am I right? Is there anything I'm missing? This is my first time implementing multithreaded code in any language, and I know it's generally a Hard Problem.
For your specific task I would recommend a multiprocessing worker pool. You simply define a pool and tell it how many processes you want to use (one per processor core by default) as well as a function you want to run on each unit of work. Then you ready every unit of work (in your case this would be a list of URLs) in a list and give it to the worker pool.
Your output will be a list of the return values of your worker function for every item of work in your original array. All the cool multi-processing goodness will happen in the background. There is of course other ways of working with the worker pool as well, but this is my favourite one.
Happy multi-processing!
The best approach I can think of in your use case will be to use a thread pool and maintain a work queue. The threads in the thread pool get work from the work queue, do the work and then go get some more work. This way you can finely control the number of threads working on your URLs.
So, create a WorkQueue, which in your case is basically a list containing the URLs that need to be downloaded.
Create a thread pool, which create the number of threads you specify, fetches work from the WorkQueue and assigns it to a thread. Each time a thread finishes and returns you check if the work queues has more work and accordingly assign work to that thread again. You may also want to put a hook so that every time work is added to the work queue, your threads assigns it to a free thread if available.
The fastest and most efficient method of doing IO bound tasks like this is an asynchronous event loop. The libcurl can do this, and there is a Python wrapper for that called pycurl. Using it's "multi" interface you can do high-performance client activities. I have done over 1000 simultaneous fetchs as fast as one.
However, the API is quite low-level and difficult to use. There is a simplifying wrapper here, which you can use as an example.
I'm currently learning Python (from a Java background), and I have a question about something I would have used threads for in Java.
My program will use workers to read from some web-service some data periodically. Each worker will call on the web-service at various times periodically.
From what I have read, it's preferable to use the multiprocessing module and set up the workers as independent processes that get on with their data-gathering tasks. On Java I would have done something conceptually similar, but using threads. While it appears I can use threads in Python, I'll lose out on multi-cpu utilisation.
Here's the guts of my question: The web-service is throttled, viz., the workers must not call on it more than x times per second. What is the best way for the workers to check on whether they may request data?
I'm confused as to whether this should be achieved using:
Pipes as a way to communicate to some other 'managing object', which monitors the total calls per second.
Something along the lines of nmap, to share some data/value between the processes that describes if they may call the web-service.
A Manager() object that monitors the calls per seconds and informs workers if they have permission to make their calls.
Of course, I guess this may come down to how I keep track of the calls per second. I suppose one option would be for the workers to call a function on some other object, which makes the call to the web-service and records the current number of calls/sec. Another option would be for the function that calls the web-service to live within each worker, and for them to message a managing object every time they make a call to the web-service.
Thoughts welcome!
Delegate the retrieval to a separate process which queues the requests until it is their turn.
I think that you'll find that the multiprocessing module will provide you with some fairly familiar constructs.
You might find that multiprocessing.Queue is useful for connecting your worker threads back to a managing thread that could provide monitoring or throttling.
Not really an answer to your question, but an alternative approach to your problem: You could get rid of synchronization issues when doing requests event driven, e.g. by using the Python async module or Twisted. You wouldn't benefit from multiple CPUs/cores, but in context of network communication that's usually negligible.