EDIT: I'm using Python 3.5.0, and so map() will return an iterator instead of a list, unlike Python 2.x
I have a list of units and I am calling a REST api on all of them to return more data about them. I'm using map() to do this, but when I try to convert that map to a list, the program hangs there and doesn't proceed (both when I run it and debug it)
data = list(map(lambda product: client.request(units_url + "/" + product), units))
At first I thought maybe it was an issue with calling the api so quickly, but when I iterate through the map (without converting it to a list) manually and print it goes just fine:
data = map(lambda product: client.request(units_url + "/" + product), units)
for item in data:
print(item) # <-- this works just fine for the entire map
Anyone know why I'm getting this behavior?
When you list-ify the map, that means every single request is dispatched serially, waits for completion, then stores to the resulting list. If you're dispatching 1000 requests, that means each request must complete in order, one by one, before the list is constructed and you see the first result; it's entirely synchronous.
You get results (almost) immediately in the direct map iteration case because it only makes one request at a time; instead of waiting for 1000 requests, it waits for 1, you process that result, then it waits for another, etc.
If the goal is to minimize latency, take a look at multiprocessing.Pool.imap (or the thread based version of the pool implemented in multiprocessing.dummy; threads can be ideal for parallel network I/O requests and won't require pickling data for IPC). With the Pool's map, imap, or imap_unordered methods (choose one based on your needs), the requests will be dispatched asynchronously, several at a time (depending on the number of workers you select). If you absolutely must have a list, Pool.map will usually construct it faster; if you can iterate directly and don't care about the ordering of results, Pool.imap_unordered will get you results as fast as the workers can get them, in whatever order they are satisfied in. Plain map without a Pool isn't getting you any magical performance benefits (a list comprehension would usually run faster actually), so use a Pool.
Simple example code for fastest results:
import multiprocessing.dummy as multiprocessing # Import thread based version of library; for network I/O should work fine
with multiprocessing.Pool(8) as pool: # Pool of eight worker threads
for item in pool.imap_unordered(lambda product: client.request(units_url + "/" + product), units):
print(item)
If you really need to, you can use Pool.map and store to a real list, and assuming you have the bandwidth to run eight parallel requests (or however many workers you configure the pool for), that should (roughly) divide the time to complete the map by eight.
Better answer than I previously had. Check out this link. Near the bottom of the answer it gives a great analysis on why you should really use a list comprehension.
data = [ client.request(units_url + "/" + product) for product in units ]
Related
I'm trying to parallelize some Python code using processes and concurrent.futures. It looks like I can execute a function multiple times in parrallel either by submitting calls and then calling Future.result() on the futures, or by using Executor.map().
I'm wondering if the latter is just a syntactic sugar for the former and if there's any difference performance-wise. It doesn't seem immediately clear from the documentation.
It will allow you to execute a function multiple times concurrently instead true parallel execution.
Performance wise, I recently found that the ProcessPoolExecutor.submit() and ProcessPoolExecutor.map() consumed the same amount of compute time to complete the same task. Note: .submit() returns a future object (let's call it f) and you need to use it's f.result option to see it's result. On the other hand, .map() directly returns an iterator.
When converting their results into an ordered list using the sorted method, I have found that compute time of the entire .map()code can be faster than entire .submit() code in certain scenarios.
When converting their results into an unordered list using the list method, the compute time of the entire .submit() and .map() codes are the same. Also, these codes performed faster than the codes using the sorted method.
You can read the details in my answer. There, I have also shared my codes where you can see how they work. I hope they can be helpful to you.
I have not used ThreadPoolExecutor so I can't comment in detail. However, I have read that they are implemented the same way as the ProcessPoolExecutor and they are more suited to be used for I/O bound tasks instead of CPU bound tasks. You do need to specify the max_workers argument, i.e. the max number of threads, whereas in the ProcessPoolExecutor max_workers is an optional argument which defaults to the number of CPUs returned by os.cpu_count().
i am just learning Python and dont have much expierence with Multithreading. I am trying to send some json via the Requests session.post Method. This is called in the function at the bottem of the many for loops i need to run through the dictionary.
Is there a way to let this run in paralell?
I also have to limit my numbers of Threads, otherwise the post calls get blocked because they are to fast after each other. Help would be much appreciated.
def doWork(session, List, RefHashList):
for itemRefHash in RefHashList:
for equipment in res['Response']['data']['items']:
if equipment['itemHash'] == itemRefHash:
if equipment['characterIndex'] != 0:
SendJsonViaSession(session, getCharacterIdFromIndex(res, equipment['characterIndex']), itemRefHash, equipment['quantity'])
First, structuring your code differently might improve the speed without the added complexity of threading.
def doWork(session, res, RefHashList):
for equipment in res['Response']['data']['items']:
i = equipment['itemHash']
k = equipment['characterIndex']
if i in RefHashList and k != 0:
SendJsonViaSession(session, getCharacterIdFromIndex(res, k), i, equipment['quantity'])
To start with, we will look up equipment['itemHash'] and equipment['characterIndex'] only once.
Instead of explicitly looping over RefHashList, you could use the in operator. This moves the loop into the Python virtual machine, which is faster.
And instead of a nested if-conditional, you could use a single conditional using and.
Note: I have removed the unused parameter List, and replaced it with res. It is generally good practice to write functions that only act on parameters that they are given, not global variables.
Second, how much extra performance do you need? How much time is there on average between the SendJsonViaSession calls, and how small can this this time become before calls get blocked? If the difference between those numbers is small, it is probably not worth to implement a threaded sender.
Third, a design feature of the standard Python implementation is that only one thread at a time can be executing Python bytecode. So it is not certain that threading will improve performance.
Edit:
There are several ways to run stuff in parallel in Python. There is multiprocessing.Pool which uses processes, and multiprocessing.dummy.ThreadPool which uses threads. And from Python 3.2 onwards there is concurrent.futures, which can use processes or threads.
The thing is, neither of them has rate limiting. So you could get blocked for making too many calls.
Every time you call SendJsonViaSession you'd have to save the current time somehow so that all processes or threads can use it. And before every call, you would have to read that time and wait if it is too close to the last call.
Edit2:
If a call to SendJsonViaSession only takes 0.3 seconds, you should be able to do 3 calls/second sequentially. But your code only does 1 call/second. This implies that the speed restriction is somewhere else. You'd have to profile your code to see where the problem lies.
I have a program that is using pool.map() to get the values using ten parallel workers. I'm having trouble wrapping my head around how I am suppose to stitch the values back together to make use of it at the end.
What I have is structured like this:
initial_input = get_initial_values()
pool.map(function, initial_input)
pool.close()
pool.join()
# now how would I get the output?
send_ftp_of_output(output_data)
Would I write the function to a log file? If so, if there are (as a hypothetical) a million processes trying to write to the same file, would things overwrite each other?
pool.map(function,input)
returns a list.
You can get the output by doing:
output_data = pool.map(function,input)
pool.map simply runs the map function in paralell, but it still only returns a single list. If you're not outputting anything in the function you are mapping (and you shouldn't), then it simply returns a list. This is the same as map() would do, except it is executed in paralell.
In regards to the log file, yes, having multiple threads right to the same place would interleave within the log file. You could have the thread log the file before the write, which would ensure that something wouldn't get interrupted mid-entry, but it would still interleave things chronologically amongst all the threads. Locking the log file each time also would significantly slow down logging due to the overhead involved.
You can also have, say, the thread number -- %(thread)d -- or some other identifying mark in the logging Formatter output that would help to differentiate, but it could still be hard to follow, especially for a bunch of threads.
Not sure if this would work in your specific application, as the specifics in your app may preclude it, however, I would strongly recommend considering GNU Parallel (http://www.gnu.org/software/parallel/) to do the parallelized work. (You can use, say, subprocess.check_output to call into it).
The benefit of this is several fold, chiefly that you can easily vary the number of parallel workers -- up to having parallel use one worker per core on the machine -- and it will pipeline the items accordingly. The other main benefit, and the one more specifically related to your question -- is that it will stitch the output of all of these parallel workers together as if they had been invoked serially.
If your program wouldn't work so well having, say, a single command line piped from a file within the app and parallelized, you could perhaps make your Python code single-worker and then as the commands piped to parallel, make it a number of permutations of your Python command line, varying the target each time, and then have it output the results.
I use GNU Parallel quite often in conjunction with Python, often to do things, like, say, 6 simultaneous Postgres queries using psql from a list of 50 items.
Using Tritlo's suggestion, here is what worked for me:
def run_updates(input_data):
# do something
return {data}
if __name__ == '__main__':
item = iTunes()
item.fetch_itunes_pulldowns_to_do()
initial_input_data = item.fetched_update_info
pool = Pool(NUM_IN_PARALLEL)
result = pool.map(run_updates, initial_input_data)
pool.close()
pool.join()
print result
And this gives me a list of results
General Overview
I have medium size django project
I have a bunch of prefix trees in memory (as opposed to DB)
The nodes of these trees represent entities/objects that are subject to a timeout. Ie, I need to timeout these nodes at various points in time
Design:
Essentially, I needed a Timer construct that allows me to fire a resettable 1-shot timer and associate and give it a callback that can can perform some operation on the entity creating the timer, which in this case is a node of the tree.
After looking through the various options, I couldn't find anything that I could natively use (like some django app). The Timer object in Python is not suitable for this since it won't scale/perform. Thus I decided to write my own timer based on:
A sorted list of time-delta objects that holds the time-horizon
A mechanism to trigger the "tick"
Implementation Choices:
Went with a wrapper around Bisect for the sorted delta list:
http://code.activestate.com/recipes/577197-sortedcollection/
Went with celery to provide the tick - A granularity of 1 minute, where the worker would trigger the timer_tick function provided by my Timer class.
The timer_tick essentially should go through the sorted list, decrementing the head node every tick. Then if any nodes have ticked down to 0, kick off the callback and remove those nodes from the sorted timer list.
Creating a timer involves instantiating a Timer object which returns the id of the object. This id is stored in db and associated with an entry in DB that represents the entity creating the timer
Additional Data Structures:
In order to track the Timer instances (which get instantiated for each timer creation) I have a WeakRef Dictionary that maps the id to obj
So essentially, I have 2 data-structures in memory of my main Django project.
Problem Statement:
Since the celery worker needs to walk the timer list and also potentially modify the id2obj map, looks like I need to find a way to share state between my celery worker and main
Going through SO/Google, I find the following suggestions
Manager
Shared Memory
Unfortunately, bisect wrapper doesn't lend itself very well to pickling and/or state sharing. I tried the Manager approach by creating a dict and trying to embed the sorted List within the Dict..it came out with an error (kind of expected I guess since the memory held by the Sorted List is not shared and embedding it within a "shared" memory object will not work)
Finally...Question:
Is there a way I can share my SortedCollection and Weakref Dict with the worker thread
Alternate solution:
How about keeping the worker thread simple...having it write to DB for every tick and then using a post Db signal to get notified on the main and execute the processing of expired timers in the main. Of course, the con is that I lose parallelisation.
Let's start with some comments on your existing implementation:
Went with a wrapper around Bisect for the sorted delta list: http://code.activestate.com/recipes/577197-sortedcollection/
While this gives you O(1) pops (as long as you keep the list in reverse time order), it makes each insert O(N) (and likewise for less common operations like deleting arbitrary jobs if you have a "cancel" API). Since you're doing exactly as many inserts as pops, this means the whole thing is algorithmically no better than an unsorted list.
Replacing this with a heapq (that's exactly what they're for) gives you O(log N) inserts. (Note that Python's heapq doesn't have a peek, but that's because heap[0] is equivalent to heap.peek(0), so you don't need it.)
If you need to make other operations (cancel, iterate non-destructively, etc.) O(log N) as well, you want a search tree; look at blist and bintrees on PyPI for some good ones.
Went with celery to provide the tick - A granularity of 1 minute, where the worker would trigger the timer_tick function provided by my Timer class. The timer_tick essentially should go through the sorted list, decrementing the head node every tick. Then if any nodes have ticked down to 0, kick off the callback and remove those nodes from the sorted timer list.
It's much nicer to just keep the target times instead of the deltas. With target times, you just have to do this:
while q.peek().timestamp <= now():
process(q.pop())
Again, that's O(1) rather than O(N), and it's a lot simpler, and it treats the elements on the queue as immutable, and it avoids any possible problems with iterations taking longer than your tick time (probably not a problem with 1-minute ticks…).
Now, on to your main question:
Is there a way I can share my SortedCollection
Yes. If you just want a priority heap of (timestamp, id) pairs, you can fit that into a multiprocessing.Array just as easily as a list, except for the need to keep track of length explicitly. Then you just need to synchronize every operation, and… that's it.
If you're only ticking once/minute, and you expect to be busy more often than not, you can just use a Lock to synchronize, and have the schedule-worker(s) tick itself.
But honestly, I'd drop the ticks completely and just use a Condition—it's more flexible, and conceptually simpler (even if it's a bit more code), and it means you're using 0% CPU when there's no work to be done and responding quickly and smoothly when you're under load. For example:
def schedule_job(timestamp, job):
job_id = add_job_to_shared_dict(job) # see below
with scheduler_condition:
scheduler_heap.push((timestamp, job))
scheduler_condition.notify_all()
def scheduler_worker_run_once():
with scheduler_condition:
while True:
top = scheduler_heap.peek()
if top is not None:
delay = top[0] - now()
if delay <= 0:
break
scheduler_condition.wait(delay)
else:
scheduler_condition.wait()
top = scheduler_heap.pop()
if top is not None:
job = pop_job_from_shared_dict(top[1])
process_job(job)
Anyway, that brings us to the weakdict full of jobs.
Since a weakdict is explicitly storing references to in-process objects, it doesn't make any sense to share it across processes. What you want to store are immutable objects that define what the jobs actually are, not the mutable jobs themselves. Then it's just a plain old dict.
But still, a plain old dict is not an easy thing to share across processes.
The easy way to do that is to use a dbm database (or a shelve wrapper around one) instead of an in-memory dict, synchronized with a Lock. But this means re-flushing and re-opening the database every time anyone wants to change it, which may be unacceptable.
Switching to, say, a sqlite3 database may seem like overkill, but it may be a whole lot simpler.
On the other hand… the only operations you actually have here are "map the next id to this job and return the id" and "pop and return the job specified by this id". Does that really need to be a dict? The keys are integers, and you control them. An Array, plus a single Value for the next key, and a Lock, and you're almost done. The problem is that you need some kind of scheme for key overflow. Instead of just next_id += 1, you have to roll over, and check for already-used slots:
with lock:
next_id += 1
if next_id == size: next_id = 0
if arr[next_id] is None:
arr[next_id] = job
return next_id
Another option is to just store the dict in the main process, and use a Queue to make other processes query it.
Is it possible to "pipeline" consumption of a generator across multiple consumers?
For example, it's common to have code with this pattern:
def consumer1(iterator):
for item in iterator:
foo(item)
def consumer2(iterator):
for item in iterator:
bar(item)
myiter = list(big_generator())
v1 = consumer1(myiter)
v2 = consumer2(myiter)
In this case, multiple functions completely consume the same iterator, making it necessary to cache the iterator in a list. Since each consumer exhausts the iterator, itertools.tee is useless.
I see code like this a lot and I always wish I could get the consumers to consume one item at a time in order instead of caching the entire iterator. E.g.:
consumer1 consumes myiter[0]
consumer2 consumes myiter[0]
consumer1 consumes myiter[1]
consumer2 consumes myiter[1]
etc...
If I were to make up a syntax, it would look like this:
c1_retval, c2_retval = iforkjoin(big_generator(), (consumer1, consumer2))
You can get close with threads or multiprocessing and teed iterators, but threads consume at different speeds meaning that the value deque cached inside tee could get very large. The point here is not to exploit parallelism or to speed up tasks but to avoid caching large sections of the iterator.
It seems to me that this might be impossible without modifying the consumers because the flow of control is in the consumer. However, when a consumer actually consumes the iterator control passes into the iterator's next() method, so maybe it is possible to invert the flow of control somehow so that the iterator blocks the consumers one at a time until it can feed them all?
If this is possible, I'm not clever enough to see how. Any ideas?
With the limitation of not changing consumers' code (i.e. having a loop in them), you're left with only two options:
the approach you already include in your question: caching the generated items in memory, then iterating over them multiple times.
running each consumer in a thread, and implement some kind of synchronized-itertools.tee, one with buffer of size=1, which blocks serving item i+1 until item i has been served to all consumers.
There are no other options. You can't achieve all of the below, as they are contradicting:
having a generator
having a loop to consume all of it
then, (serially-)after the previous loop has finished, having another loop to consume all of it again
only keeping O(1) items in memory (or disk, etc.) while consuming them
not regenerating (i.e. not re-creating the generator)
The generated items must be stored somewhere if you want to reuse them.
If changing the consumers' code is acceptable, clearly #monkey's solution is the simplest and most straightforward.
Doesn't this work? Or do you require the entire iterator so a copy to each like this, won't work? If so, then I think you either have to create a copy, else generate the list twice?
for item in big_generator():
consumer1.handle_item(item)
consumer2.handle_item(item)