Sorry for such a silly question but Python docs are confusing...
Link 1: Queue Implementation
http://docs.python.org/library/queue.html
It says that Queue has a class for the priority queue. But I could not find how to implement it.
class Queue.PriorityQueue(maxsize=0)
Link 2: Heap Implementation
http://docs.python.org/library/heapq.html
Here they say that we can implement priority queues indirectly using heapq
pq = [] # list of entries arranged in a heap
entry_finder = {} # mapping of tasks to entries
REMOVED = '<removed-task>' # placeholder for a removed task
counter = itertools.count() # unique sequence count
def add_task(task, priority=0):
'Add a new task or update the priority of an existing task'
if task in entry_finder:
remove_task(task)
count = next(counter)
entry = [priority, count, task]
entry_finder[task] = entry
heappush(pq, entry)
def remove_task(task):
'Mark an existing task as REMOVED. Raise KeyError if not found.'
entry = entry_finder.pop(task)
entry[-1] = REMOVED
def pop_task():
'Remove and return the lowest priority task. Raise KeyError if empty.'
while pq:
priority, count, task = heappop(pq)
if task is not REMOVED:
del entry_finder[task]
return task
raise KeyError('pop from an empty priority queue'
Which is the most efficient priority queue implementation in Python? And how to implement it?
There is no such thing as a "most efficient priority queue implementation" in any language.
A priority queue is all about trade-offs. See http://en.wikipedia.org/wiki/Priority_queue
You should choose one of these two, based on how you plan to use it:
O(log(N)) insertion time and O(1) (findMin+deleteMin)* time, or
O(1) insertion time and O(log(N)) (findMin+deleteMin)* time
(* sidenote: the findMin time of most queues is almost always O(1), so
here I mostly mean the deleteMin time can either be O(1) quick if the
insertion time is O(log(N)) slow, or the deleteMin time must be
O(log(N)) slow if the insertion time is O(1) fast. One should note that
both may also be unnecessarily slow like with binary-tree based
priority queues.)
In the latter case, you can choose to implement a priority queue with a Fibonacci heap: http://en.wikipedia.org/wiki/Heap_(data_structure)#Comparison_of_theoretic_bounds_for_variants (as you can see, heapq which is basically a binary tree, must necessarily have O(log(N)) for both insertion and findMin+deleteMin)
If you are dealing with data with special properties (such as bounded data), then you can achieve O(1) insertion and O(1) findMin+deleteMin time. You can only do this with certain kinds of data because otherwise you could abuse your priority queue to violate the O(N log(N)) bound on sorting. vEB trees kind of fall under a similar category, since you have a maximum set size (O(log(log(M)) is not referring to the number of elements, but the maximum number of elements) and thus you cannot circumvent the theoretical O(N log(N)) general-purpose comparison-sorting bound.
To implement any queue in any language, all you need is to define the insert(value) and extractMin() -> value operations. This generally just involves a minimal wrapping of the underlying heap; see http://en.wikipedia.org/wiki/Fibonacci_heap to implement your own, or use an off-the-shelf library of a similar heap like a Pairing Heap (a Google search revealed http://svn.python.org/projects/sandbox/trunk/collections/pairing_heap.py )
If you only care about which of the two you referenced are more efficient (the heapq-based code from http://docs.python.org/library/heapq.html#priority-queue-implementation-notes which you included above, versus Queue.PriorityQueue), then:
There doesn't seem to be any easily-findable discussion on the web as to what Queue.PriorityQueue is actually doing; you would have to source dive into the code, which is linked to from the help documentation: http://hg.python.org/cpython/file/2.7/Lib/Queue.py
224 def _put(self, item, heappush=heapq.heappush):
225 heappush(self.queue, item)
226
227 def _get(self, heappop=heapq.heappop):
228 return heappop(self.queue)
As we can see, Queue.PriorityQueue is also using heapq as an underlying mechanism. Therefore they are equally bad (asymptotically speaking). Queue.PriorityQueue may allow for parallel queries, so I would wager that it might have a very slightly constant-factor more of overhead. But because you know the underlying implementation (and asymptotic behavior) must be the same, the simplest way would simply be to run them on the same large dataset.
(Do note that Queue.PriorityQueue does not seem to have a way to remove entries, while heapq does. However this is a double-edged sword: Good priority queue implementations might possibly allow you to delete elements in O(1) or O(log(N)) time, but if you use the remove_task function you mention, and let those zombie tasks accumulate in your queue because you aren't extracting them off the min, then you will see asymptotic slowdown which you wouldn't otherwise see. Of course, you couldn't do this with Queue.PriorityQueue in the first place, so no comparison can be made here.)
The version in the Queue module is implemented using the heapq module, so they have equal efficiency for the underlying heap operations.
That said, the Queue version is slower because it adds locks, encapsulation, and a nice object oriented API.
The priority queue suggestions shown in the heapq docs are meant to show how to add additional capabilities to a priority queue (such as sort stability and the ability to change the priority of a previously enqueued task). If you don't need those capabilities, then the basic heappush and heappop functions will give you the fastest performance.
Although this question has been answered and marked accepted, still here is a simple custom implementation of Priority Queue without using any module to understand how it works.
# class for Node with data and priority
class Node:
def __init__(self, info, priority):
self.info = info
self.priority = priority
# class for Priority queue
class PriorityQueue:
def __init__(self):
self.queue = list()
# if you want you can set a maximum size for the queue
def insert(self, node):
# if queue is empty
if self.size() == 0:
# add the new node
self.queue.append(node)
else:
# traverse the queue to find the right place for new node
for x in range(0, self.size()):
# if the priority of new node is greater
if node.priority >= self.queue[x].priority:
# if we have traversed the complete queue
if x == (self.size()-1):
# add new node at the end
self.queue.insert(x+1, node)
else:
continue
else:
self.queue.insert(x, node)
return True
def delete(self):
# remove the first node from the queue
return self.queue.pop(0)
def show(self):
for x in self.queue:
print str(x.info)+" - "+str(x.priority)
def size(self):
return len(self.queue)
Find the complete code and explanation here: https://www.studytonight.com/post/implementing-priority-queue-in-python (Updated URL)
Related
I used the queue.Queue class for passing tasks from one thread to the other. Later I needed to add priority so I changed it to PriorityQueue, using the proposed PrioritizedItem (because the tasks are dict and cannot be compared). Then, in rare situation, it started cause task mixup. It took me a while to realise/debug that same priority items in the PriorityQueue do not keep the insertion order or, even worse from debugging point of view, usually they do.
I guess, FIFO is a sort of default, when talking about task queues. This is why the Queue is not called like FifoQueue, isn't it? So, PriorityQueue should explicitly state that it is not FIFO for equal-priority items. Unfortunately, the Python doc does not warn us about this, and that lack of warning caused headache for me, and probably others too.
I have not found any ready-made solution, but I am pretty sure others may need a PriorityQueue that keeps the insertion order for equal-priority items. Hence this ticket...
Besides I hope the Python doc will state the warning in some next release, let me share how I solved the problem.
heapq (used by PriorityQueue) proposes that we need to insert a sequence number into the compared section of the item so that the calculated priority is obvious and avoid having 2 items with the same priority.
I also added the threading.Lock so that we avoid having the same sequence number for 2 items just because some thread racing situation occurred.
class _ThreadSafeCounter(object):
def __init__(self, start=0):
self.countergen = itertools.count(start)
self.lock = threading.Lock()
def __call__(self):
with self.lock:
return self.countergen.__next__()
#create a function that provides incremental sequence numbers
_getnextseqnum = _ThreadSafeCounter()
#dataclasses.dataclass(order=True)
class PriorityQueueItem:
"""Container for priority queue items
The payload of the item is stored in the optional "data" (None by default), and
can be of any type, even such that cannot be compared, e.g. dict.
The queue priority is defined mainly by the optional "priority" argument (10 by
default).
If there are more items with the same "priority", their put-order is preserved,
because of the automatically increasing sequence number, "_seqnum".
Usage in the producer:
pq.put(PriorityQueueItem("Best-effort-task",100))
pq.put(PriorityQueueItem(dict(b=2))
pq.put(PriorityQueueItem(priority=0))
pq.put(PriorityQueueItem(dict(a=1))
Consumer is to get the tasks with pq.get().getdata(), and will actually receive
None
{'b':2}
{'a':1}
"Best-effort-task"
"""
data: typing.Any=dataclasses.field(default=None, compare=False)
priority: int=10
_seqnum: int=dataclasses.field(default_factory=_getnextseqnum, init=False)
def getdata(self):
"""Get the payload of the item in the consumer thread"""
return self.data
I have designed a circular priority queue. But it took me a while because it is so conditional and has a bit much of a time complexity.
I implemented it using a list. But I need a more efficient circular priority queue implementation.
I'll illustrate my queue structure, sometimes it would be helpful for someone who seeks for a code to understand circular priority queues.
class PriorityQueue:
def __init__(self,n,key=None):
if key is None:
key=lambda x:x
self.maxsize = n
self.key=key
self.arr = list(range(self.maxsize))
self.rear = -1
self.front = 0
self.nelements=0
def isPQueueful(self):
return self.nelements==self.maxsize
def isPQueueempty(self):
return self.nelements==0
def insert(self, item):
if not self.isPQueueful():
pos=self.rear+1
scope = range(self.rear - self.maxsize, self.front - self.maxsize - 1, -1)
if self.rear==0 and self.rear<self.front:
scope=range(0,self.front-self.maxsize-1,-1)
for i in scope:
if self.key(item)>self.key(self.arr[i]):
self.arr[i+1]=self.arr[i]
pos=i
else:
break
self.rear+=1
if self.rear==self.maxsize:
self.rear=0
if pos==self.maxsize:
pos=0
self.arr[pos]=item
self.nelements+=1
else:
print("Priority Queue is full")
def remove(self):
revalue=None
if not self.isPQueueempty():
revalue=self.arr[self.front]
self.front+=1
if self.front==self.maxsize:
self.front=0
self.nelements-=1
else:
print("Priority Queue is empty")
return revalue
I really appreciate if someone can say whether what I designed is suitable for used in a production code. I think mostly it is not an efficient one.
If so can you point out to me how to design a efficient circular priority queue?
So, think of the interface and implementation separately.
The interface to a circular priority queue will make you think that the structure is a circular queue. It has a "highest" priority head and the next one is slightly lower, and then you get to the end, and the next one is the head again.
The methods you write need to act that way.
But the implementation doesn't actually need to be any kind of queue, list, array or linear structure.
For the implementation, you are trying to maintain a set of nodes that are always sorted by priority. For that, it would be better to use some kind of balanced tree (for example a red-black tree).
You hide that detail below your interface -- when you get to the end, you just reset yourself to the beginning -- your interfaces makes it look circular.
So, I have a class P. I want to two priority queues of objects of type P, as well as a priority queue of objects of type P. However, I want to order one of them on P.x, and order the other on P.y. Now, queue.PriorityQueue.put() does not support a key function, so I resorted to doing the following:
class P:
...
def __lt__(self, other):
return self.y < other.y
...
However, this does not allow for sorting based on P.x, and at the same time, I want to peek one of the queues but not the other, and queue.PriorityQueue does not have a peek function. Therefore, I replaced one of the priority queues with a sorted list instead. I can't use the SortedContainers library, because this is for a homework assignment and I can't guarantee that the grading server has it installed, so I turned to using bisect.insort.
The only problem, however, is that bisect.insort also does not support key functions. Therefore, I had to write my own function binary_insert(lst, item, key) to accomplish this task, and then I call it with binary_insert(lst, item, key = lambda i: i.x). This feels like a hack, since I'm writing my own binary insertion function, and binary insertion is such a core computer science concept that this must have come up before.
One way to do it would be to have the list store tuples of the form (x, p), and have the priority queue store tuples of the form (y, p). But, is there any other way to internalize these attributes into P itself? Otherwise, I will have to unpack a tuple every time I pop off an item, and this may cause my program to become littered with unused variables.
Perhaps you can subclass PriorityQueue to do the tuples for you, something like this (totally untested code):
class MyPriorityQueue(PriorityQueue):
def _put(self, item):
super()._put((item.x, item))
def _get(self):
return super()._get()[1]
I have an asyncio.PriorityQueue that I am using as the URL queue for a web crawler, with the lowest scored URLs being the first removed from the queue when I call url_queue.get(). When the queue reaches maxsize items, the default behavior is to block on calls to url_queue.put(), until a call to get() removes an item from the queue to make space.
What I would like to do is to never block, but instead push off the queue item with the highest score (or least an item with one of the highest scores), whenever I attempt to put() an item that has a lower score. Is there a way to automatically remove items from the bottom of the heap this way in asyncio.PriorityQueue? If not, is there an alternative priority queue / heap implementation that works with asyncio, which would enable me to do this? Or some other data structure / technique that would enable me to have some kind of non-blocking, prioritized queue with a maximum size?
Thanks!
Is there a way to automatically remove items from the bottom of the heap this way in asyncio.PriorityQueue?
Not by default, but it should be straightforward to inherit from asyncio.PriorityQueue and just implement the desired behavior. Unlike multi-threaded queue implementations, the asyncio queue runs in a single thread and therefore does not need to worry about synchronization issues.
A possible issue with performance is that PriorityQueue is not designed as a double-ended queue, so it uses a heap to store items. A heap is either min or max, but not both; Python's heapq module implements a min-heap, but you can easily simulate a max-heap by multiplying priorities by -1. In a min-heap one can access and pop the smallest item in logarithmic time, but not the largest one, and in a max-heap it's the other way around. To efficiently manipulate both the smallest and the largest item, you'll need to inherit from asyncio.Queue and use a different data structure to store items, such as a sorted list.
For example (untested):
class DroppingPriorityQueue(asyncio.Queue):
def _init(self, maxsize):
# called by asyncio.Queue.__init__
self._queue = sortedcontainers.SortedList()
def _put(self, item):
# called by asyncio.Queue.put_nowait
self._queue.add(item)
def _get(self):
# called by asyncio.Queue.get_nowait
# pop the first (most important) item off the queue
return self._queue.pop(0)
def __drop(self):
# drop the last (least important) item from the queue
self._queue.pop()
# no consumer will get a chance to process this item, so
# we must decrement the unfinished count ourselves
self.task_done()
def put_nowait(self, item):
if self.full():
self.__drop()
super().put_nowait(item)
async def put(self, item):
# Queue.put blocks when full, so we must override it.
# Since our put_nowait never raises QueueFull, we can just
# call it directly
self.put_nowait(item)
The class implements two distinct concerns:
It overrides the _get, _put, and _init protected methods to use a SortedList as the underlying storage. Although undocumented, these methods are used for building customized queues such as PriorityQueue and LifoQueue and have been in place for decades, first in the Queue module (queue in Python 3) and later in asyncio.queue.
It overrides the put and put_nowait public methods to implement the drop-when-full semantics.
I have a large number of CPU-bounded tasks that I want to run in parallel. Most of those tasks will return similar results and I only need to store unique results and count non-unique ones.
Here's how it is currently designed: I use two managed dictionaries - one for results and another one for result counters. My tasks are checking those dictionaries using unique result keys for the results they found and either write into both dictionaries or only increase the counters for non-unique results (if I have to write I acquire the lock and check again to avoid inconsistency).
What I am concerned about: since Pool.map should actually return result object, even though I do not save a reference to it, results will still pile up in memory until they are garbage collected. Even though I will have millions of just None's there (since I am processing results in a different manner and all my tasks just return None) I can not rely on specific garbage collector behavior so the program might eventually run out of memory. I still want to keep nice features of the pool but leave out this built-in result handling. Is my understanding correct and is my concern valid? If so, are there any alternatives?
Also, now that I laid it out on paper it looks really clumsy :) Do you see a better way to design such thing?
Thanks!
Question: I still want to keep nice features of the pool
Remove return result from multiprocessing.Pool.
Copy class MapResult and inherit from mp.pool.ApplyResult.
Add, replace ,comment the following:
import multiprocessing as mp
from multiprocessing.pool import Pool
class MapResult(mp.pool.ApplyResult):
def __init__(self, cache, chunksize, length, callback, error_callback):
super().__init__(cache, callback, error_callback=error_callback)
...
#self._value = [None] * length
self._value = None
...
def _set(self, i, success_result):
...
if success:
#self._value[i*self._chunksize:(i+1)*self._chunksize] = result
Create your own class myPool(Pool) inherit from multiprocessing.Pool.
Copy def _map_async(... from multiprocessing.Pool.
Add, replace, comment the following:
class myPool(Pool):
def __init__(self, processes=1):
super().__init__(processes=processes)
def _map_async(self, func, iterable, mapper, chunksize=None, callback=None,
error_callback=None):
...
#if self._state != RUN:
if self._state != mp.pool.RUN:
...
#return result
Tested with Python: 3.4.2