Python: delete element from heap

Python: delete element from heap - python

Python has heapq module which implements heap data structure and it supports some basic operations (push, pop).
How to remove i-th element from the heap in O(log n)? Is it even possible with heapq or do I have to use another module?
Note, there is an example at the bottom of the documentation:
http://docs.python.org/library/heapq.html
which suggest a possible approach - this is not what I want. I want the element to remove, not to merely mark as removed.

You can remove the i-th element from a heap quite easily:
h[i] = h[-1]
h.pop()
heapq.heapify(h)
Just replace the element you want to remove with the last element and remove the last element then re-heapify the heap. This is O(n), if you want you can do the same thing in O(log(n)) but you'll need to call a couple of the internal heapify functions, or better as larsmans pointed out just copy the source of _siftup/_siftdown out of heapq.py into your own code:
h[i] = h[-1]
h.pop()
if i < len(h):
heapq._siftup(h, i)
heapq._siftdown(h, 0, i)
Note that in each case you can't just do h[i] = h.pop() as that would fail if i references the last element. If you special case removing the last element then you could combine the overwrite and pop.
Note that depending on the typical size of your heap you might find that just calling heapify while theoretically less efficient could be faster than re-using _siftup/_siftdown: a little bit of introspection will reveal that heapify is probably implemented in C but the C implementation of the internal functions aren't exposed. If performance matter to you then consider doing some timing tests on typical data to see which is best. Unless you have really massive heaps big-O may not be the most important factor.
Edit: someone tried to edit this answer to remove the call to _siftdown with a comment that:
_siftdown is not needed. New h[i] is guaranteed to be the smallest of the old h[i]'s children, which is still larger than old h[i]'s parent
(new h[i]'s parent). _siftdown will be a no-op. I have to edit since I
don't have enough rep to add a comment yet.
What they've missed in this comment is that h[-1] might not be a child of h[i] at all. The new value inserted at h[i] could come from a completely different branch of the heap so it might need to be sifted in either direction.
Also to the comment asking why not just use sort() to restore the heap: calling _siftup and _siftdown are both O(log n) operations, calling heapify is O(n). Calling sort() is an O(n log n) operation. It is quite possible that calling sort will be fast enough but for large heaps it is an unnecessary overhead.
Edited to avoid the issue pointed out by #Seth Bruder. When i references the end element the _siftup() call would fail, but in that case popping an element off the end of the heap doesn't break the heap invariant.

(a) Consider why you don't want to lazy delete. It is the right solution in a lot of cases.
(b) A heap is a list. You can delete an element by index, just like any other list, but then you will need to re-heapify it, because it will no longer satisfy the heap invariant.

Related

Converting recursive function to completely iterative function without using extra space

Is it possible to convert a recursive function like the one below to a completely iterative function?
def fact(n):
if n <= 1:
return
for i in range(n):
fact(n-1)
doSomethingFunc()
It seems pretty easy to do given extra space like a stack or a queue, but I was wondering if we can do this in O(1) space complexity?
Note, we cannot do something like:
def fact(n):
for i in range (factorial(n)):
doSomethingFunc()
since it takes a non-constant amount of memory to store the result of factorial(n).

Well, generally speaking no.
I mean, the space taken in the stack by recursive functions is not just an inconvenient of this programming style. It is the memory needed for the computation.
So, sure, for lot of algorithm, that space is unnecessary and could be spared. For a classical factorial for example
def fact(n):
if n<=1:
return 1
else:
return n*fact(n-1)
the stacking of all the n, n-1, n-2, ..., 1 arguments is not really necessary.
So, sure, you can find an implementation that get rid of it. But that is optimization (For example, in the specific case of terminal recursion. But I am pretty sure that you add that "doSomething" to make clear that you don't want to focus on that specific case).
You cannot assume in general that an algorithm that don't need all those values exist, recursive or iterative. Or else, that would be saying that all algorithm exist in a O(1) space complexity version.
Example: base representation of a positive integer
def baseRepr(num, base):
if num>=base:
s=baseRepr(num//base, base)
else:
s=''
return s+chr(48+num%base)
Not claiming it is optimal, or even well written.
But, the stacking of the arguments is needed. It is the way you implicitly store the digits that you compute in the reverse order.
An iterative function would also need some memory to store those digits, since you have to compute the last one first.
Well, I am pretty sure that for this simple example, you could find a way to compute from left to right, for example using a log computation to know in advance the number of digits or something. But that's not the point. Just imagine that there is no other algorithm known than the one computing digits from right to left. Then you need to store them. Either implicitly in the stack using recursion, or explicitly in allocated memory. So again, memory used in the stack is not just an inconvenience of recursion. It is the way recursive algorithm store things, that would be stored otherwise in iterative algorithm

Note, we cannot do something like:
def fact(n):
for i in range (factorial(n)):
doSomethingFunc()
since it takes a non-constant amount of memory to store the result of
factorial(n).
Yes.
I was wondering if we can do this in O(1) space complexity?
So, no.

Python heappush vs simple append - what is the difference?

From https://www.tutorialspoint.com/heap-queue-or-heapq-in-python:
heappush – This function adds an element to the heap without altering the current heap.
If the current heap is not altered, why don't we use the append() list method? Is the list with the new element heapified only when heappop() is called?
Am I misunderstanding "without altering the current heap"? Or something else?

This is not an official reference documentation. So it contains what its author wanted to write.
If you consult the official Python Standard Library reference, you will find:
heapq.heappush(heap, item): Push the value item onto the heap, maintaining the heap invariant.
Here what happens is clear: the new item is added to the collection, and the internal structure is enventually adapted to have the binary tree respect: every parent node has a value less than or equal to any of its children.
After a second look at the tutorial, I think that what is meant is that heappush adds the new element without altering other elements on the heap, by opposition to heappop or heapreplace which will remove the current smaller item.

I believe the "without altering the current heap" means "maintaining the heap property that each node has a smaller key than its children". If you need the heap data structure, list.append() would not suffice. You may like to refer to https://www.cs.yale.edu/homes/aspnes/pinewiki/Heaps.html.

Is it PROPER to use del or pop() for simply removing a value?

If I want to remove a value from a list in python, is it proper to use delor can I just use pop() even though it returns the popped value?
I know the difference in how they can be used I just want to know specifically in terms of deleting from a list if I should be using del instead of pop() for everything.
list = [1, 2, 3]
del list[0]
vs.
list = [1, 2, 3]
list.pop(0)

If you do not require the value, del is more correct in that it's easier for other readers to infer your intent (i.e. you just want to discard the value instead of using it for something).
However for all purposes that I can think of, they will act identically in your defined scenario. pop() obviously returns the value, but you are discarding it immediately.
There may be extremely minor performance / garbage collection benefits to using del over pop() but I am merely hypothesizing and have nothing to back that up. I am unsure of del's implementation but I can imagine potential implementations that'd be much faster than pop() due to not needing to return the removed value. Again, this is extremely minor and meaningless in most circumstances.

Why Python's list does not have shift/unshift methods?

I am wondering why the default list in Python does not have any shift, unshift methods. Maybe there is an obvious reason for it like the way the lists are ordered in memory.
So currently, I know that I can use append to add an item at the end of a list and pop to remove an element from the end. However, I can only use list concatenation to imitate the behavior of a missing shift or unshift method.
>>> a = [1,2,3,4,5]
>>> a = [0] + a # Unshift / Push
>>> a
[0,1,2,3,4,5]
>>> a = a[1:] # Shift / UnPush
>>> a
[1,2,3,4,5]
Did I miss something?

Python lists were optimized for fast fixed-length operations and incur O(n) memory movement costs for pop(0) and insert(0, v) operations which change both the size and position of the underlying data representation. Actually, the "list" datatype in CPython works differently as to what many other languages might call a list (e.g. a linked-list) - it is implemented more similarly to what other languages might call an array, though there are some differences here too.
You may be interested instead in collections.deque, which is a list-like container with fast appends and pops on either end.
Deques support thread-safe, memory efficient appends and pops from either side of the deque with approximately the same O(1) performance in either direction.
It provides the missing methods you appear to be asking about under the names appendleft and popleft:
appendleft(x)
Add x to the left side of the deque.
popleft()
Remove and return an element from the left side of the deque.
If no elements are present, raises an IndexError.
Of course there are trade-offs: indexing or inserting/removing near the middle of the deque is slow. In fact deque.insert(index, object) wasn't even possible before Python 3.5, you would need to rotate, insert/pop, and rotate back. deques also don't support slicing, for similar functionality you'll have to write something annoying with e.g. itertools.islice instead.
For further discussion of the advantages and disadvantages of deque vs list data structures, see How are deques in Python implemented, and when are they worse than lists?

in Python3
we have insert method on a list.
takes the value you require to add and also the index where you want to add it.
arrayOrList = [1,2,3,4,5]
arrayOrList.insert(0 , 0)
print(arrayOrList)

Python: Memory usage and optimization when modifying lists

The problem
My concern is the following: I am storing a relativity large dataset in a classical python list and in order to process the data I must iterate over the list several times, perform some operations on the elements, and often pop an item out of the list.
It seems that deleting one item out of a Python list costs O(N) since Python has to copy all the items above the element at hand down one place. Furthermore, since the number of items to delete is approximately proportional to the number of elements in the list this results in an O(N^2) algorithm.
I am hoping to find a solution that is cost effective (time and memory-wise). I have studied what I could find on the internet and have summarized my different options below. Which one is the best candidate ?
Keeping a local index:
while processingdata:
index = 0
while index < len(somelist):
item = somelist[index]
dosomestuff(item)
if somecondition(item):
del somelist[index]
else:
index += 1
This is the original solution I came up with. Not only is this not very elegant, but I am hoping there is better way to do it that remains time and memory efficient.
Walking the list backwards:
while processingdata:
for i in xrange(len(somelist) - 1, -1, -1):
dosomestuff(item)
if somecondition(somelist, i):
somelist.pop(i)
This avoids incrementing an index variable but ultimately has the same cost as the original version. It also breaks the logic of dosomestuff(item) that wishes to process them in the same order as they appear in the original list.
Making a new list:
while processingdata:
for i, item in enumerate(somelist):
dosomestuff(item)
newlist = []
for item in somelist:
if somecondition(item):
newlist.append(item)
somelist = newlist
gc.collect()
This is a very naive strategy for eliminating elements from a list and requires lots of memory since an almost full copy of the list must be made.
Using list comprehensions:
while processingdata:
for i, item in enumerate(somelist):
dosomestuff(item)
somelist[:] = [x for x in somelist if somecondition(x)]
This is very elegant but under-the-cover it walks the whole list one more time and must copy most of the elements in it. My intuition is that this operation probably costs more than the original del statement at least memory wise. Keep in mind that somelist can be huge and that any solution that will iterate through it only once per run will probably always win.
Using the filter function:
while processingdata:
for i, item in enumerate(somelist):
dosomestuff(item)
somelist = filter(lambda x: not subtle_condition(x), somelist)
This also creates a new list occupying lots of RAM.
Using the itertools' filter function:
from itertools import ifilterfalse
while processingdata:
for item in itertools.ifilterfalse(somecondtion, somelist):
dosomestuff(item)
This version of the filter call does not create a new list but will not call dosomestuff on every item breaking the logic of the algorithm. I am including this example only for the purpose of creating an exhaustive list.
Moving items up the list while walking
while processingdata:
index = 0
for item in somelist:
dosomestuff(item)
if not somecondition(item):
somelist[index] = item
index += 1
del somelist[index:]
This is a subtle method that seems cost effective. I think it will move each item (or the pointer to each item ?) exactly once resulting in an O(N) algorithm. Finally, I hope Python will be intelligent enough to resize the list at the end without allocating memory for a new copy of the list. Not sure though.
Abandoning Python lists:
class Doubly_Linked_List:
def __init__(self):
self.first = None
self.last = None
self.n = 0
def __len__(self):
return self.n
def __iter__(self):
return DLLIter(self)
def iterator(self):
return self.__iter__()
def append(self, x):
x = DLLElement(x)
x.next = None
if self.last is None:
x.prev = None
self.last = x
self.first = x
self.n = 1
else:
x.prev = self.last
x.prev.next = x
self.last = x
self.n += 1
class DLLElement:
def __init__(self, x):
self.next = None
self.data = x
self.prev = None
class DLLIter:
etc...
This type of object resembles a python list in a limited way. However, deletion of an element is guaranteed O(1). I would not like to go here since this would require massive amounts of code refactoring almost everywhere.

Without knowing the specifics of what you're doing with this list, it's hard to know exactly what would be best in this case. If your processing stage depends on the current index of the list element, this won't work, but if not, it appears you've left off the most Pythonic (and in many ways, easiest) approach: generators.
If all you're doing is iterating over each element, processing it in some way, then either including that element in the list or not, use a generator. Then you never need to store the entire iterable in memory.
def process_and_generate_data(source_iterable):
for item in source_iterable:
dosomestuff(item)
if not somecondition(item):
yield item
You would need to have a processing loop that dealt with persisting the processed iterable (writing it back to a file, or whatever), or if you have multiple processing stages you'd prefer to separate into different generators you could have your processing loop pass one generator to the next.

From your description it sounds like a deque ("deck") would be exactly what you are looking for:
http://docs.python.org/library/collections.html#deque-objects
"Iterate" across it by repeatedly calling pop() and then, if you want to keep the popped item in the deque, returning that item to the front with appendleft(item). To keep up with when you're done iterating and have seen everything in the deque, either put in a marker object like None that you watch for, or just ask for the deque's len() when you start a particular loop and use range() to pop() exactly that many items.
I believe you will find all of the operations you need are then O(1).

Python stores only references to objects in the list - not the elements themselves. If you grow a list item by item, the list (that is the list of references to the objects) will grow one by one, eventually reaching the end of the excess memory that Python preallocated at the end of the list (of references!). It then copies the list (of references!) into a new larger place while your list elements stay at their old location. As your code visits all the elements in the old list anyway, copying the references to a new list by new_list[i]=old_list[i] will be nearly no burden at all. The only performance hint is to allocate all new elements at once instead of appending them (OTOH the Python docs say that amortized append is still O(1) as the number of excess elements grows with the list size). If you are lacking the place for the new list (of references) then I fear you are out of luck - any data structure that would evade the O(n) in-place insert/delete will likely be bigger than a simple array of 4- or 8-byte entries.

A doubly linked list is worse than just reallocating the list. A Python list uses 5 words + one word per element. A doubly linked list will use 5 words per element. Even if you use a singly linked list, it's still going to be 4 words per element - a lot worse than the less than 2 words per element that rebuilding the list will take.
From memory usage perspective, moving items up the list and deleting the slack at the end is the best approach. Python will release the memory if the list gets less than half full. The question to ask yourself is, does it really matter. The list entries probably point to some data, unless you have lots of duplicate objects in the list, the memory used for the list is insignificant compared to the data. Given that, you might just as well just build a new list.
For building a new list, the approach you suggested is not that good. There's no apparent reason why you couldn't just go over the list once. Also, calling gc.collect() is unnecessary and actually harmful - the CPython reference counting will release the old list immediately anyway, and even the other garbage collectors are better off collecting when they hit memory pressure. So something like this will work:
while processingdata:
retained = []
for item in somelist:
dosomething(item)
if not somecondition(item):
retained.append(item)
somelist = retained
If you don't mind using side effects in list comprehensions, then the following is also an option:
def process_and_decide(item):
dosomething(item)
return not somecondition(item)
while processingdata:
somelist = [item for item in somelist if process_and_decide(item)]
The inplace method can also be refactored so the mechanism and business logic are separated:
def inplace_filter(func, list_):
pos = 0
for item in list_:
if func(item):
list_[pos] = item
pos += 1
del list_[pos:]
while processingdata:
inplace_filter(process_and_decide, somelist)

You do not provide enough information I can find to answer this question really well. I don't know your use case well enough to tell you what data structures will get you the time complexities you want if you have to optimize for time. The typical solution is to build a new list rather than repeated deletions, but obviously this doubles(ish) memory usage.
If you have memory usage issues, you might want to abandon using in-memory Python constructs and go with an on-disk database. Many databases are available and sqlite ships with Python. Depending on your usage and how tight your memory requirements are, array.array or numpy might help you, but this is highly dependent on what you need to do. array.array will have all the same time complexities as list and numpy arrays sort of will but work in some different ways. Using lazy iterators (like generators and the stuff in the itertools module) can often reduce memory usage by a factor of n.
Using a database will improve time to delete items from arbitrary locations (though order will be lost if this is important). Using a dict will do the same, but potentially at high memory usage.
You can also consider blist as a drop-in replacement for a list that might get some of the compromises you want. I don't believe it will drastically increase memory usage, but it will change item removal to O(log n). This comes at the cost of making other operations more expensive, of course.
I would have to see testing to believe that the constant factor for memory use for your doubly linked list implementation would be less than the 2 that you get by simply creating a new list. I really doubt it.
You will have to share more about your problem class for a more concrete answer, I think, but the general advice is
Iterate over a list building a new list as you go along (or using a generator to yield the items when you need them). If you actually need a list, this will have a memory factor of 2, which scales fine but doesn't help if you are short on memory period.
If you are running out of memory, rather than microoptimization you probably want an on-disk database or to store your data in a file.

Brandon Craig Rhodes suggests using a collections.deque, which can suit this problem: no additional memory is required for the operation and it is kept O(n). I do not know the total memory usage and how it compares to a list; it's worth noting that a deque has to store a lot more references and I would not be surprised if it isn't as memory intensive as using two lists. You would have to test or study it to know yourself.
If you were to use a deque, I would deploy it slightly differently than Rhodes suggests:
from collections import deque
d = deque(range(30))
n = deque()
print d
while True:
try:
item = d.popleft()
except IndexError:
break
if item % 3 != 0:
n.append(item)
print n
There is no significant memory difference doing it this way, but there's a lot less opportunity to flub up than mutating the same deque as you go.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.