Efficient data structure for insertion, deletion and search - python

I need to keep track of open and closed states as certain resources are used. The first structure I have used are two open_states_list and closed_states_list to which I add and remove as states are updated.
Inserting is O(1) if we ignore the internal memory allocation and reallocation mechanism, I guess: open_states_list.append(x).
Deleting is O(n): open_states_list.remove(x).
Searching, just like deleting, is O(n): x in open_states_list.
Getting the list is obviously O(1).
The second structure I have used is a single dictionary with boolean values open_states = {} where if open_states[x] is True then x is open and if it is False then it is closed.
"Inserting" is O(1) as we are just setting a key and a value: open_states[x] = True.
"Deleting" is O(1) for the same reason: open_states[x] = False.
"Searching" is also O(1), just accessing the value of the key: open_states[x].
Getting the list is O(n): [x for x, s in open_states.iteritems() if s].
Our most common operation is checking whether x is open or not (search), so the second option is better. But our second most common operation, getting the lists of open or used states, follows the search operation very closely, is more efficiently in the first option. We cannot choose the first option though, because checking the state of a resource is O(n).
Inserting is very common too. Removing is not as essential.
What would be the most efficient data structure for these needs?


Efficient reverse order comparison of huge growing list in Python

In Python, my goal is to maintain a unique list of points (complex scalars, rounded), while steadily creating new ones with a function, like in this pseudo code
list_of_points = []
while True
# generate new point according to some rule
z = generate()
# check whether this point is already there
if z not in list_of_points:
if some_condition:
Now list_of_points can become potentially huge (like 10 million entries or even more) during the process and duplicates are quite frequent. In fact about 50% of the time, a newly created point is already somewhere in the list. However, what I know is that oftentimes the already existing point is near the end of the list. Sometimes it is in the "bulk" and only very occasionally it can be found near the beginning.
This brought me to the idea of doing the search in reverse order. But how would I do this most efficiently (in terms of raw speed), given my potentially large list which grows during the process. Is the list container even the best way here?
I managed to gain some performance by doing this
list_of_points = []
while True
# generate new point according to some rule
z = generate()
# check very end of list
if z in list_of_points[-10:]:
# check deeper into the list
if z in list_of_points[-100:-10]:
# check the rest
if z not in list_of_points[:-100]:
if some_condition:
Apparently, this is not very elegant. Using instead a second, FIFO-type container (collection.deque), gives about the same speed up.
Your best bet might to be to use a set instead of a list, python sets use hashing to insert items, so it is very fast. And, you can skip the step of checking if an item is already in the list by simply trying to add it, if it is already in the set it wont be added since duplicates are not allowed.
Stealing your pseudo code axample
set_of_points = {}
while True
# get size of set
a = len(set_of_points)
# generate new point according to some rule
z = generate()
# try to add z to the set
b = len(set_of_points)
# if a == b it was not added, thus already existed in the set
if some_condition:
Use a set. This is what sets are for. Ah - you already have answer saying that. So my other comment: this part of your code appears to be incorrect:
# check the rest
if z not in list_of_points[100:]:
In context, I believe you meant to write list_of_points[:-100] there instead. You already checked the last 100, but, as is, you're skipping checking the first 100 instead.
But even better, use plain list_of_points. As the list grows longer, the cost to possibly do 100 redundant comparisons becomes trivial compared to the cost of copying len(list_of_points) - 100 elements

How do I get the last inserted key in Python (>3.7) dict without iterating through the whole list?

I know that Python dict's keys() since 3.7 are ordered by insertion order. I also know that I can get the first inserted key in O(1) time by doing next(dict.keys())
What I want to know is, is it possible to get the last inserted key in O(1)?
Currently, the only way I know of is to do list(dict.keys())[-1] but this takes O(n) time and space.
By last inserted key, I mean:
d = {}
d['a'] = ...
d['b'] = ...
d['a'] = ...
The last inserted key is 'b' as it's the last element in d.keys()
Dicts support reverse iteration:
Note that due to how the dict implementation works, this is O(1) for a dict where no deletions have occurred, but it may take longer than O(1) if items have been deleted from the dict. If items have been deleted, the iterator may have to traverse a bunch of dummy entries to find the last real entry. (Finding the first item may also take longer than O(1) in such a case.)
If you want to both find the last element and remove it, use d.popitem(), as Tim Peters suggested. popitem is optimized for this. Starting with a dict where no deletions have occurred, removing all elements from the dict with del d[next(reversed(d))] or del d[next(iter(d))] is O(N^2), while removing all elements with d.popitem() is O(N).
In more complex cases, you may want to consider using collections.OrderedDict. collections.OrderedDict uses a linked list-based ordering implementation with a few advantages, one of those advantages being that finding the first or last element is always O(1) regardless of what deletions have occurred.
#[user2357112 supports Monica]'s next(reversed(d)) does the trick. I'll add that, when I want the most recently added key, I usually want to remove it from the dict too. In that case, a simple d.popitem() does the trick.
I also know that I can get the first inserted key in O(1) time by doing next(dict.keys())
That doesn't work. If you try, you'll get
TypeError: 'dict_keys' object is not an iterator
However, e.g., next(iter(d)) will work.
But it's not necessarily O(1). It is for the OrderedDict library implementation, but not for the built-in dicts. The problem is that if, e.g., you delete "the first" key over & over & over ... again, that leaves an increasing number of "holes" at the start of the data structure, and finding the first non-hole has to skip over all of those one at a time.
So, in a loop, repeatedly deleting the first and then accessing the new first can take time quadratic in the number of loop iterations. You probably won't notice unless the dict has at least about 10 thousand elements, though.

How to efficently manage a list of elements that can either have one of it's elements removed or swapped with it's next one?

I have to build a program having two inputs (eventList, a list composed of strings that hold the type of operation and the id of the element that will undergo it, and idList, a list composed of ints, each one being the id of the element).
The two possible events are the deletion of the corresponding id, or having the id swap it's position in the idList with the following one (i.e. if the selected id is located in idList[2], it will swap value with idList[3]).
It has to pass strict tests with a set timeout and has to use dictionaries.
This is for a programmation assignment, I've alredy built this program but I can't find a way to get a decent time and pass the tester's timeouts.
I've alseo tried using lists instead of dicts, but I still can't pass some timeouts because of the time it takes to use .pop() and .index(), and I've been told the only way to pass all of them is to use dicts.
How I currently handle swaps:
def overtake(dictElement, elementId):
elementIndex = dictElement[elementId]
overtakerId = dictSearchOvertaker(dictElement, elementIndex)
dictElement[elementId], dictElement[overtakerId] = dictElement[overtakerId], dictElement[elementId]
return dictElement
How I currently handle deletions:
def eliminate(dictElement, elementId):
#elementIndex = dictElement[elementId]
del dictElement[elementId]
return dictUpdate(dictElement, elementId)
How i update the dictionary after an element is deleted:
def dictUpdate(dictElement, elementIndex):
listedDict = dictElement.items()
i = 0
for item in listedDict:
i += 1
if item[1] > elementIndex:
dictElement[item[0]] -= 1
return dictElement
I'm expected to handle a list of 200k elements where every element gets deleted one by one in 1.5 seconds, but it takes me more than 5 minutes, and even longer for a test where I get an idList with 1500 elements and every elements gets swapped with the following one untill in the end idList is reversed .
One thing that strikes me about this problem is that you're given a single list of operations and expected to return the result of doing all of them. That means you don't necessarily need to do them all one by one, and can instead do operations in a single batch that would otherwise be individually time-consuming.
Swapping two items is O(1) as long as you already know where they are. That's where a dict would come in -- a dict can help you associate one piece of information with another in such a way that you can find it in O(1) time. In this case, you want a way to find the index of an item given its id.
Deleting an item from the middle of a Python list is O(N), even if you already know its index, because internally it's an array and you need to shift everything over to take up the empty space every time you delete something that's not at the end. A naive solution is going to therefore be O(K*N), which is probably the thing the assignment is trying to get you to avoid. But nothing in the problem requires that you actually delete each item from the list one by one, just that the final result you return does not contain those items.
So, my approach would be:
Build a dict of id -> index. (This is just a single O(n) iteration over the list.)
Create an empty set to track deletions.
For each operation:
If it's a swap:
If the id is in your set, raise an exception.
Use your dict to find the indices of the two ids.
Swap the two items in the list.
Update your dict so it continues to match the list.
If it's a delete:
Add the id to your set.
Create a new list to return as the result.
For each item in the original list:
Check to see if it's in your set.
If it's in the set, skip it (it got deleted).
If not, append it to the result.
Return the result.
Where N is the list size and K is the number of operations, this ends up being O(N+K), because you iterated over the entire list of IDs exactly twice, and the entire list of operations exactly once, and everything you did inside those iterations was O(1).

Python - convert list into dictionary in order to reduce complexity

Let's say I have a big list:
word_list = [elt.strip() for elt in open("bible_words.txt", "r").readlines()]
//complexity O(n) --> proporcional to list length "n"
I have learned that hash function used for building up dictionaries allows lookup to be much faster, like so:
word_dict = dict((elt, 1) for elt in word_list)
// complexity O(l) ---> constant.
using word_list, is there a most efficient way which is recommended to reduce the complexity of my code?
The code from the question does just one thing: fills all words from a file into a list. The complexity of that is O(n).
Filling the same words into any other type of container will still have at least O(n) complexity, because it has to read all of the words from the file and it has to put all of the words into the container.
What is different with a dict?
Finding out whether something is in a list has O(n) complexity, because the algorithm has to go through the list item by item and check whether it is the sought item. The item can be found at position 0, which is fast, or it could be the last item (or not in the list at all), which makes it O(n).
In dict, data is organized in "buckets". When a key:value pair is saved to a dict, hash of the key is calculated and that number is used to identify the bucket into which data is stored. Later on, when the key is looked up, hash(key) is calculated again to identify the bucket and then only that bucket is searched. There is typically only one key:value pair per bucked, so the search can be done in O(1).
For more detils, see the article about DictionaryKeys on python.org.
How about a set?
A set is something like a dictionary with only keys and no values. The question contains this code:
word_dict = dict((elt, 1) for elt in word_list)
That is obviously a dictionary which does not need values, so a set would be more appropriate.
BTW, there is no need to create a word_list which is a list first and convert it to set or dict. The first step can be skipped:
set_of_words = {elt.strip() for elt in open("bible_words.txt", "r").readlines()}
Are there any drawbacks?
Always ;)
A set does not have duplicates. So counting how many times a word is in the set will never return 2. If that is needed, don't use a set.
A set is not ordered. There is no way to check which was the first word in the set. If that is needed, don't use a set.
Objects saved to sets have to be hashable, which kind-of implies that they are immutable. If it was possible to modify the object, then its hash would change, so it would be in the wrong bucket and searching for it would fail. Anyway, str, int, float, and tuple objects are immutable, so at least those can go into sets.
Writing to a set is probably going to be a bit slower than writing to a list. Still O(n), but a slower O(n), because it has to calculate hashes and organize into buckets, whereas a list just dumps one item after another. See timings below.
Reading everything from a set is also going to be a bit slower than reading everything from a list.
All of these apply to dict as well as to set.
Some examples with timings
Writing to list vs. set:
>>> timeit.timeit('[n for n in range(1000000)]', number=10)
>>> timeit.timeit('{n for n in range(1000000)}', number=10)
Reading from list vs. set:
>>> timeit.timeit('989234 in values', setup='values=[n for n in range(1000000)]', number=10)
>>> timeit.timeit('989234 in values', setup='values={n for n in range(1000000)}', number=10)
So, writing to a set seems to be about 30% slower, but finding an item in the set is thousands of times faster when there are thousands of items.

iterating over a growing set in python

I have a set, setOfManyElements, which contains n elements. I need to go through all those elements and run a function on each element of S:
for s in setOfManyElements:
EvilFunction(s) returns the set of elements it has found. Some of them will already be in S, some will be new, and some will be in S and will have already been tested.
The problem is that each time I run EvilFunction, S will expand (until a maximum set, at which point it will stop growing). So I am essentially iterating over a growing set. Also EvilFunction takes a long time to compute, so you do not want to run it twice on the same data.
Is there an efficient way to approach this problem in Python 2.7?
LATE EDIT: changed the name of the variables to make them more understandable. Thanks for the suggestion
I suggest an incremental version of 6502's approach:
seen = set(initial_items)
active = set(initial_items)
while active:
next_active = set()
for item in active:
for result in evil_func(item):
if result not in seen:
active = next_active
This visits each item only once, and when finished seen contains all visited items.
For further research: this is a breadth-first graph search.
You can just keep a set of already visited elements and pick a non-yet-visited element each time
visited = set()
todo = S
while todo:
s = todo.pop()
todo |= EvilFunction(s) - visited
Iterating a set in your scenario is a bad idea, as you have no guarantee on the ordering and the iterator are not intended to be used in a modifying set. So you do not know what will happen to the iterator, nor will you know the position of a newly inserted element
However, using a list and a set may be a good idea:
list_elements = list(set_elements)
for s in list_elements:
new_subset = elementsFound - list_elements
set_elements |= new_subset
Depending on the size of everything, you could even drop the set entirely
for s in list_elements:
list_elements.extend(i for i in elementsFound if i not in list_elements)
However, I am not sure on the performance of this. I think that you should profile. If the list is huge, then the set-based solution seems good --it is cheap to perform set-based operations. However, for moderate size, maybe the EvilFunction is expensive enough and it doesn't matter.
