I've noticed the table of the time complexity of set operations on the python official website. But i just wanna ask what's the time complexity of converting a list to a set, for instance,
l = [1, 2, 3, 4, 5]
s = set(l)
I kind of know that this is actually a hash table, but how exactly does it work? Is it O(n) then?
Yes. Iterating over a list is O(n) and adding each element to the hash set is O(1), so the total operation is O(n).
I was asked the same question in my last interview and didn't get it right.
As Trilarion commented in the first solution, the worst-case complexity is O(n^2). Iterating through the list will need O(n), but you can not just add each element to the hash table (sets are implemented using hashtables). In the worst case, our hash function will hash each element to the same value, thus adding each element to the hash set is not O(1). In such a case, we need to add each element to a linked list - (note that hash sets have a linked list in case of collision). When adding to the linked list we need to make sure that the element doesn't already exist (as a Set by definition doesn't have duplicates). To do that we need to iterate through the same linked list for each element, which takes a total of n*(n-1)/2 = O(n^2).
Related
I know that Python dict's keys() since 3.7 are ordered by insertion order. I also know that I can get the first inserted key in O(1) time by doing next(dict.keys())
What I want to know is, is it possible to get the last inserted key in O(1)?
Currently, the only way I know of is to do list(dict.keys())[-1] but this takes O(n) time and space.
By last inserted key, I mean:
d = {}
d['a'] = ...
d['b'] = ...
d['a'] = ...
The last inserted key is 'b' as it's the last element in d.keys()
Dicts support reverse iteration:
next(reversed(d))
Note that due to how the dict implementation works, this is O(1) for a dict where no deletions have occurred, but it may take longer than O(1) if items have been deleted from the dict. If items have been deleted, the iterator may have to traverse a bunch of dummy entries to find the last real entry. (Finding the first item may also take longer than O(1) in such a case.)
If you want to both find the last element and remove it, use d.popitem(), as Tim Peters suggested. popitem is optimized for this. Starting with a dict where no deletions have occurred, removing all elements from the dict with del d[next(reversed(d))] or del d[next(iter(d))] is O(N^2), while removing all elements with d.popitem() is O(N).
In more complex cases, you may want to consider using collections.OrderedDict. collections.OrderedDict uses a linked list-based ordering implementation with a few advantages, one of those advantages being that finding the first or last element is always O(1) regardless of what deletions have occurred.
#[user2357112 supports Monica]'s next(reversed(d)) does the trick. I'll add that, when I want the most recently added key, I usually want to remove it from the dict too. In that case, a simple d.popitem() does the trick.
CAUTIONS
I also know that I can get the first inserted key in O(1) time by doing next(dict.keys())
That doesn't work. If you try, you'll get
TypeError: 'dict_keys' object is not an iterator
However, e.g., next(iter(d)) will work.
But it's not necessarily O(1). It is for the OrderedDict library implementation, but not for the built-in dicts. The problem is that if, e.g., you delete "the first" key over & over & over ... again, that leaves an increasing number of "holes" at the start of the data structure, and finding the first non-hole has to skip over all of those one at a time.
So, in a loop, repeatedly deleting the first and then accessing the new first can take time quadratic in the number of loop iterations. You probably won't notice unless the dict has at least about 10 thousand elements, though.
Suppose I have a list list1 of about 1 000 000 sub-lists. Next, I would like to check if the given element a, which is a sub-list, exists in the list. Normally, it would be enough to check using if a in list1, but with a large list it works quite slowly. Is there another way?
Since you state you can use tuples, I would recommend making each of your sub-lists into tuples and then making a set of these tuples. Then, searching the set will be an O(1) lookup. Initial construction of the set may be costly, though, but if you do many lookups it is worth it.
>>> set_of_sublists = {tuple(sublist) for sublist in orignal_list}
>>> tuple(sublist_to_check_for_membership) in set_of_sublists
I want to acknowledge that #BrettBeatty originally gave this answer as well but has deleted it subsequently.
Let's say I have a big list:
word_list = [elt.strip() for elt in open("bible_words.txt", "r").readlines()]
//complexity O(n) --> proporcional to list length "n"
I have learned that hash function used for building up dictionaries allows lookup to be much faster, like so:
word_dict = dict((elt, 1) for elt in word_list)
// complexity O(l) ---> constant.
using word_list, is there a most efficient way which is recommended to reduce the complexity of my code?
The code from the question does just one thing: fills all words from a file into a list. The complexity of that is O(n).
Filling the same words into any other type of container will still have at least O(n) complexity, because it has to read all of the words from the file and it has to put all of the words into the container.
What is different with a dict?
Finding out whether something is in a list has O(n) complexity, because the algorithm has to go through the list item by item and check whether it is the sought item. The item can be found at position 0, which is fast, or it could be the last item (or not in the list at all), which makes it O(n).
In dict, data is organized in "buckets". When a key:value pair is saved to a dict, hash of the key is calculated and that number is used to identify the bucket into which data is stored. Later on, when the key is looked up, hash(key) is calculated again to identify the bucket and then only that bucket is searched. There is typically only one key:value pair per bucked, so the search can be done in O(1).
For more detils, see the article about DictionaryKeys on python.org.
How about a set?
A set is something like a dictionary with only keys and no values. The question contains this code:
word_dict = dict((elt, 1) for elt in word_list)
That is obviously a dictionary which does not need values, so a set would be more appropriate.
BTW, there is no need to create a word_list which is a list first and convert it to set or dict. The first step can be skipped:
set_of_words = {elt.strip() for elt in open("bible_words.txt", "r").readlines()}
Are there any drawbacks?
Always ;)
A set does not have duplicates. So counting how many times a word is in the set will never return 2. If that is needed, don't use a set.
A set is not ordered. There is no way to check which was the first word in the set. If that is needed, don't use a set.
Objects saved to sets have to be hashable, which kind-of implies that they are immutable. If it was possible to modify the object, then its hash would change, so it would be in the wrong bucket and searching for it would fail. Anyway, str, int, float, and tuple objects are immutable, so at least those can go into sets.
Writing to a set is probably going to be a bit slower than writing to a list. Still O(n), but a slower O(n), because it has to calculate hashes and organize into buckets, whereas a list just dumps one item after another. See timings below.
Reading everything from a set is also going to be a bit slower than reading everything from a list.
All of these apply to dict as well as to set.
Some examples with timings
Writing to list vs. set:
>>> timeit.timeit('[n for n in range(1000000)]', number=10)
0.7802875302271843
>>> timeit.timeit('{n for n in range(1000000)}', number=10)
1.025623542189976
Reading from list vs. set:
>>> timeit.timeit('989234 in values', setup='values=[n for n in range(1000000)]', number=10)
0.19846207875508526
>>> timeit.timeit('989234 in values', setup='values={n for n in range(1000000)}', number=10)
3.5699193290383846e-06
So, writing to a set seems to be about 30% slower, but finding an item in the set is thousands of times faster when there are thousands of items.
I understand that a list is different from an array. But still, O(1)? That would mean accessing an element in a list would be as fast as accessing an element in a dict, which we all know is not true.
My question is based on this document:
list
----------------------------
| Operation | Average Case |
|-----------|--------------|
| ... | ... |
|-----------|--------------|
| Get Item | O(1) |
----------------------------
and this answer:
Lookups in lists are O(n), lookups in dictionaries are amortized O(1),
with regard to the number of items in the data structure.
If the first document is true, then why is accessing a dict faster than accessing a list if they have the same complexity?
Can anybody give a clear explanation on this please? I would say it always depends on the size of the list/dict, but I need more insight on this.
Get item is getting an item in a specific index, while lookup means searching if some element exists in the list. To do so, unless the list is sorted, you will need to iterate all elements, and have O(n) Get Item operations, which leads to O(n) lookup.
A dictionary is maintaining a smart data structure (hash table) under the hood, so you will not need to query O(n) times to find if the element exists, but a constant number of times (average case), leading to O(1) lookup.
accessing a list l at index n l[n] is O(1) because it is not implemented as a Vanilla linked list where one needs to jump between pointers (value, next-->) n times to reach cell index n.
If the memory is continuous and the entry size would had been fixed, reaching a specific entry would be trivial as we know to jump n times an entry size (like classic arrays in C).
However, since list is variable in entries size, the python implementation uses a continuous memory list just for the pointers to the values. This makes indexing a list (l[n]) an operation whose cost is independent of the size of the list or the value of the index.
For more information see http://effbot.org/pyfaq/how-are-lists-implemented.htm
That is because Python stores the address of each node of a list into a separate array. When we want an element at nth node, all we have to do is look up the nth element of the address array which gives us the address of nth node of the list by which we can get the value of that node in O(1) complexity.
Python does some neat tricks to make these arrays expandable as the list grows. Thus we are getting the flexibility of lists and and the speed of arrays. The trade off here is that the compiler has to reallocate memory for the address array whenever the list grows to a certain extent.
amit has explained in his answer to this question about why lookups in dictionaries are faster than in lists.
The minimum possible I can see is O(log n) from a strict computer science standpoint
Say I have a list x with unkown length from which I want to randomly pop one element so that the list does not contain the element afterwards. What is the most pythonic way to do this?
I can do it using a rather unhandy combincation of pop, random.randint, and len, and would like to see shorter or nicer solutions:
import random
x = [1,2,3,4,5,6]
x.pop(random.randint(0,len(x)-1))
What I am trying to achieve is consecutively pop random elements from a list. (i.e., randomly pop one element and move it to a dictionary, randomly pop another element and move it to another dictionary, ...)
Note that I am using Python 2.6 and did not find any solutions via the search function.
What you seem to be up to doesn't look very Pythonic in the first place. You shouldn't remove stuff from the middle of a list, because lists are implemented as arrays in all Python implementations I know of, so this is an O(n) operation.
If you really need this functionality as part of an algorithm, you should check out a data structure like the blist that supports efficient deletion from the middle.
In pure Python, what you can do if you don't need access to the remaining elements is just shuffle the list first and then iterate over it:
lst = [1,2,3]
random.shuffle(lst)
for x in lst:
# ...
If you really need the remainder (which is a bit of a code smell, IMHO), at least you can pop() from the end of the list now (which is fast!):
while lst:
x = lst.pop()
# do something with the element
In general, you can often express your programs more elegantly if you use a more functional style, instead of mutating state (like you do with the list).
You won't get much better than that, but here is a slight improvement:
x.pop(random.randrange(len(x)))
Documentation on random.randrange():
random.randrange([start], stop[, step])
Return a randomly selected element from range(start, stop, step). This is equivalent to choice(range(start, stop, step)), but doesn’t actually build a range object.
To remove a single element at random index from a list if the order of the rest of list elements doesn't matter:
import random
L = [1,2,3,4,5,6]
i = random.randrange(len(L)) # get random index
L[i], L[-1] = L[-1], L[i] # swap with the last element
x = L.pop() # pop last element O(1)
The swap is used to avoid O(n) behavior on deletion from a middle of a list.
despite many answers suggesting use random.shuffle(x) and x.pop() its very slow on large data. and time required on a list of 10000 elements took about 6 seconds when shuffle is enabled. when shuffle is disabled speed was 0.2s
the fastest method after testing all the given methods above was turned out to be written by #jfs
import random
L = [1,"2",[3],(4),{5:"6"},'etc'] #you can take mixed or pure list
i = random.randrange(len(L)) # get random index
L[i], L[-1] = L[-1], L[i] # swap with the last element
x = L.pop() # pop last element O(1)
in support of my claim here is the time complexity chart from this source
IF there are no duplicates in list,
you can achieve your purpose using sets too. once list made into set duplicates will be removed. remove by value and remove random cost O(1), ie very effecient. this is the cleanest method i could come up with.
L=set([1,2,3,4,5,6...]) #directly input the list to inbuilt function set()
while 1:
r=L.pop()
#do something with r , r is random element of initial list L.
Unlike lists which support A+B option, sets also support A-B (A minus B) along with A+B (A union B)and A.intersection(B,C,D). super useful when you want to perform logical operations on the data.
OPTIONAL
IF you want speed when operations performed on head and tail of list, use python dequeue (double ended queue) in support of my claim here is the image. an image is thousand words.
Here's another alternative: why don't you shuffle the list first, and then start popping elements of it until no more elements remain? like this:
import random
x = [1,2,3,4,5,6]
random.shuffle(x)
while x:
p = x.pop()
# do your stuff with p
I know this is an old question, but just for documentation's sake:
If you (the person googling the same question) are doing what I think you are doing, which is selecting k number of items randomly from a list (where k<=len(yourlist)), but making sure each item is never selected more than one time (=sampling without replacement), you could use random.sample like #j-f-sebastian suggests. But without knowing more about the use case, I don't know if this is what you need.
One way to do it is:
x.remove(random.choice(x))
While not popping from the list, I encountered this question on Google while trying to get X random items from a list without duplicates. Here's what I eventually used:
items = [1, 2, 3, 4, 5]
items_needed = 2
from random import shuffle
shuffle(items)
for item in items[:items_needed]:
print(item)
This may be slightly inefficient as you're shuffling an entire list but only using a small portion of it, but I'm not an optimisation expert so I could be wrong.
This answer comes courtesy of #niklas-b:
"You probably want to use something like pypi.python.org/pypi/blist "
To quote the PYPI page:
...a list-like type with better asymptotic performance and similar
performance on small lists
The blist is a drop-in replacement for the Python list that provides
better performance when modifying large lists. The blist package also
provides sortedlist, sortedset, weaksortedlist, weaksortedset,
sorteddict, and btuple types.
One would assume lowered performance on the random access/random run end, as it is a "copy on write" data structure. This violates many use case assumptions on Python lists, so use it with care.
HOWEVER, if your main use case is to do something weird and unnatural with a list (as in the forced example given by #OP, or my Python 2.6 FIFO queue-with-pass-over issue), then this will fit the bill nicely.