I need a way to find the minimum value in a dictionary full of Node objects in O(1) time, or really any sublinear time, if possible.
Here's an example of what I'd need:
'''
Nodes have 4 attributes:
- stack
- backwards
- forwards
- total_score
'''
dict = {str(Node1.stack): Node1,
str(Node2.stack): Node2, ... } # note that keys are the stacks as strings
key, smallest = min(frontier.items(), key=lambda pair: pair[1].total_score)
( ^^^ something better than this! ^^^ )
The last line above (key, smallest ... ) is what I have so far. It works fine, but it's too slow. I read online that the min() function takes O(n) time. I have a lot of Nodes to process, so something faster would be amazing.
edit Should have mentioned before, but this is running inside an A* algorithm - frontier is updated dynamically. The operations I need to be able to do are:
Find minimum in O(1), or at least < O(n)
Update values of specific elements quickly
Access attributes easily
It's impossible to get the min value from a dictionary in O(1) time because you have to check every value. However, you can do a fast lookup if you store your data in a heap or a sorted tree, where the data is sorted by value. Trees generally give you insertion and search times of O(log n).
If you literally only need the min value of your data and you don't ever need to look up other values, you could just create a minValue variable that you keep updated every time you insert or remove items.
It is impossible to get the smallest value in a dictionary in O(1), it will take O(n).
You can also do one thing while you are adding the elements in the dictionary you can maintain the minimum value and check at the moment itself.
Related
My textbook says that the following algorithm has an efficiency of O(n):
list = [5,8,4,5]
def check_for_duplicates(list):
dups = []
for i in range(len(list)):
if list[i] not in dups:
dups.append(list[i])
else:
return True
return False
But why? I ask because the in operation has an efficiency of O(n) as well (according to this resource). If we take list as an example the program needs to iterate 4 times over the list. But with each iteration, dups keeps growing faster. So for the first iteration over list, dups does not have any elements, but for the second iteration it has one element, for the third two elements and for the fourth three elements. Wouldn't that make 1 + 2 + 3 = 6 extra iterations for the in operation on top of the list iterations? But if this is true then wouldn't this alter the efficiency significantly, as the sum of the extra iterations grows faster with every iteration?
You are correct that the runtime of the code that you've posted here is O(n2), not O(n), for precisely the reason that you've indicated.
Conceptually, the algorithm you're implementing goes like this:
Maintain a collection of all the items seen so far.
For each item in the list:
If that item is in the collection, report a duplicate exists.
Otherwise, add it to the collection.
Report that there are no duplicates.
The reason the code you've posted here is slow is because the cost of checking whether a duplicate exists is O(n) when using a list to track the items seen so far. In fact, if you're using a list of the existing elements, what you're doing is essentially equivalent to just checking the previous elements of the array to see if any of them are equal!
You can speed this up by switching your implementation so that you use a set to track prior elements rather than a list. Sets have (expected) O(1) lookups and insertions, so this will make your code run in (expected) O(1) time.
I have some code where I have a number of dicts that look something like this:
d = {63:1, 96:2, 128:3, 192:4, 256:6, 384:9, 480:12, 511:14}
These are used with another lookup key to do threshold checks: where we want to find the value associated with the closest dict key that is higher than our lookup key. So for the dict above, a search for any k <=63 would return 1, a search on any k >63 && k <=96 would return 2, and so on.
The code I received did a simple walk through the dict k/v pairs, as so:
for k,v in d.items(): ## keys are in strictly increasing order
if lookup_key > k: ## walk up the dict, are we still smaller than the dict key?
rc = v
return rc
This works, but as it "walks up" the dictionary keys, the performance isn't great. It does a compare for every k,v pair in the dict as well as a (re)assignment for however many of the keys are smaller, so we end up with somewhere between O(n) and O(2n) operations.
This lookup gets called trillions of times in the simulations I'm running, so I'm trying to find a way to make this O(1). The brute force way to do this is just convert this mapping dictionary into a flat array list, where I replicate all the intermediate values, and end up with a direct table with the number of entries equal to the largest key in the dict, and each index would map directly to the return value. So if I had a source dict like this:
example = {2:12, 4:17, 7:21, 8:50}
I could convert that into a list such as:
Idx = 0 1 2 3 4 5 6 7 8
directTable = [12, 12, 12, 17, 17, 21, 21, 21, 50]
This is nice because it obviously runs in O(1) time, but it requires doing the work to pre-build this and the state to store it. In my particular case I can live with this restriction because the max size of the direct table isn't too big, but it got me wondering if there's some other way to do this that doesn't involve flattening/explicitly building out a big array. I found OrderedDict and bisect which would help if the dictionary had a lot of keys, but in my case most of these dicts have in the range of 3-10 keys. So even if I could do an O(logN) search, on a short dictionary that might still be 3-4 operations, which compared to O(1) and run trillions of times is a lot of clock cycles. ;-)
So again, I guess my specific problem is solved, but is there some super fancy pythonic magic I'm unaware of that would let me do this even more elegantly?
When you use the counting sort algorithm you create a list, and use its indices as keys while adding the number of integer occurrences as the values within the list. Why is this not the same as simply creating a dictionary with the keys as the index and the counts as the values? Such as:
hash_table = collections.Counter(numList)
or
hash_table = {x:numList.count(x) for x in numList}
Once you have your hash table created you essentially just copy the number of integer occurrences over to another list. Hash Tables/Dictionaries have O(1) lookup times, so why would this not be preferable if your simply referencing the key/value pairs?
I've included the algorithm for Counting Sort below for reference:
def counting_sort(the_list, max_value):
# List of 0's at indices 0...max_value
num_counts = [0] * (max_value + 1)
# Populate num_counts
for item in the_list:
num_counts[item] += 1
# Populate the final sorted list
sorted_list = []
# For each item in num_counts
for item, count in enumerate(num_counts):
# For the number of times the item occurs
for _ in xrange(count):
# Add it to the sorted list
sorted_list.append(item)
return sorted_list
You certainly can do something like this. The question is whether it’s worthwhile to do so.
Counting sort has a runtime of O(n + U), where n is the number of elements in the array and U is the maximum value. Notice that as U gets larger and larger the runtime of this algorithm starts to degrade noticeably. For example, if U > n and I add one more digit to U (for example, changing it from 1,000,000 to 10,000,000), the runtime can increase by a factor of ten. This means that counting sort starts to become impractical as U gets bigger and bigger, and so you typically run counting sort when U is fairly small. If you’re going to run counting sort with a small value of U, then using a hash table isn’t necessarily worth the overhead. Hashing items costs more CPU cycles than just doing standard array lookups, and for small arrays the potential savings in memory might not be worth the extra time. And if you’re using a very large value of U, you’re better off switching to radix sort, which essentially is lots of smaller passes of counting sort with a very small value of U.
The other issue is that the reassembly step of counting sort has amazing locality of reference - you simply scan over the counts array and the input array in parallel filling in values. If you use a hash table, you lose some of that locality because th elements in the hash table aren’t necessarily stored consecutively.
But these are more implementation arguments than anything else. Fundamentally, counting sort is less about “use an array” and more about “build a frequency histogram.” It just happens to be the case that a regular old array is usually preferable to a hash table when building that histogram.
I have created a matrix by using a dictionary with a tuple as the key (e.g. {(user, place) : 1 } )
I need to calculate the Euclidian for each place in the matrix.
I've created a method to do this, but it is extremely inefficient because it iterates through the entire matrix for each place.
def calculateEuclidian(self, place):
count = 0;
for key, value in self.matrix.items():
if(key[1] == place and value == 1):
count += 1
euclidian = math.sqrt(count)
return euclidian
Is there a way to do this more efficiently?
I need the result to be in a dictionary with the place as a key, and the euclidian as the value.
You can use a dictionary comprehension (using a vectorized form is much faster than a for loop) and accumulate the result of the conditionals (0 or 1) as the euclidean value:
def calculateEuclidian(self, place):
return {place: sum(p==place and val==1 for (_,p), val in self.matrix.items())}
With your current data structure, I doubt there is any way you can avoid iterating through the entire dictionary.
If you cannot use another way (or an auxiliary way) of representing your data, iterating through every element of the dict is as efficient as you can get (asymptotically), since there is no way to ask a dict with tuple keys to give you all elements with keys matching (_, place) (where _ denotes "any value"). There are other, and more succinct, ways of writing the iteration code, but you cannot escape the asymptotic efficiency limitation.
If this is your most common operation, and you can in fact use another way of representing your data, you can use a dict[Place, list[User]] instead. That way, you can, in O(1) time, get the list of all users at a certain place, and all you would need to do is count the items in the list using the len(...) function which is also O(1). Obviously, you'll still need to take the sqrt in the end.
There may be ways to make it more Pythonic, but I do not think you can change the overall complexity since you are making a query based off both key and value. I think you have to search the whole matrix for your instances.
you may want to create a new dictionary from your current dictionary which isn't adapted to this kind of search and create a dictionary with place as key and list of (user,value) tuples as values.
Get the tuple list under place key (that'll be fast), then count the times where value is 1 (linear, but on a small set of data)
Keep the original dictionary for euclidian distance computation. Hoping that you don't change the data too often in the program, because you'd need to keep both dicts in-sync.
In the situation I encounter, I would like to define "elegant" being having 1) constant O(1) time complexity for checking if an item exists and 2) store only items, nothing more.
For example, if I use a list
num_list = []
for num in range(10): # Dummy operation to fill the container.
num_list += num
if 1 in num_list:
print("Number exists!")
The operation "in" will take O(n) time according to [Link]
In order to achieve constant checking time, I may employ a dictionary
num_dict = {}
for num in range(10): # Dummy operation to fill the container.
num_dict[num] = True
if 1 in num_dict:
print("Number exists!")
In the case of a dictionary, the operation "in" costs O(1) time according to [Link], but additional O(n) storage is required to store dummy values. Therefore, both implementations/containers seem inelegant.
What would be a better implementation/container to achieve constant O(1) time for checking if an item exists while only storing the items? How to keep resource requirement to the bare minimum?
The solution here is to use a set, which doesnˈt requires you to save a dummy variable for each value.
Normally you can't optimise both space and time together. One thing you can do is have more details about the range of data(here min to max value of num) and size of data(here it is number of times loop runs ie., 10). Then you will have two options :
If range is limited then go for dictionary method(or even use array index method)
If size is limited then go for list method.
If you choose right method then you will probably achieve constant time and space for large sample
EDIT:
Set
It is a hash table, implemented very similarly the Python dict with some optimizations that take advantage of the fact that the values are always null (in a set, we only care about the keys). Set operations do require iteration over at least one of the operand tables (both in the case of union). Iteration isn't any cheaper than any other collection ( O(n) ), but membership testing is O(1) on average.