I am calling itertools in python (see below). In this code, snp_dic is a dictionary with integer keys and sets as values. The goal here is to find the minimum list of keys whose union of values is a combination of unions of sets that is equivalent to the set_union. (This is equivalent to solving for a global optimum for the popular NP-hard graph theory problem set-cover for those of you interested)! The algorithm below works but the goal here is optimization.
The most obvious optimization I see has to do with itertools. Let's say for a length r, there exists a combination of r sets in snp_dic whose union = set_union. Basic probability dictates that if this combination exists and is distributed somewhere uniformly at random over the combinations, it is expected to on average only have to iterate over have the combinations to find this set-covering combination. Itertools however will return all the possible combinations, taking twice as long as the expected time of checking set_unions by checking at each iteration.
A logical solution would seem to be simply by to implement itertools.combinations() locally. Based on the "equivalent" python implementation of itertools.combinations() in the python docs however the time is approximately twice as slow because itertools.combinations calls a C level implementation rather than a python-native one.
The question (finally) is then, how can I stream the results of itertools.combinations() one by one so I can check set unions as I go along so it still runs at a near equivalent time as the python implementation of itertools.combinations(). In an answer I would appreciate if you could include the results of timing your new method to prove it runs at a similar time as the python-native implementation. Any other optimizations also appreciated.
def min_informative_helper(snp_dic, min, set_union):
union = lambda set_iterable : reduce(lambda a,b: a|b, set_iterable) #takes the union of sets
for i in range(min, len(snp_dic)):
combinations = itertools.combinations(snp_dic, i)
combinations = [{i:snp_dic[i] for i in combination} for combination in combinations]
for combination in combinations:
comb_union = union(combination.values())
if(comb_union == set_union):
return combination.keys()
itertools provides generators for the things it returns. To stream them simply use
for combo in itertools.combinations(snp_dic, i):
... remainder of your logic
The combinations method returns one new element each time you access it: one per loop iteration.
Related
I am currently going through Math adventures with Python book by Peter Farrel. Now I am simply trying to improve my math skills while learning Python in a fun way. So we made a factors function as seen below:
def factors(num):
factorList = []
for i in range(1, num+1):
if num % i == 0:
factorList.append(i)
return factorList
Exercise 3-1 is asking to make GCF (Greatest Common Factor) function. All the answers here are how we could use builtin Python modules or recursive or Euclid algorithm. I have no clue what any of these things mean, let alone trying it on this assignment. I came with the following solution using the above function:
def gcFactor(num1, num2):
fnum1 = factors(num1)
fnum2 = factors(num2)
gcf = list(set(fnum1).intersection(fnum2))
return max(gcf)
print(gcFactor(28,21))
Is this the best way of doing it? Using the .intersection() function seems a little cheaty to me.
So what I wanted to do is if I could use a loop and separate the list values in fnum1 & fnum2 and compare them and then return the value that matches (which would make common factors) and is greatest (which would be GCF).
The idea behind your algorithm is sound, but there are a few problems:
In your original version, you used gcf[-1] to get the greatest factor, but that will not always work, since converting a set to list does not guarantee that the elements will be in sorted order, even if they were sorted before converting to set. Better use max (you already changed that).
Using set.intersection is definitely not "cheating" but just making good use of what the languages provides. It might be considered cheating to just use math.gcd, but not basic set or list functions.
Your algorithm is rather inefficient. I don't know the book, but I don't think you should actually use the factors function to calculate the gcf, but that was just an exercise to teach you stuff like loops and modulo. Consider two very different numbers as inputs, say 23764372 and 6. You'd calculate all the factors of 23764372 first, before testing the very few values that could actually be common factors. Instead of using factors directly, try to rewrite your gcFactor function to test which values up to the min of the two numbers are factors of both numbers.
Even then, your algorithm will not be very efficient. I would suggest reading up on Euclid's Algorithm and trying to implement that next. If unsure if you did it right, you can use your first function as a reference for testing, and to see the difference in performance.
About your factors function itself: Note that there is a symmetry: if i is a factor, so is n//i. If you use this, you do not have to test all the values up to n but just up to sqrt(n), which is a speed-up equivalent to reducing running time from O(n²) to O(n).
Let d be a large (but still fits into memory) Python dictionary where we do not know what the keys are. What is the most efficient way (efficient should mean something like the memory used to perform the task is small compared to the size of the dictionary and the speed should at least as fast any of the methods below) to get a key of d (where it does not mater which key you get) and d is unchanged either in content or order (for newer versions of Python) once you are done? This question is not about readability but about the python dictionary objects. For example two methods are:
Use the list method
any_key = list(d)[0]
Using the popitem method
any_key,y = d.popitem()
d[any_key]=y
So both methods essentially implement a peekkey() method. My basic timeit analysis shows that method 2) is must faster than method 1) and I assume that method 2) uses a lot less memory (but I do not really know if this true yet). Is method 2) "best" or is there something better?
Extra brownie points if you get a fast and a readable method using only Python. Even more points for a C/Python method that accesses the dictionary object directly if that method is significantly faster than the best python method.
If you do not care about which key you get, and you don't mean "sample" in the random sense, then just grab the first key using next
key = next(iter(d.keys()))
which, for brevity, is the same as
key = next(iter(d))
Just to test performance, if I generate a dict with 1000 elements
d = {k:k for k in range(1000)}
then benchmarking these two methods, the next approach is about 95% faster
>>> timeit.timeit('sample_key = list(d)[0]', setup='d = {k:k for k in range(1000)}')
5.3303698
>>> timeit.timeit('next(iter(d.keys()))', setup='d = {k:k for k in range(1000)}')
0.18915620000001354
I have these three solutions to a Leetcode problem and do not really understand the difference in time complexity here. Why is the last function twice as fast as the first one?
68 ms
def numJewelsInStones(J, S):
count=0
for s in S:
if s in J:
count += 1
return count
40ms
def numJewelsInStones(J, S):
return sum(s in J for s in S)
32ms
def numJewelsInStones(J, S):
return len([x for x in S if x in J])
Why is the last function twice as fast as the first one?
The analytical time complexity in terms of big O notation looks the same for all, however subject to constants. That is e.g. O(n) really means O(c*n) however c is ignored by convention when comparing time complexities.
Each of your functions has a different c. In particular
loops in general are slower than generators
sum of a generator is likely executed in C code (the sum part, adding numbers)
len is a simple attribute "single operation" lookup on the array, which can be done in constant time, whereas sum takes n add operations.
Thus c(for) > c(sum) > c(len) where c(f) is the hypothetical fixed-overhead measurement of function/statement f.
You could check my assumptions by disassembling each function.
Other than that your measurements are likely influenced by variation due to other processes running in your system. To remove these influences from your analysis, take the average of execution times over at least 1000 calls to each function (you may find that perhaps c is less than this variation though I don't expect that).
what is the time complexity of these functions?
Note that while all functions share the same big O time complexity, the latter will be different depending on the data type you use for J, S. If J, S are of type:
dict, the complexity of your functions will be in O(n)
set, the complexity of your functions will be in O(n)
list, the complexity of your functions will be in O(n*m), where n,m are the sizes of the J, S variables, respectively. Note if n ~ m this will effectively turn into O(n^2). In other words, don't use list.
Why is the data type important? Because Python's in operator is really just a proxy to membership testing implemented for a particular type. Specifically, dict and set membership testing works in O(1) that is in constant time, while the one for list works in O(n) time. Since in the list case there is a pass on every member of J for each member of S, or vice versa, the total time is in O(n*m). See Python's TimeComplexity wiki for details.
With time complexity, big O notation describes how the solution grows as the input set grows. In other words, how they are relatively related. If your solution is O(n) then as the input grows then the time to complete grows linearly. More concretely, if the solution is O(n) and it takes 10 seconds when the data set is 100, then it should take approximately 100 seconds when the data set is 1000.
Your first solution is O(n), we know this because of the for loop, for s in S, which will iterate through the entire data set once. If s in J, assuming J is a set or a dictionary will likely be constant time, O(1), the reasoning behind this is a bit beyond the scope of the question. As a result, the first solution overall is O(n), linear time.
The nuanced differences in time between the other solutions is very likely negligible if you ran your tests on multiple data sets and averaged them out over time, accounting for startup time and other factors that impact the test results. Additionally, Big O notation discards coefficients, so for example, O(3n) ~= O(n).
You'll notice in all of the other solutions you have the same concept, loop over the entire collection and check for the existence in the set or dict. As a result, all of these solutions are O(n). The differences in time can be attributed to other processes running at the same time, the fact that some of the built-ins used are pure C, and also to differences as a result of insufficient testing.
Well, second function faster than first because of using generator instead of loop. Third function is faster than second because second summing generators output (which returns something like list), but third - just calculating it's length.
In Python 3.3, itertools.accumulate(), which normally repeatedly applies an addition operation to the supplied iterable, can now take a function argument as a parameter; this means it now overlaps with functools.reduce(). With a cursory look, the main differences between the two now would seem to be:
accumulate() defaults to summing but doesn't let you supply an extra initial condition explicitly while reduce() doesn't default to any method but does let you supply an initial condition for use with 1/0-element sequences, and
accumulate() takes the iterable first while reduce() takes the function first.
Are there any other differences between the two? Or is this just a matter of behavior of two functions with initially distinct uses beginning to converge over time?
It seems that accumulate keeps the previous results, whereas reduce (which is known as fold in other languages) does not necessarily.
e.g. list(accumulate([1,2,3], operator.add)) would return [1,3,6] whereas a plain fold would return 6
Also (just for fun, don't do this) you can define accumulate in terms of reduce
def accumulate(xs, f):
return reduce(lambda a, x: a + [f(a[-1], x)], xs[1:], [xs[0]])
You can see in the documentation what the difference is. reduce returns a single result, the sum, product, etc., of the sequence. accumulate returns an iterator over all the intermediate results. Basically, accumulate returns an iterator over the results of each step of the reduce operation.
itertools.accumulate
is like reduce but returns a generator* instead of a value. This generator can give you all the intermediate step values. So basically reduce gives you the last element of what accumulate will give you.
*A generator is like an iterator but can be iterated over only once.
Lets say that I have a graph and want to see if b in N[a]. Which is the faster implementation and why?
a, b = range(2)
N = [set([b]), set([a,b])]
OR
N= [[b],[a,b]]
This is obviously oversimplified, but imagine that the graph becomes really dense.
Membership testing in a set is vastly faster, especially for large sets. That is because the set uses a hash function to map to a bucket. Since Python implementations automatically resize that hash table, the speed can be constant (O(1)) no matter the size of the set (assuming the hash function is sufficiently good).
In contrast, to evaluate whether an object is a member of a list, Python has to compare every single member for equality, i.e. the test is O(n).
It all depends on what you're trying to accomplish. Using your example verbatim, it's faster to use lists, as you don't have to go through the overhead of creating the sets:
import timeit
def use_sets(a, b):
return [set([b]), set([a, b])]
def use_lists(a, b):
return [[b], [a, b]]
t=timeit.Timer("use_sets(a, b)", """from __main__ import use_sets
a, b = range(2)""")
print "use_sets()", t.timeit(number=1000000)
t=timeit.Timer("use_lists(a, b)", """from __main__ import use_lists
a, b = range(2)""")
print "use_lists()", t.timeit(number=1000000)
Produces:
use_sets() 1.57522511482
use_lists() 0.783344984055
However, for reasons already mentioned here, you benefit from using sets when you are searching large sets. It's impossible to tell by your example where that inflection point is for you and whether or not you'll see the benefit.
I suggest you test it both ways and go with whatever is faster for your specific use-case.
Set ( I mean a hash based set like HashSet) is much faster than List to lookup for a value. List has to go sequentially to find out if the value exists. HashSet can directly jump and locate the bucket and look up for a value almost in a constant time.