Why this strange execution time - python

I am using this sorting algorithm:
def merge(L,R):
"""Merge 2 sorted lists provided as input
into a single sorted list
"""
M = [] #Merged list, initially empty
indL,indR = 0,0 #start indices
nL,nR = len(L),len(R)
#Add one element to M per iteration until an entire sublist
#has been added
for i in range(nL+nR):
if L[indL]<R[indR]:
M.append(L[indL])
indL = indL + 1
if indL>=nL:
M.extend(R[indR:])
break
else:
M.append(R[indR])
indR = indR + 1
if indR>=nR:
M.extend(L[indL:])
break
return M
def func(L_all):
if len(L_all)==1:
return L_all[0]
else:
L_all[-1] = merge(L_all[-2],L_all.pop())
return func(L_all)
merge() is the classical merge algorithm in which, given two lists of sorted numbers, it merges them into a single sorted list, it has a linear complexity. An example of input is L_all = [[1,3],[2,4],[6,7]], a list of N sorted lists. The algorithm applies merge to the last elements of the list until there is just one element in the list, which is sorted. I have evaluated the execution time for different N, using constant length for the lists inside the list and I have obtained an unexpected pattern. The algorithm has a linear complexity but the execution time is constant, as you can see in the graph
What could be the explanation of the fact that the execution time does not depend on N?.

You haven't shown your timing code, but the problem is likely to be that your func mutates the list L_all so that it becomes a list of length 1, containing a single sorted list. After the first call func(L_all) in timeit, all subsequent calls don't change L_all at all. Instead, they just instantly return L_all[0]. Rather than 100000 calls to L_all for each N in timeit , you are in effect just doing one real call for each N. Your timing code just shows that return L_all[0] is O(1), which is hardly surprising.
I would rewrite your code like this:
import functools, random, timeit
def func(L_all):
return functools.reduce(merge,L_all)
for n in range(1,10):
L = [sorted([random.randint(1,10) for _ in range(5)]) for _ in range(n)]
print(timeit.timeit("func(L)",globals=globals()))
Then even for these smallish n you see a clear dependence on n:
0.16632885999999997
1.711736347
3.5761923199999996
6.058960655
8.796722217
15.112843280999996
17.723825805000004
22.803739991999997
26.114925834000005

Related

Trying to find sets of numbers with all distinct sums; help optimizing algorithm?

I was recently trying to write an algorithm to solve a math problem I came up with (long story how I encountered it): basically, I wanted to come up with sets of P distinct integers such that given a number, there is at most one way of selecting G numbers from the set (repetitions allowed) which sum to that number (or put another way, there are not two distinct sets of G integers from the set with the same sum, called a "collision"). For example, with P, G = 3, 3, the set (10, 1, 0) would work, but (2, 1, 0) wouldn't, since 1+1+1=2+1+0.
I came up with an algorithm in Python that can find and generate these sets, but when I tried it, it runs extremely slowly; I'm pretty sure there is a much more optimized way to do this, but I'm not sure how. The current code is also a bit messy because parts were added organically as I figured out what I needed.
The algorithm starts with these two functions:
import numpy
def rec_gen_list(leng, index, nums, func):
if index == leng-1: #list full
func(nums)
else:
nextMax = nums[index-1];
for nextNum in range(nextMax)[::-1]: # nextMax-1 to 0
nums[index] = nextNum;
rec_gen_list(leng, index+1, nums, func)
def gen_all_lists(leng, first, func):
nums = np.zeros(leng, dtype='int')
nums[0] = first
rec_gen_list(leng, 1, nums, func)
Basically, this code generates all possible lists of distinct integers (with maximum of "first" and minimum 0) and applies some function to them. rec_gen_list is the recursive part; given a partial list and an index, it tries every possible next number in the list less than the last one, and sends that to the next recursion. Once it gets to the last iteration (with the list being full), it applies the given function to the completed list. Note that I stop before the last entry in the list, so it always ends with 0; I enforce that because if you have a list that doesn't contain 0, you can subtract the smallest number from each one in the list to get one that does, so I force them to have 0 to avoid duplicates and make other things I'm planning to do more convenient.
gen_all_lists is the wrapper around the recursive function; it sets up the array and first iteration of the process and gets it started. For example, you could display all lists of 4 distinct numbers between 7 and 0 by calling it as gen_all_lists(4, 7, print). The function included is so that once the lists are generated, I can test them for collisions before displaying them.
However, after coming up with these, I had to modify them to fit with the rest of the algorithm. First off, I needed to keep track of if the algorithm had found any lists that worked; this is handled by the foundOne and foundNew variables in the updated versions. This probably could be done with a global variable, but I don't think it's a significant issue with the slowdown.
In addition, I realized that I could use backtracking to significantly optimize this: if the first 3 numbers out of a long list are something like (100, 99, 98...), that already causes a collision, so I can skip checking all the lists generated from that. This is handled by the G variable (described before) and the test_no_colls function (which tests if a list has any collisions for a certain value of G); each time I make a new sublist, I check it for collisions, and skip the recursive call if I find any.
This is the result of these modifications, used in the current algorithm:
import numpy
def rec_test_list(leng, index, nums, G, func, foundOne):
if index == leng - 1: #list full
foundNew = func(nums)
foundOne = foundOne or foundNew
else:
nextMax = nums[index-1];
for nextNum in range(nextMax)[::-1]: # nextMax-1 to 0
nums[index] = nextNum;
# If already a collision, don't bother going down this tree.
if (test_no_colls(nums[:index+1], G)):
foundNew = rec_test_list(leng, index+1, nums, G, func, foundOne)
foundOne = foundOne or foundNew
return foundOne
def test_all_lists(leng, first, G, func):
nums = np.zeros(leng, dtype='int')
nums[0] = first
return rec_test_list(leng, 1, nums, G, func, False)
For the next two functions, test_no_colls takes a list of numbers and a number G, and determines if there are any "collisions" (two distinct sets of G numbers from the list that add to the same total), returning true if there are none. It starts by making a set that contains the possible scores, then generates every possible distinct set of G indices into the list (repetition allowed) and finds their totals. Each one is checked for in the set; if one is found, there are two combinations with the same total.
The combinations are generated with another algorithm I came up with; this probably could be done the same way as generating the initial lists, but I was a bit confused about the variable scope of the set, so I found a non-recursive way to do it. This may be something to optimize.
The second function is just a wrapper for test_no_colls, printing the input array if it passes; this is used in the test_all_lists later on.
def test_no_colls(nums, G):
possiblePoints=set(()) # Set of possible scores.
ranks = np.zeros(G, dtype='int')
ranks[0] = len(nums) - 1 # Lowest possible rank.
curr_ind = 0
while True: # Repeat until break.
if ranks[curr_ind] >= 0: # Copy over to make the start of the rest.
if curr_ind < G - 1:
copy = ranks[curr_ind]
curr_ind += 1
ranks[curr_ind] = copy
else: # Start decrementing, since we're at the end. We also have a complete list, so test it.
# First, get the score for these rankings and test to see if it collides with a previous score.
total_score = 0
for rank in ranks:
total_score += nums[rank]
if total_score in possiblePoints: # Collision found.
return False
# Otherwise, add the new score to the list.
possiblePoints.add(total_score)
#Then backtrack and continue.
ranks[curr_ind] -= 1
else:
# If the current value is less than 0, we've exhausted the possibilities for the rest of the list,
# and need to backtrack if possible and start with the next lowest number.
curr_ind -= 1;
if (curr_ind < 0): # Backtracked from the start, so we're done.
break
else:
ranks[curr_ind] -= 1 # Start with the next lowest number.
# If we broke out of the loop before returning, no collisions were found.
return True
def show_if_no_colls(nums, games):
if test_no_colls(nums, games):
print(nums)
return True
return False
These are the final functions that wrap everything up. find_good_lists wraps up test_all_lists more conveniently; it finds all lists ranging from 0 to maxPts of length P which have no collisions for a certain G. find_lowest_score then uses this to find the smallest possible maximum value of a list that works for a certain P and G (for example, find_lowest_score(6, 3) finds two possible lists with max 45, [45 43 34 19 3 0] and [45 42 26 11 2 0], with nothing that is all below 45); it also shows some timing data about how long each iteration took.
def find_good_lists(maxPts, P, G):
return test_all_lists(P, maxPts, G, lambda nums: show_if_no_colls(nums, G))
from time import perf_counter
def find_lowest_score(P, G):
maxPts = P - 1; # The minimum possible to even generate a scoring table.
foundSet = False;
while not foundSet:
start = perf_counter()
foundSet = find_good_lists(maxPts, P, G)
end = perf_counter()
print("Looked for {}, took {:.5f} s".format(maxPts, end-start))
maxPts += 1;
So, this algorithm does seem to work, but it runs very slowly; when trying to run lowest_score(7, 3), for example, it starts taking minutes per iteration around maxPts in the 70s or so, even on Google Colab. Does anyone have suggestions for optimizing this algorithm to improve its runtime and time complexity, or better ways to solve the problem? I am interested in further exploration of this (such as filtering the lists generated for other qualities), but am concerned about the time it would take with this algorithm.

Why does this function have exponential complexity big O instead of quadratic?

The following function has been given:
def genSubsets(L):
res = []
if len(L) == 0:
return [[]]
smaller = genSubsets(L[:-1])
extra = L[-1:]
new = []
for small in smaller:
new.append(small+extra)
return smaller+new
From my understanding, i making a copy of a list is (O n), then looping is (O n) as well. Which should make this (O n^2). However, it seems that my logic is flawed and the answer is (O 2^n). Why?
From my understanding, i making a copy of a list is (O n)
You are correct that making a copy of a list of n items takes time O(n). And in this case, each of the lists that's being copied is a subset of the original list, which has length n, so each list copied does take time O(n).
then looping is (O n) as well
Looping over a list of length n takes time O(n). However, in this case, the lists that you're looping over do not have n elements in them. There are 2n subsets of a set of size n, so at the top-level recursive call, when you recursively generate all subsets of L[:-1], you will end up with a list of 2n-1 items. Looping over that list takes time O(2n).
More generally, when looking at a loop or a list, it's important to ask "how many times does this loop run?" or "how many elements are in this list?"

Python return list of duplicates in list in order

How do you quickly return a list of duplicates for a list in the order that they appear? For example duplicates([2,3,5,5,5,6,6,3]) results in [5,6,3] meaning that the repeated element is only added to the resulting duplicates list when its second element appears. So far I have the code below but its not running fast enough to pass large test cases. Is there any faster option without imports?
def duplicates(L):
first = set()
second = []
for i in L:
if i in first and i not in second:
second.append(i)
continue
if i not in first and i not in second:
first.add(i)
continue
return second
You're doing well in using a set first since it has a O(1) time complexity for the in operation.
But in the other hand you're using a list for second, wich turns this function into a O(N^2) and, in the worst case, you're going through the second list twice.
So, my sugestion to you is using a dictionary to store the numbers you found.
For exemple:
def duplicates(L):
first = dict()
second=[]
for i in L:
if i not in first: #First time the number appears
first[i] = False
elif not first[i]: #Number not on the second list
second.append(i)
first[i]=True
return second
Note that I used a Boolean for the dictionary key value to represent whether the number appears more than 1 time or not (or if it was already added to de second list).
This solution has O(N) time complexity wich means that is way faster.
Revision of posted code
Use a dictionary rather than a list for 'second' in OP code.
List Dicts have O(1) rather than O(n) look up times
Dicts keep track of order of insertion for Python 3.6+ or could use OrderedDict
Code
def duplicatesnew(L):
first = set()
second = {} # Change to dictionary
for i in L:
if i in first and i not in second:
second[i] = None
continue
if i not in first and i not in second:
first.add(i)
continue
return list(second.keys()) # report list of keys
lst = [2,3,5,5,5,6,6,3]
Performance
Summary
Comparable on short lists
2X faster on longer lists
Tests
Use lists of length N
For N = 6: use original list
N > 6 use:
lst = [random.randint(1, 10) for _ in range(N)]
N = 6
Original: 2.24 us
Revised: 2.74 us
1000 random numbers between 1 and 10
Original: 241 us
Revised: 146 us
N = 100, 000
Original: 27.2 ms
Revised: 13.4 ms

What is fastest way to determine numbers are within specific range of each other in Python?

I have list of numbers as follows -
L = [ 1430185458, 1430185456, 1430185245, 1430185246, 1430185001 ]
I am trying to determine which numbers are within range of "2" from each other. List will be in unsorted when I receive it.
If there are numbers within range of 2 from each other I have to return "1" at exact same position number was received in.
I was able to achieve desired result , however code is running very slow. My approach involves sorting list, iterating it twice taking two pointers and comparing it successively. I will have millions of records coming as seperate lists.
Just trying to see what is best possible approach to address this problem.
Edit - Apology as I was away for a while. List can have any number of elements in it ranging from 1 to n. Idea is to return either 0 or 1 in exact same position number was received. I can not post actual code I implemented but here is pseudo code.
a. create new list as list of list with second part as 0 for each element. We assume that there are no numbers within range of 2 of each other.
[[1430185458,0], [1430185456,0], [1430185245,0], [1430185246,0], [1430185001,0]]
b. sort original list
c. compare first element to second, second to third and so on until end of list is reached and whenever difference is less than or equal to 2 update corresponding second elements in step a to 1.
[[1430185458,1], [1430185456,1], [1430185245,1], [1430185246,1], [1430185001,0]]
The goal is to be fast, so that presumably means an O(N) algorithm. Building an NxN difference matrix is O(N^2), so that's not good at all. Sorting is O(N*log(N)), so that's out, too. Assuming average case O(1) behavior for dictionary insert and lookup, the following is an O(N) algorithm. It rips through a list of a million random integers in a couple of seconds.
def in_range (numbers) :
result = [0] * len(numbers)
index = {}
for idx, number in enumerate(numbers) :
for offset in range(-2,3) :
match_idx = index.get(number+offset)
if match_idx is not None :
result[match_idx] = result[idx] = 1
index[number] = idx
return result
Update
I have to return "1" at exact same position number was received in.
The update to the question asks for a list of the form [[1,1],[2,1],[5,0]] given an input of [1,2,5]. I didn't do that. Instead, my code returns [1,1,0] given [1,2,5]. It's about 15% faster to produce that simple 0/1 list compared to the [[value,in_range],...] list. The desired list can easily be created using zip:
zip(numbers,in_range(numbers)) # Generator
list(zip(numbers,in_range(numbers))) # List of (value,in_range) tuples
I think this does what you need (process() modifies the list L). Very likely it's still optimizable, though:
def process(L):
s = [(v,k) for k,v in enumerate(L)]
s.sort()
j = 0
for i,v_k in enumerate(s):
v = v_k[0]
while j < i and v-s[j][0]>2:
j += 1
while j < i:
L[s[j][1]] = 1
L[s[i][1]] = 1
j += 1

Python very slow random sampling over big list

I'm expecting very slow performance with the algorithm below.
I've a very large (1.000.000+) list containing large strings.
ie: id_list = ['MYSUPERLARGEID:1123:123123', 'MYSUPERLARGEID:1123:134534389', 'MYSUPERLARGEID:1123:12763']...
num_reads is the max number of elements to random choose from this list.
The idea is to randomly choose one of the string ids in id_list until num_reads is reached and to add (I say add, and not append because I don't care on random_id_list order) them into random_id_list which is empty at the beginning.
I can't repeat same id so I remove it from the original list after being randonly chosen. I suspect this is what is doing the script to go real slow.. maybe I'm wrong and it's another part of this loop the responsible of slow behavior.
for x in xrange(0, num_reads):
id_index, id_string = random.choice(list(enumerate(id_list)))
random_id_list.append(id_string)
del read_id_list[id_index]
Use random.sample() to produce a sample of N elements with no repeats:
random_id_list = random.sample(read_id_list, num_reads)
Removing elements from the middle of a large list is indeed slow, as everything beyond that index has to be moved up a step.
This does not, of course, remove elements from the original list anymore, so repeated random.sample() calls can still give you samples with elements that have been picked before. If you need to produce samples repeatedly until your list is exhausted, then shuffle once and from there on out take consecutive slices of k elements from the shuffled list:
def random_samples(k):
random.shuffle(id_list)
for i in range(0, len(id_list), k):
yield id_list[i : i + k]
then use this to produce your samples; either in a loop or with next():
sample_gen = random_samples(num_reads)
random_id_list = next(sample_gen)
# some point later
another_random_id_list = next(sample_gen)
Because the list is shuffled entirely randomly, the slices produced this way are also all valid random samples.
The "hard" way, instead of just shuffling the list, is to evaluate each element of your list in order, and selecting the item with a probability that relies on both the number of items you still need to choose and the number of items left to choose from. This is useful if you don't have the entire list presented to you at once (a so-called on-line algorithm).
Let's say you need to select k of N items. That means each item has a k/N probability of being chosen, if you can consider all items at once. However, if you accept the first item, then you only need to select k-1 items from N-1 remaining items. If you reject it, you still need k items from N-1 remaining items. So the algorithm would look like
N = len(id_list)
k = 10 # For example
choices = []
for i in id_list:
if random.randint(1,N) <= k:
choices.append(i)
k -= 1
N -= 1
Initially, the first item is chosen with the expected probability of k/N. As you go through your list, N steadily decreases, while k decreases as you actually accept items. Note that each item, overall, still has a p = k/N chance of being chosen. As an example, consider the second item in the list. Let pi be the probability that you choose the ith element in the list. p1 is obviously k/N, given the starting values of k and N. Consider p2 for example.
p2 = p1 * (k-1) / (N-1) + (1-p1) * k / (N-1)
= (p1*k - p1 + k - k*p1) / (N-1)
= (k - p1)/(N-1)
= (k - k/N)/(N-1)
= k/(N-1) - k/(N*(N-1)
= (k*N - k)/(N*(N-1))
= k/N
Similar (but longer) analysis holds for p3, p4, etc.

Categories