Algorithm to find the maxima of a given set of items - python

I have written a program in python that displays the set of maximum values that are entered into database. When an item is incomparable to the maxima, it is added to the maxima
Currently I am performing a linear search through the entire database. The problem is the worst case runtime is O(n^2). I was wndering is there can be a better implementation for this algorithm.
maxima = []
for item in items:
should_insert = 1;
for val in maxima:
comp = self.test(item, val)
if comp == 1:
should_insert = 0
break
elif comp == -1:
maxima.remove(val)
if should_insert == 1:
maxima.append(item)
return maxima

In general there is no way to improve on this.
However there are usually many linear extensions of your partial order that turn your partial order into a total order. (See http://en.wikipedia.org/wiki/Linear_extension for what I mean by that.) Let's suppose that you can find several which, between them, have the property that two elements are comparable in the original order if and only if they compare the same way on each. Now what you can do is take your set, do a heapsort using the first ordering until you find the first element not comparable with your max. (See http://en.wikipedia.org/wiki/Heapsort for that algorithm, which in Python is available from https://docs.python.org/2/library/heapq.html.) Take that set, switch to the second ordering, and repeat. Continue until you've used all orderings.
If you have n elements, and k such orderings, then this algorithm's worst case running time is O(k * n * log(n)). And frequently it will be much better - if m is the size of the group that you pull out on the first step, the running time is O(n + k * m * log(n)).
Your ability to use this approach will, unfortunately, depend on whether you can find several total extensions of your partial ordering that have this property. But in many cases you can. For instance for one ordering you break your original sort on number of bathrooms ascending, and in the next on number of bathrooms descending. And so on.

Its not entirely clear what you mean by "incomparable" values. If you mean equal values, then you probably want a simple variation on the normal max function, allowing it to return multiple equal values:
def find_maxima_if_incomparable_means_equal(self, items):
it = iter(items)
maxima = [next(it)] # TODO: change the exception type raised here if items is empty
for item in it:
comp = self.test(item, maxima[0])
if comp == 0:
maxima.append(item)
elif comp < 0:
maxima = [item]
return maxima
On the other hand, if you really mean it when you say that some cannot be compared (i.e. that comparing them has no meaning), the situation is more complicated. You want to find a "maxima" subset of the values such that each item in the maxima set is either greater than or incomparable to every other item in the original set. If your set was [1, 2, 3, "a", "b", "c"], you'd expect the maxima to be [3, "c"], since integers and strings cannot be compared to each other (at least, not in Python 3).
There's no way to avoid the potential of O(N^2) running time in the general case. That's because if none of your items can be compared to any of the others, the maxima set is going to end up being the same as the whole set, and you'll have to have tested every item against every other item to be sure they're really incomparable.
In fact, in the most general case where there's no requirement of a total ordering among any of the values (e.g. a < b < c does not imply a < c), you probably have to compare every item to every other item always. Here's a function that does exactly that:
import itertools
def find_maxima_no_total_ordering(self, items):
non_maximal = set()
for a, b in itertools.combinations(items, 2):
comp = self.test(a, b)
if comp > 0:
non_maximal.add(a)
elif comp < 0:
non_maximal.add(b)
return [x for x in items if x not in non_maximal]
Note that the maxima returned by this function may be empty, if the comparisons are bizarre enough that there are cycles (e.g. A < B, B < C and C < A are all true).
If your specific situation is more limited, you may have some better options. If your set of items is the union of of several totally ordered groups (such that A < B < C implies A < C and that A incomparable-to B and B < C implies A incomparable-to C), and there's no easy way to separate out the incomparable groups, you can use an algorithm similar to what your current code attempts, which will be O(M*N) where N is the number of items and M is the number of totally ordered groups. This is still O(N^2) in the worst case (N groups) but somewhat better if the items end up belonging to just a few groups. If all items are comparable to each other, it's O(N) (and the maxima will contain just a single value). Here's an improved version of your code:
def find_maxima_with_total_orderings(self, items):
maxima = set() # use a set for fast removal
for item in items:
for val in maxima:
comp = self.test(item, val)
if comp == 1:
break
elif comp == -1:
maxima.remove(val)
maxima.add(item)
break
else: # else clause is run if there was no break in the loop
maxima.add(item)
return maxima # you may want to turn this into a list again before returning it
You can do even better if the group an item belongs to can be determined easily (for instance, by checking the item's type). You can first subdivide the items into their groups, then find the maximum of each totally ordered group. Here's code that's O(N) for all cases, assuming that there's an O(1) running time method self.group that returns some hashable value so that if self.group(A) == self.group(B) then self.test(A, B) != 0:
from collections import defaultdict
def _max(comparable_items): # a helper function, find max using self.test rather than >
it = iter(comparable_items)
max_item = next(it)
for item in it:
if self.test(item, max_item) < 0:
max_item = item
return max_item
def find_maxima_with_groups(self, items):
groups = defaultdict(list)
for item in items:
groups[self.group(item)].append(item)
return [self._max(g) for g in groups.values()]

Related

Trying to find sets of numbers with all distinct sums; help optimizing algorithm?

I was recently trying to write an algorithm to solve a math problem I came up with (long story how I encountered it): basically, I wanted to come up with sets of P distinct integers such that given a number, there is at most one way of selecting G numbers from the set (repetitions allowed) which sum to that number (or put another way, there are not two distinct sets of G integers from the set with the same sum, called a "collision"). For example, with P, G = 3, 3, the set (10, 1, 0) would work, but (2, 1, 0) wouldn't, since 1+1+1=2+1+0.
I came up with an algorithm in Python that can find and generate these sets, but when I tried it, it runs extremely slowly; I'm pretty sure there is a much more optimized way to do this, but I'm not sure how. The current code is also a bit messy because parts were added organically as I figured out what I needed.
The algorithm starts with these two functions:
import numpy
def rec_gen_list(leng, index, nums, func):
if index == leng-1: #list full
func(nums)
else:
nextMax = nums[index-1];
for nextNum in range(nextMax)[::-1]: # nextMax-1 to 0
nums[index] = nextNum;
rec_gen_list(leng, index+1, nums, func)
def gen_all_lists(leng, first, func):
nums = np.zeros(leng, dtype='int')
nums[0] = first
rec_gen_list(leng, 1, nums, func)
Basically, this code generates all possible lists of distinct integers (with maximum of "first" and minimum 0) and applies some function to them. rec_gen_list is the recursive part; given a partial list and an index, it tries every possible next number in the list less than the last one, and sends that to the next recursion. Once it gets to the last iteration (with the list being full), it applies the given function to the completed list. Note that I stop before the last entry in the list, so it always ends with 0; I enforce that because if you have a list that doesn't contain 0, you can subtract the smallest number from each one in the list to get one that does, so I force them to have 0 to avoid duplicates and make other things I'm planning to do more convenient.
gen_all_lists is the wrapper around the recursive function; it sets up the array and first iteration of the process and gets it started. For example, you could display all lists of 4 distinct numbers between 7 and 0 by calling it as gen_all_lists(4, 7, print). The function included is so that once the lists are generated, I can test them for collisions before displaying them.
However, after coming up with these, I had to modify them to fit with the rest of the algorithm. First off, I needed to keep track of if the algorithm had found any lists that worked; this is handled by the foundOne and foundNew variables in the updated versions. This probably could be done with a global variable, but I don't think it's a significant issue with the slowdown.
In addition, I realized that I could use backtracking to significantly optimize this: if the first 3 numbers out of a long list are something like (100, 99, 98...), that already causes a collision, so I can skip checking all the lists generated from that. This is handled by the G variable (described before) and the test_no_colls function (which tests if a list has any collisions for a certain value of G); each time I make a new sublist, I check it for collisions, and skip the recursive call if I find any.
This is the result of these modifications, used in the current algorithm:
import numpy
def rec_test_list(leng, index, nums, G, func, foundOne):
if index == leng - 1: #list full
foundNew = func(nums)
foundOne = foundOne or foundNew
else:
nextMax = nums[index-1];
for nextNum in range(nextMax)[::-1]: # nextMax-1 to 0
nums[index] = nextNum;
# If already a collision, don't bother going down this tree.
if (test_no_colls(nums[:index+1], G)):
foundNew = rec_test_list(leng, index+1, nums, G, func, foundOne)
foundOne = foundOne or foundNew
return foundOne
def test_all_lists(leng, first, G, func):
nums = np.zeros(leng, dtype='int')
nums[0] = first
return rec_test_list(leng, 1, nums, G, func, False)
For the next two functions, test_no_colls takes a list of numbers and a number G, and determines if there are any "collisions" (two distinct sets of G numbers from the list that add to the same total), returning true if there are none. It starts by making a set that contains the possible scores, then generates every possible distinct set of G indices into the list (repetition allowed) and finds their totals. Each one is checked for in the set; if one is found, there are two combinations with the same total.
The combinations are generated with another algorithm I came up with; this probably could be done the same way as generating the initial lists, but I was a bit confused about the variable scope of the set, so I found a non-recursive way to do it. This may be something to optimize.
The second function is just a wrapper for test_no_colls, printing the input array if it passes; this is used in the test_all_lists later on.
def test_no_colls(nums, G):
possiblePoints=set(()) # Set of possible scores.
ranks = np.zeros(G, dtype='int')
ranks[0] = len(nums) - 1 # Lowest possible rank.
curr_ind = 0
while True: # Repeat until break.
if ranks[curr_ind] >= 0: # Copy over to make the start of the rest.
if curr_ind < G - 1:
copy = ranks[curr_ind]
curr_ind += 1
ranks[curr_ind] = copy
else: # Start decrementing, since we're at the end. We also have a complete list, so test it.
# First, get the score for these rankings and test to see if it collides with a previous score.
total_score = 0
for rank in ranks:
total_score += nums[rank]
if total_score in possiblePoints: # Collision found.
return False
# Otherwise, add the new score to the list.
possiblePoints.add(total_score)
#Then backtrack and continue.
ranks[curr_ind] -= 1
else:
# If the current value is less than 0, we've exhausted the possibilities for the rest of the list,
# and need to backtrack if possible and start with the next lowest number.
curr_ind -= 1;
if (curr_ind < 0): # Backtracked from the start, so we're done.
break
else:
ranks[curr_ind] -= 1 # Start with the next lowest number.
# If we broke out of the loop before returning, no collisions were found.
return True
def show_if_no_colls(nums, games):
if test_no_colls(nums, games):
print(nums)
return True
return False
These are the final functions that wrap everything up. find_good_lists wraps up test_all_lists more conveniently; it finds all lists ranging from 0 to maxPts of length P which have no collisions for a certain G. find_lowest_score then uses this to find the smallest possible maximum value of a list that works for a certain P and G (for example, find_lowest_score(6, 3) finds two possible lists with max 45, [45 43 34 19 3 0] and [45 42 26 11 2 0], with nothing that is all below 45); it also shows some timing data about how long each iteration took.
def find_good_lists(maxPts, P, G):
return test_all_lists(P, maxPts, G, lambda nums: show_if_no_colls(nums, G))
from time import perf_counter
def find_lowest_score(P, G):
maxPts = P - 1; # The minimum possible to even generate a scoring table.
foundSet = False;
while not foundSet:
start = perf_counter()
foundSet = find_good_lists(maxPts, P, G)
end = perf_counter()
print("Looked for {}, took {:.5f} s".format(maxPts, end-start))
maxPts += 1;
So, this algorithm does seem to work, but it runs very slowly; when trying to run lowest_score(7, 3), for example, it starts taking minutes per iteration around maxPts in the 70s or so, even on Google Colab. Does anyone have suggestions for optimizing this algorithm to improve its runtime and time complexity, or better ways to solve the problem? I am interested in further exploration of this (such as filtering the lists generated for other qualities), but am concerned about the time it would take with this algorithm.

most efficient way to iterate over a large array looking for a missing element in Python

I was trying an online test. the test asked to write a function that given a list of up to 100000 integers whose range is 1 to 100000, would find the first missing integer.
for example, if the list is [1,4,5,2] the output should be 3.
I iterated over the list as follow
def find_missing(num)
for i in range(1, 100001):
if i not in num:
return i
the feedback I receives is the code is not efficient in handling big lists.
I am quite new and I couldnot find an answer, how can I iterate more efficiently?
The first improvement would be to make yours linear by using a set for the repeated membership test:
def find_missing(nums)
s = set(nums)
for i in range(1, 100001):
if i not in s:
return i
Given how C-optimized python sorting is, you could also do sth like:
def find_missing(nums)
s = sorted(set(nums))
return next(i for i, n in enumerate(s, 1) if i != n)
But both of these are fairly space inefficient as they create a new collection. You can avoid that with an in-place sort:
from itertools import groupby
def find_missing(nums):
nums.sort() # in-place
return next(i for i, (k, _) in enumerate(groupby(nums), 1) if i != k)
For any range of numbers, the sum is given by Gauss's formula:
# sum of all numbers up to and including nums[-1] minus
# sum of all numbers up to but not including nums[-1]
expected = nums[-1] * (nums[-1] + 1) // 2 - nums[0] * (nums[0] - 1) // 2
If a number is missing, the actual sum will be
actual = sum(nums)
The difference is the missing number:
result = expected - actual
This compulation is O(n), which is as efficient as you can get. expected is an O(1) computation, while actual has to actually add up the elements.
A somewhat slower but similar complexity approach would be to step along the sequence in lockstep with either a range or itertools.count:
for a, e in zip(nums, range(nums[0], len(nums) + nums[0])):
if a != e:
return e # or break if not in a function
Notice the difference between a single comparison a != e, vs a linear containment check like e in nums, which has to iterate on average through half of nums to get the answer.
You can use Counter to count every occurrence of your list. The minimum number with occurrence 0 will be your output. For example:
from collections import Counter
def find_missing():
count = Counter(your_list)
keys = count.keys() #list of every element in increasing order
main_list = list(range(1:100000)) #the list of values from 1 to 100k
missing_numbers = list(set(main_list) - set(keys))
your_output = min(missing_numbers)
return your_output

optimizing code running time [diffrence between the codes below]

these are two codes, can anyone tell me why the second one takes more time to run.
#1
ar = [9547948, 8558390, 9999933, 5148263, 5764559, 906438, 9296778, 1156268]
count=0
big = max(ar)
for i in range(len(ar)):
if(ar[i]==big):
count+=1
print(count)
#2
ar = [9547948, 8558390, 9999933, 5148263, 5764559, 906438, 9296778, 1156268]
list = [i for i in ar if i == max(ar)]
return len(list)
In the list comprehension (the second one), the if clause is evaluated for each candidate item, so max() is evaluated each time.
In the first one, the maximum is evaluated once, before the loop starts. You could probably get a similar performance from the list comprehension by pre-evaluating the maximum in the same way:
maxiest = max(ar)
list = [i for i in ar if i == maxiest]
Additionally, you're not creating a new list in the first one, rather you're just counting the items that match the biggest one. That may also have an impact but you'd have to do some measurements to be certain.
Of course, if you just want to know what the difference between those two algorithms are, that hopefully answers your question. However, you should be aware that max itself will generally pass over the data, then your search will do that again. There is a way to do it with only one pass, something like:
def countLargest(collection):
# Just return zero for empty list.
if len(collection) == 0: return 0
# Setup so first item is largest with count one.
count = 0
largest = collection[0]
# Process every item.
for item in collection:
if item > largest:
# Larger: Replace with count one.
largest = item
count = 1
elif item == largest:
# Equal: increase count.
count += 1
return count
Just keep in mind you should check if that's faster, based on likely data sets (the optimisation mantra is "measure, don't guess").
And, to be honest, it won't make much difference until either your data sets get very large or you need to do it many, many times per second. It certainly won't make any real difference for your eight-element collection. Sometimes, it's better to optimise for readability rather than performance.

How to find number of ways that the integers 1,2,3 can add up to n?

Given a set of integers 1,2, and 3, find the number of ways that these can add up to n. (The order matters, i.e. say n is 5. 1+2+1+1 and 2+1+1+1 are two distinct solutions)
My solution involves splitting n into a list of 1s so if n = 5, A = [1,1,1,1,1]. And I will generate more sublists recursively from each list by adding adjacent numbers. So A will generate 4 more lists: [2,1,1,1], [1,2,1,1], [1,1,2,1],[1,1,1,2], and each of these lists will generate further sublists until it reaches a terminating case like [3,2] or [2,3]
Here is my proposed solution (in Python)
ways = []
def check_terminating(A,n):
# check for terminating case
for i in range(len(A)-1):
if A[i] + A[i+1] <= 3:
return False # means still can compute
return True
def count_ways(n,A=[]):
if A in ways:
# check if alr computed if yes then don't compute
return True
if A not in ways: # check for duplicates
ways.append(A) # global ways
if check_terminating(A,n):
return True # end of the tree
for i in range(len(A)-1):
# for each index i,
# combine with the next element and form a new list
total = A[i] + A[i+1]
print(total)
if total <= 3:
# form new list and compute
newA = A[:i] + [total] + A[i+2:]
count_ways(A,newA)
# recursive call
# main
n = 5
A = [1 for _ in range(n)]
count_ways(5,A)
print("No. of ways for n = {} is {}".format(n,len(ways)))
May I know if I'm on the right track, and if so, is there any way to make this code more efficient?
Please note that this is not a coin change problem. In coin change, order of occurrence is not important. In my problem, 1+2+1+1 is different from 1+1+1+2 but in coin change, both are same. Please don't post coin change solutions for this answer.
Edit: My code is working but I would like to know if there are better solutions. Thank you for all your help :)
The recurrence relation is F(n+3)=F(n+2)+F(n+1)+F(n) with F(0)=1, F(-1)=F(-2)=0. These are the tribonacci numbers (a variant of the Fibonacci numbers):
It's possible to write an easy O(n) solution:
def count_ways(n):
a, b, c = 1, 0, 0
for _ in xrange(n):
a, b, c = a+b+c, a, b
return a
It's harder, but possible to compute the result in relatively few arithmetic operations:
def count_ways(n):
A = 3**(n+3)
P = A**3-A**2-A-1
return pow(A, n+3, P) % A
for i in xrange(20):
print i, count_ways(i)
The idea that you describe sounds right. It is easy to write a recursive function that produces the correct answer..slowly.
You can then make it faster by memoizing the answer. Just keep a dictionary of answers that you've already calculated. In your recursive function look at whether you have a precalculated answer. If so, return it. If not, calculate it, save that answer in the dictionary, then return the answer.
That version should run quickly.
An O(n) method is possible:
def countways(n):
A=[1,1,2]
while len(A)<=n:
A.append(A[-1]+A[-2]+A[-3])
return A[n]
The idea is that we can work out how many ways of making a sequence with n by considering each choice (1,2,3) for the last partition size.
e.g. to count choices for (1,1,1,1) consider:
choices for (1,1,1) followed by a 1
choices for (1,1) followed by a 2
choices for (1) followed by a 3
If you need the results (instead of just the count) you can adapt this approach as follows:
cache = {}
def countwaysb(n):
if n < 0:
return []
if n == 0:
return [[]]
if n in cache:
return cache[n]
A = []
for last in range(1,4):
for B in countwaysb(n-last):
A.append(B+[last])
cache[n] = A
return A

Sorting for index values using a binary search function in python

I am being tasked with designing a python function that returns the index of a given item inside a given list. It is called binary_sort(l, item) where l is a list(unsorted or sorted), and item is the item you're looking for the index of.
Here's what I have so far, but it can only handle sorted lists
def binary_search(l, item, issorted=False):
templist = list(l)
templist.sort()
if l == templist:
issorted = True
i = 0
j = len(l)-1
if item in l:
while i != j + 1:
m = (i + j)//2
if l[m] < item:
i = m + 1
else:
j = m - 1
if 0 <= i < len(l) and l[i] == item:
return(i)
else:
return(None)
How can I modify this so it will return the index of a value in an unsorted list if it is given an unsorted list and a value as parameters?
Binary Search (you probably misnamed it - the algorithm above is not called "Binary Sort") - requires ordered sequences to work.
It simply can't work on an unordered sequence, since is the ordering that allows it to throw away at least half of the items in each search step.
On the other hand, since you are allowed to use the list.sorted method, that seems to be the way to go: calling l.sort() will sort your target list before starting the search operations, and the algorithm will work.
In a side note, avoid in a program to call anything just l - it maybe a nice name for a list for someone with a background in Mathematics and used to do things on paper - but on the screen, l is hard to disinguish from 1 and makes for poor source code reading. Good names for this case could be sequence lst, or data. (list should be avoided as well, since it would override the Python built-in with the same name).

Categories