Is there some function which would return me the N highest elements from some list?
I.e. if max(l) returns the single highest element, sth. like max(l, count=10) would return me a list of the 10 highest numbers (or less if l is smaller).
Or what would be an efficient easy way to get these? (Except the obvious canonical implementation; also, no such things which involve sorting the whole list first because that would be inefficient compared to the canonical solution.)
heapq.nlargest:
>>> import heapq, random
>>> heapq.nlargest(3, (random.gauss(0, 1) for _ in xrange(100)))
[1.9730767232998481, 1.9326532289091407, 1.7762926716966254]
The function in the standard library that does this is heapq.nlargest
Start with the first 10 from L, call that X. Note the minimum value of X.
Loop over L[i] for i over the rest of L.
If L[i] is greater than min(X), drop min(X) from X and insert L[i]. You may need to keep X as a sorted linked list and do an insertion. Update min(X).
At the end, you have the 10 largest values in X.
I suspect that will be O(kN) (where k is 10 here) since insertion sort is linear. Might be what gsl uses, so if you can read some C code:
http://www.gnu.org/software/gsl/manual/html_node/Selecting-the-k-smallest-or-largest-elements.html
Probably something in numpy that does this.
A fairly efficient solution is a variation of quicksort where recursion is limited to the right part of the pivot until the pivot point position is higher than the number of elements required (with a few extra conditions to deal with border cases of course).
The standard library has heapq.nlargest, as pointed out by others here.
If you do not mind using pandas then:
import pandas as pd
N = 10
column_name = 0
pd.DataFrame(your_array).nlargest(N, column_name)
The above code will show you the N largest values along with the index position of each value.
Pandas nlargest documentation
Related
I'm looking for a better, faster way to center a couple of lists. Right now I have the following:
import random
m = range(2000)
sm = sorted(random.sample(range(100000), 16000))
si = random.sample(range(16005), 16000)
# Centered array.
smm = []
print sm
print si
for i in m:
if i in sm:
smm.append(si[sm.index(i)])
else:
smm.append(None)
print m
print smm
Which in effect creates a list (m) containing a range of random numbers to center against, another list (sm) from which m is centered against and a list of values (si) to append.
This sample runs fairly quickly, but when I run a larger task with much more variables performance slows to a standstill.
your mainloop contains this infamous line:
if i in sm:
it seems to be nothing but since sm is a result of sorted it is a list, hence O(n) lookup, which explains why it's slow with a big dataset.
Moreover you're using the even more infamous si[sm.index(i)], which makes your algorithm O(n**2).
Since you need the indexes, using a set is not so easy, and there's better to do:
Since sm is sorted, you could use bisect to find the index in O(log(n)), like this:
for i in m:
j = bisect.bisect_left(sm,i)
smm.append(si[j] if (j < len(sm) and sm[j]==i) else None)
small explanation: bisect gives you the insertion point of i in sm. It doesn't mean that the value is actually in the list so we have to check that (by checking if the returned value is within existing list range, and checking if the value at the returned index is the searched value), if so, append, else append None.
This question already has answers here:
Quadratic algorithm for 4-SUM
(3 answers)
Closed 9 years ago.
I am trying to find whether a list has 4 elements that sum to 0 (and later find what those elements are). I'm trying to make a solution based off the even k algorithm described at https://cs.stackexchange.com/questions/2973/generalised-3sum-k-sum-problem.
I get this code in Python using combinations from the standard library
def foursum(arr):
seen = {sum(subset) for subset in combinations(arr,2)}
return any(-x in seen for x in seen)
But this fails for input like [-1, 1, 2, 3]. It fails because it matches the sum (-1+1) with itself. I think this problem will get even worse when I want to find the elements because you can separate a set of 4 distinct items into 2 sets of 2 items in 6 ways: {1,4}+{-2,-3}, {1,-2}+{4,-3} etc etc.
How can I make an algorithm that correctly returns all solutions avoiding this problem?
EDIT: I should have added that I want to use as efficient algorithm as possible. O(len(arr)^4) is too slow for my task...
This works.
import itertools
def foursum(arr):
seen = {}
for i in xrange(len(arr)):
for j in xrange(i+1,len(arr)):
if arr[i]+arr[j] in seen: seen[arr[i]+arr[j]].add((i,j))
else: seen[arr[i]+arr[j]] = {(i,j)}
for key in seen:
if -key in seen:
for (i,j) in seen[key]:
for (p,q) in seen[-key]:
if i != p and i != q and j != p and j != q:
return True
return False
EDIT
This can be made more pythonic i think, I don't know enough python.
It is normal for the 4SUM problem to permit input elements to be used multiple times. For instance, given the input (2 3 1 0 -4 -1), valid solutions are (3 1 0 -4) and (0 0 0 0).
The basic algorithm is O(n^2): Use two nested loops, each running over all the items in the input, to form all sums of pairs, storing the sums and their components in some kind of dictionary (hash table, AVL tree). Then scan the pair-sums, reporting any quadruple for which the negative of the pair-sum is also present in the dictionary.
If you insist on not duplicating input elements, you can modify the algorithm slightly. When computing the two nested loops, start the second loop beyond the current index of the first loop, so no input elements are taken twice. Then, when scanning the dictionary, reject any quadruples that include duplicates.
I discuss this problem at my blog, where you will see solutions in multiple languages, including Python.
First note that the problem is O(n^4) in worst case, since the output size might be of O(n^4) (you are looking for finding all solutions, not only the binary problem).
Proof:
Take an example of [-1]*(n/2).extend([1]*(n/2)). you need to "choose" two instances of -1 w/o repeats - (n/2)*(n/2-1)/2 possibilities, and 2 instances of 1 w/o repeats - (n/2)*(n/2-1)/2 possibilities. This totals in (n/2)*(n/2-1)*(n/2)*(n/2-1)/4 which is in Theta(n^4)
Now, that we understood we cannot achieve O(n^2logn) worst case, we can get to the following algorithm (pseudo-code), that should scale closer to O(n^2logn) for "good" cases (few identical sums), and get O(n^4) worst case (as expected).
Pseudo-code:
subsets <- all subsets of size of indices (not values!)
l <- empty list
for each s in subsets:
#appending a triplet of (sum,idx1,idx2):
l.append(( arr[s[0]] + arr[s[1]], s[0],s[1]))
sort l by first element (sum) in each tupple
for each x in l:
binary search l for -x[0] #for the sum
for each element y that satisfies the above:
if x[1] != y[1] and x[2] != y[1] and x[1] != y[2] and x[2] != y[2]:
yield arr[x[1]], arr[x[2]], arr[y[1]], arr[y[2]]
Probably a pythonic way to do the above will be more elegant and readable, but I am not a python expert I am afraid.
EDIT: Ofcourse the algorithm shall be atleast as time complex as per the solution size!
If the number of possible solutions is not 'large' as compared to n, then
A suggested solution in O(N^3):
Find pair-wise sums of all elements and build a NxN matrix of the sums.
For each element in this matrix, build a struct that would have sumValue, row and column as it fields.
Sort all these N^2 struct elements in a 1D array. (in O(N^2 logN) time).
For each element x in this array, conduct a binary search for its partner y such that x + y = 0 (O(logn) per search).
Now if you find a partner y, check if its row or column field matches with the element x. If so, iterate sequentially in both directions until either there is no more such y.
If you find some y's that do not have a common row or column with x, then increment the count (or print the solution).
This iteration can at most take 2N steps because the length of rows and columns is N.
Hence the total order of complexity for this algorithm shall be O(N^2 * N) = O(N^3)
In a given array how to find the 2nd, 3rd, 4th, or 5th values?
Also if we use themax() function in python what is the order of complexity i.e, associated with this function max()?
.
def nth_largest(li,n):
li.remove(max(li))
print max(ele) //will give me the second largest
#how to make a general algorithm to find the 2nd,3rd,4th highest value
#n is the element to be found below the highest value
I'd go for:
import heapq
res = heapq.nlargest(2, some_sequence)
print res[1] # to get 2nd largest
This is more efficient than sorting the entire list, then taking the first n many elements. See the heapq documentation for further info.
You could use sorted(set(element)):
>>> a = (0, 11, 100, 11, 33, 33, 55)
>>>
>>> sorted(set(a))[-1] # highest
100
>>> sorted(set(a))[-2] # second highest
55
>>>
as a function:
def nth_largest(li, n):
return sorted(set(li))[-n]
test:
>>> a = (0, 11, 100, 11, 33, 33, 55)
>>> def nth_largest(li, n):
... return sorted(set(li))[-n]
...
>>>
>>> nth_largest(a, 1)
100
>>> nth_largest(a, 2)
55
>>>
Note, here you only need to sort and remove the duplications once, if you worry about the performance you could cache the result of sorted(set(li)).
If performance is a concern (e.g.: you intend to call this a lot), then you should absolutely keep the list sorted and de-duplicated at all times, and simply the first, second, or nth element (which is o(1)).
Use the bisect module for this - it's faster than a "standard" sort.
insort lets you insert an element, and bisect will let you find whether you should be inserting at all (to avoid duplicates).
If it's not, I'd suggest the simpler:
def nth_largest(li, n):.
return sorted(set(li))[-(n+1)]
If the reverse indexing looks ugly to you, you can do:
def nth_largest(li, n):
return sorted(set(li), reverse=True)[n]
As for which method would have the lowest time complexity, this depends a lot on which types of queries you plan on making.
If you're planning on making queries into high indexes (e.g. 36th largest element in a list with 38 elements), your function nth_largest(li,n) will have close to O(n^2) time complexity since it will have to do max, which is O(n), several times. It will be similar to the Selection Sort algorithm except using max() instead of min().
On the other hand, if you are only making low index queries, then your function can be efficient as it will only apply the O(n) max function a few times and the time complexity will be close to O(n). However, building a max heap is possible in linear time O(n) and you would be better off just using that. After you go through the trouble of constructing a heap, all of your max() operations on the heap will be O(1) which could be a better long-term solution for you.
I believe the most scalable way (in terms of being able to query nth largest element for any n) is to sort the list with time complexity O(n log n) using the built-in sort function and then make O(1) queries from the sorted list. Of course, that's not the most memory-efficient method but in terms of time complexity it is very efficient.
If you do not mind using numpy (import numpy as np):
np.partition(numbers, -i)[-i]
gives you the ith largest element of the list with a guaranteed worst-case O(n) running time.
The partition(a, kth) methods returns an array where the kth element is the same it would be in a sorted array, all elements before are smaller, and all behind are larger.
How about:
sorted(li)[::-1][n]
I've been mucking around a bit with Python, and I've gathered that it's usually better (or 'pythonic') to use
for x in SomeArray:
rather than the more C-style
for i in range(0, len(SomeArray)):
I do see the benefits in this, mainly cleaner code, and the ability to use the nice map() and related functions. However, I am quite often faced with the situation where I would like to simultaneously access elements of varying offsets in the array. For example, I might want to add the current element to the element two steps behind it. Is there a way to do this without resorting to explicit indices?
The way to do this in Python is:
for i, x in enumerate(SomeArray):
print i, x
The enumerate generator produces a sequence of 2-tuples, each containing the array index and the element.
List indexing and zip() are your friends.
Here's my answer for your more specific question:
I might want to add the current element to the element two steps behind it. Is there a way to do this without resorting to explicit indices?
arr = range(10)
[i+j for i,j in zip(arr[:-2], arr[2:])]
You can also use the module numpy if you intend to work on numerical arrays. For example, the above code can be more elegantly written as:
import numpy
narr = numpy.arange(10)
narr[:-2] + narr[2:]
Adding the nth element to the (n-2)th element is equivalent to adding the mth element to the (m+2) element (for the mathematically inclined, we performed the substitution n->m+2). The range of n is [2, len(arr)) and the range of m is [0, len(arr)-2). Note the brackets and parenthesis. The elements from 0 to len(arr)-3 (you exclude the last two elements) is indexed as [:-2] while elements from 2 to len(arr)-1 (you exclude the first two elements) is indexed as [2:].
I assume that you already know list comprehensions.
Is there a way to sum up a list of numbers faster than with a for-loop, perhaps in the Python library? Or is that something really only multi-threading / vector processing can do efficiently?
Edit: Just to clarify, it could be a list of any numbers, unsorted, just input from the user.
You can use sum() to sum the values of an array.
a = [1,9,12]
print sum(a)
Yet another way to sum up a list with the loop time:
s = reduce(lambda x, y: x + y, l)
If each term in the list simply increments by 1, or if you can find a pattern in the series, you could find a formula for summing n terms. For example, the sum of the series {1,2,3,...,n} = n(n+1)/2
Read more here
Well, I don't know if it is faster but you could try a little calculus to make it one operation. (N*(N+1))/2 gives you the sum of every number from 1 to N, and there are other formulas for solving more complex sums.
For a general list, you have to at least go over every member at least once to get the sum, which is exactly what a for loop does. Using library APIs (like sum) is more convenient, but I doubt it would actually be faster.