Find the 2nd highest element - python

In a given array how to find the 2nd, 3rd, 4th, or 5th values?
Also if we use themax() function in python what is the order of complexity i.e, associated with this function max()?
.
def nth_largest(li,n):
li.remove(max(li))
print max(ele) //will give me the second largest
#how to make a general algorithm to find the 2nd,3rd,4th highest value
#n is the element to be found below the highest value

I'd go for:
import heapq
res = heapq.nlargest(2, some_sequence)
print res[1] # to get 2nd largest
This is more efficient than sorting the entire list, then taking the first n many elements. See the heapq documentation for further info.

You could use sorted(set(element)):
>>> a = (0, 11, 100, 11, 33, 33, 55)
>>>
>>> sorted(set(a))[-1] # highest
100
>>> sorted(set(a))[-2] # second highest
55
>>>
as a function:
def nth_largest(li, n):
return sorted(set(li))[-n]
test:
>>> a = (0, 11, 100, 11, 33, 33, 55)
>>> def nth_largest(li, n):
... return sorted(set(li))[-n]
...
>>>
>>> nth_largest(a, 1)
100
>>> nth_largest(a, 2)
55
>>>
Note, here you only need to sort and remove the duplications once, if you worry about the performance you could cache the result of sorted(set(li)).

If performance is a concern (e.g.: you intend to call this a lot), then you should absolutely keep the list sorted and de-duplicated at all times, and simply the first, second, or nth element (which is o(1)).
Use the bisect module for this - it's faster than a "standard" sort.
insort lets you insert an element, and bisect will let you find whether you should be inserting at all (to avoid duplicates).
If it's not, I'd suggest the simpler:
def nth_largest(li, n):.
return sorted(set(li))[-(n+1)]
If the reverse indexing looks ugly to you, you can do:
def nth_largest(li, n):
return sorted(set(li), reverse=True)[n]

As for which method would have the lowest time complexity, this depends a lot on which types of queries you plan on making.
If you're planning on making queries into high indexes (e.g. 36th largest element in a list with 38 elements), your function nth_largest(li,n) will have close to O(n^2) time complexity since it will have to do max, which is O(n), several times. It will be similar to the Selection Sort algorithm except using max() instead of min().
On the other hand, if you are only making low index queries, then your function can be efficient as it will only apply the O(n) max function a few times and the time complexity will be close to O(n). However, building a max heap is possible in linear time O(n) and you would be better off just using that. After you go through the trouble of constructing a heap, all of your max() operations on the heap will be O(1) which could be a better long-term solution for you.
I believe the most scalable way (in terms of being able to query nth largest element for any n) is to sort the list with time complexity O(n log n) using the built-in sort function and then make O(1) queries from the sorted list. Of course, that's not the most memory-efficient method but in terms of time complexity it is very efficient.

If you do not mind using numpy (import numpy as np):
np.partition(numbers, -i)[-i]
gives you the ith largest element of the list with a guaranteed worst-case O(n) running time.
The partition(a, kth) methods returns an array where the kth element is the same it would be in a sorted array, all elements before are smaller, and all behind are larger.

How about:
sorted(li)[::-1][n]

Related

If there's a way to solve element uniqueness problem in O(n)

there any help in such problem??
Do you mean this?
def check_elements(arr):
return len(arr) == len(set(arr))
UPD
I think I get the point. Given a list with constant length (say 50). And we need to add such circumstances to the problem that solving this problem will take O(n) time. And I suppose not like O(n) dummy operations but kinda reasonable O(n).
Well... the only way a see where we can get O(n) are elements themselves. Say we have something like this:
[
1.1111111111111111..<O(n) digits>..1,
1.1111111111111111..<O(n) digits>..2,
1.1111111111111111..<O(n) digits>..3,
1.1111111111111111..<O(n) digits>..1,
]
Basically we can treat elements as very long string. And to check whether constant number of n-character strings are unique or not we have to at least read them all. And it's at least O(n) time.
You can just use a counting sort, in your case it will be in O(n). Create an array from 0 to N (N is your maximum value), and then if you have value v in the original array, add one to the value-th entry of the resulting array. This will takes you O(n) (juste review all values from the original array), and then juste search in the resulting array if there is an entry greater than 1...

faster way of finding combinations?

I'm trying to find all possible sub-intervals between np.linspace(0,n,n*10+1)
where the sub-intervals are greater than width (say width=0.5)
so I tried this with using itertools by
import itertools
ranges=np.linspace(0,n,n*10+1)
#find all combinations
combinations=list(itertools.combinations(ranges,2))
#using for-loops to calculate width of each intervals
#and append to new list if the width is greater than 0.5
save=[]
for i in range(len(combinations)):
if combinations[i][1]-combinations[i][0]>0.5:
save.append(combinations[i])
but this takes too many times especially when n gets bigger especially it costs huge ram usage
So I'm wondering whether if I can modify the function faster or set constraint when I collect combinations
itertools.combinations(...) returns a generator, that means the returned object produces its values when needed instead of calculating everything at once and storing the result in memory. You force immediate calculation and storing by converting it to a list, but this is unnecessary. Simply iterate over the combinations object instead of making a list of it and iterating over the indexes (which should not be done anyway):
import itertools
ranges=np.linspace(0,n,n*10+1) # alternatively 'range(100)' or so to test
combinations=itertools.combinations(ranges,2)
save=[]
for c in combinations:
if c[1] - c[0] > 0.5:
save.append(c)

Constant time random selection and deletion

I'm trying to implement an edge list for a MultiGraph in Python.
What I've tried so far:
>>> l1 = Counter({(1, 2): 2, (1, 3): 1})
>>> l2 = [(1, 2), (1, 2), (1, 3)]
l1 has constant-time deletion of all edges between two vertices (e.g. del l1[(1, 2)]) but linear-time random selection on those edges (e.g. random.choice(list(l1.elements()))). Note that you have to do a selection on elements (vs. l1 itself).
l2 has constant-time random selection (random.choice(l2)) but linear-time deletion of all elements equal to a given edge ([i for i in l2 if i != (1, 2)]).
Question: is there a Python data structure that would give me both constant-time random selection and deletion?
I don't think what you're trying to do is achievable in theory.
If you're using weighted values to represent duplicates, you can't get constant-time random selection. The best you could possibly do is some kind of skip-list-type structure that lets you binary-search the element by weighted index, which is logarithmic.
If you're not using weighted values to represent duplicates, then you need some structure that allows you to store multiple copies. And a hash table isn't going to do it—the dups have to be independent objects (e.g., (edge, autoincrement)),, meaning there's no way to delete all that match some criterion in constant time.
If you can accept logarithmic time, the obvious choice is a tree. For example, using blist:
>>> l3 = blist.sortedlist(l2)
To select one at random:
>>> edge = random.choice(l3)
The documentation doesn't seem to guarantee that this won't do something O(n). But fortunately, the source for both 3.3 and 2.7 shows that it's going to do the right thing. If you don't trust that, just write l3[random.randrange(len(l3))].
To delete all copies of an edge, you can do it like this:
>>> del l3[l3.bisect_left(edge):l3.bisect_right(edge)]
Or:
>>> try:
... while True:
... l3.remove(edge)
... except ValueError:
... pass
The documentation explains the exact performance guarantees for every operation involved. In particular, len is constant, while indexing, slicing, deleting by index or slice, bisecting, and removing by value are all logarithmic, so both operations end up logarithmic.
(It's worth noting that blist is a B+Tree; you might get better performance out of a red-black tree, or a treap, or something else. You can find good implementations for most data structures on PyPI.)
As pointed out by senderle, if the maximum number of copies of an edge is much smaller than the size of the collection, you can create a data structure that does it in time quadratic on the maximum number of copies. Translating his suggestion into code:
class MGraph(object):
def __init__(self):
self.edgelist = []
self.edgedict = defaultdict(list)
def add(self, edge):
self.edgedict[edge].append(len(self.edgelist))
self.edgelist.append(edge)
def remove(self, edge):
for index in self.edgedict.get(edge, []):
maxedge = len(self.edgelist) - 1
lastedge = self.edgelist[maxedge]
self.edgelist[index], self.edgelist[maxedge] = self.edgelist[maxedge], self.edgelist[index]
self.edgedict[lastedge] = [i if i != maxedge else index for i in self.edgedict[lastedge]]
del self.edgelist[-1]
del self.edgedict[edge]
def choice(self):
return random.choice(self.edgelist)
(You could, of course, change the replace-list-with-list-comprehension line with a three-liner find-and-update-in-place, but that's still linear in the number of dups.)
Obviously, if you plan to use this for real, you may want to beef up the class a bit. You can make it look like a list of edges, a set of tuples of multiple copies of each edge, a Counter, etc., by implementing a few methods and letting the appropriate collections.abc.Foo/collections.Foo fill in the rest.
So, which is better? Well, in your sample case, the average dup count is half the size of the list, and the maximum is 2/3rds the size. If that were true for your real data, the tree would be much, much better, because log N will obviously blow away (N/2)**2. On the other hand, if dups were rare, senderle's solution would obviously be better, because W**2 is still 1 if W is 1.
Of course for a 3-element sample, constant overhead and multipliers are going to dominate everything. But presumably your real collection isn't that tiny. (If it is, just use a list...)
If you don't know how to characterize your real data, write both implementations and time them with various realistic inputs.

i th order statistic in Python

Given a list of n comparable elements (say numbers or string), the optimal algorithm to find the ith ordered element takes O(n) time.
Does Python implement natively O(n) time order statistics for lists, dicts, sets, ...?
None of Python's mentioned data structures implements natively the ith order statistic algorithm.
In fact, it might not make much sense for dictionaries and sets, given the fact that both make no assumptions about the ordering of its elements. For lists, it shouldn't be hard to implement the selection algorithm, which provides O(n) running time.
This is not a native solution, but you can use NumPy's partition to find the k-th order statistic of a list in O(n) time.
import numpy as np
x = [2, 4, 0, 3, 1]
k = 2
print('The k-th order statistic is:', np.partition(np.asarray(x), k)[k])
EDIT: this assumes zero-indexing, i.e. the "zeroth order statistic" above is 0.
If i << n you can give a look at http://docs.python.org/library/heapq.html#heapq.nlargest and http://docs.python.org/library/heapq.html#heapq.nsmallest (the don't solve your problem, but are faster than sorting and taking the i-th element).

Python: take max N elements from some list

Is there some function which would return me the N highest elements from some list?
I.e. if max(l) returns the single highest element, sth. like max(l, count=10) would return me a list of the 10 highest numbers (or less if l is smaller).
Or what would be an efficient easy way to get these? (Except the obvious canonical implementation; also, no such things which involve sorting the whole list first because that would be inefficient compared to the canonical solution.)
heapq.nlargest:
>>> import heapq, random
>>> heapq.nlargest(3, (random.gauss(0, 1) for _ in xrange(100)))
[1.9730767232998481, 1.9326532289091407, 1.7762926716966254]
The function in the standard library that does this is heapq.nlargest
Start with the first 10 from L, call that X. Note the minimum value of X.
Loop over L[i] for i over the rest of L.
If L[i] is greater than min(X), drop min(X) from X and insert L[i]. You may need to keep X as a sorted linked list and do an insertion. Update min(X).
At the end, you have the 10 largest values in X.
I suspect that will be O(kN) (where k is 10 here) since insertion sort is linear. Might be what gsl uses, so if you can read some C code:
http://www.gnu.org/software/gsl/manual/html_node/Selecting-the-k-smallest-or-largest-elements.html
Probably something in numpy that does this.
A fairly efficient solution is a variation of quicksort where recursion is limited to the right part of the pivot until the pivot point position is higher than the number of elements required (with a few extra conditions to deal with border cases of course).
The standard library has heapq.nlargest, as pointed out by others here.
If you do not mind using pandas then:
import pandas as pd
N = 10
column_name = 0
pd.DataFrame(your_array).nlargest(N, column_name)
The above code will show you the N largest values along with the index position of each value.
Pandas nlargest documentation

Categories