Given a list of n comparable elements (say numbers or string), the optimal algorithm to find the ith ordered element takes O(n) time.
Does Python implement natively O(n) time order statistics for lists, dicts, sets, ...?
None of Python's mentioned data structures implements natively the ith order statistic algorithm.
In fact, it might not make much sense for dictionaries and sets, given the fact that both make no assumptions about the ordering of its elements. For lists, it shouldn't be hard to implement the selection algorithm, which provides O(n) running time.
This is not a native solution, but you can use NumPy's partition to find the k-th order statistic of a list in O(n) time.
import numpy as np
x = [2, 4, 0, 3, 1]
k = 2
print('The k-th order statistic is:', np.partition(np.asarray(x), k)[k])
EDIT: this assumes zero-indexing, i.e. the "zeroth order statistic" above is 0.
If i << n you can give a look at http://docs.python.org/library/heapq.html#heapq.nlargest and http://docs.python.org/library/heapq.html#heapq.nsmallest (the don't solve your problem, but are faster than sorting and taking the i-th element).
Related
I was wondering whether pandas sorting with sort_values() is a deterministic operation in case of ties, i.e. if calling df.sort_values('foo') would always return the same sorting, no matter how often I run it?
One example would be
df=pd.DataFrame(np.random.randint(1, 3, 5),columns=["foo"])
df.sort_values(['foo'])
foo
0 1
4 1
1 2
2 2
3 2
I understand that the operation is not stable, but is it deterministic?
Yes. If you use kind='quicksort', the output is deterministic, but not stable.
The reason why quicksort can be nondeterministic is that all quicksort implementations are made up of three steps:
Pick a pivot element.
Divide the list into two lists: the elements smaller than the pivot, and the elements larger than the pivot.
Run quicksort on both halves of the list.
There are three popular ways of implementing step 1.
The first way is to arbitrarily pick a pivot element, such as picking the first element, or middle element.
The second way is to pick an element at random.
The third way is to pick several elements at random, and compute a median (or median of medians.)
The first way is deterministic. The second and third ways are nondeterministic.
So, which kind of quicksort does Pandas implement? Pandas dispatches sort_values() to sort_index(), which uses numpy's argsort() to do the sort. How does numpy implement picking the pivot? That's defined in this file.
The pivot element is vp. It is chosen like so:
/* quicksort partition */
pm = pl + ((pr - pl) >> 1);
[...]
vp = *pm;
How does this work? The variables pr and pl are pointers to the beginning and end of the region to be sorted, respectively. If you subtract the two, that is the number of elements to be sorted. If you shift that right once, that's dividing it by 2. So the pm pointer points to an element halfway into the array. Then pm is de-referenced to get the pivot element. (Note that this isn't necessarily the median element of the array! It could be the smallest element, or the largest.)
This means that numpy uses the first way to pick elements - it is arbitrary but deterministic. The tradeoff for this is that for some orderings of data, the sort performance will degrade from O(N log N) to O(N^2).
More information about implementing quicksort
For example, suppose I had an (n,2) dimensional tensor t whose elements are all from the set S containing random integers. I want to build another tensor d with size (m,2) where individual elements in each tuple are from S, but the whole tuples do not occur in t.
E.g.
S = [0,1,2,3,7]
t = [[0,1],
[7,3],
[3,1]]
d = some_algorithm(S,t)
/*
d =[[2,1],
[3,2],
[7,4]]
*/
What is the most efficient way to do this in python? Preferably with pytorch or numpy, but I can work around general solutions.
In my naive attempt, I just use
d = np.random.choice(S,(m,2))
non_dupes = [i not in t for i in d]
d = d[non_dupes]
But both t and S are incredibly large, and this takes an enormous amount of time (not to mention, rarely results in a (m,2) array). I feel like there has to be some fancy tensor thing I can do to achieve this, or maybe making a large hash map of the values in t so checking for membership in t is O(1), but this produces the same issue just with memory. Is there a more efficient way?
An approximate solution is also okay.
my naive attempt would be a base-transformation function to reduce the problem to an integer set problem:
definitions and assumptions:
let S be a set (unique elements)
let L be the number of elements in S
let t be a set of M-tuples with elements from S
the original order of the elements in t is irrelevant
let I(x) be the index function of the element x in S
let x[n] be the n-th tuple-member of an element of t
let f(x) be our base-transform function (and f^-1 its inverse)
since S is a set we can write each element in t as a M digit number to the base L using elements from S as digits.
for M=2 the transformation looks like
f(x) = I(x[1])*L^1 + I(x[0])*L^0
f^-1(x) is also rather trivial ... x mod L to get back the index of the least significant digit. floor(x/L) and repeat until all indices are extracted. lookup the values in S and construct the tuple.
since now you can represet t as an integer set (read hastable) calculating the inverse set d becomes rather trivial
loop from L^(M-1) to (L^(M+1)-1) and ask your hashtable if the element is in t or d
if the size of S is too big you can also just draw random numbers against the hashtable for a subset of the inverse of t
does this help you?
If |t| + |d| << |S|^2 then the probability of some random tuple to be chosen again (in a single iteration) is relatively small.
To be more exact, if (|t|+|d|) / |S|^2 = C for some constant C<1, then if you redraw an element until it is a "new" one, the expected number of redraws needed is 1/(1-C).
This means, that by doing this, and redrawing elements until this is a new element, you get O((1/(1-C)) * |d|) times to process a new element (on average), which is O(|d|) if C is indeed constant.
Checking is an element is already "seen" can be done in several ways:
Keeping hash sets of t and d. This requires extra space, but each lookup is constant O(1) time. You could also use a bloom filter instead of storing the actual elements you already seen, this will make some errors, saying an element is already "seen" though it was not, but never the other way around - so you will still get all elements in d as unique.
Inplace sorting t, and using binary search. This adds O(|t|log|t|) pre-processing, and O(log|t|) for each lookup, but requires no additional space (other then where you store d).
If in fact, |d| + |t| is very close to |S|^2, then an O(|S|^2) time solution could be to use Fisher Yates shuffle on the available choices, and choosing the first |d| elements that do not appear in t.
there any help in such problem??
Do you mean this?
def check_elements(arr):
return len(arr) == len(set(arr))
UPD
I think I get the point. Given a list with constant length (say 50). And we need to add such circumstances to the problem that solving this problem will take O(n) time. And I suppose not like O(n) dummy operations but kinda reasonable O(n).
Well... the only way a see where we can get O(n) are elements themselves. Say we have something like this:
[
1.1111111111111111..<O(n) digits>..1,
1.1111111111111111..<O(n) digits>..2,
1.1111111111111111..<O(n) digits>..3,
1.1111111111111111..<O(n) digits>..1,
]
Basically we can treat elements as very long string. And to check whether constant number of n-character strings are unique or not we have to at least read them all. And it's at least O(n) time.
You can just use a counting sort, in your case it will be in O(n). Create an array from 0 to N (N is your maximum value), and then if you have value v in the original array, add one to the value-th entry of the resulting array. This will takes you O(n) (juste review all values from the original array), and then juste search in the resulting array if there is an entry greater than 1...
Let's suppose we have a matrix and a list of indexes:
adj_mat = np.array([[1,2,3],
[4,5,6],
[7,8,9]])
indexes = [0,2]
What I want is to sum the rows and columns corresponding to the sub matrix we get by the intersection of the rows and columns of the indexes list. In this case it would be:
sub_matrix = ([[1,3]
[7,9]])
result_rows = [4,16]
result_columns = [8,12]
However, I do this calculation rather a lot of times with the same original matrix and different indexes lists, so I am looking for an efficent solution without creating the sub matrix each iteration. My solution so far is (and for columns respectively):
def sum_rows(matrix, indexes):
sum_r = [0]*len(indexes)
for i in range(len(indexes)):
for j in indexes:
sum_r[i] += matrix.item(indexes[i], j)
return sum_r
I'm looking for a more efficient algorithm as I remember there is a method which looks like this that sums all rows (or columns?) in the indexes:
matrix.sum(:, indexes)
matrix.sum(indexes, indexes)
I assume what I need is the second line, if it exists. I tried to google it, with or without numpy, but couldn't find the right syntax.
Is there a solution as I described here but I'm just using the wrong syntax? Or any other suggestions for improvement?
IIUC:
import numpy as np
adj_mat = np.array([[1,2,3],
[4,5,6],
[7,8,9]])
indexes = np.array([1, 3]) - 1
sub_matrix = adj_mat[np.ix_(indexes, indexes)]
result_rows, result_columns = sub_matrix.sum(axis=1), sub_matrix.sum(axis=0)
Result:
array([ 4, 16]) # result_rows
array([ 8, 12]) # result_columns
So assuming you made a mistake and you meant indexes = [0,2] and sub_matrix = [[1,3], [7,9]], then this should do what you want
def sum_sub(matrix, indices):
"""
Returns the sum of each row and column (as a tuple)
for each index in indices (as an array)
"""
# note that this sub matrix does not copy any data from matrix,
# it is a "view" which simply holds a reference to matrix
sub_mat = matrix[np.ix_(indices, indices)]
return sub_mat.sum(axis=1), sub_mat.sum(axis=0)
sum_row, sum_col = sum_sub(np.arange(1,10).reshape((3,3)), [0,2])
The results of this are
sum_col # --> [ 8 12]
sum_row # --> [ 4 16]
Since the point of efficiency was brought up in the question, a little further analysis should probably be done.
First and foremost, the code looks like code to find a matrix inverse using the adjoint matrix. Unless that particular method is important to the project, the standard np.linalg.inv() is almost certainly going to be faster than anything we cook up here. Moreover, in many applications you can get away with solving a system of linear equations rather than finding an inverse and multiplying by it, cutting run times in half or more again.
Second, any discussion of efficient numpy code needs to address views as opposed to copies. Memory allocation, writing to memory, and memory deallocation are all extremely expensive operations when compared with standard floating point arithmetic. That's not to say that they're slow, but you can notice an order of magnitude or two of difference in the speed of code memory efficient code vs nearly anything else. That's the entire premise behind the fastest implementation of persistent homology calculations I know of, among other things.
All of the other answers (at the time of writing) create a copy of the data they're working with, explicitly storing that information in a new variable sub_matrix. It isn't possible to create every fancy-indexed matrix with a copy, but oftentimes equivalent operations can be performed.
For example, if this really is a set of computations on adjoint matrices so that your indexes variable consists of all but one of the available indices (in your example, all but the middle index), then instead of explicitly summing over all the intended indices, we can sum over all indices and subtract the one we don't care about. The effect is that all the intermediate matrices are views rather than copies, preventing the expensive memory allocations. On my machine, this is twice as fast for the tiny 3x3 example given and 10x as fast for 500x500 matrices.
bad_row = 1
bad_col = 1
result_rows = (np.sum(adj_mat, axis=1)-adj_mat[:,bad_col])[np.arange(adj_mat.shape[0])!=bad_row]
result_cols = (np.sum(adj_mat, axis=0)-adj_mat[bad_row,:])[np.arange(adj_mat.shape[1])!=bad_col]
Of course, it's even faster if you can use slices to represent whatever you're doing and you don't have to work around the problem with extra operations as I did, but the example you gave doesn't easily permit slices.
In a given array how to find the 2nd, 3rd, 4th, or 5th values?
Also if we use themax() function in python what is the order of complexity i.e, associated with this function max()?
.
def nth_largest(li,n):
li.remove(max(li))
print max(ele) //will give me the second largest
#how to make a general algorithm to find the 2nd,3rd,4th highest value
#n is the element to be found below the highest value
I'd go for:
import heapq
res = heapq.nlargest(2, some_sequence)
print res[1] # to get 2nd largest
This is more efficient than sorting the entire list, then taking the first n many elements. See the heapq documentation for further info.
You could use sorted(set(element)):
>>> a = (0, 11, 100, 11, 33, 33, 55)
>>>
>>> sorted(set(a))[-1] # highest
100
>>> sorted(set(a))[-2] # second highest
55
>>>
as a function:
def nth_largest(li, n):
return sorted(set(li))[-n]
test:
>>> a = (0, 11, 100, 11, 33, 33, 55)
>>> def nth_largest(li, n):
... return sorted(set(li))[-n]
...
>>>
>>> nth_largest(a, 1)
100
>>> nth_largest(a, 2)
55
>>>
Note, here you only need to sort and remove the duplications once, if you worry about the performance you could cache the result of sorted(set(li)).
If performance is a concern (e.g.: you intend to call this a lot), then you should absolutely keep the list sorted and de-duplicated at all times, and simply the first, second, or nth element (which is o(1)).
Use the bisect module for this - it's faster than a "standard" sort.
insort lets you insert an element, and bisect will let you find whether you should be inserting at all (to avoid duplicates).
If it's not, I'd suggest the simpler:
def nth_largest(li, n):.
return sorted(set(li))[-(n+1)]
If the reverse indexing looks ugly to you, you can do:
def nth_largest(li, n):
return sorted(set(li), reverse=True)[n]
As for which method would have the lowest time complexity, this depends a lot on which types of queries you plan on making.
If you're planning on making queries into high indexes (e.g. 36th largest element in a list with 38 elements), your function nth_largest(li,n) will have close to O(n^2) time complexity since it will have to do max, which is O(n), several times. It will be similar to the Selection Sort algorithm except using max() instead of min().
On the other hand, if you are only making low index queries, then your function can be efficient as it will only apply the O(n) max function a few times and the time complexity will be close to O(n). However, building a max heap is possible in linear time O(n) and you would be better off just using that. After you go through the trouble of constructing a heap, all of your max() operations on the heap will be O(1) which could be a better long-term solution for you.
I believe the most scalable way (in terms of being able to query nth largest element for any n) is to sort the list with time complexity O(n log n) using the built-in sort function and then make O(1) queries from the sorted list. Of course, that's not the most memory-efficient method but in terms of time complexity it is very efficient.
If you do not mind using numpy (import numpy as np):
np.partition(numbers, -i)[-i]
gives you the ith largest element of the list with a guaranteed worst-case O(n) running time.
The partition(a, kth) methods returns an array where the kth element is the same it would be in a sorted array, all elements before are smaller, and all behind are larger.
How about:
sorted(li)[::-1][n]