I was wondering whether pandas sorting with sort_values() is a deterministic operation in case of ties, i.e. if calling df.sort_values('foo') would always return the same sorting, no matter how often I run it?
One example would be
df=pd.DataFrame(np.random.randint(1, 3, 5),columns=["foo"])
df.sort_values(['foo'])
foo
0 1
4 1
1 2
2 2
3 2
I understand that the operation is not stable, but is it deterministic?
Yes. If you use kind='quicksort', the output is deterministic, but not stable.
The reason why quicksort can be nondeterministic is that all quicksort implementations are made up of three steps:
Pick a pivot element.
Divide the list into two lists: the elements smaller than the pivot, and the elements larger than the pivot.
Run quicksort on both halves of the list.
There are three popular ways of implementing step 1.
The first way is to arbitrarily pick a pivot element, such as picking the first element, or middle element.
The second way is to pick an element at random.
The third way is to pick several elements at random, and compute a median (or median of medians.)
The first way is deterministic. The second and third ways are nondeterministic.
So, which kind of quicksort does Pandas implement? Pandas dispatches sort_values() to sort_index(), which uses numpy's argsort() to do the sort. How does numpy implement picking the pivot? That's defined in this file.
The pivot element is vp. It is chosen like so:
/* quicksort partition */
pm = pl + ((pr - pl) >> 1);
[...]
vp = *pm;
How does this work? The variables pr and pl are pointers to the beginning and end of the region to be sorted, respectively. If you subtract the two, that is the number of elements to be sorted. If you shift that right once, that's dividing it by 2. So the pm pointer points to an element halfway into the array. Then pm is de-referenced to get the pivot element. (Note that this isn't necessarily the median element of the array! It could be the smallest element, or the largest.)
This means that numpy uses the first way to pick elements - it is arbitrary but deterministic. The tradeoff for this is that for some orderings of data, the sort performance will degrade from O(N log N) to O(N^2).
More information about implementing quicksort
Related
In my current project, I have a D-dimensional array. For the sake of exposition, we can assume D=2, but the code should work with arbitrarily high dimensions. I need to run some operations on this matrix when it is sorted according to its last dimension, and subsequently reverse the sorting on the matrix.
The first part of sorting the matrix is relatively simple:
import numpy as np
D = 2
matrix = np.random.uniform(low=0.,high=1.,size=tuple([5]*D))
matrix_sorted = np.sort(matrix,axis=-1)
This code snippet sorts the matrix according to the last dimension, but does not remember how the array was sorted, and consequently does not allow me to revert the sorting. Alternatively, I could get the sorted indices with the following line:
sorted_indices = np.argsort(matrix,axis=-1)
Unfortunately, these indices do not seem to be very useful. I am not sure how I can use these sorted indices to (a) sort the matrix, and (b) undo the sorting in the case for general D. A simple approach would be to create a for-loop over all rows for the D=2 case (in this case, we sorted across the columns), but since I want the code to work for arbitrary dimensions, hard-coding nested for-loops is not really an option.
Do you have any elegant suggestions on how I could tackle this issue?
So yes, after following Mark M's suggestion, and reading up on some other StackOverflow answers continuing from there, the answer seems to be as follows:
import numpy as np
# Create the initial random matrix
D = 2
matrix = np.random.uniform(low=0.,high=1.,size=tuple([5]*D))
# Get the sorting indices
sorting = np.argsort(matrix,axis=-1)
# Get the indices for unsorting the matrix
reverse_sorting = np.argsort(sorting,axis=-1)
# Sort the initial matrix
matrix_sorted = np.take_along_axis(matrix, sorting, axis=-1)
# Undo the sorting
matrix_unsorted = np.take_along_axis(matrix_sorted, reverse_sorting, axis=-1)
The trick consists of two steps: np.take_along_axis allows us to sort arbitrarily-dimensional matrices according to the indices we get from np.argsort, and sorting the indices gives us the set of indices required to undo the sorting again with np.take_along_axis. I can do the desired complex operations between the penultimate and ultimate steps. Perfect!
For example, suppose I had an (n,2) dimensional tensor t whose elements are all from the set S containing random integers. I want to build another tensor d with size (m,2) where individual elements in each tuple are from S, but the whole tuples do not occur in t.
E.g.
S = [0,1,2,3,7]
t = [[0,1],
[7,3],
[3,1]]
d = some_algorithm(S,t)
/*
d =[[2,1],
[3,2],
[7,4]]
*/
What is the most efficient way to do this in python? Preferably with pytorch or numpy, but I can work around general solutions.
In my naive attempt, I just use
d = np.random.choice(S,(m,2))
non_dupes = [i not in t for i in d]
d = d[non_dupes]
But both t and S are incredibly large, and this takes an enormous amount of time (not to mention, rarely results in a (m,2) array). I feel like there has to be some fancy tensor thing I can do to achieve this, or maybe making a large hash map of the values in t so checking for membership in t is O(1), but this produces the same issue just with memory. Is there a more efficient way?
An approximate solution is also okay.
my naive attempt would be a base-transformation function to reduce the problem to an integer set problem:
definitions and assumptions:
let S be a set (unique elements)
let L be the number of elements in S
let t be a set of M-tuples with elements from S
the original order of the elements in t is irrelevant
let I(x) be the index function of the element x in S
let x[n] be the n-th tuple-member of an element of t
let f(x) be our base-transform function (and f^-1 its inverse)
since S is a set we can write each element in t as a M digit number to the base L using elements from S as digits.
for M=2 the transformation looks like
f(x) = I(x[1])*L^1 + I(x[0])*L^0
f^-1(x) is also rather trivial ... x mod L to get back the index of the least significant digit. floor(x/L) and repeat until all indices are extracted. lookup the values in S and construct the tuple.
since now you can represet t as an integer set (read hastable) calculating the inverse set d becomes rather trivial
loop from L^(M-1) to (L^(M+1)-1) and ask your hashtable if the element is in t or d
if the size of S is too big you can also just draw random numbers against the hashtable for a subset of the inverse of t
does this help you?
If |t| + |d| << |S|^2 then the probability of some random tuple to be chosen again (in a single iteration) is relatively small.
To be more exact, if (|t|+|d|) / |S|^2 = C for some constant C<1, then if you redraw an element until it is a "new" one, the expected number of redraws needed is 1/(1-C).
This means, that by doing this, and redrawing elements until this is a new element, you get O((1/(1-C)) * |d|) times to process a new element (on average), which is O(|d|) if C is indeed constant.
Checking is an element is already "seen" can be done in several ways:
Keeping hash sets of t and d. This requires extra space, but each lookup is constant O(1) time. You could also use a bloom filter instead of storing the actual elements you already seen, this will make some errors, saying an element is already "seen" though it was not, but never the other way around - so you will still get all elements in d as unique.
Inplace sorting t, and using binary search. This adds O(|t|log|t|) pre-processing, and O(log|t|) for each lookup, but requires no additional space (other then where you store d).
If in fact, |d| + |t| is very close to |S|^2, then an O(|S|^2) time solution could be to use Fisher Yates shuffle on the available choices, and choosing the first |d| elements that do not appear in t.
there any help in such problem??
Do you mean this?
def check_elements(arr):
return len(arr) == len(set(arr))
UPD
I think I get the point. Given a list with constant length (say 50). And we need to add such circumstances to the problem that solving this problem will take O(n) time. And I suppose not like O(n) dummy operations but kinda reasonable O(n).
Well... the only way a see where we can get O(n) are elements themselves. Say we have something like this:
[
1.1111111111111111..<O(n) digits>..1,
1.1111111111111111..<O(n) digits>..2,
1.1111111111111111..<O(n) digits>..3,
1.1111111111111111..<O(n) digits>..1,
]
Basically we can treat elements as very long string. And to check whether constant number of n-character strings are unique or not we have to at least read them all. And it's at least O(n) time.
You can just use a counting sort, in your case it will be in O(n). Create an array from 0 to N (N is your maximum value), and then if you have value v in the original array, add one to the value-th entry of the resulting array. This will takes you O(n) (juste review all values from the original array), and then juste search in the resulting array if there is an entry greater than 1...
Given a list of n comparable elements (say numbers or string), the optimal algorithm to find the ith ordered element takes O(n) time.
Does Python implement natively O(n) time order statistics for lists, dicts, sets, ...?
None of Python's mentioned data structures implements natively the ith order statistic algorithm.
In fact, it might not make much sense for dictionaries and sets, given the fact that both make no assumptions about the ordering of its elements. For lists, it shouldn't be hard to implement the selection algorithm, which provides O(n) running time.
This is not a native solution, but you can use NumPy's partition to find the k-th order statistic of a list in O(n) time.
import numpy as np
x = [2, 4, 0, 3, 1]
k = 2
print('The k-th order statistic is:', np.partition(np.asarray(x), k)[k])
EDIT: this assumes zero-indexing, i.e. the "zeroth order statistic" above is 0.
If i << n you can give a look at http://docs.python.org/library/heapq.html#heapq.nlargest and http://docs.python.org/library/heapq.html#heapq.nsmallest (the don't solve your problem, but are faster than sorting and taking the i-th element).
Is there some function which would return me the N highest elements from some list?
I.e. if max(l) returns the single highest element, sth. like max(l, count=10) would return me a list of the 10 highest numbers (or less if l is smaller).
Or what would be an efficient easy way to get these? (Except the obvious canonical implementation; also, no such things which involve sorting the whole list first because that would be inefficient compared to the canonical solution.)
heapq.nlargest:
>>> import heapq, random
>>> heapq.nlargest(3, (random.gauss(0, 1) for _ in xrange(100)))
[1.9730767232998481, 1.9326532289091407, 1.7762926716966254]
The function in the standard library that does this is heapq.nlargest
Start with the first 10 from L, call that X. Note the minimum value of X.
Loop over L[i] for i over the rest of L.
If L[i] is greater than min(X), drop min(X) from X and insert L[i]. You may need to keep X as a sorted linked list and do an insertion. Update min(X).
At the end, you have the 10 largest values in X.
I suspect that will be O(kN) (where k is 10 here) since insertion sort is linear. Might be what gsl uses, so if you can read some C code:
http://www.gnu.org/software/gsl/manual/html_node/Selecting-the-k-smallest-or-largest-elements.html
Probably something in numpy that does this.
A fairly efficient solution is a variation of quicksort where recursion is limited to the right part of the pivot until the pivot point position is higher than the number of elements required (with a few extra conditions to deal with border cases of course).
The standard library has heapq.nlargest, as pointed out by others here.
If you do not mind using pandas then:
import pandas as pd
N = 10
column_name = 0
pd.DataFrame(your_array).nlargest(N, column_name)
The above code will show you the N largest values along with the index position of each value.
Pandas nlargest documentation