Randomize numpy.argsort output in case of ties - python

I have a numpy array with some elements same as others i.e. there are ties, and I am applying np.argsort to find the indices which will sort the array:
In [29]: x = [1, 2, 1, 1, 5, 2]
In [30]: np.argsort(x)
Out[30]: array([0, 2, 3, 1, 5, 4])
In [31]: np.argsort(x)
Out[31]: array([0, 2, 3, 1, 5, 4])
As can be seen here, the outputs we get by running argsort two times are identical. However, array([2, 3, 0, 5, 1, 4]) is also a completely valid output because some elements in the original array are equal. Can I make argsort return me such "randomized" outputs when there are ties in my array? If not, what is a workaround because I don't want to bias my choice of the lowest values in the array when I am picking them.

One trick would be to add uniform noise in [0,1) range and then perform argsort-ing. Adding such a noise forces sorting only within their respective bins and gives randomized sort indices restricted to those bins -
(x+np.random.rand(len(x))).argsort()

Related

Example in np.argsort document

For some reason I cannot resolve this.
According to the example here for 1-dim array,
https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html
x = np.array([3, 1, 2])
np.argsort(x)
array([1, 2, 0])
And I have tried this myself. But by default, the return result should be ascending..meaning
x([result])
returns
array([1, 2, 3])
Thus shouldnt the result be [2,0,1]
What am I missing here?
From the docs, the first line states "Returns the indices that would sort an array." Hence if you want the positions of the sorted values we have:
x = np.array([3, 1, 2])
np.argsort(x)
>>>array([1, 2, 0])
here we want the index positions of 1, 2 and 3 in x. The psotion of 3 is 0, the psotion of 1 is 1, and the position of 2 is 2, hence array([1, 2, 0]) = sorted_array(1,2,3).
Again from the notes, " argsort returns an array of indices of the same shape as a that index data along the given axis in sorted order."
A more intuitive way of looking at what that means is to use a for loop, where we loop over our returned argsort values, and then index the initial array with these values:
x = np.array([3, 1, 2])
srt_positions = np.argsort(x)
for k in srt_positions:
print x[k]
>>> 1, 2, 3

NumPy: How to most efficiently/idiomatically add a value to all items in an array after an index?

I want to add a value to all items of a 1-D array after a certain index.
For example, my original array looks like this:
[0, 1, 2, 3, 4, 5, 6]
and I want to add 1 to all items after index 2, to end up with the following result array:
[0, 1, 2, 4, 5, 6, 7]
What is the most efficient way to do this, in terms of performance and use of 'idiomatic' Python/NumPy (i.e. not using a loop)? It seems that a list comprehension isn't the best approach since I'm dealing with NumPy arrays -- my assumption is that there's a clever way to index the array for this instead that may also be more performant.
Here's what I've cooked up using a list comprehension:
ary = np.array([0, 1, 2, 3, 4, 5, 6])
ix = 3
ary[ix:] = [x + 1 for x in ary[ix:]]
Is there a better way to do this, or is this good enough?
To add a value to all elements of an array ary at and after index ix:
ary[ix:] += value

Calculate persistence of sign in numpy array

This is hard to describe so consider this example. Let's say I have this array
import numpy as np
X=np.array([-1.94198425, 2.29219632, 0.35505434,
-0.06408812, -1.25963731, -0.32275248, -0.4178637 , 0.37951672])
Now I want to count the number of times (number of consecutive indices) that the sign of the elements remain the same. In this case the answer would be [1, 2, 4, 1], because there's 1 negative number, followed by 2 positive numbers, followed by 4 negatives and so on. I can calculate this by doing
times=[0]
sig=np.sign(X[0])
for x in X:
if sig==np.sign(x):
times[-1]+=1
else:
times.append(1)
sig=np.sign(x)
print(times)
Which yields the correct result.
However, if I have a 400x1000 array and I want to perform this over one of the axes things get pretty slow.
Is there any way to use Numpy/Scipy to do this easily and over on axis of an n-dimensional array?
I figured I could start with something like
a=X.copy()
a[a<=0]=-1
a[a>0]=1
And use stuff like cumsum() but so far I got nothing.
PS: I could probably use f2py, Cython or Numba, but I'm trying to avoid that because of flexibility.
Approach #1 : Vectorized one-liner solution -
np.diff(np.r_[0,np.flatnonzero(np.diff(np.sign(X))!=0)+1, len(X)])
Approach #2 : Alternatively, for some performance boost, we can make use of slicing to replace the differentiation on the sign values and use faster np.concatenate in place of np.r_ for the concatenation step, like so -
s = np.sign(X)
out = np.diff(np.concatenate(( [0], np.flatnonzero(s[1:]!=s[:-1])+1, [len(X)] )))
Approach #3 : Alternatively again, if the number of sign changes is a considerable number as compared to the length of the input array, you might want to do the concatenation on the mask array of sign change. The mask arrays/boolean arrays being much more memory efficient than int or float arrays might bring about more performance boost.
Thus, one more method would be -
s = np.sign(X)
mask = np.concatenate(( [True], s[1:]!=s[:-1], [True] ))
out = np.diff(np.flatnonzero(mask))
Extending to 2D case
We can extend the approach #3 to a 2D array case with a bit more of additional work that are explained alongwith the code comments. Good thing is that the concatenation part lets us keep the code vectorized during the extension work. Thus, on a 2D array for which we need the sign persistence on a per row basis, the implementation would look something like this -
# Get signs. Get one-off shifted mask for each row.
# Concatenate at either ends of each row with True values, getting us 2D mask
s = np.sign(X)
T = np.ones((X.shape[0],1),dtype=bool)
mask2D = np.column_stack(( T, s[:,1:]!=s[:,:-1], T ))
# Get flattened nonzeros indices on the 2D mask.
all_intervals = np.diff(np.flatnonzero(mask2D.ravel()))
# We need to remove the indices that were generated because of the True values
# concatenation. So, get those indices and delete those.
rm_idx = (mask2D[:-1].sum(1)-1).cumsum()
all_intervals1 = np.delete(all_intervals, rm_idx + np.arange(X.shape[0]-1))
# Finally, split the indices into a list of arrays, with each array giving us
# the counts of sign persistences
out = np.split(all_intervals1, rm_idx )
Sample input, output -
In [212]: X
Out[212]:
array([[-3, 1, -3, -2, 2, 3, -3, 1, 1, -1],
[-2, -3, 0, -2, -2, 0, 3, -1, -2, 2],
[ 0, -1, -3, -2, -2, 3, -3, -2, 1, 1],
[ 1, -3, 0, -1, -2, 1, -1, 1, 3, 2],
[-1, 1, 0, -2, 0, -1, -1, -3, 0, 1]])
In [213]: out
Out[213]:
[array([1, 1, 2, 2, 1, 2, 1]),
array([2, 1, 2, 1, 1, 2, 1]),
array([1, 4, 1, 2, 2]),
array([1, 1, 1, 2, 1, 1, 3]),
array([1, 1, 1, 1, 1, 3, 1, 1])]

How to get the rank of a column in numpy 2d array?

suppose I have an array:
a = np.array([[1,2,3,4],
[4,2,5,6],
[6,5,0,3]])
I want to get the rank of column 0 in each row(i.e. np.array([0, 1, 3])), Is there any short way to do this?
In 1d array I can use np.sum(a < a[0]) to do this, but how about 2d array? But it seems < cannot broadcast.
Approach #1
Use np.argsort along the rows and look for the index 0 corresponding to the first column to give us a mask of the same shape as the input array. Finally, get the column indices of the matches (True) in the mask for the desired rank output. So, the implementation would be -
np.where(a.argsort(1)==0)[1]
Approach #2
Another way to get the ranks of all columns in one go, would be a slight modification of the earlier method. The implementation would look like this -
(a.argsort(1)).argsort(1)
So, to get the rank of first column, index into the first column of it, like so -
(a.argsort(1)).argsort(1)[:,0]
Sample run
In [27]: a
Out[27]:
array([[1, 2, 3, 4],
[4, 2, 5, 6],
[6, 5, 0, 3]])
In [28]: np.where(a.argsort(1)==0)[1]
Out[28]: array([0, 1, 3])
In [29]: (a.argsort(1)).argsort(1) # Ranks for all cols
Out[29]:
array([[0, 1, 2, 3],
[1, 0, 2, 3],
[3, 2, 0, 1]])
In [30]: (a.argsort(1)).argsort(1)[:,0] # Rank for first col
Out[30]: array([0, 1, 3])
In [31]: (a.argsort(1)).argsort(1)[:,1] # Rank for second col
Out[31]: array([1, 0, 2])

How does the axis parameter from NumPy work?

Can someone explain exactly what the axis parameter in NumPy does?
I am terribly confused.
I'm trying to use the function myArray.sum(axis=num)
At first I thought if the array is itself 3 dimensions, axis=0 will return three elements, consisting of the sum of all nested items in that same position. If each dimension contained five dimensions, I expected axis=1 to return a result of five items, and so on.
However this is not the case, and the documentation does not do a good job helping me out (they use a 3x3x3 array so it's hard to tell what's happening)
Here's what I did:
>>> e
array([[[1, 0],
[0, 0]],
[[1, 1],
[1, 0]],
[[1, 0],
[0, 1]]])
>>> e.sum(axis = 0)
array([[3, 1],
[1, 1]])
>>> e.sum(axis=1)
array([[1, 0],
[2, 1],
[1, 1]])
>>> e.sum(axis=2)
array([[1, 0],
[2, 1],
[1, 1]])
>>>
Clearly the result is not intuitive.
Clearly,
e.shape == (3, 2, 2)
Sum over an axis is a reduction operation so the specified axis disappears. Hence,
e.sum(axis=0).shape == (2, 2)
e.sum(axis=1).shape == (3, 2)
e.sum(axis=2).shape == (3, 2)
Intuitively, we are "squashing" the array along the chosen axis, and summing the numbers that get squashed together.
To understand the axis intuitively, refer the picture below (source: Physics Dept, Cornell Uni)
The shape of the (boolean) array in the above figure is shape=(8, 3). ndarray.shape will return a tuple where the entries correspond to the length of the particular dimension. In our example, 8 corresponds to length of axis 0 whereas 3 corresponds to length of axis 1.
If someone need this visual description:
There are good answers for visualization however it might help to think purely from analytical perspective.
You can create array of arbitrary dimension with numpy.
For example, here's a 5-dimension array:
>>> a = np.random.rand(2, 3, 4, 5, 6)
>>> a.shape
(2, 3, 4, 5, 6)
You can access any element of this array by specifying indices. For example, here's the first element of this array:
>>> a[0, 0, 0, 0, 0]
0.0038908603263844155
Now if you take out one of the dimensions, you get number of elements in that dimension:
>>> a[0, 0, :, 0, 0]
array([0.00389086, 0.27394775, 0.26565889, 0.62125279])
When you apply a function like sum with axis parameter, that dimension gets eliminated and array of dimension less than original gets created. For each cell in new array, the operator will get list of elements and apply the reduction function to get a scaler.
>>> np.sum(a, axis=2).shape
(2, 3, 5, 6)
Now you can check that the first element of this array is sum of above elements:
>>> np.sum(a, axis=2)[0, 0, 0, 0]
1.1647502999560164
>>> a[0, 0, :, 0, 0].sum()
1.1647502999560164
The axis=None has special meaning to flatten out the array and apply function on all numbers.
Now you can think about more complex cases where axis is not just number but a tuple:
>>> np.sum(a, axis=(2,3)).shape
(2, 3, 6)
Note that we use same technique to figure out how this reduction was done:
>>> np.sum(a, axis=(2,3))[0,0,0]
7.889432081931909
>>> a[0, 0, :, :, 0].sum()
7.88943208193191
You can also use same reasoning for adding dimension in array instead of reducing dimension:
>>> x = np.random.rand(3, 4)
>>> y = np.random.rand(3, 4)
# New dimension is created on specified axis
>>> np.stack([x, y], axis=2).shape
(3, 4, 2)
>>> np.stack([x, y], axis=0).shape
(2, 3, 4)
# To retrieve item i in stack set i in that axis
Hope this gives you generic and full understanding of this important parameter.
Some answers are too specific or do not address the main source of confusion. This answer attempts to provide a more general but simple explanation of the concept, with a simple example.
The main source of confusion is related to expressions such as "Axis along which the means are computed", which is the documentation of the argument axis of the numpy.mean function. What the heck does "along which" even mean here? "Along which" essentially means that you will sum the rows (and divide by the number of rows, given that we are computing the mean), if the axis is 0, and the columns, if the axis is 1. In the case of axis is 0 (or 1), the rows can be scalars or vectors or even other multi-dimensional arrays.
In [1]: import numpy as np
In [2]: a=np.array([[1, 2], [3, 4]])
In [3]: a
Out[3]:
array([[1, 2],
[3, 4]])
In [4]: np.mean(a, axis=0)
Out[4]: array([2., 3.])
In [5]: np.mean(a, axis=1)
Out[5]: array([1.5, 3.5])
So, in the example above, np.mean(a, axis=0) returns array([2., 3.]) because (1 + 3)/2 = 2 and (2 + 4)/2 = 3. It returns an array of two numbers because it returns the mean of the rows for each column (and there are two columns).
Both 1st and 2nd reply is great for understanding ndarray concept in numpy. I am giving a simple example.
And according to this image by #debaonline4u
https://i.stack.imgur.com/O5hBF.jpg
Suppose , you have an 2D array -
[1, 2, 3]
[4, 5, 6]
In, numpy format it will be -
c = np.array([[1, 2, 3],
[4, 5, 6]])
Now,
c.ndim = 2 (rows/axis=0)
c.shape = (2,3) (axis0, axis1)
c.sum(axis=0) = [1+4, 2+5, 3+6] = [5, 7, 9] (sum of the 1st elements of each rows, so along axis0)
c.sum(axis=1) = [1+2+3, 4+5+6] = [6, 15] (sum of the elements in a row, so along axis1)
So for your 3D array,

Categories