Select elements row-wise based on single array - python

Say I have an array d of size (N,T), out of which I need to select elements using index of shape (N,), where the first element corresponds to the index in the first row, etc... how would I do that?
For example
>>> d
Out[748]:
array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]])
>>> index
Out[752]: array([5, 6, 1], dtype=int64)
Expected Output:
array([[5],
[6],
[2])
Which is an array containing the fifth element of the first row, the 6th element of the second row and the second element of the third row.
Update
Since I will have sufficiently larger N, I was interested in the speed of the different methods for higher N. With N = 30000:
>>> %timeit np.diag(e.take(index2, axis=1)).reshape(N*3, 1)
1 loops, best of 3: 3.9 s per loop
>>> %timeit e.ravel()[np.arange(e.shape[0])*e.shape[1]+index2].reshape(N*3, 1)
1000 loops, best of 3: 287 µs per loop
Finally, you suggest reshape(). As I want to leave it as general as possible (without knowing N), I instead use [:,np.newaxis] - it seems to increase duration from 287µs to 288µs, which I'll take :)

This might be ugly but more efficient:
>>> d.ravel()[np.arange(d.shape[0])*d.shape[1]+index]
array([5, 6, 2])
edit
As pointed out by #deinonychusaur the statement above can be written as clean as:
d[np.arange(index.size),index]

There might be nicer ways, but a combo of take, diag and reshape would do:
In [137]: np.diag(d.take(index, axis=1)).reshape(3, 1)
Out[137]:
array([[5],
[6],
[2]])
EDIT
Comparisons with #Emanuele Paolinis' alterative, adding reshape to it to match the sought output:
In [142]: %timeit d.reshape(d.size)[np.arange(d.shape[0])*d.shape[1]+index].reshape(3, 1)
100000 loops, best of 3: 9.51 µs per loop
In [143]: %timeit np.diag(d.take(index, axis=1)).reshape(3, 1)
100000 loops, best of 3: 3.81 µs per loop
In [146]: %timeit d.ravel()[np.arange(d.shape[0])*d.shape[1]+index].reshape(3, 1)
100000 loops, best of 3: 8.56 µs per loop
This method is about twice as fast as both proposed alternatives.
EDIT 2: An even better method
Based on #Emanuele Paulinis' version but reduced number of operations outperforms all on large arrays 10k rows by 100 columns.
In [199]: %timeit d[(np.arange(index.size), index)].reshape(index.size, 1)
1000 loops, best of 3: 364 µs per loop
In [200]: %timeit d.ravel()[np.arange(d.shape[0])*d.shape[1]+index].reshape(index.size, 1)
100 loops, best of 3: 5.22 ms per loop
So if speed is of essence:
d[(np.arange(index.size), index)].reshape(index.size, 1)

Related

How could I get numpy array indices by some conditions

I come to a problem like this:
suppose I have arrays like this:
a = np.array([[1,2,3,4,5,4,3,2,1],])
label = np.array([[1,0,1,0,0,1,1,0,1],])
I need to obtain the indices of a at which position the element value of label is 1 and the value of a is the largest amount all that causing label to be 1.
It maybe confusing, in the above example, the indices where label is 1 are: 0, 2, 5, 6, 8, their corresponding values of a are thus: 1, 3, 4, 3, 1, among which 4 is the larges, thus I need to get the result of 5 which is the index of number 4 in a. How could I do this with numpy ?
Get the 1s indices say as idx, then index into a with it, get max index and finally trace it back to the original order by indexing into idx -
idx = np.flatnonzero(label==1)
out = idx[a[idx].argmax()]
Sample run -
# Assuming inputs to be 1D
In [18]: a
Out[18]: array([1, 2, 3, 4, 5, 4, 3, 2, 1])
In [19]: label
Out[19]: array([1, 0, 1, 0, 0, 1, 1, 0, 1])
In [20]: idx = np.flatnonzero(label==1)
In [21]: idx[a[idx].argmax()]
Out[21]: 5
For a as ints and label as an array of 0s and 1s, we could optimize further as we could scale a based on the range of values in it, like so -
(label*(a.max()-a.min()+1) + a).argmax()
Furthermore, if a has positive numbers only, it would simplify to -
(label*(a.max()+1) + a).argmax()
Timings for positive ints largish a -
In [115]: np.random.seed(0)
...: a = np.random.randint(0,10,(100000))
...: label = np.random.randint(0,2,(100000))
In [117]: %%timeit
...: idx = np.flatnonzero(label==1)
...: out = idx[a[idx].argmax()]
1000 loops, best of 3: 592 µs per loop
In [116]: %timeit (label*(a.max()-a.min()+1) + a).argmax()
1000 loops, best of 3: 357 µs per loop
# #coldspeed's soln
In [120]: %timeit np.ma.masked_where(~label.astype(bool), a).argmax()
1000 loops, best of 3: 1.63 ms per loop
# won't work with negative numbers in a
In [119]: %timeit (label*(a.max()+1) + a).argmax()
1000 loops, best of 3: 292 µs per loop
# #klim's soln (won't work with negative numbers in a)
In [121]: %timeit np.argmax(a * (label == 1))
1000 loops, best of 3: 229 µs per loop
You can use masked arrays:
>>> np.ma.masked_where(~label.astype(bool), a).argmax()
5
Here is one of the simplest ways.
>>> np.argmax(a * (label == 1))
5
>>> np.argmax(a * (label == 1), axis=1)
array([5])
Coldspeed's method may take more time.

Indexing numpy array by a numpy array of coordinates

Suppose we have
an n-dimensional numpy.array A
a numpy.array B with dtype=int and shape of (n, m)
How do I index A by B so that the result is an array of shape (m,), with values taken from the positions indicated by the columns of B?
For example, consider this code that does what I want when B is a python list:
>>> a = np.arange(27).reshape(3,3,3)
>>> a[[0, 1, 2], [0, 0, 0], [1, 1, 2]]
array([ 1, 10, 20]) # the result we're after
>>> bl = [[0, 1, 2], [0, 0, 0], [1, 1, 2]]
>>> a[bl]
array([ 1, 10, 20]) # also works when indexing with a python list
>>> a[bl].shape
(3,)
However, when B is a numpy array, the result is different:
>>> b = np.array(bl)
>>> a[b].shape
(3, 3, 3, 3)
Now, I can get the desired result by casting B into a tuple, but surely that cannot be the proper/idiomatic way to do it?
>>> a[tuple(b)]
array([ 1, 10, 20])
Is there a numpy function to achieve the same without casting B to a tuple?
One alternative would be converting to linear indices and then index with np.take or index into its flattened version -
np.take(a,np.ravel_multi_index(b, a.shape))
a.flat[np.ravel_multi_index(b, a.shape)]
Custom np.ravel_multi_index for performance boost
We could implement a custom version to simulate the behaviour of np.ravel_multi_index to boost the performance, like so -
def ravel_index(b, shp):
return np.concatenate((np.asarray(shp[1:])[::-1].cumprod()[::-1],[1])).dot(b)
Using it, the desired output would be found in two ways -
np.take(a,ravel_index(b, a.shape))
a.flat[ravel_index(b, a.shape)]
Benchmarking
Additionall incorporating tuple based method from the question and map based one from #Kanak's post.
Case #1 : dims = 3
In [23]: a = np.random.randint(0,9,([20]*3))
In [24]: b = np.random.randint(0,20,(a.ndim,1000000))
In [25]: %timeit a[tuple(b)]
...: %timeit a[map(np.ravel, b)]
...: %timeit np.take(a,np.ravel_multi_index(b, a.shape))
...: %timeit a.flat[np.ravel_multi_index(b, a.shape)]
...: %timeit np.take(a,ravel_index(b, a.shape))
...: %timeit a.flat[ravel_index(b, a.shape)]
100 loops, best of 3: 6.56 ms per loop
100 loops, best of 3: 6.58 ms per loop
100 loops, best of 3: 6.95 ms per loop
100 loops, best of 3: 9.17 ms per loop
100 loops, best of 3: 6.31 ms per loop
100 loops, best of 3: 8.52 ms per loop
Case #2 : dims = 6
In [29]: a = np.random.randint(0,9,([10]*6))
In [30]: b = np.random.randint(0,10,(a.ndim,1000000))
In [31]: %timeit a[tuple(b)]
...: %timeit a[map(np.ravel, b)]
...: %timeit np.take(a,np.ravel_multi_index(b, a.shape))
...: %timeit a.flat[np.ravel_multi_index(b, a.shape)]
...: %timeit np.take(a,ravel_index(b, a.shape))
...: %timeit a.flat[ravel_index(b, a.shape)]
10 loops, best of 3: 40.9 ms per loop
10 loops, best of 3: 40 ms per loop
10 loops, best of 3: 20 ms per loop
10 loops, best of 3: 29.9 ms per loop
100 loops, best of 3: 15.7 ms per loop
10 loops, best of 3: 25.8 ms per loop
Case #3 : dims = 10
In [32]: a = np.random.randint(0,9,([4]*10))
In [33]: b = np.random.randint(0,4,(a.ndim,1000000))
In [34]: %timeit a[tuple(b)]
...: %timeit a[map(np.ravel, b)]
...: %timeit np.take(a,np.ravel_multi_index(b, a.shape))
...: %timeit a.flat[np.ravel_multi_index(b, a.shape)]
...: %timeit np.take(a,ravel_index(b, a.shape))
...: %timeit a.flat[ravel_index(b, a.shape)]
10 loops, best of 3: 60.7 ms per loop
10 loops, best of 3: 60.1 ms per loop
10 loops, best of 3: 27.8 ms per loop
10 loops, best of 3: 38 ms per loop
100 loops, best of 3: 18.7 ms per loop
10 loops, best of 3: 29.3 ms per loop
So, it makes sense to look for alternatives when working with higher-dimensional inputs and with large data.
Another alternative that fits your need involves the use of np.ravel
>>> a[map(np.ravel, b)]
array([ 1, 10, 20])
However not fully numpy-based.
Performance-concerns.
Updated following the comments below.
Be that as it may, your approach is better than mine, but not better than any of #Divakar's.
import numpy as np
import timeit
a = np.arange(27).reshape(3,3,3)
bl = [[0, 1, 2], [0, 0, 0], [1, 1, 2]]
b = np.array(bl)
imps = "from __main__ import np,a,b"
reps = 100000
tup_cas_t = timeit.Timer("a[tuple(b)]", imps).timeit(reps)
map_rav_t = timeit.Timer("a[map(np.ravel, b)]", imps).timeit(reps)
fla_rp1_t = timeit.Timer("np.take(a,np.ravel_multi_index(b, a.shape))", imps).timeit(reps)
fla_rp2_t = timeit.Timer("a.flat[np.ravel_multi_index(b, a.shape)]", imps).timeit(reps)
print tup_cas_t/map_rav_t ## 0.505382211881
print tup_cas_t/fla_rp1_t ## 1.18185817386
print tup_cas_t/fla_rp2_t ## 1.71288705886
Are you looking for numpy.ndarray.tolist() ?
>>> a = np.arange(27).reshape(3,3,3)
>>> bl = [[0, 1, 2], [0, 0, 0], [1, 1, 2]]
>>> b = np.array(bl)
>>> a[b.tolist()]
array([ 1, 10, 20])
Or for arrays indexing arrays which is quite similar to list indexing :
>>> a[np.array([0, 1, 2]), np.array([0, 0, 0]), np.array([1, 1, 2])]
array([ 1, 10, 20])
However as you can from the previous link, indexing an array a with an array b directly means you are indexing the first index of a only with your whole b array which can lead to confusing output.

flatten list of lists and scalars [duplicate]

This question already has answers here:
Flatten an irregular (arbitrarily nested) list of lists
(51 answers)
Closed 6 months ago.
So for a matrix, we have methods like numpy.flatten()
np.array([[1,2,3],[4,5,6],[7,8,9]]).flatten()
gives [1,2,3,4,5,6,7,8,9]
what if I wanted to get from np.array([[1,2,3],[4,5,6],7]) to [1,2,3,4,5,6,7]?
Is there an existing function that performs something like that?
With uneven lists, the array is a object dtype, (and 1d, so flatten doesn't change it)
In [96]: arr=np.array([[1,2,3],[4,5,6],7])
In [97]: arr
Out[97]: array([[1, 2, 3], [4, 5, 6], 7], dtype=object)
In [98]: arr.sum()
...
TypeError: can only concatenate list (not "int") to list
The 7 element is giving problems. If I change that to a list:
In [99]: arr=np.array([[1,2,3],[4,5,6],[7]])
In [100]: arr.sum()
Out[100]: [1, 2, 3, 4, 5, 6, 7]
I'm using a trick here. The elements of the array lists, and for lists [1,2,3]+[4,5] is concatenate.
The basic point is that an object array is not a 2d array. It is, in many ways, more like a list of lists.
chain
The best list flattener is chain
In [104]: list(itertools.chain(*arr))
Out[104]: [1, 2, 3, 4, 5, 6, 7]
though it too will choke on the integer 7 version.
concatenate and hstack
If the array is a list of lists (not the original mix of lists and scalar) then np.concatenate works. It iterates on the object just as though it were a list.
With the mixed original list concatenate does not work, but hstack does
In [178]: arr=np.array([[1,2,3],[4,5,6],7])
In [179]: np.concatenate(arr)
...
ValueError: all the input arrays must have same number of dimensions
In [180]: np.hstack(arr)
Out[180]: array([1, 2, 3, 4, 5, 6, 7])
That's because hstack first iterates though the list and makes sure all elements are atleast_1d. This extra iteration makes it more robust, but at a cost in processing speed.
time tests
In [170]: big1=arr.repeat(1000)
In [171]: timeit big1.sum()
10 loops, best of 3: 31.6 ms per loop
In [172]: timeit list(itertools.chain(*big1))
1000 loops, best of 3: 433 µs per loop
In [173]: timeit np.concatenate(big1)
100 loops, best of 3: 5.05 ms per loop
double the size
In [174]: big1=arr.repeat(2000)
In [175]: timeit big1.sum()
10 loops, best of 3: 128 ms per loop
In [176]: timeit list(itertools.chain(*big1))
1000 loops, best of 3: 803 µs per loop
In [177]: timeit np.concatenate(big1)
100 loops, best of 3: 9.93 ms per loop
In [182]: timeit np.hstack(big1) # the extra iteration hurts hstack speed
10 loops, best of 3: 43.1 ms per loop
The sum is quadratic in size
res=[]
for e in bigarr:
res += e
res grows with the number of e, so each iteration step is more expensive.
chain times the best.
You can write custom flatten function using yield:
def flatten(arr):
for i in arr:
try:
yield from flatten(i)
except TypeError:
yield i
Usage example:
>>> myarr = np.array([[1,2,3],[4,5,6],7])
>>> newarr = list(flatten(myarr))
>>> newarr
[1, 2, 3, 4, 5, 6, 7]
You can use apply_along_axis here
>>> arr = np.array([[1,2,3],[4,5,6],[7]])
>>> np.apply_along_axis(np.concatenate, 0, arr)
array([1, 2, 3, 4, 5, 6, 7])
As a bonus, this is not quadratic in the number of lists either.

More Pythonic/Pandaic approach to looping over a pandas Series

This is most likely something very basic, but I can't figure it out.
Suppose that I have a Series like this:
s1 = pd.Series([1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4])
How can I do operations on sub-series of this Series without having to revert to using a for-loop?
Suppose, for example, that I want to turn it into a new Series that contains four elements. The first element in this new Series is the sum of the first three elements in the original Series (1, 1, 1), the second the sum of the second three (2, 2, 2), etc.:
s2 = pd.Series([3, 6, 9, 12])
How can I do this?
You could also use np.add.reduceat by specifying the slices to be reduced at every 3rd element and compute their running sum:
>>> pd.Series(np.add.reduceat(s1.values, np.arange(0, s1.shape[0], 3)))
0 3
1 6
2 9
3 12
dtype: int64
Timing Constraints:
arr = np.repeat(np.arange(10**5), 3)
s = pd.Series(arr)
s.shape
(300000,)
# #IanS soln
%timeit s.rolling(3).sum()[2::3]
100 loops, best of 3: 15.6 ms per loop
# #Divakar soln
%timeit pd.Series(np.bincount(np.arange(s.size)//3, s))
100 loops, best of 3: 5.44 ms per loop
# #Nikolas Rieble soln
%timeit pd.Series(np.sum(np.array(s).reshape(len(s)/3,3), axis = 1))
100 loops, best of 3: 2.17 ms per loop
# #Nikolas Rieble modified soln
%timeit pd.Series(np.sum(np.array(s).reshape(-1, 3), axis=1))
100 loops, best of 3: 2.15 ms per loop
# #Divakar modified soln
%timeit pd.Series(s.values.reshape(-1,3).sum(1))
1000 loops, best of 3: 1.62 ms per loop
# Proposed solution in post
%timeit pd.Series(np.add.reduceat(s.values, np.arange(0, s.shape[0], 3)))
1000 loops, best of 3: 1.45 ms per loop
Here's a NumPy approach using np.bincount to handle generic number of elements -
pd.Series(np.bincount(np.arange(s1.size)//3, s1))
Sample run -
In [42]: s1 = pd.Series([1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 9, 5])
In [43]: pd.Series(np.bincount(np.arange(s1.size)//3, s1))
Out[43]:
0 3.0
1 6.0
2 9.0
3 12.0
4 14.0
dtype: float64
If we are really craving for performance and for case when the length of the series is divisible by the window length, we can get the view into the series with s1.values, then reshape and finally use np.einsum for summation, like so -
pd.Series(np.einsum('ij->i',s.values.reshape(-1,3)))
Timings with the same benchmark dataset as used in #Nickil Maveli's post -
In [140]: s = pd.Series(np.repeat(np.arange(10**5), 3))
# #Nickil Maveli's soln
In [141]: %timeit pd.Series(np.add.reduceat(s.values, np.arange(0, s.shape[0], 3)))
100 loops, best of 3: 2.07 ms per loop
# Using views+sum
In [142]: %timeit pd.Series(s.values.reshape(-1,3).sum(1))
100 loops, best of 3: 2.03 ms per loop
# Using views+einsum
In [143]: %timeit pd.Series(np.einsum('ij->i',s.values.reshape(-1,3)))
1000 loops, best of 3: 1.04 ms per loop
You could reshape the series s1 using numpy and then sum over the rows such as:
np.sum(np.array(s1).reshape(len(s1)/3,3), axis = 1)
which results in
array([ 3, 6, 9, 12], dtype=int64)
EDIT: as MSeifert mentioned in his comment, you can also let numpy compute the length such as:
np.sum(np.array(s1).reshape(-1, 3), axis=1)
This computes the rolling sum:
s1.rolling(3).sum()
You simply need to select every third element:
s1.rolling(3).sum()[2::3]
Output:
2 3.0
5 6.0
8 9.0
11 12.0

Fastest way to mix arrays in numpy?

a= array([1,3,5,7,9])
b= array([2,4,6,8,10])
I want to mix pair of arrays so that their sequences insert element by element
Example: using a and b, it should result into
c= array([1,2,3,4,5,6,7,8,9,10])
I need to do that using pairs of long arrays (more than one hundred elements) on thousand of sequences. Any smarter ideas than pickling element by element on each array?
thanks
c = np.empty(len(a)+len(b), dtype=a.dtype)
c[::2] = a
c[1::2] = b
(That assumes a and b have the same dtype.)
You asked for the fastest, so here's a timing comparison (vstack, ravel and empty are all numpy functions):
In [40]: a = np.random.randint(0, 10, size=150)
In [41]: b = np.random.randint(0, 10, size=150)
In [42]: %timeit vstack((a,b)).T.flatten()
100000 loops, best of 3: 5.6 µs per loop
In [43]: %timeit ravel([a, b], order='F')
100000 loops, best of 3: 3.1 µs per loop
In [44]: %timeit c = empty(len(a)+len(b), dtype=a.dtype); c[::2] = a; c[1::2] = b
1000000 loops, best of 3: 1.94 µs per loop
With vstack((a,b)).T.flatten(), a and b are copied to create vstack((a,b)), and then the data is copied again by the flatten() method.
ravel([a, b], order='F') is implemented as asarray([a, b]).ravel(order), which requires copying a and b, and then copying the result to create an array with order='F'. (If you do just ravel([a, b]), it is about the same speed as my answer, because it doesn't have to copy the data again. Unfortunately, order='F' is needed to get the alternating pattern.)
So the other two methods copy the data twice. In my version, each array is copied once.
This'll do it:
vstack((a,b)).T.flatten()
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
Using numpy.ravel:
>>> np.ravel([a, b], order='F')
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

Categories