This is most likely something very basic, but I can't figure it out.
Suppose that I have a Series like this:
s1 = pd.Series([1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4])
How can I do operations on sub-series of this Series without having to revert to using a for-loop?
Suppose, for example, that I want to turn it into a new Series that contains four elements. The first element in this new Series is the sum of the first three elements in the original Series (1, 1, 1), the second the sum of the second three (2, 2, 2), etc.:
s2 = pd.Series([3, 6, 9, 12])
How can I do this?
You could also use np.add.reduceat by specifying the slices to be reduced at every 3rd element and compute their running sum:
>>> pd.Series(np.add.reduceat(s1.values, np.arange(0, s1.shape[0], 3)))
0 3
1 6
2 9
3 12
dtype: int64
Timing Constraints:
arr = np.repeat(np.arange(10**5), 3)
s = pd.Series(arr)
s.shape
(300000,)
# #IanS soln
%timeit s.rolling(3).sum()[2::3]
100 loops, best of 3: 15.6 ms per loop
# #Divakar soln
%timeit pd.Series(np.bincount(np.arange(s.size)//3, s))
100 loops, best of 3: 5.44 ms per loop
# #Nikolas Rieble soln
%timeit pd.Series(np.sum(np.array(s).reshape(len(s)/3,3), axis = 1))
100 loops, best of 3: 2.17 ms per loop
# #Nikolas Rieble modified soln
%timeit pd.Series(np.sum(np.array(s).reshape(-1, 3), axis=1))
100 loops, best of 3: 2.15 ms per loop
# #Divakar modified soln
%timeit pd.Series(s.values.reshape(-1,3).sum(1))
1000 loops, best of 3: 1.62 ms per loop
# Proposed solution in post
%timeit pd.Series(np.add.reduceat(s.values, np.arange(0, s.shape[0], 3)))
1000 loops, best of 3: 1.45 ms per loop
Here's a NumPy approach using np.bincount to handle generic number of elements -
pd.Series(np.bincount(np.arange(s1.size)//3, s1))
Sample run -
In [42]: s1 = pd.Series([1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 9, 5])
In [43]: pd.Series(np.bincount(np.arange(s1.size)//3, s1))
Out[43]:
0 3.0
1 6.0
2 9.0
3 12.0
4 14.0
dtype: float64
If we are really craving for performance and for case when the length of the series is divisible by the window length, we can get the view into the series with s1.values, then reshape and finally use np.einsum for summation, like so -
pd.Series(np.einsum('ij->i',s.values.reshape(-1,3)))
Timings with the same benchmark dataset as used in #Nickil Maveli's post -
In [140]: s = pd.Series(np.repeat(np.arange(10**5), 3))
# #Nickil Maveli's soln
In [141]: %timeit pd.Series(np.add.reduceat(s.values, np.arange(0, s.shape[0], 3)))
100 loops, best of 3: 2.07 ms per loop
# Using views+sum
In [142]: %timeit pd.Series(s.values.reshape(-1,3).sum(1))
100 loops, best of 3: 2.03 ms per loop
# Using views+einsum
In [143]: %timeit pd.Series(np.einsum('ij->i',s.values.reshape(-1,3)))
1000 loops, best of 3: 1.04 ms per loop
You could reshape the series s1 using numpy and then sum over the rows such as:
np.sum(np.array(s1).reshape(len(s1)/3,3), axis = 1)
which results in
array([ 3, 6, 9, 12], dtype=int64)
EDIT: as MSeifert mentioned in his comment, you can also let numpy compute the length such as:
np.sum(np.array(s1).reshape(-1, 3), axis=1)
This computes the rolling sum:
s1.rolling(3).sum()
You simply need to select every third element:
s1.rolling(3).sum()[2::3]
Output:
2 3.0
5 6.0
8 9.0
11 12.0
Related
I come to a problem like this:
suppose I have arrays like this:
a = np.array([[1,2,3,4,5,4,3,2,1],])
label = np.array([[1,0,1,0,0,1,1,0,1],])
I need to obtain the indices of a at which position the element value of label is 1 and the value of a is the largest amount all that causing label to be 1.
It maybe confusing, in the above example, the indices where label is 1 are: 0, 2, 5, 6, 8, their corresponding values of a are thus: 1, 3, 4, 3, 1, among which 4 is the larges, thus I need to get the result of 5 which is the index of number 4 in a. How could I do this with numpy ?
Get the 1s indices say as idx, then index into a with it, get max index and finally trace it back to the original order by indexing into idx -
idx = np.flatnonzero(label==1)
out = idx[a[idx].argmax()]
Sample run -
# Assuming inputs to be 1D
In [18]: a
Out[18]: array([1, 2, 3, 4, 5, 4, 3, 2, 1])
In [19]: label
Out[19]: array([1, 0, 1, 0, 0, 1, 1, 0, 1])
In [20]: idx = np.flatnonzero(label==1)
In [21]: idx[a[idx].argmax()]
Out[21]: 5
For a as ints and label as an array of 0s and 1s, we could optimize further as we could scale a based on the range of values in it, like so -
(label*(a.max()-a.min()+1) + a).argmax()
Furthermore, if a has positive numbers only, it would simplify to -
(label*(a.max()+1) + a).argmax()
Timings for positive ints largish a -
In [115]: np.random.seed(0)
...: a = np.random.randint(0,10,(100000))
...: label = np.random.randint(0,2,(100000))
In [117]: %%timeit
...: idx = np.flatnonzero(label==1)
...: out = idx[a[idx].argmax()]
1000 loops, best of 3: 592 µs per loop
In [116]: %timeit (label*(a.max()-a.min()+1) + a).argmax()
1000 loops, best of 3: 357 µs per loop
# #coldspeed's soln
In [120]: %timeit np.ma.masked_where(~label.astype(bool), a).argmax()
1000 loops, best of 3: 1.63 ms per loop
# won't work with negative numbers in a
In [119]: %timeit (label*(a.max()+1) + a).argmax()
1000 loops, best of 3: 292 µs per loop
# #klim's soln (won't work with negative numbers in a)
In [121]: %timeit np.argmax(a * (label == 1))
1000 loops, best of 3: 229 µs per loop
You can use masked arrays:
>>> np.ma.masked_where(~label.astype(bool), a).argmax()
5
Here is one of the simplest ways.
>>> np.argmax(a * (label == 1))
5
>>> np.argmax(a * (label == 1), axis=1)
array([5])
Coldspeed's method may take more time.
Suppose we have
an n-dimensional numpy.array A
a numpy.array B with dtype=int and shape of (n, m)
How do I index A by B so that the result is an array of shape (m,), with values taken from the positions indicated by the columns of B?
For example, consider this code that does what I want when B is a python list:
>>> a = np.arange(27).reshape(3,3,3)
>>> a[[0, 1, 2], [0, 0, 0], [1, 1, 2]]
array([ 1, 10, 20]) # the result we're after
>>> bl = [[0, 1, 2], [0, 0, 0], [1, 1, 2]]
>>> a[bl]
array([ 1, 10, 20]) # also works when indexing with a python list
>>> a[bl].shape
(3,)
However, when B is a numpy array, the result is different:
>>> b = np.array(bl)
>>> a[b].shape
(3, 3, 3, 3)
Now, I can get the desired result by casting B into a tuple, but surely that cannot be the proper/idiomatic way to do it?
>>> a[tuple(b)]
array([ 1, 10, 20])
Is there a numpy function to achieve the same without casting B to a tuple?
One alternative would be converting to linear indices and then index with np.take or index into its flattened version -
np.take(a,np.ravel_multi_index(b, a.shape))
a.flat[np.ravel_multi_index(b, a.shape)]
Custom np.ravel_multi_index for performance boost
We could implement a custom version to simulate the behaviour of np.ravel_multi_index to boost the performance, like so -
def ravel_index(b, shp):
return np.concatenate((np.asarray(shp[1:])[::-1].cumprod()[::-1],[1])).dot(b)
Using it, the desired output would be found in two ways -
np.take(a,ravel_index(b, a.shape))
a.flat[ravel_index(b, a.shape)]
Benchmarking
Additionall incorporating tuple based method from the question and map based one from #Kanak's post.
Case #1 : dims = 3
In [23]: a = np.random.randint(0,9,([20]*3))
In [24]: b = np.random.randint(0,20,(a.ndim,1000000))
In [25]: %timeit a[tuple(b)]
...: %timeit a[map(np.ravel, b)]
...: %timeit np.take(a,np.ravel_multi_index(b, a.shape))
...: %timeit a.flat[np.ravel_multi_index(b, a.shape)]
...: %timeit np.take(a,ravel_index(b, a.shape))
...: %timeit a.flat[ravel_index(b, a.shape)]
100 loops, best of 3: 6.56 ms per loop
100 loops, best of 3: 6.58 ms per loop
100 loops, best of 3: 6.95 ms per loop
100 loops, best of 3: 9.17 ms per loop
100 loops, best of 3: 6.31 ms per loop
100 loops, best of 3: 8.52 ms per loop
Case #2 : dims = 6
In [29]: a = np.random.randint(0,9,([10]*6))
In [30]: b = np.random.randint(0,10,(a.ndim,1000000))
In [31]: %timeit a[tuple(b)]
...: %timeit a[map(np.ravel, b)]
...: %timeit np.take(a,np.ravel_multi_index(b, a.shape))
...: %timeit a.flat[np.ravel_multi_index(b, a.shape)]
...: %timeit np.take(a,ravel_index(b, a.shape))
...: %timeit a.flat[ravel_index(b, a.shape)]
10 loops, best of 3: 40.9 ms per loop
10 loops, best of 3: 40 ms per loop
10 loops, best of 3: 20 ms per loop
10 loops, best of 3: 29.9 ms per loop
100 loops, best of 3: 15.7 ms per loop
10 loops, best of 3: 25.8 ms per loop
Case #3 : dims = 10
In [32]: a = np.random.randint(0,9,([4]*10))
In [33]: b = np.random.randint(0,4,(a.ndim,1000000))
In [34]: %timeit a[tuple(b)]
...: %timeit a[map(np.ravel, b)]
...: %timeit np.take(a,np.ravel_multi_index(b, a.shape))
...: %timeit a.flat[np.ravel_multi_index(b, a.shape)]
...: %timeit np.take(a,ravel_index(b, a.shape))
...: %timeit a.flat[ravel_index(b, a.shape)]
10 loops, best of 3: 60.7 ms per loop
10 loops, best of 3: 60.1 ms per loop
10 loops, best of 3: 27.8 ms per loop
10 loops, best of 3: 38 ms per loop
100 loops, best of 3: 18.7 ms per loop
10 loops, best of 3: 29.3 ms per loop
So, it makes sense to look for alternatives when working with higher-dimensional inputs and with large data.
Another alternative that fits your need involves the use of np.ravel
>>> a[map(np.ravel, b)]
array([ 1, 10, 20])
However not fully numpy-based.
Performance-concerns.
Updated following the comments below.
Be that as it may, your approach is better than mine, but not better than any of #Divakar's.
import numpy as np
import timeit
a = np.arange(27).reshape(3,3,3)
bl = [[0, 1, 2], [0, 0, 0], [1, 1, 2]]
b = np.array(bl)
imps = "from __main__ import np,a,b"
reps = 100000
tup_cas_t = timeit.Timer("a[tuple(b)]", imps).timeit(reps)
map_rav_t = timeit.Timer("a[map(np.ravel, b)]", imps).timeit(reps)
fla_rp1_t = timeit.Timer("np.take(a,np.ravel_multi_index(b, a.shape))", imps).timeit(reps)
fla_rp2_t = timeit.Timer("a.flat[np.ravel_multi_index(b, a.shape)]", imps).timeit(reps)
print tup_cas_t/map_rav_t ## 0.505382211881
print tup_cas_t/fla_rp1_t ## 1.18185817386
print tup_cas_t/fla_rp2_t ## 1.71288705886
Are you looking for numpy.ndarray.tolist() ?
>>> a = np.arange(27).reshape(3,3,3)
>>> bl = [[0, 1, 2], [0, 0, 0], [1, 1, 2]]
>>> b = np.array(bl)
>>> a[b.tolist()]
array([ 1, 10, 20])
Or for arrays indexing arrays which is quite similar to list indexing :
>>> a[np.array([0, 1, 2]), np.array([0, 0, 0]), np.array([1, 1, 2])]
array([ 1, 10, 20])
However as you can from the previous link, indexing an array a with an array b directly means you are indexing the first index of a only with your whole b array which can lead to confusing output.
I want to generate a cyclic sequence of numbers like: [A B C A B C] with arbitrary length N I tried:
import numpy as np
def cyclic(N):
x = np.array([1.0,2.0,3.0]) # The main sequence
y = np.tile(x,N//3) # Repeats the sequence N//3 times
return y
but the problem with my code is if i enter any integer which ain't dividable by three then the results would have smaller length (N) than I excpected. I know this is very newbish question but i really got stuck
You can just use numpy.resize
x = np.array([1.0, 2.0, 3.0])
y = np.resize(x, 13)
y
Out[332]: array([ 1., 2., 3., 1., 2., 3., 1., 2., 3., 1., 2., 3., 1.])
WARNING: This is answer does not extend to 2D, as resize flattens the array before repeating it.
Approach #1 : Here'e one approach to handle generic sequences using modulus to generate those cyclic indices -
def cyclic_seq(x, N):
return np.take(x, np.mod(np.arange(N),len(x)))
Approach #2 : For performance, here's another method that tiles to the multiple of the max number of intervals and then making use of slicing to select the first N elements -
def cyclic_seq_v2(x, N):
return np.tile(x,(N+N-1)//len(x))[:N]
Sample runs -
In [81]: cyclic_seq([6,9,2,1,7],14)
Out[81]: array([6, 9, 2, 1, 7, 6, 9, 2, 1, 7, 6, 9, 2, 1])
In [82]: cyclic_seq_v2([6,9,2,1,7],14)
Out[82]: array([6, 9, 2, 1, 7, 6, 9, 2, 1, 7, 6, 9, 2, 1])
Runtime test
In [327]: x = np.random.randint(0,9,(3))
In [328]: %timeit np.resize(x, 10000) # #Daniel Forsman's solution
...: %timeit list(itertools.islice(itertools.cycle(x),10000)) # #Chris soln
...: %timeit cyclic_seq(x,10000) # Approach #1 from this post
...: %timeit cyclic_seq_v2(x,10000) # Approach #2 from this post
...:
1000 loops, best of 3: 296 µs per loop
10000 loops, best of 3: 185 µs per loop
10000 loops, best of 3: 120 µs per loop
10000 loops, best of 3: 28.7 µs per loop
In [329]: x = np.random.randint(0,9,(30))
In [330]: %timeit np.resize(x, 10000) # #Daniel Forsman's solution
...: %timeit list(itertools.islice(itertools.cycle(x),10000)) # #Chris soln
...: %timeit cyclic_seq(x,10000) # Approach #1 from this post
...: %timeit cyclic_seq_v2(x,10000) # Approach #2 from this post
...:
10000 loops, best of 3: 38.8 µs per loop
10000 loops, best of 3: 101 µs per loop
10000 loops, best of 3: 115 µs per loop
100000 loops, best of 3: 13.2 µs per loop
In [331]: %timeit np.resize(x, 100000) # #Daniel Forsman's solution
...: %timeit list(itertools.islice(itertools.cycle(x),100000)) # #Chris soln
...: %timeit cyclic_seq(x,100000) # Approach #1 from this post
...: %timeit cyclic_seq_v2(x,100000) # Approach #2 from this post
...:
1000 loops, best of 3: 297 µs per loop
1000 loops, best of 3: 942 µs per loop
1000 loops, best of 3: 1.13 ms per loop
10000 loops, best of 3: 88.3 µs per loop
On performance, approach #2 seems to be working quite well.
First over-length it (using math.ceil) then resize it after tile
import numpy as np
import math
def cyclic(N):
x = np.array([1.0,2.0,3.0]) # The main sequence
y = np.tile(x, math.ceil(N / 3.0))
y = np.resize(y, N)
return y
After taking Daniel Forsman's suggestion, it can be simplified as
import numpy as np
def cyclic(N):
x = np.array([1.0,2.0,3.0]) # The main sequence
y = np.resize(x, N)
return y
because np.resize automatically tiles the response in 1D
You can use itertools.cycle, an infinite iterator, for this:
>>> import itertools
>>> it = itertools.cycle([1,2,3])
>>> next(it)
1
>>> next(it)
2
>>> next(it)
3
>>> next(it)
1
You get a specific length of sequence (N), combine it with itertools.islice:
>>> list(itertools.islice(itertools.cycle([1,2,3]),11))
[1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2]
EDIT: as you can see in Divakar's benchmark, this approach is generally intermediate in terms of speed compared to other answers. I recommend when this solution when you want an iterator returned rather than a list or numpy array.
You can use itertools cycle for that.
In [3]: from itertools import cycle
In [4]: for x in cycle(['A','B','C']):
...: print(x)
...:
C
A
B
C
A
B
C
A
B
C
A
B
C
A
B
C
A
B
Edit:
If you want to implement it with out loops, you are going to need recursive functions. Solutions based on itertools cycle and the like are just hiding the loops behind the imported function.
In [5]: def repeater(arr, n):
...: yield arr[0]
...: yield arr[1]
...: yield arr[2]
...: if n == 0:
...: yield StopIteration
...: else:
...: yield from repeater(arr, n-1)
...:
Say I have an array d of size (N,T), out of which I need to select elements using index of shape (N,), where the first element corresponds to the index in the first row, etc... how would I do that?
For example
>>> d
Out[748]:
array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]])
>>> index
Out[752]: array([5, 6, 1], dtype=int64)
Expected Output:
array([[5],
[6],
[2])
Which is an array containing the fifth element of the first row, the 6th element of the second row and the second element of the third row.
Update
Since I will have sufficiently larger N, I was interested in the speed of the different methods for higher N. With N = 30000:
>>> %timeit np.diag(e.take(index2, axis=1)).reshape(N*3, 1)
1 loops, best of 3: 3.9 s per loop
>>> %timeit e.ravel()[np.arange(e.shape[0])*e.shape[1]+index2].reshape(N*3, 1)
1000 loops, best of 3: 287 µs per loop
Finally, you suggest reshape(). As I want to leave it as general as possible (without knowing N), I instead use [:,np.newaxis] - it seems to increase duration from 287µs to 288µs, which I'll take :)
This might be ugly but more efficient:
>>> d.ravel()[np.arange(d.shape[0])*d.shape[1]+index]
array([5, 6, 2])
edit
As pointed out by #deinonychusaur the statement above can be written as clean as:
d[np.arange(index.size),index]
There might be nicer ways, but a combo of take, diag and reshape would do:
In [137]: np.diag(d.take(index, axis=1)).reshape(3, 1)
Out[137]:
array([[5],
[6],
[2]])
EDIT
Comparisons with #Emanuele Paolinis' alterative, adding reshape to it to match the sought output:
In [142]: %timeit d.reshape(d.size)[np.arange(d.shape[0])*d.shape[1]+index].reshape(3, 1)
100000 loops, best of 3: 9.51 µs per loop
In [143]: %timeit np.diag(d.take(index, axis=1)).reshape(3, 1)
100000 loops, best of 3: 3.81 µs per loop
In [146]: %timeit d.ravel()[np.arange(d.shape[0])*d.shape[1]+index].reshape(3, 1)
100000 loops, best of 3: 8.56 µs per loop
This method is about twice as fast as both proposed alternatives.
EDIT 2: An even better method
Based on #Emanuele Paulinis' version but reduced number of operations outperforms all on large arrays 10k rows by 100 columns.
In [199]: %timeit d[(np.arange(index.size), index)].reshape(index.size, 1)
1000 loops, best of 3: 364 µs per loop
In [200]: %timeit d.ravel()[np.arange(d.shape[0])*d.shape[1]+index].reshape(index.size, 1)
100 loops, best of 3: 5.22 ms per loop
So if speed is of essence:
d[(np.arange(index.size), index)].reshape(index.size, 1)
I need the minimal distance between elements of an array.
I did:
numpy.min(numpy.ediff1d(numpy.sort(x)))
Is there a better / more efficient / more elegant / faster way of doing this?
If you are after sheer speed, here are some timings:
In [13]: a = np.random.rand(1000)
In [14]: %timeit np.sort(a)
10000 loops, best of 3: 31.9 us per loop
In [15]: %timeit np.ediff1d(a)
100000 loops, best of 3: 15.2 us per loop
In [16]: %timeit np.diff(a)
100000 loops, best of 3: 7.76 us per loop
In [17]: %timeit np.min(a)
100000 loops, best of 3: 3.19 us per loop
In [18]: %timeit np.unique(a)
10000 loops, best of 3: 53.8 us per loop
The timing of unique was in hopes that it would be comparably fast to sort, and you could break out early without the calls to diff and min if the length of the unique array was shorter than the array itself (as that would mean your answer was 0). But the overhead of unique is more than any gain to be made.
So it seems the only potential improvement I can offer is replacing ediff1d with diff:
In [19]: %timeit np.min(np.diff(np.sort(a)))
10000 loops, best of 3: 47.7 us per loop
In [20]: %timeit np.min(np.ediff1d(np.sort(a)))
10000 loops, best of 3: 57.1 us per loop
Your current approach is definitely optimal. By sorting first, you're reducing the space in between each element and ediff1d will return a difference array. Here's a suggestion:
Since we know that the difference must be positive since we have an ascending-order sort, we can implement ediff1d manually and include a break where the difference is zero. That way, if you have the sorted array x:
[1, 1, 2, 3, 4, 5, 6, 7, ... , n]
Rather than going through n elements, your ediff1d function breaks early and covers only the first two elements, returning [0]. This also reduces the size of the difference array, reducing the amount of iterations required by your min call.
Here is an example without the use of numpy:
x = [1, 12, 3, 8, 4, 1, 4, 9, 1, 29, 210, 313, 12]
def ediff1d_custom(x):
darr = []
for i in xrange(len(x)):
if i != len(x) - 1:
diff = x[i + 1] - x[i]
darr.append(diff)
if diff == 0:
break
return darr
print min(ediff1d_custom(sorted(x))) # prints 0
try:
min(x[i+1]-x[i] for i in xrange(0, len(x)-1))
except ValueError:
print 'Array contains less than two values.'