How can I sum across rows that have equal values in the first column of a numpy array? For example:
In: np.array([[1,2,3],
[1,4,6],
[2,3,5],
[2,6,2],
[3,4,8]])
Out: [[1,6,9], [2,9,7], [3,4,8]]
Any help would be greatly appreciated.
Pandas has a very very powerful groupby function which makes this very simple.
import pandas as pd
n = np.array([[1,2,3],
[1,4,6],
[2,3,5],
[2,6,2],
[3,4,8]])
df = pd.DataFrame(n, columns = ["First Col", "Second Col", "Third Col"])
df.groupby("First Col").sum()
Approach #1
Here's something in a numpythonic vectorized way based on np.bincount -
# Initial setup
N = A.shape[1]-1
unqA1, id = np.unique(A[:, 0], return_inverse=True)
# Create subscripts and accumulate with bincount for tagged summations
subs = np.arange(N)*(id.max()+1) + id[:,None]
sums = np.bincount( subs.ravel(), weights=A[:,1:].ravel() )
# Append the unique elements from first column to get final output
out = np.append(unqA1[:,None],sums.reshape(N,-1).T,1)
Sample input, output -
In [66]: A
Out[66]:
array([[1, 2, 3],
[1, 4, 6],
[2, 3, 5],
[2, 6, 2],
[7, 2, 1],
[2, 0, 3]])
In [67]: out
Out[67]:
array([[ 1., 6., 9.],
[ 2., 9., 10.],
[ 7., 2., 1.]])
Approach #2
Here's another based on np.cumsum and np.diff -
# Sort A based on first column
sA = A[np.argsort(A[:,0]),:]
# Row mask of where each group ends
row_mask = np.append(np.diff(sA[:,0],axis=0)!=0,[True])
# Get cummulative summations and then DIFF to get summations for each group
cumsum_grps = sA.cumsum(0)[row_mask,1:]
sum_grps = np.diff(cumsum_grps,axis=0)
# Concatenate the first unique row with its counts
counts = np.concatenate((cumsum_grps[0,:][None],sum_grps),axis=0)
# Concatenate the first column of the input array for final output
out = np.concatenate((sA[row_mask,0][:,None],counts),axis=1)
Benchmarking
Here's some runtime tests for the numpy based approaches presented so far for the question -
In [319]: A = np.random.randint(0,1000,(100000,10))
In [320]: %timeit cumsum_diff(A)
100 loops, best of 3: 12.1 ms per loop
In [321]: %timeit bincount(A)
10 loops, best of 3: 21.4 ms per loop
In [322]: %timeit add_at(A)
10 loops, best of 3: 60.4 ms per loop
In [323]: A = np.random.randint(0,1000,(100000,20))
In [324]: %timeit cumsum_diff(A)
10 loops, best of 3: 32.1 ms per loop
In [325]: %timeit bincount(A)
10 loops, best of 3: 32.3 ms per loop
In [326]: %timeit add_at(A)
10 loops, best of 3: 113 ms per loop
Seems like Approach #2: cumsum + diff is performing quite well.
Try using pandas. Group by the first column and then sum rowwise. Something like
df.groupby(df.ix[:,1]).sum()
With a little help from your friends np.unique and np.add.at:
>>> unq, unq_inv = np.unique(A[:, 0], return_inverse=True)
>>> out = np.zeros((len(unq), A.shape[1]), dtype=A.dtype)
>>> out[:, 0] = unq
>>> np.add.at(out[:, 1:], unq_inv, A[:, 1:])
>>> out # A was the OP's array
array([[1, 6, 9],
[2, 9, 7],
[3, 4, 8]])
Related
So I want to shift the values in matrix_a according to the values in matrix_b. So if the value in matrix_b at postion 0,0 is 1, then the element in the result_matrix at 0,0 should be the element that is at 1,1 in matrix_a. I already have this working using the following code:
import numpy as np
matrix_a = np.matrix([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
matrix_b = np.matrix([[1, 1, 0],
[0,-1, 0],
[0, 0, -1]])
result_matrix = np.zeros((3,3))
for x in range(matrix_b.shape[0]):
for y in range(matrix_b.shape[1]):
value = matrix_b.item(x,y)
result_matrix[x][y]=matrix_a.item(x+value,y+value)
print(result_matrix)
which results in:
[[5. 6. 3.]
[4. 1. 6.]
[7. 8. 5.]]
Right now this is quite slow on large matrices, and I have the feeling that this can be optimized using one of numpy or scipy's functions. Can someone tell me how this can be done more efficiently?
Using np.indices
ix = np.indices(matrix_a.shape)
matrix_a[tuple(ix + np.array(matrix_b))]
Out[]:
matrix([[5, 6, 3],
[4, 1, 6],
[7, 8, 5]])
As a word of advice, try to avoid using np.matrix - it's only really for compatibility with old MATLAB code, and breaks a lot of numpy functions. np.array works just as well 99% of the time, and the rest of the time np.matrix will be confusing for core numpy users.
Here's one way with integer-indexing generated off the same iterators as open ranged arrays to get row, column indices for all elements -
I,J = np.ogrid[:matrix_b.shape[0],:matrix_b.shape[1]]
out = matrix_a[I+matrix_b, J+matrix_b]
Output for given sample -
In [152]: out
Out[152]:
matrix([[5, 6, 3],
[4, 1, 6],
[7, 8, 5]])
Timings on a large dataset 5000x5000 -
In [142]: np.random.seed(0)
...: N = 5000 # matrix size
...: matrix_a = np.random.rand(N,N)
...: matrix_b = np.random.randint(0,N,matrix_a.shape)-matrix_a.shape[1]
# #Daniel F's soln
In [143]: %%timeit
...: ix = np.indices(matrix_a.shape)
...: matrix_a[tuple(ix + np.array(matrix_b))]
1.37 s ± 99.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Solution from this post
In [149]: %%timeit
...: I,J = np.ogrid[:matrix_b.shape[0],:matrix_b.shape[1]]
...: out = matrix_a[I+matrix_b, J+matrix_b]
1.17 s ± 3.21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I have one numpy array, where indices are stored in the shape of (n, 2). E.g.:
[[0, 1],
[2, 3],
[1, 2],
[4, 2]]
Then I do some processing and create an array in the shape of (m, 2), where n > m. E.g.:
[[2, 3]
[4, 2]]
Now I want to delete every row in the first array that can be found in the second array as well. So my wanted result is:
[[0, 1],
[1, 2]]
My current solution is as follows:
for row in second_array:
result = np.delete(first_array, np.where(np.all(first_array == second_array, axis=1)), axis=0)
However, this is quiet time consuming if the second is large. Does someone know a numpy only solution, which does not require a loop?
Here's one leveraging the fact that they are positive numbers using matrix-multiplication for dimensionality-reduction -
def setdiff_nd_positivenums(a,b):
s = np.maximum(a.max(0)+1,b.max(0)+1)
return a[~np.isin(a.dot(s),b.dot(s))]
Sample run -
In [82]: a
Out[82]:
array([[0, 1],
[2, 3],
[1, 2],
[4, 2]])
In [83]: b
Out[83]:
array([[2, 3],
[4, 2]])
In [85]: setdiff_nd_positivenums(a,b)
Out[85]:
array([[0, 1],
[1, 2]])
Also, it seems the second-array b is a subset of a. So, we can leverage that scenario to boost the performance even further using np.searchsorted, like so -
def setdiff_nd_positivenums_searchsorted(a,b):
s = np.maximum(a.max(0)+1,b.max(0)+1)
a1D,b1D = a.dot(s),b.dot(s)
b1Ds = np.sort(b1D)
return a[b1Ds[np.searchsorted(b1Ds,a1D)] != a1D]
Timings -
In [146]: np.random.seed(0)
...: a = np.random.randint(0,9,(1000000,2))
...: b = a[np.random.choice(len(a), 10000, replace=0)]
In [147]: %timeit setdiff_nd_positivenums(a,b)
...: %timeit setdiff_nd_positivenums_searchsorted(a,b)
10 loops, best of 3: 101 ms per loop
10 loops, best of 3: 70.9 ms per loop
For generic numbers, here's another using views -
# https://stackoverflow.com/a/45313353/ #Divakar
def view1D(a, b): # a, b are arrays
a = np.ascontiguousarray(a)
b = np.ascontiguousarray(b)
void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
return a.view(void_dt).ravel(), b.view(void_dt).ravel()
def setdiff_nd(a,b):
# a,b are the nD input arrays
A,B = view1D(a,b)
return a[~np.isin(A,B)]
Sample run -
In [94]: a
Out[94]:
array([[ 0, 1],
[-2, -3],
[ 1, 2],
[-4, -2]])
In [95]: b
Out[95]:
array([[-2, -3],
[ 4, 2]])
In [96]: setdiff_nd(a,b)
Out[96]:
array([[ 0, 1],
[ 1, 2],
[-4, -2]])
Timings -
In [158]: np.random.seed(0)
...: a = np.random.randint(0,9,(1000000,2))
...: b = a[np.random.choice(len(a), 10000, replace=0)]
In [159]: %timeit setdiff_nd(a,b)
1 loop, best of 3: 352 ms per loop
The numpy-indexed package (disclaimer: I am its author) was designed to perform operations of this type efficiently on nd-arrays.
import numpy_indexed as npi
# if the output should consist of unique values and there is no need to preserve ordering
result = npi.difference(first_array, second_array)
# otherwise:
result = first_array[~npi.in_(first_array, second_array)]
Here is a function that works with 2D arrays of integers with any shape, and accepting both positive and negative numbers:
import numpy as np
# Gets a boolean array of rows of a that are in b
def isin_rows(a, b):
a = np.asarray(a)
b = np.asarray(b)
# Subtract minimum value per column
min = np.minimum(a.min(0), b.min(0))
a = a - min
b = b - min
# Get maximum value per column
max = np.maximum(a.max(0), b.max(0))
# Compute multiplicative base for each column
base = np.roll(max, 1)
base[0] = 1
base = np.cumprod(max)
# Make flattened version of arrays
a_flat = (a * base).sum(1)
b_flat = (b * base).sum(1)
# Check elements of a in b
return np.isin(a_flat, b_flat)
# Test
a = np.array([[0, 1],
[2, 3],
[1, 2],
[4, 2]])
b = np.array([[2, 3],
[4, 2]])
a_in_b_mask = isin_rows(a, b)
a_not_in_b = a[~a_in_b_mask]
print(a_not_in_b)
# [[0 1]
# [1 2]]
EDIT: One possible optimization raises from considering the number of possible rows in b. If b has more rows than the possible number of combinations, then you may find its unique elements first so np.isin is faster:
import numpy as np
def isin_rows_opt(a, b):
a = np.asarray(a)
b = np.asarray(b)
min = np.minimum(a.min(0), b.min(0))
a = a - min
b = b - min
max = np.maximum(a.max(0), b.max(0))
base = np.roll(max, 1)
base[0] = 1
base = np.cumprod(max)
a_flat = (a * base).sum(1)
b_flat = (b * base).sum(1)
# Count number of possible different rows for b
num_possible_b = np.prod(b.max(0) - b.min(0) + 1)
if len(b_flat) > num_possible_b: # May tune this condition
b_flat = np.unique(b_flat)
return np.isin(a_flat, b_flat)
The condition len(b_flat) > num_possible_b should probably be tuned better so you only find for unique elements if it is really going to be worth it (maybe len(b_flat) > 2 * num_possible_b or len(b_flat) > num_possible_b + CONSTANT). It seems to give some improvement for big arrays with fewer values:
import numpy as np
# Test setup from #Divakar
np.random.seed(0)
a = np.random.randint(0, 9, (1000000, 2))
b = a[np.random.choice(len(a), 10000, replace=0)]
print(np.all(isin_rows(a, b) == isin_rows_opt(a, b)))
# True
%timeit isin_rows(a, b)
# 100 ms ± 425 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit isin_rows_opt(a, b)
# 81.2 ms ± 324 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I come to a problem like this:
suppose I have arrays like this:
a = np.array([[1,2,3,4,5,4,3,2,1],])
label = np.array([[1,0,1,0,0,1,1,0,1],])
I need to obtain the indices of a at which position the element value of label is 1 and the value of a is the largest amount all that causing label to be 1.
It maybe confusing, in the above example, the indices where label is 1 are: 0, 2, 5, 6, 8, their corresponding values of a are thus: 1, 3, 4, 3, 1, among which 4 is the larges, thus I need to get the result of 5 which is the index of number 4 in a. How could I do this with numpy ?
Get the 1s indices say as idx, then index into a with it, get max index and finally trace it back to the original order by indexing into idx -
idx = np.flatnonzero(label==1)
out = idx[a[idx].argmax()]
Sample run -
# Assuming inputs to be 1D
In [18]: a
Out[18]: array([1, 2, 3, 4, 5, 4, 3, 2, 1])
In [19]: label
Out[19]: array([1, 0, 1, 0, 0, 1, 1, 0, 1])
In [20]: idx = np.flatnonzero(label==1)
In [21]: idx[a[idx].argmax()]
Out[21]: 5
For a as ints and label as an array of 0s and 1s, we could optimize further as we could scale a based on the range of values in it, like so -
(label*(a.max()-a.min()+1) + a).argmax()
Furthermore, if a has positive numbers only, it would simplify to -
(label*(a.max()+1) + a).argmax()
Timings for positive ints largish a -
In [115]: np.random.seed(0)
...: a = np.random.randint(0,10,(100000))
...: label = np.random.randint(0,2,(100000))
In [117]: %%timeit
...: idx = np.flatnonzero(label==1)
...: out = idx[a[idx].argmax()]
1000 loops, best of 3: 592 µs per loop
In [116]: %timeit (label*(a.max()-a.min()+1) + a).argmax()
1000 loops, best of 3: 357 µs per loop
# #coldspeed's soln
In [120]: %timeit np.ma.masked_where(~label.astype(bool), a).argmax()
1000 loops, best of 3: 1.63 ms per loop
# won't work with negative numbers in a
In [119]: %timeit (label*(a.max()+1) + a).argmax()
1000 loops, best of 3: 292 µs per loop
# #klim's soln (won't work with negative numbers in a)
In [121]: %timeit np.argmax(a * (label == 1))
1000 loops, best of 3: 229 µs per loop
You can use masked arrays:
>>> np.ma.masked_where(~label.astype(bool), a).argmax()
5
Here is one of the simplest ways.
>>> np.argmax(a * (label == 1))
5
>>> np.argmax(a * (label == 1), axis=1)
array([5])
Coldspeed's method may take more time.
Hi I am using numpy to create a new array with timesteps and multiple features, for an LSTM.
i have looked at a number of approaches using strides and reshaping but haven't managed to find an efficient solution.
Here is a function that solves a toy problem, however i have 30,000 samples, each with 100 features.
def make_timesteps(a, timesteps):
array = []
for j in np.arange(len(a)):
unit = []
for i in range(timesteps):
unit.append(np.roll(a, i, axis=0)[j])
array.append(unit)
return np.array(array)
inArr = np.array([[1, 2], [3,4], [5,6]])
inArr.shape => (3, 2)
outArr = make_timesteps(inArr, 2)
outArr.shape => (3, 2, 2)
assert(np.array_equal(outArr,
np.array([[[1, 2], [3, 4]], [[3, 4], [5, 6]], [[5, 6], [1, 2]]])))
=> True
Is there a more efficeint way of doing this (there must be!!) can someone please help?
One trick would be to append last L-1 rows off the array and append those to the start of the array. Then, it would be a simple case of using the very efficient NumPy strides. For people wondering about the cost of this trick, as we will see later on through the timing tests, it's as good as nothing.
The trick leading upto the final goal that would support both forward and backward striding in codes would look something like this -
Backward striding :
def strided_axis0_backward(inArr, L = 2):
# INPUTS :
# a : Input array
# L : Length along rows to be cut to create per subarray
# Append the last row to the start. It just helps in keeping a view output.
a = np.vstack(( inArr[-L+1:], inArr ))
# Store shape and strides info
m,n = a.shape
s0,s1 = a.strides
# Length of 3D output array along its axis=0
nd0 = m - L + 1
strided = np.lib.stride_tricks.as_strided
return strided(a[L-1:], shape=(nd0,L,n), strides=(s0,-s0,s1))
Forward striding :
def strided_axis0_forward(inArr, L = 2):
# INPUTS :
# a : Input array
# L : Length along rows to be cut to create per subarray
# Append the last row to the start. It just helps in keeping a view output.
a = np.vstack(( inArr , inArr[:L-1] ))
# Store shape and strides info
m,n = a.shape
s0,s1 = a.strides
# Length of 3D output array along its axis=0
nd0 = m - L + 1
strided = np.lib.stride_tricks.as_strided
return strided(a[:L-1], shape=(nd0,L,n), strides=(s0,s0,s1))
Sample run -
In [42]: inArr
Out[42]:
array([[1, 2],
[3, 4],
[5, 6]])
In [43]: strided_axis0_backward(inArr, 2)
Out[43]:
array([[[1, 2],
[5, 6]],
[[3, 4],
[1, 2]],
[[5, 6],
[3, 4]]])
In [44]: strided_axis0_forward(inArr, 2)
Out[44]:
array([[[1, 2],
[3, 4]],
[[3, 4],
[5, 6]],
[[5, 6],
[1, 2]]])
Runtime test -
In [53]: inArr = np.random.randint(0,9,(1000,10))
In [54]: %timeit make_timesteps(inArr, 2)
...: %timeit strided_axis0_forward(inArr, 2)
...: %timeit strided_axis0_backward(inArr, 2)
...:
10 loops, best of 3: 33.9 ms per loop
100000 loops, best of 3: 12.1 µs per loop
100000 loops, best of 3: 12.2 µs per loop
In [55]: %timeit make_timesteps(inArr, 10)
...: %timeit strided_axis0_forward(inArr, 10)
...: %timeit strided_axis0_backward(inArr, 10)
...:
1 loops, best of 3: 152 ms per loop
100000 loops, best of 3: 12 µs per loop
100000 loops, best of 3: 12.1 µs per loop
In [56]: 152000/12.1 # Speedup figure
Out[56]: 12561.98347107438
The timings of strided_axis0 stays the same even as we increase the length of subarrays in the output. That just goes to show us the massive benefit with strides and of course the crazy speedups too over the original loopy version.
As promised at the start, here's the timings on stacking cost with np.vstack -
In [417]: inArr = np.random.randint(0,9,(1000,10))
In [418]: L = 10
In [419]: %timeit np.vstack(( inArr[-L+1:], inArr ))
100000 loops, best of 3: 5.41 µs per loop
The timings support the idea of stacking to be a pretty efficient one.
If I try
x = np.append(x, (2,3))
the tuple (2,3) does not get appended to the end of the array, rather 2 and 3 get appended individually, even if I originally declared x as
x = np.array([], dtype = tuple)
or
x = np.array([], dtype = (int,2))
What is the proper way to do this?
I agree with #user2357112 comment:
appending to NumPy arrays is catastrophically slower than appending to ordinary lists. It's an operation that they are not at all designed for
Here's a little benchmark:
# measure execution time
import timeit
import numpy as np
def f1(num_iterations):
x = np.dtype((np.int32, (2, 1)))
for i in range(num_iterations):
x = np.append(x, (i, i))
def f2(num_iterations):
x = np.array([(0, 0)])
for i in range(num_iterations):
x = np.vstack((x, (i, i)))
def f3(num_iterations):
x = []
for i in range(num_iterations):
x.append((i, i))
x = np.array(x)
N = 50000
print timeit.timeit('f1(N)', setup='from __main__ import f1, N', number=1)
print timeit.timeit('f2(N)', setup='from __main__ import f2, N', number=1)
print timeit.timeit('f3(N)', setup='from __main__ import f3, N', number=1)
I wouldn't use neither np.append nor vstack, I'd just create my python array properly and then use it to construct the np.array
EDIT
Here's the benchmark output on my laptop:
append: 12.4983000173
vstack: 1.60663705793
list: 0.0252208517006
[Finished in 14.3s]
You need to supply the shape to numpy dtype, like so:
x = np.dtype((np.int32, (1,2)))
x = np.append(x,(2,3))
Outputs
array([dtype(('<i4', (2, 3))), 1, 2], dtype=object)
[Reference][1]http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html
If I understand what you mean, you can use vstack:
>>> a = np.array([(1,2),(3,4)])
>>> a = np.vstack((a, (4,5)))
>>> a
array([[1, 2],
[3, 4],
[4, 5]])
I do not have any special insight as to why this works, but:
x = np.array([1, 3, 2, (5,7), 4])
mytuple = [(2, 3)]
mytuplearray = np.empty(len(mytuple), dtype=object)
mytuplearray[:] = mytuple
y = np.append(x, mytuplearray)
print(y) # [1 3 2 (5, 7) 4 (2, 3)]
As others have correctly pointed out, this is a slow operation with numpy arrays. If you're just building some code from scratch, try to use some other data type. But if you know your array will always remain small or you're not going to append much or if you have existing code that you need to tweak quickly, then go ahead.
simplest way:
x=np.append(x,None)
x[-1]=(2,3)
np.append is easy to use with a case like:
In [94]: np.append([1,2,3],4)
Out[94]: array([1, 2, 3, 4])
but its first example is harder to understand. It shows the same sort of flat concatenate that bothers you:
>>> np.append([1, 2, 3], [[4, 5, 6], [7, 8, 9]])
array([1, 2, 3, 4, 5, 6, 7, 8, 9])
Stripped of dimensional tests, np.append does
In [166]: np.append(np.array([1,2],int),(2,3))
Out[166]: array([1, 2, 2, 3])
In [167]: np.concatenate([np.array([1,2],int),np.array((2,3))])
Out[167]: array([1, 2, 2, 3])
So except for the simplest cases you need to understand what np.array((2,3)) does, and how concatenate handles dimensions.
So apart from the speed issues, np.append can be trickier to use that the interface suggests. The parallels to list append are only superficial.
As for append (or concatenate) with dtype=object (not dtype=tuple) or a compound dtype ('i,i'), I couldn't tell you what happens without testing. At a minimum the inputs should already be arrays, and should have a matching dtype. Otherwise the results can unpredicatable.
edit
Don't trust the timings in https://stackoverflow.com/a/38985245/901925. The functions don't produce the same things.
Corrected functions:
In [233]: def g1(num_iterations):
...: x = np.ones((0,2),int)
...: for i in range(num_iterations):
...: x = np.append(x, [(i, i)], axis=0)
...: return x
...:
...: def g2(num_iterations):
...: x = np.ones((0, 2),int)
...: for i in range(num_iterations):
...: x = np.vstack((x, (i, i)))
...: return x
...:
...: def g3(num_iterations):
...: x = []
...: for i in range(num_iterations):
...: x.append((i, i))
...: x = np.array(x)
...: return x
...:
In [234]: g1(3)
Out[234]:
array([[0, 0],
[1, 1],
[2, 2]])
In [235]: g2(3)
Out[235]:
array([[0, 0],
[1, 1],
[2, 2]])
In [236]: g3(3)
Out[236]:
array([[0, 0],
[1, 1],
[2, 2]])
np.append and np.vstack timings are much closer. Both use np.concatenate to do the actual joining. They differ in how the inputs are processed prior to sending them to concatenate.
In [237]: timeit g1(1000)
9.69 ms ± 6.25 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [238]: timeit g2(1000)
12.8 ms ± 7.53 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [239]: timeit g3(1000)
537 µs ± 2.22 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
The wrong results. Note that f1 produces a 1d object dtype array, because the starting value is object dtype array, and there's not axis parameter. f2 duplicates the starting array.
In [240]: f1(3)
Out[240]: array([dtype(('<i4', (2, 1))), 0, 0, 1, 1, 2, 2], dtype=object)
In [241]: f2(3)
Out[241]:
array([[0, 0],
[0, 0],
[1, 1],
[2, 2]])
Not only is it slower to use np.append or np.vstack in a loop, it is also hard to do it right.