I have the following numpy array:
a = np.array([[1.1,0.8,0.5,0,0],[1,0.85,0.5,0,0],[1,0.8,0.5,1,0]])
with shape = (3,5).
I would like to reshape and resize it to a new array with shape = (3,8), filling the new values in each row with 0. So far I tried the following approach:
b = np.resize(a,(3,8))
But it returns:
[[ 1.1 0.8 0.5 0. 0. 1. 0.85 0.5 ]
[ 0. 0. 1. 0.8 0.5 1. 0. 1.1 ]
[ 0.8 0.5 0. 0. 1. 0.85 0.5 0. ]]
instead of the expected (for me):
[[ 1.1 0.8 0.5 0. 0. 0. 0. 0. ]
[ 1. 0.85 0.5 0. 0. 0. 0. 0. ]
[ 1. 0.8 0.5 1. 0. 0. 0. 0. ]]
Use np.lib.pad -
np.lib.pad(a, ((0,0),(0,3)), 'constant', constant_values=(0))
Sample run -
In [156]: a
Out[156]:
array([[ 1.1 , 0.8 , 0.5 , 0. , 0. ],
[ 1. , 0.85, 0.5 , 0. , 0. ],
[ 1. , 0.8 , 0.5 , 1. , 0. ]])
In [157]: np.lib.pad(a, ((0,0),(0,3)), 'constant', constant_values=(0))
Out[157]:
array([[ 1.1 , 0.8 , 0.5 , 0. , 0. , 0. , 0. , 0. ],
[ 1. , 0.85, 0.5 , 0. , 0. , 0. , 0. , 0. ],
[ 1. , 0.8 , 0.5 , 1. , 0. , 0. , 0. , 0. ]])
Runtime tests -
This section covers runtime tests for the approaches posted thus far for the size listed in the question and scaling it up by 100x. Here are the timing test results -
In [212]: def init_based(a,N):
...: b = np.zeros((a.shape[0], a.shape[1]+N))
...: b[:, :a.shape[1]] = a
...: return b
...:
In [213]: a = np.random.rand(3,5)
In [214]: N = 3
In [215]: %timeit np.lib.pad(a, ((0,0),(0,N)), 'constant', constant_values=(0))
...: %timeit np.hstack([a, np.zeros([a.shape[0], N])])
...: %timeit np.concatenate((a,np.zeros((a.shape[0],N))), axis=1)
...: %timeit init_based(a,N)
...:
10000 loops, best of 3: 32.7 µs per loop
100000 loops, best of 3: 11.2 µs per loop
100000 loops, best of 3: 4.49 µs per loop
100000 loops, best of 3: 5.67 µs per loop
In [216]: a = np.random.rand(300,500)
In [217]: N = 300
In [218]: %timeit np.lib.pad(a, ((0,0),(0,N)), 'constant', constant_values=(0))
...: %timeit np.hstack([a, np.zeros([a.shape[0], N])])
...: %timeit np.concatenate((a,np.zeros((a.shape[0],N))), axis=1)
...: %timeit init_based(a,N)
...:
100 loops, best of 3: 2.99 ms per loop
1000 loops, best of 3: 1.72 ms per loop
1000 loops, best of 3: 1.71 ms per loop
1000 loops, best of 3: 1.72 ms per loop
From the doc of np.resize():
If the new array is larger than the original array, then the new
array is filled with repeated copies of a.
Zeros are not used, but actual values of a.
Instead, you could use np.hstack() and np.zeros() :
np.hstack([a, np.zeros([3, 3])])
Edit: I have not tested the speed, so I suggest you have a look a other solutions too.
Definitely you can use resize().
If the new array is larger than the original array, then the new array is filled with repeated copies of a. Note that this behavior is different from a.resize(new_shape) which fills with zeros instead of repeated copies of a.
b = a.transpose().copy()
b.resize((8,3), refcheck=False)
b = a.transpose()
which outputs:
[[ 1.1 0.8 0.5 0. 0. 0. 0. 0. ]
[ 1. 0.85 0.5 0. 0. 0. 0. 0. ]
[ 1. 0.8 0.5 1. 0. 0. 0. 0. ]]
Limitation:
Filling with 0s can be only applied to the 1st dimension.
Another option (although np.hstack is probably best as in M. Massias' answer).
Initialise an array of zeros:
b = np.zeros((3, 8))
Fill using slice syntax:
b[:3, :5] = a
np.concatenate
np.concatenate((a,np.zeros((3,3))), axis=1)
array([[ 1.1 , 0.8 , 0.5 , 0. , 0. , 0. , 0. , 0. ],
[ 1. , 0.85, 0.5 , 0. , 0. , 0. , 0. , 0. ],
[ 1. , 0.8 , 0.5 , 1. , 0. , 0. , 0. , 0. ]])
Related
I have a symmetric matrix with 0's on the diagonal:
[[0. 4. 1.25 1.25]
[4. 0. 9.25 1.25]
[1.25 9.25 0. 4. ]
[1.25 1.25 4. 0. ]]
I want to keep the smallest k non-zero distinct values in each row and zero the rest. For example, if k = 1, I would have:
[[0. 0. 1.25 1.25]
[0. 0. 0. 1.25]
[1.25 0. 0. 0. ]
[1.25 1.25 0. 0. ]]
Here is what I have tried:
k = 1
for i in range(matrix.shape[0]):
ind = np.argsort(matrix[i,:])
matrix[i, ind[k + 1:]] = 0
Out:
[[0. 0. 1.25 0. ]
[0. 0. 0. 1.25]
[1.25 0. 0. 0. ]
[0. 1.25 0. 0. ]]
I can take the set of values in each row and then zero the value if it doesn't belong in the set, but I am looking for a more elegant solution.
Edit: For k = 2, the desired result is:
[[0. 4. 1.25 1.25]
[4. 0. 0. 1.25]
[1.25 0. 0. 4. ]
[1.25 1.25 4. 0. ]]
and for k = 3:
[[0. 4. 1.25 1.25]
[4. 0. 9.25 1.25]
[1.25 9.25 0. 4. ]
[1.25 1.25 4. 0. ]]
symmetric matrix with 0's on the diagonal
For k=1: Fill the diagonal with a large value; then use np.where with row minimums being the condition.
a = [[0. , 4. , 1.25, 1.25],
[4. , 0. , 9.25, 1.25],
[1.25, 9.25, 0. , 4. ],
[1.25, 1.25, 4. , 0. ]]
a = np.array(a)
np.fill_diagonal(a,a.max()+1)
cond = a == a.min(-1)
b = np.where(cond,a,0)
That matches your desired result for k=1 but your text description seems to say that you only want to keep k values in each row.
This works for any k ...
Sticking with a for loop for now: find the unique nonzero values in each row; use the first k of those values to make a mask; use the mask to fill non-conforming items with zero.
def f(a,k=1):
for row in a:
values = row[row.nonzero()]
values = np.unique(values) #returns a sorted array
values = values[:k]
mask = (row[:,None] != values).all(1)
row[mask] = 0
return a
q = f(a,k=2)
Maybe not the best solution since you find the use of np.unique inelegant.
I have quite large arrays to fill matrix (about 5e6 elements). I know the fast way to fill is something like
(simplified example)
bbb = (np.array([1,2,3,4,1])) # row
ccc = (np.array([0,1,2,1,0])) # column
ddd = (np.array([55.5,22.2,33.3,44.4,11.1])) # values
experiment = np.zeros(shape=(5,3))
experiment[bbb, ccc] = [ddd] # filling
>[[ 0. 0. 0. ]
[ 11.1 0. 0. ]
[ 0. 22.2 0. ]
[ 0. 0. 33.3]
[ 0. 44.4 0. ]]
but if I want the max ddd instead. Something like at # filling
#pseudocode
experiment[bbb, ccc] = [ddd if ddd > experiment[bbb, ccc]]
The matrix should return
>[[ 0. 0. 0. ]
[ 55.5 0. 0. ]
[ 0. 22.2 0. ]
[ 0. 0. 33.3]
[ 0. 44.4 0. ]]
What is a good fast way to get max to fill the matrix from np.array here?
You can use np.ufunc.at on np.maximum.
np.ufunc.at performs the preceding ufunc "unbuffered and in-place". This means all indices appearing in [bbb, ccc] will be processed by np.maximum, no matter how ofthen those indices appear.
In your case (0, 1) appears twice, so it will be processed twice, each time picking the maximum of experiment[bbb, ccc] and ddd.
np.maximum.at(experiment, [bbb, ccc], ddd)
# array([[ 0. , 0. , 0. ],
# [ 55.5, 0. , 0. ],
# [ 0. , 22.2, 0. ],
# [ 0. , 0. , 33.3],
# [ 0. , 44.4, 0. ]])
Is it possible to use np.bincount but get the max instead of sum of weights? Here, bbb at index 3 has two values, 11.1 and 55.5. I want to have 55.5, not 66.6. I doubt I choose use other function but not so sure which one is good for this purpose.
bbb = np.array([ 3, 7, 11, 13, 3])
weight = np.array([ 11.1, 22.2, 33.3, 44.4, 55.5])
print np.bincount(bbb, weight, minlength=15)
OUT >> [ 0. 0. 0. 66.6 0. 0. 0. 22.2 0. 0. 0. 33.3 0. 44.4 0. ]
Note that, in fact, bbb and weight are very large (about 5e6 elements).
The solution to your 2D question is also valid for the 1D case, so you can use np.maxmimum.at
out = np.zeros(15)
np.maximum.at(out, bbb, weight)
# array([ 0. , 0. , 0. , 55.5, 0. , 0. , 0. , 22.2, 0. ,
# 0. , 0. , 33.3, 0. , 44.4, 0. ])
Approach #1 : Here's one way with np.maximum.reduceat to get the binned maximum values -
def binned_max(bbb, weight, minlength):
sidx = bbb.argsort()
weight_s = weight[sidx]
bbb_s = bbb[sidx]
cut_idx = np.flatnonzero(np.concatenate(([True], bbb_s[1:] != bbb_s[:-1])))
bbb_unq = bbb_s[cut_idx]
#Or bbb_unq, cut_idx = np.unique(bbb_s, return_index=1)
max_val = np.maximum.reduceat(weight_s, cut_idx)
out = np.zeros(minlength, dtype=weight.dtype)
out[bbb_unq] = max_val
return out
Sample run -
In [36]: bbb = np.array([ 3, 7, 11, 13, 3])
...: weight = np.array([ 11.1, 22.2, 33.3, 44.4, 55.5])
In [37]: binned_max(bbb, weight, minlength=15)
Out[37]:
array([ 0. , 0. , 0. , 55.5, 0. , 0. , 0. , 22.2, 0. ,
0. , 0. , 33.3, 0. , 44.4, 0. ])
Approach #2 : Well I was trying to check out/having fun with numba to solve this and it seems quite efficient. Here's one numba way -
from numba import njit
#njit
def numba_func(out, bins, weight, minlength):
l = len(bins)
for i in range(l):
if out[bins[i]] < weight[i]:
out[bins[i]] = weight[i]
return out
def maxat_numba(bins, weight, minlength):
out = np.zeros(minlength, dtype=weight.dtype)
out[bins] = weight.min()
numba_func(out, bins, weight, minlength)
return out
Runtime test -
The built-in with np.maximum.at looks quite neat and would be the preferred one in most scenarios, so testing the proposed one against it -
# #Nils Werner's soln with np.maximum.at
def maxat_numpy(bins, weight, minlength):
out = np.zeros(minlength)
np.maximum.at(out, bins, weight)
return out
Timings -
Case #1 :
In [155]: bbb = np.random.randint(1,1000, (10000))
In [156]: weight = np.random.rand(*bbb.shape)
In [157]: %timeit maxat_numpy(bbb, weight, minlength=bbb.max()+1)
1000 loops, best of 3: 686 µs per loop
In [158]: %timeit maxat_numba(bbb, weight, minlength=bbb.max()+1)
10000 loops, best of 3: 60.6 µs per loop
Case #2 :
In [159]: bbb = np.random.randint(1,10000, (1000000))
In [160]: weight = np.random.rand(*bbb.shape)
In [161]: %timeit maxat_numpy(bbb, weight, minlength=bbb.max()+1)
10 loops, best of 3: 66 ms per loop
In [162]: %timeit maxat_numba(bbb, weight, minlength=bbb.max()+1)
100 loops, best of 3: 5.42 ms per loop
Probably not quite as fast as the answer by Nils, but the numpy_indexed package (disclaimer: I am its author) has a more flexible syntax for performing these type of operations:
import numpy_indexed as npi
unique_keys, maxima_per_key = npi.group_by(bbb).max(weight)
This is a VERY simplified example of a larger problem I'm working with. How do I replace multiple rows in a two dimensional numpy array? For example, I have an array...
main_list = array( [[ 0. 0.]
[ 0. 0.]
[ 0. 0.]
[ 0. 0.]
[ 0. 0.]
[ 0. 0.]
[ 0. 0.]
[ 0. 0.]
[ 0. 0.]
[ 0. 0.]])
I have a list of indexes...
indexes = array([3, 6, 2])
I have a list of substitutions that will always be the same length as the list of indexes.
substitutions = array( [[ 2.4 5.2]
[ 10.1 1.3]
[ 5.6 9.5]])
I want...
main_list = array( [[ 0. 0.]
[ 0. 0.]
[ 5.6 9.5]
[ 2.4 5.2]
[ 0. 0.]
[ 0. 0.]
[ 10.1 1.3]
[ 0. 0.]
[ 0. 0.]
[ 0. 0.]])
Right now I'm doing...
for i, ind in enumerate(indexes):
main_list[ind] = substitutions[i]
Keeping in mind that this is a simple example, in the production version of what I'm doing the length of all these lists will be large. Is there a faster way to do these substitutions? Thanks.
main_list[indexes,:] = substitutions
my attempt at timing says this is 3x faster than what you posted
In [51]: %%timeit
for i, ind in enumerate(indexes):
main_list[ind] = substitutions[i]
....:
100000 loops, best of 3: 6.83 us per loop
In [52]: %timeit main_list[indexes,:] = substitutions
100000 loops, best of 3: 2.27 us per loop
I have a ndarray. From this array I need to choose the list of N numbers with biggest values. I found heapq.nlargest to find the N largest entries, but I need to extract the indexes.
I want to build a new array where only the N rows with the largest weights in the first column will survive. The rest of the rows will be replaced by random values
import numpy as np
import heapq # For choosing list of max values
a = [[1.1,2.1,3.1], [2.1,3.1,4.1], [5.1,0.1,7.1],[0.1,1.1,1.1],[4.1,3.1,9.1]]
a = np.asarray(a)
maxVal = heapq.nlargest(2,a[:,0])
if __name__ == '__main__':
print a
print maxVal
The output I have is:
[[ 1.1 2.1 3.1]
[ 2.1 3.1 4.1]
[ 5.1 0.1 7.1]
[ 0.1 1.1 1.1]
[ 4.1 3.1 9.1]]
[5.0999999999999996, 4.0999999999999996]
but what I need is [2,4] as the indexes to build a new array. The indexes are the rows so if in this example I want to replace the rest by 0 I need to finish with:
[[0.0 0.0 0.0]
[ 0.0 0.0 0.0]
[ 5.1 0.1 7.1]
[ 0.0 0.0 0.0]
[ 4.1 3.1 9.1]]
I am stuck in the point where I need indexes. The original array has 1000 rows and 100 columns. The weights are normalized floating points and I don't want to do something like if a[:,1] == maxVal[0]: because sometimes I have the weights very close and can finish with more values maxVal[0] than my original N.
Is there any simple way to extract indexes on this setup to replace the rest of the array?
If you only have 1000 rows, I would forget about the heap and use np.argsort on the first column:
>>> np.argsort(a[:,0])[::-1][:2]
array([2, 4])
If you want to put it all together, it would look something like:
def trim_rows(a, n) :
idx = np.argsort(a[:,0])[:-n]
a[idx] = 0
>>> a = np.random.rand(10, 4)
>>> a
array([[ 0.34416425, 0.89021968, 0.06260404, 0.0218131 ],
[ 0.72344948, 0.79637177, 0.70029863, 0.20096129],
[ 0.27772833, 0.05372373, 0.00372941, 0.18454153],
[ 0.09124461, 0.38676351, 0.98478492, 0.72986697],
[ 0.84789887, 0.69171688, 0.97718206, 0.64019977],
[ 0.27597241, 0.26705301, 0.62124467, 0.43337711],
[ 0.79455424, 0.37024814, 0.93549275, 0.01130491],
[ 0.95113795, 0.32306471, 0.47548887, 0.20429272],
[ 0.3943888 , 0.61586129, 0.02776393, 0.2560126 ],
[ 0.5934556 , 0.23093912, 0.12550062, 0.58542137]])
>>> trim_rows(a, 3)
>>> a
array([[ 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. ],
[ 0.84789887, 0.69171688, 0.97718206, 0.64019977],
[ 0. , 0. , 0. , 0. ],
[ 0.79455424, 0.37024814, 0.93549275, 0.01130491],
[ 0.95113795, 0.32306471, 0.47548887, 0.20429272],
[ 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. ]])
And for your data size it's probably fast enough:
In [7]: a = np.random.rand(1000, 100)
In [8]: %timeit -n1 -r1 trim_rows(a, 50)
1 loops, best of 1: 7.65 ms per loop