Replace first occurence of 0 in a 2D numpy array - python

I have a 2D array:
[[1,2,0,0],
[4,0,9,4],
[0,0,1,0],
[4,6,9,0]]
is there an efficient way (without using loops) to replace every first 0 in the array, with a 1:
[[1,2,1,0],
[4,1,9,4],
[1,0,1,0],
[4,6,9,1]]
?
Thanks a lot !

Here is a one-liner inspired by the accepted answer of this question:
a = np.array([
[1, 2, 0, 0],
[4, 0, 9, 4],
[0, 0, 1, 0],
[4, 6, 9, 0]
])
a[range(len(a)), np.argmax(a == 0, axis=1)] = 1

So, you can use np.where to get the indices of the rows and columns where the array is 0:
In [45]: arr = np.array(
...: [[1,2,0,0],
...: [4,0,9,4],
...: [0,0,1,0],
...: [4,6,9,0]]
...: )
In [46]: r, c = np.where(arr == 0)
Then, use np.unique to get the unique x values, which will correspond to the first incidence of 0 in each row, and use return_index to get the indices to extract the corresponding column values:
In [47]: uniq_val, uniq_idx = np.unique(r, return_index=True)
In [48]: arr[uniq_val, c[uniq_idx]] = 1
In [49]: arr
Out[49]:
array([[1, 2, 1, 0],
[4, 1, 9, 4],
[1, 0, 1, 0],
[4, 6, 9, 1]])
If performance is really an issue, you could just write a numba function, I suspect this would be very amenable to numba

Related

Can I slice a numpy array using an array as indices?

I have 2 numpy arrays:
a = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
b = np.array([2, 1, 2])
I want to use b as starting indices into the columns of a and set all the values of a from those column indexes onwards to 0 like this:
np.array([[1, 2, 3],
[4, 0, 6],
[0, 0, 0]])
i.e., set elements of column 1 from position 2 onwards to 0, set elements of column 2 from position 1 onwards to 0, and set elements of column 3 from position 2 onwards to 0.
When I try this:
a[:, b:] = 0
I get
TypeError: only integer scalar arrays can be converted to a scalar index
Is there a way to slice using an array of indices without a for loop?
Edit: updated the example to show the indices can be arbitrary
You can use boolean array indexing. First, create a mask of indices you want to set to 0 and then apply the mask to array and assign the replacement value (e.g., 0 in your case).
mask = b>np.arange(a.shape[1])[:,None]
a[~mask]=0
output:
array([[1, 2, 3],
[4, 0, 6],
[0, 0, 0]])
I think the issue is in a[:,b:]; here b: means little if b is not a scaler e.g. 5: means 6th onwards but [1,2,3]: means nothing when array is 2d.
It should be a[:,b]. Setting a[:,b] = 0 will set all columns specified in b to 0. Following is the run.
In [2]: import numpy as np
In [3]: a = np.array([[1, 2, 3],
...: [4, 5, 6],
...: [7, 8, 9]])
...:
...: b = np.array([2, 1, 2])
...:
In [4]: a
Out[4]:
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
In [5]: b
Out[5]: array([2, 1, 2])
In [6]: b.dtype
Out[6]: dtype('int64')
In [7]: a[:, b:] = 0
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-7-6e5050513225> in <module>
----> 1 a[:, b:] = 0
TypeError: only integer scalar arrays can be converted to a scalar index
In [8]: a[:, b] = 0
In [9]: a
Out[9]:
array([[1, 0, 0],
[4, 0, 0],
[7, 0, 0]])
But that's not what you want.
To get what you want, you need to specify rows indices and column indices e.g. (1,1), (2,0), (2,1), (2,2).
In [11]: a[[1,2,2,2], [1, 0, 1,2]] = 0
In [12]: a
Out[12]:
array([[1, 2, 3],
[4, 0, 6],
[0, 0, 0]])

Padding a numpy array with offsets for each data column

I'm working with 2D numpy arrays which exhibit variable sizes, in terms of the number of rows and columns. I'd like to pad this array with zeros both before the start of the first row and at the end of the last row, but I'd like the start/end of the zeros to be offset in a different way for each column of data.
So the original 2D array:
1 2 3
4 5 6
7 8 9
A Normal example of padding:
0 0 0
0 0 0
1 2 3
4 5 6
7 8 9
0 0 0
Modified Padding with offsets (what I'm trying to do):
0 0 0
1 0 0
4 0 3
7 2 6
0 5 9
0 8 0
Does numpy possess any functions which can replicate the last example in an extendable manner for variables numbers of rows/columns, that avoids the use of for loops/other computationally slow approaches?
Here's a vectorized one with broadcasting and boolean-indexing -
def create_padded_array(a, row_start, n_rows):
r = np.arange(n_rows)[:,None]
row_start = np.asarray(row_start)
mask = (r >= row_start) & (r < row_start+a.shape[0])
out = np.zeros(mask.shape, dtype=a.dtype)
out.T[mask.T] = a.ravel('F')
return out
Sample run -
In [184]: a
Out[184]:
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
In [185]: create_padded_array(a, row_start=[1,3,2], n_rows=6)
Out[185]:
array([[0, 0, 0],
[1, 0, 0],
[4, 0, 3],
[7, 2, 6],
[0, 5, 9],
[0, 8, 0]])
Sorry for the trouble, but I think I found the answer that I was looking for.
I can use numpy.pad to create an arbitrary number of filler zeros at the end of my original array. There is also a function called numpy.roll which can then be used to shift all array elements along a given axis by a set number of positions down the column.
After a quick test, it looks like this is extendable for an arbitrary number of matrix elements and allows a unique offset along each column.
Thanks to everyone for their responses to this question!
To my knowledge there is no such numpy function with those exact specific requirements, however what you can do is have your array:
`
In [10]: arr = np.array([(1,2,3),(4,5,6),(7,8,9)])
In [11]: arr
Out[11]:
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])`
Then pad it:
In [12]: arr = np.pad(arr, ((2,1),(0,0)), 'constant', constant_values=(0))
In [13]: arr
Out[13]:
array([[0, 0, 0],
[0, 0, 0],
[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[0, 0, 0]])
Then you can randomize with shuffle (which I assume is what you want to do):
But np.random.shuffle only shuffles rows if this is satisfactory for your needs then:
In [14]: np.random.shuffle(arr)
In [15]: arr
Out[15]:
array([[7, 8, 9],
[4, 5, 6],
[0, 0, 0],
[0, 0, 0],
[0, 0, 0],
[1, 2, 3]])
If this is not satisfactory you can do this:
First create a 1D array:
In [16]: arr = np.arange(1,10)
In [17]: arr
Out[17]: array([1, 2, 3, 4, 5, 6, 7, 8, 9])
Then pad your array with zeros:
In [18]: arr = np.pad(arr, (6,3), 'constant', constant_values = (0))
In [19]: arr
Out[19]: array([0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 0, 0])
Then you shuffle the array:
In [20]: np.random.shuffle(arr)
In [21]: arr
Out[21]: array([4, 0, 0, 5, 0, 0, 3, 0, 0, 0, 8, 0, 7, 2, 1, 6, 0, 9])
Finally you reshape to the desired format:
In [22]: np.reshape(arr,[6,3])
Out[22]:
array([[4, 0, 0],
[5, 0, 0],
[3, 0, 0],
[0, 8, 0],
[7, 2, 1],
[6, 0, 9]])
Although this may seem lengthy this is much quicker for large data sets than it will be using for loops, or any other python control structures. When you say offsets if you want to change the amount of randomness you can choose to only shuffle portions of the 1D array then combine it to the rest of the data so that way the whole data set is not shuffled but a portion you want to be shuffled is.
(If what you mean by offsets is different from my assumption above please clarify in a comment)

vectorizing numpy bincount

I have a 2d numpy array., A I want to apply np.bincount() to each column of the matrix A to generate another 2d array B that is composed of the bincounts of each column of the original matrix A.
My problem is that np.bincount() is a function that takes a 1d array-like. It's not an array method like B = A.max(axis=1) for example.
Is there a more pythonic/numpythic way to generate this B array other than a nasty for-loop?
import numpy as np
states = 4
rows = 8
cols = 4
A = np.random.randint(0,states,(rows,cols))
B = np.zeros((states,cols))
for x in range(A.shape[1]):
B[:,x] = np.bincount(A[:,x])
Using the same philosophy as in this post, here's a vectorized approach -
m = A.shape[1]
n = A.max()+1
A1 = A + (n*np.arange(m))
out = np.bincount(A1.ravel(),minlength=n*m).reshape(m,-1).T
I would suggest to use np.apply_along_axis, which will allow you to apply a 1D-method (in this case np.bincount) to 1D slices of a higher dimensional array:
import numpy as np
states = 4
rows = 8
cols = 4
A = np.random.randint(0,states,(rows,cols))
B = np.zeros((states,cols))
B = np.apply_along_axis(np.bincount, axis=0, arr=A)
You'll have to be careful, though. This (as well as your suggested for-loop) only works if the output of np.bincount has the right shape. If the maximum state is not present in one or multiple columns of your array A, the output will not have a smaller dimensionality and thus, the code will file with a ValueError.
This solution using the numpy_indexed package (disclaimer: I am its author) is fully vectorized, thus does not include any python loops behind the scenes. Also, there are no restrictions on the input; not every column needs to contain the same set of unique values.
import numpy_indexed as npi
rowidx, colidx = np.indices(A.shape)
(bin, col), B = npi.count_table(A.flatten(), colidx.flatten())
This gives an alternative (sparse) representation of the same result, which may be much more appropriate if the B array does indeed contain many zeros:
(bin, col), count = npi.count((A.flatten(), colidx.flatten()))
Note that apply_along_axis is just syntactic sugar for a for-loop, and has the same performance characteristics.
Yet another possibility:
import numpy as np
def bincount_columns(x, minlength=None):
nbins = x.max() + 1
if minlength is not None:
nbins = max(nbins, minlength)
ncols = x.shape[1]
count = np.zeros((nbins, ncols), dtype=int)
colidx = np.arange(ncols)[None, :]
np.add.at(count, (x, colidx), 1)
return count
For example,
In [110]: x
Out[110]:
array([[4, 2, 2, 3],
[4, 3, 4, 4],
[4, 3, 4, 4],
[0, 2, 4, 0],
[4, 1, 2, 1],
[4, 2, 4, 3]])
In [111]: bincount_columns(x)
Out[111]:
array([[1, 0, 0, 1],
[0, 1, 0, 1],
[0, 3, 2, 0],
[0, 2, 0, 2],
[5, 0, 4, 2]])
In [112]: bincount_columns(x, minlength=7)
Out[112]:
array([[1, 0, 0, 1],
[0, 1, 0, 1],
[0, 3, 2, 0],
[0, 2, 0, 2],
[5, 0, 4, 2],
[0, 0, 0, 0],
[0, 0, 0, 0]])

How to get a value from every column in a Numpy matrix

I'd like to get the index of a value for every column in a matrix M. For example:
M = matrix([[0, 1, 0],
[4, 2, 4],
[3, 4, 1],
[1, 3, 2],
[2, 0, 3]])
In pseudocode, I'd like to do something like this:
for col in M:
idx = numpy.where(M[col]==0) # Only for columns!
and have idx be 0, 4, 0 for each column.
I have tried to use where, but I don't understand the return value, which is a tuple of matrices.
The tuple of matrices is a collection of items suited for indexing. The output will have the shape of the indexing matrices (or arrays), and each item in the output will be selected from the original array using the first array as the index of the first dimension, the second as the index of the second dimension, and so on. In other words, this:
>>> numpy.where(M == 0)
(matrix([[0, 0, 4]]), matrix([[0, 2, 1]]))
>>> row, col = numpy.where(M == 0)
>>> M[row, col]
matrix([[0, 0, 0]])
>>> M[numpy.where(M == 0)] = 1000
>>> M
matrix([[1000, 1, 1000],
[ 4, 2, 4],
[ 3, 4, 1],
[ 1, 3, 2],
[ 2, 1000, 3]])
The sequence may be what's confusing you. It proceeds in flattened order -- so M[0,2] appears second, not third. If you need to reorder them, you could do this:
>>> row[0,col.argsort()]
matrix([[0, 4, 0]])
You also might be better off using arrays instead of matrices. That way you can manipulate the shape of the arrays, which is often useful! Also note ajcr's transpose-based trick, which is probably preferable to using argsort.
Finally, there is also a nonzero method that does the same thing as where in this case. Using the transpose trick now:
>>> (M == 0).T.nonzero()
(matrix([[0, 1, 2]]), matrix([[0, 4, 0]]))
As an alternative to np.where, you could perhaps use np.argwhere to return an array of indexes where the array meets the condition:
>>> np.argwhere(M == 0)
array([[[0, 0]],
[[0, 2]],
[[4, 1]]])
This tells you each the indexes in the format [row, column] where the condition was met.
If you'd prefer the format of this output array to be grouped by column rather than row, (that is, [column, row]), just use the method on the transpose of the array:
>>> np.argwhere(M.T == 0).squeeze()
array([[0, 0],
[1, 4],
[2, 0]])
I also used np.squeeze here to get rid of axis 1, so that we are left with a 2D array. The sequence you want is the second column, i.e. np.argwhere(M.T == 0).squeeze()[:, 1].
The result of where(M == 0) would look something like this
(matrix([[0, 0, 4]]), matrix([[0, 2, 1]])) First matrix tells you the rows where 0s are and second matrix tells you the columns where 0s are.
Out[4]:
matrix([[0, 1, 0],
[4, 2, 4],
[3, 4, 1],
[1, 3, 2],
[2, 0, 3]])
In [5]: np.where(M == 0)
Out[5]: (matrix([[0, 0, 4]]), matrix([[0, 2, 1]]))
In [6]: M[0,0]
Out[6]: 0
In [7]: M[0,2] #0th row 2nd column
Out[7]: 0
In [8]: M[4,1] #4th row 1st column
Out[8]: 0
This isn't anything new on what's been already suggested, but a one-line solution is:
>>> np.where(np.array(M.T)==0)[-1]
array([0, 4, 0])
(I agree that NumPy matrix objects are more trouble than they're worth).
>>> M = np.array([[0, 1, 0],
... [4, 2, 4],
... [3, 4, 1],
... [1, 3, 2],
... [2, 0, 3]])
>>> [np.where(M[:,i]==0)[0][0] for i in range(M.shape[1])]
[0, 4, 0]

1D numpy array which is shifted to the right for each consecutive row in a new 2D array

I am trying to optimise some code by removing for loops and using numpy arrays only as I am working with large data sets.
I would like to take a 1D numpy array, for example:
a = [1, 2, 3, 4, 5]
and produce a 2D numpy array whereby the value in each column shifts along a place, for example in the case above for a I wish to have a function which returns:
[[1 2 3 4 5]
[0 1 2 3 4]
[0 0 1 2 3]
[0 0 0 1 2]
[0 0 0 0 1]]
I have found examples which use the strides function to do something similar to produce, for example:
[[1 2 3]
[2 3 4]
[3 4 5]]
However I am trying to shift each of my columns in the other direction. Alternatively, one can view the problem as putting the first element of a on the first diagonal, the second element on the second diagonal and so on. However, I would like to stress again how I would like to avoid using a for, while or if loop entirely. Any help would be greatly appreciated.
Such a matrix is an example of a Toeplitz matrix. You could use scipy.linalg.toeplitz to create it:
In [32]: from scipy.linalg import toeplitz
In [33]: a = range(1,6)
In [34]: toeplitz(a, np.zeros_like(a)).T
Out[34]:
array([[1, 2, 3, 4, 5],
[0, 1, 2, 3, 4],
[0, 0, 1, 2, 3],
[0, 0, 0, 1, 2],
[0, 0, 0, 0, 1]])
Inspired by #EelcoHoogendoorn's answer, here's a variation that doesn't use as much memory as scipy.linalg.toeplitz:
In [47]: from numpy.lib.stride_tricks import as_strided
In [48]: a
Out[48]: array([1, 2, 3, 4, 5])
In [49]: t = as_strided(np.r_[a[::-1], np.zeros_like(a)], shape=(a.size,a.size), strides=(a.itemsize, a.itemsize))[:,::-1]
In [50]: t
Out[50]:
array([[1, 2, 3, 4, 5],
[0, 1, 2, 3, 4],
[0, 0, 1, 2, 3],
[0, 0, 0, 1, 2],
[0, 0, 0, 0, 1]])
The result should be treated as a "read only" array. Otherwise, you'll be in for some surprises when you change an element. For example:
In [51]: t[0,2] = 99
In [52]: t
Out[52]:
array([[ 1, 2, 99, 4, 5],
[ 0, 1, 2, 99, 4],
[ 0, 0, 1, 2, 99],
[ 0, 0, 0, 1, 2],
[ 0, 0, 0, 0, 1]])
Here is the indexing-tricks based solution. Not nearly as elegant as the toeplitz solution already posted, but should memory consumption or performance be a concern, it is to be preferred. As demonstrated, this also makes it easy to subsequently manipulate the entries of the matrix in a consistent manner.
import numpy as np
a = np.arange(5)+1
def toeplitz_view(a):
b = np.concatenate((np.zeros_like(a),a))
i = a.itemsize
v = np.lib.index_tricks.as_strided(b,
shape=(len(b),)*2,
strides=(-i, i))
#return a view on the 'original' data as well, for manipulation
return v[:len(a), len(a):], b[len(a):]
v, a = toeplitz_view(a)
print v
a[0] = 10
v[2,1] = -1
print v

Categories