I have a lot of data in database in (x, y, value) triplet form.
I would like to be able to create dynamically a 2D numpy array from this data by setting value at the coords (x,y) of the array.
For instance if I have :
(0,0,8)
(0,1,5)
(0,2,3)
(1,0,4)
(1,1,0)
(1,2,0)
(2,0,1)
(2,1,2)
(2,2,5)
The resulting array should be :
Array([[8,5,3],[4,0,0],[1,2,5]])
I'm new to numpy, is there any method in numpy to do so ? If not, what approach would you advice to do this ?
Extending the answer from #MaxU, in case the coordinates are not ordered in a grid fashion (or in case some coordinates are missing), you can create your array as follows:
import numpy as np
a = np.array([(0,0,8),(0,1,5),(0,2,3),
(1,0,4),(1,1,0),(1,2,0),
(2,0,1),(2,1,2),(2,2,5)])
Here a represents your coordinates. It is an (N, 3) array, where N is the number of coordinates (it doesn't have to contain ALL the coordinates). The first column of a (a[:, 0]) contains the Y positions while the second columne (a[:, 1]) contains the X positions. Similarly, the last column (a[:, 2]) contains your values.
Then you can extract the maximum dimensions of your target array:
# Maximum Y and X coordinates
ymax = a[:, 0].max()
xmax = a[:, 1].max()
# Target array
target = np.zeros((ymax+1, xmax+1), a.dtype)
And finally, fill the array with data from your coordinates:
target[a[:, 0], a[:, 1]] = a[:, 2]
The line above sets values in target at a[:, 0] (all Y) and a[:, 1] (all X) locations to their corresponding a[:, 2] value (your value).
>>> target
array([[8, 5, 3],
[4, 0, 0],
[1, 2, 5]])
Additionally, if you have missing coordinates, and you want to replace those missing values by some number, you can initialize the array as:
default_value = -1
target = np.full((ymax+1, xmax+1), default_value, a.type)
This way, the coordinates not present in your list will be filled with -1 in the target array/
Why not using sparse matrices? (which is pretty much the format of your triplets.)
First split the triplets in rows, columns, and data using numpy.hsplit(). (Use numpy.squeeze() to convert the resulting 2d arrays to 1d arrays.)
>>> row, col, data = [np.squeeze(splt) for splt
... in np.hsplit(tripets, tripets.shape[-1])]
Use the sparse matrix in coordinate format, and convert it to an array.
>>> from scipy.sparse import coo_matrix
>>> coo_matrix((data, (row, col))).toarray()
array([[8, 5, 3],
[4, 0, 0],
[1, 2, 5]])
is that what you want?
In [37]: a = np.array([(0,0,8)
....: ,(0,1,5)
....: ,(0,2,3)
....: ,(1,0,4)
....: ,(1,1,0)
....: ,(1,2,0)
....: ,(2,0,1)
....: ,(2,1,2)
....: ,(2,2,5)])
In [38]:
In [38]: a
Out[38]:
array([[0, 0, 8],
[0, 1, 5],
[0, 2, 3],
[1, 0, 4],
[1, 1, 0],
[1, 2, 0],
[2, 0, 1],
[2, 1, 2],
[2, 2, 5]])
In [39]:
In [39]: a[:, 2].reshape(3,len(a)//3)
Out[39]:
array([[8, 5, 3],
[4, 0, 0],
[1, 2, 5]])
or a bit more flexible (after your comment):
In [48]: a[:, 2].reshape([int(len(a) ** .5)] * 2)
Out[48]:
array([[8, 5, 3],
[4, 0, 0],
[1, 2, 5]])
Explanation:
this gives you the 3rd column (value):
In [42]: a[:, 2]
Out[42]: array([8, 5, 3, 4, 0, 0, 1, 2, 5])
In [49]: [int(len(a) ** .5)]
Out[49]: [3]
In [50]: [int(len(a) ** .5)] * 2
Out[50]: [3, 3]
Related
I have a 2d numpy array., A I want to apply np.bincount() to each column of the matrix A to generate another 2d array B that is composed of the bincounts of each column of the original matrix A.
My problem is that np.bincount() is a function that takes a 1d array-like. It's not an array method like B = A.max(axis=1) for example.
Is there a more pythonic/numpythic way to generate this B array other than a nasty for-loop?
import numpy as np
states = 4
rows = 8
cols = 4
A = np.random.randint(0,states,(rows,cols))
B = np.zeros((states,cols))
for x in range(A.shape[1]):
B[:,x] = np.bincount(A[:,x])
Using the same philosophy as in this post, here's a vectorized approach -
m = A.shape[1]
n = A.max()+1
A1 = A + (n*np.arange(m))
out = np.bincount(A1.ravel(),minlength=n*m).reshape(m,-1).T
I would suggest to use np.apply_along_axis, which will allow you to apply a 1D-method (in this case np.bincount) to 1D slices of a higher dimensional array:
import numpy as np
states = 4
rows = 8
cols = 4
A = np.random.randint(0,states,(rows,cols))
B = np.zeros((states,cols))
B = np.apply_along_axis(np.bincount, axis=0, arr=A)
You'll have to be careful, though. This (as well as your suggested for-loop) only works if the output of np.bincount has the right shape. If the maximum state is not present in one or multiple columns of your array A, the output will not have a smaller dimensionality and thus, the code will file with a ValueError.
This solution using the numpy_indexed package (disclaimer: I am its author) is fully vectorized, thus does not include any python loops behind the scenes. Also, there are no restrictions on the input; not every column needs to contain the same set of unique values.
import numpy_indexed as npi
rowidx, colidx = np.indices(A.shape)
(bin, col), B = npi.count_table(A.flatten(), colidx.flatten())
This gives an alternative (sparse) representation of the same result, which may be much more appropriate if the B array does indeed contain many zeros:
(bin, col), count = npi.count((A.flatten(), colidx.flatten()))
Note that apply_along_axis is just syntactic sugar for a for-loop, and has the same performance characteristics.
Yet another possibility:
import numpy as np
def bincount_columns(x, minlength=None):
nbins = x.max() + 1
if minlength is not None:
nbins = max(nbins, minlength)
ncols = x.shape[1]
count = np.zeros((nbins, ncols), dtype=int)
colidx = np.arange(ncols)[None, :]
np.add.at(count, (x, colidx), 1)
return count
For example,
In [110]: x
Out[110]:
array([[4, 2, 2, 3],
[4, 3, 4, 4],
[4, 3, 4, 4],
[0, 2, 4, 0],
[4, 1, 2, 1],
[4, 2, 4, 3]])
In [111]: bincount_columns(x)
Out[111]:
array([[1, 0, 0, 1],
[0, 1, 0, 1],
[0, 3, 2, 0],
[0, 2, 0, 2],
[5, 0, 4, 2]])
In [112]: bincount_columns(x, minlength=7)
Out[112]:
array([[1, 0, 0, 1],
[0, 1, 0, 1],
[0, 3, 2, 0],
[0, 2, 0, 2],
[5, 0, 4, 2],
[0, 0, 0, 0],
[0, 0, 0, 0]])
for example, I have the numpy arrays like this
a =
array([[1, 2, 3],
[4, 3, 2]])
and index like this to select the max values
max_idx =
array([[0, 2],
[1, 0]])
how can I access there positions at the same time, to modify them.
like "a[max_idx] = 0" getting the following
array([[1, 2, 0],
[0, 3, 2]])
Simply use subscripted-indexing -
a[max_idx[:,0],max_idx[:,1]] = 0
If you are working with higher dimensional arrays and don't want to type out slices of max_idx for each axis, you can use linear-indexing to assign zeros, like so -
a.ravel()[np.ravel_multi_index(max_idx.T,a.shape)] = 0
Sample run -
In [28]: a
Out[28]:
array([[1, 2, 3],
[4, 3, 2]])
In [29]: max_idx
Out[29]:
array([[0, 2],
[1, 0]])
In [30]: a[max_idx[:,0],max_idx[:,1]] = 0
In [31]: a
Out[31]:
array([[1, 2, 0],
[0, 3, 2]])
Numpy support advanced slicing like this:
a[b[:, 0], b[:, 1]] = 0
Code above would fit your requirement.
If b is more than 2-D. A better way should be like this:
a[np.split(b, 2, axis=1)]
The np.split will split ndarray into columns.
import numpy as np
x = np.array([[1,2 ,3], [9,8,7]])
y = np.array([[2,1 ,0], [1,0,2]])
x[y]
Expected output:
array([[3,2,1], [8,9,7]])
If x and y were 1D arrays, then x[y] would work. So what is the numpy way or most pythonic or efficient way of doing this for 2D arrays?
You need to define the corresponding row indices.
One way is:
>>> x[np.arange(x.shape[0])[..., None], y]
array([[3, 2, 1],
[8, 9, 7]])
You can calculate the linear indices from y and then use those to extract specific elements from x, like so -
# Linear indices from y, using x's shape
lin_idx = y + np.arange(y.shape[0])[:,None]*x.shape[1]
# Use np.take to extract those indexed elements from x
out = np.take(x,lin_idx)
Sample run -
In [47]: x
Out[47]:
array([[1, 2, 3],
[9, 8, 7]])
In [48]: y
Out[48]:
array([[2, 1, 0],
[1, 0, 2]])
In [49]: lin_idx = y + np.arange(y.shape[0])[:,None]*x.shape[1]
In [50]: lin_idx # Compare this with y
Out[50]:
array([[2, 1, 0],
[4, 3, 5]])
In [51]: np.take(x,lin_idx)
Out[51]:
array([[3, 2, 1],
[8, 9, 7]])
I'd like to get the index of a value for every column in a matrix M. For example:
M = matrix([[0, 1, 0],
[4, 2, 4],
[3, 4, 1],
[1, 3, 2],
[2, 0, 3]])
In pseudocode, I'd like to do something like this:
for col in M:
idx = numpy.where(M[col]==0) # Only for columns!
and have idx be 0, 4, 0 for each column.
I have tried to use where, but I don't understand the return value, which is a tuple of matrices.
The tuple of matrices is a collection of items suited for indexing. The output will have the shape of the indexing matrices (or arrays), and each item in the output will be selected from the original array using the first array as the index of the first dimension, the second as the index of the second dimension, and so on. In other words, this:
>>> numpy.where(M == 0)
(matrix([[0, 0, 4]]), matrix([[0, 2, 1]]))
>>> row, col = numpy.where(M == 0)
>>> M[row, col]
matrix([[0, 0, 0]])
>>> M[numpy.where(M == 0)] = 1000
>>> M
matrix([[1000, 1, 1000],
[ 4, 2, 4],
[ 3, 4, 1],
[ 1, 3, 2],
[ 2, 1000, 3]])
The sequence may be what's confusing you. It proceeds in flattened order -- so M[0,2] appears second, not third. If you need to reorder them, you could do this:
>>> row[0,col.argsort()]
matrix([[0, 4, 0]])
You also might be better off using arrays instead of matrices. That way you can manipulate the shape of the arrays, which is often useful! Also note ajcr's transpose-based trick, which is probably preferable to using argsort.
Finally, there is also a nonzero method that does the same thing as where in this case. Using the transpose trick now:
>>> (M == 0).T.nonzero()
(matrix([[0, 1, 2]]), matrix([[0, 4, 0]]))
As an alternative to np.where, you could perhaps use np.argwhere to return an array of indexes where the array meets the condition:
>>> np.argwhere(M == 0)
array([[[0, 0]],
[[0, 2]],
[[4, 1]]])
This tells you each the indexes in the format [row, column] where the condition was met.
If you'd prefer the format of this output array to be grouped by column rather than row, (that is, [column, row]), just use the method on the transpose of the array:
>>> np.argwhere(M.T == 0).squeeze()
array([[0, 0],
[1, 4],
[2, 0]])
I also used np.squeeze here to get rid of axis 1, so that we are left with a 2D array. The sequence you want is the second column, i.e. np.argwhere(M.T == 0).squeeze()[:, 1].
The result of where(M == 0) would look something like this
(matrix([[0, 0, 4]]), matrix([[0, 2, 1]])) First matrix tells you the rows where 0s are and second matrix tells you the columns where 0s are.
Out[4]:
matrix([[0, 1, 0],
[4, 2, 4],
[3, 4, 1],
[1, 3, 2],
[2, 0, 3]])
In [5]: np.where(M == 0)
Out[5]: (matrix([[0, 0, 4]]), matrix([[0, 2, 1]]))
In [6]: M[0,0]
Out[6]: 0
In [7]: M[0,2] #0th row 2nd column
Out[7]: 0
In [8]: M[4,1] #4th row 1st column
Out[8]: 0
This isn't anything new on what's been already suggested, but a one-line solution is:
>>> np.where(np.array(M.T)==0)[-1]
array([0, 4, 0])
(I agree that NumPy matrix objects are more trouble than they're worth).
>>> M = np.array([[0, 1, 0],
... [4, 2, 4],
... [3, 4, 1],
... [1, 3, 2],
... [2, 0, 3]])
>>> [np.where(M[:,i]==0)[0][0] for i in range(M.shape[1])]
[0, 4, 0]
Briefly: there is a similar question and the best answer suggests using numpy.bincount. I need the same thing, but for a matrix.
I've got two arrays:
array([1, 2, 1, 1, 2])
array([2, 1, 1, 1, 1])
together they make indices that should be incremented:
>>> np.array([a, b]).T
array([[1, 2],
[2, 1],
[1, 1],
[1, 1],
[2, 1]])
I want to get this matrix:
array([[0, 0, 0],
[0, 2, 1], # (1,1) twice, (1,2) once
[0, 2, 0]]) # (2,1) twice
The matrix will be small (like, 5×5), and the number of indices will be large (somewhere near 10^3 or 10^5).
So, is there anything better (faster) than a for-loop?
You can still use bincount(). The trick is to convert a and b into a single 1D array of flat indices.
If the matrix is nxm, you could apply bincount() to a * m + b, and construct the matrix from the result.
To take the example in your question:
In [15]: a = np.array([1, 2, 1, 1, 2])
In [16]: b = np.array([2, 1, 1, 1, 1])
In [17]: cnt = np.bincount(a * 3 + b)
In [18]: cnt.resize((3, 3))
In [19]: cnt
Out[19]:
array([[0, 0, 0],
[0, 2, 1],
[0, 2, 0]])
If the shape of the array is more complicated, it might be easier to use np.ravel_multi_index() instead of computing flat indices by hand:
In [20]: cnt = np.bincount(np.ravel_multi_index(np.vstack((a, b)), (3, 3)))
In [21]: np.resize(cnt, (3, 3))
Out[21]:
array([[0, 0, 0],
[0, 2, 1],
[0, 2, 0]])
(Hat tip #Jaime for pointing out ravel_multi_index.)
m1 = m.view(numpy.ndarray) # Create view
m1.shape = -1 # Make one-dimensional array
m1 += np.bincount(a+m.shape[1]*b, minlength=m1.size)