Numpy array concatenation - python

In some special cases, array can be concatenated without explicitly calling concatenate function. For example, given a 2D array A, the following code will yield an identical array B:
B = np.array([A[ii,:] for ii in range(A.shape[0])])
I know this method works, but do not quite understand the underlying mechanism. Can anyone demystify the code above a little bit?

A[ii,:] is ii-th row of array A.
The list comprehension [A[ii,:] for ii in range(A.shape[0])] basically makes a list of rows in A (A.shape[0] is number of rows in A).
Finally, B is an array, that its content is a list of A's rows, which is essentially the same as A itself.

By now you should be familiar with making an array from a list of lists:
In [178]: np.array([[1,2],[3,4]])
Out[178]:
array([[1, 2],
[3, 4]])
but that works just as well if it's a list of arrays:
In [179]: np.array([np.array([1,2]),np.array([3,4])])
Out[179]:
array([[1, 2],
[3, 4]])
stack also does this, by adding a dimension to the arrays and calling concatenate (read its code):
In [180]: np.stack([np.array([1,2]),np.array([3,4])])
Out[180]:
array([[1, 2],
[3, 4]])
concatenate joins the arrays - on an existing axis:
In [181]: np.concatenate([np.array([1,2]),np.array([3,4])])
Out[181]: array([1, 2, 3, 4])
stack adds a dimension first, as in:
In [182]: np.concatenate([np.array([[1,2]]),np.array([[3,4]])])
Out[182]:
array([[1, 2],
[3, 4]])
np.array and concatenate aren't identical, but there's a lot of overlap in their functionality.

Related

numpy array indexing with lists and arrays

I have:
>>> a
array([[1, 2],
[3, 4]])
>>> type(l), l # list of scalers
(<type 'list'>, [0, 1])
>>> type(i), i # a numpy array
(<type 'numpy.ndarray'>, array([0, 1]))
>>> type(j), j # list of numpy arrays
(<type 'list'>, [array([0, 1]), array([0, 1])])
When I do
>>> a[l] # Case 1, l is a list of scalers
I get
array([[1, 2],
[3, 4]])
which means indexing happened only on 0th axis.
But when I do
>>> a[j] # Case 2, j is a list of numpy arrays
I get
array([1, 4])
which means indexing happened along axis 0 and axis 1.
Q1: When used for indexing, why is there a difference in treatment of list of scalers and list of numpy arrays ? (Case 1 vs Case 2). In Case 2, I was hoping to see indexing happen only along axis 0 and get
array( [[[1,2],
[3,4]],
[[1,2],
[3,4]]])
Now, when using numpy array of arrays instead
>>> j1 = np.array(j) # numpy array of arrays
The result below indicates that indexing happened only along axis 0 (as expected)
>>> a[j1] Case 3, j1 is a numpy array of numpy arrays
array([[[1, 2],
[3, 4]],
[[1, 2],
[3, 4]]])
Q2: When used for indexing, why is there a difference in treatment of list of numpy arrays and numpy array of numpy arrays? (Case 2 vs Case 3)
Case1, a[l] is actually a[(l,)] which expands to a[(l, slice(None))]. That is, indexing the first dimension with the list l, and an automatic trailing : slice. Indices are passed as a tuple to the array __getitem__, and extra () may be added without confusion.
Case2, a[j] is treated as a[array([0, 1]), array([0, 1]] or a[(array(([0, 1]), array([0, 1])]. In other words, as a tuple of indexing objects, one per dimension. It ends up returning a[0,0] and a[1,1].
Case3, a[j1] is a[(j1, slice(None))], applying the j1 index to just the first dimension.
Case2 is a bit of any anomaly. Your intuition is valid, but for historical reasons, this list of arrays (or list of lists) is interpreted as a tuple of arrays.
This has been discussed in other SO questions, and I think it is documented. But off hand I can't find those references.
So it's safer to use either a tuple of indexing objects, or an array. Indexing with a list has a potential ambiguity.
numpy array indexing: list index and np.array index give different result
This SO question touches on the same issue, though the clearest statement of what is happening is buried in a code link in a comment by #user2357112.
Another way of forcing the Case3 like indexing, make the 2nd dimension slice explicit, a[j,:]
In [166]: a[j]
Out[166]: array([1, 4])
In [167]: a[j,:]
Out[167]:
array([[[1, 2],
[3, 4]],
[[1, 2],
[3, 4]]])
(I often include the trailing : even if it isn't needed. It makes it clear to me, and readers, how many dimensions we are working with.)
A1: The structure of l is not the same as j.
l is just one-dimension while j is two-dimension. If you change one of them:
# l = [0, 1] # just one dimension!
l = [[0, 1], [0, 1]] # two dimensions
j = [np.array([0,1]), np.array([0, 1])] # two dimensions
They have the same behave.
A2: The same, the structure of arrays in Case 2 and Case 3 are not the same.

Understanding non-homogeneous numpy arrays

I have recently started numpy and noticed a peculiar thing.
import numpy as np
a = np.array([[1,2,3], [4,5,9, 8]])
print a.shape, "shape"
print a[1, 0]
The shape, in this case, comes out to be 2L. However if I make a homogenous numpy array as
a = np.array([[1,2,3], [4,5,6]], then a.shape gives (2L, 3L). I understand that the shape of a non-homogenous array is difficult to represent as a tuple.
Additionally, print a[1,0] for non-homogenous array that I created earlier gives a traceback IndexError: too many indices for array. Doing the same on the homogenous array gives back the correct element 4.
Noticing these two peculiarities, I am curious to know how python looks at non-homogenous numpy arrays at a low level.
Thank You in advance
When the sublists differ in length, np.array falls back to creating an object dtype array:
In [272]: a = np.array([[1,2,3], [4,5,9, 8]])
In [273]: a
Out[273]: array([[1, 2, 3], [4, 5, 9, 8]], dtype=object)
This array is similar to the list we started with. Both store the sublists as pointers. The sublists exist else where in memory.
With equal length sublsts, it can create a 2d array, with integer elements:
In [274]: a2 = np.array([[1,2,3], [4,5,9]])
In [275]: a2
Out[275]:
array([[1, 2, 3],
[4, 5, 9]])
In fact to confirm my claim that the sublists are stored elsewhere in memory, let's try to change one:
In [276]: alist = [[1,2,3], [4,5,9, 8]]
In [277]: a = np.array(alist)
In [278]: a
Out[278]: array([[1, 2, 3], [4, 5, 9, 8]], dtype=object)
In [279]: a[0].append(4)
In [280]: a
Out[280]: array([[1, 2, 3, 4], [4, 5, 9, 8]], dtype=object)
In [281]: alist
Out[281]: [[1, 2, 3, 4], [4, 5, 9, 8]]
That would not work in the case of a2. a2 has its own data storage, independent of the source list.
The basic point is that np.array tries to create an n-d array where possible. If it can't it falls back on to creating an object dtype array. And, as has been discussed in other questions, it sometimes raises an error. It is also tricky to intentionally create an object array.
The shape of a is easy, (2,). A single element tuple. a is a 1d array. But that shape does not convey information about the elements of a. And the same goes for the elements of alist. len(alist) is 2. An object array can have a more complex shape, e.g. a.reshape(1,2,1), but it is still just contains pointers
a contains 2 4byte pointers; a2 contains 6 4byte integers.
n [282]: a.itemsize
Out[282]: 4
In [283]: a.nbytes
Out[283]: 8
In [284]: a2.nbytes
Out[284]: 24
In [285]: a2.itemsize
Out[285]: 4

substract element to row array in python [duplicate]

This question already has answers here:
numpy subtract every row of matrix by vector
(3 answers)
numpy subtract/add 1d array from 2d array
(2 answers)
Closed 5 years ago.
I have two numpy array a and b
a=np.array([[1,2,3],[4,5,6],[7,8,9]])
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
b = np.array([1,2,3])
array([1, 2, 3])
I would like to substract to each row of a the correspondent element of b (ie. to the first row of a, the first element of b, etc)
so that c is
array([[0, 1, 2],
[2, 3, 4],
[4, 5, 6]])
Is there a python command to do this?
Is there a python command to do this?
Yes, the - operator.
In addition you need to make b into a column vector so that broadcasting can do the rest for you:
a - b[:, np.newaxis]
# array([[0, 1, 2],
# [2, 3, 4],
# [4, 5, 6]])
yup! You just need to make b a column vector first
a - b[:, np.newaxis]
Reshape b into a column vector, then subtract:
a - b.reshape(3, 1)
b isn't altered in place, but the result of the reshape method call will be the column vector:
array([[1],
[2],
[3]])
Allowing the "shape" of the subtraction you wanted. A little more general reshape operation would be:
b.reshape(b.size, 1)
Taking however many elements b has, and molding them into an N x 1 vector.
Update: A quick benchmark shows kazemakase's answer, using b[:, np.newaxis] as the reshaping strategy, to be ~7% faster. For small vectors, those few extra fractions of a µs won't matter. But for large vectors or inner loops, prefer his approach. It's a less-general reshape, but more performant for this use.

How to access numpy array with a set of indices stored in another numpy array?

I have a numpy array which stores a set of indices I need to access another numpy array.
I tried to use a for loop but it doesn't work as I expected.
The situation is like this:
>>> a
array([[1, 2],
[3, 4]])
>>> c
array([[0, 0],
[0, 1]])
>>> a[c[0]]
array([[1, 2],
[1, 2]])
>>> a[0,0] # the result I want
1
Above is a simplified version of my actual code, where the c array is much larger so I have to use a for loop to get every index.
Convert it to a tuple:
>>> a[tuple(c[0])]
1
Because list and array indices trigger advanced indexing. tuples are (mostly) basic slicing.
Index a with columns of c by passing the first column as row's index and second one as column index:
In [23]: a[c[:,0], c[:,1]]
Out[23]: array([1, 2])

How do I access the ith column of a NumPy multidimensional array?

Given:
test = numpy.array([[1, 2], [3, 4], [5, 6]])
test[i] gives the ith row (e.g. [1, 2]). How do I access the ith column? (e.g. [1, 3, 5]). Also, would this be an expensive operation?
To access column 0:
>>> test[:, 0]
array([1, 3, 5])
To access row 0:
>>> test[0, :]
array([1, 2])
This is covered in Section 1.4 (Indexing) of the NumPy reference. This is quick, at least in my experience. It's certainly much quicker than accessing each element in a loop.
>>> test[:,0]
array([1, 3, 5])
this command gives you a row vector, if you just want to loop over it, it's fine, but if you want to hstack with some other array with dimension 3xN, you will have
ValueError: all the input arrays must have same number of dimensions
while
>>> test[:,[0]]
array([[1],
[3],
[5]])
gives you a column vector, so that you can do concatenate or hstack operation.
e.g.
>>> np.hstack((test, test[:,[0]]))
array([[1, 2, 1],
[3, 4, 3],
[5, 6, 5]])
And if you want to access more than one column at a time you could do:
>>> test = np.arange(9).reshape((3,3))
>>> test
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
>>> test[:,[0,2]]
array([[0, 2],
[3, 5],
[6, 8]])
You could also transpose and return a row:
In [4]: test.T[0]
Out[4]: array([1, 3, 5])
Although the question has been answered, let me mention some nuances.
Let's say you are interested in the first column of the array
arr = numpy.array([[1, 2],
[3, 4],
[5, 6]])
As you already know from other answers, to get it in the form of "row vector" (array of shape (3,)), you use slicing:
arr_col1_view = arr[:, 1] # creates a view of the 1st column of the arr
arr_col1_copy = arr[:, 1].copy() # creates a copy of the 1st column of the arr
To check if an array is a view or a copy of another array you can do the following:
arr_col1_view.base is arr # True
arr_col1_copy.base is arr # False
see ndarray.base.
Besides the obvious difference between the two (modifying arr_col1_view will affect the arr), the number of byte-steps for traversing each of them is different:
arr_col1_view.strides[0] # 8 bytes
arr_col1_copy.strides[0] # 4 bytes
see strides and this answer.
Why is this important? Imagine that you have a very big array A instead of the arr:
A = np.random.randint(2, size=(10000, 10000), dtype='int32')
A_col1_view = A[:, 1]
A_col1_copy = A[:, 1].copy()
and you want to compute the sum of all the elements of the first column, i.e. A_col1_view.sum() or A_col1_copy.sum(). Using the copied version is much faster:
%timeit A_col1_view.sum() # ~248 µs
%timeit A_col1_copy.sum() # ~12.8 µs
This is due to the different number of strides mentioned before:
A_col1_view.strides[0] # 40000 bytes
A_col1_copy.strides[0] # 4 bytes
Although it might seem that using column copies is better, it is not always true for the reason that making a copy takes time too and uses more memory (in this case it took me approx. 200 µs to create the A_col1_copy). However if we needed the copy in the first place, or we need to do many different operations on a specific column of the array and we are ok with sacrificing memory for speed, then making a copy is the way to go.
In the case we are interested in working mostly with columns, it could be a good idea to create our array in column-major ('F') order instead of the row-major ('C') order (which is the default), and then do the slicing as before to get a column without copying it:
A = np.asfortranarray(A) # or np.array(A, order='F')
A_col1_view = A[:, 1]
A_col1_view.strides[0] # 4 bytes
%timeit A_col1_view.sum() # ~12.6 µs vs ~248 µs
Now, performing the sum operation (or any other) on a column-view is as fast as performing it on a column copy.
Finally let me note that transposing an array and using row-slicing is the same as using the column-slicing on the original array, because transposing is done by just swapping the shape and the strides of the original array.
A[:, 1].strides[0] # 40000 bytes
A.T[1, :].strides[0] # 40000 bytes
To get several and indepent columns, just:
> test[:,[0,2]]
you will get colums 0 and 2
>>> test
array([[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9]])
>>> ncol = test.shape[1]
>>> ncol
5L
Then you can select the 2nd - 4th column this way:
>>> test[0:, 1:(ncol - 1)]
array([[1, 2, 3],
[6, 7, 8]])
This is not multidimensional. It is 2 dimensional array. where you want to access the columns you wish.
test = numpy.array([[1, 2], [3, 4], [5, 6]])
test[:, a:b] # you can provide index in place of a and b

Categories