Understanding non-homogeneous numpy arrays - python

I have recently started numpy and noticed a peculiar thing.
import numpy as np
a = np.array([[1,2,3], [4,5,9, 8]])
print a.shape, "shape"
print a[1, 0]
The shape, in this case, comes out to be 2L. However if I make a homogenous numpy array as
a = np.array([[1,2,3], [4,5,6]], then a.shape gives (2L, 3L). I understand that the shape of a non-homogenous array is difficult to represent as a tuple.
Additionally, print a[1,0] for non-homogenous array that I created earlier gives a traceback IndexError: too many indices for array. Doing the same on the homogenous array gives back the correct element 4.
Noticing these two peculiarities, I am curious to know how python looks at non-homogenous numpy arrays at a low level.
Thank You in advance

When the sublists differ in length, np.array falls back to creating an object dtype array:
In [272]: a = np.array([[1,2,3], [4,5,9, 8]])
In [273]: a
Out[273]: array([[1, 2, 3], [4, 5, 9, 8]], dtype=object)
This array is similar to the list we started with. Both store the sublists as pointers. The sublists exist else where in memory.
With equal length sublsts, it can create a 2d array, with integer elements:
In [274]: a2 = np.array([[1,2,3], [4,5,9]])
In [275]: a2
Out[275]:
array([[1, 2, 3],
[4, 5, 9]])
In fact to confirm my claim that the sublists are stored elsewhere in memory, let's try to change one:
In [276]: alist = [[1,2,3], [4,5,9, 8]]
In [277]: a = np.array(alist)
In [278]: a
Out[278]: array([[1, 2, 3], [4, 5, 9, 8]], dtype=object)
In [279]: a[0].append(4)
In [280]: a
Out[280]: array([[1, 2, 3, 4], [4, 5, 9, 8]], dtype=object)
In [281]: alist
Out[281]: [[1, 2, 3, 4], [4, 5, 9, 8]]
That would not work in the case of a2. a2 has its own data storage, independent of the source list.
The basic point is that np.array tries to create an n-d array where possible. If it can't it falls back on to creating an object dtype array. And, as has been discussed in other questions, it sometimes raises an error. It is also tricky to intentionally create an object array.
The shape of a is easy, (2,). A single element tuple. a is a 1d array. But that shape does not convey information about the elements of a. And the same goes for the elements of alist. len(alist) is 2. An object array can have a more complex shape, e.g. a.reshape(1,2,1), but it is still just contains pointers
a contains 2 4byte pointers; a2 contains 6 4byte integers.
n [282]: a.itemsize
Out[282]: 4
In [283]: a.nbytes
Out[283]: 8
In [284]: a2.nbytes
Out[284]: 24
In [285]: a2.itemsize
Out[285]: 4

Related

Numpy array concatenation

In some special cases, array can be concatenated without explicitly calling concatenate function. For example, given a 2D array A, the following code will yield an identical array B:
B = np.array([A[ii,:] for ii in range(A.shape[0])])
I know this method works, but do not quite understand the underlying mechanism. Can anyone demystify the code above a little bit?
A[ii,:] is ii-th row of array A.
The list comprehension [A[ii,:] for ii in range(A.shape[0])] basically makes a list of rows in A (A.shape[0] is number of rows in A).
Finally, B is an array, that its content is a list of A's rows, which is essentially the same as A itself.
By now you should be familiar with making an array from a list of lists:
In [178]: np.array([[1,2],[3,4]])
Out[178]:
array([[1, 2],
[3, 4]])
but that works just as well if it's a list of arrays:
In [179]: np.array([np.array([1,2]),np.array([3,4])])
Out[179]:
array([[1, 2],
[3, 4]])
stack also does this, by adding a dimension to the arrays and calling concatenate (read its code):
In [180]: np.stack([np.array([1,2]),np.array([3,4])])
Out[180]:
array([[1, 2],
[3, 4]])
concatenate joins the arrays - on an existing axis:
In [181]: np.concatenate([np.array([1,2]),np.array([3,4])])
Out[181]: array([1, 2, 3, 4])
stack adds a dimension first, as in:
In [182]: np.concatenate([np.array([[1,2]]),np.array([[3,4]])])
Out[182]:
array([[1, 2],
[3, 4]])
np.array and concatenate aren't identical, but there's a lot of overlap in their functionality.

How to properly concatenate two 1D arrays without flattening?

How do I concatenate properly two numpy vectors without flattening the result? This is really obvious with append, but it gets shamefully messy when turning to numpy.
I've tried concatenate (expliciting axis and not), hstack, vstack. All with no results.
In [1]: a
Out[1]: array([1, 2, 3])
In [2]: b
Out[2]: array([6, 7, 8])
In [3]: c = np.concatenate((a,b),axis=0)
In [4]: c
Out[4]: array([1, 2, 3, 6, 7, 8])
Note that the code above works indeed if a and b are lists instead of numpy arrays.
The output I want:
Out[4]: array([[1, 2, 3], [6, 7, 8]])
EDIT
vstack works indeed for a and b as in above. It does not in my real life case, where I want to iteratively fill an empty array with vectors of some dimension.
hist=[]
for i in range(len(filenames)):
fileload = np.load(filenames[i])
maxarray.append(fileload['maxamp'])
hist_t, bins_t = np.histogram(maxarray[i], bins=np.arange(0,4097,4))
hist = np.vstack((hist,hist_t))
SOLUTION:
I found the solution: you have to properly initialize the array e.g.: How to add a new row to an empty numpy array
For np.concatenate to work here the input arrays should have two dimensions, as you wasnt a concatenation along the second axis here, and the input arrays only have 1 dimension.
You can use np.vstack here, which as explained in the docs:
It is equivalent to concatenation along the first axis after 1-D arrays of shape (N,) have been reshaped to (1,N)
a = np.array([1, 2, 3])
b = np.array([6, 7, 8])
np.vstack([a, b])
array([[1, 2, 3],
[6, 7, 8]])

Very Basic Numpy array dimension visualization

I'm a beginner to numpy with no experience in matrices. I understand basic 1d and 2d arrays but I'm having trouble visualizing a 3d numpy array like the one below. How do the following python lists form a 3d array with height, length and width? Which are the rows and columns?
b = np.array([[[1, 2, 3],[4, 5, 6]],
[[7, 8, 9],[10, 11, 12]]])
The anatomy of an ndarray in NumPy looks like this red cube below: (source: Physics Dept, Cornell Uni)
Once you leave the 2D space and enter 3D or higher dimensional spaces, the concept of rows and columns doesn't make much sense anymore. But still you can intuitively understand 3D arrays. For instance, considering your example:
In [41]: b
Out[41]:
array([[[ 1, 2, 3],
[ 4, 5, 6]],
[[ 7, 8, 9],
[10, 11, 12]]])
In [42]: b.shape
Out[42]: (2, 2, 3)
Here the shape of b is (2, 2, 3). You can think about it like, we've two (2x3) matrices stacked to form a 3D array. To access the first matrix you index into the array b like b[0] and to access the second matrix, you index into the array b like b[1].
# gives you the 2D array (i.e. matrix) at position `0`
In [43]: b[0]
Out[43]:
array([[1, 2, 3],
[4, 5, 6]])
# gives you the 2D array (i.e. matrix) at position 1
In [44]: b[1]
Out[44]:
array([[ 7, 8, 9],
[10, 11, 12]])
However, if you enter 4D space or higher, it will be very hard to make any sense out of the arrays itself since we humans have hard time visualizing 4D and more dimensions. So, one would rather just consider the ndarray.shape attribute and work with it.
More information about how we build higher dimensional arrays using (nested) lists:
For 1D arrays, the array constructor needs a sequence (tuple, list, etc) but conventionally list is used.
In [51]: oneD = np.array([1, 2, 3,])
In [52]: oneD.shape
Out[52]: (3,)
For 2D arrays, it's list of lists but can also be tuple of lists or tuple of tuples etc:
In [53]: twoD = np.array([[1, 2, 3], [4, 5, 6]])
In [54]: twoD.shape
Out[54]: (2, 3)
For 3D arrays, it's list of lists of lists:
In [55]: threeD = np.array([[[1, 2, 3], [2, 3, 4]], [[5, 6, 7], [6, 7, 8]]])
In [56]: threeD.shape
Out[56]: (2, 2, 3)
P.S. Internally, the ndarray is stored in a memory block as shown in the below picture. (source: Enthought)

How do I convert a 2D numpy array into a 1D numpy array of 1D numpy arrays?

In other words, each element of the outer array will be a row vector from the original 2D array.
A #Jaime already said, a 2D array can be interpreted as an array of 1D arrays, suppose:
a = np.array([[1,2,3],
[4,5,6],
[7,8,9]])
doing a[0] will return array([1, 2, 3]).
So you don't need to do any conversion.
I think it makes little sense to use numpy arrays to do that, just think you're missing out on all the advantages of numpy.
I had the same issue to append a raw with a different length to a 2D-array.
The only trick I found up to now was to use list comprenhsion and append the new row (see below). Not very optimal I guess but at least it works ;-)
Hope this can help
>>> x=np.reshape(np.arange(0,9),(3,3))
>>> x
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
>>> row_to_append = np.arange(9,11)
>>> row_to_append
array([ 9, 10])
>>> result=[item for item in x]
>>> result.append(row_to_append)
>>> result
[array([0, 1, 2]), array([3, 4, 5]), array([6, 7, 8]), array([ 9, 10])]
np.vsplit Split an array into multiple sub-arrays vertically (row-wise).
x=np.arange(12).reshape(3,4)
In [7]: np.vsplit(x,3)
Out[7]: [array([[0, 1, 2, 3]]), array([[4, 5, 6, 7]]), array([[ 8, 9, 10, 11]])]
A comprehension could be used to reshape those arrays into 1d ones.
This is a list of arrays, not an array of arrays. Such a sequence of arrays can be recombined with vstack (or hstack, dstack).
np.array([np.arange(3),np.arange(4)])
makes a 2 element array of arrays. But if the arrays in the list are all the same shape (or compatible), it makes a 2d array. In terms of data storage it may not matter whether it is 2d or 1d of 1d arrays.

How do I access the ith column of a NumPy multidimensional array?

Given:
test = numpy.array([[1, 2], [3, 4], [5, 6]])
test[i] gives the ith row (e.g. [1, 2]). How do I access the ith column? (e.g. [1, 3, 5]). Also, would this be an expensive operation?
To access column 0:
>>> test[:, 0]
array([1, 3, 5])
To access row 0:
>>> test[0, :]
array([1, 2])
This is covered in Section 1.4 (Indexing) of the NumPy reference. This is quick, at least in my experience. It's certainly much quicker than accessing each element in a loop.
>>> test[:,0]
array([1, 3, 5])
this command gives you a row vector, if you just want to loop over it, it's fine, but if you want to hstack with some other array with dimension 3xN, you will have
ValueError: all the input arrays must have same number of dimensions
while
>>> test[:,[0]]
array([[1],
[3],
[5]])
gives you a column vector, so that you can do concatenate or hstack operation.
e.g.
>>> np.hstack((test, test[:,[0]]))
array([[1, 2, 1],
[3, 4, 3],
[5, 6, 5]])
And if you want to access more than one column at a time you could do:
>>> test = np.arange(9).reshape((3,3))
>>> test
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
>>> test[:,[0,2]]
array([[0, 2],
[3, 5],
[6, 8]])
You could also transpose and return a row:
In [4]: test.T[0]
Out[4]: array([1, 3, 5])
Although the question has been answered, let me mention some nuances.
Let's say you are interested in the first column of the array
arr = numpy.array([[1, 2],
[3, 4],
[5, 6]])
As you already know from other answers, to get it in the form of "row vector" (array of shape (3,)), you use slicing:
arr_col1_view = arr[:, 1] # creates a view of the 1st column of the arr
arr_col1_copy = arr[:, 1].copy() # creates a copy of the 1st column of the arr
To check if an array is a view or a copy of another array you can do the following:
arr_col1_view.base is arr # True
arr_col1_copy.base is arr # False
see ndarray.base.
Besides the obvious difference between the two (modifying arr_col1_view will affect the arr), the number of byte-steps for traversing each of them is different:
arr_col1_view.strides[0] # 8 bytes
arr_col1_copy.strides[0] # 4 bytes
see strides and this answer.
Why is this important? Imagine that you have a very big array A instead of the arr:
A = np.random.randint(2, size=(10000, 10000), dtype='int32')
A_col1_view = A[:, 1]
A_col1_copy = A[:, 1].copy()
and you want to compute the sum of all the elements of the first column, i.e. A_col1_view.sum() or A_col1_copy.sum(). Using the copied version is much faster:
%timeit A_col1_view.sum() # ~248 µs
%timeit A_col1_copy.sum() # ~12.8 µs
This is due to the different number of strides mentioned before:
A_col1_view.strides[0] # 40000 bytes
A_col1_copy.strides[0] # 4 bytes
Although it might seem that using column copies is better, it is not always true for the reason that making a copy takes time too and uses more memory (in this case it took me approx. 200 µs to create the A_col1_copy). However if we needed the copy in the first place, or we need to do many different operations on a specific column of the array and we are ok with sacrificing memory for speed, then making a copy is the way to go.
In the case we are interested in working mostly with columns, it could be a good idea to create our array in column-major ('F') order instead of the row-major ('C') order (which is the default), and then do the slicing as before to get a column without copying it:
A = np.asfortranarray(A) # or np.array(A, order='F')
A_col1_view = A[:, 1]
A_col1_view.strides[0] # 4 bytes
%timeit A_col1_view.sum() # ~12.6 µs vs ~248 µs
Now, performing the sum operation (or any other) on a column-view is as fast as performing it on a column copy.
Finally let me note that transposing an array and using row-slicing is the same as using the column-slicing on the original array, because transposing is done by just swapping the shape and the strides of the original array.
A[:, 1].strides[0] # 40000 bytes
A.T[1, :].strides[0] # 40000 bytes
To get several and indepent columns, just:
> test[:,[0,2]]
you will get colums 0 and 2
>>> test
array([[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9]])
>>> ncol = test.shape[1]
>>> ncol
5L
Then you can select the 2nd - 4th column this way:
>>> test[0:, 1:(ncol - 1)]
array([[1, 2, 3],
[6, 7, 8]])
This is not multidimensional. It is 2 dimensional array. where you want to access the columns you wish.
test = numpy.array([[1, 2], [3, 4], [5, 6]])
test[:, a:b] # you can provide index in place of a and b

Categories