Numpy: use reshape or newaxis to add dimensions - python

Either ndarray.reshape or numpy.newaxis can be used to add a new dimension to an array. They both seem to create a view, is there any reason or advantage to use one instead of the other?
>>> b
array([ 1., 1., 1., 1.])
>>> c = b.reshape((1,4))
>>> c *= 2
>>> c
array([[ 2., 2., 2., 2.]])
>>> c.shape
(1, 4)
>>> b
array([ 2., 2., 2., 2.])
>>> d = b[np.newaxis,...]
>>> d
array([[ 2., 2., 2., 2.]])
>>> d.shape
(1, 4)
>>> d *= 2
>>> b
array([ 4., 4., 4., 4.])
>>> c
array([[ 4., 4., 4., 4.]])
>>> d
array([[ 4., 4., 4., 4.]])
>>>
`

One reason to use numpy.newaxis over ndarray.reshape is when you have more than one "unknown" dimension to operate with. So, for example, for the following array:
>>> arr.shape
(10, 5)
This works:
>>> arr[:, np.newaxis, :].shape
(10, 1, 5)
But this does not:
>>> arr.reshape(-1, 1, -1)
...
ValueError: can only specify one unknown dimension

I don't see evidence of much difference. You could do a time test on very large arrays. Basically both fiddle with the shape, and possibly the strides. __array_interface__ is a nice way of accessing this information. For example:
In [94]: b.__array_interface__
Out[94]:
{'data': (162400368, False),
'descr': [('', '<f8')],
'shape': (5,),
'strides': None,
'typestr': '<f8',
'version': 3}
In [95]: b[None,:].__array_interface__
Out[95]:
{'data': (162400368, False),
'descr': [('', '<f8')],
'shape': (1, 5),
'strides': (0, 8),
'typestr': '<f8',
'version': 3}
In [96]: b.reshape(1,5).__array_interface__
Out[96]:
{'data': (162400368, False),
'descr': [('', '<f8')],
'shape': (1, 5),
'strides': None,
'typestr': '<f8',
'version': 3}
Both create a view, using the same data buffer as the original. Same shape, but reshape doesn't change the strides. reshape lets you specify the order.
And .flags shows differences in the C_CONTIGUOUS flag.
reshape may be faster because it is making fewer changes. But either way the operation shouldn't affect the time of larger calculations much.
e.g. for large b
In [123]: timeit np.outer(b.reshape(1,-1),b)
1 loops, best of 3: 288 ms per loop
In [124]: timeit np.outer(b[None,:],b)
1 loops, best of 3: 287 ms per loop
Interesting observation that: b.reshape(1,4).strides -> (32, 8)
Here's my guess. .__array_interface__ is displaying an underlying attribute, and .strides is more like a property (though it may all be buried in C code). The default underlying value is None, and when needed for calculation (or display with .strides) it calculates it from the shape and item size. 32 is the distance to the end of the 1st row (4x8). np.ones((2,4)).strides has the same (32,8) (and None in __array_interface__.
b[None,:] on the other hand is preparing the array for broadcasting. When broadcasted, existing values are used repeatedly. That's what the 0 in (0,8) does.
In [147]: b1=np.broadcast_arrays(b,np.zeros((2,1)))[0]
In [148]: b1.shape
Out[148]: (2, 5000)
In [149]: b1.strides
Out[149]: (0, 8)
In [150]: b1.__array_interface__
Out[150]:
{'data': (3023336880L, False),
'descr': [('', '<f8')],
'shape': (2, 5),
'strides': (0, 8),
'typestr': '<f8',
'version': 3}
b1 displays the same as np.ones((2,5)) but has only 5 items.
np.broadcast_arrays is a function in /numpy/lib/stride_tricks.py. It uses as_strided from the same file. These functions directly play with the shape and strides attributes.

Related

Numpy array of different types

I want a numpy array of different mixed datatypes, basically a combination of float32 and uint32.
The thing is, I don't write the array manually (as all other forums that I've found). Here is a piece of code of what I'm trying to do:
a = np.full((1, 10), 1).astype(np.float32)
b = np.full((1, 10), 2).astype(np.float32)
c = np.full((1, 10), 3).astype(np.float32)
d = np.full((1, 10), 4).astype(np.uint32)
arr = np.dstack([a, b, c, d]) # arr.shape = 1, 10, 4
I want axis 2 of arr to be of mixed data types. Of course a, b, c, and d are read from files, but for simplicity i show them as constant values!
One important note: I want this functionality. Last element of the array have to be represented as a uint32 because I'm dealing with hardware components that expects this order of datatypes (think of it as an API that will throw an error if the data types do not match)
This is what I've tried:
arr.astype("float32, float32, float32, uint1")
but this duplicate each element in axis 2 four times with different data types (same value).
I also tried this (which is basically the same thing):
dt = np.dtype([('floats', np.float32, (3, )), ('ints', np.uint32, (1, ))])
arr = np.dstack((a, b, c, d)).astype(dt)
but I got the same duplication as well.
I know for sure if I construct the array as follows:
arr = np.array([((1, 2, 3), (4)), ((5, 6, 7), (8))], dtype=dt)
where dt is from the code block above, it works nice-ish. but I read those a, b, c, d arrays and I don't know if constructing those tuples (or structures) is the best way to do it because those arrays have length of 850k in practice.
Your dtype:
In [83]: dt = np.dtype([('floats', np.float32, (3, )), ('ints', np.uint32, (1, ))])
and a sample uniform array:
In [84]: x= np.arange(1,9).reshape(2,4);x
Out[84]:
array([[1, 2, 3, 4],
[5, 6, 7, 8]])
the wrong way of making a structured array:
In [85]: x.astype(dt)
Out[85]:
array([[([1., 1., 1.], [1]), ([2., 2., 2.], [2]), ([3., 3., 3.], [3]),
([4., 4., 4.], [4])],
[([5., 5., 5.], [5]), ([6., 6., 6.], [6]), ([7., 7., 7.], [7]),
([8., 8., 8.], [8])]],
dtype=[('floats', '<f4', (3,)), ('ints', '<u4', (1,))])
The right way:
In [86]: import numpy.lib.recfunctions as rf
In [87]: rf.unstructured_to_structured(x,dt)
Out[87]:
array([([1., 2., 3.], [4]), ([5., 6., 7.], [8])],
dtype=[('floats', '<f4', (3,)), ('ints', '<u4', (1,))])
and alternate way:
In [88]: res = np.zeros(2,dt)
In [89]: res['floats'] = x[:,:3]
In [90]: res['ints'] = x[:,-1:]
In [91]: res
Out[91]:
array([([1., 2., 3.], [4]), ([5., 6., 7.], [8])],
dtype=[('floats', '<f4', (3,)), ('ints', '<u4', (1,))])
https://numpy.org/doc/stable/user/basics.rec.html

How to get field of nested numpy structured array (advanced indexing)

I have a complex nested structured array (often used as a recarray). Its simplified for this example, but in the real case there are multiple levels.
c = [('x','f8'),('y','f8')]
A = [('data_string','|S20'),('data_val', c, 2)]
zeros = np.zeros(1, dtype=A)
print(zeros["data_val"]["x"])
I am trying to index the "x" datatype of the nested arrays datatype without defining the preceding named fields. I was hoping something like print(zeros[:,"x"]) would let me slice all of the top level data, but it doesn't work.
Are there ways to do fancy indexing with nested structured arrays with accessing their field names?
I don't know if displaying the resulting array helps you visualize the nesting or not.
In [279]: c = [('x','f8'),('y','f8')]
...: A = [('data_string','|S20'),('data_val', c, 2)]
...: arr = np.zeros(2, dtype=A)
In [280]: arr
Out[280]:
array([(b'', [(0., 0.), (0., 0.)]), (b'', [(0., 0.), (0., 0.)])],
dtype=[('data_string', 'S20'), ('data_val', [('x', '<f8'), ('y', '<f8')], (2,))])
Note how the nesting of () and [] reflects the nesting of the fields.
arr.dtype only has direct access to the top level field names:
In [281]: arr.dtype.names
Out[281]: ('data_string', 'data_val')
In [282]: arr['data_val']
Out[282]:
array([[(0., 0.), (0., 0.)],
[(0., 0.), (0., 0.)]], dtype=[('x', '<f8'), ('y', '<f8')])
But having accessed one field, we can then look at its fields:
In [283]: arr['data_val'].dtype.names
Out[283]: ('x', 'y')
In [284]: arr['data_val']['x']
Out[284]:
array([[0., 0.],
[0., 0.]])
Record number indexing is separate, and can be multidimensional in the usual sense:
In [285]: arr[1]['data_val']['x'] = [1,2]
In [286]: arr[0]['data_val']['y'] = [3,4]
In [287]: arr
Out[287]:
array([(b'', [(0., 3.), (0., 4.)]), (b'', [(1., 0.), (2., 0.)])],
dtype=[('data_string', 'S20'), ('data_val', [('x', '<f8'), ('y', '<f8')], (2,))])
Since the data_val field has a (2,) shape, we can mix/match that index with the (2,) shape of arr:
In [289]: arr['data_val']['x']
Out[289]:
array([[0., 0.],
[1., 2.]])
In [290]: arr['data_val']['x'][[0,1],[0,1]]
Out[290]: array([0., 2.])
In [291]: arr['data_val'][[0,1],[0,1]]
Out[291]: array([(0., 3.), (2., 0.)], dtype=[('x', '<f8'), ('y', '<f8')])
I mentioned that fields indexing is like dict indexing. Note this display of the fields:
In [294]: arr.dtype.fields
Out[294]:
mappingproxy({'data_string': (dtype('S20'), 0),
'data_val': (dtype(([('x', '<f8'), ('y', '<f8')], (2,))), 20)})
Each record is stored as a block of 52 bytes:
In [299]: arr.itemsize
Out[299]: 52
In [300]: arr.dtype.str
Out[300]: '|V52'
20 of those are data_string, and 32 are the 2 c fields
In [303]: arr['data_val'].dtype.str
Out[303]: '|V16'
You can ask for a list of fields, and get a special kind of view. Its dtype display is a little different
In [306]: arr[['data_val']]
Out[306]:
array([([(0., 3.), (0., 4.)],), ([(1., 0.), (2., 0.)],)],
dtype={'names': ['data_val'], 'formats': [([('x', '<f8'), ('y', '<f8')], (2,))], 'offsets': [20], 'itemsize': 52})
In [311]: arr['data_val'][['y']]
Out[311]:
array([[(3.,), (4.,)],
[(0.,), (0.,)]],
dtype={'names': ['y'], 'formats': ['<f8'], 'offsets': [8], 'itemsize': 16})
Each 'data_val' starts 20 bytes into the 52 byte record. And each 'y' starts 8 bytes into its 16 byte record.
The statement zeros['data_val'] creates a view into the array, which may already be non-contiguous at that point. You can extract multiple values of x because c is an array type, meaning that x has clearly defined strides and shape. The semantics of the statement zeros[:, 'x'] are very unclear. For example, what happens to data_string, which has no x? I would expect an error; you might expect something else.
The only way I can see the index being simplified, is if you expand c into A directly, sort of like an anonymous structure in C, except you can't do that easily with an array.

Purpose/status of the attribute numpy.dtype.base

I have found an attribute called base on numpy.dtype objects. Doing some experiments:
numpy.dtype('i4').base
# dtype('int32')
numpy.dtype('6i4').base
# dtype('int32')
numpy.dtype('10f8').base
# dtype('float64')
numpy.dtype('3i4, 2f4')
# dtype([('f0', '<i4', (3,)), ('f1', '<f4', (2,))])
So it seems to contain the dtype of a single element for simple sub-array data types and itself for structured data types.
Unfortunately, this attribute does not seem to be documented anywhere. There is a page in the documentation, but it’s empty and not linked anywhere. Curiously, it is also absent in the documentation for numpy version 1.15.0 specifically:
/doc/numpy/…/numpy.dtype.base.html (empty page)
/doc/numpy-1.15.0/…/numpy.dtype.base.html (error 404)
/doc/numpy-1.15.1/…/numpy.dtype.base.html (empty page)
Can I rely on the presence and behavior of this attribute in future versions of numpy?
This is now documented:
https://numpy.org/doc/stable/reference/generated/numpy.dtype.base.html#numpy.dtype.base
It is defined at https://github.com/numpy/numpy/blob/eeef9d4646103c3b1afd3085f1393f2b3f9575b2/numpy/core/src/multiarray/descriptor.c#L2255-L2300 and from git-blame was last touched ~13 years ago so it is probably safe to assume that dtype.base will exist and continue to exist.
I'm not sure whether it's safe to rely on base, but it's probably a bad idea either way. People reading your code can't look up what base means in the docs, and anyway, there's a better option.
Instead of base, you can use subdtype, which is documented:
Tuple (item_dtype, shape) if this dtype describes a
sub-array, and None otherwise.
The shape is the fixed shape of the sub-array described by this data
type, and item_dtype the data type of the array.
If a field whose dtype object has this attribute is retrieved, then
the extra dimensions implied by shape are tacked on to the end of
the retrieved array.
For a dtype that represents a subarray, dtype.base is equivalent to dtype.subdtype[0]. For a dtype that doesn't represent a subarray, dtype.base is dtype and dtype.subdtype is None. Here's a demo:
>>> subarray = numpy.dtype('5i4')
>>> not_subarray = numpy.dtype('i4')
>>> subarray.base
dtype('int32')
>>> subarray.subdtype
(dtype('int32'), (5,))
>>> not_subarray.base
dtype('int32')
>>> print(not_subarray.subdtype) # None doesn't get auto-printed
None
Incidentally, if you want to be sure about what dtype.base does, here's the source, which confirms what you guessed from your experiments:
static PyObject *
arraydescr_base_get(PyArray_Descr *self)
{
if (!PyDataType_HASSUBARRAY(self)) {
Py_INCREF(self);
return (PyObject *)self;
}
Py_INCREF(self->subarray->base);
return (PyObject *)(self->subarray->base);
}
I've never used the base attribute, or seen it used. But it does make sense that there should be a way of identifying such an object. I can't find a use of it code such as in np.lib.recfunctions, but it may well be used in compiled code.
With a dtype like '10f8' there are various attritubes (some may be properties):
In [259]: dt = np.dtype('10f8')
In [260]: dt
Out[260]: dtype(('<f8', (10,)))
In [261]: dt.base
Out[261]: dtype('float64')
In [263]: dt.descr
Out[263]: [('', '|V80')]
In [264]: dt.itemsize
Out[264]: 80
In [265]: dt.shape
Out[265]: (10,)
Look what happens when we make an array with this dtype:
In [278]: x = np.ones((3,),'10f8')
In [279]: x
Out[279]:
array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])
In [280]: x.shape
Out[280]: (3, 10)
In [281]: x.dtype
Out[281]: dtype('float64') # there's your base
There's the answer - dt.base is the dtype that will be used in creating an array with the dtype. It's the dtype without the extra dimensional information.
That sort of dtype is rarely used by itself; more likely it is part of a compound dtype:
In [252]: dt=np.dtype('3i4, 2f4')
In [253]: dt
Out[253]: dtype([('f0', '<i4', (3,)), ('f1', '<f4', (2,))])
In [254]: dt.base
Out[254]: dtype([('f0', '<i4', (3,)), ('f1', '<f4', (2,))])
In [255]: dt[0]
Out[255]: dtype(('<i4', (3,)))
In [256]: dt[0].base
This dt could be embedded in another dtype:
In [272]: dt1 = np.dtype((dt, (3,)))
In [273]: dt1
Out[273]: dtype(([('f0', '<i4', (3,)), ('f1', '<f4', (2,))], (3,)))
In [274]: dt1.base
Out[274]: dtype([('f0', '<i4', (3,)), ('f1', '<f4', (2,))])
In [275]: arr = np.ones((3,), dt1)
In [276]: arr
Out[276]:
array([[([1, 1, 1], [1., 1.]), ([1, 1, 1], [1., 1.]),
([1, 1, 1], [1., 1.])],
[([1, 1, 1], [1., 1.]), ([1, 1, 1], [1., 1.]),
([1, 1, 1], [1., 1.])],
[([1, 1, 1], [1., 1.]), ([1, 1, 1], [1., 1.]),
([1, 1, 1], [1., 1.])]],
dtype=[('f0', '<i4', (3,)), ('f1', '<f4', (2,))])
In [277]: arr.shape
Out[277]: (3, 3)
In the case of a structured array, the base of a field is the dtype that we get when viewing just that field.

Indexing array using column names

I'm loading pretty large input files into Numpy array (30 columns, over 10k rows). Data contains only floating point numbers. To simplify data processing I'd like to name columns and access them using human-readable names. AFAIK it's only possibly using structured/record arrays. However, if I'm right, when i use structured arrays I'll loose some information. For instance:
x = np.array([(1.0, 2), (3.0, 4), (11, 22)], dtype='float64')
y = np.array([(1.0, 2), (3.0, 4), (11, 22)], dtype=[('x', float), ('y', float), ('z', float)])
Both arrays contains the same data and the same dtype. y can be accessed using column names:
yIn [155]: y['x']
Out[155]: array([ 1., 3., 11.])
Unfortunately, I loose (or I get wrong impression?) so essential properties when I use structured arrays. x and y have different shapes, y cannot be transposed etc.
In [160]: x
Out[160]:
array([[ 1., 2.],
[ 3., 4.],
[11., 22.]])
In [161]: y
Out[161]:
array([( 1., 2.), ( 3., 4.), (11., 22.)],
dtype=[('x', '<f8'), ('y', '<f8')])
In [162]: x.shape
Out[162]: (3, 2)
In [163]: y.shape
Out[163]: (3,)
In [164]: x.T
Out[164]:
array([[ 1., 3., 11.],
[ 2., 4., 22.]])
In [165]: y.T
Out[165]:
array([( 1., 2.), ( 3., 4.), (11., 22.)],
dtype=[('x', '<f8'), ('y', '<f8')])
Is it possible to continue using "regular 2D Numpy arrays" and access columns using their names?

Split NumPy array according to values in the array (a condition)

I have an array:
arr = [(1,1,1), (1,1,2), (1,1,3), (1,1,4)...(35,1,22),(35,1,23)]
I want to split my array according to the third value in each ordered pair. I want each third value of 1 to be the start
of a new array. The results should be:
[(1,1,1), (1,1,2),...(1,1,35)][(1,2,1), (1,2,2),...(1,2,46)]
and so on. I know numpy.split should do the trick but I'm lost as to how to write the condition for the split.
Here's a quick idea, working with a 1d array. It can be easily extended to work with your 2d array:
In [385]: x=np.arange(10)
In [386]: I=np.where(x%3==0)
In [387]: I
Out[387]: (array([0, 3, 6, 9]),)
In [389]: np.split(x,I[0])
Out[389]:
[array([], dtype=float64),
array([0, 1, 2]),
array([3, 4, 5]),
array([6, 7, 8]),
array([9])]
The key is to use where to find the indecies where you want split to act.
For a 2d arr
First make a sample 2d array, with something interesting in the 3rd column:
In [390]: arr=np.ones((10,3))
In [391]: arr[:,2]=np.arange(10)
In [392]: arr
Out[392]:
array([[ 1., 1., 0.],
[ 1., 1., 1.],
...
[ 1., 1., 9.]])
Then use the same where and boolean to find indexes to split on:
In [393]: I=np.where(arr[:,2]%3==0)
In [395]: np.split(arr,I[0])
Out[395]:
[array([], dtype=float64),
array([[ 1., 1., 0.],
[ 1., 1., 1.],
[ 1., 1., 2.]]),
array([[ 1., 1., 3.],
[ 1., 1., 4.],
[ 1., 1., 5.]]),
array([[ 1., 1., 6.],
[ 1., 1., 7.],
[ 1., 1., 8.]]),
array([[ 1., 1., 9.]])]
I cannot think of any numpy functions or tricks to do this . A simple solution using for loop would be -
In [48]: arr = [(1,1,1), (1,1,2), (1,1,3), (1,1,4),(1,2,1),(1,2,2),(1,2,3),(1,3,1),(1,3,2),(1,3,3),(1,3,4),(1,3,5)]
In [49]: result = []
In [50]: for i in arr:
....: if i[2] == 1:
....: tempres = []
....: result.append(tempres)
....: tempres.append(i)
....:
In [51]: result
Out[51]:
[[(1, 1, 1), (1, 1, 2), (1, 1, 3), (1, 1, 4)],
[(1, 2, 1), (1, 2, 2), (1, 2, 3)],
[(1, 3, 1), (1, 3, 2), (1, 3, 3), (1, 3, 4), (1, 3, 5)]]
From looking at the documentation it seems like specifying the index of where to split on will work best. For your specific example the following works if arr is already a 2dimensional numpy array:
np.split(arr, np.where(arr[:,2] == 1)[0])
arr[:,2] returns a list of the 3rd entry in each tuple. The colon says to take every row and the 2 says to take the 3rd column, which is the 3rd component.
We then use np.where to return all the places where the 3rd coordinate is a 1. We have to do np.where()[0] to get at the array of locations directly.
We then plug in the indices we've found where the 3rd coordinate is 1 to np.split which splits at the desired locations.
Note that because the first entry has a 1 in the 3rd coordinate it will split before the first entry. This gives us one extra "split" array which is empty.

Categories