Numpy array of different types - python

I want a numpy array of different mixed datatypes, basically a combination of float32 and uint32.
The thing is, I don't write the array manually (as all other forums that I've found). Here is a piece of code of what I'm trying to do:
a = np.full((1, 10), 1).astype(np.float32)
b = np.full((1, 10), 2).astype(np.float32)
c = np.full((1, 10), 3).astype(np.float32)
d = np.full((1, 10), 4).astype(np.uint32)
arr = np.dstack([a, b, c, d]) # arr.shape = 1, 10, 4
I want axis 2 of arr to be of mixed data types. Of course a, b, c, and d are read from files, but for simplicity i show them as constant values!
One important note: I want this functionality. Last element of the array have to be represented as a uint32 because I'm dealing with hardware components that expects this order of datatypes (think of it as an API that will throw an error if the data types do not match)
This is what I've tried:
arr.astype("float32, float32, float32, uint1")
but this duplicate each element in axis 2 four times with different data types (same value).
I also tried this (which is basically the same thing):
dt = np.dtype([('floats', np.float32, (3, )), ('ints', np.uint32, (1, ))])
arr = np.dstack((a, b, c, d)).astype(dt)
but I got the same duplication as well.
I know for sure if I construct the array as follows:
arr = np.array([((1, 2, 3), (4)), ((5, 6, 7), (8))], dtype=dt)
where dt is from the code block above, it works nice-ish. but I read those a, b, c, d arrays and I don't know if constructing those tuples (or structures) is the best way to do it because those arrays have length of 850k in practice.

Your dtype:
In [83]: dt = np.dtype([('floats', np.float32, (3, )), ('ints', np.uint32, (1, ))])
and a sample uniform array:
In [84]: x= np.arange(1,9).reshape(2,4);x
Out[84]:
array([[1, 2, 3, 4],
[5, 6, 7, 8]])
the wrong way of making a structured array:
In [85]: x.astype(dt)
Out[85]:
array([[([1., 1., 1.], [1]), ([2., 2., 2.], [2]), ([3., 3., 3.], [3]),
([4., 4., 4.], [4])],
[([5., 5., 5.], [5]), ([6., 6., 6.], [6]), ([7., 7., 7.], [7]),
([8., 8., 8.], [8])]],
dtype=[('floats', '<f4', (3,)), ('ints', '<u4', (1,))])
The right way:
In [86]: import numpy.lib.recfunctions as rf
In [87]: rf.unstructured_to_structured(x,dt)
Out[87]:
array([([1., 2., 3.], [4]), ([5., 6., 7.], [8])],
dtype=[('floats', '<f4', (3,)), ('ints', '<u4', (1,))])
and alternate way:
In [88]: res = np.zeros(2,dt)
In [89]: res['floats'] = x[:,:3]
In [90]: res['ints'] = x[:,-1:]
In [91]: res
Out[91]:
array([([1., 2., 3.], [4]), ([5., 6., 7.], [8])],
dtype=[('floats', '<f4', (3,)), ('ints', '<u4', (1,))])
https://numpy.org/doc/stable/user/basics.rec.html

Related

Purpose/status of the attribute numpy.dtype.base

I have found an attribute called base on numpy.dtype objects. Doing some experiments:
numpy.dtype('i4').base
# dtype('int32')
numpy.dtype('6i4').base
# dtype('int32')
numpy.dtype('10f8').base
# dtype('float64')
numpy.dtype('3i4, 2f4')
# dtype([('f0', '<i4', (3,)), ('f1', '<f4', (2,))])
So it seems to contain the dtype of a single element for simple sub-array data types and itself for structured data types.
Unfortunately, this attribute does not seem to be documented anywhere. There is a page in the documentation, but it’s empty and not linked anywhere. Curiously, it is also absent in the documentation for numpy version 1.15.0 specifically:
/doc/numpy/…/numpy.dtype.base.html (empty page)
/doc/numpy-1.15.0/…/numpy.dtype.base.html (error 404)
/doc/numpy-1.15.1/…/numpy.dtype.base.html (empty page)
Can I rely on the presence and behavior of this attribute in future versions of numpy?
This is now documented:
https://numpy.org/doc/stable/reference/generated/numpy.dtype.base.html#numpy.dtype.base
It is defined at https://github.com/numpy/numpy/blob/eeef9d4646103c3b1afd3085f1393f2b3f9575b2/numpy/core/src/multiarray/descriptor.c#L2255-L2300 and from git-blame was last touched ~13 years ago so it is probably safe to assume that dtype.base will exist and continue to exist.
I'm not sure whether it's safe to rely on base, but it's probably a bad idea either way. People reading your code can't look up what base means in the docs, and anyway, there's a better option.
Instead of base, you can use subdtype, which is documented:
Tuple (item_dtype, shape) if this dtype describes a
sub-array, and None otherwise.
The shape is the fixed shape of the sub-array described by this data
type, and item_dtype the data type of the array.
If a field whose dtype object has this attribute is retrieved, then
the extra dimensions implied by shape are tacked on to the end of
the retrieved array.
For a dtype that represents a subarray, dtype.base is equivalent to dtype.subdtype[0]. For a dtype that doesn't represent a subarray, dtype.base is dtype and dtype.subdtype is None. Here's a demo:
>>> subarray = numpy.dtype('5i4')
>>> not_subarray = numpy.dtype('i4')
>>> subarray.base
dtype('int32')
>>> subarray.subdtype
(dtype('int32'), (5,))
>>> not_subarray.base
dtype('int32')
>>> print(not_subarray.subdtype) # None doesn't get auto-printed
None
Incidentally, if you want to be sure about what dtype.base does, here's the source, which confirms what you guessed from your experiments:
static PyObject *
arraydescr_base_get(PyArray_Descr *self)
{
if (!PyDataType_HASSUBARRAY(self)) {
Py_INCREF(self);
return (PyObject *)self;
}
Py_INCREF(self->subarray->base);
return (PyObject *)(self->subarray->base);
}
I've never used the base attribute, or seen it used. But it does make sense that there should be a way of identifying such an object. I can't find a use of it code such as in np.lib.recfunctions, but it may well be used in compiled code.
With a dtype like '10f8' there are various attritubes (some may be properties):
In [259]: dt = np.dtype('10f8')
In [260]: dt
Out[260]: dtype(('<f8', (10,)))
In [261]: dt.base
Out[261]: dtype('float64')
In [263]: dt.descr
Out[263]: [('', '|V80')]
In [264]: dt.itemsize
Out[264]: 80
In [265]: dt.shape
Out[265]: (10,)
Look what happens when we make an array with this dtype:
In [278]: x = np.ones((3,),'10f8')
In [279]: x
Out[279]:
array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])
In [280]: x.shape
Out[280]: (3, 10)
In [281]: x.dtype
Out[281]: dtype('float64') # there's your base
There's the answer - dt.base is the dtype that will be used in creating an array with the dtype. It's the dtype without the extra dimensional information.
That sort of dtype is rarely used by itself; more likely it is part of a compound dtype:
In [252]: dt=np.dtype('3i4, 2f4')
In [253]: dt
Out[253]: dtype([('f0', '<i4', (3,)), ('f1', '<f4', (2,))])
In [254]: dt.base
Out[254]: dtype([('f0', '<i4', (3,)), ('f1', '<f4', (2,))])
In [255]: dt[0]
Out[255]: dtype(('<i4', (3,)))
In [256]: dt[0].base
This dt could be embedded in another dtype:
In [272]: dt1 = np.dtype((dt, (3,)))
In [273]: dt1
Out[273]: dtype(([('f0', '<i4', (3,)), ('f1', '<f4', (2,))], (3,)))
In [274]: dt1.base
Out[274]: dtype([('f0', '<i4', (3,)), ('f1', '<f4', (2,))])
In [275]: arr = np.ones((3,), dt1)
In [276]: arr
Out[276]:
array([[([1, 1, 1], [1., 1.]), ([1, 1, 1], [1., 1.]),
([1, 1, 1], [1., 1.])],
[([1, 1, 1], [1., 1.]), ([1, 1, 1], [1., 1.]),
([1, 1, 1], [1., 1.])],
[([1, 1, 1], [1., 1.]), ([1, 1, 1], [1., 1.]),
([1, 1, 1], [1., 1.])]],
dtype=[('f0', '<i4', (3,)), ('f1', '<f4', (2,))])
In [277]: arr.shape
Out[277]: (3, 3)
In the case of a structured array, the base of a field is the dtype that we get when viewing just that field.

Indexing array using column names

I'm loading pretty large input files into Numpy array (30 columns, over 10k rows). Data contains only floating point numbers. To simplify data processing I'd like to name columns and access them using human-readable names. AFAIK it's only possibly using structured/record arrays. However, if I'm right, when i use structured arrays I'll loose some information. For instance:
x = np.array([(1.0, 2), (3.0, 4), (11, 22)], dtype='float64')
y = np.array([(1.0, 2), (3.0, 4), (11, 22)], dtype=[('x', float), ('y', float), ('z', float)])
Both arrays contains the same data and the same dtype. y can be accessed using column names:
yIn [155]: y['x']
Out[155]: array([ 1., 3., 11.])
Unfortunately, I loose (or I get wrong impression?) so essential properties when I use structured arrays. x and y have different shapes, y cannot be transposed etc.
In [160]: x
Out[160]:
array([[ 1., 2.],
[ 3., 4.],
[11., 22.]])
In [161]: y
Out[161]:
array([( 1., 2.), ( 3., 4.), (11., 22.)],
dtype=[('x', '<f8'), ('y', '<f8')])
In [162]: x.shape
Out[162]: (3, 2)
In [163]: y.shape
Out[163]: (3,)
In [164]: x.T
Out[164]:
array([[ 1., 3., 11.],
[ 2., 4., 22.]])
In [165]: y.T
Out[165]:
array([( 1., 2.), ( 3., 4.), (11., 22.)],
dtype=[('x', '<f8'), ('y', '<f8')])
Is it possible to continue using "regular 2D Numpy arrays" and access columns using their names?

Which dim to use on tf.metrics.mean_cosine_distance?

I'm confused about which dim refers to which actual dimension in Tensorflow in general, but concretely, when using tf.metrics.mean_cosine_distance
Given
x = [
[1, 2, 3, 4, 5],
[0, 2, 3, 4, 5],
]
I'd like to calculate the distance column-wise. In other words, which dimension resolves to (pseudo code):
mean([
cosine_distance(x[0][0], x[1][0]),
cosine_distance(x[0][1], x[1][1]),
cosine_distance(x[0][2], x[1][2]),
cosine_distance(x[0][3], x[1][3]),
cosine_distance(x[0][4], x[1][4]),
])
It is along dim 0 for your input x. It's intuitive to see this once you construct your input x as a numpy array.
In [49]: x_arr = np.array(x, dtype=np.float32)
In [50]: x_arr
Out[50]:
array([[ 1., 2., 3., 4., 5.],
[ 0., 2., 3., 4., 5.]], dtype=float32)
# compute (mean) cosine distance between `x[0]` & `x[1]`
# where `x[0]` can be considered as `labels`
# while `x[1]` can be considered as `predictions`
In [51]: cosine_dist_axis0 = tf.metrics.mean_cosine_distance(x_arr[0], x_arr[1], 0)
This dim corresponds to the name axis in NumPy terminology. For example, a simple sum operation can be done along axis 0 like:
In [52]: x_arr
Out[52]:
array([[ 1., 2., 3., 4., 5.],
[ 0., 2., 3., 4., 5.]], dtype=float32)
In [53]: np.sum(x_arr, axis=0)
Out[53]: array([ 1., 4., 6., 8., 10.], dtype=float32)
When you compute the tf.metrics.mean_cosine_distance, you're essentially computing the cosine distance between the vectors labels and predictions along dim 0 (and then taking mean) if your inputs are of shape (n, ) where n is the length of each vector (i.e. number of entries in labels/prediction)
But, if you're passing the labels and predictions as a column vector, then the tf.metrics.mean_cosine_distance has to be calculated along dim 1
Example:
If your input label and prediction are column vectors,
# if your `label` is a column vector
In [66]: (x_arr[0])[:, None]
Out[66]:
array([[ 1.],
[ 2.],
[ 3.],
[ 4.],
[ 5.]], dtype=float32)
# if your `prediction` is a column vector
In [67]: (x_arr[1])[:, None]
Out[67]:
array([[ 0.],
[ 2.],
[ 3.],
[ 4.],
[ 5.]], dtype=float32)
Then, tf.metrics.mean_cosine_distance has to computed along dim 1
# inputs
In [68]: labels = (x_arr[0])[:, None]
In [69]: predictions = (x_arr[1])[:, None]
# compute mean cosine distance between them
In [70]: cosine_dist_dim1 = tf.metrics.mean_cosine_distance(labels, predictions, 1)
This tf.metrics.mean_cosine_distance is more or less doing the same thing as scipy.spatial.distance.cosine but it also takes mean.
For your example case:
In [77]: x
Out[77]: [[1, 2, 3, 4, 5], [0, 2, 3, 4, 5]]
In [78]: import scipy
In [79]: scipy.spatial.distance.cosine(x[0], x[1])
Out[79]: 0.009132

Numpy n-tuple array with dtype float

I need an expression that will grant me an 8-tuple float array. Currently, I have the 8-tuple array via:
E = np.zeros((n,m), dtype='8i') #8-tuple
However, when I append an indices i,j via:
E[i,j][0] = 1000.2 #etc.
I get back a tuple array with dtype int:
[1000 0 0 0 0 0 0 0]
It appears I need a way of using the dtype within my zeros command to both set the n-tuple and the float value. Does anyone know how this is done?
If an array is integer dtype, then assigned values will be truncated:
In [169]: x=np.array([0,1,2])
In [170]: x
Out[170]: array([0, 1, 2])
In [173]: x[0] = 1.234
In [174]: x
Out[174]: array([1, 1, 2])
The array has to have a float dtype to hold float values.
Simply changing the i (integer) to f (float) produces a float array:
In [166]: E = np.zeros((2,3), dtype='8f')
In [167]: E.shape
Out[167]: (2, 3, 8)
In [168]: E.dtype
Out[168]: dtype('float32')
This '8f' dtype is not common. The string actually translates to:
In [175]: np.dtype('8f')
Out[175]: dtype(('<f4', (8,)))
But when used in np.zeros that 8 is treated as a dimension. Usually we specify all dimensions in the shape, as #FHTMitchell notes:
In [176]: E1 = np.zeros((2,3,8), dtype=np.float32)
In [177]: E1.shape
Out[177]: (2, 3, 8)
In [178]: E1.dtype
Out[178]: dtype('float32')
Your use of 'n-tuple' is unclear. While shape is a tuple, numeric arrays don't use tuple notation. That is reserved for structured arrays.
In [180]: np.zeros((3,), dtype='f,f,f,f')
Out[180]:
array([(0., 0., 0., 0.), (0., 0., 0., 0.), (0., 0., 0., 0.)],
dtype=[('f0', '<f4'), ('f1', '<f4'), ('f2', '<f4'), ('f3', '<f4')])
In [181]: _.shape
Out[181]: (3,)
This is a 1d array with 3 elements. The dtype shows 4 fields. Each element, or record, is displayed as a tuple.
But fields are indexed by name, not number:
In [182]: Out[180]['f1']
Out[182]: array([0., 0., 0.], dtype=float32)
It is also possible to put 'arrays' within fields:
In [183]: np.zeros((3,), dtype=[('f0','f',(4,))])
Out[183]:
array([([0., 0., 0., 0.],), ([0., 0., 0., 0.],), ([0., 0., 0., 0.],)],
dtype=[('f0', '<f4', (4,))])
In [184]: _['f0']
Out[184]:
array([[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]], dtype=float32)
Initially I thought the 8f notation would produce this sort of array. But apparently I have to either use the full notation with field name, or make a comma separated string:
In [185]: np.zeros((3,), dtype='4f,i')
Out[185]:
array([([0., 0., 0., 0.], 0), ([0., 0., 0., 0.], 0),
([0., 0., 0., 0.], 0)], dtype=[('f0', '<f4', (4,)), ('f1', '<i4')])
dtype notation can be confusing, https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.dtypes.html
Unless you are intentionally trying to create a structured array, it is best to stay away from the '8f' notation.
In [189]: np.array([0,1,2,3],dtype='4i')
TypeError: object of type 'int' has no len()
In [190]: np.array([[0,1,2,3]],dtype='4i')
TypeError: object of type 'int' has no len()
In [191]: np.array([(0,1,2,3)],dtype='4i') # requires [(...)]
Out[191]: array([[0, 1, 2, 3]], dtype=int32)
Without the 4, I can simply write:
In [193]: np.array([[0,1,2,3]], dtype='i')
Out[193]: array([[0, 1, 2, 3]], dtype=int32)
In [194]: np.array([0,1,2,3], dtype='i')
Out[194]: array([0, 1, 2, 3], dtype=int32)
In [195]: np.array([[0,1,2,3]])
Out[195]: array([[0, 1, 2, 3]])
E = np.zeros((n,m), dtype='8f')
Try:
E = np.zeros((n,m), dtype='8f') #8-tuple

Split NumPy array according to values in the array (a condition)

I have an array:
arr = [(1,1,1), (1,1,2), (1,1,3), (1,1,4)...(35,1,22),(35,1,23)]
I want to split my array according to the third value in each ordered pair. I want each third value of 1 to be the start
of a new array. The results should be:
[(1,1,1), (1,1,2),...(1,1,35)][(1,2,1), (1,2,2),...(1,2,46)]
and so on. I know numpy.split should do the trick but I'm lost as to how to write the condition for the split.
Here's a quick idea, working with a 1d array. It can be easily extended to work with your 2d array:
In [385]: x=np.arange(10)
In [386]: I=np.where(x%3==0)
In [387]: I
Out[387]: (array([0, 3, 6, 9]),)
In [389]: np.split(x,I[0])
Out[389]:
[array([], dtype=float64),
array([0, 1, 2]),
array([3, 4, 5]),
array([6, 7, 8]),
array([9])]
The key is to use where to find the indecies where you want split to act.
For a 2d arr
First make a sample 2d array, with something interesting in the 3rd column:
In [390]: arr=np.ones((10,3))
In [391]: arr[:,2]=np.arange(10)
In [392]: arr
Out[392]:
array([[ 1., 1., 0.],
[ 1., 1., 1.],
...
[ 1., 1., 9.]])
Then use the same where and boolean to find indexes to split on:
In [393]: I=np.where(arr[:,2]%3==0)
In [395]: np.split(arr,I[0])
Out[395]:
[array([], dtype=float64),
array([[ 1., 1., 0.],
[ 1., 1., 1.],
[ 1., 1., 2.]]),
array([[ 1., 1., 3.],
[ 1., 1., 4.],
[ 1., 1., 5.]]),
array([[ 1., 1., 6.],
[ 1., 1., 7.],
[ 1., 1., 8.]]),
array([[ 1., 1., 9.]])]
I cannot think of any numpy functions or tricks to do this . A simple solution using for loop would be -
In [48]: arr = [(1,1,1), (1,1,2), (1,1,3), (1,1,4),(1,2,1),(1,2,2),(1,2,3),(1,3,1),(1,3,2),(1,3,3),(1,3,4),(1,3,5)]
In [49]: result = []
In [50]: for i in arr:
....: if i[2] == 1:
....: tempres = []
....: result.append(tempres)
....: tempres.append(i)
....:
In [51]: result
Out[51]:
[[(1, 1, 1), (1, 1, 2), (1, 1, 3), (1, 1, 4)],
[(1, 2, 1), (1, 2, 2), (1, 2, 3)],
[(1, 3, 1), (1, 3, 2), (1, 3, 3), (1, 3, 4), (1, 3, 5)]]
From looking at the documentation it seems like specifying the index of where to split on will work best. For your specific example the following works if arr is already a 2dimensional numpy array:
np.split(arr, np.where(arr[:,2] == 1)[0])
arr[:,2] returns a list of the 3rd entry in each tuple. The colon says to take every row and the 2 says to take the 3rd column, which is the 3rd component.
We then use np.where to return all the places where the 3rd coordinate is a 1. We have to do np.where()[0] to get at the array of locations directly.
We then plug in the indices we've found where the 3rd coordinate is 1 to np.split which splits at the desired locations.
Note that because the first entry has a 1 in the 3rd coordinate it will split before the first entry. This gives us one extra "split" array which is empty.

Categories