I just discovered Numpy structured arrays and I find them to be quite powerful. The natural question arises in my mind: How in the world do I create a Numpy structure scalar. Let me show you what I mean. Let's say I want a structure containing some data:
import numpy as np
dtype = np.dtype([('a', np.float_), ('b', np.int_)])
ar = np.array((0.5, 1), dtype=dtype)
ar['a']
This gives me array(0.5) instead of 0.5. On the other hand, if I do this:
import numpy as np
dtype = np.dtype([('a', np.float_), ('b', np.int_)])
ar = np.array([(0.5, 1)], dtype=dtype)
ar[0]['a']
I get 0.5, just like I want. Which means that ar[0] isn't an array, but a scalar. Is it possible to create a structured scalar in a way more elegant than the one I've described?
Singleton isn't quite the right term, but I get what you want.
arr = np.array((0.5, 1), dtype=dtype)
Creates a 0d, single element array of this dtype. Check its dtype and shape.
arr.item() returns a tuple (0.5, 1). Aso test arr[()] and arr.tolist().
np.float64(0.5) creates a float with a numpy wrapper. It is similar to, but exactly the same as np.array(0.5). Their methods diff some.
I don't know anything similar with a compound dtype.
In [123]: dt = np.dtype('i,f,U10')
In [124]: dt
Out[124]: dtype([('f0', '<i4'), ('f1', '<f4'), ('f2', '<U10')])
In [125]: arr = np.array((1,2,3),dtype=dt)
In [126]: arr
Out[126]:
array((1, 2., '3'),
dtype=[('f0', '<i4'), ('f1', '<f4'), ('f2', '<U10')])
In [127]: arr.shape
Out[127]: ()
arr is a 0d 1 element array. It can be indexed with:
In [128]: arr[()]
Out[128]: (1, 2., '3')
In [129]: type(_)
Out[129]: numpy.void
This indexing produces a np.void object. Doing the same thing on a 0d float array would produce a np.float object.
But you can't use np.void((1,2,3), dtype=dt) to directly create such an object (in contrast to np.float(12.34)).
item is the normal way of extracting a 'scalar' from an array. Here it returns a tuple, the same sort of object that we used as input to create arr:
In [131]: arr.item()
Out[131]: (1, 2.0, '3')
In [132]: type(_)
Out[132]: tuple
np.asscalar(arr) returns the same tuple.
One difference between the np.void object and the tuple, is that it can still be indexed with the field name, arr[()]['f0'], whereas the tuple has to be indexed by number arr.item()[0]. The void still has a dtype, while the tuple doesn't.
fromrecords makes a recarray. This is similar to a structured array, but allows us to access fields as attributes. It may actually be an older class, that has been merged to into numpy, hence the np.rec prefix. Mostly we use structured arrays, though np.rec still has some convenience functions. (actually in numpy.lib.recfunctions):
In [133]: res = np.rec.fromrecords((1,2,3), dt)
In [134]: res
Out[134]:
rec.array((1, 2., '3'),
dtype=[('f0', '<i4'), ('f1', '<f4'), ('f2', '<U10')])
In [135]: res.f0
Out[135]: array(1, dtype=int32)
In [136]: res.item()
Out[136]: (1, 2.0, '3')
In [137]: type(_)
Out[137]: tuple
In [138]: res[()]
Out[138]: (1, 2.0, '3')
In [139]: type(_)
Out[139]: numpy.record
So this produced a np.record instead of a np.void. But that's just a subclass:
In [143]: numpy.record.__mro__
Out[143]: (numpy.record, numpy.void, numpy.flexible, numpy.generic, object)
Accessing a structured array by field name gives an array of the corresponding dtype (and same shape)
In [145]: arr['f1']
Out[145]: array(2.0, dtype=float32)
In [146]: arr[()]['f1']
Out[146]: 2.0
In [147]: type(_)
Out[147]: numpy.float32
Out[146] could also be created with np.float32(2.0).
Checking my comment on the ar[0] for the 1d array:
In [158]: arr1d = np.array([(1,2,3)], dt)
In [159]: arr1d
Out[159]:
array([(1, 2., '3')],
dtype=[('f0', '<i4'), ('f1', '<f4'), ('f2', '<U10')])
In [160]: arr1d[0]
Out[160]: (1, 2., '3')
In [161]: type(_)
Out[161]: numpy.void
So arr[()] and arr1d[0] do the same thing for their respective sized arrays. Likewise arr2d[0,0], which can also be written as arr2d[(0,0)].
Use np.asscalar.
In both of your cases it will be just np.asscalar(ar['a']).
Also, you might find useful np.item.
Related
I have a complex nested structured array (often used as a recarray). Its simplified for this example, but in the real case there are multiple levels.
c = [('x','f8'),('y','f8')]
A = [('data_string','|S20'),('data_val', c, 2)]
zeros = np.zeros(1, dtype=A)
print(zeros["data_val"]["x"])
I am trying to index the "x" datatype of the nested arrays datatype without defining the preceding named fields. I was hoping something like print(zeros[:,"x"]) would let me slice all of the top level data, but it doesn't work.
Are there ways to do fancy indexing with nested structured arrays with accessing their field names?
I don't know if displaying the resulting array helps you visualize the nesting or not.
In [279]: c = [('x','f8'),('y','f8')]
...: A = [('data_string','|S20'),('data_val', c, 2)]
...: arr = np.zeros(2, dtype=A)
In [280]: arr
Out[280]:
array([(b'', [(0., 0.), (0., 0.)]), (b'', [(0., 0.), (0., 0.)])],
dtype=[('data_string', 'S20'), ('data_val', [('x', '<f8'), ('y', '<f8')], (2,))])
Note how the nesting of () and [] reflects the nesting of the fields.
arr.dtype only has direct access to the top level field names:
In [281]: arr.dtype.names
Out[281]: ('data_string', 'data_val')
In [282]: arr['data_val']
Out[282]:
array([[(0., 0.), (0., 0.)],
[(0., 0.), (0., 0.)]], dtype=[('x', '<f8'), ('y', '<f8')])
But having accessed one field, we can then look at its fields:
In [283]: arr['data_val'].dtype.names
Out[283]: ('x', 'y')
In [284]: arr['data_val']['x']
Out[284]:
array([[0., 0.],
[0., 0.]])
Record number indexing is separate, and can be multidimensional in the usual sense:
In [285]: arr[1]['data_val']['x'] = [1,2]
In [286]: arr[0]['data_val']['y'] = [3,4]
In [287]: arr
Out[287]:
array([(b'', [(0., 3.), (0., 4.)]), (b'', [(1., 0.), (2., 0.)])],
dtype=[('data_string', 'S20'), ('data_val', [('x', '<f8'), ('y', '<f8')], (2,))])
Since the data_val field has a (2,) shape, we can mix/match that index with the (2,) shape of arr:
In [289]: arr['data_val']['x']
Out[289]:
array([[0., 0.],
[1., 2.]])
In [290]: arr['data_val']['x'][[0,1],[0,1]]
Out[290]: array([0., 2.])
In [291]: arr['data_val'][[0,1],[0,1]]
Out[291]: array([(0., 3.), (2., 0.)], dtype=[('x', '<f8'), ('y', '<f8')])
I mentioned that fields indexing is like dict indexing. Note this display of the fields:
In [294]: arr.dtype.fields
Out[294]:
mappingproxy({'data_string': (dtype('S20'), 0),
'data_val': (dtype(([('x', '<f8'), ('y', '<f8')], (2,))), 20)})
Each record is stored as a block of 52 bytes:
In [299]: arr.itemsize
Out[299]: 52
In [300]: arr.dtype.str
Out[300]: '|V52'
20 of those are data_string, and 32 are the 2 c fields
In [303]: arr['data_val'].dtype.str
Out[303]: '|V16'
You can ask for a list of fields, and get a special kind of view. Its dtype display is a little different
In [306]: arr[['data_val']]
Out[306]:
array([([(0., 3.), (0., 4.)],), ([(1., 0.), (2., 0.)],)],
dtype={'names': ['data_val'], 'formats': [([('x', '<f8'), ('y', '<f8')], (2,))], 'offsets': [20], 'itemsize': 52})
In [311]: arr['data_val'][['y']]
Out[311]:
array([[(3.,), (4.,)],
[(0.,), (0.,)]],
dtype={'names': ['y'], 'formats': ['<f8'], 'offsets': [8], 'itemsize': 16})
Each 'data_val' starts 20 bytes into the 52 byte record. And each 'y' starts 8 bytes into its 16 byte record.
The statement zeros['data_val'] creates a view into the array, which may already be non-contiguous at that point. You can extract multiple values of x because c is an array type, meaning that x has clearly defined strides and shape. The semantics of the statement zeros[:, 'x'] are very unclear. For example, what happens to data_string, which has no x? I would expect an error; you might expect something else.
The only way I can see the index being simplified, is if you expand c into A directly, sort of like an anonymous structure in C, except you can't do that easily with an array.
I have a process where I need to convert a numpy recarray to bytes,
and after that reconstruct the recarray from the bytes.
However, I am not sure how to do recover the array from bytes.
Does anyone know how could I do it?
Example code:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.zeros(500))
rec = df.to_records()
rec_s = rec.tostring() # this returns a bytes object
# perform some computation
new_rec = <method to recover from bytes>(rec_s)
Note: I don't actually need to use numpy recarry, just some structure that will allow me to transform the pandas dataframe into a bytes object, and also recover it.
In [497]: arr = np.ones(3, dtype='i,i,f')
In [498]: arr
Out[498]:
array([(1, 1, 1.), (1, 1, 1.), (1, 1, 1.)],
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<f4')])
In [499]: astr = arr.tostring()
In [500]: astr
Out[500]: b'\x01\x00\x00\x00\x01\x00\x00\x00\x00\x00\x80?\x01\x00\x00\x00\x01\x00\x00\x00\x00\x00\x80?\x01\x00\x00\x00\x01\x00\x00\x00\x00\x00\x80?'
Recover it using the same dtype:
In [502]: np.frombuffer(astr, arr.dtype)
Out[502]:
array([(1, 1, 1.), (1, 1, 1.), (1, 1, 1.)],
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<f4')])
If the source was 2d, you'd have to reshape as well
I have a numpy float array and an int array of the same length. I would like to concatenate them such that the output has the composite dtype (float, int). column_stacking them together just yields a float64 array:
import numpy
a = numpy.random.rand(5)
b = numpy.random.randint(0, 100, 5)
ab = numpy.column_stack([a, b])
print(ab.dtype)
float64
Any hints?
Create a 'blank' array:
In [391]: dt = np.dtype('f,i')
In [392]: arr = np.zeros(5, dtype=dt)
In [393]: arr
Out[393]:
array([(0., 0), (0., 0), (0., 0), (0., 0), (0., 0)],
dtype=[('f0', '<f4'), ('f1', '<i4')])
fill it:
In [394]: arr['f0']=np.random.rand(5)
In [396]: arr['f1']=np.random.randint(0,100,5)
In [397]: arr
Out[397]:
array([(0.40140057, 75), (0.93731374, 99), (0.6226782 , 48),
(0.01068745, 68), (0.19197434, 53)],
dtype=[('f0', '<f4'), ('f1', '<i4')])
There are recfunctions that can be used as well, but it's good to know (and understand) this basic approach.
I am importing huge data sets with various types of data, using genfromtxt.
My original code worked fine (ucols is the list of columns I want to load):
data = np.genfromtxt(fname,comments = '#', skip_header=1, usecols=(ucols))
Some of my values are strings, so to avoid getting entries of NaN I tried setting dtype = None :
data = np.genfromtxt(fname, dtype = None,comments = '#', skip_header=1, usecols=(ucols))
Now for some reason I only get one column of data, IE the first column. Can someone explain what I am doing wrong?
EDIT: I now understand I am supposed to obtain a 1D structured array that can be referenced to get a whole row of values. However I wish to have my data as a numpy array, is it possible to use genfromtxt with dtype = None and still obtain a numpy array instead of a structured array, or alternatively is there a quick way to convert between the two. Although the second method is not preferable unless it can be quick and efficient since I am moving much larger values than this current instance usually.
Make a structured array and write it to csv:
In [131]: arr=np.ones((3,), dtype='i,f,U10,i,f')
In [132]: arr['f2']=['a','bc','def']
In [133]: arr
Out[133]:
array([(1, 1., 'a', 1, 1.), (1, 1., 'bc', 1, 1.), (1, 1., 'def', 1, 1.)],
dtype=[('f0', '<i4'), ('f1', '<f4'), ('f2', '<U10'), ('f3', '<i4'), ('f4', '<f4')])
In [134]: np.savetxt('test',arr,fmt='%d,%e,%s,%d,%f')
In [135]: cat test
1,1.000000e+00,a,1,1.000000
1,1.000000e+00,bc,1,1.000000
1,1.000000e+00,def,1,1.000000
load all columns with dtype=None:
In [137]: np.genfromtxt('test',delimiter=',',dtype=None,encoding=None)
Out[137]:
array([(1, 1., 'a', 1, 1.), (1, 1., 'bc', 1, 1.), (1, 1., 'def', 1, 1.)],
dtype=[('f0', '<i8'), ('f1', '<f8'), ('f2', '<U3'), ('f3', '<i8'), ('f4', '<f8')])
load a subset of the columns:
In [138]: np.genfromtxt('test',delimiter=',',dtype=None,encoding=None,usecols=
...: (1,2,4))
Out[138]:
array([(1., 'a', 1.), (1., 'bc', 1.), (1., 'def', 1.)],
dtype=[('f0', '<f8'), ('f1', '<U3'), ('f2', '<f8')])
Say I have the following array:
a = array([(1L, 2.0, 'buckle_my_shoe'), (3L, 4.0, 'margery_door')],
dtype=[('f0', '<i8'), ('f1', '<f8'), ('f2', 'S14')])
How do I access a column?
I can access a row using this syntax:
a[0][:]
but get an error when I try to access a column in the same way.
a[:][0]
Note. This is not a dupe of "How to access the ith column of a NumPy multidimensional array?" since I am using an array of different types.
In [33]: a['f0']
Out[33]: array([1, 3], dtype=int64)
In [34]: a['f1']
Out[34]: array([ 2., 4.])
In [35]: a['f2']
Out[35]:
array(['buckle_my_shoe', 'margery_door'],
dtype='|S14')
Here, f0, f1 and f2 are the field names from your array's dtype.
For more information, see Structured Arrays.