I have a complex nested structured array (often used as a recarray). Its simplified for this example, but in the real case there are multiple levels.
c = [('x','f8'),('y','f8')]
A = [('data_string','|S20'),('data_val', c, 2)]
zeros = np.zeros(1, dtype=A)
print(zeros["data_val"]["x"])
I am trying to index the "x" datatype of the nested arrays datatype without defining the preceding named fields. I was hoping something like print(zeros[:,"x"]) would let me slice all of the top level data, but it doesn't work.
Are there ways to do fancy indexing with nested structured arrays with accessing their field names?
I don't know if displaying the resulting array helps you visualize the nesting or not.
In [279]: c = [('x','f8'),('y','f8')]
...: A = [('data_string','|S20'),('data_val', c, 2)]
...: arr = np.zeros(2, dtype=A)
In [280]: arr
Out[280]:
array([(b'', [(0., 0.), (0., 0.)]), (b'', [(0., 0.), (0., 0.)])],
dtype=[('data_string', 'S20'), ('data_val', [('x', '<f8'), ('y', '<f8')], (2,))])
Note how the nesting of () and [] reflects the nesting of the fields.
arr.dtype only has direct access to the top level field names:
In [281]: arr.dtype.names
Out[281]: ('data_string', 'data_val')
In [282]: arr['data_val']
Out[282]:
array([[(0., 0.), (0., 0.)],
[(0., 0.), (0., 0.)]], dtype=[('x', '<f8'), ('y', '<f8')])
But having accessed one field, we can then look at its fields:
In [283]: arr['data_val'].dtype.names
Out[283]: ('x', 'y')
In [284]: arr['data_val']['x']
Out[284]:
array([[0., 0.],
[0., 0.]])
Record number indexing is separate, and can be multidimensional in the usual sense:
In [285]: arr[1]['data_val']['x'] = [1,2]
In [286]: arr[0]['data_val']['y'] = [3,4]
In [287]: arr
Out[287]:
array([(b'', [(0., 3.), (0., 4.)]), (b'', [(1., 0.), (2., 0.)])],
dtype=[('data_string', 'S20'), ('data_val', [('x', '<f8'), ('y', '<f8')], (2,))])
Since the data_val field has a (2,) shape, we can mix/match that index with the (2,) shape of arr:
In [289]: arr['data_val']['x']
Out[289]:
array([[0., 0.],
[1., 2.]])
In [290]: arr['data_val']['x'][[0,1],[0,1]]
Out[290]: array([0., 2.])
In [291]: arr['data_val'][[0,1],[0,1]]
Out[291]: array([(0., 3.), (2., 0.)], dtype=[('x', '<f8'), ('y', '<f8')])
I mentioned that fields indexing is like dict indexing. Note this display of the fields:
In [294]: arr.dtype.fields
Out[294]:
mappingproxy({'data_string': (dtype('S20'), 0),
'data_val': (dtype(([('x', '<f8'), ('y', '<f8')], (2,))), 20)})
Each record is stored as a block of 52 bytes:
In [299]: arr.itemsize
Out[299]: 52
In [300]: arr.dtype.str
Out[300]: '|V52'
20 of those are data_string, and 32 are the 2 c fields
In [303]: arr['data_val'].dtype.str
Out[303]: '|V16'
You can ask for a list of fields, and get a special kind of view. Its dtype display is a little different
In [306]: arr[['data_val']]
Out[306]:
array([([(0., 3.), (0., 4.)],), ([(1., 0.), (2., 0.)],)],
dtype={'names': ['data_val'], 'formats': [([('x', '<f8'), ('y', '<f8')], (2,))], 'offsets': [20], 'itemsize': 52})
In [311]: arr['data_val'][['y']]
Out[311]:
array([[(3.,), (4.,)],
[(0.,), (0.,)]],
dtype={'names': ['y'], 'formats': ['<f8'], 'offsets': [8], 'itemsize': 16})
Each 'data_val' starts 20 bytes into the 52 byte record. And each 'y' starts 8 bytes into its 16 byte record.
The statement zeros['data_val'] creates a view into the array, which may already be non-contiguous at that point. You can extract multiple values of x because c is an array type, meaning that x has clearly defined strides and shape. The semantics of the statement zeros[:, 'x'] are very unclear. For example, what happens to data_string, which has no x? I would expect an error; you might expect something else.
The only way I can see the index being simplified, is if you expand c into A directly, sort of like an anonymous structure in C, except you can't do that easily with an array.
I am importing huge data sets with various types of data, using genfromtxt.
My original code worked fine (ucols is the list of columns I want to load):
data = np.genfromtxt(fname,comments = '#', skip_header=1, usecols=(ucols))
Some of my values are strings, so to avoid getting entries of NaN I tried setting dtype = None :
data = np.genfromtxt(fname, dtype = None,comments = '#', skip_header=1, usecols=(ucols))
Now for some reason I only get one column of data, IE the first column. Can someone explain what I am doing wrong?
EDIT: I now understand I am supposed to obtain a 1D structured array that can be referenced to get a whole row of values. However I wish to have my data as a numpy array, is it possible to use genfromtxt with dtype = None and still obtain a numpy array instead of a structured array, or alternatively is there a quick way to convert between the two. Although the second method is not preferable unless it can be quick and efficient since I am moving much larger values than this current instance usually.
Make a structured array and write it to csv:
In [131]: arr=np.ones((3,), dtype='i,f,U10,i,f')
In [132]: arr['f2']=['a','bc','def']
In [133]: arr
Out[133]:
array([(1, 1., 'a', 1, 1.), (1, 1., 'bc', 1, 1.), (1, 1., 'def', 1, 1.)],
dtype=[('f0', '<i4'), ('f1', '<f4'), ('f2', '<U10'), ('f3', '<i4'), ('f4', '<f4')])
In [134]: np.savetxt('test',arr,fmt='%d,%e,%s,%d,%f')
In [135]: cat test
1,1.000000e+00,a,1,1.000000
1,1.000000e+00,bc,1,1.000000
1,1.000000e+00,def,1,1.000000
load all columns with dtype=None:
In [137]: np.genfromtxt('test',delimiter=',',dtype=None,encoding=None)
Out[137]:
array([(1, 1., 'a', 1, 1.), (1, 1., 'bc', 1, 1.), (1, 1., 'def', 1, 1.)],
dtype=[('f0', '<i8'), ('f1', '<f8'), ('f2', '<U3'), ('f3', '<i8'), ('f4', '<f8')])
load a subset of the columns:
In [138]: np.genfromtxt('test',delimiter=',',dtype=None,encoding=None,usecols=
...: (1,2,4))
Out[138]:
array([(1., 'a', 1.), (1., 'bc', 1.), (1., 'def', 1.)],
dtype=[('f0', '<f8'), ('f1', '<U3'), ('f2', '<f8')])
The main reason why I am asking this question is because I do not exactly know how the structured arrays work compared to normal arrays and because I could not find suitable examples online for my case. Further, I am probably filling my structured array wrongly in the first place.
So, here I want to present the 'normal' numpy-array version (and what I need to do with it) and the new 'structured' array version. My (largest) datasets contain around 200e6 objects/rows with up to 40-50 properties/columns. They all have the same data type except for a few special columns: 'haloid', 'hostid', 'type'. They are ID numbers or flags and I have to keep them with the rest of the data because I have to identify my objects with them.
data set name:
data_array: ndarray shape: (42648, 10)
data type:
dt = [('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'),
('mstar', '<f8'), ('x_pos', '<f8'), ('y_pos', '<f8'),
('z_pos', '<f8'), ('x_vel', '<f8'), ('y_vel', '<f8'), ('z_vel', '<f8')]
Reading data from .hdf5-file format to array
Most of the data is stored in hdf5-files (2000 of them corresponds to one snapshot I have to process at once) which should be read-in to a single array
import numpy as np
import h5py as hdf5
mydict={'name0': 'haloid', 'name1': 'hostid', ...} #dictionary of column names
nr_rows = 200000 # approximated
nr_files = 100 # up to 2200
nr_entries = 10 # up to 50
size = 0
size_before = 0
new_size = 0
# normal array:
data_array=np.zeros((nr_rows, nr_entries), dtype=np.float64)
# structured array:
data_array=np.zeros((nr_rows,), dtype=dt)
i=0
while i<nr_files:
size_before=new_size
f = hdf5.File(path, "r")
size=f[mydict['name0']].size
new_size+=size
a=0
while a<nr_entries:
name=mydict['name'+str(a)]
# normal array:
data_array[size_before:new_size, a] = f[name]
# structured array:
data_array[name][size_before:new_size] = f[name]
a+=1
i+=1
EDIT: I edit the code above because hpaulj was fortunately commenting the following:
First point of confusion. You show a dt definition with names like dt = [('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'),.... But
the h5 load is
data_array['name'+str(a)][size_before:new_size] = f['name'+str(a)] In other words, the file has datasets with names like name0, name1,
and you are downloading those to an array with fields with the same
names.
This was a 'I-simplify-code' copy/paste mistake and I corrected it!
Question 1: Is that the right way to fill a structured array?
data_array[name][size_before:new_size] = f[name]
Question 2: How to address a column in a structured array?
data_array[name] #--> column with a certain name
Question 3: How to address an entire row in a structured array?
data_array[0] #--> first row
Question 4: How to address 3 rows and all columns?
# normal array:
print data_array[0:3,:]
[[ 1.21080866e+10 1.21080866e+10 0.00000000e+00 5.69363234e+08
1.28992369e+03 1.28894614e+03 1.32171442e+03 -1.08210000e+02
4.92900000e+02 6.50400000e+01]
[ 1.21080711e+10 1.21080711e+10 0.00000000e+00 4.76329837e+06
1.29058079e+03 1.28741361e+03 1.32358059e+03 -4.23130000e+02
5.08720000e+02 -6.74800000e+01]
[ 1.21080700e+10 1.21080700e+10 0.00000000e+00 2.22978043e+10
1.28750287e+03 1.28864306e+03 1.32270418e+03 -6.13760000e+02
2.19530000e+02 -2.28980000e+02]]
# structured array:
print data_array[0:3]
#it returns a lot of data ...
[[ (12108086595L, 12108086595L, 0, 105676938.02998888, 463686295.4907876,.7144191943337, -108.21, 492.9, 65.04)
(12108071103L, 12108071103L, 0, 0.0, ... more data ...
... 228.02) ... more data ...
(8394715323L, 8394715323L, 2, 0.0, 823505.2374262045, 0798, 812.0612163877823, -541.61, 544.44, 421.08)]]
Question 5: Why does data_array[0:3] not only return the first 3 rows with the 10 columns?
Question 6: How to address the first two elements in the first column?
# normal array:
print data_array[0:1,0]
[ 1.21080866e+10 1.21080711e+10]
# structured array:
print data_array['haloid']][0][0:1]
[12108086595 12108071103]
OK! I got that!
Question 7: How to address three specific columns by name and they first 3 rows in that columns?
# normal array:
print data_array[0:3, [0,2,1]]
[[ 1.21080866e+10 0.00000000e+00 1.21080866e+10]
[ 1.21080711e+10 0.00000000e+00 1.21080711e+10]
[ 1.21080700e+10 0.00000000e+00 1.21080700e+10]]
# structured array:
print data_array[['haloid','type','hostid']][0][0:3]
[(12108086595L, 0, 12108086595L) (12108071103L, 0, 12108071103L)
(12108069992L, 0, 12108069992L)]
OK, the last example seems to work!!!
Question 8: What is the difference between:
(a) data_array['haloid'][0][0:3] and (b) data_array['haloid'][0:3]
where (a) returns really the first three haloids and (b) returns a lot of haloids (10x3).
[[12108086595 12108071103 12108069992 12108076356 12108075899 12108066340
9248632230 12108066342 10878169355 10077026070]
[ 6093565531 10077025463 8046772253 7871669276 5558161476 5558161473
12108068704 12108068708 12108077435 12108066338]
[ 8739142199 12108069995 12108069994 12108076355 12108092590 12108066312
12108075900 9248643751 6630111058 12108074389]]
Question 9: What is data_array['haloid'][0:3] actually returning?
Question 10: How to mask a structured array with np.where()
# NOTE: col0,1,2 are some integer values of the column I want to address
# col_name0,1,2 are corresponding names e.g. mstar, type, haloid
# normal array
mask = np.where(data[:,col2] > data[:,col1])
data[mask[:][0]]
mask = np.where(data[:,col2]==2)
data[:,col0][[mask[:][0]]]=data[:,col2][[mask[:][0]]]
#structured array
mask = np.where(data['x_pos'][0] > data['y_pos'][0]])
data[mask[:][0]]
mask = np.where(data[:,col2]==2)
data['haloid'][:,col0][[mask[:][0]]]=data['hostid'][:,col1][[mask[:][0]]]
This seems to work, but I am not sure!
Question 11: Can I still use np.resize() like: data_array = np.resize(data_array,(new_size, nr_entries)) to resize/reshape my array?
Question 12: How to sort a structured array?
# normal array:
data_sorted = data[np.argsort(data[:,col2])]
# structured array:
data_sorted = data[np.argsort(data['mstar'][:,col3])]
Thanks, I appreciate any help or advise!
First point of confusion. You show a dt definition with names like dt = [('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'),.... But the h5 load
is
data_array['name'+str(a)][size_before:new_size] = f['name'+str(a)]
In other words, the file has datasets with names like name0, name1, and you are downloading those to an array with fields with the same names.
You can iterate of the fields of an array defined with dt using
for name in dt.names:
data[name] = ...
e.g.
In [20]: dt = [('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'),
...: ('mstar', '<f8'), ('x_pos', '<f8'), ('y_pos', '<f8'),
...: ('z_pos', '<f8'), ('x_vel', '<f8'), ('y_vel', '<f8'), ('z_vel', '<f8')]
In [21]: arr = np.zeros((3,), dtype=dt)
In [22]: arr
Out[22]:
array([(0, 0, 0, 0., 0., 0., 0., 0., 0., 0.),
(0, 0, 0, 0., 0., 0., 0., 0., 0., 0.),
(0, 0, 0, 0., 0., 0., 0., 0., 0., 0.)],
dtype=[('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'), ('mstar', '<f8'), ('x_pos', '<f8'), ('y_pos', '<f8'), ('z_pos', '<f8'), ('x_vel', '<f8'), ('y_vel', '<f8'), ('z_vel', '<f8')])
In [23]: for name in arr.dtype.names:
...: print(name)
...: arr[name] = 1
...:
haloid
hostid
....
In [24]: arr
Out[24]:
array([(1, 1, 1, 1., 1., 1., 1., 1., 1., 1.),
(1, 1, 1, 1., 1., 1., 1., 1., 1., 1.),
(1, 1, 1, 1., 1., 1., 1., 1., 1., 1.)],
dtype=[('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'), ('mstar', '<f8'), ('x_pos', '<f8'), ('y_pos', '<f8'), ('z_pos', '<f8'), ('x_vel', '<f8'), ('y_vel', '<f8'), ('z_vel', '<f8')])
In [25]: arr[0] # get one record
Out[25]: (1, 1, 1, 1., 1., 1., 1., 1., 1., 1.)
In [26]: arr[0]['hostid'] # get one field, one record
In [27]: arr['hostid'] # get all values of a field
Out[27]: array([1, 1, 1], dtype=uint64)
In [28]: arr['hostid'][:2] # subset of records
Out[28]: array([1, 1], dtype=uint64)
So filling a structured array by field name should work fine:
arr[name][n1:n2] = file[dataset_name]
Prints like this:
structured array:
print data_array[['haloid','type','hostid']][0][0:3]
[(12108086595L, 0, 12108086595L) (12108071103L, 0, 12108071103L) (12108069992L, 0, 12108069992L)]
and
[[ (12108086595L, 12108086595L, 0,
look to me like the structured data_array is actually 2d, created with something like (see Question 8)
data_array = np.zeros((10, nr_rows), dtype=dt)
That's the only way that the [0][0:3] indexing would work,
For the 2d array:
mask = np.where(data[:,col2] > data[:,col1])
compares 2 columns. When in doubt look first as the boolean data[:,col2] > data[:,col1]. where just returns the indices where that boolean array is True.
Simple example of masked indexing:
In [29]: x = np.array((np.arange(6), np.arange(6)[::-1])).T
In [33]: mask = x[:,0]>x[:,1]
In [34]: mask
Out[34]: array([False, False, False, True, True, True], dtype=bool)
In [35]: idx = np.where(mask)
In [36]: idx
Out[36]: (array([3, 4, 5], dtype=int32),)
In [37]: x[mask,:]
Out[37]:
array([[3, 2],
[4, 1],
[5, 0]])
In [38]: x[idx,:]
Out[38]:
array([[[3, 2],
[4, 1],
[5, 0]]])
In this structured example, data['x_pos'] selects the field. the [0] is required to select the 1st row of that 2d array (the size 10 dimension). The rest of the comparison and where should work as with a 2d array.
mask = np.where(data['x_pos'][0] > data['y_pos'][0]])
mask[:][0] on a where tuple is probably not needed. mask is a tuple, [:] makes a copy and [0] selects the 1st element, which is an array. Sometimes a arr[idx[0],:] might be needed instead of arr[idx,:], but don't do that routinely.
My first comment suggested separate arrays
dt1 = [('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1')]
data_id = np.zeros((n,), dtype=dt1)
data = np.zeros((n,m), dtype=float) # m float columns
Or even
haloid = np.zeros((n,), '<u8')
hostid = np.zeros((n,), '<u8')
type = np.zeros((n,), 'i1')
With these arrays, data_array['hostid'][0], data_id['hostid'] and hostid should all return the same 1d array, and be equally usable in mask expressions.
Sometimes it is convenient to keep the ids and data in one structure. That's especially true if writing/reading to csv format files. But for masked selection it doesn't help that much. And for data calculations across data fields it can be a pain.
I could also suggest a compound dtype, one with
dt2 = [('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'), ('data', 'f8', (m,))]
In [41]: np.zeros((4,), dtype=dt2)
Out[41]:
array([(0, 0, 0, [ 0., 0., 0.]), (0, 0, 0, [ 0., 0., 0.]),
(0, 0, 0, [ 0., 0., 0.]), (0, 0, 0, [ 0., 0., 0.])],
dtype=[('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'), ('data', '<f8', (3,))])
In [42]: _['data']
Out[42]:
array([[ 0., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.]])
Is it better to access the float data by column number or by a name like 'x_coor'? Do you need to do calculations with several float columns at once, or will you always access them individually?
Through your description I think the naive way is to read in only useful data into arrays with different names (one type each maybe?)
If you want all the data read into one array, maybe Pandas is your choice:
http://pandas.pydata.org
http://pandas.pydata.org/pandas-docs/stable/
But I haven't tried that yet. Have fun to give a try.
ANSWER TO QUESTION 11:
Question 11: Can I still use np.resize() like: data_array =
np.resize(data_array,(new_size, nr_entries)) to resize/reshape my
array?
If I am resizing my array like this, I am creating to each field in dt 10 more columns. So I get the 'weird' result of Question 8b: a structure (10x3) haloids
The right way to trim my array, because I want to keep only the filled part of if (I designed the array to be big enough to contain a various amount of data blocks I am reading subsequently ...) is to:
data_array = data_array[:newsize]
print np.info(data_array)
class: ndarray
shape: (42648,)
strides: (73,)
type: [('haloid', '<u8'), ('hostid', '<u8'), ('orphan', 'i1'),
('mstar', '<f8'), ('x_pos', '<f8'), ('y_pos', '<f8'),
('z_pos', '<f8'), ('x_vel', '<f8'), ('y_vel', '<f8'), ('z_vel', '<f8')]