How to get field of nested numpy structured array (advanced indexing) - python

I have a complex nested structured array (often used as a recarray). Its simplified for this example, but in the real case there are multiple levels.
c = [('x','f8'),('y','f8')]
A = [('data_string','|S20'),('data_val', c, 2)]
zeros = np.zeros(1, dtype=A)
print(zeros["data_val"]["x"])
I am trying to index the "x" datatype of the nested arrays datatype without defining the preceding named fields. I was hoping something like print(zeros[:,"x"]) would let me slice all of the top level data, but it doesn't work.
Are there ways to do fancy indexing with nested structured arrays with accessing their field names?

I don't know if displaying the resulting array helps you visualize the nesting or not.
In [279]: c = [('x','f8'),('y','f8')]
...: A = [('data_string','|S20'),('data_val', c, 2)]
...: arr = np.zeros(2, dtype=A)
In [280]: arr
Out[280]:
array([(b'', [(0., 0.), (0., 0.)]), (b'', [(0., 0.), (0., 0.)])],
dtype=[('data_string', 'S20'), ('data_val', [('x', '<f8'), ('y', '<f8')], (2,))])
Note how the nesting of () and [] reflects the nesting of the fields.
arr.dtype only has direct access to the top level field names:
In [281]: arr.dtype.names
Out[281]: ('data_string', 'data_val')
In [282]: arr['data_val']
Out[282]:
array([[(0., 0.), (0., 0.)],
[(0., 0.), (0., 0.)]], dtype=[('x', '<f8'), ('y', '<f8')])
But having accessed one field, we can then look at its fields:
In [283]: arr['data_val'].dtype.names
Out[283]: ('x', 'y')
In [284]: arr['data_val']['x']
Out[284]:
array([[0., 0.],
[0., 0.]])
Record number indexing is separate, and can be multidimensional in the usual sense:
In [285]: arr[1]['data_val']['x'] = [1,2]
In [286]: arr[0]['data_val']['y'] = [3,4]
In [287]: arr
Out[287]:
array([(b'', [(0., 3.), (0., 4.)]), (b'', [(1., 0.), (2., 0.)])],
dtype=[('data_string', 'S20'), ('data_val', [('x', '<f8'), ('y', '<f8')], (2,))])
Since the data_val field has a (2,) shape, we can mix/match that index with the (2,) shape of arr:
In [289]: arr['data_val']['x']
Out[289]:
array([[0., 0.],
[1., 2.]])
In [290]: arr['data_val']['x'][[0,1],[0,1]]
Out[290]: array([0., 2.])
In [291]: arr['data_val'][[0,1],[0,1]]
Out[291]: array([(0., 3.), (2., 0.)], dtype=[('x', '<f8'), ('y', '<f8')])
I mentioned that fields indexing is like dict indexing. Note this display of the fields:
In [294]: arr.dtype.fields
Out[294]:
mappingproxy({'data_string': (dtype('S20'), 0),
'data_val': (dtype(([('x', '<f8'), ('y', '<f8')], (2,))), 20)})
Each record is stored as a block of 52 bytes:
In [299]: arr.itemsize
Out[299]: 52
In [300]: arr.dtype.str
Out[300]: '|V52'
20 of those are data_string, and 32 are the 2 c fields
In [303]: arr['data_val'].dtype.str
Out[303]: '|V16'
You can ask for a list of fields, and get a special kind of view. Its dtype display is a little different
In [306]: arr[['data_val']]
Out[306]:
array([([(0., 3.), (0., 4.)],), ([(1., 0.), (2., 0.)],)],
dtype={'names': ['data_val'], 'formats': [([('x', '<f8'), ('y', '<f8')], (2,))], 'offsets': [20], 'itemsize': 52})
In [311]: arr['data_val'][['y']]
Out[311]:
array([[(3.,), (4.,)],
[(0.,), (0.,)]],
dtype={'names': ['y'], 'formats': ['<f8'], 'offsets': [8], 'itemsize': 16})
Each 'data_val' starts 20 bytes into the 52 byte record. And each 'y' starts 8 bytes into its 16 byte record.

The statement zeros['data_val'] creates a view into the array, which may already be non-contiguous at that point. You can extract multiple values of x because c is an array type, meaning that x has clearly defined strides and shape. The semantics of the statement zeros[:, 'x'] are very unclear. For example, what happens to data_string, which has no x? I would expect an error; you might expect something else.
The only way I can see the index being simplified, is if you expand c into A directly, sort of like an anonymous structure in C, except you can't do that easily with an array.

Related

genfromtxt only imports first column, after changing dtype

I am importing huge data sets with various types of data, using genfromtxt.
My original code worked fine (ucols is the list of columns I want to load):
data = np.genfromtxt(fname,comments = '#', skip_header=1, usecols=(ucols))
Some of my values are strings, so to avoid getting entries of NaN I tried setting dtype = None :
data = np.genfromtxt(fname, dtype = None,comments = '#', skip_header=1, usecols=(ucols))
Now for some reason I only get one column of data, IE the first column. Can someone explain what I am doing wrong?
EDIT: I now understand I am supposed to obtain a 1D structured array that can be referenced to get a whole row of values. However I wish to have my data as a numpy array, is it possible to use genfromtxt with dtype = None and still obtain a numpy array instead of a structured array, or alternatively is there a quick way to convert between the two. Although the second method is not preferable unless it can be quick and efficient since I am moving much larger values than this current instance usually.
Make a structured array and write it to csv:
In [131]: arr=np.ones((3,), dtype='i,f,U10,i,f')
In [132]: arr['f2']=['a','bc','def']
In [133]: arr
Out[133]:
array([(1, 1., 'a', 1, 1.), (1, 1., 'bc', 1, 1.), (1, 1., 'def', 1, 1.)],
dtype=[('f0', '<i4'), ('f1', '<f4'), ('f2', '<U10'), ('f3', '<i4'), ('f4', '<f4')])
In [134]: np.savetxt('test',arr,fmt='%d,%e,%s,%d,%f')
In [135]: cat test
1,1.000000e+00,a,1,1.000000
1,1.000000e+00,bc,1,1.000000
1,1.000000e+00,def,1,1.000000
load all columns with dtype=None:
In [137]: np.genfromtxt('test',delimiter=',',dtype=None,encoding=None)
Out[137]:
array([(1, 1., 'a', 1, 1.), (1, 1., 'bc', 1, 1.), (1, 1., 'def', 1, 1.)],
dtype=[('f0', '<i8'), ('f1', '<f8'), ('f2', '<U3'), ('f3', '<i8'), ('f4', '<f8')])
load a subset of the columns:
In [138]: np.genfromtxt('test',delimiter=',',dtype=None,encoding=None,usecols=
...: (1,2,4))
Out[138]:
array([(1., 'a', 1.), (1., 'bc', 1.), (1., 'def', 1.)],
dtype=[('f0', '<f8'), ('f1', '<U3'), ('f2', '<f8')])

Indexing array using column names

I'm loading pretty large input files into Numpy array (30 columns, over 10k rows). Data contains only floating point numbers. To simplify data processing I'd like to name columns and access them using human-readable names. AFAIK it's only possibly using structured/record arrays. However, if I'm right, when i use structured arrays I'll loose some information. For instance:
x = np.array([(1.0, 2), (3.0, 4), (11, 22)], dtype='float64')
y = np.array([(1.0, 2), (3.0, 4), (11, 22)], dtype=[('x', float), ('y', float), ('z', float)])
Both arrays contains the same data and the same dtype. y can be accessed using column names:
yIn [155]: y['x']
Out[155]: array([ 1., 3., 11.])
Unfortunately, I loose (or I get wrong impression?) so essential properties when I use structured arrays. x and y have different shapes, y cannot be transposed etc.
In [160]: x
Out[160]:
array([[ 1., 2.],
[ 3., 4.],
[11., 22.]])
In [161]: y
Out[161]:
array([( 1., 2.), ( 3., 4.), (11., 22.)],
dtype=[('x', '<f8'), ('y', '<f8')])
In [162]: x.shape
Out[162]: (3, 2)
In [163]: y.shape
Out[163]: (3,)
In [164]: x.T
Out[164]:
array([[ 1., 3., 11.],
[ 2., 4., 22.]])
In [165]: y.T
Out[165]:
array([( 1., 2.), ( 3., 4.), (11., 22.)],
dtype=[('x', '<f8'), ('y', '<f8')])
Is it possible to continue using "regular 2D Numpy arrays" and access columns using their names?

Creating a Numpy structure scalar instead of array

I just discovered Numpy structured arrays and I find them to be quite powerful. The natural question arises in my mind: How in the world do I create a Numpy structure scalar. Let me show you what I mean. Let's say I want a structure containing some data:
import numpy as np
dtype = np.dtype([('a', np.float_), ('b', np.int_)])
ar = np.array((0.5, 1), dtype=dtype)
ar['a']
This gives me array(0.5) instead of 0.5. On the other hand, if I do this:
import numpy as np
dtype = np.dtype([('a', np.float_), ('b', np.int_)])
ar = np.array([(0.5, 1)], dtype=dtype)
ar[0]['a']
I get 0.5, just like I want. Which means that ar[0] isn't an array, but a scalar. Is it possible to create a structured scalar in a way more elegant than the one I've described?
Singleton isn't quite the right term, but I get what you want.
arr = np.array((0.5, 1), dtype=dtype)
Creates a 0d, single element array of this dtype. Check its dtype and shape.
arr.item() returns a tuple (0.5, 1). Aso test arr[()] and arr.tolist().
np.float64(0.5) creates a float with a numpy wrapper. It is similar to, but exactly the same as np.array(0.5). Their methods diff some.
I don't know anything similar with a compound dtype.
In [123]: dt = np.dtype('i,f,U10')
In [124]: dt
Out[124]: dtype([('f0', '<i4'), ('f1', '<f4'), ('f2', '<U10')])
In [125]: arr = np.array((1,2,3),dtype=dt)
In [126]: arr
Out[126]:
array((1, 2., '3'),
dtype=[('f0', '<i4'), ('f1', '<f4'), ('f2', '<U10')])
In [127]: arr.shape
Out[127]: ()
arr is a 0d 1 element array. It can be indexed with:
In [128]: arr[()]
Out[128]: (1, 2., '3')
In [129]: type(_)
Out[129]: numpy.void
This indexing produces a np.void object. Doing the same thing on a 0d float array would produce a np.float object.
But you can't use np.void((1,2,3), dtype=dt) to directly create such an object (in contrast to np.float(12.34)).
item is the normal way of extracting a 'scalar' from an array. Here it returns a tuple, the same sort of object that we used as input to create arr:
In [131]: arr.item()
Out[131]: (1, 2.0, '3')
In [132]: type(_)
Out[132]: tuple
np.asscalar(arr) returns the same tuple.
One difference between the np.void object and the tuple, is that it can still be indexed with the field name, arr[()]['f0'], whereas the tuple has to be indexed by number arr.item()[0]. The void still has a dtype, while the tuple doesn't.
fromrecords makes a recarray. This is similar to a structured array, but allows us to access fields as attributes. It may actually be an older class, that has been merged to into numpy, hence the np.rec prefix. Mostly we use structured arrays, though np.rec still has some convenience functions. (actually in numpy.lib.recfunctions):
In [133]: res = np.rec.fromrecords((1,2,3), dt)
In [134]: res
Out[134]:
rec.array((1, 2., '3'),
dtype=[('f0', '<i4'), ('f1', '<f4'), ('f2', '<U10')])
In [135]: res.f0
Out[135]: array(1, dtype=int32)
In [136]: res.item()
Out[136]: (1, 2.0, '3')
In [137]: type(_)
Out[137]: tuple
In [138]: res[()]
Out[138]: (1, 2.0, '3')
In [139]: type(_)
Out[139]: numpy.record
So this produced a np.record instead of a np.void. But that's just a subclass:
In [143]: numpy.record.__mro__
Out[143]: (numpy.record, numpy.void, numpy.flexible, numpy.generic, object)
Accessing a structured array by field name gives an array of the corresponding dtype (and same shape)
In [145]: arr['f1']
Out[145]: array(2.0, dtype=float32)
In [146]: arr[()]['f1']
Out[146]: 2.0
In [147]: type(_)
Out[147]: numpy.float32
Out[146] could also be created with np.float32(2.0).
Checking my comment on the ar[0] for the 1d array:
In [158]: arr1d = np.array([(1,2,3)], dt)
In [159]: arr1d
Out[159]:
array([(1, 2., '3')],
dtype=[('f0', '<i4'), ('f1', '<f4'), ('f2', '<U10')])
In [160]: arr1d[0]
Out[160]: (1, 2., '3')
In [161]: type(_)
Out[161]: numpy.void
So arr[()] and arr1d[0] do the same thing for their respective sized arrays. Likewise arr2d[0,0], which can also be written as arr2d[(0,0)].
Use np.asscalar.
In both of your cases it will be just np.asscalar(ar['a']).
Also, you might find useful np.item.

Filter numpy array of tuples

Scikit-learn library have a brilliant example of data clustering - stock market structure. It works fine within US stocks. But when one adds tickers from other markets, numpy's error appear that arrays shoud have the same size - this is true, for example, german stocks have different trading calendar.
Ok, after quotes download I add preparation of shared dates:
quotes = [quotes_historical_yahoo_ochl(symbol, d1, d2, asobject=True)
for symbol in symbols]
def intersect(list_1, list_2):
return list(set(list_1) & set(list_2))
dates_all = quotes[0].date
for q in quotes:
dates_symbol = q.date
dates_all = intersect(dates_all, dates_symbol)
Then I'm stuck with filtering numpy array of tuples. Here's some tries:
# for index, q in enumerate(quotes):
# filtered = [i for i in q if i.date in dates_all]
# quotes[index] = np.rec.array(filtered, dtype=q.dtype)
# quotes[index] = np.asanyarray(filtered, dtype=q.dtype)
#
# quotes[index] = np.where(a.date in dates_all for a in q)
#
# quotes[index] = np.where(q[0].date in dates_all)
How to apply filter to numpy array or how to truly convert list of records (after filter) back to numpy's recarray?
quotes[0].dtype:
'(numpy.record, [('date', 'O'), ('year', '<i2'), ('month', 'i1'), ('day', 'i1'), ('d', '<f8'), ('open', '<f8'), ('close', '<f8'), ('high', '<f8'), ('low', '<f8'), ('volume', '<f8'), ('aclose', '<f8')])'
quotes[0].shape:
<class 'tuple'>: (261,)
So quotes is a list of recarrays, and in date_all you collect the intersection of all values in the date field.
I can recreate one such array with:
In [286]: dt=np.dtype([('date', 'O'), ('year', '<i2'), ('month', 'i1'), ('day',
...:
...: ), ('low', '<f8'), ('volume', '<f8'), ('aclose', '<f8')])
In [287]:
In [287]: arr=np.ones((2,), dtype=dt) # 2 element structured array
In [288]: arr
Out[288]:
array([(1, 1, 1, 1, 1., 1., 1., 1., 1., 1., 1.),
(1, 1, 1, 1, 1., 1., 1., 1., 1., 1., 1.)],
dtype=[('date', 'O'), ('year', '<i2'), ('month', 'i1'), ('day', 'i1'), ... ('aclose', '<f8')])
In [289]: type(arr[0])
Out[289]: numpy.void
turn that into a recarray (I dont' use those as much as plain structured arrays):
In [291]: np.rec.array(arr)
Out[291]:
rec.array([(1, 1, 1, 1, 1., 1., 1., 1., 1., 1., 1.),
(1, 1, 1, 1, 1., 1., 1., 1., 1., 1., 1.)],
dtype=[('date', 'O'), ('year', '<i2'), ('month', 'i1'), ('day', 'i1'), .... ('aclose', '<f8')])
dtype of the recarray displays slightly different:
In [292]: _.dtype
Out[292]: dtype((numpy.record, [('date', 'O'), ('year', '<i2'), ('month', 'i1'), ....('aclose', '<f8')]))
In [293]: __.date
Out[293]: array([1, 1], dtype=object)
In any case the date field is an array of objects, possibly of datetime?
q is one of these arrays; i is an element, and i.date is the date field.
[i for i in q if i.date in dates_all]
So filtered is list of recarray elements. np.stack does a better job of reassembling them into an array (that works with the recarray too).
np.stack([i for i in arr if i['date'] in alist])
Or you could collect the indices of the matching records, and index the quote array
In [319]: [i for i,v in enumerate(arr) if v['date'] in alist]
Out[319]: [0, 1]
In [320]: arr[_]
or pull out the date field first:
In [321]: [i for i,v in enumerate(arr['date']) if v in alist]
Out[321]: [0, 1]
in1d might also work to search
In [322]: np.in1d(arr['date'],alist)
Out[322]: array([ True, True], dtype=bool)
In [323]: np.where(np.in1d(arr['date'],alist))
Out[323]: (array([0, 1], dtype=int32),)

PYTHON/NUMPY: Handling structured arrays compared to common numpy-arrays Python2.7

The main reason why I am asking this question is because I do not exactly know how the structured arrays work compared to normal arrays and because I could not find suitable examples online for my case. Further, I am probably filling my structured array wrongly in the first place.
So, here I want to present the 'normal' numpy-array version (and what I need to do with it) and the new 'structured' array version. My (largest) datasets contain around 200e6 objects/rows with up to 40-50 properties/columns. They all have the same data type except for a few special columns: 'haloid', 'hostid', 'type'. They are ID numbers or flags and I have to keep them with the rest of the data because I have to identify my objects with them.
data set name:
data_array: ndarray shape: (42648, 10)
data type:
dt = [('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'),
('mstar', '<f8'), ('x_pos', '<f8'), ('y_pos', '<f8'),
('z_pos', '<f8'), ('x_vel', '<f8'), ('y_vel', '<f8'), ('z_vel', '<f8')]
Reading data from .hdf5-file format to array
Most of the data is stored in hdf5-files (2000 of them corresponds to one snapshot I have to process at once) which should be read-in to a single array
import numpy as np
import h5py as hdf5
mydict={'name0': 'haloid', 'name1': 'hostid', ...} #dictionary of column names
nr_rows = 200000 # approximated
nr_files = 100 # up to 2200
nr_entries = 10 # up to 50
size = 0
size_before = 0
new_size = 0
# normal array:
data_array=np.zeros((nr_rows, nr_entries), dtype=np.float64)
# structured array:
data_array=np.zeros((nr_rows,), dtype=dt)
i=0
while i<nr_files:
size_before=new_size
f = hdf5.File(path, "r")
size=f[mydict['name0']].size
new_size+=size
a=0
while a<nr_entries:
name=mydict['name'+str(a)]
# normal array:
data_array[size_before:new_size, a] = f[name]
# structured array:
data_array[name][size_before:new_size] = f[name]
a+=1
i+=1
EDIT: I edit the code above because hpaulj was fortunately commenting the following:
First point of confusion. You show a dt definition with names like dt = [('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'),.... But
the h5 load is
data_array['name'+str(a)][size_before:new_size] = f['name'+str(a)] In other words, the file has datasets with names like name0, name1,
and you are downloading those to an array with fields with the same
names.
This was a 'I-simplify-code' copy/paste mistake and I corrected it!
Question 1: Is that the right way to fill a structured array?
data_array[name][size_before:new_size] = f[name]
Question 2: How to address a column in a structured array?
data_array[name] #--> column with a certain name
Question 3: How to address an entire row in a structured array?
data_array[0] #--> first row
Question 4: How to address 3 rows and all columns?
# normal array:
print data_array[0:3,:]
[[ 1.21080866e+10 1.21080866e+10 0.00000000e+00 5.69363234e+08
1.28992369e+03 1.28894614e+03 1.32171442e+03 -1.08210000e+02
4.92900000e+02 6.50400000e+01]
[ 1.21080711e+10 1.21080711e+10 0.00000000e+00 4.76329837e+06
1.29058079e+03 1.28741361e+03 1.32358059e+03 -4.23130000e+02
5.08720000e+02 -6.74800000e+01]
[ 1.21080700e+10 1.21080700e+10 0.00000000e+00 2.22978043e+10
1.28750287e+03 1.28864306e+03 1.32270418e+03 -6.13760000e+02
2.19530000e+02 -2.28980000e+02]]
# structured array:
print data_array[0:3]
#it returns a lot of data ...
[[ (12108086595L, 12108086595L, 0, 105676938.02998888, 463686295.4907876,.7144191943337, -108.21, 492.9, 65.04)
(12108071103L, 12108071103L, 0, 0.0, ... more data ...
... 228.02) ... more data ...
(8394715323L, 8394715323L, 2, 0.0, 823505.2374262045, 0798, 812.0612163877823, -541.61, 544.44, 421.08)]]
Question 5: Why does data_array[0:3] not only return the first 3 rows with the 10 columns?
Question 6: How to address the first two elements in the first column?
# normal array:
print data_array[0:1,0]
[ 1.21080866e+10 1.21080711e+10]
# structured array:
print data_array['haloid']][0][0:1]
[12108086595 12108071103]
OK! I got that!
Question 7: How to address three specific columns by name and they first 3 rows in that columns?
# normal array:
print data_array[0:3, [0,2,1]]
[[ 1.21080866e+10 0.00000000e+00 1.21080866e+10]
[ 1.21080711e+10 0.00000000e+00 1.21080711e+10]
[ 1.21080700e+10 0.00000000e+00 1.21080700e+10]]
# structured array:
print data_array[['haloid','type','hostid']][0][0:3]
[(12108086595L, 0, 12108086595L) (12108071103L, 0, 12108071103L)
(12108069992L, 0, 12108069992L)]
OK, the last example seems to work!!!
Question 8: What is the difference between:
(a) data_array['haloid'][0][0:3] and (b) data_array['haloid'][0:3]
where (a) returns really the first three haloids and (b) returns a lot of haloids (10x3).
[[12108086595 12108071103 12108069992 12108076356 12108075899 12108066340
9248632230 12108066342 10878169355 10077026070]
[ 6093565531 10077025463 8046772253 7871669276 5558161476 5558161473
12108068704 12108068708 12108077435 12108066338]
[ 8739142199 12108069995 12108069994 12108076355 12108092590 12108066312
12108075900 9248643751 6630111058 12108074389]]
Question 9: What is data_array['haloid'][0:3] actually returning?
Question 10: How to mask a structured array with np.where()
# NOTE: col0,1,2 are some integer values of the column I want to address
# col_name0,1,2 are corresponding names e.g. mstar, type, haloid
# normal array
mask = np.where(data[:,col2] > data[:,col1])
data[mask[:][0]]
mask = np.where(data[:,col2]==2)
data[:,col0][[mask[:][0]]]=data[:,col2][[mask[:][0]]]
#structured array
mask = np.where(data['x_pos'][0] > data['y_pos'][0]])
data[mask[:][0]]
mask = np.where(data[:,col2]==2)
data['haloid'][:,col0][[mask[:][0]]]=data['hostid'][:,col1][[mask[:][0]]]
This seems to work, but I am not sure!
Question 11: Can I still use np.resize() like: data_array = np.resize(data_array,(new_size, nr_entries)) to resize/reshape my array?
Question 12: How to sort a structured array?
# normal array:
data_sorted = data[np.argsort(data[:,col2])]
# structured array:
data_sorted = data[np.argsort(data['mstar'][:,col3])]
Thanks, I appreciate any help or advise!
First point of confusion. You show a dt definition with names like dt = [('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'),.... But the h5 load
is
data_array['name'+str(a)][size_before:new_size] = f['name'+str(a)]
In other words, the file has datasets with names like name0, name1, and you are downloading those to an array with fields with the same names.
You can iterate of the fields of an array defined with dt using
for name in dt.names:
data[name] = ...
e.g.
In [20]: dt = [('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'),
...: ('mstar', '<f8'), ('x_pos', '<f8'), ('y_pos', '<f8'),
...: ('z_pos', '<f8'), ('x_vel', '<f8'), ('y_vel', '<f8'), ('z_vel', '<f8')]
In [21]: arr = np.zeros((3,), dtype=dt)
In [22]: arr
Out[22]:
array([(0, 0, 0, 0., 0., 0., 0., 0., 0., 0.),
(0, 0, 0, 0., 0., 0., 0., 0., 0., 0.),
(0, 0, 0, 0., 0., 0., 0., 0., 0., 0.)],
dtype=[('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'), ('mstar', '<f8'), ('x_pos', '<f8'), ('y_pos', '<f8'), ('z_pos', '<f8'), ('x_vel', '<f8'), ('y_vel', '<f8'), ('z_vel', '<f8')])
In [23]: for name in arr.dtype.names:
...: print(name)
...: arr[name] = 1
...:
haloid
hostid
....
In [24]: arr
Out[24]:
array([(1, 1, 1, 1., 1., 1., 1., 1., 1., 1.),
(1, 1, 1, 1., 1., 1., 1., 1., 1., 1.),
(1, 1, 1, 1., 1., 1., 1., 1., 1., 1.)],
dtype=[('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'), ('mstar', '<f8'), ('x_pos', '<f8'), ('y_pos', '<f8'), ('z_pos', '<f8'), ('x_vel', '<f8'), ('y_vel', '<f8'), ('z_vel', '<f8')])
In [25]: arr[0] # get one record
Out[25]: (1, 1, 1, 1., 1., 1., 1., 1., 1., 1.)
In [26]: arr[0]['hostid'] # get one field, one record
In [27]: arr['hostid'] # get all values of a field
Out[27]: array([1, 1, 1], dtype=uint64)
In [28]: arr['hostid'][:2] # subset of records
Out[28]: array([1, 1], dtype=uint64)
So filling a structured array by field name should work fine:
arr[name][n1:n2] = file[dataset_name]
Prints like this:
structured array:
print data_array[['haloid','type','hostid']][0][0:3]
[(12108086595L, 0, 12108086595L) (12108071103L, 0, 12108071103L) (12108069992L, 0, 12108069992L)]
and
[[ (12108086595L, 12108086595L, 0,
look to me like the structured data_array is actually 2d, created with something like (see Question 8)
data_array = np.zeros((10, nr_rows), dtype=dt)
That's the only way that the [0][0:3] indexing would work,
For the 2d array:
mask = np.where(data[:,col2] > data[:,col1])
compares 2 columns. When in doubt look first as the boolean data[:,col2] > data[:,col1]. where just returns the indices where that boolean array is True.
Simple example of masked indexing:
In [29]: x = np.array((np.arange(6), np.arange(6)[::-1])).T
In [33]: mask = x[:,0]>x[:,1]
In [34]: mask
Out[34]: array([False, False, False, True, True, True], dtype=bool)
In [35]: idx = np.where(mask)
In [36]: idx
Out[36]: (array([3, 4, 5], dtype=int32),)
In [37]: x[mask,:]
Out[37]:
array([[3, 2],
[4, 1],
[5, 0]])
In [38]: x[idx,:]
Out[38]:
array([[[3, 2],
[4, 1],
[5, 0]]])
In this structured example, data['x_pos'] selects the field. the [0] is required to select the 1st row of that 2d array (the size 10 dimension). The rest of the comparison and where should work as with a 2d array.
mask = np.where(data['x_pos'][0] > data['y_pos'][0]])
mask[:][0] on a where tuple is probably not needed. mask is a tuple, [:] makes a copy and [0] selects the 1st element, which is an array. Sometimes a arr[idx[0],:] might be needed instead of arr[idx,:], but don't do that routinely.
My first comment suggested separate arrays
dt1 = [('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1')]
data_id = np.zeros((n,), dtype=dt1)
data = np.zeros((n,m), dtype=float) # m float columns
Or even
haloid = np.zeros((n,), '<u8')
hostid = np.zeros((n,), '<u8')
type = np.zeros((n,), 'i1')
With these arrays, data_array['hostid'][0], data_id['hostid'] and hostid should all return the same 1d array, and be equally usable in mask expressions.
Sometimes it is convenient to keep the ids and data in one structure. That's especially true if writing/reading to csv format files. But for masked selection it doesn't help that much. And for data calculations across data fields it can be a pain.
I could also suggest a compound dtype, one with
dt2 = [('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'), ('data', 'f8', (m,))]
In [41]: np.zeros((4,), dtype=dt2)
Out[41]:
array([(0, 0, 0, [ 0., 0., 0.]), (0, 0, 0, [ 0., 0., 0.]),
(0, 0, 0, [ 0., 0., 0.]), (0, 0, 0, [ 0., 0., 0.])],
dtype=[('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'), ('data', '<f8', (3,))])
In [42]: _['data']
Out[42]:
array([[ 0., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.]])
Is it better to access the float data by column number or by a name like 'x_coor'? Do you need to do calculations with several float columns at once, or will you always access them individually?
Through your description I think the naive way is to read in only useful data into arrays with different names (one type each maybe?)
If you want all the data read into one array, maybe Pandas is your choice:
http://pandas.pydata.org
http://pandas.pydata.org/pandas-docs/stable/
But I haven't tried that yet. Have fun to give a try.
ANSWER TO QUESTION 11:
Question 11: Can I still use np.resize() like: data_array =
np.resize(data_array,(new_size, nr_entries)) to resize/reshape my
array?
If I am resizing my array like this, I am creating to each field in dt 10 more columns. So I get the 'weird' result of Question 8b: a structure (10x3) haloids
The right way to trim my array, because I want to keep only the filled part of if (I designed the array to be big enough to contain a various amount of data blocks I am reading subsequently ...) is to:
data_array = data_array[:newsize]
print np.info(data_array)
class: ndarray
shape: (42648,)
strides: (73,)
type: [('haloid', '<u8'), ('hostid', '<u8'), ('orphan', 'i1'),
('mstar', '<f8'), ('x_pos', '<f8'), ('y_pos', '<f8'),
('z_pos', '<f8'), ('x_vel', '<f8'), ('y_vel', '<f8'), ('z_vel', '<f8')]

Categories