Indexing array using column names - python

I'm loading pretty large input files into Numpy array (30 columns, over 10k rows). Data contains only floating point numbers. To simplify data processing I'd like to name columns and access them using human-readable names. AFAIK it's only possibly using structured/record arrays. However, if I'm right, when i use structured arrays I'll loose some information. For instance:
x = np.array([(1.0, 2), (3.0, 4), (11, 22)], dtype='float64')
y = np.array([(1.0, 2), (3.0, 4), (11, 22)], dtype=[('x', float), ('y', float), ('z', float)])
Both arrays contains the same data and the same dtype. y can be accessed using column names:
yIn [155]: y['x']
Out[155]: array([ 1., 3., 11.])
Unfortunately, I loose (or I get wrong impression?) so essential properties when I use structured arrays. x and y have different shapes, y cannot be transposed etc.
In [160]: x
Out[160]:
array([[ 1., 2.],
[ 3., 4.],
[11., 22.]])
In [161]: y
Out[161]:
array([( 1., 2.), ( 3., 4.), (11., 22.)],
dtype=[('x', '<f8'), ('y', '<f8')])
In [162]: x.shape
Out[162]: (3, 2)
In [163]: y.shape
Out[163]: (3,)
In [164]: x.T
Out[164]:
array([[ 1., 3., 11.],
[ 2., 4., 22.]])
In [165]: y.T
Out[165]:
array([( 1., 2.), ( 3., 4.), (11., 22.)],
dtype=[('x', '<f8'), ('y', '<f8')])
Is it possible to continue using "regular 2D Numpy arrays" and access columns using their names?

Related

Numpy array of different types

I want a numpy array of different mixed datatypes, basically a combination of float32 and uint32.
The thing is, I don't write the array manually (as all other forums that I've found). Here is a piece of code of what I'm trying to do:
a = np.full((1, 10), 1).astype(np.float32)
b = np.full((1, 10), 2).astype(np.float32)
c = np.full((1, 10), 3).astype(np.float32)
d = np.full((1, 10), 4).astype(np.uint32)
arr = np.dstack([a, b, c, d]) # arr.shape = 1, 10, 4
I want axis 2 of arr to be of mixed data types. Of course a, b, c, and d are read from files, but for simplicity i show them as constant values!
One important note: I want this functionality. Last element of the array have to be represented as a uint32 because I'm dealing with hardware components that expects this order of datatypes (think of it as an API that will throw an error if the data types do not match)
This is what I've tried:
arr.astype("float32, float32, float32, uint1")
but this duplicate each element in axis 2 four times with different data types (same value).
I also tried this (which is basically the same thing):
dt = np.dtype([('floats', np.float32, (3, )), ('ints', np.uint32, (1, ))])
arr = np.dstack((a, b, c, d)).astype(dt)
but I got the same duplication as well.
I know for sure if I construct the array as follows:
arr = np.array([((1, 2, 3), (4)), ((5, 6, 7), (8))], dtype=dt)
where dt is from the code block above, it works nice-ish. but I read those a, b, c, d arrays and I don't know if constructing those tuples (or structures) is the best way to do it because those arrays have length of 850k in practice.
Your dtype:
In [83]: dt = np.dtype([('floats', np.float32, (3, )), ('ints', np.uint32, (1, ))])
and a sample uniform array:
In [84]: x= np.arange(1,9).reshape(2,4);x
Out[84]:
array([[1, 2, 3, 4],
[5, 6, 7, 8]])
the wrong way of making a structured array:
In [85]: x.astype(dt)
Out[85]:
array([[([1., 1., 1.], [1]), ([2., 2., 2.], [2]), ([3., 3., 3.], [3]),
([4., 4., 4.], [4])],
[([5., 5., 5.], [5]), ([6., 6., 6.], [6]), ([7., 7., 7.], [7]),
([8., 8., 8.], [8])]],
dtype=[('floats', '<f4', (3,)), ('ints', '<u4', (1,))])
The right way:
In [86]: import numpy.lib.recfunctions as rf
In [87]: rf.unstructured_to_structured(x,dt)
Out[87]:
array([([1., 2., 3.], [4]), ([5., 6., 7.], [8])],
dtype=[('floats', '<f4', (3,)), ('ints', '<u4', (1,))])
and alternate way:
In [88]: res = np.zeros(2,dt)
In [89]: res['floats'] = x[:,:3]
In [90]: res['ints'] = x[:,-1:]
In [91]: res
Out[91]:
array([([1., 2., 3.], [4]), ([5., 6., 7.], [8])],
dtype=[('floats', '<f4', (3,)), ('ints', '<u4', (1,))])
https://numpy.org/doc/stable/user/basics.rec.html

Purpose/status of the attribute numpy.dtype.base

I have found an attribute called base on numpy.dtype objects. Doing some experiments:
numpy.dtype('i4').base
# dtype('int32')
numpy.dtype('6i4').base
# dtype('int32')
numpy.dtype('10f8').base
# dtype('float64')
numpy.dtype('3i4, 2f4')
# dtype([('f0', '<i4', (3,)), ('f1', '<f4', (2,))])
So it seems to contain the dtype of a single element for simple sub-array data types and itself for structured data types.
Unfortunately, this attribute does not seem to be documented anywhere. There is a page in the documentation, but it’s empty and not linked anywhere. Curiously, it is also absent in the documentation for numpy version 1.15.0 specifically:
/doc/numpy/…/numpy.dtype.base.html (empty page)
/doc/numpy-1.15.0/…/numpy.dtype.base.html (error 404)
/doc/numpy-1.15.1/…/numpy.dtype.base.html (empty page)
Can I rely on the presence and behavior of this attribute in future versions of numpy?
This is now documented:
https://numpy.org/doc/stable/reference/generated/numpy.dtype.base.html#numpy.dtype.base
It is defined at https://github.com/numpy/numpy/blob/eeef9d4646103c3b1afd3085f1393f2b3f9575b2/numpy/core/src/multiarray/descriptor.c#L2255-L2300 and from git-blame was last touched ~13 years ago so it is probably safe to assume that dtype.base will exist and continue to exist.
I'm not sure whether it's safe to rely on base, but it's probably a bad idea either way. People reading your code can't look up what base means in the docs, and anyway, there's a better option.
Instead of base, you can use subdtype, which is documented:
Tuple (item_dtype, shape) if this dtype describes a
sub-array, and None otherwise.
The shape is the fixed shape of the sub-array described by this data
type, and item_dtype the data type of the array.
If a field whose dtype object has this attribute is retrieved, then
the extra dimensions implied by shape are tacked on to the end of
the retrieved array.
For a dtype that represents a subarray, dtype.base is equivalent to dtype.subdtype[0]. For a dtype that doesn't represent a subarray, dtype.base is dtype and dtype.subdtype is None. Here's a demo:
>>> subarray = numpy.dtype('5i4')
>>> not_subarray = numpy.dtype('i4')
>>> subarray.base
dtype('int32')
>>> subarray.subdtype
(dtype('int32'), (5,))
>>> not_subarray.base
dtype('int32')
>>> print(not_subarray.subdtype) # None doesn't get auto-printed
None
Incidentally, if you want to be sure about what dtype.base does, here's the source, which confirms what you guessed from your experiments:
static PyObject *
arraydescr_base_get(PyArray_Descr *self)
{
if (!PyDataType_HASSUBARRAY(self)) {
Py_INCREF(self);
return (PyObject *)self;
}
Py_INCREF(self->subarray->base);
return (PyObject *)(self->subarray->base);
}
I've never used the base attribute, or seen it used. But it does make sense that there should be a way of identifying such an object. I can't find a use of it code such as in np.lib.recfunctions, but it may well be used in compiled code.
With a dtype like '10f8' there are various attritubes (some may be properties):
In [259]: dt = np.dtype('10f8')
In [260]: dt
Out[260]: dtype(('<f8', (10,)))
In [261]: dt.base
Out[261]: dtype('float64')
In [263]: dt.descr
Out[263]: [('', '|V80')]
In [264]: dt.itemsize
Out[264]: 80
In [265]: dt.shape
Out[265]: (10,)
Look what happens when we make an array with this dtype:
In [278]: x = np.ones((3,),'10f8')
In [279]: x
Out[279]:
array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])
In [280]: x.shape
Out[280]: (3, 10)
In [281]: x.dtype
Out[281]: dtype('float64') # there's your base
There's the answer - dt.base is the dtype that will be used in creating an array with the dtype. It's the dtype without the extra dimensional information.
That sort of dtype is rarely used by itself; more likely it is part of a compound dtype:
In [252]: dt=np.dtype('3i4, 2f4')
In [253]: dt
Out[253]: dtype([('f0', '<i4', (3,)), ('f1', '<f4', (2,))])
In [254]: dt.base
Out[254]: dtype([('f0', '<i4', (3,)), ('f1', '<f4', (2,))])
In [255]: dt[0]
Out[255]: dtype(('<i4', (3,)))
In [256]: dt[0].base
This dt could be embedded in another dtype:
In [272]: dt1 = np.dtype((dt, (3,)))
In [273]: dt1
Out[273]: dtype(([('f0', '<i4', (3,)), ('f1', '<f4', (2,))], (3,)))
In [274]: dt1.base
Out[274]: dtype([('f0', '<i4', (3,)), ('f1', '<f4', (2,))])
In [275]: arr = np.ones((3,), dt1)
In [276]: arr
Out[276]:
array([[([1, 1, 1], [1., 1.]), ([1, 1, 1], [1., 1.]),
([1, 1, 1], [1., 1.])],
[([1, 1, 1], [1., 1.]), ([1, 1, 1], [1., 1.]),
([1, 1, 1], [1., 1.])],
[([1, 1, 1], [1., 1.]), ([1, 1, 1], [1., 1.]),
([1, 1, 1], [1., 1.])]],
dtype=[('f0', '<i4', (3,)), ('f1', '<f4', (2,))])
In [277]: arr.shape
Out[277]: (3, 3)
In the case of a structured array, the base of a field is the dtype that we get when viewing just that field.

Python Numpy DTYPE dynamic array

I have a code like this
np.fromfile( f,
dtype = np.dtype( [ ( 'f1', np.float16 ),
( 'f2', np.float16 )
]
),
count = -1
)
and I need to make a dtype dependent on variable ( instead of size=2 make size=var ).
Tried to google it and tried to ask on IRC but no rescue. Thanks in advance!
If the individual dtypes are the same, and you aren't picky about names, this dtype format is easy to use and generalize:
In [358]: dt = np.dtype('f,f,f')
In [359]: dt
Out[359]: dtype([('f0', '
Or with the common string join:
In [360]: dt = np.dtype(','.join(['f']*3))
In [361]: np.ones((3,), dtype=dt)
Out[361]:
array([( 1., 1., 1.), ( 1., 1., 1.), ( 1., 1., 1.)],
dtype=[('f0', '<f4'), ('f1', '<f4'), ('f2', '<f4')])
Or with list zip
In [364]: dt = [('foo%s'%n, fmt) for n,fmt in zip(range(3),'ifd')]
In [365]: dt
Out[365]: [('foo0', 'i'), ('foo1', 'f'), ('foo2', 'd')]
In [366]: np.ones(2, dtype=dt)
Out[366]:
array([(1, 1., 1.), (1, 1., 1.)],
dtype=[('foo0', '<i4'), ('foo1', '<f4'), ('foo2', '<f8')])

Filter numpy array of tuples

Scikit-learn library have a brilliant example of data clustering - stock market structure. It works fine within US stocks. But when one adds tickers from other markets, numpy's error appear that arrays shoud have the same size - this is true, for example, german stocks have different trading calendar.
Ok, after quotes download I add preparation of shared dates:
quotes = [quotes_historical_yahoo_ochl(symbol, d1, d2, asobject=True)
for symbol in symbols]
def intersect(list_1, list_2):
return list(set(list_1) & set(list_2))
dates_all = quotes[0].date
for q in quotes:
dates_symbol = q.date
dates_all = intersect(dates_all, dates_symbol)
Then I'm stuck with filtering numpy array of tuples. Here's some tries:
# for index, q in enumerate(quotes):
# filtered = [i for i in q if i.date in dates_all]
# quotes[index] = np.rec.array(filtered, dtype=q.dtype)
# quotes[index] = np.asanyarray(filtered, dtype=q.dtype)
#
# quotes[index] = np.where(a.date in dates_all for a in q)
#
# quotes[index] = np.where(q[0].date in dates_all)
How to apply filter to numpy array or how to truly convert list of records (after filter) back to numpy's recarray?
quotes[0].dtype:
'(numpy.record, [('date', 'O'), ('year', '<i2'), ('month', 'i1'), ('day', 'i1'), ('d', '<f8'), ('open', '<f8'), ('close', '<f8'), ('high', '<f8'), ('low', '<f8'), ('volume', '<f8'), ('aclose', '<f8')])'
quotes[0].shape:
<class 'tuple'>: (261,)
So quotes is a list of recarrays, and in date_all you collect the intersection of all values in the date field.
I can recreate one such array with:
In [286]: dt=np.dtype([('date', 'O'), ('year', '<i2'), ('month', 'i1'), ('day',
...:
...: ), ('low', '<f8'), ('volume', '<f8'), ('aclose', '<f8')])
In [287]:
In [287]: arr=np.ones((2,), dtype=dt) # 2 element structured array
In [288]: arr
Out[288]:
array([(1, 1, 1, 1, 1., 1., 1., 1., 1., 1., 1.),
(1, 1, 1, 1, 1., 1., 1., 1., 1., 1., 1.)],
dtype=[('date', 'O'), ('year', '<i2'), ('month', 'i1'), ('day', 'i1'), ... ('aclose', '<f8')])
In [289]: type(arr[0])
Out[289]: numpy.void
turn that into a recarray (I dont' use those as much as plain structured arrays):
In [291]: np.rec.array(arr)
Out[291]:
rec.array([(1, 1, 1, 1, 1., 1., 1., 1., 1., 1., 1.),
(1, 1, 1, 1, 1., 1., 1., 1., 1., 1., 1.)],
dtype=[('date', 'O'), ('year', '<i2'), ('month', 'i1'), ('day', 'i1'), .... ('aclose', '<f8')])
dtype of the recarray displays slightly different:
In [292]: _.dtype
Out[292]: dtype((numpy.record, [('date', 'O'), ('year', '<i2'), ('month', 'i1'), ....('aclose', '<f8')]))
In [293]: __.date
Out[293]: array([1, 1], dtype=object)
In any case the date field is an array of objects, possibly of datetime?
q is one of these arrays; i is an element, and i.date is the date field.
[i for i in q if i.date in dates_all]
So filtered is list of recarray elements. np.stack does a better job of reassembling them into an array (that works with the recarray too).
np.stack([i for i in arr if i['date'] in alist])
Or you could collect the indices of the matching records, and index the quote array
In [319]: [i for i,v in enumerate(arr) if v['date'] in alist]
Out[319]: [0, 1]
In [320]: arr[_]
or pull out the date field first:
In [321]: [i for i,v in enumerate(arr['date']) if v in alist]
Out[321]: [0, 1]
in1d might also work to search
In [322]: np.in1d(arr['date'],alist)
Out[322]: array([ True, True], dtype=bool)
In [323]: np.where(np.in1d(arr['date'],alist))
Out[323]: (array([0, 1], dtype=int32),)

PYTHON/NUMPY: Handling structured arrays compared to common numpy-arrays Python2.7

The main reason why I am asking this question is because I do not exactly know how the structured arrays work compared to normal arrays and because I could not find suitable examples online for my case. Further, I am probably filling my structured array wrongly in the first place.
So, here I want to present the 'normal' numpy-array version (and what I need to do with it) and the new 'structured' array version. My (largest) datasets contain around 200e6 objects/rows with up to 40-50 properties/columns. They all have the same data type except for a few special columns: 'haloid', 'hostid', 'type'. They are ID numbers or flags and I have to keep them with the rest of the data because I have to identify my objects with them.
data set name:
data_array: ndarray shape: (42648, 10)
data type:
dt = [('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'),
('mstar', '<f8'), ('x_pos', '<f8'), ('y_pos', '<f8'),
('z_pos', '<f8'), ('x_vel', '<f8'), ('y_vel', '<f8'), ('z_vel', '<f8')]
Reading data from .hdf5-file format to array
Most of the data is stored in hdf5-files (2000 of them corresponds to one snapshot I have to process at once) which should be read-in to a single array
import numpy as np
import h5py as hdf5
mydict={'name0': 'haloid', 'name1': 'hostid', ...} #dictionary of column names
nr_rows = 200000 # approximated
nr_files = 100 # up to 2200
nr_entries = 10 # up to 50
size = 0
size_before = 0
new_size = 0
# normal array:
data_array=np.zeros((nr_rows, nr_entries), dtype=np.float64)
# structured array:
data_array=np.zeros((nr_rows,), dtype=dt)
i=0
while i<nr_files:
size_before=new_size
f = hdf5.File(path, "r")
size=f[mydict['name0']].size
new_size+=size
a=0
while a<nr_entries:
name=mydict['name'+str(a)]
# normal array:
data_array[size_before:new_size, a] = f[name]
# structured array:
data_array[name][size_before:new_size] = f[name]
a+=1
i+=1
EDIT: I edit the code above because hpaulj was fortunately commenting the following:
First point of confusion. You show a dt definition with names like dt = [('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'),.... But
the h5 load is
data_array['name'+str(a)][size_before:new_size] = f['name'+str(a)] In other words, the file has datasets with names like name0, name1,
and you are downloading those to an array with fields with the same
names.
This was a 'I-simplify-code' copy/paste mistake and I corrected it!
Question 1: Is that the right way to fill a structured array?
data_array[name][size_before:new_size] = f[name]
Question 2: How to address a column in a structured array?
data_array[name] #--> column with a certain name
Question 3: How to address an entire row in a structured array?
data_array[0] #--> first row
Question 4: How to address 3 rows and all columns?
# normal array:
print data_array[0:3,:]
[[ 1.21080866e+10 1.21080866e+10 0.00000000e+00 5.69363234e+08
1.28992369e+03 1.28894614e+03 1.32171442e+03 -1.08210000e+02
4.92900000e+02 6.50400000e+01]
[ 1.21080711e+10 1.21080711e+10 0.00000000e+00 4.76329837e+06
1.29058079e+03 1.28741361e+03 1.32358059e+03 -4.23130000e+02
5.08720000e+02 -6.74800000e+01]
[ 1.21080700e+10 1.21080700e+10 0.00000000e+00 2.22978043e+10
1.28750287e+03 1.28864306e+03 1.32270418e+03 -6.13760000e+02
2.19530000e+02 -2.28980000e+02]]
# structured array:
print data_array[0:3]
#it returns a lot of data ...
[[ (12108086595L, 12108086595L, 0, 105676938.02998888, 463686295.4907876,.7144191943337, -108.21, 492.9, 65.04)
(12108071103L, 12108071103L, 0, 0.0, ... more data ...
... 228.02) ... more data ...
(8394715323L, 8394715323L, 2, 0.0, 823505.2374262045, 0798, 812.0612163877823, -541.61, 544.44, 421.08)]]
Question 5: Why does data_array[0:3] not only return the first 3 rows with the 10 columns?
Question 6: How to address the first two elements in the first column?
# normal array:
print data_array[0:1,0]
[ 1.21080866e+10 1.21080711e+10]
# structured array:
print data_array['haloid']][0][0:1]
[12108086595 12108071103]
OK! I got that!
Question 7: How to address three specific columns by name and they first 3 rows in that columns?
# normal array:
print data_array[0:3, [0,2,1]]
[[ 1.21080866e+10 0.00000000e+00 1.21080866e+10]
[ 1.21080711e+10 0.00000000e+00 1.21080711e+10]
[ 1.21080700e+10 0.00000000e+00 1.21080700e+10]]
# structured array:
print data_array[['haloid','type','hostid']][0][0:3]
[(12108086595L, 0, 12108086595L) (12108071103L, 0, 12108071103L)
(12108069992L, 0, 12108069992L)]
OK, the last example seems to work!!!
Question 8: What is the difference between:
(a) data_array['haloid'][0][0:3] and (b) data_array['haloid'][0:3]
where (a) returns really the first three haloids and (b) returns a lot of haloids (10x3).
[[12108086595 12108071103 12108069992 12108076356 12108075899 12108066340
9248632230 12108066342 10878169355 10077026070]
[ 6093565531 10077025463 8046772253 7871669276 5558161476 5558161473
12108068704 12108068708 12108077435 12108066338]
[ 8739142199 12108069995 12108069994 12108076355 12108092590 12108066312
12108075900 9248643751 6630111058 12108074389]]
Question 9: What is data_array['haloid'][0:3] actually returning?
Question 10: How to mask a structured array with np.where()
# NOTE: col0,1,2 are some integer values of the column I want to address
# col_name0,1,2 are corresponding names e.g. mstar, type, haloid
# normal array
mask = np.where(data[:,col2] > data[:,col1])
data[mask[:][0]]
mask = np.where(data[:,col2]==2)
data[:,col0][[mask[:][0]]]=data[:,col2][[mask[:][0]]]
#structured array
mask = np.where(data['x_pos'][0] > data['y_pos'][0]])
data[mask[:][0]]
mask = np.where(data[:,col2]==2)
data['haloid'][:,col0][[mask[:][0]]]=data['hostid'][:,col1][[mask[:][0]]]
This seems to work, but I am not sure!
Question 11: Can I still use np.resize() like: data_array = np.resize(data_array,(new_size, nr_entries)) to resize/reshape my array?
Question 12: How to sort a structured array?
# normal array:
data_sorted = data[np.argsort(data[:,col2])]
# structured array:
data_sorted = data[np.argsort(data['mstar'][:,col3])]
Thanks, I appreciate any help or advise!
First point of confusion. You show a dt definition with names like dt = [('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'),.... But the h5 load
is
data_array['name'+str(a)][size_before:new_size] = f['name'+str(a)]
In other words, the file has datasets with names like name0, name1, and you are downloading those to an array with fields with the same names.
You can iterate of the fields of an array defined with dt using
for name in dt.names:
data[name] = ...
e.g.
In [20]: dt = [('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'),
...: ('mstar', '<f8'), ('x_pos', '<f8'), ('y_pos', '<f8'),
...: ('z_pos', '<f8'), ('x_vel', '<f8'), ('y_vel', '<f8'), ('z_vel', '<f8')]
In [21]: arr = np.zeros((3,), dtype=dt)
In [22]: arr
Out[22]:
array([(0, 0, 0, 0., 0., 0., 0., 0., 0., 0.),
(0, 0, 0, 0., 0., 0., 0., 0., 0., 0.),
(0, 0, 0, 0., 0., 0., 0., 0., 0., 0.)],
dtype=[('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'), ('mstar', '<f8'), ('x_pos', '<f8'), ('y_pos', '<f8'), ('z_pos', '<f8'), ('x_vel', '<f8'), ('y_vel', '<f8'), ('z_vel', '<f8')])
In [23]: for name in arr.dtype.names:
...: print(name)
...: arr[name] = 1
...:
haloid
hostid
....
In [24]: arr
Out[24]:
array([(1, 1, 1, 1., 1., 1., 1., 1., 1., 1.),
(1, 1, 1, 1., 1., 1., 1., 1., 1., 1.),
(1, 1, 1, 1., 1., 1., 1., 1., 1., 1.)],
dtype=[('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'), ('mstar', '<f8'), ('x_pos', '<f8'), ('y_pos', '<f8'), ('z_pos', '<f8'), ('x_vel', '<f8'), ('y_vel', '<f8'), ('z_vel', '<f8')])
In [25]: arr[0] # get one record
Out[25]: (1, 1, 1, 1., 1., 1., 1., 1., 1., 1.)
In [26]: arr[0]['hostid'] # get one field, one record
In [27]: arr['hostid'] # get all values of a field
Out[27]: array([1, 1, 1], dtype=uint64)
In [28]: arr['hostid'][:2] # subset of records
Out[28]: array([1, 1], dtype=uint64)
So filling a structured array by field name should work fine:
arr[name][n1:n2] = file[dataset_name]
Prints like this:
structured array:
print data_array[['haloid','type','hostid']][0][0:3]
[(12108086595L, 0, 12108086595L) (12108071103L, 0, 12108071103L) (12108069992L, 0, 12108069992L)]
and
[[ (12108086595L, 12108086595L, 0,
look to me like the structured data_array is actually 2d, created with something like (see Question 8)
data_array = np.zeros((10, nr_rows), dtype=dt)
That's the only way that the [0][0:3] indexing would work,
For the 2d array:
mask = np.where(data[:,col2] > data[:,col1])
compares 2 columns. When in doubt look first as the boolean data[:,col2] > data[:,col1]. where just returns the indices where that boolean array is True.
Simple example of masked indexing:
In [29]: x = np.array((np.arange(6), np.arange(6)[::-1])).T
In [33]: mask = x[:,0]>x[:,1]
In [34]: mask
Out[34]: array([False, False, False, True, True, True], dtype=bool)
In [35]: idx = np.where(mask)
In [36]: idx
Out[36]: (array([3, 4, 5], dtype=int32),)
In [37]: x[mask,:]
Out[37]:
array([[3, 2],
[4, 1],
[5, 0]])
In [38]: x[idx,:]
Out[38]:
array([[[3, 2],
[4, 1],
[5, 0]]])
In this structured example, data['x_pos'] selects the field. the [0] is required to select the 1st row of that 2d array (the size 10 dimension). The rest of the comparison and where should work as with a 2d array.
mask = np.where(data['x_pos'][0] > data['y_pos'][0]])
mask[:][0] on a where tuple is probably not needed. mask is a tuple, [:] makes a copy and [0] selects the 1st element, which is an array. Sometimes a arr[idx[0],:] might be needed instead of arr[idx,:], but don't do that routinely.
My first comment suggested separate arrays
dt1 = [('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1')]
data_id = np.zeros((n,), dtype=dt1)
data = np.zeros((n,m), dtype=float) # m float columns
Or even
haloid = np.zeros((n,), '<u8')
hostid = np.zeros((n,), '<u8')
type = np.zeros((n,), 'i1')
With these arrays, data_array['hostid'][0], data_id['hostid'] and hostid should all return the same 1d array, and be equally usable in mask expressions.
Sometimes it is convenient to keep the ids and data in one structure. That's especially true if writing/reading to csv format files. But for masked selection it doesn't help that much. And for data calculations across data fields it can be a pain.
I could also suggest a compound dtype, one with
dt2 = [('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'), ('data', 'f8', (m,))]
In [41]: np.zeros((4,), dtype=dt2)
Out[41]:
array([(0, 0, 0, [ 0., 0., 0.]), (0, 0, 0, [ 0., 0., 0.]),
(0, 0, 0, [ 0., 0., 0.]), (0, 0, 0, [ 0., 0., 0.])],
dtype=[('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'), ('data', '<f8', (3,))])
In [42]: _['data']
Out[42]:
array([[ 0., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.]])
Is it better to access the float data by column number or by a name like 'x_coor'? Do you need to do calculations with several float columns at once, or will you always access them individually?
Through your description I think the naive way is to read in only useful data into arrays with different names (one type each maybe?)
If you want all the data read into one array, maybe Pandas is your choice:
http://pandas.pydata.org
http://pandas.pydata.org/pandas-docs/stable/
But I haven't tried that yet. Have fun to give a try.
ANSWER TO QUESTION 11:
Question 11: Can I still use np.resize() like: data_array =
np.resize(data_array,(new_size, nr_entries)) to resize/reshape my
array?
If I am resizing my array like this, I am creating to each field in dt 10 more columns. So I get the 'weird' result of Question 8b: a structure (10x3) haloids
The right way to trim my array, because I want to keep only the filled part of if (I designed the array to be big enough to contain a various amount of data blocks I am reading subsequently ...) is to:
data_array = data_array[:newsize]
print np.info(data_array)
class: ndarray
shape: (42648,)
strides: (73,)
type: [('haloid', '<u8'), ('hostid', '<u8'), ('orphan', 'i1'),
('mstar', '<f8'), ('x_pos', '<f8'), ('y_pos', '<f8'),
('z_pos', '<f8'), ('x_vel', '<f8'), ('y_vel', '<f8'), ('z_vel', '<f8')]

Categories