The main reason why I am asking this question is because I do not exactly know how the structured arrays work compared to normal arrays and because I could not find suitable examples online for my case. Further, I am probably filling my structured array wrongly in the first place.
So, here I want to present the 'normal' numpy-array version (and what I need to do with it) and the new 'structured' array version. My (largest) datasets contain around 200e6 objects/rows with up to 40-50 properties/columns. They all have the same data type except for a few special columns: 'haloid', 'hostid', 'type'. They are ID numbers or flags and I have to keep them with the rest of the data because I have to identify my objects with them.
data set name:
data_array: ndarray shape: (42648, 10)
data type:
dt = [('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'),
('mstar', '<f8'), ('x_pos', '<f8'), ('y_pos', '<f8'),
('z_pos', '<f8'), ('x_vel', '<f8'), ('y_vel', '<f8'), ('z_vel', '<f8')]
Reading data from .hdf5-file format to array
Most of the data is stored in hdf5-files (2000 of them corresponds to one snapshot I have to process at once) which should be read-in to a single array
import numpy as np
import h5py as hdf5
mydict={'name0': 'haloid', 'name1': 'hostid', ...} #dictionary of column names
nr_rows = 200000 # approximated
nr_files = 100 # up to 2200
nr_entries = 10 # up to 50
size = 0
size_before = 0
new_size = 0
# normal array:
data_array=np.zeros((nr_rows, nr_entries), dtype=np.float64)
# structured array:
data_array=np.zeros((nr_rows,), dtype=dt)
i=0
while i<nr_files:
size_before=new_size
f = hdf5.File(path, "r")
size=f[mydict['name0']].size
new_size+=size
a=0
while a<nr_entries:
name=mydict['name'+str(a)]
# normal array:
data_array[size_before:new_size, a] = f[name]
# structured array:
data_array[name][size_before:new_size] = f[name]
a+=1
i+=1
EDIT: I edit the code above because hpaulj was fortunately commenting the following:
First point of confusion. You show a dt definition with names like dt = [('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'),.... But
the h5 load is
data_array['name'+str(a)][size_before:new_size] = f['name'+str(a)] In other words, the file has datasets with names like name0, name1,
and you are downloading those to an array with fields with the same
names.
This was a 'I-simplify-code' copy/paste mistake and I corrected it!
Question 1: Is that the right way to fill a structured array?
data_array[name][size_before:new_size] = f[name]
Question 2: How to address a column in a structured array?
data_array[name] #--> column with a certain name
Question 3: How to address an entire row in a structured array?
data_array[0] #--> first row
Question 4: How to address 3 rows and all columns?
# normal array:
print data_array[0:3,:]
[[ 1.21080866e+10 1.21080866e+10 0.00000000e+00 5.69363234e+08
1.28992369e+03 1.28894614e+03 1.32171442e+03 -1.08210000e+02
4.92900000e+02 6.50400000e+01]
[ 1.21080711e+10 1.21080711e+10 0.00000000e+00 4.76329837e+06
1.29058079e+03 1.28741361e+03 1.32358059e+03 -4.23130000e+02
5.08720000e+02 -6.74800000e+01]
[ 1.21080700e+10 1.21080700e+10 0.00000000e+00 2.22978043e+10
1.28750287e+03 1.28864306e+03 1.32270418e+03 -6.13760000e+02
2.19530000e+02 -2.28980000e+02]]
# structured array:
print data_array[0:3]
#it returns a lot of data ...
[[ (12108086595L, 12108086595L, 0, 105676938.02998888, 463686295.4907876,.7144191943337, -108.21, 492.9, 65.04)
(12108071103L, 12108071103L, 0, 0.0, ... more data ...
... 228.02) ... more data ...
(8394715323L, 8394715323L, 2, 0.0, 823505.2374262045, 0798, 812.0612163877823, -541.61, 544.44, 421.08)]]
Question 5: Why does data_array[0:3] not only return the first 3 rows with the 10 columns?
Question 6: How to address the first two elements in the first column?
# normal array:
print data_array[0:1,0]
[ 1.21080866e+10 1.21080711e+10]
# structured array:
print data_array['haloid']][0][0:1]
[12108086595 12108071103]
OK! I got that!
Question 7: How to address three specific columns by name and they first 3 rows in that columns?
# normal array:
print data_array[0:3, [0,2,1]]
[[ 1.21080866e+10 0.00000000e+00 1.21080866e+10]
[ 1.21080711e+10 0.00000000e+00 1.21080711e+10]
[ 1.21080700e+10 0.00000000e+00 1.21080700e+10]]
# structured array:
print data_array[['haloid','type','hostid']][0][0:3]
[(12108086595L, 0, 12108086595L) (12108071103L, 0, 12108071103L)
(12108069992L, 0, 12108069992L)]
OK, the last example seems to work!!!
Question 8: What is the difference between:
(a) data_array['haloid'][0][0:3] and (b) data_array['haloid'][0:3]
where (a) returns really the first three haloids and (b) returns a lot of haloids (10x3).
[[12108086595 12108071103 12108069992 12108076356 12108075899 12108066340
9248632230 12108066342 10878169355 10077026070]
[ 6093565531 10077025463 8046772253 7871669276 5558161476 5558161473
12108068704 12108068708 12108077435 12108066338]
[ 8739142199 12108069995 12108069994 12108076355 12108092590 12108066312
12108075900 9248643751 6630111058 12108074389]]
Question 9: What is data_array['haloid'][0:3] actually returning?
Question 10: How to mask a structured array with np.where()
# NOTE: col0,1,2 are some integer values of the column I want to address
# col_name0,1,2 are corresponding names e.g. mstar, type, haloid
# normal array
mask = np.where(data[:,col2] > data[:,col1])
data[mask[:][0]]
mask = np.where(data[:,col2]==2)
data[:,col0][[mask[:][0]]]=data[:,col2][[mask[:][0]]]
#structured array
mask = np.where(data['x_pos'][0] > data['y_pos'][0]])
data[mask[:][0]]
mask = np.where(data[:,col2]==2)
data['haloid'][:,col0][[mask[:][0]]]=data['hostid'][:,col1][[mask[:][0]]]
This seems to work, but I am not sure!
Question 11: Can I still use np.resize() like: data_array = np.resize(data_array,(new_size, nr_entries)) to resize/reshape my array?
Question 12: How to sort a structured array?
# normal array:
data_sorted = data[np.argsort(data[:,col2])]
# structured array:
data_sorted = data[np.argsort(data['mstar'][:,col3])]
Thanks, I appreciate any help or advise!
First point of confusion. You show a dt definition with names like dt = [('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'),.... But the h5 load
is
data_array['name'+str(a)][size_before:new_size] = f['name'+str(a)]
In other words, the file has datasets with names like name0, name1, and you are downloading those to an array with fields with the same names.
You can iterate of the fields of an array defined with dt using
for name in dt.names:
data[name] = ...
e.g.
In [20]: dt = [('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'),
...: ('mstar', '<f8'), ('x_pos', '<f8'), ('y_pos', '<f8'),
...: ('z_pos', '<f8'), ('x_vel', '<f8'), ('y_vel', '<f8'), ('z_vel', '<f8')]
In [21]: arr = np.zeros((3,), dtype=dt)
In [22]: arr
Out[22]:
array([(0, 0, 0, 0., 0., 0., 0., 0., 0., 0.),
(0, 0, 0, 0., 0., 0., 0., 0., 0., 0.),
(0, 0, 0, 0., 0., 0., 0., 0., 0., 0.)],
dtype=[('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'), ('mstar', '<f8'), ('x_pos', '<f8'), ('y_pos', '<f8'), ('z_pos', '<f8'), ('x_vel', '<f8'), ('y_vel', '<f8'), ('z_vel', '<f8')])
In [23]: for name in arr.dtype.names:
...: print(name)
...: arr[name] = 1
...:
haloid
hostid
....
In [24]: arr
Out[24]:
array([(1, 1, 1, 1., 1., 1., 1., 1., 1., 1.),
(1, 1, 1, 1., 1., 1., 1., 1., 1., 1.),
(1, 1, 1, 1., 1., 1., 1., 1., 1., 1.)],
dtype=[('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'), ('mstar', '<f8'), ('x_pos', '<f8'), ('y_pos', '<f8'), ('z_pos', '<f8'), ('x_vel', '<f8'), ('y_vel', '<f8'), ('z_vel', '<f8')])
In [25]: arr[0] # get one record
Out[25]: (1, 1, 1, 1., 1., 1., 1., 1., 1., 1.)
In [26]: arr[0]['hostid'] # get one field, one record
In [27]: arr['hostid'] # get all values of a field
Out[27]: array([1, 1, 1], dtype=uint64)
In [28]: arr['hostid'][:2] # subset of records
Out[28]: array([1, 1], dtype=uint64)
So filling a structured array by field name should work fine:
arr[name][n1:n2] = file[dataset_name]
Prints like this:
structured array:
print data_array[['haloid','type','hostid']][0][0:3]
[(12108086595L, 0, 12108086595L) (12108071103L, 0, 12108071103L) (12108069992L, 0, 12108069992L)]
and
[[ (12108086595L, 12108086595L, 0,
look to me like the structured data_array is actually 2d, created with something like (see Question 8)
data_array = np.zeros((10, nr_rows), dtype=dt)
That's the only way that the [0][0:3] indexing would work,
For the 2d array:
mask = np.where(data[:,col2] > data[:,col1])
compares 2 columns. When in doubt look first as the boolean data[:,col2] > data[:,col1]. where just returns the indices where that boolean array is True.
Simple example of masked indexing:
In [29]: x = np.array((np.arange(6), np.arange(6)[::-1])).T
In [33]: mask = x[:,0]>x[:,1]
In [34]: mask
Out[34]: array([False, False, False, True, True, True], dtype=bool)
In [35]: idx = np.where(mask)
In [36]: idx
Out[36]: (array([3, 4, 5], dtype=int32),)
In [37]: x[mask,:]
Out[37]:
array([[3, 2],
[4, 1],
[5, 0]])
In [38]: x[idx,:]
Out[38]:
array([[[3, 2],
[4, 1],
[5, 0]]])
In this structured example, data['x_pos'] selects the field. the [0] is required to select the 1st row of that 2d array (the size 10 dimension). The rest of the comparison and where should work as with a 2d array.
mask = np.where(data['x_pos'][0] > data['y_pos'][0]])
mask[:][0] on a where tuple is probably not needed. mask is a tuple, [:] makes a copy and [0] selects the 1st element, which is an array. Sometimes a arr[idx[0],:] might be needed instead of arr[idx,:], but don't do that routinely.
My first comment suggested separate arrays
dt1 = [('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1')]
data_id = np.zeros((n,), dtype=dt1)
data = np.zeros((n,m), dtype=float) # m float columns
Or even
haloid = np.zeros((n,), '<u8')
hostid = np.zeros((n,), '<u8')
type = np.zeros((n,), 'i1')
With these arrays, data_array['hostid'][0], data_id['hostid'] and hostid should all return the same 1d array, and be equally usable in mask expressions.
Sometimes it is convenient to keep the ids and data in one structure. That's especially true if writing/reading to csv format files. But for masked selection it doesn't help that much. And for data calculations across data fields it can be a pain.
I could also suggest a compound dtype, one with
dt2 = [('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'), ('data', 'f8', (m,))]
In [41]: np.zeros((4,), dtype=dt2)
Out[41]:
array([(0, 0, 0, [ 0., 0., 0.]), (0, 0, 0, [ 0., 0., 0.]),
(0, 0, 0, [ 0., 0., 0.]), (0, 0, 0, [ 0., 0., 0.])],
dtype=[('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'), ('data', '<f8', (3,))])
In [42]: _['data']
Out[42]:
array([[ 0., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.]])
Is it better to access the float data by column number or by a name like 'x_coor'? Do you need to do calculations with several float columns at once, or will you always access them individually?
Through your description I think the naive way is to read in only useful data into arrays with different names (one type each maybe?)
If you want all the data read into one array, maybe Pandas is your choice:
http://pandas.pydata.org
http://pandas.pydata.org/pandas-docs/stable/
But I haven't tried that yet. Have fun to give a try.
ANSWER TO QUESTION 11:
Question 11: Can I still use np.resize() like: data_array =
np.resize(data_array,(new_size, nr_entries)) to resize/reshape my
array?
If I am resizing my array like this, I am creating to each field in dt 10 more columns. So I get the 'weird' result of Question 8b: a structure (10x3) haloids
The right way to trim my array, because I want to keep only the filled part of if (I designed the array to be big enough to contain a various amount of data blocks I am reading subsequently ...) is to:
data_array = data_array[:newsize]
print np.info(data_array)
class: ndarray
shape: (42648,)
strides: (73,)
type: [('haloid', '<u8'), ('hostid', '<u8'), ('orphan', 'i1'),
('mstar', '<f8'), ('x_pos', '<f8'), ('y_pos', '<f8'),
('z_pos', '<f8'), ('x_vel', '<f8'), ('y_vel', '<f8'), ('z_vel', '<f8')]
Related
I have a complex nested structured array (often used as a recarray). Its simplified for this example, but in the real case there are multiple levels.
c = [('x','f8'),('y','f8')]
A = [('data_string','|S20'),('data_val', c, 2)]
zeros = np.zeros(1, dtype=A)
print(zeros["data_val"]["x"])
I am trying to index the "x" datatype of the nested arrays datatype without defining the preceding named fields. I was hoping something like print(zeros[:,"x"]) would let me slice all of the top level data, but it doesn't work.
Are there ways to do fancy indexing with nested structured arrays with accessing their field names?
I don't know if displaying the resulting array helps you visualize the nesting or not.
In [279]: c = [('x','f8'),('y','f8')]
...: A = [('data_string','|S20'),('data_val', c, 2)]
...: arr = np.zeros(2, dtype=A)
In [280]: arr
Out[280]:
array([(b'', [(0., 0.), (0., 0.)]), (b'', [(0., 0.), (0., 0.)])],
dtype=[('data_string', 'S20'), ('data_val', [('x', '<f8'), ('y', '<f8')], (2,))])
Note how the nesting of () and [] reflects the nesting of the fields.
arr.dtype only has direct access to the top level field names:
In [281]: arr.dtype.names
Out[281]: ('data_string', 'data_val')
In [282]: arr['data_val']
Out[282]:
array([[(0., 0.), (0., 0.)],
[(0., 0.), (0., 0.)]], dtype=[('x', '<f8'), ('y', '<f8')])
But having accessed one field, we can then look at its fields:
In [283]: arr['data_val'].dtype.names
Out[283]: ('x', 'y')
In [284]: arr['data_val']['x']
Out[284]:
array([[0., 0.],
[0., 0.]])
Record number indexing is separate, and can be multidimensional in the usual sense:
In [285]: arr[1]['data_val']['x'] = [1,2]
In [286]: arr[0]['data_val']['y'] = [3,4]
In [287]: arr
Out[287]:
array([(b'', [(0., 3.), (0., 4.)]), (b'', [(1., 0.), (2., 0.)])],
dtype=[('data_string', 'S20'), ('data_val', [('x', '<f8'), ('y', '<f8')], (2,))])
Since the data_val field has a (2,) shape, we can mix/match that index with the (2,) shape of arr:
In [289]: arr['data_val']['x']
Out[289]:
array([[0., 0.],
[1., 2.]])
In [290]: arr['data_val']['x'][[0,1],[0,1]]
Out[290]: array([0., 2.])
In [291]: arr['data_val'][[0,1],[0,1]]
Out[291]: array([(0., 3.), (2., 0.)], dtype=[('x', '<f8'), ('y', '<f8')])
I mentioned that fields indexing is like dict indexing. Note this display of the fields:
In [294]: arr.dtype.fields
Out[294]:
mappingproxy({'data_string': (dtype('S20'), 0),
'data_val': (dtype(([('x', '<f8'), ('y', '<f8')], (2,))), 20)})
Each record is stored as a block of 52 bytes:
In [299]: arr.itemsize
Out[299]: 52
In [300]: arr.dtype.str
Out[300]: '|V52'
20 of those are data_string, and 32 are the 2 c fields
In [303]: arr['data_val'].dtype.str
Out[303]: '|V16'
You can ask for a list of fields, and get a special kind of view. Its dtype display is a little different
In [306]: arr[['data_val']]
Out[306]:
array([([(0., 3.), (0., 4.)],), ([(1., 0.), (2., 0.)],)],
dtype={'names': ['data_val'], 'formats': [([('x', '<f8'), ('y', '<f8')], (2,))], 'offsets': [20], 'itemsize': 52})
In [311]: arr['data_val'][['y']]
Out[311]:
array([[(3.,), (4.,)],
[(0.,), (0.,)]],
dtype={'names': ['y'], 'formats': ['<f8'], 'offsets': [8], 'itemsize': 16})
Each 'data_val' starts 20 bytes into the 52 byte record. And each 'y' starts 8 bytes into its 16 byte record.
The statement zeros['data_val'] creates a view into the array, which may already be non-contiguous at that point. You can extract multiple values of x because c is an array type, meaning that x has clearly defined strides and shape. The semantics of the statement zeros[:, 'x'] are very unclear. For example, what happens to data_string, which has no x? I would expect an error; you might expect something else.
The only way I can see the index being simplified, is if you expand c into A directly, sort of like an anonymous structure in C, except you can't do that easily with an array.
I have found an attribute called base on numpy.dtype objects. Doing some experiments:
numpy.dtype('i4').base
# dtype('int32')
numpy.dtype('6i4').base
# dtype('int32')
numpy.dtype('10f8').base
# dtype('float64')
numpy.dtype('3i4, 2f4')
# dtype([('f0', '<i4', (3,)), ('f1', '<f4', (2,))])
So it seems to contain the dtype of a single element for simple sub-array data types and itself for structured data types.
Unfortunately, this attribute does not seem to be documented anywhere. There is a page in the documentation, but it’s empty and not linked anywhere. Curiously, it is also absent in the documentation for numpy version 1.15.0 specifically:
/doc/numpy/…/numpy.dtype.base.html (empty page)
/doc/numpy-1.15.0/…/numpy.dtype.base.html (error 404)
/doc/numpy-1.15.1/…/numpy.dtype.base.html (empty page)
Can I rely on the presence and behavior of this attribute in future versions of numpy?
This is now documented:
https://numpy.org/doc/stable/reference/generated/numpy.dtype.base.html#numpy.dtype.base
It is defined at https://github.com/numpy/numpy/blob/eeef9d4646103c3b1afd3085f1393f2b3f9575b2/numpy/core/src/multiarray/descriptor.c#L2255-L2300 and from git-blame was last touched ~13 years ago so it is probably safe to assume that dtype.base will exist and continue to exist.
I'm not sure whether it's safe to rely on base, but it's probably a bad idea either way. People reading your code can't look up what base means in the docs, and anyway, there's a better option.
Instead of base, you can use subdtype, which is documented:
Tuple (item_dtype, shape) if this dtype describes a
sub-array, and None otherwise.
The shape is the fixed shape of the sub-array described by this data
type, and item_dtype the data type of the array.
If a field whose dtype object has this attribute is retrieved, then
the extra dimensions implied by shape are tacked on to the end of
the retrieved array.
For a dtype that represents a subarray, dtype.base is equivalent to dtype.subdtype[0]. For a dtype that doesn't represent a subarray, dtype.base is dtype and dtype.subdtype is None. Here's a demo:
>>> subarray = numpy.dtype('5i4')
>>> not_subarray = numpy.dtype('i4')
>>> subarray.base
dtype('int32')
>>> subarray.subdtype
(dtype('int32'), (5,))
>>> not_subarray.base
dtype('int32')
>>> print(not_subarray.subdtype) # None doesn't get auto-printed
None
Incidentally, if you want to be sure about what dtype.base does, here's the source, which confirms what you guessed from your experiments:
static PyObject *
arraydescr_base_get(PyArray_Descr *self)
{
if (!PyDataType_HASSUBARRAY(self)) {
Py_INCREF(self);
return (PyObject *)self;
}
Py_INCREF(self->subarray->base);
return (PyObject *)(self->subarray->base);
}
I've never used the base attribute, or seen it used. But it does make sense that there should be a way of identifying such an object. I can't find a use of it code such as in np.lib.recfunctions, but it may well be used in compiled code.
With a dtype like '10f8' there are various attritubes (some may be properties):
In [259]: dt = np.dtype('10f8')
In [260]: dt
Out[260]: dtype(('<f8', (10,)))
In [261]: dt.base
Out[261]: dtype('float64')
In [263]: dt.descr
Out[263]: [('', '|V80')]
In [264]: dt.itemsize
Out[264]: 80
In [265]: dt.shape
Out[265]: (10,)
Look what happens when we make an array with this dtype:
In [278]: x = np.ones((3,),'10f8')
In [279]: x
Out[279]:
array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])
In [280]: x.shape
Out[280]: (3, 10)
In [281]: x.dtype
Out[281]: dtype('float64') # there's your base
There's the answer - dt.base is the dtype that will be used in creating an array with the dtype. It's the dtype without the extra dimensional information.
That sort of dtype is rarely used by itself; more likely it is part of a compound dtype:
In [252]: dt=np.dtype('3i4, 2f4')
In [253]: dt
Out[253]: dtype([('f0', '<i4', (3,)), ('f1', '<f4', (2,))])
In [254]: dt.base
Out[254]: dtype([('f0', '<i4', (3,)), ('f1', '<f4', (2,))])
In [255]: dt[0]
Out[255]: dtype(('<i4', (3,)))
In [256]: dt[0].base
This dt could be embedded in another dtype:
In [272]: dt1 = np.dtype((dt, (3,)))
In [273]: dt1
Out[273]: dtype(([('f0', '<i4', (3,)), ('f1', '<f4', (2,))], (3,)))
In [274]: dt1.base
Out[274]: dtype([('f0', '<i4', (3,)), ('f1', '<f4', (2,))])
In [275]: arr = np.ones((3,), dt1)
In [276]: arr
Out[276]:
array([[([1, 1, 1], [1., 1.]), ([1, 1, 1], [1., 1.]),
([1, 1, 1], [1., 1.])],
[([1, 1, 1], [1., 1.]), ([1, 1, 1], [1., 1.]),
([1, 1, 1], [1., 1.])],
[([1, 1, 1], [1., 1.]), ([1, 1, 1], [1., 1.]),
([1, 1, 1], [1., 1.])]],
dtype=[('f0', '<i4', (3,)), ('f1', '<f4', (2,))])
In [277]: arr.shape
Out[277]: (3, 3)
In the case of a structured array, the base of a field is the dtype that we get when viewing just that field.
I'm loading pretty large input files into Numpy array (30 columns, over 10k rows). Data contains only floating point numbers. To simplify data processing I'd like to name columns and access them using human-readable names. AFAIK it's only possibly using structured/record arrays. However, if I'm right, when i use structured arrays I'll loose some information. For instance:
x = np.array([(1.0, 2), (3.0, 4), (11, 22)], dtype='float64')
y = np.array([(1.0, 2), (3.0, 4), (11, 22)], dtype=[('x', float), ('y', float), ('z', float)])
Both arrays contains the same data and the same dtype. y can be accessed using column names:
yIn [155]: y['x']
Out[155]: array([ 1., 3., 11.])
Unfortunately, I loose (or I get wrong impression?) so essential properties when I use structured arrays. x and y have different shapes, y cannot be transposed etc.
In [160]: x
Out[160]:
array([[ 1., 2.],
[ 3., 4.],
[11., 22.]])
In [161]: y
Out[161]:
array([( 1., 2.), ( 3., 4.), (11., 22.)],
dtype=[('x', '<f8'), ('y', '<f8')])
In [162]: x.shape
Out[162]: (3, 2)
In [163]: y.shape
Out[163]: (3,)
In [164]: x.T
Out[164]:
array([[ 1., 3., 11.],
[ 2., 4., 22.]])
In [165]: y.T
Out[165]:
array([( 1., 2.), ( 3., 4.), (11., 22.)],
dtype=[('x', '<f8'), ('y', '<f8')])
Is it possible to continue using "regular 2D Numpy arrays" and access columns using their names?
I need an expression that will grant me an 8-tuple float array. Currently, I have the 8-tuple array via:
E = np.zeros((n,m), dtype='8i') #8-tuple
However, when I append an indices i,j via:
E[i,j][0] = 1000.2 #etc.
I get back a tuple array with dtype int:
[1000 0 0 0 0 0 0 0]
It appears I need a way of using the dtype within my zeros command to both set the n-tuple and the float value. Does anyone know how this is done?
If an array is integer dtype, then assigned values will be truncated:
In [169]: x=np.array([0,1,2])
In [170]: x
Out[170]: array([0, 1, 2])
In [173]: x[0] = 1.234
In [174]: x
Out[174]: array([1, 1, 2])
The array has to have a float dtype to hold float values.
Simply changing the i (integer) to f (float) produces a float array:
In [166]: E = np.zeros((2,3), dtype='8f')
In [167]: E.shape
Out[167]: (2, 3, 8)
In [168]: E.dtype
Out[168]: dtype('float32')
This '8f' dtype is not common. The string actually translates to:
In [175]: np.dtype('8f')
Out[175]: dtype(('<f4', (8,)))
But when used in np.zeros that 8 is treated as a dimension. Usually we specify all dimensions in the shape, as #FHTMitchell notes:
In [176]: E1 = np.zeros((2,3,8), dtype=np.float32)
In [177]: E1.shape
Out[177]: (2, 3, 8)
In [178]: E1.dtype
Out[178]: dtype('float32')
Your use of 'n-tuple' is unclear. While shape is a tuple, numeric arrays don't use tuple notation. That is reserved for structured arrays.
In [180]: np.zeros((3,), dtype='f,f,f,f')
Out[180]:
array([(0., 0., 0., 0.), (0., 0., 0., 0.), (0., 0., 0., 0.)],
dtype=[('f0', '<f4'), ('f1', '<f4'), ('f2', '<f4'), ('f3', '<f4')])
In [181]: _.shape
Out[181]: (3,)
This is a 1d array with 3 elements. The dtype shows 4 fields. Each element, or record, is displayed as a tuple.
But fields are indexed by name, not number:
In [182]: Out[180]['f1']
Out[182]: array([0., 0., 0.], dtype=float32)
It is also possible to put 'arrays' within fields:
In [183]: np.zeros((3,), dtype=[('f0','f',(4,))])
Out[183]:
array([([0., 0., 0., 0.],), ([0., 0., 0., 0.],), ([0., 0., 0., 0.],)],
dtype=[('f0', '<f4', (4,))])
In [184]: _['f0']
Out[184]:
array([[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]], dtype=float32)
Initially I thought the 8f notation would produce this sort of array. But apparently I have to either use the full notation with field name, or make a comma separated string:
In [185]: np.zeros((3,), dtype='4f,i')
Out[185]:
array([([0., 0., 0., 0.], 0), ([0., 0., 0., 0.], 0),
([0., 0., 0., 0.], 0)], dtype=[('f0', '<f4', (4,)), ('f1', '<i4')])
dtype notation can be confusing, https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.dtypes.html
Unless you are intentionally trying to create a structured array, it is best to stay away from the '8f' notation.
In [189]: np.array([0,1,2,3],dtype='4i')
TypeError: object of type 'int' has no len()
In [190]: np.array([[0,1,2,3]],dtype='4i')
TypeError: object of type 'int' has no len()
In [191]: np.array([(0,1,2,3)],dtype='4i') # requires [(...)]
Out[191]: array([[0, 1, 2, 3]], dtype=int32)
Without the 4, I can simply write:
In [193]: np.array([[0,1,2,3]], dtype='i')
Out[193]: array([[0, 1, 2, 3]], dtype=int32)
In [194]: np.array([0,1,2,3], dtype='i')
Out[194]: array([0, 1, 2, 3], dtype=int32)
In [195]: np.array([[0,1,2,3]])
Out[195]: array([[0, 1, 2, 3]])
E = np.zeros((n,m), dtype='8f')
Try:
E = np.zeros((n,m), dtype='8f') #8-tuple
Scikit-learn library have a brilliant example of data clustering - stock market structure. It works fine within US stocks. But when one adds tickers from other markets, numpy's error appear that arrays shoud have the same size - this is true, for example, german stocks have different trading calendar.
Ok, after quotes download I add preparation of shared dates:
quotes = [quotes_historical_yahoo_ochl(symbol, d1, d2, asobject=True)
for symbol in symbols]
def intersect(list_1, list_2):
return list(set(list_1) & set(list_2))
dates_all = quotes[0].date
for q in quotes:
dates_symbol = q.date
dates_all = intersect(dates_all, dates_symbol)
Then I'm stuck with filtering numpy array of tuples. Here's some tries:
# for index, q in enumerate(quotes):
# filtered = [i for i in q if i.date in dates_all]
# quotes[index] = np.rec.array(filtered, dtype=q.dtype)
# quotes[index] = np.asanyarray(filtered, dtype=q.dtype)
#
# quotes[index] = np.where(a.date in dates_all for a in q)
#
# quotes[index] = np.where(q[0].date in dates_all)
How to apply filter to numpy array or how to truly convert list of records (after filter) back to numpy's recarray?
quotes[0].dtype:
'(numpy.record, [('date', 'O'), ('year', '<i2'), ('month', 'i1'), ('day', 'i1'), ('d', '<f8'), ('open', '<f8'), ('close', '<f8'), ('high', '<f8'), ('low', '<f8'), ('volume', '<f8'), ('aclose', '<f8')])'
quotes[0].shape:
<class 'tuple'>: (261,)
So quotes is a list of recarrays, and in date_all you collect the intersection of all values in the date field.
I can recreate one such array with:
In [286]: dt=np.dtype([('date', 'O'), ('year', '<i2'), ('month', 'i1'), ('day',
...:
...: ), ('low', '<f8'), ('volume', '<f8'), ('aclose', '<f8')])
In [287]:
In [287]: arr=np.ones((2,), dtype=dt) # 2 element structured array
In [288]: arr
Out[288]:
array([(1, 1, 1, 1, 1., 1., 1., 1., 1., 1., 1.),
(1, 1, 1, 1, 1., 1., 1., 1., 1., 1., 1.)],
dtype=[('date', 'O'), ('year', '<i2'), ('month', 'i1'), ('day', 'i1'), ... ('aclose', '<f8')])
In [289]: type(arr[0])
Out[289]: numpy.void
turn that into a recarray (I dont' use those as much as plain structured arrays):
In [291]: np.rec.array(arr)
Out[291]:
rec.array([(1, 1, 1, 1, 1., 1., 1., 1., 1., 1., 1.),
(1, 1, 1, 1, 1., 1., 1., 1., 1., 1., 1.)],
dtype=[('date', 'O'), ('year', '<i2'), ('month', 'i1'), ('day', 'i1'), .... ('aclose', '<f8')])
dtype of the recarray displays slightly different:
In [292]: _.dtype
Out[292]: dtype((numpy.record, [('date', 'O'), ('year', '<i2'), ('month', 'i1'), ....('aclose', '<f8')]))
In [293]: __.date
Out[293]: array([1, 1], dtype=object)
In any case the date field is an array of objects, possibly of datetime?
q is one of these arrays; i is an element, and i.date is the date field.
[i for i in q if i.date in dates_all]
So filtered is list of recarray elements. np.stack does a better job of reassembling them into an array (that works with the recarray too).
np.stack([i for i in arr if i['date'] in alist])
Or you could collect the indices of the matching records, and index the quote array
In [319]: [i for i,v in enumerate(arr) if v['date'] in alist]
Out[319]: [0, 1]
In [320]: arr[_]
or pull out the date field first:
In [321]: [i for i,v in enumerate(arr['date']) if v in alist]
Out[321]: [0, 1]
in1d might also work to search
In [322]: np.in1d(arr['date'],alist)
Out[322]: array([ True, True], dtype=bool)
In [323]: np.where(np.in1d(arr['date'],alist))
Out[323]: (array([0, 1], dtype=int32),)