Read/Write Python List from/to Binary file - python

According to Python Cookbook, below is how to write a list of tuple into binary file:
from struct import Struct
def write_records(records, format, f):
'''
Write a sequence of tuples to a binary file of structures.
'''
record_struct = Struct(format)
for r in records:
f.write(record_struct.pack(*r))
# Example
if __name__ == '__main__':
records = [ (1, 2.3, 4.5),
(6, 7.8, 9.0),
(12, 13.4, 56.7) ]
with open('data.b', 'wb') as f:
write_records(records, '<idd', f)
And it works well.
For reading (large amount of binary data), the author recommended the following:
>>> import numpy as np
>>> f = open('data.b', 'rb')
>>> records = np.fromfile(f, dtype='<i,<d,<d')
>>> records
array([(1, 2.3, 4.5), (6, 7.8, 9.0), (12, 13.4, 56.7)],
dtype=[('f0', '<i4'), ('f1', '<f8'), ('f2', '<f8')])
>>> records[0]
(1, 2.3, 4.5)
>>> records[1]
(6, 7.8, 9.0)
>>>
It is also good, but this record is not a normal numpy array. For instance, type(record[0]) will return <type 'numpy.void'>. Even worse, I cannot extract the first column using X = record[:, 0].
Is there a way to efficiently load list(or any other types) from binary file into a normal numpy array?
Thx in advance.

In [196]: rec = np.fromfile('data.b', dtype='<i,<d,<d')
In [198]: rec
Out[198]:
array([( 1, 2.3, 4.5), ( 6, 7.8, 9. ), (12, 13.4, 56.7)],
dtype=[('f0', '<i4'), ('f1', '<f8'), ('f2', '<f8')])
This is a 1d structured array
In [199]: rec['f0']
Out[199]: array([ 1, 6, 12], dtype=int32)
In [200]: rec.shape
Out[200]: (3,)
In [201]: rec.dtype
Out[201]: dtype([('f0', '<i4'), ('f1', '<f8'), ('f2', '<f8')])
Note that its tolist looks identical to your original records:
In [202]: rec.tolist()
Out[202]: [(1, 2.3, 4.5), (6, 7.8, 9.0), (12, 13.4, 56.7)]
In [203]: records
Out[203]: [(1, 2.3, 4.5), (6, 7.8, 9.0), (12, 13.4, 56.7)]
You could create a 2d array from either list with:
In [204]: arr2 = np.array(rec.tolist())
In [205]: arr2
Out[205]:
array([[ 1. , 2.3, 4.5],
[ 6. , 7.8, 9. ],
[ 12. , 13.4, 56.7]])
In [206]: arr2.shape
Out[206]: (3, 3)
There are other ways of converting a structured array to 'regular' array, but this is simplest and most consistent.
The tolist of a regular array uses nested lists. The tuples in the structured version are intended to convey a difference:
In [207]: arr2.tolist()
Out[207]: [[1.0, 2.3, 4.5], [6.0, 7.8, 9.0], [12.0, 13.4, 56.7]]
In the structured array the first field is integer. In the regular array the first column is same as the others, float.
If the binary file contained all floats, you could load it as a 1d of floats and reshape
In [208]: with open('data.f', 'wb') as f:
...: write_records(records, 'ddd', f)
In [210]: rec2 = np.fromfile('data.f', dtype='<d')
In [211]: rec2
Out[211]: array([ 1. , 2.3, 4.5, 6. , 7.8, 9. , 12. , 13.4, 56.7])
But to take advantage of any record structure in the binary file, you have load by records as well, which means structured array:
In [213]: rec3 = np.fromfile('data.f', dtype='d,d,d')
In [214]: rec3
Out[214]:
array([( 1., 2.3, 4.5), ( 6., 7.8, 9. ), ( 12., 13.4, 56.7)],
dtype=[('f0', '<f8'), ('f1', '<f8'), ('f2', '<f8')])

Related

Converting numpy structured array subset to numpy array without copy

Suppose I have the following numpy structured array:
In [250]: x
Out[250]:
array([(22, 2, -1000000000, 2000), (22, 2, 400, 2000),
(22, 2, 804846, 2000), (44, 2, 800, 4000), (55, 5, 900, 5000),
(55, 5, 1000, 5000), (55, 5, 8900, 5000), (55, 5, 11400, 5000),
(33, 3, 14500, 3000), (33, 3, 40550, 3000), (33, 3, 40990, 3000),
(33, 3, 44400, 3000)],
dtype=[('f1', '<i4'), ('f2', '<f4'), ('f3', '<f4'), ('f4', '<i4')])
I am trying to modify a subset of the above array to a regular numpy array.
It is essential for my application that no copies are created (only views).
Fields are retrieved from the above structured array by using the following function:
def fields_view(array, fields):
return array.getfield(numpy.dtype(
{name: array.dtype.fields[name] for name in fields}
))
If I am interested in fields 'f2' and 'f3', I would do the following:
In [251]: y=fields_view(x,['f2','f3'])
In [252]: y
Out [252]:
array([(2.0, -1000000000.0), (2.0, 400.0), (2.0, 804846.0), (2.0, 800.0),
(5.0, 900.0), (5.0, 1000.0), (5.0, 8900.0), (5.0, 11400.0),
(3.0, 14500.0), (3.0, 40550.0), (3.0, 40990.0), (3.0, 44400.0)],
dtype={'names':['f2','f3'], 'formats':['<f4','<f4'], 'offsets':[4,8], 'itemsize':12})
There is a way to directly get an ndarray from the 'f2' and 'f3' fields of the original structured array. However, for my application, it is necessary to build this intermediary structured array as this data subset is an attribute of a class.
I can't convert the intermediary structured array to a regular numpy array without doing a copy.
In [253]: y.view(('<f4', len(y.dtype.names)))
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-54-f8fc3a40fd1b> in <module>()
----> 1 y.view(('<f4', len(y.dtype.names)))
ValueError: new type not compatible with array.
This function can also be used to convert a record array to an ndarray:
def recarr_to_ndarr(x,typ):
fields = x.dtype.names
shape = x.shape + (len(fields),)
offsets = [x.dtype.fields[name][1] for name in fields]
assert not any(np.diff(offsets, n=2))
strides = x.strides + (offsets[1] - offsets[0],)
y = np.ndarray(shape=shape, dtype=typ, buffer=x,
offset=offsets[0], strides=strides)
return y
However, I get the following error:
In [254]: recarr_to_ndarr(y,'<f4')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-65-2ebda2a39e9f> in <module>()
----> 1 recarr_to_ndarr(y,'<f4')
<ipython-input-62-8a9eea8e7512> in recarr_to_ndarr(x, typ)
8 strides = x.strides + (offsets[1] - offsets[0],)
9 y = np.ndarray(shape=shape, dtype=typ, buffer=x,
---> 10 offset=offsets[0], strides=strides)
11 return y
12
TypeError: expected a single-segment buffer object
The function works fine if I create a copy:
In [255]: recarr_to_ndarr(np.array(y),'<f4')
Out[255]:
array([[ 2.00000000e+00, -1.00000000e+09],
[ 2.00000000e+00, 4.00000000e+02],
[ 2.00000000e+00, 8.04846000e+05],
[ 2.00000000e+00, 8.00000000e+02],
[ 5.00000000e+00, 9.00000000e+02],
[ 5.00000000e+00, 1.00000000e+03],
[ 5.00000000e+00, 8.90000000e+03],
[ 5.00000000e+00, 1.14000000e+04],
[ 3.00000000e+00, 1.45000000e+04],
[ 3.00000000e+00, 4.05500000e+04],
[ 3.00000000e+00, 4.09900000e+04],
[ 3.00000000e+00, 4.44000000e+04]], dtype=float32)
There seems to be no difference between the two arrays:
In [66]: y
Out[66]:
array([(2.0, -1000000000.0), (2.0, 400.0), (2.0, 804846.0), (2.0, 800.0),
(5.0, 900.0), (5.0, 1000.0), (5.0, 8900.0), (5.0, 11400.0),
(3.0, 14500.0), (3.0, 40550.0), (3.0, 40990.0), (3.0, 44400.0)],
dtype={'names':['f2','f3'], 'formats':['<f4','<f4'], 'offsets':[4,8], 'itemsize':12})
In [67]: np.array(y)
Out[67]:
array([(2.0, -1000000000.0), (2.0, 400.0), (2.0, 804846.0), (2.0, 800.0),
(5.0, 900.0), (5.0, 1000.0), (5.0, 8900.0), (5.0, 11400.0),
(3.0, 14500.0), (3.0, 40550.0), (3.0, 40990.0), (3.0, 44400.0)],
dtype={'names':['f2','f3'], 'formats':['<f4','<f4'], 'offsets':[4,8], 'itemsize':12})
This answer is a bit long and rambling. I started with what I knew from previous work on taking array views, and then tried to relate that to your functions.
================
In your case, all fields are 4 bytes long, both floats and ints. I can then view it as all ints or all floats:
In [1431]: x
Out[1431]:
array([(22, 2.0, -1000000000.0, 2000), (22, 2.0, 400.0, 2000),
(22, 2.0, 804846.0, 2000), (44, 2.0, 800.0, 4000),
(55, 5.0, 900.0, 5000), (55, 5.0, 1000.0, 5000),
(55, 5.0, 8900.0, 5000), (55, 5.0, 11400.0, 5000),
(33, 3.0, 14500.0, 3000), (33, 3.0, 40550.0, 3000),
(33, 3.0, 40990.0, 3000), (33, 3.0, 44400.0, 3000)],
dtype=[('f1', '<i4'), ('f2', '<f4'), ('f3', '<f4'), ('f4', '<i4')])
In [1432]: x.view('i4')
Out[1432]:
array([ 22, 1073741824, -831624408, 2000, 22,
1073741824, 1137180672, 2000, 22, 1073741824,
1229225696, 2000, 44, 1073741824, 1145569280,
.... 3000])
In [1433]: x.view('f4')
Out[1433]:
array([ 3.08285662e-44, 2.00000000e+00, -1.00000000e+09,
2.80259693e-42, 3.08285662e-44, 2.00000000e+00,
.... 4.20389539e-42], dtype=float32)
This view is 1d. I can reshape and slice the 2 float columns
In [1434]: x.shape
Out[1434]: (12,)
In [1435]: x.view('f4').reshape(12,-1)
Out[1435]:
array([[ 3.08285662e-44, 2.00000000e+00, -1.00000000e+09,
2.80259693e-42],
[ 3.08285662e-44, 2.00000000e+00, 4.00000000e+02,
2.80259693e-42],
...
[ 4.62428493e-44, 3.00000000e+00, 4.44000000e+04,
4.20389539e-42]], dtype=float32)
In [1437]: x.view('f4').reshape(12,-1)[:,1:3]
Out[1437]:
array([[ 2.00000000e+00, -1.00000000e+09],
[ 2.00000000e+00, 4.00000000e+02],
[ 2.00000000e+00, 8.04846000e+05],
[ 2.00000000e+00, 8.00000000e+02],
...
[ 3.00000000e+00, 4.44000000e+04]], dtype=float32)
That this is a view can be verified by doing a bit of inplace math, and seeing the results in x:
In [1439]: y=x.view('f4').reshape(12,-1)[:,1:3]
In [1440]: y[:,0] += .5
In [1441]: y
Out[1441]:
array([[ 2.50000000e+00, -1.00000000e+09],
[ 2.50000000e+00, 4.00000000e+02],
...
[ 3.50000000e+00, 4.44000000e+04]], dtype=float32)
In [1442]: x
Out[1442]:
array([(22, 2.5, -1000000000.0, 2000), (22, 2.5, 400.0, 2000),
(22, 2.5, 804846.0, 2000), (44, 2.5, 800.0, 4000),
(55, 5.5, 900.0, 5000), (55, 5.5, 1000.0, 5000),
(55, 5.5, 8900.0, 5000), (55, 5.5, 11400.0, 5000),
(33, 3.5, 14500.0, 3000), (33, 3.5, 40550.0, 3000),
(33, 3.5, 40990.0, 3000), (33, 3.5, 44400.0, 3000)],
dtype=[('f1', '<i4'), ('f2', '<f4'), ('f3', '<f4'), ('f4', '<i4')])
If the field sizes differed this might be impossible. For example if the floats were 8 bytes. The key is picturing how the structured data is stored, and imagining whether that can be viewed as a simple dtype of multiple columns. And field choice has to be equivalent to a basic slice. Working with ['f1','f4'] would be equivalent to advanced indexing with [:,[0,3], which has to be a copy.
==========
The 'direct' field indexing is:
z = x[['f2','f3']].view('f4').reshape(12,-1)
z -= .5
modifies z but with a futurewarning. Also it does not modify x; z has become a copy. I can also see this by looking at z.__array_interface__['data'], the data buffer location (and comparing with that of x and y).
=================
Your fields_view does create a structured view:
In [1480]: w=fields_view(x,['f2','f3'])
In [1481]: w.__array_interface__['data']
Out[1481]: (151950184, False)
In [1482]: x.__array_interface__['data']
Out[1482]: (151950184, False)
which can be used to modify x, w['f2'] -= .5. So it is more versatile than the 'direct' x[['f2','f3']].
The w dtype is
dtype({'names':['f2','f3'], 'formats':['<f4','<f4'], 'offsets':[4,8], 'itemsize':12})
Adding print(shape, typ, offsets, strides) to your recarr_to_ndarr, I get (py3)
In [1499]: recarr_to_ndarr(w,'<f4')
(12, 2) <f4 [4, 8] (16, 4)
....
ValueError: ndarray is not contiguous
In [1500]: np.ndarray(shape=(12,2), dtype='<f4', buffer=w.data, offset=4, strides=(16,4))
...
BufferError: memoryview: underlying buffer is not contiguous
That contiguous problem must be refering to the values shown in w.flags:
In [1502]: w.flags
Out[1502]:
C_CONTIGUOUS : False
F_CONTIGUOUS : False
....
It's interesting that w.dtype.descr converts the 'offsets' into a unnamed field:
In [1506]: w.__array_interface__
Out[1506]:
{'data': (151950184, False),
'descr': [('', '|V4'), ('f2', '<f4'), ('f3', '<f4')],
'shape': (12,),
'strides': (16,),
'typestr': '|V12',
'version': 3}
One way or other, w has a non-contiguous data buffer, which can't be used to create a new array. Flattened, the data buffer looks something like
xoox|xoox|xoox|...
# x 4 bytes we want to skip
# o 4 bytes we want to use
# | invisible bdry between records in x
The y I constructed above has:
In [1511]: y.__array_interface__
Out[1511]:
{'data': (151950188, False),
'descr': [('', '<f4')],
'shape': (12, 2),
'strides': (16, 4),
'typestr': '<f4',
'version': 3}
So it accesses the o bytes with a 4 byte offset, and then (16,4) strides, and (12,2) shape.
If I modify your ndarray call to use the original x.data, it works:
In [1514]: xx=np.ndarray(shape=(12,2), dtype='<f4', buffer=x.data, offset=4, strides=(16,4))
In [1515]: xx
Out[1515]:
array([[ 2.00000000e+00, -1.00000000e+09],
[ 2.00000000e+00, 4.00000000e+02],
....
[ 3.00000000e+00, 4.44000000e+04]], dtype=float32)
with the same array_interface as my y:
In [1516]: xx.__array_interface__
Out[1516]:
{'data': (151950188, False),
'descr': [('', '<f4')],
'shape': (12, 2),
'strides': (16, 4),
'typestr': '<f4',
'version': 3}
hpaulj was right in saying that the problem is that the subset of the structured array is not contiguous. Interestingly, I figured out a way to make the array subset contiguous with the following function:
def view_fields(a, fields):
"""
`a` must be a numpy structured array.
`names` is the collection of field names to keep.
Returns a view of the array `a` (not a copy).
"""
dt = a.dtype
formats = [dt.fields[name][0] for name in fields]
offsets = [dt.fields[name][1] for name in fields]
itemsize = a.dtype.itemsize
newdt = np.dtype(dict(names=fields,
formats=formats,
offsets=offsets,
itemsize=itemsize))
b = a.view(newdt)
return b
In [5]: view_fields(x,['f2','f3']).flags
Out[5]:
C_CONTIGUOUS : True
F_CONTIGUOUS : True
OWNDATA : False
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False
The old function:
In [10]: fields_view(x,['f2','f3']).flags
Out[10]:
C_CONTIGUOUS : False
F_CONTIGUOUS : False
OWNDATA : False
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False

python- add col names to np.array

Why the following works:
mat = np.array(
[(0,0,0),
(0,0,0),
(0,0,0)],
dtype=[('MSFT','float'),('CSCO','float'),('GOOG','float') ]
)
while this doesn't:
mat = np.array(
[[0]*3]*3,
dtype=[('MSFT','float'),('CSCO','float'),('GOOG','float')]
)
TypeError: expected a readable buffer object
How can I create a matrix easily like
[[None]*M]*N
But with tuples in it to be able to assign names to columns?
When I make an zero array with your dtype
In [548]: dt=np.dtype([('MSFT','float'),('CSCO','float'),('GOOG','float') ])
In [549]: A = np.zeros(3, dtype=dt)
In [550]: A
Out[550]:
array([(0.0, 0.0, 0.0), (0.0, 0.0, 0.0), (0.0, 0.0, 0.0)],
dtype=[('MSFT', '<f8'), ('CSCO', '<f8'), ('GOOG', '<f8')])
notice that the display shows a list of tuples. That's intentional, to distinguish the dtype records from a row of a 2d (ordinary) array.
That also means that when creating the array, or assigning values, you also need to use a list of tuples.
For example let's make a list of lists:
In [554]: ll = np.arange(9).reshape(3,3).tolist()
In [555]: ll
In [556]: A[:]=ll
...
TypeError: a bytes-like object is required, not 'list'
but if I turn it into a list of tuples:
In [557]: llt = [tuple(i) for i in ll]
In [558]: llt
Out[558]: [(0, 1, 2), (3, 4, 5), (6, 7, 8)]
In [559]: A[:]=llt
In [560]: A
Out[560]:
array([(0.0, 1.0, 2.0), (3.0, 4.0, 5.0), (6.0, 7.0, 8.0)],
dtype=[('MSFT', '<f8'), ('CSCO', '<f8'), ('GOOG', '<f8')])
assignment works fine. That list also can be used directly in array.
In [561]: np.array(llt, dtype=dt)
Out[561]:
array([(0.0, 1.0, 2.0), (3.0, 4.0, 5.0), (6.0, 7.0, 8.0)],
dtype=[('MSFT', '<f8'), ('CSCO', '<f8'), ('GOOG', '<f8')])
Similarly assigning values to one record requires a tuple, not a list:
In [563]: A[0]=(10,12,14)
The other common way of setting values is on a field by field basis. That can be done with a list or array:
In [564]: A['MSFT']=[100,200,300]
In [565]: A
Out[565]:
array([(100.0, 12.0, 14.0), (200.0, 4.0, 5.0), (300.0, 7.0, 8.0)],
dtype=[('MSFT', '<f8'), ('CSCO', '<f8'), ('GOOG', '<f8')])
The np.rec.fromarrays method recommended in the other answer ends up using the copy-by-fields approach. It's code is, in essence:
arrayList = [sb.asarray(x) for x in arrayList]
<determine shape>
<determine dtype>
_array = recarray(shape, descr)
# populate the record array (makes a copy)
for i in range(len(arrayList)):
_array[_names[i]] = arrayList[i]
If you have a number of 1D arrays (columns) you would like to merge while keeping column names, you can use np.rec.fromarrays:
>>> dt = np.dtype([('a', float),('b', float),('c', float),])
>>> np.rec.fromarrays([[0] * 3 ] * 3, dtype=dt)
rec.array([(0.0, 0.0, 0.0), (0.0, 0.0, 0.0), (0.0, 0.0, 0.0)], dtype=[('a', '<f8'), ('b', '<f8'), ('c', '<f8')])
This gives you a record/structured array in which columns can have names & different datatypes.

Numpy genfromtxt - Is it possible to set default type for missing fields when reading file?

I am using numpy genfromtxt to read in CSV data files which I subsequently stack into a single structured numpy array. However, I am running into some problems because in some of the files all data are missing from some fields. Because of this, when I try and stack the data I get a "TypeError: Incompatible type" for the field with all missing data.
Is there a way to handle this by setting a default missing_values dtype in genfromtxt, or by somehow handling the type mismatch when stacking the arrays?
Note, I do not know what the field datatypes are going to be ahead of time.
import numpy as np
import numpy.lib.recfunctions as RF
#=====================================================
#------------ file test0.csv ----------
# fld1, fld2, fld3, fld4, fld5
# aaa, 1, , 3.0, 4
# bbb, 2, , 4.1, 3
# ccc, 3, , 5.2, 2
# ddd, 4, , 6.3, 1
#
#------------ file test1.csv ----------
# fld1, fld2, fld3, fld4, fld5
# aaa, 1, 2.0, 3.0, 4
# bbb, 2, 2.1, 4.1, 3
# ccc, 3, 2.2, 5.2, 2
# ddd, 4, 2.3, 6.3, 1
#
#====================================================================
fn0 = r'C:\temp\test0.csv'
fn1 = r'C:\temp\test1.csv'
a0 = np.genfromtxt(fn0, dtype=None, delimiter=',', names=True)
a1 = np.genfromtxt(fn1, dtype=None, delimiter=',', names=True)
da = RF.stack_arrays((a0,a1))
With your samples a2.dtype is
dtype=[('fld1', 'S3'), ('fld2', '<i4'),
('fld3', '<f8'), ('fld4', '<f8'), ('fld5', '<i4')
but for a1 ('fld3', '?') because there's no data to deduce a type from.
If I define a dtype list
dt=['S3',int,float,float,int]
a2 = np.genfromtxt(txt2, dtype=dt, delimiter=',', names=True)
a1 = np.genfromtxt(txt1, dtype=dt, delimiter=',', names=True)
then the arrays have a common dtype and I can concatenate (no need for the RF version):
In [25]: np.concatenate([a1,a2],axis=0)
Out[25]:
array([(b'aaa', 1, nan, 3.0, 4), (b'bbb', 2, nan, 4.1, 3),
(b'ccc', 3, nan, 5.2, 2), (b'ddd', 4, nan, 6.3, 1),
(b'aaa', 1, 2.0, 3.0, 4), (b'bbb', 2, 2.1, 4.1, 3),
(b'ccc', 3, 2.2, 5.2, 2), (b'ddd', 4, 2.3, 6.3, 1)],
dtype=[('fld1', 'S3'), ('fld2', '<i4'), ('fld3', '<f8'), ('fld4', '<f8'), ('fld5', '<i4')])
I don't see a way around making an explicit list like dt. There's nothing in the parameters, that I can see, which lets me say, for example that when using dtype=None, unknown columns should be float.
Once created, it's going to take a lot of work to change the dtype of selected fields to make them compatible. Changing names is easy. But changing field dtype most likely will require making a new empty array and copying fields.
Look at the a1.itemsize and a2.itemsize. They are respectively 20 and 27.
==========================
Here's an example of changing the dtype after loading. a1n and a2n are the arrays created with dtype=None:
Empty array with shape from a1, but dtype from a2:
In [31]: a1nn = np.zeros(a1n.shape, dtype=a2n.dtype)
In [32]: for n in a1nn.dtype.names:
....: a1nn[n]=a1n[n] # copy fields by name
....:
In [33]: a1nn
Out[33]:
array([(b'aaa', 1, 0.0, 3.0, 4), (b'bbb', 2, 0.0, 4.1, 3),
(b'ccc', 3, 0.0, 5.2, 2), (b'ddd', 4, 0.0, 6.3, 1)],
dtype=[('fld1', 'S3'), ('fld2', '<i4'), ('fld3', '<f8'), ('fld4', '<f8'), ('fld5', '<i4')])
In [34]: np.concatenate([a1nn,a2n])
Out[34]:
array([(b'aaa', 1, 0.0, 3.0, 4), (b'bbb', 2, 0.0, 4.1, 3),
(b'ccc', 3, 0.0, 5.2, 2), (b'ddd', 4, 0.0, 6.3, 1),
(b'aaa', 1, 2.0, 3.0, 4), (b'bbb', 2, 2.1, 4.1, 3),
(b'ccc', 3, 2.2, 5.2, 2), (b'ddd', 4, 2.3, 6.3, 1)],
dtype=[('fld1', 'S3'), ('fld2', '<i4'), ('fld3', '<f8'), ('fld4', '<f8'), ('fld5', '<i4')])
genfromtxt filled the missing fields with np.nan, but this route used 0.
RF has a function that copies arrays field by field, but does so recursively when the dtype is nested.

convert (nx2) array of floats into (nx1) array of 2-tuples

I have a NumPy float array
x = np.array([
[0.0, 1.0],
[2.0, 3.0],
[4.0, 5.0]
],
dtype=np.float32
)
and need to convert it into a NumPy array with a tuple dtype,
y = np.array([
(0.0, 1.0),
(2.0, 3.0),
(4.0, 5.0)
],
dtype=np.dtype((np.float32, 2))
)
NumPy views unfortunately don't work here:
y = x.view(dtype=np.dtype((np.float32, 2)))
ValueError: new type not compatible with array.
Is there a chance to get this done without iterating through x and copying over every single entry?
This is close:
In [122]: dt=np.dtype([('x',float,(2,))])
In [123]: y=np.zeros(x.shape[0],dtype=dt)
In [124]: y
Out[124]:
array([([0.0, 0.0],), ([0.0, 0.0],), ([0.0, 0.0],)],
dtype=[('x', '<f8', (2,))])
In [125]: y['x']=x
In [126]: y
Out[126]:
array([([0.0, 1.0],), ([2.0, 3.0],), ([4.0, 5.0],)],
dtype=[('x', '<f8', (2,))])
In [127]: y['x']
Out[127]:
array([[ 0., 1.],
[ 2., 3.],
[ 4., 5.]])
y has one compound field. That field has 2 elements.
Alternatively you could define 2 fields:
In [134]: dt=np.dtype('f,f')
In [135]: x.view(dt)
Out[135]:
array([[(0.0, 1.0)],
[(2.0, 3.0)],
[(4.0, 5.0)]],
dtype=[('f0', '<f4'), ('f1', '<f4')])
But that is shape (3,1); so reshape:
In [137]: x.view(dt).reshape(3)
Out[137]:
array([(0.0, 1.0), (2.0, 3.0), (4.0, 5.0)],
dtype=[('f0', '<f4'), ('f1', '<f4')])
Apart from the dtype that displays the same as your y.

Assigning field names to numpy array in Python 2.7.3

I am going nuts over this one, as I obviously miss the point and the solution is too simple to see :(
I have an np.array with x columns, and I want to assign a field name. So here is my code:
data = np.array([[1,2,3], [4.0,5.0,6.0], [11,12,12.3]])
a = np.array(data, dtype= {'names': ['1st', '2nd', '3rd'], 'formats':['f8','f8', 'f8']})
print a['1st']
why does this give
[[ 1. 2. 3. ]
[ 4. 5. 6. ]
[ 11. 12. 12.3]]
Instead of [1, 2, 3]?
In [1]: data = np.array([[1,2,3], [4.0,5.0,6.0], [11,12,12.3]])
In [2]: dt = np.dtype({'names': ['1st', '2nd', '3rd'], 'formats':['f8','f8', 'f8']})
Your attempt:
In [3]: np.array(data,dt)
Out[3]:
array([[(1.0, 1.0, 1.0), (2.0, 2.0, 2.0), (3.0, 3.0, 3.0)],
[(4.0, 4.0, 4.0), (5.0, 5.0, 5.0), (6.0, 6.0, 6.0)],
[(11.0, 11.0, 11.0), (12.0, 12.0, 12.0), (12.3, 12.3, 12.3)]],
dtype=[('1st', '<f8'), ('2nd', '<f8'), ('3rd', '<f8')])
produces a (3,3) array, with the same values assigned to each field. data.astype(dt) does the same thing.
But view produces a (3,1) array in which each field contains the data for a column.
In [4]: data.view(dt)
Out[4]:
array([[(1.0, 2.0, 3.0)],
[(4.0, 5.0, 6.0)],
[(11.0, 12.0, 12.3)]],
dtype=[('1st', '<f8'), ('2nd', '<f8'), ('3rd', '<f8')])
I should caution that view only works if all the fields have the same data type as the original. It uses the same data buffer, just interpreting the values differently.
You could reshape the result from (3,1) to (3,).
But since you want A['1st'] to be [1,2,3] - a row of data - we have to do some other manipulation.
In [16]: data.T.copy().view(dt)
Out[16]:
array([[(1.0, 4.0, 11.0)],
[(2.0, 5.0, 12.0)],
[(3.0, 6.0, 12.3)]],
dtype=[('1st', '<f8'), ('2nd', '<f8'), ('3rd', '<f8')])
In [17]: _['1st']
Out[17]:
array([[ 1.],
[ 2.],
[ 3.]])
I transpose, and then make a copy (rearranging the underlying data buffer). Now a view puts [1,2,3] in one field.
Note that the display of the structured array uses () instead of [] for the 'rows'. This is clue as to how it accepts input.
I can turn your data into a list of tuples with:
In [19]: [tuple(i) for i in data.T]
Out[19]: [(1.0, 4.0, 11.0), (2.0, 5.0, 12.0), (3.0, 6.0, 12.300000000000001)]
In [20]: np.array([tuple(i) for i in data.T],dt)
Out[20]:
array([(1.0, 4.0, 11.0), (2.0, 5.0, 12.0), (3.0, 6.0, 12.3)],
dtype=[('1st', '<f8'), ('2nd', '<f8'), ('3rd', '<f8')])
In [21]: _['1st']
Out[21]: array([ 1., 2., 3.])
This is a (3,) array with 3 fields.
A list of tuples is the normal way of supplying data to np.array(...,dt). See the doc link in my comment.
You can also create an empty array, and fill it, row by row, or field by field
In [26]: A=np.zeros((3,),dt)
In [27]: for i in range(3):
....: A[i]=data[:,i].copy()
Without the copy I get a ValueError: ndarray is not C-contiguous
Fill field by field:
In [29]: for i in range(3):
....: A[dt.names[i]]=data[i,:]
Usually a structured array has many rows, and a few fields. So filling by field is relatively fast. That's how recarray functions handle most copying tasks.
fromiter can also be used:
In [31]: np.fromiter(data, dtype=dt)
Out[31]:
array([(1.0, 2.0, 3.0), (4.0, 5.0, 6.0), (11.0, 12.0, 12.3)],
dtype=[('1st', '<f8'), ('2nd', '<f8'), ('3rd', '<f8')])
But the error I get when using data.T without the copy is a strong indication that is doing the row by row iteration (my In[27])
In [32]: np.fromiter(data.T, dtype=dt)
ValueError: ndarray is not C-contiguous
zip(*data) is another way of reordering the input array (see #unutbu's answer in the comment link).
np.fromiter(zip(*data),dtype=dt)
As pointed out in a comment, fromarrays works:
np.rec.fromarrays(data,dt)
This is an example of a rec function that uses the by field copy method:
arrayList = [sb.asarray(x) for x in arrayList]
....
_array = recarray(shape, descr)
# populate the record array (makes a copy)
for i in range(len(arrayList)):
_array[_names[i]] = arrayList[i]
Which in our case is:
In [8]: data1 = [np.asarray(i) for i in data]
In [9]: data1
Out[9]: [array([ 1., 2., 3.]), array([ 4., 5., 6.]), array([ 11. , 12. , 12.3])]
In [10]: for i in range(3):
A[dt.names[i]] = data1[i]

Categories