numpy array to ndarray - python

I have an exported pandas dataframe that is now a numpy.array object.
subset = array[:4,:]
array([[ 2. , 12. , 33.33333333, 2. ,
33.33333333, 12. ],
[ 2. , 2. , 33.33333333, 2. ,
33.33333333, 2. ],
[ 2.8 , 8. , 45.83333333, 2.75 ,
46.66666667, 13. ],
[ 3.11320755, 75. , 56. , 3.24 ,
52.83018868, 33. ]])
print subset.dtype
dtype('float64')
I was to convert the column values to specific types, and set column names as well, this means I need to convert it to a ndarray.
Here are my dtypes:
[('PERCENT_A_NEW', '<f8'), ('JoinField', '<i4'), ('NULL_COUNT_B', '<f8'),
('PERCENT_COMP_B', '<f8'), ('RANKING_A', '<f8'), ('RANKING_B', '<f8'),
('NULL_COUNT_B', '<f8')]
When I go to convert the array, I get:
ValueError: new type not compatible with array.
How do you cast each column to a specific value so I can convert the array to an ndarray?
Thanks

You already have an ndarray. What you are seeking is a structured array, one with this compound dtype. First see if pandas can do it for you. If that fails we might be able to do something with tolist and a list comprehension.
In [84]: dt=[('PERCENT_A_NEW', '<f8'), ('JoinField', '<i4'), ('NULL_COUNT_B', '<
...: f8'),
...: ('PERCENT_COMP_B', '<f8'), ('RANKING_A', '<f8'), ('RANKING_B', '<f8'),
...: ('NULL_COUNT_B', '<f8')]
In [85]: subset=np.array([[ 2. , 12. , 33.33333333, 2.
...: ,
...: 33.33333333, 12. ],
...: [ 2. , 2. , 33.33333333, 2. ,
...: 33.33333333, 2. ],
...: [ 2.8 , 8. , 45.83333333, 2.75 ,
...: 46.66666667, 13. ],
...: [ 3.11320755, 75. , 56. , 3.24 ,
...: 52.83018868, 33. ]])
In [86]: subset
Out[86]:
array([[ 2. , 12. , 33.33333333, 2. ,
33.33333333, 12. ],
[ 2. , 2. , 33.33333333, 2. ,
33.33333333, 2. ],
[ 2.8 , 8. , 45.83333333, 2.75 ,
46.66666667, 13. ],
[ 3.11320755, 75. , 56. , 3.24 ,
52.83018868, 33. ]])
Now make an array with dt. Input for a structured array has to be a list of tuples - so I'm using tolist and a list comprehension
In [87]: np.array([tuple(row) for row in subset.tolist()],dtype=dt)
....
ValueError: field 'NULL_COUNT_B' occurs more than once
In [88]: subset.shape
Out[88]: (4, 6)
In [89]: dt
Out[89]:
[('PERCENT_A_NEW', '<f8'),
('JoinField', '<i4'),
('NULL_COUNT_B', '<f8'),
('PERCENT_COMP_B', '<f8'),
('RANKING_A', '<f8'),
('RANKING_B', '<f8'),
('NULL_COUNT_B', '<f8')]
In [90]: dt=[('PERCENT_A_NEW', '<f8'), ('JoinField', '<i4'), ('NULL_COUNT_B', '<
...: f8'),
...: ('PERCENT_COMP_B', '<f8'), ('RANKING_A', '<f8'), ('RANKING_B', '<f8')]
In [91]: np.array([tuple(row) for row in subset.tolist()],dtype=dt)
Out[91]:
array([(2.0, 12, 33.33333333, 2.0, 33.33333333, 12.0),
(2.0, 2, 33.33333333, 2.0, 33.33333333, 2.0),
(2.8, 8, 45.83333333, 2.75, 46.66666667, 13.0),
(3.11320755, 75, 56.0, 3.24, 52.83018868, 33.0)],
dtype=[('PERCENT_A_NEW', '<f8'), ('JoinField', '<i4'), ('NULL_COUNT_B', '<f8'), ('PERCENT_COMP_B', '<f8'), ('RANKING_A', '<f8'), ('RANKING_B', '<f8')])

Related

Ndarray of lists with mix of floats and integers?

I have an array of lists (corr: N-Dimensional array)
s_cluster_data
Out[410]:
array([[ 0.9607611 , 0.19538569, 0. ],
[ 1.03990463, 0.22274072, 0. ],
[ 1.09430461, 0.22603228, 0. ],
...,
[ 1.10802461, -0.54190659, 2. ],
[ 0.9288097 , -0.49195368, 2. ],
[ 0.81606986, -0.47141286, 2. ]])
I would like to make the third column an integer. I've tried to assign dtype as such
dtype=[('A','f8'),('B','f8'),('C','i4')]
s_cluster_data = np.array(s_cluster_data, dtype=dtype)
s_cluster_data
Out[414]:
array([[( 0.9607611 , 0.9607611 , 0), ( 0.19538569, 0.19538569, 0),
( 0. , 0. , 0)],
[( 1.03990463, 1.03990463, 1), ( 0.22274072, 0.22274072, 0),
( 0. , 0. , 0)],
[( 1.09430461, 1.09430461, 1), ( 0.22603228, 0.22603228, 0),
( 0. , 0. , 0)],
...,
dtype=[('A', '<f8'), ('B', '<f8'), ('C', '<i4')])
Which creates an array of lists of tuples (corr: array with dtype), with each index in lists becoming a separate tuple.
I've also tried to take apart the array, read it in as array of tuples, but return back to original state.
list_cluster = s_cluster_data.tolist() # py list
tuple_cluster = [tuple(l) for l in list_cluster] # list of tuples
dtype=[('A','f8'),('B','f8'),('C','i4')]
sd_cluster_data = np.array(tuple_cluster, dtype=dtype) # array of tuples with dtype
sd_cluster_data
Out: ...,
(1.0020371 , -0.56034073, 2), (1.18264038, -0.55773913, 2),
(1.00550194, -0.55359672, 2), (1.10802461, -0.54190659, 2),
(0.9288097 , -0.49195368, 2), (0.81606986, -0.47141286, 2)],
dtype=[('A', '<f8'), ('B', '<f8'), ('C', '<i4')])
So ideally the above output is what I would like to see, but with array of lists, not array of tuples.
I tried to take the array apart and merge it back as lists
x_val_arr = np.array([x[0] for x in sd_cluster_data])
y_val_arr = np.array([x[1] for x in sd_cluster_data])
cluster_id_arr = np.array([x[2] for x in sd_cluster_data])
coordinates_arr = np.stack((x_val_arr,y_val_arr,cluster_id_arr),axis=1)
But once again I get floats in the third column
coordinates_arr
Out[416]:
array([[ 0.9607611 , 0.19538569, 0. ],
[ 1.03990463, 0.22274072, 0. ],
[ 1.09430461, 0.22603228, 0. ],
...,
[ 1.10802461, -0.54190659, 2. ],
[ 0.9288097 , -0.49195368, 2. ],
[ 0.81606986, -0.47141286, 2. ]])
So this is probably a question due to my lack of domain knowledge, but do ndarrays not support mixed data types if it consists of lists, not tuples?
In [87]: import numpy.lib.recfunctions as rf
In [88]: arr = np.array([[ 0.9607611 , 0.19538569, 0. ],
...: [ 1.03990463, 0.22274072, 0. ],
...: [ 1.09430461, 0.22603228, 0. ],
...: [ 1.10802461, -0.54190659, 2. ],
...: [ 0.9288097 , -0.49195368, 2. ],
...: [ 0.81606986, -0.47141286, 2. ]])
In [89]: arr
Out[89]:
array([[ 0.9607611 , 0.19538569, 0. ],
[ 1.03990463, 0.22274072, 0. ],
[ 1.09430461, 0.22603228, 0. ],
[ 1.10802461, -0.54190659, 2. ],
[ 0.9288097 , -0.49195368, 2. ],
[ 0.81606986, -0.47141286, 2. ]])
There are various ways of constructing a structured array from 2d array like this. Recent versions provide a convenient unstructured_to_structured function:
In [90]: dt = np.dtype([('A','f8'),('B','f8'),('C','i4')])
In [92]: rf.unstructured_to_structured(arr, dt)
Out[92]:
array([(0.9607611 , 0.19538569, 0), (1.03990463, 0.22274072, 0),
(1.09430461, 0.22603228, 0), (1.10802461, -0.54190659, 2),
(0.9288097 , -0.49195368, 2), (0.81606986, -0.47141286, 2)],
dtype=[('A', '<f8'), ('B', '<f8'), ('C', '<i4')])
Each row of arr has been turned into a structured record, displayed as a tuple.
A functionally equivalent approach is to create a 'blank' array, and assign field values by name:
In [93]: res = np.zeros(arr.shape[0], dt)
In [94]: res
Out[94]:
array([(0., 0., 0), (0., 0., 0), (0., 0., 0), (0., 0., 0), (0., 0., 0),
(0., 0., 0)], dtype=[('A', '<f8'), ('B', '<f8'), ('C', '<i4')])
In [95]: res['A'] = arr[:,0]
In [96]: res['B'] = arr[:,1]
In [97]: res['C'] = arr[:,2]
In [98]: res
Out[98]:
array([(0.9607611 , 0.19538569, 0), (1.03990463, 0.22274072, 0),
(1.09430461, 0.22603228, 0), (1.10802461, -0.54190659, 2),
(0.9288097 , -0.49195368, 2), (0.81606986, -0.47141286, 2)],
dtype=[('A', '<f8'), ('B', '<f8'), ('C', '<i4')])
and to belabor the point, we could also make the structured array from a list of tuples:
In [104]: np.array([tuple(row) for row in arr.tolist()], dt)
Out[104]:
array([(0.9607611 , 0.19538569, 0), (1.03990463, 0.22274072, 0),
(1.09430461, 0.22603228, 0), (1.10802461, -0.54190659, 2),
(0.9288097 , -0.49195368, 2), (0.81606986, -0.47141286, 2)],
dtype=[('A', '<f8'), ('B', '<f8'), ('C', '<i4')])
The problem might be in the way you pass data to np.array. The rows of array should be tuples.
a = np.array([( 0.9607611 , 0.19538569, 0. )], dtype='f8, f8, i4')
will create an array
array([(0.9607611, 0.19538569, 0)],
dtype=[('f0', '<f8'), ('f1', '<f8'), ('f2', '<i4')])

Zip arrays in Python

I have one 2D array and one 1D array. I would like to zip them together.
import numpy as np
arr2D = [[5.88964708e-02, -2.38142395e-01, -4.95821417e-01, -7.07269274e-01],
[0.53363666, 0.1654723 , -0.16439857, -0.44880487]]
arr2D = np.asarray(arr2D)
arr1D = np.arange(7, 8.5+0.5, 0.5)
arr1D = np.asarray(arr1D)
res = np.array(list(zip(arr1D, arr2D)))
print(res)
which results in:
[[7.0 array([ 0.05889647, -0.2381424 , -0.49582142, -0.70726927])]
[7.5 array([ 0.53363666, 0.1654723 , -0.16439857, -0.44880487])]]
But I am trying to get:
[[(7.0, 0.05889647), (7.5, -0.2381424), (8.0, -0.49582142), (8.5, -0.70726927)]]
[[(7.0, 0.53363666), (7.5, 0.1654723),(8.0, -0.16439857), (8.5, -0.44880487)]]
How can I do this?
You were almost there! Here's a solution:
list(map(lambda x: list(zip(arr1D, x)), arr2D))
[[(7.0, 0.0588964708),
(7.5, -0.238142395),
(8.0, -0.495821417),
(8.5, -0.707269274)],
[(7.0, 0.53363666), (7.5, 0.1654723), (8.0, -0.16439857), (8.5, -0.44880487)]]
In [382]: arr2D = [[5.88964708e-02, -2.38142395e-01, -4.95821417e-01, -7.07269274e-01],
...: [0.53363666, 0.1654723 , -0.16439857, -0.44880487]]
...: arr2D = np.asarray(arr2D)
...: arr1D = np.arange(7, 8.5+0.5, 0.5) # already an array
In [384]: arr2D.shape
Out[384]: (2, 4)
In [385]: arr1D.shape
Out[385]: (4,)
zip iterates on the first dimension of the arguments, and stops with the shortest:
In [387]: [[i,j[0:2]] for i,j in zip(arr1D, arr2D)]
Out[387]:
[[7.0, array([ 0.05889647, -0.2381424 ])],
[7.5, array([0.53363666, 0.1654723 ])]]
If we transpose the 2d, so it is now (4,2), we get a four element list:
In [389]: [[i,j] for i,j in zip(arr1D, arr2D.T)]
Out[389]:
[[7.0, array([0.05889647, 0.53363666])],
[7.5, array([-0.2381424, 0.1654723])],
[8.0, array([-0.49582142, -0.16439857])],
[8.5, array([-0.70726927, -0.44880487])]]
We could add another level of iteration to get the desired pairs:
In [390]: [[(i,k) for k in j] for i,j in zip(arr1D, arr2D.T)]
Out[390]:
[[(7.0, 0.0588964708), (7.0, 0.53363666)],
[(7.5, -0.238142395), (7.5, 0.1654723)],
[(8.0, -0.495821417), (8.0, -0.16439857)],
[(8.5, -0.707269274), (8.5, -0.44880487)]]
and with list transpose idiom:
In [391]: list(zip(*_))
Out[391]:
[((7.0, 0.0588964708), (7.5, -0.238142395), (8.0, -0.495821417), (8.5, -0.707269274)),
((7.0, 0.53363666), (7.5, 0.1654723), (8.0, -0.16439857), (8.5, -0.44880487))]
Or we can get that result directly by moving the zip into an inner loop:
[[(i,k) for i,k in zip(arr1D, row)] for row in arr2D]
In other words, you are pairing the elements of arr1D with the elements of each row of 2D, rather than with the whole row.
Since you already have arrays, one of the array solutions might be better, but I'm trying to clarify what is happening with zip.
numpy
There are various ways of building a numpy array from these arrays. Since you want to repeat the arr1D values:
This repeat makes a (4,2) array that matchs arr2D (tile also works):
In [400]: arr1D[None,:].repeat(2,0)
Out[400]:
array([[7. , 7.5, 8. , 8.5],
[7. , 7.5, 8. , 8.5]])
In [401]: arr2D
Out[401]:
array([[ 0.05889647, -0.2381424 , -0.49582142, -0.70726927],
[ 0.53363666, 0.1654723 , -0.16439857, -0.44880487]])
which can then be joined on a new trailing axis:
In [402]: np.stack((_400, arr2D), axis=2)
Out[402]:
array([[[ 7. , 0.05889647],
[ 7.5 , -0.2381424 ],
[ 8. , -0.49582142],
[ 8.5 , -0.70726927]],
[[ 7. , 0.53363666],
[ 7.5 , 0.1654723 ],
[ 8. , -0.16439857],
[ 8.5 , -0.44880487]]])
Or a structured array with tuple-like display:
In [406]: arr = np.zeros((2,4), dtype='f,f')
In [407]: arr
Out[407]:
array([[(0., 0.), (0., 0.), (0., 0.), (0., 0.)],
[(0., 0.), (0., 0.), (0., 0.), (0., 0.)]],
dtype=[('f0', '<f4'), ('f1', '<f4')])
In [408]: arr['f1'] = arr2D
In [409]: arr['f0'] = _400
In [410]: arr
Out[410]:
array([[(7. , 0.05889647), (7.5, -0.2381424 ), (8. , -0.49582142),
(8.5, -0.70726925)],
[(7. , 0.5336367 ), (7.5, 0.1654723 ), (8. , -0.16439857),
(8.5, -0.44880486)]], dtype=[('f0', '<f4'), ('f1', '<f4')])
You can use numpy.tile to expand the 1d array, and then use numpy.dstack, namely:
import numpy as np
arr2D = np.array([[5.88964708e-02, -2.38142395e-01, -4.95821417e-01, -7.07269274e-01],
[0.53363666, 0.1654723 , -0.16439857, -0.44880487]])
arr1D = np.arange(7, 8.5+0.5, 0.5)
np.dstack([np.tile(arr1D, (2,1)), arr2D])
array([[[ 7. , 0.05889647],
[ 7.5 , -0.2381424 ],
[ 8. , -0.49582142],
[ 8.5 , -0.70726927]],
[[ 7. , 0.53363666],
[ 7.5 , 0.1654723 ],
[ 8. , -0.16439857],
[ 8.5 , -0.44880487]]])

make a numpy array with shape and offset argument in another style

I wanted to access my array both as a 3-element entity (3d position) and individual element (each of x,y,z coordinate).
After some researching, I ended up doing the following.
>>> import numpy as np
>>> arr = np.zeros(5, dtype={'pos': (('<f8', (3,)), 0),
'x': (('<f8', 1), 0),
'y': (('<f8', 1), 8),
'z': (('<f8', 1), 16)})
>>> arr["x"] = 0
>>> arr["y"] = 1
>>> arr["z"] = 2
# I can access the whole array by "pos"
>>> print(arr["pos"])
>>> array([[ 1., 2., 3.],
[ 1., 2., 3.],
[ 1., 2., 3.],
[ 1., 2., 3.],
[ 1., 2., 3.]])
However, I've always been making array in this style:
>>> arr = np.zeros(10, dtype=[("pos", "f8", (3,))])
But I can't find a way to specify both the offset and the shape of the element at the same time in this style. Is there a way to do this?
In reference to the docs page, https://docs.scipy.org/doc/numpy-1.14.0/reference/arrays.dtypes.html
you are using the fields dictionary form, with (data-type, offset) value
{'field1': ..., 'field2': ..., ...}
dt1 = {'pos': (('<f8', (3,)), 0),
'x': (('<f8', 1), 0),
'y': (('<f8', 1), 8),
'z': (('<f8', 1), 16)}
The display for the resulting dtype is the other dictionary format:
{'names': ..., 'formats': ..., 'offsets': ..., 'titles': ..., 'itemsize': ...}
In [15]: np.dtype(dt1)
Out[15]: dtype({'names':['x','pos','y','z'],
'formats':['<f8',('<f8', (3,)),'<f8','<f8'],
'offsets':[0,0,8,16], 'itemsize':24})
In [16]: np.dtype(dt1).fields
Out[16]:
mappingproxy({'pos': (dtype(('<f8', (3,))), 0),
'x': (dtype('float64'), 0),
'y': (dtype('float64'), 8),
'z': (dtype('float64'), 16)})
offsets aren't mentioned any where else on the documentation page.
The last format is a union type. It's a little unclear as to whether that's allowed or discouraged. The examples don't seem to work. There have been some changes in how multifield indexing works, and that may have affected this.
Let's play around with various ways of viewing the array:
In [25]: arr
Out[25]:
array([(0., [ 0. , 10. , 0. ], 10., 0. ),
(1., [ 1. , 11. , 0.1], 11., 0.1),
(2., [ 2. , 12. , 0.2], 12., 0.2),
(3., [ 3. , 13. , 0.3], 13., 0.3),
(4., [ 4. , 14. , 0.4], 14., 0.4)],
dtype={'names':['x','pos','y','z'], 'formats':['<f8',('<f8', (3,)),'<f8','<f8'], 'offsets':[0,0,8,16], 'itemsize':24})
In [29]: dt3=[('x','<f8'),('y','<f8'),('z','<f8')]
In [30]: np.dtype(dt3)
Out[30]: dtype([('x', '<f8'), ('y', '<f8'), ('z', '<f8')])
In [31]: np.dtype(dt3).fields
Out[31]:
mappingproxy({'x': (dtype('float64'), 0),
'y': (dtype('float64'), 8),
'z': (dtype('float64'), 16)})
In [32]: arr.view(dt3)
Out[32]:
array([(0., 10., 0. ), (1., 11., 0.1), (2., 12., 0.2), (3., 13., 0.3),
(4., 14., 0.4)], dtype=[('x', '<f8'), ('y', '<f8'), ('z', '<f8')])
In [33]: arr['pos']
Out[33]:
array([[ 0. , 10. , 0. ],
[ 1. , 11. , 0.1],
[ 2. , 12. , 0.2],
[ 3. , 13. , 0.3],
[ 4. , 14. , 0.4]])
In [35]: arr.view('f8').reshape(5,3)
Out[35]:
array([[ 0. , 10. , 0. ],
[ 1. , 11. , 0.1],
[ 2. , 12. , 0.2],
[ 3. , 13. , 0.3],
[ 4. , 14. , 0.4]])
In [37]: arr.view(dt4)
Out[37]:
array([([ 0. , 10. , 0. ],), ([ 1. , 11. , 0.1],),
([ 2. , 12. , 0.2],), ([ 3. , 13. , 0.3],),
([ 4. , 14. , 0.4],)], dtype=[('pos', '<f8', (3,))])
In [38]: arr.view(dt4)['pos']
Out[38]:
array([[ 0. , 10. , 0. ],
[ 1. , 11. , 0.1],
[ 2. , 12. , 0.2],
[ 3. , 13. , 0.3],
[ 4. , 14. , 0.4]])

Using np.view() with changes to structured arrays in numpy 1.14

I have a numpy structured array with a mixed dtype (i.e., floats, ints, and strings). I want to select some of the columns of the array (all of which contain only floats) and then get the sum, by column, of the rows, as a standard numpy array. The initial array takes a form comparable to:
some_data = np.array([('foo', 3.5, 2.15), ('bar', 2.8, 5.3), ('baz', 1.2, 3.7)],
dtype=[('col1', '<U20'), ('A', '<f8'), ('B', '<f8')])
For this example, I'd like to take the sum of columns A and B, yielding np.array([7.5, 11.15]). With numpy ≤1.13, I could do that as follows:
get_cols = ['A', 'B']
desired_sum = np.sum(some_data[get_cols].view(('<f8', len(get_cols))), axis=0)
With the release of numpy 1.14, this method now fails with ValueError: Changing the dtype to a subarray type is only supported if the total itemsize is unchanged, which is a result of the changes made in numpy 1.14 to the handling of structured arrays. (User bbengfort commented about the FutureWarning given about this change in this answer.)
In light of these changes to structured arrays, how can I obtain the desired sum from the structured array subset?
In [165]: some_data = np.array([('foo', 3.5, 2.15), ('bar', 2.8, 5.3), ('baz', 1.2, 3.7)], dtype=[('col1', '<U20'), ('A', '<f8'), ('B', '<f8')])
...:
In [166]: get_cols = ['A','B']
In [167]: some_data[get_cols]
Out[167]:
array([( 3.5, 2.15), ( 2.8, 5.3 ), ( 1.2, 3.7 )],
dtype=[('A', '<f8'), ('B', '<f8')])
Simply reading the field values is fine. In 1.13 we get a warning
In [168]: some_data[get_cols].view(('<f8', len(get_cols)))
/usr/local/bin/ipython3:1: FutureWarning: Numpy has detected that you may be viewing or writing to an array returned by selecting multiple fields in a structured array.
This code may break in numpy 1.13 because this will return a view instead of a copy -- see release notes for details.
#!/usr/bin/python3
Out[168]:
array([[ 3.5 , 2.15],
[ 2.8 , 5.3 ],
[ 1.2 , 3.7 ]])
With the recommended copy, no warning:
In [169]: some_data[get_cols].copy().view(('<f8', len(get_cols)))
Out[169]:
array([[ 3.5 , 2.15],
[ 2.8 , 5.3 ],
[ 1.2 , 3.7 ]])
In [171]: np.sum(_, axis=0)
Out[171]: array([ 7.5 , 11.15])
In your original array,
dtype([('col1', '<U20'), ('A', '<f8'), ('B', '<f8')])
An A,B slice would have the two f8 items interspersed with the 20U items. Changing the view dtype of such a mix is problematic. That's why working with a copy is more reliable.
Since U20 takes up 4*20 bytes, the total itemsize is 96, a multiple of 8. We can convert the whole thing to f8, reshape and 'throw-away' the U20 columns:
In [183]: some_data.view('f8').reshape(3,-1)[:,-2:]
Out[183]:
array([[ 3.5 , 2.15],
[ 2.8 , 5.3 ],
[ 1.2 , 3.7 ]])
It's not very pretty and I don't recommend it, but it may give some insight into how structured data is arranged.
view on a structured array is useful at times, but often a bit tricky to use correctly.
If the 2 numeric fields are usually used together, I'd recommend a compound dtype like:
In [184]: some_data = np.array([('foo', [3.5, 2.15]), ('bar', [2.8, 5.3]), ('baz
...: ', [1.2, 3.7])],
...: dtype=[('col1', '<U20'), ('AB', '<f8',(2,))])
...:
...:
In [185]: some_data
Out[185]:
array([('foo', [ 3.5 , 2.15]), ('bar', [ 2.8 , 5.3 ]),
('baz', [ 1.2 , 3.7 ])],
dtype=[('col1', '<U20'), ('AB', '<f8', (2,))])
In [186]: some_data['AB']
Out[186]:
array([[ 3.5 , 2.15],
[ 2.8 , 5.3 ],
[ 1.2 , 3.7 ]])
genfromtxt accepts this style of dtype.

Read/Write Python List from/to Binary file

According to Python Cookbook, below is how to write a list of tuple into binary file:
from struct import Struct
def write_records(records, format, f):
'''
Write a sequence of tuples to a binary file of structures.
'''
record_struct = Struct(format)
for r in records:
f.write(record_struct.pack(*r))
# Example
if __name__ == '__main__':
records = [ (1, 2.3, 4.5),
(6, 7.8, 9.0),
(12, 13.4, 56.7) ]
with open('data.b', 'wb') as f:
write_records(records, '<idd', f)
And it works well.
For reading (large amount of binary data), the author recommended the following:
>>> import numpy as np
>>> f = open('data.b', 'rb')
>>> records = np.fromfile(f, dtype='<i,<d,<d')
>>> records
array([(1, 2.3, 4.5), (6, 7.8, 9.0), (12, 13.4, 56.7)],
dtype=[('f0', '<i4'), ('f1', '<f8'), ('f2', '<f8')])
>>> records[0]
(1, 2.3, 4.5)
>>> records[1]
(6, 7.8, 9.0)
>>>
It is also good, but this record is not a normal numpy array. For instance, type(record[0]) will return <type 'numpy.void'>. Even worse, I cannot extract the first column using X = record[:, 0].
Is there a way to efficiently load list(or any other types) from binary file into a normal numpy array?
Thx in advance.
In [196]: rec = np.fromfile('data.b', dtype='<i,<d,<d')
In [198]: rec
Out[198]:
array([( 1, 2.3, 4.5), ( 6, 7.8, 9. ), (12, 13.4, 56.7)],
dtype=[('f0', '<i4'), ('f1', '<f8'), ('f2', '<f8')])
This is a 1d structured array
In [199]: rec['f0']
Out[199]: array([ 1, 6, 12], dtype=int32)
In [200]: rec.shape
Out[200]: (3,)
In [201]: rec.dtype
Out[201]: dtype([('f0', '<i4'), ('f1', '<f8'), ('f2', '<f8')])
Note that its tolist looks identical to your original records:
In [202]: rec.tolist()
Out[202]: [(1, 2.3, 4.5), (6, 7.8, 9.0), (12, 13.4, 56.7)]
In [203]: records
Out[203]: [(1, 2.3, 4.5), (6, 7.8, 9.0), (12, 13.4, 56.7)]
You could create a 2d array from either list with:
In [204]: arr2 = np.array(rec.tolist())
In [205]: arr2
Out[205]:
array([[ 1. , 2.3, 4.5],
[ 6. , 7.8, 9. ],
[ 12. , 13.4, 56.7]])
In [206]: arr2.shape
Out[206]: (3, 3)
There are other ways of converting a structured array to 'regular' array, but this is simplest and most consistent.
The tolist of a regular array uses nested lists. The tuples in the structured version are intended to convey a difference:
In [207]: arr2.tolist()
Out[207]: [[1.0, 2.3, 4.5], [6.0, 7.8, 9.0], [12.0, 13.4, 56.7]]
In the structured array the first field is integer. In the regular array the first column is same as the others, float.
If the binary file contained all floats, you could load it as a 1d of floats and reshape
In [208]: with open('data.f', 'wb') as f:
...: write_records(records, 'ddd', f)
In [210]: rec2 = np.fromfile('data.f', dtype='<d')
In [211]: rec2
Out[211]: array([ 1. , 2.3, 4.5, 6. , 7.8, 9. , 12. , 13.4, 56.7])
But to take advantage of any record structure in the binary file, you have load by records as well, which means structured array:
In [213]: rec3 = np.fromfile('data.f', dtype='d,d,d')
In [214]: rec3
Out[214]:
array([( 1., 2.3, 4.5), ( 6., 7.8, 9. ), ( 12., 13.4, 56.7)],
dtype=[('f0', '<f8'), ('f1', '<f8'), ('f2', '<f8')])

Categories