Numpy structured array fails basic numpy operations - python

I wish to manipulate named numpy arrays (add, multiply, concatenate, ...)
I defined structured arrays:
types=[('name1', int), ('name2', float)]
a = np.array([2, 3.3], dtype=types)
b = np.array([4, 5.35], dtype=types)
a and b are created such that
a
array([(2, 2. ), (3, 3.3)], dtype=[('name1', '<i8'), ('name2', '<f8')])
but I really want a['name1'] to be just 2, not array([2, 3])
Similarly, I want a['name2'] to be just 3.3
This way I could sum c=a+b, which is expected to be an array of length 2, where c['name1'] is 6 and c['name2'] is 8.65
How can I do that?

Define a structured array:
In [125]: dt = np.dtype([('f0','U10'),('f1',int),('f2',float)])
In [126]: a = np.array([('one',2,3),('two',4,5.5),('three',6,7)],dt)
In [127]: a
Out[127]:
array([('one', 2, 3. ), ('two', 4, 5.5), ('three', 6, 7. )],
dtype=[('f0', '<U10'), ('f1', '<i8'), ('f2', '<f8')])
And an object dtype array with the same data
In [128]: A = np.array([('one',2,3),('two',4,5.5),('three',6,7)],object)
In [129]: A
Out[129]:
array([['one', 2, 3],
['two', 4, 5.5],
['three', 6, 7]], dtype=object)
Addition works because it (iteratively) delegates the action to all elements
In [130]: A+A
Out[130]:
array([['oneone', 4, 6],
['twotwo', 8, 11.0],
['threethree', 12, 14]], dtype=object)
structured addition does not work
In [131]: a+a
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-131-6ff992d1ddd5> in <module>()
----> 1 a+a
TypeError: ufunc 'add' did not contain a loop with signature matching types
dtype([('f0', '<U10'), ('f1', '<i8'), ('f2', '<f8')]) dtype([('f0', '<U10'), ('f1', '<i8'), ('f2', '<f8')])
dtype([('f0', '<U10'), ('f1', '<i8'), ('f2', '<f8')])
Lets try addition field by field:
In [132]: aa = np.zeros_like(a)
In [133]: for n in a.dtype.names: aa[n] = a[n] + a[n]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-133-68476e5d579e> in <module>()
----> 1 for n in a.dtype.names: aa[n] = a[n] + a[n]
TypeError: ufunc 'add' did not contain a loop with signature matching types
dtype('<U10') dtype('<U10') dtype('<U10')
Oops, doesn't quite work - string dtype doesn't have addition. But we can handle the string field separately:
In [134]: aa['f0'] = a['f0']
In [135]: for n in a.dtype.names[1:]: aa[n] = a[n] + a[n]
In [136]: aa
Out[136]:
array([('one', 4, 6.), ('two', 8, 11.), ('three', 12, 14.)],
dtype=[('f0', '<U10'), ('f1', '<i8'), ('f2', '<f8')])
Or we can change the string dtype to object:
In [137]: dt1 = np.dtype([('f0',object),('f1',int),('f2',float)])
In [138]: b = np.array([('one',2,3),('two',4,5.5),('three',6,7)],dt1)
In [139]: b
Out[139]:
array([('one', 2, 3. ), ('two', 4, 5.5), ('three', 6, 7. )],
dtype=[('f0', 'O'), ('f1', '<i8'), ('f2', '<f8')])
In [140]: bb = np.zeros_like(b)
In [141]: for n in a.dtype.names: bb[n] = b[n] + b[n]
In [142]: bb
Out[142]:
array([('oneone', 4, 6.), ('twotwo', 8, 11.), ('threethree', 12, 14.)],
dtype=[('f0', 'O'), ('f1', '<i8'), ('f2', '<f8')])
Python strings do have a __add__, defined as concatenate. Numpy dtype strings don't have that definition. Python strings can be multiplied by an integer, but raise an error otherwise.
My guess is that pandas resorts to something like what I just did. I doubt if it implements dataframe addition in compiled code (except in some special cases). It probably works column by column if the dtype allows. It also seems to freely switch to object dtype (for example a column with both np.nan and a string). Timings might confirm my guess (I don't have pandas installed on this OS).

According to the documentation, the right way to make your arrays is:
types=[('name1', int), ('name2', float)]
a = np.array([(2, 3.3)], dtype=types)
b = np.array([(4, 5.35)], dtype=types)
Which gives generates a and b as you want them:
a['name1']
array([2])
But summing them is not as straight forward as the conventional numpy arrays, so I also suggest to use pandas:
names=['name1','name2']
a=pd.Series([2,3.3],index=names)
b=pd.Series([4,5.35],index=names)
a+b
name1 6.00
name2 8.65
dtype: float64

Related

numpy genfromtxt - infer column header if headers not provided

I understand that with genfromtxt, the defaultfmt parameter can be used to infer default column names, which is useful if column names are not in input data. And defaultfmt, if not provided, is defaulted to f%i. E.g.
>>> data = StringIO("1 2 3\n 4 5 6")
>>> np.genfromtxt(data, dtype=(int, float, int))
array([(1, 2.0, 3), (4, 5.0, 6)],
dtype=[('f0', '<i8'), ('f1', '<f8'), ('f2', '<i8')])
So here we have autogenerated column names f0, f1, f2.
But what if I want numpy to infer both column headers and data type? I thought you do it with dtype=None. Like this
>>> data3 = StringIO("1 2 3\n 4 5 6")
>>> np.genfromtxt(data3, dtype=None, ???) # some parameter combo
array([(1, 2, 3), (4, 5, 6)],
dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8')])
I still want the automatically generated column names of f0, f1...etc. And I want numpy to automatically determine the datatypes based on the data, which I thought was the whole point of doing dtype=None.
EDIT
But unfortunately that doesn't ALWAYS work.
This case works when I have both floats and ints.
>>> data3b = StringIO("1 2 3.0\n 4 5 6.0")
>>> np.genfromtxt(data3b, dtype=None)
array([(1, 2, 3.), (4, 5, 6.)],
dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<f8')])
So numpy correctly inferred datatype of i8 for first 2 column, and f8 for last column.
But, if I provide all ints, the inferred columned names disappears.
>>> data3c = StringIO("1 2 3\n 4 5 6")
>>> np.genfromtxt(data3c, dtype=None)
array([[1, 2, 3],
[4, 5, 6]])
My identical code may or may not work depending on the input data? That doesn't sound right.
And yes I know there's pandas. But I'm not using pandas on purpose. So please bear with me on that.
In [2]: txt = '''1,2,3
...: 4,5,6'''.splitlines()
Defaylt 2d array of flaots:
In [6]: np.genfromtxt(txt, delimiter=',',encoding=None)
Out[6]:
array([[1., 2., 3.],
[4., 5., 6.]])
2d of ints:
In [7]: np.genfromtxt(txt, dtype=None, delimiter=',',encoding=None)
Out[7]:
array([[1, 2, 3],
[4, 5, 6]])
Specified field dtypes:
In [8]: np.genfromtxt(txt, dtype='i,i,i', delimiter=',',encoding=None)
Out[8]:
array([(1, 2, 3), (4, 5, 6)],
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4')])
Specified field names:
In [9]: np.genfromtxt(txt, dtype=None, delimiter=',',encoding=None, names=['a','b','c'])
Out[9]:
array([(1, 2, 3), (4, 5, 6)],
dtype=[('a', '<i8'), ('b', '<i8'), ('c', '<i8')])
The unstructured array can be converted to structured with:
In [10]: import numpy.lib.recfunctions as rf
In [11]: rf.unstructured_to_structured(Out[7])
Out[11]:
array([(1, 2, 3), (4, 5, 6)],
dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8')])
In numpy the default, preferred array, is multidimensional numeric. That's why it produces Out7] if it can.

python h5py: can I store a dataset which different columns have different types?

Suppose I have a table which has many columns, only a few columns is float type, others are small integers, for example:
col1, col2, col3, col4
1.31 1 2 3
2.33 3 5 4
...
How can I store this effectively, suppose I use np.float32 for this dataset, the storage is wasted, because other columns only have a small integer, they don't need so much space. If I use np.int16, the float column is not exact, which also what I wanted. Therefore how do I deal with the situation like this?
Suppose I also have a string column, which make me more confused, how should I store the data?
col1, col2, col3, col4, col5
1.31 1 2 3 "a"
2.33 3 5 4 "b"
...
Edit:
To make things simpler, lets suppose the string column has fix length strings only, for example, length of 3.
I'm going to demonstrate the structured array approach:
I'm guessing you are starting with a csv file 'table'. If not it's still the easiest way to turn your sample into an array:
In [40]: txt = '''col1, col2, col3, col4, col5
...: 1.31 1 2 3 "a"
...: 2.33 3 5 4 "b"
...: '''
In [42]: data = np.genfromtxt(txt.splitlines(), names=True, dtype=None, encoding=None)
In [43]: data
Out[43]:
array([(1.31, 1, 2, 3, '"a"'), (2.33, 3, 5, 4, '"b"')],
dtype=[('col1', '<f8'), ('col2', '<i8'), ('col3', '<i8'), ('col4', '<i8'), ('col5', '<U3')])
With these parameters, genfromtxt takes care of creating a structured array. Note it is a 1d array with 5 fields. Fields dtype are determined from the data.
In [44]: import h5py
...
In [46]: f = h5py.File('struct.h5', 'w')
In [48]: ds = f.create_dataset('data',data=data)
...
TypeError: No conversion path for dtype: dtype('<U3')
But h5py has problems saving the unicode strings (default for py3). There may be ways around that, but here it will be simpler to convert the string dtype to bytestrings. Besides, that'll be more compact.
To convert that, I'll make a new dtype, and use astype. Alternatively I could specify the dtypes in the genfromtxt call.
In [49]: data.dtype
Out[49]: dtype([('col1', '<f8'), ('col2', '<i8'), ('col3', '<i8'), ('col4', '<i8'), ('col5', '<U3')])
In [50]: data.dtype.descr
Out[50]:
[('col1', '<f8'),
('col2', '<i8'),
('col3', '<i8'),
('col4', '<i8'),
('col5', '<U3')]
In [51]: dt1 = data.dtype.descr
In [52]: dt1[-1] = ('col5', 'S3')
In [53]: data.astype(dt1)
Out[53]:
array([(1.31, 1, 2, 3, b'"a"'), (2.33, 3, 5, 4, b'"b"')],
dtype=[('col1', '<f8'), ('col2', '<i8'), ('col3', '<i8'), ('col4', '<i8'), ('col5', 'S3')])
Now it saves the array without problem:
In [54]: data1 = data.astype(dt1)
In [55]: data1
Out[55]:
array([(1.31, 1, 2, 3, b'"a"'), (2.33, 3, 5, 4, b'"b"')],
dtype=[('col1', '<f8'), ('col2', '<i8'), ('col3', '<i8'), ('col4', '<i8'), ('col5', 'S3')])
In [56]: ds = f.create_dataset('data',data=data1)
In [57]: ds
Out[57]: <HDF5 dataset "data": shape (2,), type "|V35">
In [58]: ds[:]
Out[58]:
array([(1.31, 1, 2, 3, b'"a"'), (2.33, 3, 5, 4, b'"b"')],
dtype=[('col1', '<f8'), ('col2', '<i8'), ('col3', '<i8'), ('col4', '<i8'), ('col5', 'S3')])
I could make further modifications, shortening one or more of the int fields:
In [60]: dt1[1] = ('col2','i2')
In [61]: dt1[2] = ('col3','i2')
In [62]: dt1
Out[62]:
[('col1', '<f8'),
('col2', 'i2'),
('col3', 'i2'),
('col4', '<i8'),
('col5', 'S3')]
In [63]: data1 = data.astype(dt1)
In [64]: data1
Out[64]:
array([(1.31, 1, 2, 3, b'"a"'), (2.33, 3, 5, 4, b'"b"')],
dtype=[('col1', '<f8'), ('col2', '<i2'), ('col3', '<i2'), ('col4', '<i8'), ('col5', 'S3')])
In [65]: ds1 = f.create_dataset('data1',data=data1)
ds1 has a more compact storage, 'V23' vs 'V35'
In [67]: ds1
Out[67]: <HDF5 dataset "data1": shape (2,), type "|V23">
In [68]: ds1[:]
Out[68]:
array([(1.31, 1, 2, 3, b'"a"'), (2.33, 3, 5, 4, b'"b"')],
dtype=[('col1', '<f8'), ('col2', '<i2'), ('col3', '<i2'), ('col4', '<i8'), ('col5', 'S3')])

Adding a data column to a numpy rec array with only one row

I need to add a column of data to a numpy rec array. I have seen many answers floating around here, but they do not seem to work for a rec array that only contains one row...
Let's say I have a rec array x:
>>> x = np.rec.array([1, 2, 3])
>>> print(x)
rec.array((1, 2, 3),
dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8')])
and I want to append the value 4 to a new column with it's own field name and data type, such as
rec.array((1, 2, 3, 4),
dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<i8')])
If I try to add a column using the normal append_fields approach;
>>> np.lib.recfunctions.append_fields(x, 'f3', 4, dtypes='<i8',
usemask=False, asrecarray=True)
then I ultimately end up with
TypeError: len() of unsized object
It turns out that for a rec array with only one row, len(x) does not work, while x.size does. If I instead use np.hstack(), I get TypeError: invalid type promotion, and if I try np.c_, I get an undesired result
>>> np.c_[x, 4]
array([[(1, 2, 3), (4, 4, 4)]],
dtype=(numpy.record, [('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8')]))
Create the initial array so that it has shape (1,); note the extra brackets:
In [17]: x = np.rec.array([[1, 2, 3]])
(If x is an input that you can't control that way, you could use x = np.atleast_1d(x) before using it in append_fields().)
Then make sure the value given in append_fields is also a sequence of length 1:
In [18]: np.lib.recfunctions.append_fields(x, 'f3', [4], dtypes='<i8',
...: usemask=False, asrecarray=True)
Out[18]:
rec.array([(1, 2, 3, 4)],
dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<i8')])
Here's a way of doing the job without a recfunctions:
In [64]: x = np.rec.array((1, 2, 3))
In [65]: y=np.zeros(x.shape, dtype=x.dtype.descr+[('f3','<i4')])
In [66]: y
Out[66]:
array((0, 0, 0, 0),
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4')])
In [67]: for name in x.dtype.names: y[name] = x[name]
In [68]: y['f3']=4
In [69]: y
Out[69]:
array((1, 2, 3, 4),
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4')])
From what I've seen in recfunctions code, I think it's just as fast. Of course for a single row speed isn't an issue. In general those functions create a new 'blank' array with the target dtype, and copy fields, by name (possibly recursively) from sources to target. Usually an array has many more records than fields, so iteration on fields is not, relatively speaking, slow.

numpy - Change/Specify dtypes of masked array columns

I have a csv-file containing a lot of data that I want to read as a masked array. I've done so using the following:
data=np.recfromcsv(filename,case_sensitive=True,usemask=True)
which works just fine. However, my problem is that the data are either strings, integers, or floats. What I want to do now is convert all the integers into floats, i.e. turn all the "1"s into "1.0"s etc. while preserving everything else.
Additionally, I am looking for a generic solution. So simply specifying the desired types manually won't do since the csv-file (including the number of columns) changes.
I've tried astype but since the array also has string-entries that won't work, or am I missing something?
Thanks.
I haven't used recfromcsv, but looking at its code I see it uses np.genfromtxt, followed by a masked records construction.
I'd suggest giving a small sample csv text (3 or so lines), and show the resulting data. We need to see the dtype in particular.
It may also be useful to start with genfromtxt, skipping the masked array stuff for now. I don't think that's where the sticky point is in converting dtypes in structured arrays.
In any case, we need something more concrete to explore.
You can't change the dtype of structured fields in-place. You have to make a new array with a new dtype, and copy values from the old to the new.
import numpy.lib.recfunctions as rf
has some functions that can help in changing structured arrays.
===========
I suspect that it will be simpler to spell out the dtypes when calling genfromtxt than to change dtypes in an existing array.
You could try one read with the dtype=None and limited number of lines to get the column count and base dtype. Then edit that, substituting floats for ints as needed. Now read the whole with the new dtype. Look in the recfunctions code if you need ideas on how to edit dtypes.
For example:
In [504]: txt=b"""a, 1, 2, 4\nb, 6, 9, 10\nc, 4, 4, 3"""
In [506]: arr = np.genfromtxt(txt.splitlines(), dtype=None, delimiter=',')
In [507]: arr
Out[507]:
array([(b'a', 1, 2, 4), (b'b', 6, 9, 10), (b'c', 4, 4, 3)],
dtype=[('f0', 'S1'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4')])
In [508]: arr.dtype.descr
Out[508]: [('f0', '|S1'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4')]
A crude dtype editor:
def foo(tup):
name, dtype=tup
dtype = dtype.replace('S','U')
dtype = dtype.replace('i','f')
return name, dtype
And applying this to default dtype:
In [511]: dt = [foo(tup) for tup in arr.dtype.descr]
In [512]: dt
Out[512]: [('f0', '|U1'), ('f1', '<f4'), ('f2', '<f4'), ('f3', '<f4')]
In [513]: arr = np.genfromtxt(txt.splitlines(), dtype=dt, delimiter=',')
In [514]: arr
Out[514]:
array([('a', 1.0, 2.0, 4.0), ('b', 6.0, 9.0, 10.0), ('c', 4.0, 4.0, 3.0)],
dtype=[('f0', '<U1'), ('f1', '<f4'), ('f2', '<f4'), ('f3', '<f4')])
In [522]: arr = np.recfromcsv(txt.splitlines(), dtype=dt, delimiter=',',case_sensitive=True,usemask=True,names=None)
In [523]: arr
Out[523]:
masked_records(
f0 : ['a' 'b' 'c']
f1 : [1.0 6.0 4.0]
f2 : [2.0 9.0 4.0]
f3 : [4.0 10.0 3.0]
fill_value : ('N', 1.0000000200408773e+20, 1.0000000200408773e+20, 1.0000000200408773e+20)
)
=====================
astype works if the target dtype matches. For example if I read the txt with dtype=None, and then use the derived dt, it works:
In [530]: arr = np.genfromtxt(txt.splitlines(), delimiter=',',dtype=None)
In [531]: arr
Out[531]:
array([(b'a', 1, 2, 4), (b'b', 6, 9, 10), (b'c', 4, 4, 3)],
dtype=[('f0', 'S1'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4')])
In [532]: arr.astype(dt)
Out[532]:
array([('a', 1.0, 2.0, 4.0), ('b', 6.0, 9.0, 10.0), ('c', 4.0, 4.0, 3.0)],
dtype=[('f0', '<U1'), ('f1', '<f4'), ('f2', '<f4'), ('f3', '<f4')])
Same for arr.astype('U3,int,float,int') which also has 4 compatible fields.

numpy.concatenate on record arrays fails when array has different length strings

When trying to concatenate record arrays which has a field of dtype string but has different length, concatenation fails.
As you can see in the following example, concatenate works if 'f1' is of same length but fails, if not.
In [1]: import numpy as np
In [2]: a = np.core.records.fromarrays( ([1,2], ["one","two"]) )
In [3]: b = np.core.records.fromarrays( ([3,4,5], ["three","four","three"]) )
In [4]: c = np.core.records.fromarrays( ([6], ["six"]) )
In [5]: np.concatenate( (a,c) )
Out[5]:
array([(1, 'one'), (2, 'two'), (6, 'six')],
dtype=[('f0', '<i8'), ('f1', '|S3')])
In [6]: np.concatenate( (a,b) )
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/u/jegannas/<ipython console> in <module>()
TypeError: expected a readable buffer object
But, again if we just concatenate the arrays (not the records), it succeeds, though strings are of different size.
In [8]: np.concatenate( (a['f1'], b['f1']) )
Out[8]:
array(['one', 'two', 'three', 'four', 'three'],
dtype='|S5')
Is this a bug in concatenate when concatenating records or is this the expected behavior. I have figured only the following way to overcome this.
In [10]: np.concatenate( (a.astype(b.dtype), b) )
Out[10]:
array([(1, 'one'), (2, 'two'), (3, 'three'), (4, 'four'), (5, 'three')],
dtype=[('f0', '<i8'), ('f1', '|S5')]
But the trouble here is that I have to go through all the recarrays, I am concatenating and find the largest string length and I have to use that. If I have more than one string columns in the record array, I need to keep track of a few other things too.
What do you think is the best way to overcome this, at least for now?
To post a complete answer. As Pierre GM suggested the module:
import numpy.lib.recfunctions
gives a solution. The function that does what you want however is:
numpy.lib.recfunctions.stack_arrays((a,b), autoconvert=True, usemask=False)
(usemask=False is just to avoid creation of a masked array, which you are probably not using. The important thing is autoconvert=True to force the conversion from a's dtype "|S3" to "|S5").
Would numpy.lib.recfunctions.merge_arrays work for you ? recfunctions is a little known subpackage that hasn't been advertised a lot, it's a bit clunky but could be useful sometimes.
When you do not specify the dtype, np.rec.fromarrays (aka np.core.records.fromarrays) tries to guess the dtype for you. Hence,
In [4]: a = np.core.records.fromarrays( ([1,2], ["one","two"]) )
In [5]: a
Out[5]:
rec.array([(1, 'one'), (2, 'two')],
dtype=[('f0', '<i4'), ('f1', '|S3')])
Notice the dtype of the f1 column is a 3-byte string.
You can't concatenate np.concatenate( (a,b) ) because numpy sees the dtypes of a and b are different and doesn't change the dtype of the smaller string to match the larger string.
If you know a maximum string size that would work with all your arrays, you could specify the dtype from the beginning:
In [9]: a = np.rec.fromarrays( ([1,2], ["one","two"]), dtype = [('f0', '<i4'), ('f1', '|S8')])
In [10]: b = np.core.records.fromarrays( ([3,4,5], ["three","four","three"]), dtype = [('f0', '<i4'), ('f1', '|S8')])
and then concatenation will work as desired:
In [11]: np.concatenate( (a,b))
Out[11]:
array([(1, 'one'), (2, 'two'), (3, 'three'), (4, 'four'), (5, 'three')],
dtype=[('f0', '<i4'), ('f1', '|S8')])
If you do not know in advance the maximum length of the strings, you could specify the dtype as 'object':
In [35]: a = np.core.records.fromarrays( ([1,2], ["one","two"]), dtype = [('f0', '<i4'), ('f1', 'object')])
In [36]: b = np.core.records.fromarrays( ([3,4,5], ["three","four","three"]), dtype = [('f0', '<i4'), ('f1', 'object')])
In [37]: np.concatenate( (a,b))
Out[37]:
array([(1, 'one'), (2, 'two'), (3, 'three'), (4, 'four'), (5, 'three')],
dtype=[('f0', '<i4'), ('f1', '|O4')])
This will not be as space-efficient as a dtype of '|Sn' (for some integer n), but at least it will allow you to perform the concatenate operation.

Categories