numpy genfromtxt - infer column header if headers not provided - python

I understand that with genfromtxt, the defaultfmt parameter can be used to infer default column names, which is useful if column names are not in input data. And defaultfmt, if not provided, is defaulted to f%i. E.g.
>>> data = StringIO("1 2 3\n 4 5 6")
>>> np.genfromtxt(data, dtype=(int, float, int))
array([(1, 2.0, 3), (4, 5.0, 6)],
dtype=[('f0', '<i8'), ('f1', '<f8'), ('f2', '<i8')])
So here we have autogenerated column names f0, f1, f2.
But what if I want numpy to infer both column headers and data type? I thought you do it with dtype=None. Like this
>>> data3 = StringIO("1 2 3\n 4 5 6")
>>> np.genfromtxt(data3, dtype=None, ???) # some parameter combo
array([(1, 2, 3), (4, 5, 6)],
dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8')])
I still want the automatically generated column names of f0, f1...etc. And I want numpy to automatically determine the datatypes based on the data, which I thought was the whole point of doing dtype=None.
EDIT
But unfortunately that doesn't ALWAYS work.
This case works when I have both floats and ints.
>>> data3b = StringIO("1 2 3.0\n 4 5 6.0")
>>> np.genfromtxt(data3b, dtype=None)
array([(1, 2, 3.), (4, 5, 6.)],
dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<f8')])
So numpy correctly inferred datatype of i8 for first 2 column, and f8 for last column.
But, if I provide all ints, the inferred columned names disappears.
>>> data3c = StringIO("1 2 3\n 4 5 6")
>>> np.genfromtxt(data3c, dtype=None)
array([[1, 2, 3],
[4, 5, 6]])
My identical code may or may not work depending on the input data? That doesn't sound right.
And yes I know there's pandas. But I'm not using pandas on purpose. So please bear with me on that.

In [2]: txt = '''1,2,3
...: 4,5,6'''.splitlines()
Defaylt 2d array of flaots:
In [6]: np.genfromtxt(txt, delimiter=',',encoding=None)
Out[6]:
array([[1., 2., 3.],
[4., 5., 6.]])
2d of ints:
In [7]: np.genfromtxt(txt, dtype=None, delimiter=',',encoding=None)
Out[7]:
array([[1, 2, 3],
[4, 5, 6]])
Specified field dtypes:
In [8]: np.genfromtxt(txt, dtype='i,i,i', delimiter=',',encoding=None)
Out[8]:
array([(1, 2, 3), (4, 5, 6)],
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4')])
Specified field names:
In [9]: np.genfromtxt(txt, dtype=None, delimiter=',',encoding=None, names=['a','b','c'])
Out[9]:
array([(1, 2, 3), (4, 5, 6)],
dtype=[('a', '<i8'), ('b', '<i8'), ('c', '<i8')])
The unstructured array can be converted to structured with:
In [10]: import numpy.lib.recfunctions as rf
In [11]: rf.unstructured_to_structured(Out[7])
Out[11]:
array([(1, 2, 3), (4, 5, 6)],
dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8')])
In numpy the default, preferred array, is multidimensional numeric. That's why it produces Out7] if it can.

Related

How to convert 1D numpy array of tuples to 2D numpy array?

I have a numpy array of tuples:
import numpy as np
the_tuples = np.array([(1, 4), (7, 8)], dtype=[('f0', '<i4'), ('f1', '<i4')])
I would like to have a 2D numpy array instead:
the_2Darray = np.array([[1,4],[7,8]])
I have tried doing several things, such as
import numpy as np
the_tuples = np.array([(1, 4), (7, 8)], dtype=[('f0', '<i4'), ('f1', '<i4')])
the_2Darray = np.array([*the_tuples])
How can I convert it?
Needed to add the_tuples.astype(object).
import numpy as np
the_tuples = np.array([(1, 4), (7, 8)], dtype=[('f0', '<i4'), ('f1', '<i4')])
the_tuples = the_tuples.astype(object)
the_2Darray = np.array([*the_tuples])
This is a structured array - 1d with a compound dtype:
In [2]: arr = np.array([(1, 4), (7, 8)], dtype=[('f0', '<i4'), ('f1', '<i4')])
In [3]: arr.shape, arr.dtype
Out[3]: ((2,), dtype([('f0', '<i4'), ('f1', '<i4')]))
recfunctions has a function designed to do such a conversion:
In [4]: import numpy.lib.recfunctions as rf
In [5]: arr1 = rf.structured_to_unstructured(arr)
In [6]: arr1
Out[6]:
array([[1, 4],
[7, 8]])
view works if all fields have the same dtype, but the shape isn't kept well:
In [7]: arr.view('int')
Out[7]: array([1, 4, 7, 8])
The tolist produces a list of tuples (instead of a list of lists):
In [9]: arr.tolist()
Out[9]: [(1, 4), (7, 8)]
Which can be used as the basis for making a new array:
In [10]: np.array(arr.tolist())
Out[10]:
array([[1, 4],
[7, 8]])
Note that when you created the array (in [2]) you had to use the list of tuples format.

python h5py: can I store a dataset which different columns have different types?

Suppose I have a table which has many columns, only a few columns is float type, others are small integers, for example:
col1, col2, col3, col4
1.31 1 2 3
2.33 3 5 4
...
How can I store this effectively, suppose I use np.float32 for this dataset, the storage is wasted, because other columns only have a small integer, they don't need so much space. If I use np.int16, the float column is not exact, which also what I wanted. Therefore how do I deal with the situation like this?
Suppose I also have a string column, which make me more confused, how should I store the data?
col1, col2, col3, col4, col5
1.31 1 2 3 "a"
2.33 3 5 4 "b"
...
Edit:
To make things simpler, lets suppose the string column has fix length strings only, for example, length of 3.
I'm going to demonstrate the structured array approach:
I'm guessing you are starting with a csv file 'table'. If not it's still the easiest way to turn your sample into an array:
In [40]: txt = '''col1, col2, col3, col4, col5
...: 1.31 1 2 3 "a"
...: 2.33 3 5 4 "b"
...: '''
In [42]: data = np.genfromtxt(txt.splitlines(), names=True, dtype=None, encoding=None)
In [43]: data
Out[43]:
array([(1.31, 1, 2, 3, '"a"'), (2.33, 3, 5, 4, '"b"')],
dtype=[('col1', '<f8'), ('col2', '<i8'), ('col3', '<i8'), ('col4', '<i8'), ('col5', '<U3')])
With these parameters, genfromtxt takes care of creating a structured array. Note it is a 1d array with 5 fields. Fields dtype are determined from the data.
In [44]: import h5py
...
In [46]: f = h5py.File('struct.h5', 'w')
In [48]: ds = f.create_dataset('data',data=data)
...
TypeError: No conversion path for dtype: dtype('<U3')
But h5py has problems saving the unicode strings (default for py3). There may be ways around that, but here it will be simpler to convert the string dtype to bytestrings. Besides, that'll be more compact.
To convert that, I'll make a new dtype, and use astype. Alternatively I could specify the dtypes in the genfromtxt call.
In [49]: data.dtype
Out[49]: dtype([('col1', '<f8'), ('col2', '<i8'), ('col3', '<i8'), ('col4', '<i8'), ('col5', '<U3')])
In [50]: data.dtype.descr
Out[50]:
[('col1', '<f8'),
('col2', '<i8'),
('col3', '<i8'),
('col4', '<i8'),
('col5', '<U3')]
In [51]: dt1 = data.dtype.descr
In [52]: dt1[-1] = ('col5', 'S3')
In [53]: data.astype(dt1)
Out[53]:
array([(1.31, 1, 2, 3, b'"a"'), (2.33, 3, 5, 4, b'"b"')],
dtype=[('col1', '<f8'), ('col2', '<i8'), ('col3', '<i8'), ('col4', '<i8'), ('col5', 'S3')])
Now it saves the array without problem:
In [54]: data1 = data.astype(dt1)
In [55]: data1
Out[55]:
array([(1.31, 1, 2, 3, b'"a"'), (2.33, 3, 5, 4, b'"b"')],
dtype=[('col1', '<f8'), ('col2', '<i8'), ('col3', '<i8'), ('col4', '<i8'), ('col5', 'S3')])
In [56]: ds = f.create_dataset('data',data=data1)
In [57]: ds
Out[57]: <HDF5 dataset "data": shape (2,), type "|V35">
In [58]: ds[:]
Out[58]:
array([(1.31, 1, 2, 3, b'"a"'), (2.33, 3, 5, 4, b'"b"')],
dtype=[('col1', '<f8'), ('col2', '<i8'), ('col3', '<i8'), ('col4', '<i8'), ('col5', 'S3')])
I could make further modifications, shortening one or more of the int fields:
In [60]: dt1[1] = ('col2','i2')
In [61]: dt1[2] = ('col3','i2')
In [62]: dt1
Out[62]:
[('col1', '<f8'),
('col2', 'i2'),
('col3', 'i2'),
('col4', '<i8'),
('col5', 'S3')]
In [63]: data1 = data.astype(dt1)
In [64]: data1
Out[64]:
array([(1.31, 1, 2, 3, b'"a"'), (2.33, 3, 5, 4, b'"b"')],
dtype=[('col1', '<f8'), ('col2', '<i2'), ('col3', '<i2'), ('col4', '<i8'), ('col5', 'S3')])
In [65]: ds1 = f.create_dataset('data1',data=data1)
ds1 has a more compact storage, 'V23' vs 'V35'
In [67]: ds1
Out[67]: <HDF5 dataset "data1": shape (2,), type "|V23">
In [68]: ds1[:]
Out[68]:
array([(1.31, 1, 2, 3, b'"a"'), (2.33, 3, 5, 4, b'"b"')],
dtype=[('col1', '<f8'), ('col2', '<i2'), ('col3', '<i2'), ('col4', '<i8'), ('col5', 'S3')])

Numpy structured array fails basic numpy operations

I wish to manipulate named numpy arrays (add, multiply, concatenate, ...)
I defined structured arrays:
types=[('name1', int), ('name2', float)]
a = np.array([2, 3.3], dtype=types)
b = np.array([4, 5.35], dtype=types)
a and b are created such that
a
array([(2, 2. ), (3, 3.3)], dtype=[('name1', '<i8'), ('name2', '<f8')])
but I really want a['name1'] to be just 2, not array([2, 3])
Similarly, I want a['name2'] to be just 3.3
This way I could sum c=a+b, which is expected to be an array of length 2, where c['name1'] is 6 and c['name2'] is 8.65
How can I do that?
Define a structured array:
In [125]: dt = np.dtype([('f0','U10'),('f1',int),('f2',float)])
In [126]: a = np.array([('one',2,3),('two',4,5.5),('three',6,7)],dt)
In [127]: a
Out[127]:
array([('one', 2, 3. ), ('two', 4, 5.5), ('three', 6, 7. )],
dtype=[('f0', '<U10'), ('f1', '<i8'), ('f2', '<f8')])
And an object dtype array with the same data
In [128]: A = np.array([('one',2,3),('two',4,5.5),('three',6,7)],object)
In [129]: A
Out[129]:
array([['one', 2, 3],
['two', 4, 5.5],
['three', 6, 7]], dtype=object)
Addition works because it (iteratively) delegates the action to all elements
In [130]: A+A
Out[130]:
array([['oneone', 4, 6],
['twotwo', 8, 11.0],
['threethree', 12, 14]], dtype=object)
structured addition does not work
In [131]: a+a
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-131-6ff992d1ddd5> in <module>()
----> 1 a+a
TypeError: ufunc 'add' did not contain a loop with signature matching types
dtype([('f0', '<U10'), ('f1', '<i8'), ('f2', '<f8')]) dtype([('f0', '<U10'), ('f1', '<i8'), ('f2', '<f8')])
dtype([('f0', '<U10'), ('f1', '<i8'), ('f2', '<f8')])
Lets try addition field by field:
In [132]: aa = np.zeros_like(a)
In [133]: for n in a.dtype.names: aa[n] = a[n] + a[n]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-133-68476e5d579e> in <module>()
----> 1 for n in a.dtype.names: aa[n] = a[n] + a[n]
TypeError: ufunc 'add' did not contain a loop with signature matching types
dtype('<U10') dtype('<U10') dtype('<U10')
Oops, doesn't quite work - string dtype doesn't have addition. But we can handle the string field separately:
In [134]: aa['f0'] = a['f0']
In [135]: for n in a.dtype.names[1:]: aa[n] = a[n] + a[n]
In [136]: aa
Out[136]:
array([('one', 4, 6.), ('two', 8, 11.), ('three', 12, 14.)],
dtype=[('f0', '<U10'), ('f1', '<i8'), ('f2', '<f8')])
Or we can change the string dtype to object:
In [137]: dt1 = np.dtype([('f0',object),('f1',int),('f2',float)])
In [138]: b = np.array([('one',2,3),('two',4,5.5),('three',6,7)],dt1)
In [139]: b
Out[139]:
array([('one', 2, 3. ), ('two', 4, 5.5), ('three', 6, 7. )],
dtype=[('f0', 'O'), ('f1', '<i8'), ('f2', '<f8')])
In [140]: bb = np.zeros_like(b)
In [141]: for n in a.dtype.names: bb[n] = b[n] + b[n]
In [142]: bb
Out[142]:
array([('oneone', 4, 6.), ('twotwo', 8, 11.), ('threethree', 12, 14.)],
dtype=[('f0', 'O'), ('f1', '<i8'), ('f2', '<f8')])
Python strings do have a __add__, defined as concatenate. Numpy dtype strings don't have that definition. Python strings can be multiplied by an integer, but raise an error otherwise.
My guess is that pandas resorts to something like what I just did. I doubt if it implements dataframe addition in compiled code (except in some special cases). It probably works column by column if the dtype allows. It also seems to freely switch to object dtype (for example a column with both np.nan and a string). Timings might confirm my guess (I don't have pandas installed on this OS).
According to the documentation, the right way to make your arrays is:
types=[('name1', int), ('name2', float)]
a = np.array([(2, 3.3)], dtype=types)
b = np.array([(4, 5.35)], dtype=types)
Which gives generates a and b as you want them:
a['name1']
array([2])
But summing them is not as straight forward as the conventional numpy arrays, so I also suggest to use pandas:
names=['name1','name2']
a=pd.Series([2,3.3],index=names)
b=pd.Series([4,5.35],index=names)
a+b
name1 6.00
name2 8.65
dtype: float64

Adding a data column to a numpy rec array with only one row

I need to add a column of data to a numpy rec array. I have seen many answers floating around here, but they do not seem to work for a rec array that only contains one row...
Let's say I have a rec array x:
>>> x = np.rec.array([1, 2, 3])
>>> print(x)
rec.array((1, 2, 3),
dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8')])
and I want to append the value 4 to a new column with it's own field name and data type, such as
rec.array((1, 2, 3, 4),
dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<i8')])
If I try to add a column using the normal append_fields approach;
>>> np.lib.recfunctions.append_fields(x, 'f3', 4, dtypes='<i8',
usemask=False, asrecarray=True)
then I ultimately end up with
TypeError: len() of unsized object
It turns out that for a rec array with only one row, len(x) does not work, while x.size does. If I instead use np.hstack(), I get TypeError: invalid type promotion, and if I try np.c_, I get an undesired result
>>> np.c_[x, 4]
array([[(1, 2, 3), (4, 4, 4)]],
dtype=(numpy.record, [('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8')]))
Create the initial array so that it has shape (1,); note the extra brackets:
In [17]: x = np.rec.array([[1, 2, 3]])
(If x is an input that you can't control that way, you could use x = np.atleast_1d(x) before using it in append_fields().)
Then make sure the value given in append_fields is also a sequence of length 1:
In [18]: np.lib.recfunctions.append_fields(x, 'f3', [4], dtypes='<i8',
...: usemask=False, asrecarray=True)
Out[18]:
rec.array([(1, 2, 3, 4)],
dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<i8')])
Here's a way of doing the job without a recfunctions:
In [64]: x = np.rec.array((1, 2, 3))
In [65]: y=np.zeros(x.shape, dtype=x.dtype.descr+[('f3','<i4')])
In [66]: y
Out[66]:
array((0, 0, 0, 0),
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4')])
In [67]: for name in x.dtype.names: y[name] = x[name]
In [68]: y['f3']=4
In [69]: y
Out[69]:
array((1, 2, 3, 4),
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4')])
From what I've seen in recfunctions code, I think it's just as fast. Of course for a single row speed isn't an issue. In general those functions create a new 'blank' array with the target dtype, and copy fields, by name (possibly recursively) from sources to target. Usually an array has many more records than fields, so iteration on fields is not, relatively speaking, slow.

In place sorting by column fails on slices

I'm trying to sort a numpy array by a specific column (in-place) using the solution from this answer. For the most part it works, but it fails on any array that's a view on another array:
In [35]: columnnum = 2
In [36]: a = np.array([[1,2,3], [4,7,5], [9,0,1]])
In [37]: a
Out[37]:
array([[1, 2, 3],
[4, 7, 5],
[9, 0, 1]])
In [38]: b = a[:,(0, 2)]
In [39]: b
Out[39]:
array([[1, 3],
[4, 5],
[9, 1]])
In [40]: a.view(','.join([a.dtype.str] * a.shape[1])).sort(order=['f%d' % columnnum], axis=0)
In [41]: a
Out[41]:
array([[9, 0, 1],
[1, 2, 3],
[4, 7, 5]])
In [42]: b.view(','.join([b.dtype.str] * b.shape[1])).sort(order=['f%d' % columnnum], axis=0)
ValueError: new type not compatible with array.
It looks like numpy doesn't support views of views, which makes a certain amount of sense, but I now can't figure out how to get the view I need for any array, whether it itself is a view or not. So far, I haven't been able to find any way to get the necessary information about the view I have to construct the new one I need.
For now, I'm using the l = l[l[:,columnnum].argsort()] in-place sorting method, which works fine, but since I'm operating on large datasets, I'd like to avoid the extra memory overhead of the argsort() call (the list of indexes). Is there either a way to get the necessary information about the view or to do the sort by column?
In [1019]: a=np.array([[1,2,3],[4,7,5],[9,0,1]])
In [1020]: b=a[:,(0,2)]
This is the a that you are sorting; a structured array with 3 fields. It uses the same data buffer, but interpreting groups of 3 ints as fields rather than columns.
In [1021]: a.view('i,i,i')
Out[1021]:
array([[(1, 2, 3)],
[(4, 7, 5)],
[(9, 0, 1)]],
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4')])
By the same logic, you try to view b:
In [1022]: b.view('i,i')
/usr/local/bin/ipython3:1: DeprecationWarning: Changing the shape of non-C contiguous array by
descriptor assignment is deprecated. To maintain
the Fortran contiguity of a multidimensional Fortran
array, use 'a.T.view(...).T' instead
#!/usr/bin/python3
....
ValueError: new type not compatible with array.
But if I use 3 fields instead of 2, it works (but with the same warning):
In [1023]: b.view('i,i,i')
/usr/local/bin/ipython3:1: DeprecationWarning:...
Out[1023]:
array([[(1, 4, 9), (3, 5, 1)]],
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4')])
The problem is that b is Fortran order. Check b.flags or
In [1026]: a.strides
Out[1026]: (12, 4)
In [1027]: b.strides
Out[1027]: (4, 12)
b is a copy, not a view. I don't know, off hand, why this construction of b changed the order.
Heeding the warning, I can do:
In [1047]: b.T.view('i,i,i').T
Out[1047]:
array([[(1, 4, 9), (3, 5, 1)]],
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4')])
A default copy (order c) of b can be viewed as 2 fields:
In [1042]: b1=b.copy()
In [1043]: b1.strides
Out[1043]: (8, 4)
In [1044]: b1.view('i,i')
Out[1044]:
array([[(1, 3)],
[(4, 5)],
[(9, 1)]],
dtype=[('f0', '<i4'), ('f1', '<i4')])
A footnote on: https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html
The memory layout of an advanced indexing result is optimized for each indexing operation and no particular memory order can be assumed.
====================
b in this case was constructed with advanced indexing, and thus is a copy, even a true view might not be viewable in this way either:
In [1052]: a[:,:2].view('i,i')
....
ValueError: new type not compatible with array.
In [1054]: a[:,:2].copy().view('i,i')
Out[1054]:
array([[(1, 2)],
[(4, 7)],
[(9, 0)]],
dtype=[('f0', '<i4'), ('f1', '<i4')])
The view is selecting a subset of the values: 'i,i,x,i,i,x,i,i,x...', and that does not translated into structured dtype.
The structured view of a does: '(i,i,i),(i,i,i),...'
You can select a subset of the fields of a structured array:
In [1059]: a1=a.view('i,i,i')
In [1060]: a1
Out[1060]:
array([[(1, 2, 3)],
[(4, 7, 5)],
[(9, 0, 1)]],
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4')])
In [1061]: b1=a1[['f0','f2']]
In [1062]: b1
Out[1062]:
array([[(1, 3)],
[(4, 5)],
[(9, 1)]],
dtype=[('f0', '<i4'), ('f2', '<i4')])
But there are limits as to what you can do with such a view. Values can be changed in a1, and seen in a and b1. But I get an error if I try to change values in b1.
This is on the development edge.

Categories