In place sorting by column fails on slices - python

I'm trying to sort a numpy array by a specific column (in-place) using the solution from this answer. For the most part it works, but it fails on any array that's a view on another array:
In [35]: columnnum = 2
In [36]: a = np.array([[1,2,3], [4,7,5], [9,0,1]])
In [37]: a
Out[37]:
array([[1, 2, 3],
[4, 7, 5],
[9, 0, 1]])
In [38]: b = a[:,(0, 2)]
In [39]: b
Out[39]:
array([[1, 3],
[4, 5],
[9, 1]])
In [40]: a.view(','.join([a.dtype.str] * a.shape[1])).sort(order=['f%d' % columnnum], axis=0)
In [41]: a
Out[41]:
array([[9, 0, 1],
[1, 2, 3],
[4, 7, 5]])
In [42]: b.view(','.join([b.dtype.str] * b.shape[1])).sort(order=['f%d' % columnnum], axis=0)
ValueError: new type not compatible with array.
It looks like numpy doesn't support views of views, which makes a certain amount of sense, but I now can't figure out how to get the view I need for any array, whether it itself is a view or not. So far, I haven't been able to find any way to get the necessary information about the view I have to construct the new one I need.
For now, I'm using the l = l[l[:,columnnum].argsort()] in-place sorting method, which works fine, but since I'm operating on large datasets, I'd like to avoid the extra memory overhead of the argsort() call (the list of indexes). Is there either a way to get the necessary information about the view or to do the sort by column?

In [1019]: a=np.array([[1,2,3],[4,7,5],[9,0,1]])
In [1020]: b=a[:,(0,2)]
This is the a that you are sorting; a structured array with 3 fields. It uses the same data buffer, but interpreting groups of 3 ints as fields rather than columns.
In [1021]: a.view('i,i,i')
Out[1021]:
array([[(1, 2, 3)],
[(4, 7, 5)],
[(9, 0, 1)]],
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4')])
By the same logic, you try to view b:
In [1022]: b.view('i,i')
/usr/local/bin/ipython3:1: DeprecationWarning: Changing the shape of non-C contiguous array by
descriptor assignment is deprecated. To maintain
the Fortran contiguity of a multidimensional Fortran
array, use 'a.T.view(...).T' instead
#!/usr/bin/python3
....
ValueError: new type not compatible with array.
But if I use 3 fields instead of 2, it works (but with the same warning):
In [1023]: b.view('i,i,i')
/usr/local/bin/ipython3:1: DeprecationWarning:...
Out[1023]:
array([[(1, 4, 9), (3, 5, 1)]],
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4')])
The problem is that b is Fortran order. Check b.flags or
In [1026]: a.strides
Out[1026]: (12, 4)
In [1027]: b.strides
Out[1027]: (4, 12)
b is a copy, not a view. I don't know, off hand, why this construction of b changed the order.
Heeding the warning, I can do:
In [1047]: b.T.view('i,i,i').T
Out[1047]:
array([[(1, 4, 9), (3, 5, 1)]],
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4')])
A default copy (order c) of b can be viewed as 2 fields:
In [1042]: b1=b.copy()
In [1043]: b1.strides
Out[1043]: (8, 4)
In [1044]: b1.view('i,i')
Out[1044]:
array([[(1, 3)],
[(4, 5)],
[(9, 1)]],
dtype=[('f0', '<i4'), ('f1', '<i4')])
A footnote on: https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html
The memory layout of an advanced indexing result is optimized for each indexing operation and no particular memory order can be assumed.
====================
b in this case was constructed with advanced indexing, and thus is a copy, even a true view might not be viewable in this way either:
In [1052]: a[:,:2].view('i,i')
....
ValueError: new type not compatible with array.
In [1054]: a[:,:2].copy().view('i,i')
Out[1054]:
array([[(1, 2)],
[(4, 7)],
[(9, 0)]],
dtype=[('f0', '<i4'), ('f1', '<i4')])
The view is selecting a subset of the values: 'i,i,x,i,i,x,i,i,x...', and that does not translated into structured dtype.
The structured view of a does: '(i,i,i),(i,i,i),...'
You can select a subset of the fields of a structured array:
In [1059]: a1=a.view('i,i,i')
In [1060]: a1
Out[1060]:
array([[(1, 2, 3)],
[(4, 7, 5)],
[(9, 0, 1)]],
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4')])
In [1061]: b1=a1[['f0','f2']]
In [1062]: b1
Out[1062]:
array([[(1, 3)],
[(4, 5)],
[(9, 1)]],
dtype=[('f0', '<i4'), ('f2', '<i4')])
But there are limits as to what you can do with such a view. Values can be changed in a1, and seen in a and b1. But I get an error if I try to change values in b1.
This is on the development edge.

Related

How to convert 1D numpy array of tuples to 2D numpy array?

I have a numpy array of tuples:
import numpy as np
the_tuples = np.array([(1, 4), (7, 8)], dtype=[('f0', '<i4'), ('f1', '<i4')])
I would like to have a 2D numpy array instead:
the_2Darray = np.array([[1,4],[7,8]])
I have tried doing several things, such as
import numpy as np
the_tuples = np.array([(1, 4), (7, 8)], dtype=[('f0', '<i4'), ('f1', '<i4')])
the_2Darray = np.array([*the_tuples])
How can I convert it?
Needed to add the_tuples.astype(object).
import numpy as np
the_tuples = np.array([(1, 4), (7, 8)], dtype=[('f0', '<i4'), ('f1', '<i4')])
the_tuples = the_tuples.astype(object)
the_2Darray = np.array([*the_tuples])
This is a structured array - 1d with a compound dtype:
In [2]: arr = np.array([(1, 4), (7, 8)], dtype=[('f0', '<i4'), ('f1', '<i4')])
In [3]: arr.shape, arr.dtype
Out[3]: ((2,), dtype([('f0', '<i4'), ('f1', '<i4')]))
recfunctions has a function designed to do such a conversion:
In [4]: import numpy.lib.recfunctions as rf
In [5]: arr1 = rf.structured_to_unstructured(arr)
In [6]: arr1
Out[6]:
array([[1, 4],
[7, 8]])
view works if all fields have the same dtype, but the shape isn't kept well:
In [7]: arr.view('int')
Out[7]: array([1, 4, 7, 8])
The tolist produces a list of tuples (instead of a list of lists):
In [9]: arr.tolist()
Out[9]: [(1, 4), (7, 8)]
Which can be used as the basis for making a new array:
In [10]: np.array(arr.tolist())
Out[10]:
array([[1, 4],
[7, 8]])
Note that when you created the array (in [2]) you had to use the list of tuples format.

numpy genfromtxt - infer column header if headers not provided

I understand that with genfromtxt, the defaultfmt parameter can be used to infer default column names, which is useful if column names are not in input data. And defaultfmt, if not provided, is defaulted to f%i. E.g.
>>> data = StringIO("1 2 3\n 4 5 6")
>>> np.genfromtxt(data, dtype=(int, float, int))
array([(1, 2.0, 3), (4, 5.0, 6)],
dtype=[('f0', '<i8'), ('f1', '<f8'), ('f2', '<i8')])
So here we have autogenerated column names f0, f1, f2.
But what if I want numpy to infer both column headers and data type? I thought you do it with dtype=None. Like this
>>> data3 = StringIO("1 2 3\n 4 5 6")
>>> np.genfromtxt(data3, dtype=None, ???) # some parameter combo
array([(1, 2, 3), (4, 5, 6)],
dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8')])
I still want the automatically generated column names of f0, f1...etc. And I want numpy to automatically determine the datatypes based on the data, which I thought was the whole point of doing dtype=None.
EDIT
But unfortunately that doesn't ALWAYS work.
This case works when I have both floats and ints.
>>> data3b = StringIO("1 2 3.0\n 4 5 6.0")
>>> np.genfromtxt(data3b, dtype=None)
array([(1, 2, 3.), (4, 5, 6.)],
dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<f8')])
So numpy correctly inferred datatype of i8 for first 2 column, and f8 for last column.
But, if I provide all ints, the inferred columned names disappears.
>>> data3c = StringIO("1 2 3\n 4 5 6")
>>> np.genfromtxt(data3c, dtype=None)
array([[1, 2, 3],
[4, 5, 6]])
My identical code may or may not work depending on the input data? That doesn't sound right.
And yes I know there's pandas. But I'm not using pandas on purpose. So please bear with me on that.
In [2]: txt = '''1,2,3
...: 4,5,6'''.splitlines()
Defaylt 2d array of flaots:
In [6]: np.genfromtxt(txt, delimiter=',',encoding=None)
Out[6]:
array([[1., 2., 3.],
[4., 5., 6.]])
2d of ints:
In [7]: np.genfromtxt(txt, dtype=None, delimiter=',',encoding=None)
Out[7]:
array([[1, 2, 3],
[4, 5, 6]])
Specified field dtypes:
In [8]: np.genfromtxt(txt, dtype='i,i,i', delimiter=',',encoding=None)
Out[8]:
array([(1, 2, 3), (4, 5, 6)],
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4')])
Specified field names:
In [9]: np.genfromtxt(txt, dtype=None, delimiter=',',encoding=None, names=['a','b','c'])
Out[9]:
array([(1, 2, 3), (4, 5, 6)],
dtype=[('a', '<i8'), ('b', '<i8'), ('c', '<i8')])
The unstructured array can be converted to structured with:
In [10]: import numpy.lib.recfunctions as rf
In [11]: rf.unstructured_to_structured(Out[7])
Out[11]:
array([(1, 2, 3), (4, 5, 6)],
dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8')])
In numpy the default, preferred array, is multidimensional numeric. That's why it produces Out7] if it can.

optimized container for dict with typed items?

I have a large (typed) array from a dataset with the first column containing a unique key to index each row, and the other columns containing matrices or vectors of known size. I there any standard way or a library to efficiently turn that into a hash map for key based retrieval? Splitting the row into individual key-value objects doesn't seem very memory efficient.
Here is the baseline sub-optimal solution:
arr = np.array([
('b0263', 'a', 1, 2, 3),
('26ab0', 'b', 4, 5, 6),
('6d7fc', 'c', 7, 8, 9),
('48a24', 'd', 0, 1, 2),
('1dcca', 'e', 3, 4, 5)],
dtype="S5, c, i4, i4, i4")
out = {key: values for key, *values in arr}
I thought about a dict containing the integer indexes of the corresponding rows but it requires two levels of indexing.
Your structured array:
In [317]: arr
Out[317]:
array([(b'b0263', b'a', 1, 2, 3), (b'26ab0', b'b', 4, 5, 6),
(b'6d7fc', b'c', 7, 8, 9), (b'48a24', b'd', 0, 1, 2),
(b'1dcca', b'e', 3, 4, 5)],
dtype=[('f0', 'S5'), ('f1', 'S1'), ('f2', '<i4'), ('f3', '<i4'), ('f4', '<i4')])
The first field:
In [318]: arr['f0']
Out[318]: array([b'b0263', b'26ab0', b'6d7fc', b'48a24', b'1dcca'], dtype='|S5')
A dictionary built from that array:
In [321]: dd = {k:v for k,v in zip(arr['f0'], arr)}
In [322]: dd
Out[322]:
{b'b0263': (b'b0263', b'a', 1, 2, 3),
b'26ab0': (b'26ab0', b'b', 4, 5, 6),
b'6d7fc': (b'6d7fc', b'c', 7, 8, 9),
b'48a24': (b'48a24', b'd', 0, 1, 2),
b'1dcca': (b'1dcca', b'e', 3, 4, 5)}
In [323]: dd[b'6d7fc']
Out[323]: (b'6d7fc', b'c', 7, 8, 9)
In this construction, the value of a dictionary element is an element of the original array.
It has type np.void and dtype:
In [329]: dd[b'6d7fc'].dtype
Out[329]: dtype([('f0', 'S5'), ('f1', 'S1'), ('f2', '<i4'), ('f3', '<i4'), ('f4', '<i4')])
And it is actually a view of the original (I checked __array_interface__). So apart from the pointers/references it doesn't consume much additional memory.
Recent numpy versions have be reworking the handling of views/copies of structured arrays, so the details may have changed, especially with respect to multifield references.
I should qualify the memory comment. In this example, itemsize is just 18. But referencing an element of the array, arr[0] creates a new np.void object. I suspect that object overhead is larger than 18 bytes.
You could create a dictionary with tuple values:
In [335]: dd = {k:v.tolist()[1:] for k,v in zip(arr['f0'], arr)}
In [336]: dd[b'b0263']
Out[336]: (b'a', 1, 2, 3)

Adding a data column to a numpy rec array with only one row

I need to add a column of data to a numpy rec array. I have seen many answers floating around here, but they do not seem to work for a rec array that only contains one row...
Let's say I have a rec array x:
>>> x = np.rec.array([1, 2, 3])
>>> print(x)
rec.array((1, 2, 3),
dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8')])
and I want to append the value 4 to a new column with it's own field name and data type, such as
rec.array((1, 2, 3, 4),
dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<i8')])
If I try to add a column using the normal append_fields approach;
>>> np.lib.recfunctions.append_fields(x, 'f3', 4, dtypes='<i8',
usemask=False, asrecarray=True)
then I ultimately end up with
TypeError: len() of unsized object
It turns out that for a rec array with only one row, len(x) does not work, while x.size does. If I instead use np.hstack(), I get TypeError: invalid type promotion, and if I try np.c_, I get an undesired result
>>> np.c_[x, 4]
array([[(1, 2, 3), (4, 4, 4)]],
dtype=(numpy.record, [('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8')]))
Create the initial array so that it has shape (1,); note the extra brackets:
In [17]: x = np.rec.array([[1, 2, 3]])
(If x is an input that you can't control that way, you could use x = np.atleast_1d(x) before using it in append_fields().)
Then make sure the value given in append_fields is also a sequence of length 1:
In [18]: np.lib.recfunctions.append_fields(x, 'f3', [4], dtypes='<i8',
...: usemask=False, asrecarray=True)
Out[18]:
rec.array([(1, 2, 3, 4)],
dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<i8')])
Here's a way of doing the job without a recfunctions:
In [64]: x = np.rec.array((1, 2, 3))
In [65]: y=np.zeros(x.shape, dtype=x.dtype.descr+[('f3','<i4')])
In [66]: y
Out[66]:
array((0, 0, 0, 0),
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4')])
In [67]: for name in x.dtype.names: y[name] = x[name]
In [68]: y['f3']=4
In [69]: y
Out[69]:
array((1, 2, 3, 4),
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4')])
From what I've seen in recfunctions code, I think it's just as fast. Of course for a single row speed isn't an issue. In general those functions create a new 'blank' array with the target dtype, and copy fields, by name (possibly recursively) from sources to target. Usually an array has many more records than fields, so iteration on fields is not, relatively speaking, slow.

how do I add a 'RowNumber' field to a structured numpy array?

I am using genfromtxt to load large csv files into structured arrays. I need to sort the data (using multiple fields), do some work and then restore the data to the original ordering. My plan is to add another field to the data and put the row number into this field before the first sort is applied. It can then be used to revert the order at the end. I thought there might be an elegant way of adding this field of record numbers but after hours of trying and searching for ideas I have nothing particularly slick.
import numpy
import numpy.lib.recfunctions as rfn
def main():
csvDataFile = 'C:\\File1.csv'
csvData = numpy.genfromtxt(csvDataFile, delimiter=',',names = True, dtype='f8')
rowNums = numpy.zeros(len(csvData),dtype=[('RowID','f8')])
#populate and add column for RowID
for i in range (0, len(csvData)):
rowNums['RowID'][i]=i
csvDataWithID = rfn.merge_arrays((csvData, rowNums), asrecarray=True, flatten=True)
The
recfunctions.merge_arrays
in particular is very slow and adding the row numbers one by one seems so old school. Your ideas would be gratefully received.
You don't need to fill rowNums iteratively:
In [93]: rowNums=np.zeros(10,dtype=[('RowID','f8')])
In [94]: for i in range(0,10):
....: rowNums['RowID'][i]=i
....:
In [95]: rowNums
Out[95]:
array([(0.0,), (1.0,), (2.0,), (3.0,), (4.0,), (5.0,), (6.0,), (7.0,),
(8.0,), (9.0,)],
dtype=[('RowID', '<f8')])
Just assign the range values to the field:
In [96]: rowNums['RowID']=np.arange(10)
In [97]: rowNums
Out[97]:
array([(0.0,), (1.0,), (2.0,), (3.0,), (4.0,), (5.0,), (6.0,), (7.0,),
(8.0,), (9.0,)],
dtype=[('RowID', '<f8')])
rfn.merge_arrays shouldn't be that slow - unless csvData.dtype has a great number of fields. This function creates a new dtype that merges the fields of the 2 inputs, and then copies data field by field. For many rows, and just a few fields, that is quite fast.
But you should be able to get the original order back without adding this extra field.
A 2 field 1d array:
In [118]: x = np.array([(4,2),(1, 0), (0, 1),(1,2),(3,1)], dtype=[('x', '<i4'), ('y', '<i4')])
In [119]: i = np.argsort(x, order=('y','x'))
In [120]: i
Out[120]: array([1, 2, 4, 3, 0], dtype=int32)
In [121]: x[i]
Out[121]:
array([(1, 0), (0, 1), (3, 1), (1, 2), (4, 2)],
dtype=[('x', '<i4'), ('y', '<i4')])
The same values are now sorted first on y, then on x.
In [122]: j=np.argsort(i)
In [123]: j
Out[123]: array([4, 0, 1, 3, 2], dtype=int32)
In [124]: x[i][j]
Out[124]:
array([(4, 2), (1, 0), (0, 1), (1, 2), (3, 1)],
dtype=[('x', '<i4'), ('y', '<i4')])
Back to the original order
I could have added a row index array to x, and then done a sort on that. But why add it; why not just apply i to a separate array:
In [127]: np.arange(5)[i]
Out[127]: array([1, 2, 4, 3, 0])
But sorting that is just the same as sorting i.
merge_arrays is doing essentially the following:
Union dtype:
In [139]: dt=np.dtype(rowNums.dtype.descr+x.dtype.descr)
In [140]: y=np.zeros((5,),dtype=dt)
fill in the values:
In [141]: y['RowID']=np.arange(5)
In [143]: for name in x.dtype.names:
y[name]=x[name]
In [144]: y
Out[144]:
array([(0.0, 4, 2), (1.0, 1, 0), (2.0, 0, 1), (3.0, 1, 2), (4.0, 3, 1)],
dtype=[('RowID', '<f8'), ('x', '<i4'), ('y', '<i4')])
And to test my argsort of argsort idea:
In [145]: y[i]
Out[145]:
array([(1.0, 1, 0), (2.0, 0, 1), (4.0, 3, 1), (3.0, 1, 2), (0.0, 4, 2)],
dtype=[('RowID', '<f8'), ('x', '<i4'), ('y', '<i4')])
In [146]: np.argsort(y[i],order=('RowID'))
Out[146]: array([4, 0, 1, 3, 2], dtype=int32)
In [147]: j
Out[147]: array([4, 0, 1, 3, 2], dtype=int32)
Sorting on the reordered RowID is the same as sorting on i.
Curiously merge_arrays is quite a bit slower than my reconstruction:
In [163]: rfn.merge_arrays([rowNums,x],flatten=True)
Out[163]:
array([(0.0, 4, 2), (1.0, 1, 0), (2.0, 0, 1), (3.0, 1, 2), (4.0, 3, 1)],
dtype=[('RowID', '<f8'), ('x', '<i4'), ('y', '<i4')])
In [164]: timeit rfn.merge_arrays([rowNums,x],flatten=True)
10000 loops, best of 3: 161 µs per loop
In [165]: %%timeit
dt=np.dtype(rowNums.dtype.descr+x.dtype.descr)
y=np.zeros((5,),dtype=dt)
y['RowID']=rowNums['RowID']
for name in x.dtype.names:
y[name]=x[name]
10000 loops, best of 3: 38.4 µs per loop
rowNums = np.zeros(len(csvData),dtype=[('RowID','f8')])
rowNums['RowID']=np.arange(len(csvData))
The above saves approx half a second per file with the csv files I am using. Very good so far.
However the key thing was how to efficiently obtain a record of the sort order. This is most elegantly solved using;
sortorder = np.argsort(csvData, 'col_1','col_2','col_3','col_4','col_5')
giving an array that lists the order of items in CsvData when sorted by cols 1 through 5.
This negates the need to make, populate and merge a RowID column, saving me around 15s per csv file (over 6hrs across my entire dataset.)
Thank you very much #hpaulj

Categories