how do I add a 'RowNumber' field to a structured numpy array?

how do I add a 'RowNumber' field to a structured numpy array? - python

I am using genfromtxt to load large csv files into structured arrays. I need to sort the data (using multiple fields), do some work and then restore the data to the original ordering. My plan is to add another field to the data and put the row number into this field before the first sort is applied. It can then be used to revert the order at the end. I thought there might be an elegant way of adding this field of record numbers but after hours of trying and searching for ideas I have nothing particularly slick.
import numpy
import numpy.lib.recfunctions as rfn
def main():
csvDataFile = 'C:\\File1.csv'
csvData = numpy.genfromtxt(csvDataFile, delimiter=',',names = True, dtype='f8')
rowNums = numpy.zeros(len(csvData),dtype=[('RowID','f8')])
#populate and add column for RowID
for i in range (0, len(csvData)):
rowNums['RowID'][i]=i
csvDataWithID = rfn.merge_arrays((csvData, rowNums), asrecarray=True, flatten=True)
The
recfunctions.merge_arrays
in particular is very slow and adding the row numbers one by one seems so old school. Your ideas would be gratefully received.

You don't need to fill rowNums iteratively:
In [93]: rowNums=np.zeros(10,dtype=[('RowID','f8')])
In [94]: for i in range(0,10):
....: rowNums['RowID'][i]=i
....:
In [95]: rowNums
Out[95]:
array([(0.0,), (1.0,), (2.0,), (3.0,), (4.0,), (5.0,), (6.0,), (7.0,),
(8.0,), (9.0,)],
dtype=[('RowID', '<f8')])
Just assign the range values to the field:
In [96]: rowNums['RowID']=np.arange(10)
In [97]: rowNums
Out[97]:
array([(0.0,), (1.0,), (2.0,), (3.0,), (4.0,), (5.0,), (6.0,), (7.0,),
(8.0,), (9.0,)],
dtype=[('RowID', '<f8')])
rfn.merge_arrays shouldn't be that slow - unless csvData.dtype has a great number of fields. This function creates a new dtype that merges the fields of the 2 inputs, and then copies data field by field. For many rows, and just a few fields, that is quite fast.
But you should be able to get the original order back without adding this extra field.
A 2 field 1d array:
In [118]: x = np.array([(4,2),(1, 0), (0, 1),(1,2),(3,1)], dtype=[('x', '<i4'), ('y', '<i4')])
In [119]: i = np.argsort(x, order=('y','x'))
In [120]: i
Out[120]: array([1, 2, 4, 3, 0], dtype=int32)
In [121]: x[i]
Out[121]:
array([(1, 0), (0, 1), (3, 1), (1, 2), (4, 2)],
dtype=[('x', '<i4'), ('y', '<i4')])
The same values are now sorted first on y, then on x.
In [122]: j=np.argsort(i)
In [123]: j
Out[123]: array([4, 0, 1, 3, 2], dtype=int32)
In [124]: x[i][j]
Out[124]:
array([(4, 2), (1, 0), (0, 1), (1, 2), (3, 1)],
dtype=[('x', '<i4'), ('y', '<i4')])
Back to the original order
I could have added a row index array to x, and then done a sort on that. But why add it; why not just apply i to a separate array:
In [127]: np.arange(5)[i]
Out[127]: array([1, 2, 4, 3, 0])
But sorting that is just the same as sorting i.
merge_arrays is doing essentially the following:
Union dtype:
In [139]: dt=np.dtype(rowNums.dtype.descr+x.dtype.descr)
In [140]: y=np.zeros((5,),dtype=dt)
fill in the values:
In [141]: y['RowID']=np.arange(5)
In [143]: for name in x.dtype.names:
y[name]=x[name]
In [144]: y
Out[144]:
array([(0.0, 4, 2), (1.0, 1, 0), (2.0, 0, 1), (3.0, 1, 2), (4.0, 3, 1)],
dtype=[('RowID', '<f8'), ('x', '<i4'), ('y', '<i4')])
And to test my argsort of argsort idea:
In [145]: y[i]
Out[145]:
array([(1.0, 1, 0), (2.0, 0, 1), (4.0, 3, 1), (3.0, 1, 2), (0.0, 4, 2)],
dtype=[('RowID', '<f8'), ('x', '<i4'), ('y', '<i4')])
In [146]: np.argsort(y[i],order=('RowID'))
Out[146]: array([4, 0, 1, 3, 2], dtype=int32)
In [147]: j
Out[147]: array([4, 0, 1, 3, 2], dtype=int32)
Sorting on the reordered RowID is the same as sorting on i.
Curiously merge_arrays is quite a bit slower than my reconstruction:
In [163]: rfn.merge_arrays([rowNums,x],flatten=True)
Out[163]:
array([(0.0, 4, 2), (1.0, 1, 0), (2.0, 0, 1), (3.0, 1, 2), (4.0, 3, 1)],
dtype=[('RowID', '<f8'), ('x', '<i4'), ('y', '<i4')])
In [164]: timeit rfn.merge_arrays([rowNums,x],flatten=True)
10000 loops, best of 3: 161 µs per loop
In [165]: %%timeit
dt=np.dtype(rowNums.dtype.descr+x.dtype.descr)
y=np.zeros((5,),dtype=dt)
y['RowID']=rowNums['RowID']
for name in x.dtype.names:
y[name]=x[name]
10000 loops, best of 3: 38.4 µs per loop

rowNums = np.zeros(len(csvData),dtype=[('RowID','f8')])
rowNums['RowID']=np.arange(len(csvData))
The above saves approx half a second per file with the csv files I am using. Very good so far.
However the key thing was how to efficiently obtain a record of the sort order. This is most elegantly solved using;
sortorder = np.argsort(csvData, 'col_1','col_2','col_3','col_4','col_5')
giving an array that lists the order of items in CsvData when sorted by cols 1 through 5.
This negates the need to make, populate and merge a RowID column, saving me around 15s per csv file (over 6hrs across my entire dataset.)
Thank you very much #hpaulj

Related

Is there a simple way to remove "padding" fields from numpy.dtype.descr?

Context
Since numpy version 1.16, if you access multiple fields of a structured array, the dtype of the resulting array will have the same item size as the original one, leading to extra "padding":
The new behavior as of Numpy 1.16 leads to extra “padding” bytes at the location of unindexed fields compared to 1.15. You will need to update any code which depends on the data having a “packed” layout.
This can lead to issues, e.g. if you want to add fields to the array in question later-on:
import numpy as np
import numpy.lib.recfunctions
a = np.array(
[
(10.0, 13.5, 1248, -2),
(20.0, 0.0, 0, 0),
(30.0, 0.0, 0, 0),
(40.0, 0.0, 0, 0),
(50.0, 0.0, 0, 999)
], dtype=[('x', '<f8'), ('y', '<f8'), ('i', '<i8'), ('j', '<i8')]
) # some array stolen from here: https://stackoverflow.com/a/37081693/5472354
print(a.shape, a.dtype, a.dtype.names, a.dtype.descr)
# all good so far
b = a[['x', 'i']] # for further processing I only need certain fields
print(b.shape, b.dtype, b.dtype.names, b.dtype.descr)
# you will only notice the extra padding in the descr
# b = np.lib.recfunctions.repack_fields(b)
# workaround
# now when I add fields, this becomes an issue
c = np.empty(b.shape, dtype=b.dtype.descr + [('c', 'i4')])
c[list(b.dtype.names)] = b
c['c'] = 1
print(c.dtype.names)
print(c['f1'])
# the void fields are filled with raw data and were given proper names
# that can be accessed
Now a workaround would be to use numpy.lib.recfunctions.repack_fields, which removes the padding, and I will use this in the future, but for my previous code, I need a fix. (Though there can be issues with recfunctions, as the module may not be found; as is the case for me, thus the additional import numpy.lib.recfunctions statement.)
Question
This part of the code is what I used to add fields to an array (based on this):
c = np.empty(b.shape, dtype=b.dtype.descr + [('c', 'i4')])
c[list(b.dtype.names)] = b
c['c'] = 1
Though (now that I know of it) using numpy.lib.recfunctions.require_fields may be more appropriate to add the fields. However, I would still need a way to remove the empty fields from b.dtype.descr:
[('x', '<f8'), ('', '|V8'), ('i', '<i8'), ('', '|V8')]
This is just a list of tuples, so I guess I could construct a more or less awkward way (along the lines of descr.remove(('', '|V8'))) to deal with this, but I was wondering if there is a better way, especially since the size of the voids depends on the number of left-out fields, e.g. from V8 to V16 if there are two in a row and so on (instead of a new void for each left-out field). So the code would become real clunky real fast.

In [237]: a = np.array(
...: [
...: (10.0, 13.5, 1248, -2),
...: (20.0, 0.0, 0, 0),
...: (30.0, 0.0, 0, 0),
...: (40.0, 0.0, 0, 0),
...: (50.0, 0.0, 0, 999)
...: ], dtype=[('x', '<f8'), ('y', '<f8'), ('i', '<i8'), ('j', '<i8')]
...: )
In [238]: a
Out[238]:
array([(10., 13.5, 1248, -2), (20., 0. , 0, 0),
(30., 0. , 0, 0), (40., 0. , 0, 0),
(50., 0. , 0, 999)],
dtype=[('x', '<f8'), ('y', '<f8'), ('i', '<i8'), ('j', '<i8')])
the b view:
In [240]: b = a[['x','i']]
In [241]: b
Out[241]:
array([(10., 1248), (20., 0), (30., 0), (40., 0), (50., 0)],
dtype={'names':['x','i'], 'formats':['<f8','<i8'], 'offsets':[0,16], 'itemsize':32})
the repacked copy:
In [243]: c = rf.repack_fields(b)
In [244]: c
Out[244]:
array([(10., 1248), (20., 0), (30., 0), (40., 0), (50., 0)],
dtype=[('x', '<f8'), ('i', '<i8')])
In [245]: c.dtype
Out[245]: dtype([('x', '<f8'), ('i', '<i8')])
your overly padded attempt at adding a field:
In [247]: d = np.empty(b.shape, dtype=b.dtype.descr + [('c', 'i4')])
...: d[list(b.dtype.names)] = b
...: d['c'] = 1
In [248]: d
Out[248]:
array([(10., b'\x00\x00\x00\x00\x00\x00\x00\x00', 1248, b'\x00\x00\x00\x00\x00\x00\x00\x00', 1),
(20., b'\x00\x00\x00\x00\x00\x00\x00\x00', 0, b'\x00\x00\x00\x00\x00\x00\x00\x00', 1),
...],
dtype=[('x', '<f8'), ('f1', 'V8'), ('i', '<i8'), ('f3', 'V8'), ('c', '<i4')])
My first attempt at making a dtype that does not include the Void fields. I don't know simply testing for V is robust enough:
In [253]: [des for des in b.dtype.descr if not 'V' in des[1]]
Out[253]: [('x', '<f8'), ('i', '<i8')]
And make a new dtype from that:
In [254]: d_dtype = _ + [('c','i4')]
All of this is normal python list and tuple manipulation. I've seen that in other recfunctions. I suspect repack_fields does something like this.
Now we make a new array with the simpler dtype:
In [255]: d = np.empty(b.shape, dtype=d_dtype)
In [256]: d[list(b.dtype.names)] = b
...: d['c'] = 1
In [257]: d
Out[257]:
array([(10., 1248, 1), (20., 0, 1), (30., 0, 1), (40., 0, 1),
(50., 0, 1)], dtype=[('x', '<f8'), ('i', '<i8'), ('c', '<i4')])
I've extracted from repack_fields the code that constructs a new, un-padded, dtype:
In [262]: def foo(a):
...: fieldinfo = []
...: for name in a.names:
...: tup = a.fields[name]
...: fmt = tup[0]
...: if len(tup) == 3:
...: name = (tup[2], name)
...: fieldinfo.append((name, fmt))
...: print(fieldinfo)
...: dt = np.dtype(fieldinfo)
...: return dt
...:
...:
In [263]: foo(b.dtype)
[('x', dtype('float64')), ('i', dtype('int64'))]
Out[263]: dtype([('x', '<f8'), ('i', '<i8')])
This works from dtype.fields rather than the dtype.descr. One's a dict the other a list.
In [274]: b.dtype
Out[274]: dtype({'names':['x','i'], 'formats':['<f8','<i8'], 'offsets':[0,16], 'itemsize':32})
In [275]: b.dtype.descr
Out[275]: [('x', '<f8'), ('', '|V8'), ('i', '<i8'), ('', '|V8')]
In [276]: b.dtype.fields
Out[276]: mappingproxy({'x': (dtype('float64'), 0), 'i': (dtype('int64'), 16)})
In [277]: b.dtype.fields['x']
Out[277]: (dtype('float64'), 0)
another way of getting just the valid descr tuples from b.dtype:
In [278]: [des for des in b.dtype.descr if des[0] in b.dtype.names]
Out[278]: [('x', '<f8'), ('i', '<i8')]

optimized container for dict with typed items?

I have a large (typed) array from a dataset with the first column containing a unique key to index each row, and the other columns containing matrices or vectors of known size. I there any standard way or a library to efficiently turn that into a hash map for key based retrieval? Splitting the row into individual key-value objects doesn't seem very memory efficient.
Here is the baseline sub-optimal solution:
arr = np.array([
('b0263', 'a', 1, 2, 3),
('26ab0', 'b', 4, 5, 6),
('6d7fc', 'c', 7, 8, 9),
('48a24', 'd', 0, 1, 2),
('1dcca', 'e', 3, 4, 5)],
dtype="S5, c, i4, i4, i4")
out = {key: values for key, *values in arr}
I thought about a dict containing the integer indexes of the corresponding rows but it requires two levels of indexing.

Your structured array:
In [317]: arr
Out[317]:
array([(b'b0263', b'a', 1, 2, 3), (b'26ab0', b'b', 4, 5, 6),
(b'6d7fc', b'c', 7, 8, 9), (b'48a24', b'd', 0, 1, 2),
(b'1dcca', b'e', 3, 4, 5)],
dtype=[('f0', 'S5'), ('f1', 'S1'), ('f2', '<i4'), ('f3', '<i4'), ('f4', '<i4')])
The first field:
In [318]: arr['f0']
Out[318]: array([b'b0263', b'26ab0', b'6d7fc', b'48a24', b'1dcca'], dtype='|S5')
A dictionary built from that array:
In [321]: dd = {k:v for k,v in zip(arr['f0'], arr)}
In [322]: dd
Out[322]:
{b'b0263': (b'b0263', b'a', 1, 2, 3),
b'26ab0': (b'26ab0', b'b', 4, 5, 6),
b'6d7fc': (b'6d7fc', b'c', 7, 8, 9),
b'48a24': (b'48a24', b'd', 0, 1, 2),
b'1dcca': (b'1dcca', b'e', 3, 4, 5)}
In [323]: dd[b'6d7fc']
Out[323]: (b'6d7fc', b'c', 7, 8, 9)
In this construction, the value of a dictionary element is an element of the original array.
It has type np.void and dtype:
In [329]: dd[b'6d7fc'].dtype
Out[329]: dtype([('f0', 'S5'), ('f1', 'S1'), ('f2', '<i4'), ('f3', '<i4'), ('f4', '<i4')])
And it is actually a view of the original (I checked __array_interface__). So apart from the pointers/references it doesn't consume much additional memory.
Recent numpy versions have be reworking the handling of views/copies of structured arrays, so the details may have changed, especially with respect to multifield references.
I should qualify the memory comment. In this example, itemsize is just 18. But referencing an element of the array, arr[0] creates a new np.void object. I suspect that object overhead is larger than 18 bytes.
You could create a dictionary with tuple values:
In [335]: dd = {k:v.tolist()[1:] for k,v in zip(arr['f0'], arr)}
In [336]: dd[b'b0263']
Out[336]: (b'a', 1, 2, 3)

convert a set of tuples into a numpy array of lists in python

So, I've been using the set method "symmetric_difference" between 2 ndarray matrices in the following way:
x_set = list(set(tuple(i) for i in x_spam_matrix.tolist()).symmetric_difference(
set(tuple(j) for j in partitioned_x[i].tolist())))
x = np.array([list(i) for i in x_set])
this method works fine for me, but it feel a little clumsy...is there anyway to conduct this in a slightly more elegant way?

A simple list of tuples:
In [146]: alist = [(1,2),(3,4),(2,1),(3,4)]
put it in a set:
In [147]: aset = set(alist)
In [148]: aset
Out[148]: {(1, 2), (2, 1), (3, 4)}
np.array just wraps that set in an object dtype:
In [149]: np.array(aset)
Out[149]: array({(1, 2), (3, 4), (2, 1)}, dtype=object)
but make it into a list, and get a 2d array:
In [150]: np.array(list(aset))
Out[150]:
array([[1, 2],
[3, 4],
[2, 1]])
Since it is a list of tuples, it can also be made into a structured array:
In [151]: np.array(list(aset),'i,f')
Out[151]: array([(1, 2.), (3, 4.), (2, 1.)], dtype=[('f0', '<i4'), ('f1', '<f4')])
If the tuples varied in length, the list of tuples would be turned into a 1d array of tuples (object dtype):
In [152]: np.array([(1,2),(3,4),(5,6,7)])
Out[152]: array([(1, 2), (3, 4), (5, 6, 7)], dtype=object)
In [153]: _.shape
Out[153]: (3,)

The Zen of Python:
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
[...]
There is nothing wrong with your code.
Though if I had to code review it, I would suggest the following
spam_matrix_set = set(tuple(item) for item in x_spam_matrix.tolist())
partitioned_set = set(tuple(item) for item in partitioned_x[index].tolist())
disjunctive_union = spam_matrix_set.symmetric_difference(partitioned_set)
x = np.array([list(item) for item in disjunctive_union])

Delete numpy column by name

I created a numpy array from csv by
dtest = np.genfromtxt('data/test.csv', delimiter=",", names = True)
The data has 200 columns named 'name', 'id', and so on.
I'm trying to delete the 'id' column.
Can I do that using the name of the column?

The answers in the proposed duplicate, How do you remove a column from a structured numpy array?
show how to reference a subset of the fields of a structured array. That may be what you want, but it has a potential problem, which I'll illustrate in a bit.
Start with a small sample csv 'file':
In [32]: txt=b"""a,id,b,c,d,e
...: a1, 3, 0,0,0,0.1
...: b2, 4, 1,2,3,4.4
...: """
In [33]: data=np.genfromtxt(txt.splitlines(), delimiter=',',names=True, dtype=None)
In [34]: data
Out[34]:
array([(b'a1', 3, 0, 0, 0, 0.1),
(b'b2', 4, 1, 2, 3, 4.4)],
dtype=[('a', 'S2'), ('id', '<i4'), ('b', '<i4'), ('c', '<i4'), ('d', '<i4'), ('e', '<f8')])
Multifield selection
I can get a 'view' of a subset of the fields with a field name list. The 'duplicate' showed how to construct such a list from the data.dtype.names. Here I'll just type it in, omitting the 'id' name.
In [35]: subd=data[['a','b','c','d']]
In [36]: subd
Out[36]:
array([(b'a1', 0, 0, 0), (b'b2', 1, 2, 3)],
dtype=[('a', 'S2'), ('b', '<i4'), ('c', '<i4'), ('d', '<i4')])
The problem is that this isn't regular 'view'. It's fine for reading, but any attempt to write to the subset, raises a warning.
In [37]: subd[0]['b'] = 3
/usr/local/bin/ipython3:1: FutureWarning: Numpy has detected that you (may be) writing to an array returned
by numpy.diagonal or by selecting multiple fields in a structured
array. This code will likely break in a future numpy release --
see numpy.diagonal or arrays.indexing reference docs for details.
The quick fix is to make an explicit copy (e.g., do
arr.diagonal().copy() or arr[['f0','f1']].copy()).
#!/usr/bin/python3
Making a subset copy is ok. But changes to subd won't affect data.
In [38]: subd=data[['a','b','c','d']].copy()
In [39]: subd[0]['b'] = 3
In [40]: subd
Out[40]:
array([(b'a1', 3, 0, 0), (b'b2', 1, 2, 3)],
dtype=[('a', 'S2'), ('b', '<i4'), ('c', '<i4'), ('d', '<i4')])
A simple way to delete the ith field name from the indexing list:
In [60]: subnames = list(data.dtype.names) # list so its mutable
In [61]: subnames
Out[61]: ['a', 'id', 'b', 'c', 'd', 'e']
In [62]: del subnames[1]
usecols
Since you are reading this array from the csv, you could use usecols to load everything but the 'id' column
Since you have a large number of columns it would easist to do something like:
In [42]: col=list(range(6)); del col[1]
In [43]: col
Out[43]: [0, 2, 3, 4, 5]
In [44]: np.genfromtxt(txt.splitlines(), delimiter=',',names=True, dtype=None,usecols=col)
Out[44]:
array([(b'a1', 0, 0, 0, 0.1), (b'b2', 1, 2, 3, 4.4)],
dtype=[('a', 'S2'), ('b', '<i4'), ('c', '<i4'), ('d', '<i4'), ('e', '<f8')])
recfunctions
There's a library of functions that can help manipulate structured arrays
In [45]: import numpy.lib.recfunctions as rf
In [47]: rf.drop_fields(data, ['id'])
Out[47]:
array([(b'a1', 0, 0, 0, 0.1), (b'b2', 1, 2, 3, 4.4)],
dtype=[('a', 'S2'), ('b', '<i4'), ('c', '<i4'), ('d', '<i4'), ('e', '<f8')])
Most functions in this group work by constructing a 'blank' array with the target dtype, and then copying values, by field, from the source to the target.
field copy
Here's the field copy approach used in recfunctions:
In [65]: data.dtype.descr # dtype description as list of tuples
Out[65]:
[('a', '|S2'),
('id', '<i4'),
('b', '<i4'),
('c', '<i4'),
('d', '<i4'),
('e', '<f8')]
In [66]: desc=data.dtype.descr
In [67]: del desc[1] # remove one field
In [68]: res = np.zeros(data.shape, dtype=desc) # target
In [69]: res
Out[69]:
array([(b'', 0, 0, 0, 0.), (b'', 0, 0, 0, 0.)],
dtype=[('a', 'S2'), ('b', '<i4'), ('c', '<i4'), ('d', '<i4'), ('e', '<f8')])
In [70]: for name in res.dtype.names: # copy by field name
...: res[name] = data[name]
In [71]: res
Out[71]:
array([(b'a1', 0, 0, 0, 0.1), (b'b2', 1, 2, 3, 4.4)],
dtype=[('a', 'S2'), ('b', '<i4'), ('c', '<i4'), ('d', '<i4'), ('e', '<f8')])
Since usually structured arrays have many records, and few fields, copying by field name is relatively fast.
The linked SO cited matplotlib.mlab.rec_drop_fields(rec, names). This essentially does what I just outlined - make a target with the desired fields, and copy fields by name.
newdtype = np.dtype([(name, rec.dtype[name]) for name in rec.dtype.names
if name not in names])

I know you have a comprehensive answer but this is another that I just put together.
import numpy as np
For some sample data file:
test1.csv =
a b c id
0 1 2 3
4 5 6 7
8 9 10 11
Import using genfromtxt:
d = np.genfromtxt('test1.csv', delimiter="\t", names = True)
d
> array([(0.0, 1.0, 2.0, 3.0), (4.0, 5.0, 6.0, 7.0), (8.0, 9.0, 10.0, 11.0)],
dtype=[('a', '<f8'), ('b', '<f8'), ('c', '<f8'), ('id', '<f8')])
Return a single column from your array by doing:
d['a']
> array([ 0., 4., 8.])
To delete the column by the name 'id' you can do the following:
Return a list of the column names by writing:
list(d.dtype.names)
> ['a', 'b', 'c', 'id']
Create a new numpy array by returning only those columns not equal to the string id.
Use a list comprehension to return a new list without your 'id' string:
[b for b in list(d.dtype.names) if b != 'id']
> ['a', 'b', 'c']
Combine to give:
d_new = d[[b for b in list(d.dtype.names) if b != 'id']]
> array([(0.0, 1.0, 2.0), (4.0, 5.0, 6.0), (8.0, 9.0, 10.0)],
dtype=[('a', '<f8'), ('b', '<f8'), ('c', '<f8')])
This returns the array:
a b c
0 1 2
4 5 6
8 9 10

This may be new functionality in numpy (works in 1.20.2) but you can just slice your named array using a list of names (a tuple of names doesn't work though).
data = np.genfromtxt('some_file.csv', names=['a', 'b', 'c', 'd', 'e'])
# I don't want colums b or d
sliced = data[['a', 'c', 'd']]
I notice that you need to eliminate many columns that are named id. These columns show up as ['id', 'id_1', 'id_2', ...] and so on when parsed by genfromtxt, so you can use some list comprehension to pick out those column names and make a slice out of them.
no_ids = data[[n for n in data.dtype.names if 'id' not in n]]

In place sorting by column fails on slices

I'm trying to sort a numpy array by a specific column (in-place) using the solution from this answer. For the most part it works, but it fails on any array that's a view on another array:
In [35]: columnnum = 2
In [36]: a = np.array([[1,2,3], [4,7,5], [9,0,1]])
In [37]: a
Out[37]:
array([[1, 2, 3],
[4, 7, 5],
[9, 0, 1]])
In [38]: b = a[:,(0, 2)]
In [39]: b
Out[39]:
array([[1, 3],
[4, 5],
[9, 1]])
In [40]: a.view(','.join([a.dtype.str] * a.shape[1])).sort(order=['f%d' % columnnum], axis=0)
In [41]: a
Out[41]:
array([[9, 0, 1],
[1, 2, 3],
[4, 7, 5]])
In [42]: b.view(','.join([b.dtype.str] * b.shape[1])).sort(order=['f%d' % columnnum], axis=0)
ValueError: new type not compatible with array.
It looks like numpy doesn't support views of views, which makes a certain amount of sense, but I now can't figure out how to get the view I need for any array, whether it itself is a view or not. So far, I haven't been able to find any way to get the necessary information about the view I have to construct the new one I need.
For now, I'm using the l = l[l[:,columnnum].argsort()] in-place sorting method, which works fine, but since I'm operating on large datasets, I'd like to avoid the extra memory overhead of the argsort() call (the list of indexes). Is there either a way to get the necessary information about the view or to do the sort by column?

In [1019]: a=np.array([[1,2,3],[4,7,5],[9,0,1]])
In [1020]: b=a[:,(0,2)]
This is the a that you are sorting; a structured array with 3 fields. It uses the same data buffer, but interpreting groups of 3 ints as fields rather than columns.
In [1021]: a.view('i,i,i')
Out[1021]:
array([[(1, 2, 3)],
[(4, 7, 5)],
[(9, 0, 1)]],
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4')])
By the same logic, you try to view b:
In [1022]: b.view('i,i')
/usr/local/bin/ipython3:1: DeprecationWarning: Changing the shape of non-C contiguous array by
descriptor assignment is deprecated. To maintain
the Fortran contiguity of a multidimensional Fortran
array, use 'a.T.view(...).T' instead
#!/usr/bin/python3
....
ValueError: new type not compatible with array.
But if I use 3 fields instead of 2, it works (but with the same warning):
In [1023]: b.view('i,i,i')
/usr/local/bin/ipython3:1: DeprecationWarning:...
Out[1023]:
array([[(1, 4, 9), (3, 5, 1)]],
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4')])
The problem is that b is Fortran order. Check b.flags or
In [1026]: a.strides
Out[1026]: (12, 4)
In [1027]: b.strides
Out[1027]: (4, 12)
b is a copy, not a view. I don't know, off hand, why this construction of b changed the order.
Heeding the warning, I can do:
In [1047]: b.T.view('i,i,i').T
Out[1047]:
array([[(1, 4, 9), (3, 5, 1)]],
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4')])
A default copy (order c) of b can be viewed as 2 fields:
In [1042]: b1=b.copy()
In [1043]: b1.strides
Out[1043]: (8, 4)
In [1044]: b1.view('i,i')
Out[1044]:
array([[(1, 3)],
[(4, 5)],
[(9, 1)]],
dtype=[('f0', '<i4'), ('f1', '<i4')])
A footnote on: https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html
The memory layout of an advanced indexing result is optimized for each indexing operation and no particular memory order can be assumed.
====================
b in this case was constructed with advanced indexing, and thus is a copy, even a true view might not be viewable in this way either:
In [1052]: a[:,:2].view('i,i')
....
ValueError: new type not compatible with array.
In [1054]: a[:,:2].copy().view('i,i')
Out[1054]:
array([[(1, 2)],
[(4, 7)],
[(9, 0)]],
dtype=[('f0', '<i4'), ('f1', '<i4')])
The view is selecting a subset of the values: 'i,i,x,i,i,x,i,i,x...', and that does not translated into structured dtype.
The structured view of a does: '(i,i,i),(i,i,i),...'
You can select a subset of the fields of a structured array:
In [1059]: a1=a.view('i,i,i')
In [1060]: a1
Out[1060]:
array([[(1, 2, 3)],
[(4, 7, 5)],
[(9, 0, 1)]],
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4')])
In [1061]: b1=a1[['f0','f2']]
In [1062]: b1
Out[1062]:
array([[(1, 3)],
[(4, 5)],
[(9, 1)]],
dtype=[('f0', '<i4'), ('f2', '<i4')])
But there are limits as to what you can do with such a view. Values can be changed in a1, and seen in a and b1. But I get an error if I try to change values in b1.
This is on the development edge.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

how do I add a 'RowNumber' field to a structured numpy array? - python

Related

Is there a simple way to remove "padding" fields from numpy.dtype.descr?

optimized container for dict with typed items?

convert a set of tuples into a numpy array of lists in python

Delete numpy column by name

In place sorting by column fails on slices

Categories

Resources