Delete numpy column by name - python

I created a numpy array from csv by
dtest = np.genfromtxt('data/test.csv', delimiter=",", names = True)
The data has 200 columns named 'name', 'id', and so on.
I'm trying to delete the 'id' column.
Can I do that using the name of the column?

The answers in the proposed duplicate, How do you remove a column from a structured numpy array?
show how to reference a subset of the fields of a structured array. That may be what you want, but it has a potential problem, which I'll illustrate in a bit.
Start with a small sample csv 'file':
In [32]: txt=b"""a,id,b,c,d,e
...: a1, 3, 0,0,0,0.1
...: b2, 4, 1,2,3,4.4
...: """
In [33]: data=np.genfromtxt(txt.splitlines(), delimiter=',',names=True, dtype=None)
In [34]: data
Out[34]:
array([(b'a1', 3, 0, 0, 0, 0.1),
(b'b2', 4, 1, 2, 3, 4.4)],
dtype=[('a', 'S2'), ('id', '<i4'), ('b', '<i4'), ('c', '<i4'), ('d', '<i4'), ('e', '<f8')])
Multifield selection
I can get a 'view' of a subset of the fields with a field name list. The 'duplicate' showed how to construct such a list from the data.dtype.names. Here I'll just type it in, omitting the 'id' name.
In [35]: subd=data[['a','b','c','d']]
In [36]: subd
Out[36]:
array([(b'a1', 0, 0, 0), (b'b2', 1, 2, 3)],
dtype=[('a', 'S2'), ('b', '<i4'), ('c', '<i4'), ('d', '<i4')])
The problem is that this isn't regular 'view'. It's fine for reading, but any attempt to write to the subset, raises a warning.
In [37]: subd[0]['b'] = 3
/usr/local/bin/ipython3:1: FutureWarning: Numpy has detected that you (may be) writing to an array returned
by numpy.diagonal or by selecting multiple fields in a structured
array. This code will likely break in a future numpy release --
see numpy.diagonal or arrays.indexing reference docs for details.
The quick fix is to make an explicit copy (e.g., do
arr.diagonal().copy() or arr[['f0','f1']].copy()).
#!/usr/bin/python3
Making a subset copy is ok. But changes to subd won't affect data.
In [38]: subd=data[['a','b','c','d']].copy()
In [39]: subd[0]['b'] = 3
In [40]: subd
Out[40]:
array([(b'a1', 3, 0, 0), (b'b2', 1, 2, 3)],
dtype=[('a', 'S2'), ('b', '<i4'), ('c', '<i4'), ('d', '<i4')])
A simple way to delete the ith field name from the indexing list:
In [60]: subnames = list(data.dtype.names) # list so its mutable
In [61]: subnames
Out[61]: ['a', 'id', 'b', 'c', 'd', 'e']
In [62]: del subnames[1]
usecols
Since you are reading this array from the csv, you could use usecols to load everything but the 'id' column
Since you have a large number of columns it would easist to do something like:
In [42]: col=list(range(6)); del col[1]
In [43]: col
Out[43]: [0, 2, 3, 4, 5]
In [44]: np.genfromtxt(txt.splitlines(), delimiter=',',names=True, dtype=None,usecols=col)
Out[44]:
array([(b'a1', 0, 0, 0, 0.1), (b'b2', 1, 2, 3, 4.4)],
dtype=[('a', 'S2'), ('b', '<i4'), ('c', '<i4'), ('d', '<i4'), ('e', '<f8')])
recfunctions
There's a library of functions that can help manipulate structured arrays
In [45]: import numpy.lib.recfunctions as rf
In [47]: rf.drop_fields(data, ['id'])
Out[47]:
array([(b'a1', 0, 0, 0, 0.1), (b'b2', 1, 2, 3, 4.4)],
dtype=[('a', 'S2'), ('b', '<i4'), ('c', '<i4'), ('d', '<i4'), ('e', '<f8')])
Most functions in this group work by constructing a 'blank' array with the target dtype, and then copying values, by field, from the source to the target.
field copy
Here's the field copy approach used in recfunctions:
In [65]: data.dtype.descr # dtype description as list of tuples
Out[65]:
[('a', '|S2'),
('id', '<i4'),
('b', '<i4'),
('c', '<i4'),
('d', '<i4'),
('e', '<f8')]
In [66]: desc=data.dtype.descr
In [67]: del desc[1] # remove one field
In [68]: res = np.zeros(data.shape, dtype=desc) # target
In [69]: res
Out[69]:
array([(b'', 0, 0, 0, 0.), (b'', 0, 0, 0, 0.)],
dtype=[('a', 'S2'), ('b', '<i4'), ('c', '<i4'), ('d', '<i4'), ('e', '<f8')])
In [70]: for name in res.dtype.names: # copy by field name
...: res[name] = data[name]
In [71]: res
Out[71]:
array([(b'a1', 0, 0, 0, 0.1), (b'b2', 1, 2, 3, 4.4)],
dtype=[('a', 'S2'), ('b', '<i4'), ('c', '<i4'), ('d', '<i4'), ('e', '<f8')])
Since usually structured arrays have many records, and few fields, copying by field name is relatively fast.
The linked SO cited matplotlib.mlab.rec_drop_fields(rec, names). This essentially does what I just outlined - make a target with the desired fields, and copy fields by name.
newdtype = np.dtype([(name, rec.dtype[name]) for name in rec.dtype.names
if name not in names])

I know you have a comprehensive answer but this is another that I just put together.
import numpy as np
For some sample data file:
test1.csv =
a b c id
0 1 2 3
4 5 6 7
8 9 10 11
Import using genfromtxt:
d = np.genfromtxt('test1.csv', delimiter="\t", names = True)
d
> array([(0.0, 1.0, 2.0, 3.0), (4.0, 5.0, 6.0, 7.0), (8.0, 9.0, 10.0, 11.0)],
dtype=[('a', '<f8'), ('b', '<f8'), ('c', '<f8'), ('id', '<f8')])
Return a single column from your array by doing:
d['a']
> array([ 0., 4., 8.])
To delete the column by the name 'id' you can do the following:
Return a list of the column names by writing:
list(d.dtype.names)
> ['a', 'b', 'c', 'id']
Create a new numpy array by returning only those columns not equal to the string id.
Use a list comprehension to return a new list without your 'id' string:
[b for b in list(d.dtype.names) if b != 'id']
> ['a', 'b', 'c']
Combine to give:
d_new = d[[b for b in list(d.dtype.names) if b != 'id']]
> array([(0.0, 1.0, 2.0), (4.0, 5.0, 6.0), (8.0, 9.0, 10.0)],
dtype=[('a', '<f8'), ('b', '<f8'), ('c', '<f8')])
This returns the array:
a b c
0 1 2
4 5 6
8 9 10

This may be new functionality in numpy (works in 1.20.2) but you can just slice your named array using a list of names (a tuple of names doesn't work though).
data = np.genfromtxt('some_file.csv', names=['a', 'b', 'c', 'd', 'e'])
# I don't want colums b or d
sliced = data[['a', 'c', 'd']]
I notice that you need to eliminate many columns that are named id. These columns show up as ['id', 'id_1', 'id_2', ...] and so on when parsed by genfromtxt, so you can use some list comprehension to pick out those column names and make a slice out of them.
no_ids = data[[n for n in data.dtype.names if 'id' not in n]]

Related

Is there a simple way to remove "padding" fields from numpy.dtype.descr?

Context
Since numpy version 1.16, if you access multiple fields of a structured array, the dtype of the resulting array will have the same item size as the original one, leading to extra "padding":
The new behavior as of Numpy 1.16 leads to extra “padding” bytes at the location of unindexed fields compared to 1.15. You will need to update any code which depends on the data having a “packed” layout.
This can lead to issues, e.g. if you want to add fields to the array in question later-on:
import numpy as np
import numpy.lib.recfunctions
a = np.array(
[
(10.0, 13.5, 1248, -2),
(20.0, 0.0, 0, 0),
(30.0, 0.0, 0, 0),
(40.0, 0.0, 0, 0),
(50.0, 0.0, 0, 999)
], dtype=[('x', '<f8'), ('y', '<f8'), ('i', '<i8'), ('j', '<i8')]
) # some array stolen from here: https://stackoverflow.com/a/37081693/5472354
print(a.shape, a.dtype, a.dtype.names, a.dtype.descr)
# all good so far
b = a[['x', 'i']] # for further processing I only need certain fields
print(b.shape, b.dtype, b.dtype.names, b.dtype.descr)
# you will only notice the extra padding in the descr
# b = np.lib.recfunctions.repack_fields(b)
# workaround
# now when I add fields, this becomes an issue
c = np.empty(b.shape, dtype=b.dtype.descr + [('c', 'i4')])
c[list(b.dtype.names)] = b
c['c'] = 1
print(c.dtype.names)
print(c['f1'])
# the void fields are filled with raw data and were given proper names
# that can be accessed
Now a workaround would be to use numpy.lib.recfunctions.repack_fields, which removes the padding, and I will use this in the future, but for my previous code, I need a fix. (Though there can be issues with recfunctions, as the module may not be found; as is the case for me, thus the additional import numpy.lib.recfunctions statement.)
Question
This part of the code is what I used to add fields to an array (based on this):
c = np.empty(b.shape, dtype=b.dtype.descr + [('c', 'i4')])
c[list(b.dtype.names)] = b
c['c'] = 1
Though (now that I know of it) using numpy.lib.recfunctions.require_fields may be more appropriate to add the fields. However, I would still need a way to remove the empty fields from b.dtype.descr:
[('x', '<f8'), ('', '|V8'), ('i', '<i8'), ('', '|V8')]
This is just a list of tuples, so I guess I could construct a more or less awkward way (along the lines of descr.remove(('', '|V8'))) to deal with this, but I was wondering if there is a better way, especially since the size of the voids depends on the number of left-out fields, e.g. from V8 to V16 if there are two in a row and so on (instead of a new void for each left-out field). So the code would become real clunky real fast.
In [237]: a = np.array(
...: [
...: (10.0, 13.5, 1248, -2),
...: (20.0, 0.0, 0, 0),
...: (30.0, 0.0, 0, 0),
...: (40.0, 0.0, 0, 0),
...: (50.0, 0.0, 0, 999)
...: ], dtype=[('x', '<f8'), ('y', '<f8'), ('i', '<i8'), ('j', '<i8')]
...: )
In [238]: a
Out[238]:
array([(10., 13.5, 1248, -2), (20., 0. , 0, 0),
(30., 0. , 0, 0), (40., 0. , 0, 0),
(50., 0. , 0, 999)],
dtype=[('x', '<f8'), ('y', '<f8'), ('i', '<i8'), ('j', '<i8')])
the b view:
In [240]: b = a[['x','i']]
In [241]: b
Out[241]:
array([(10., 1248), (20., 0), (30., 0), (40., 0), (50., 0)],
dtype={'names':['x','i'], 'formats':['<f8','<i8'], 'offsets':[0,16], 'itemsize':32})
the repacked copy:
In [243]: c = rf.repack_fields(b)
In [244]: c
Out[244]:
array([(10., 1248), (20., 0), (30., 0), (40., 0), (50., 0)],
dtype=[('x', '<f8'), ('i', '<i8')])
In [245]: c.dtype
Out[245]: dtype([('x', '<f8'), ('i', '<i8')])
your overly padded attempt at adding a field:
In [247]: d = np.empty(b.shape, dtype=b.dtype.descr + [('c', 'i4')])
...: d[list(b.dtype.names)] = b
...: d['c'] = 1
In [248]: d
Out[248]:
array([(10., b'\x00\x00\x00\x00\x00\x00\x00\x00', 1248, b'\x00\x00\x00\x00\x00\x00\x00\x00', 1),
(20., b'\x00\x00\x00\x00\x00\x00\x00\x00', 0, b'\x00\x00\x00\x00\x00\x00\x00\x00', 1),
...],
dtype=[('x', '<f8'), ('f1', 'V8'), ('i', '<i8'), ('f3', 'V8'), ('c', '<i4')])
My first attempt at making a dtype that does not include the Void fields. I don't know simply testing for V is robust enough:
In [253]: [des for des in b.dtype.descr if not 'V' in des[1]]
Out[253]: [('x', '<f8'), ('i', '<i8')]
And make a new dtype from that:
In [254]: d_dtype = _ + [('c','i4')]
All of this is normal python list and tuple manipulation. I've seen that in other recfunctions. I suspect repack_fields does something like this.
Now we make a new array with the simpler dtype:
In [255]: d = np.empty(b.shape, dtype=d_dtype)
In [256]: d[list(b.dtype.names)] = b
...: d['c'] = 1
In [257]: d
Out[257]:
array([(10., 1248, 1), (20., 0, 1), (30., 0, 1), (40., 0, 1),
(50., 0, 1)], dtype=[('x', '<f8'), ('i', '<i8'), ('c', '<i4')])
I've extracted from repack_fields the code that constructs a new, un-padded, dtype:
In [262]: def foo(a):
...: fieldinfo = []
...: for name in a.names:
...: tup = a.fields[name]
...: fmt = tup[0]
...: if len(tup) == 3:
...: name = (tup[2], name)
...: fieldinfo.append((name, fmt))
...: print(fieldinfo)
...: dt = np.dtype(fieldinfo)
...: return dt
...:
...:
In [263]: foo(b.dtype)
[('x', dtype('float64')), ('i', dtype('int64'))]
Out[263]: dtype([('x', '<f8'), ('i', '<i8')])
This works from dtype.fields rather than the dtype.descr. One's a dict the other a list.
In [274]: b.dtype
Out[274]: dtype({'names':['x','i'], 'formats':['<f8','<i8'], 'offsets':[0,16], 'itemsize':32})
In [275]: b.dtype.descr
Out[275]: [('x', '<f8'), ('', '|V8'), ('i', '<i8'), ('', '|V8')]
In [276]: b.dtype.fields
Out[276]: mappingproxy({'x': (dtype('float64'), 0), 'i': (dtype('int64'), 16)})
In [277]: b.dtype.fields['x']
Out[277]: (dtype('float64'), 0)
another way of getting just the valid descr tuples from b.dtype:
In [278]: [des for des in b.dtype.descr if des[0] in b.dtype.names]
Out[278]: [('x', '<f8'), ('i', '<i8')]

numpy: creating recarray fast with different column types

I am trying to create a recarray from a series of numpy arrays with column names and mixed variable types.
The following works but is slow:
import numpy as np
a = np.array([1,2,3,4], dtype=np.int)
b = np.array([6,6,6,6], dtype=np.int)
c = np.array([-1.,-2.-1.,-1.], dtype=np.float32)
d = np.array(list(zip(a,b,c,d)),dtype = [('a',np.int),('b',np.int),('c',np.float32)])
d = d.view(np.recarray())
I think there should be a way to do this with np.stack((a,b,c), axis=-1), which is faster than the list(zip()) method. However, there does not seem to be a trivial way to do the stacking an preserving column types. This link does seem to show how to do it, but its pretty clunky and I hope there is a better way.
Thanks for the help!
np.rec.fromarrays is probably what you want:
>>> np.rec.fromarrays([a, b, c], names=['a', 'b', 'c'])
rec.array([(1, 6, -1.), (2, 6, -2.), (3, 6, -1.), (4, 6, -1.)],
dtype=[('a', '<i8'), ('b', '<i8'), ('c', '<f4')])
Here's the field by field approach that I commented on:
In [308]: a = np.array([1,2,3,4], dtype=np.int)
...: b = np.array([6,6,6,6], dtype=np.int)
...: c = np.array([-1.,-2.,-1.,-1.], dtype=np.float32)
...: dt = np.dtype([('a',np.int),('b',np.int),('c',np.float32)])
...:
...:
(I had to correct your copy-n-pasted c).
In [309]: arr = np.zeros(a.shape, dtype=dt)
In [310]: for name, x in zip(dt.names, [a,b,c]):
...: arr[name] = x
...:
In [311]: arr
Out[311]:
array([(1, 6, -1.), (2, 6, -2.), (3, 6, -1.), (4, 6, -1.)],
dtype=[('a', '<i8'), ('b', '<i8'), ('c', '<f4')])
Since typically the array will have many more records (rows) than fields this should be faster than the list of tuples approach. In this case it probably is comprable in speed.
In [312]: np.array(list(zip(a,b,c)), dtype=dt)
Out[312]:
array([(1, 6, -1.), (2, 6, -2.), (3, 6, -1.), (4, 6, -1.)],
dtype=[('a', '<i8'), ('b', '<i8'), ('c', '<f4')])
rec.fromarrays, after some setup to determine the dtype, does:
_array = recarray(shape, descr)
# populate the record array (makes a copy)
for i in range(len(arrayList)):
_array[_names[i]] = arrayList[i]
The only way to use stack is to create recarrays first:
In [315]: [np.rec.fromarrays((i,j,k), dtype=dt) for i,j,k in zip(a,b,c)]
Out[315]:
[rec.array((1, 6, -1.),
dtype=[('a', '<i8'), ('b', '<i8'), ('c', '<f4')]),
rec.array((2, 6, -2.),
dtype=[('a', '<i8'), ('b', '<i8'), ('c', '<f4')]),
rec.array((3, 6, -1.),
dtype=[('a', '<i8'), ('b', '<i8'), ('c', '<f4')]),
rec.array((4, 6, -1.),
dtype=[('a', '<i8'), ('b', '<i8'), ('c', '<f4')])]
In [316]: np.stack(_)
Out[316]:
array([(1, 6, -1.), (2, 6, -2.), (3, 6, -1.), (4, 6, -1.)],
dtype=(numpy.record, [('a', '<i8'), ('b', '<i8'), ('c', '<f4')]))

How to get <class 'numpy.str'> instead of <class 'numpy.object_'>

my question is origin from this answer by Phil.
the code is
df = pd.DataFrame([[1,31,2.5,1260759144], [1,1029,3,1260759179],
[1,1061,3,1260759182],[1,1129,2,1260759185],
[1,1172,4,1260759205],[2,31,3,1260759134],
[2,1111,4.5,1260759256]],
index=list(['a','c','h','g','e','b','f',]),
columns=list( ['userId','movieId','rating','timestamp']) )
df.index.names=['ID No.']
df.columns.names=['Information']
def df_to_sarray(df):
"""
Convert a pandas DataFrame object to a numpy structured array.
This is functionally equivalent to but more efficient than
np.array(df.to_array())
:param df: the data frame to convert
:return: a numpy structured array representation of df
"""
v = df.values
cols = df.columns
# df[k].dtype.type is <class 'numpy.object_'>,I want to convert it to numpy.str
types = [(cols[i], df[k].dtype.type) for (i, k) in enumerate(cols)]
dtype = np.dtype(types)
z = np.zeros(v.shape[0], dtype)
for (i, k) in enumerate(z.dtype.names):
z[k] = v[:, i]
return z
sa = df_to_sarray(df.reset_index())
print(sa)
Phil's answer works well, while if I run
sa = df_to_sarray(df.reset_index())
I will get the following result.
array([('a', 1, 31, 2.5, 1260759144), ('c', 1, 1029, 3.0, 1260759179),
('h', 1, 1061, 3.0, 1260759182), ('g', 1, 1129, 2.0, 1260759185),
('e', 1, 1172, 4.0, 1260759205), ('b', 2, 31, 3.0, 1260759134),
('f', 2, 1111, 4.5, 1260759256)],
dtype=[('ID No.', 'O'), ('userId', '<i8'), ('movieId', '<i8'), ('rating', '<f8'), ('timestamp', '<i8')])
I hope I can get dtype as following.
dtype=[('ID No.', 'S'), ('userId', '<i8'), ('movieId', '<i8'), ('rating', '<f8'), ('timestamp', '<i8')]
string instead of object.
I tested the type of df[k].dtype.type , I found it is <class 'numpy.object_'>,I want to convert it to numpy.str. how to do that?
After reset_index the dtypes of your dataframe are a mix of object and numbers. The indexing has been rendered as object, not strings.
In [9]: df1=df.reset_index()
In [10]: df1.dtypes
Out[10]:
Information
ID No. object
userId int64
movieId int64
rating float64
timestamp int64
dtype: object
df1.values is a (7,5) object dtype array.
With the correct dtype, your approach does nicely (I'm use 'U2' on Py3):
In [31]: v = df1.values
In [32]: dt1=np.dtype([('ID No.', 'U2'), ('userId', '<i8'), ('movieId', '<i8'),
...: ('rating', '<f8'), ('timestamp', '<i8')])
In [33]: z = np.zeros(v.shape[0], dtype=dt1)
In [34]:
In [34]: for i,k in enumerate(dt1.names):
...: z[k] = v[:, i]
...:
In [35]: z
Out[35]:
array([('a', 1, 31, 2.5, 1260759144), ('c', 1, 1029, 3. , 1260759179),
('h', 1, 1061, 3. , 1260759182), ('g', 1, 1129, 2. , 1260759185),
('e', 1, 1172, 4. , 1260759205), ('b', 2, 31, 3. , 1260759134),
('f', 2, 1111, 4.5, 1260759256)],
dtype=[('ID No.', '<U2'), ('userId', '<i8'), ('movieId', '<i8'), ('rating', '<f8'), ('timestamp', '<i8')])
So the trick is to derive that dt1 from the dataframe.
Editing types after construction is one option:
In [36]: cols=df1.columns
In [37]: types = [(cols[i], df1[k].dtype.type) for (i, k) in enumerate(cols)]
In [38]: types
Out[38]:
[('ID No.', numpy.object_),
('userId', numpy.int64),
('movieId', numpy.int64),
('rating', numpy.float64),
('timestamp', numpy.int64)]
In [39]: types[0]=(types[0][0], 'U2')
In [40]: types
Out[40]:
[('ID No.', 'U2'),
('userId', numpy.int64),
('movieId', numpy.int64),
('rating', numpy.float64),
('timestamp', numpy.int64)]
In [41]:
In [41]: z = np.zeros(v.shape[0], dtype=types)
Tweaking the column dtype during construction also works:
def foo(atype):
if atype==np.object_:
return 'U2'
return atype
In [59]: types = [(cols[i], foo(df1[k].dtype.type)) for (i, k) in enumerate(cols)]
In either case we have to know ahead of time that we want to turn the object column into a specific string type, and not something more generic.
I don't know enough pandas to say whether it's possible to change the dtype of that ID column before we extract an array. .values will be a object dtype because of the mix of column dtypes.

how do I add a 'RowNumber' field to a structured numpy array?

I am using genfromtxt to load large csv files into structured arrays. I need to sort the data (using multiple fields), do some work and then restore the data to the original ordering. My plan is to add another field to the data and put the row number into this field before the first sort is applied. It can then be used to revert the order at the end. I thought there might be an elegant way of adding this field of record numbers but after hours of trying and searching for ideas I have nothing particularly slick.
import numpy
import numpy.lib.recfunctions as rfn
def main():
csvDataFile = 'C:\\File1.csv'
csvData = numpy.genfromtxt(csvDataFile, delimiter=',',names = True, dtype='f8')
rowNums = numpy.zeros(len(csvData),dtype=[('RowID','f8')])
#populate and add column for RowID
for i in range (0, len(csvData)):
rowNums['RowID'][i]=i
csvDataWithID = rfn.merge_arrays((csvData, rowNums), asrecarray=True, flatten=True)
The
recfunctions.merge_arrays
in particular is very slow and adding the row numbers one by one seems so old school. Your ideas would be gratefully received.
You don't need to fill rowNums iteratively:
In [93]: rowNums=np.zeros(10,dtype=[('RowID','f8')])
In [94]: for i in range(0,10):
....: rowNums['RowID'][i]=i
....:
In [95]: rowNums
Out[95]:
array([(0.0,), (1.0,), (2.0,), (3.0,), (4.0,), (5.0,), (6.0,), (7.0,),
(8.0,), (9.0,)],
dtype=[('RowID', '<f8')])
Just assign the range values to the field:
In [96]: rowNums['RowID']=np.arange(10)
In [97]: rowNums
Out[97]:
array([(0.0,), (1.0,), (2.0,), (3.0,), (4.0,), (5.0,), (6.0,), (7.0,),
(8.0,), (9.0,)],
dtype=[('RowID', '<f8')])
rfn.merge_arrays shouldn't be that slow - unless csvData.dtype has a great number of fields. This function creates a new dtype that merges the fields of the 2 inputs, and then copies data field by field. For many rows, and just a few fields, that is quite fast.
But you should be able to get the original order back without adding this extra field.
A 2 field 1d array:
In [118]: x = np.array([(4,2),(1, 0), (0, 1),(1,2),(3,1)], dtype=[('x', '<i4'), ('y', '<i4')])
In [119]: i = np.argsort(x, order=('y','x'))
In [120]: i
Out[120]: array([1, 2, 4, 3, 0], dtype=int32)
In [121]: x[i]
Out[121]:
array([(1, 0), (0, 1), (3, 1), (1, 2), (4, 2)],
dtype=[('x', '<i4'), ('y', '<i4')])
The same values are now sorted first on y, then on x.
In [122]: j=np.argsort(i)
In [123]: j
Out[123]: array([4, 0, 1, 3, 2], dtype=int32)
In [124]: x[i][j]
Out[124]:
array([(4, 2), (1, 0), (0, 1), (1, 2), (3, 1)],
dtype=[('x', '<i4'), ('y', '<i4')])
Back to the original order
I could have added a row index array to x, and then done a sort on that. But why add it; why not just apply i to a separate array:
In [127]: np.arange(5)[i]
Out[127]: array([1, 2, 4, 3, 0])
But sorting that is just the same as sorting i.
merge_arrays is doing essentially the following:
Union dtype:
In [139]: dt=np.dtype(rowNums.dtype.descr+x.dtype.descr)
In [140]: y=np.zeros((5,),dtype=dt)
fill in the values:
In [141]: y['RowID']=np.arange(5)
In [143]: for name in x.dtype.names:
y[name]=x[name]
In [144]: y
Out[144]:
array([(0.0, 4, 2), (1.0, 1, 0), (2.0, 0, 1), (3.0, 1, 2), (4.0, 3, 1)],
dtype=[('RowID', '<f8'), ('x', '<i4'), ('y', '<i4')])
And to test my argsort of argsort idea:
In [145]: y[i]
Out[145]:
array([(1.0, 1, 0), (2.0, 0, 1), (4.0, 3, 1), (3.0, 1, 2), (0.0, 4, 2)],
dtype=[('RowID', '<f8'), ('x', '<i4'), ('y', '<i4')])
In [146]: np.argsort(y[i],order=('RowID'))
Out[146]: array([4, 0, 1, 3, 2], dtype=int32)
In [147]: j
Out[147]: array([4, 0, 1, 3, 2], dtype=int32)
Sorting on the reordered RowID is the same as sorting on i.
Curiously merge_arrays is quite a bit slower than my reconstruction:
In [163]: rfn.merge_arrays([rowNums,x],flatten=True)
Out[163]:
array([(0.0, 4, 2), (1.0, 1, 0), (2.0, 0, 1), (3.0, 1, 2), (4.0, 3, 1)],
dtype=[('RowID', '<f8'), ('x', '<i4'), ('y', '<i4')])
In [164]: timeit rfn.merge_arrays([rowNums,x],flatten=True)
10000 loops, best of 3: 161 µs per loop
In [165]: %%timeit
dt=np.dtype(rowNums.dtype.descr+x.dtype.descr)
y=np.zeros((5,),dtype=dt)
y['RowID']=rowNums['RowID']
for name in x.dtype.names:
y[name]=x[name]
10000 loops, best of 3: 38.4 µs per loop
rowNums = np.zeros(len(csvData),dtype=[('RowID','f8')])
rowNums['RowID']=np.arange(len(csvData))
The above saves approx half a second per file with the csv files I am using. Very good so far.
However the key thing was how to efficiently obtain a record of the sort order. This is most elegantly solved using;
sortorder = np.argsort(csvData, 'col_1','col_2','col_3','col_4','col_5')
giving an array that lists the order of items in CsvData when sorted by cols 1 through 5.
This negates the need to make, populate and merge a RowID column, saving me around 15s per csv file (over 6hrs across my entire dataset.)
Thank you very much #hpaulj

why setting dtype on numpy array changes dimensions?

I am tryng to understand how to set dtypes of an array. My original numpy array dimensions are (583760, 7) i.e. 583760 rows and 7 columns. I am setting dtype as follows
>>> allRics.shape
(583760, 7)
>>> allRics.dtype = [('idx', np.float), ('opened', np.float), ('time', np.float),('trdp1',np.float),('trdp0',np.float),('dt',np.float),('value',np.float)]
>>> allRics.shape
(583760, 1)
Why is there a change in the original shape of the array? What causes this change? I am basically trying to sort original numpy array by time column and thats why I am setting the dtype. But after the dimension change, I am not able to sort array
>>> x=np.sort(allRics,order='time')
there is no change in the output of the above command. Could you please advice?
You are turning your array into a structured array. Basically, instead of a 2D array it is now treated as a 1D array of structs. Take a look as a simpler example below:
>>> import numpy as np
>>> arr = np.array([(1,2,3),(3,4,5)])
>>> arr
array([[1, 2, 3],
[3, 4, 5]])
>>> arr.shape
(2, 3)
>>> arr.dtype=[('a',int),('b',int),('c', int)]
>>> arr # Notice that tuples inside the elements
array([[(1, 2, 3)],
[(3, 4, 5)]],
dtype=[('a', '<i8'), ('b', '<i8'), ('c', '<i8')])
>>> arr.shape
(2, 1)
The structured array not sorting is most assurdly a bug. It looks like a work around it so actually declare the array a structured array to begin with:
>>> arr_s = np.sort(arr, order='b')
>>> arr_s
array([[(1, 2, 3)],
[(3, 4, 5)]],
dtype=[('a', '<i8'), ('b', '<i8'), ('c', '<i8')])
>>> dtype=[('a',np.int64),('b',np.int64),('c', np.int64)]
>>> arr = np.array([(5,2,3),(3,4,1)], dtype=dtype)
>>> arr
array([(5, 2, 3), (3, 4, 1)],
dtype=[('a', '<i8'), ('b', '<i8'), ('c', '<i8')])
>>> arr_s = np.sort(arr, order='a')
>>> arr_s
array([(3, 4, 1), (5, 2, 3)],
dtype=[('a', '<i8'), ('b', '<i8'), ('c', '<i8')])
>>> arr_s = np.sort(arr, order='b')
>>> arr_s
array([(5, 2, 3), (3, 4, 1)],
dtype=[('a', '<i8'), ('b', '<i8'), ('c', '<i8')])
>>> arr_s = np.sort(arr, order='c')
>>> arr_s
array([(3, 4, 1), (5, 2, 3)],
dtype=[('a', '<i8'), ('b', '<i8'), ('c', '<i8')])
>>>
You might be able to avoid using structured arrays alltogether if all you are using them for is sorting. You could do something like:
new_order = np.argosrt(allRics[:, 2])
x = allRics[new_order]

Categories