How to get <class 'numpy.str'> instead of <class 'numpy.object_'> - python

my question is origin from this answer by Phil.
the code is
df = pd.DataFrame([[1,31,2.5,1260759144], [1,1029,3,1260759179],
[1,1061,3,1260759182],[1,1129,2,1260759185],
[1,1172,4,1260759205],[2,31,3,1260759134],
[2,1111,4.5,1260759256]],
index=list(['a','c','h','g','e','b','f',]),
columns=list( ['userId','movieId','rating','timestamp']) )
df.index.names=['ID No.']
df.columns.names=['Information']
def df_to_sarray(df):
"""
Convert a pandas DataFrame object to a numpy structured array.
This is functionally equivalent to but more efficient than
np.array(df.to_array())
:param df: the data frame to convert
:return: a numpy structured array representation of df
"""
v = df.values
cols = df.columns
# df[k].dtype.type is <class 'numpy.object_'>,I want to convert it to numpy.str
types = [(cols[i], df[k].dtype.type) for (i, k) in enumerate(cols)]
dtype = np.dtype(types)
z = np.zeros(v.shape[0], dtype)
for (i, k) in enumerate(z.dtype.names):
z[k] = v[:, i]
return z
sa = df_to_sarray(df.reset_index())
print(sa)
Phil's answer works well, while if I run
sa = df_to_sarray(df.reset_index())
I will get the following result.
array([('a', 1, 31, 2.5, 1260759144), ('c', 1, 1029, 3.0, 1260759179),
('h', 1, 1061, 3.0, 1260759182), ('g', 1, 1129, 2.0, 1260759185),
('e', 1, 1172, 4.0, 1260759205), ('b', 2, 31, 3.0, 1260759134),
('f', 2, 1111, 4.5, 1260759256)],
dtype=[('ID No.', 'O'), ('userId', '<i8'), ('movieId', '<i8'), ('rating', '<f8'), ('timestamp', '<i8')])
I hope I can get dtype as following.
dtype=[('ID No.', 'S'), ('userId', '<i8'), ('movieId', '<i8'), ('rating', '<f8'), ('timestamp', '<i8')]
string instead of object.
I tested the type of df[k].dtype.type , I found it is <class 'numpy.object_'>,I want to convert it to numpy.str. how to do that?

After reset_index the dtypes of your dataframe are a mix of object and numbers. The indexing has been rendered as object, not strings.
In [9]: df1=df.reset_index()
In [10]: df1.dtypes
Out[10]:
Information
ID No. object
userId int64
movieId int64
rating float64
timestamp int64
dtype: object
df1.values is a (7,5) object dtype array.
With the correct dtype, your approach does nicely (I'm use 'U2' on Py3):
In [31]: v = df1.values
In [32]: dt1=np.dtype([('ID No.', 'U2'), ('userId', '<i8'), ('movieId', '<i8'),
...: ('rating', '<f8'), ('timestamp', '<i8')])
In [33]: z = np.zeros(v.shape[0], dtype=dt1)
In [34]:
In [34]: for i,k in enumerate(dt1.names):
...: z[k] = v[:, i]
...:
In [35]: z
Out[35]:
array([('a', 1, 31, 2.5, 1260759144), ('c', 1, 1029, 3. , 1260759179),
('h', 1, 1061, 3. , 1260759182), ('g', 1, 1129, 2. , 1260759185),
('e', 1, 1172, 4. , 1260759205), ('b', 2, 31, 3. , 1260759134),
('f', 2, 1111, 4.5, 1260759256)],
dtype=[('ID No.', '<U2'), ('userId', '<i8'), ('movieId', '<i8'), ('rating', '<f8'), ('timestamp', '<i8')])
So the trick is to derive that dt1 from the dataframe.
Editing types after construction is one option:
In [36]: cols=df1.columns
In [37]: types = [(cols[i], df1[k].dtype.type) for (i, k) in enumerate(cols)]
In [38]: types
Out[38]:
[('ID No.', numpy.object_),
('userId', numpy.int64),
('movieId', numpy.int64),
('rating', numpy.float64),
('timestamp', numpy.int64)]
In [39]: types[0]=(types[0][0], 'U2')
In [40]: types
Out[40]:
[('ID No.', 'U2'),
('userId', numpy.int64),
('movieId', numpy.int64),
('rating', numpy.float64),
('timestamp', numpy.int64)]
In [41]:
In [41]: z = np.zeros(v.shape[0], dtype=types)
Tweaking the column dtype during construction also works:
def foo(atype):
if atype==np.object_:
return 'U2'
return atype
In [59]: types = [(cols[i], foo(df1[k].dtype.type)) for (i, k) in enumerate(cols)]
In either case we have to know ahead of time that we want to turn the object column into a specific string type, and not something more generic.
I don't know enough pandas to say whether it's possible to change the dtype of that ID column before we extract an array. .values will be a object dtype because of the mix of column dtypes.

Related

Is there a simple way to remove "padding" fields from numpy.dtype.descr?

Context
Since numpy version 1.16, if you access multiple fields of a structured array, the dtype of the resulting array will have the same item size as the original one, leading to extra "padding":
The new behavior as of Numpy 1.16 leads to extra “padding” bytes at the location of unindexed fields compared to 1.15. You will need to update any code which depends on the data having a “packed” layout.
This can lead to issues, e.g. if you want to add fields to the array in question later-on:
import numpy as np
import numpy.lib.recfunctions
a = np.array(
[
(10.0, 13.5, 1248, -2),
(20.0, 0.0, 0, 0),
(30.0, 0.0, 0, 0),
(40.0, 0.0, 0, 0),
(50.0, 0.0, 0, 999)
], dtype=[('x', '<f8'), ('y', '<f8'), ('i', '<i8'), ('j', '<i8')]
) # some array stolen from here: https://stackoverflow.com/a/37081693/5472354
print(a.shape, a.dtype, a.dtype.names, a.dtype.descr)
# all good so far
b = a[['x', 'i']] # for further processing I only need certain fields
print(b.shape, b.dtype, b.dtype.names, b.dtype.descr)
# you will only notice the extra padding in the descr
# b = np.lib.recfunctions.repack_fields(b)
# workaround
# now when I add fields, this becomes an issue
c = np.empty(b.shape, dtype=b.dtype.descr + [('c', 'i4')])
c[list(b.dtype.names)] = b
c['c'] = 1
print(c.dtype.names)
print(c['f1'])
# the void fields are filled with raw data and were given proper names
# that can be accessed
Now a workaround would be to use numpy.lib.recfunctions.repack_fields, which removes the padding, and I will use this in the future, but for my previous code, I need a fix. (Though there can be issues with recfunctions, as the module may not be found; as is the case for me, thus the additional import numpy.lib.recfunctions statement.)
Question
This part of the code is what I used to add fields to an array (based on this):
c = np.empty(b.shape, dtype=b.dtype.descr + [('c', 'i4')])
c[list(b.dtype.names)] = b
c['c'] = 1
Though (now that I know of it) using numpy.lib.recfunctions.require_fields may be more appropriate to add the fields. However, I would still need a way to remove the empty fields from b.dtype.descr:
[('x', '<f8'), ('', '|V8'), ('i', '<i8'), ('', '|V8')]
This is just a list of tuples, so I guess I could construct a more or less awkward way (along the lines of descr.remove(('', '|V8'))) to deal with this, but I was wondering if there is a better way, especially since the size of the voids depends on the number of left-out fields, e.g. from V8 to V16 if there are two in a row and so on (instead of a new void for each left-out field). So the code would become real clunky real fast.
In [237]: a = np.array(
...: [
...: (10.0, 13.5, 1248, -2),
...: (20.0, 0.0, 0, 0),
...: (30.0, 0.0, 0, 0),
...: (40.0, 0.0, 0, 0),
...: (50.0, 0.0, 0, 999)
...: ], dtype=[('x', '<f8'), ('y', '<f8'), ('i', '<i8'), ('j', '<i8')]
...: )
In [238]: a
Out[238]:
array([(10., 13.5, 1248, -2), (20., 0. , 0, 0),
(30., 0. , 0, 0), (40., 0. , 0, 0),
(50., 0. , 0, 999)],
dtype=[('x', '<f8'), ('y', '<f8'), ('i', '<i8'), ('j', '<i8')])
the b view:
In [240]: b = a[['x','i']]
In [241]: b
Out[241]:
array([(10., 1248), (20., 0), (30., 0), (40., 0), (50., 0)],
dtype={'names':['x','i'], 'formats':['<f8','<i8'], 'offsets':[0,16], 'itemsize':32})
the repacked copy:
In [243]: c = rf.repack_fields(b)
In [244]: c
Out[244]:
array([(10., 1248), (20., 0), (30., 0), (40., 0), (50., 0)],
dtype=[('x', '<f8'), ('i', '<i8')])
In [245]: c.dtype
Out[245]: dtype([('x', '<f8'), ('i', '<i8')])
your overly padded attempt at adding a field:
In [247]: d = np.empty(b.shape, dtype=b.dtype.descr + [('c', 'i4')])
...: d[list(b.dtype.names)] = b
...: d['c'] = 1
In [248]: d
Out[248]:
array([(10., b'\x00\x00\x00\x00\x00\x00\x00\x00', 1248, b'\x00\x00\x00\x00\x00\x00\x00\x00', 1),
(20., b'\x00\x00\x00\x00\x00\x00\x00\x00', 0, b'\x00\x00\x00\x00\x00\x00\x00\x00', 1),
...],
dtype=[('x', '<f8'), ('f1', 'V8'), ('i', '<i8'), ('f3', 'V8'), ('c', '<i4')])
My first attempt at making a dtype that does not include the Void fields. I don't know simply testing for V is robust enough:
In [253]: [des for des in b.dtype.descr if not 'V' in des[1]]
Out[253]: [('x', '<f8'), ('i', '<i8')]
And make a new dtype from that:
In [254]: d_dtype = _ + [('c','i4')]
All of this is normal python list and tuple manipulation. I've seen that in other recfunctions. I suspect repack_fields does something like this.
Now we make a new array with the simpler dtype:
In [255]: d = np.empty(b.shape, dtype=d_dtype)
In [256]: d[list(b.dtype.names)] = b
...: d['c'] = 1
In [257]: d
Out[257]:
array([(10., 1248, 1), (20., 0, 1), (30., 0, 1), (40., 0, 1),
(50., 0, 1)], dtype=[('x', '<f8'), ('i', '<i8'), ('c', '<i4')])
I've extracted from repack_fields the code that constructs a new, un-padded, dtype:
In [262]: def foo(a):
...: fieldinfo = []
...: for name in a.names:
...: tup = a.fields[name]
...: fmt = tup[0]
...: if len(tup) == 3:
...: name = (tup[2], name)
...: fieldinfo.append((name, fmt))
...: print(fieldinfo)
...: dt = np.dtype(fieldinfo)
...: return dt
...:
...:
In [263]: foo(b.dtype)
[('x', dtype('float64')), ('i', dtype('int64'))]
Out[263]: dtype([('x', '<f8'), ('i', '<i8')])
This works from dtype.fields rather than the dtype.descr. One's a dict the other a list.
In [274]: b.dtype
Out[274]: dtype({'names':['x','i'], 'formats':['<f8','<i8'], 'offsets':[0,16], 'itemsize':32})
In [275]: b.dtype.descr
Out[275]: [('x', '<f8'), ('', '|V8'), ('i', '<i8'), ('', '|V8')]
In [276]: b.dtype.fields
Out[276]: mappingproxy({'x': (dtype('float64'), 0), 'i': (dtype('int64'), 16)})
In [277]: b.dtype.fields['x']
Out[277]: (dtype('float64'), 0)
another way of getting just the valid descr tuples from b.dtype:
In [278]: [des for des in b.dtype.descr if des[0] in b.dtype.names]
Out[278]: [('x', '<f8'), ('i', '<i8')]

How do you set an entry in a numpy array to 0 or None if the entries are structured arrays themselves?

I have a list of objects that are structured arrays, for example something like this:
a = np.array([('Rex', 9, 81.0), ('Fido', 3, 27.0)], dtype=[('name', 'U10'), ('age', 'i4'), ('weight', 'f4')])
b = np.array([('Dog3', 9, 81.0), ('Dog4', 3, 27.0)], dtype=[('name', 'U10'), ('age', 'i4'), ('weight', 'f4')])
c = np.array([('Dog5', 9, 81.0), ('Dog6', 3, 27.0)], dtype=[('name', 'U10'), ('age', 'i4'), ('weight', 'f4')])
lst = [a, b, c]
Now I need this list to be a numpy array itself because I need to use numpy.where() on it and this does not work otherwise.
lst = np.array(lst)
So then I do something like this:
ID = np.where(lst == c)
lst[ID] = 0 or rather lst[ID] = None
But instead of what I would like to get, i.e.
>>>lst
array([a, b, 0/None], dtype=...)
I either get this:
>>>lst
array([a, b, [('0', 0, 0.), ('0', 0, 0.)]], dtype=...)
Or it does not work at all:
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
How can I acomplish this? Do I have to convert lst back into a list fist?
Let's make a simpler array, and try to set some values:
In [14]: dt = np.dtype([('foo','U10'),('bar',int)])
In [16]: arr = np.zeros(3, dtype=dt)
In [17]: arr
Out[17]: array([('', 0), ('', 0), ('', 0)], dtype=[('foo', '<U10'), ('bar', '<i8')])
In [18]: arr[1]
Out[18]: ('', 0)
In [19]: arr[1] = 12
In [20]: arr
Out[20]:
array([('', 0), ('12', 12), ('', 0)],
dtype=[('foo', '<U10'), ('bar', '<i8')])
note the mix of string '12' and integer 12.
Set with a tuple, one value for each field:
In [21]: arr[2] = ('dog',23)
set with a None fails because it can't convert None to integer, as required by the 2nd field.
In [22]: arr[0] = None
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-22-2f7a4c897706> in <module>
----> 1 arr[0] = None
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
In [23]: arr
Out[23]:
array([('None', 0), ('12', 12), ('dog', 23)],
dtype=[('foo', '<U10'), ('bar', '<i8')])
Actually it did manage to set the string field.
arr is an array, where each element must have the same dtype.
An array with object dtype behaves a lot more like a list. While occasionally useful, it shouldn't be used as a substitute for lists.

python h5py: can I store a dataset which different columns have different types?

Suppose I have a table which has many columns, only a few columns is float type, others are small integers, for example:
col1, col2, col3, col4
1.31 1 2 3
2.33 3 5 4
...
How can I store this effectively, suppose I use np.float32 for this dataset, the storage is wasted, because other columns only have a small integer, they don't need so much space. If I use np.int16, the float column is not exact, which also what I wanted. Therefore how do I deal with the situation like this?
Suppose I also have a string column, which make me more confused, how should I store the data?
col1, col2, col3, col4, col5
1.31 1 2 3 "a"
2.33 3 5 4 "b"
...
Edit:
To make things simpler, lets suppose the string column has fix length strings only, for example, length of 3.
I'm going to demonstrate the structured array approach:
I'm guessing you are starting with a csv file 'table'. If not it's still the easiest way to turn your sample into an array:
In [40]: txt = '''col1, col2, col3, col4, col5
...: 1.31 1 2 3 "a"
...: 2.33 3 5 4 "b"
...: '''
In [42]: data = np.genfromtxt(txt.splitlines(), names=True, dtype=None, encoding=None)
In [43]: data
Out[43]:
array([(1.31, 1, 2, 3, '"a"'), (2.33, 3, 5, 4, '"b"')],
dtype=[('col1', '<f8'), ('col2', '<i8'), ('col3', '<i8'), ('col4', '<i8'), ('col5', '<U3')])
With these parameters, genfromtxt takes care of creating a structured array. Note it is a 1d array with 5 fields. Fields dtype are determined from the data.
In [44]: import h5py
...
In [46]: f = h5py.File('struct.h5', 'w')
In [48]: ds = f.create_dataset('data',data=data)
...
TypeError: No conversion path for dtype: dtype('<U3')
But h5py has problems saving the unicode strings (default for py3). There may be ways around that, but here it will be simpler to convert the string dtype to bytestrings. Besides, that'll be more compact.
To convert that, I'll make a new dtype, and use astype. Alternatively I could specify the dtypes in the genfromtxt call.
In [49]: data.dtype
Out[49]: dtype([('col1', '<f8'), ('col2', '<i8'), ('col3', '<i8'), ('col4', '<i8'), ('col5', '<U3')])
In [50]: data.dtype.descr
Out[50]:
[('col1', '<f8'),
('col2', '<i8'),
('col3', '<i8'),
('col4', '<i8'),
('col5', '<U3')]
In [51]: dt1 = data.dtype.descr
In [52]: dt1[-1] = ('col5', 'S3')
In [53]: data.astype(dt1)
Out[53]:
array([(1.31, 1, 2, 3, b'"a"'), (2.33, 3, 5, 4, b'"b"')],
dtype=[('col1', '<f8'), ('col2', '<i8'), ('col3', '<i8'), ('col4', '<i8'), ('col5', 'S3')])
Now it saves the array without problem:
In [54]: data1 = data.astype(dt1)
In [55]: data1
Out[55]:
array([(1.31, 1, 2, 3, b'"a"'), (2.33, 3, 5, 4, b'"b"')],
dtype=[('col1', '<f8'), ('col2', '<i8'), ('col3', '<i8'), ('col4', '<i8'), ('col5', 'S3')])
In [56]: ds = f.create_dataset('data',data=data1)
In [57]: ds
Out[57]: <HDF5 dataset "data": shape (2,), type "|V35">
In [58]: ds[:]
Out[58]:
array([(1.31, 1, 2, 3, b'"a"'), (2.33, 3, 5, 4, b'"b"')],
dtype=[('col1', '<f8'), ('col2', '<i8'), ('col3', '<i8'), ('col4', '<i8'), ('col5', 'S3')])
I could make further modifications, shortening one or more of the int fields:
In [60]: dt1[1] = ('col2','i2')
In [61]: dt1[2] = ('col3','i2')
In [62]: dt1
Out[62]:
[('col1', '<f8'),
('col2', 'i2'),
('col3', 'i2'),
('col4', '<i8'),
('col5', 'S3')]
In [63]: data1 = data.astype(dt1)
In [64]: data1
Out[64]:
array([(1.31, 1, 2, 3, b'"a"'), (2.33, 3, 5, 4, b'"b"')],
dtype=[('col1', '<f8'), ('col2', '<i2'), ('col3', '<i2'), ('col4', '<i8'), ('col5', 'S3')])
In [65]: ds1 = f.create_dataset('data1',data=data1)
ds1 has a more compact storage, 'V23' vs 'V35'
In [67]: ds1
Out[67]: <HDF5 dataset "data1": shape (2,), type "|V23">
In [68]: ds1[:]
Out[68]:
array([(1.31, 1, 2, 3, b'"a"'), (2.33, 3, 5, 4, b'"b"')],
dtype=[('col1', '<f8'), ('col2', '<i2'), ('col3', '<i2'), ('col4', '<i8'), ('col5', 'S3')])

Numpy structured array fails basic numpy operations

I wish to manipulate named numpy arrays (add, multiply, concatenate, ...)
I defined structured arrays:
types=[('name1', int), ('name2', float)]
a = np.array([2, 3.3], dtype=types)
b = np.array([4, 5.35], dtype=types)
a and b are created such that
a
array([(2, 2. ), (3, 3.3)], dtype=[('name1', '<i8'), ('name2', '<f8')])
but I really want a['name1'] to be just 2, not array([2, 3])
Similarly, I want a['name2'] to be just 3.3
This way I could sum c=a+b, which is expected to be an array of length 2, where c['name1'] is 6 and c['name2'] is 8.65
How can I do that?
Define a structured array:
In [125]: dt = np.dtype([('f0','U10'),('f1',int),('f2',float)])
In [126]: a = np.array([('one',2,3),('two',4,5.5),('three',6,7)],dt)
In [127]: a
Out[127]:
array([('one', 2, 3. ), ('two', 4, 5.5), ('three', 6, 7. )],
dtype=[('f0', '<U10'), ('f1', '<i8'), ('f2', '<f8')])
And an object dtype array with the same data
In [128]: A = np.array([('one',2,3),('two',4,5.5),('three',6,7)],object)
In [129]: A
Out[129]:
array([['one', 2, 3],
['two', 4, 5.5],
['three', 6, 7]], dtype=object)
Addition works because it (iteratively) delegates the action to all elements
In [130]: A+A
Out[130]:
array([['oneone', 4, 6],
['twotwo', 8, 11.0],
['threethree', 12, 14]], dtype=object)
structured addition does not work
In [131]: a+a
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-131-6ff992d1ddd5> in <module>()
----> 1 a+a
TypeError: ufunc 'add' did not contain a loop with signature matching types
dtype([('f0', '<U10'), ('f1', '<i8'), ('f2', '<f8')]) dtype([('f0', '<U10'), ('f1', '<i8'), ('f2', '<f8')])
dtype([('f0', '<U10'), ('f1', '<i8'), ('f2', '<f8')])
Lets try addition field by field:
In [132]: aa = np.zeros_like(a)
In [133]: for n in a.dtype.names: aa[n] = a[n] + a[n]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-133-68476e5d579e> in <module>()
----> 1 for n in a.dtype.names: aa[n] = a[n] + a[n]
TypeError: ufunc 'add' did not contain a loop with signature matching types
dtype('<U10') dtype('<U10') dtype('<U10')
Oops, doesn't quite work - string dtype doesn't have addition. But we can handle the string field separately:
In [134]: aa['f0'] = a['f0']
In [135]: for n in a.dtype.names[1:]: aa[n] = a[n] + a[n]
In [136]: aa
Out[136]:
array([('one', 4, 6.), ('two', 8, 11.), ('three', 12, 14.)],
dtype=[('f0', '<U10'), ('f1', '<i8'), ('f2', '<f8')])
Or we can change the string dtype to object:
In [137]: dt1 = np.dtype([('f0',object),('f1',int),('f2',float)])
In [138]: b = np.array([('one',2,3),('two',4,5.5),('three',6,7)],dt1)
In [139]: b
Out[139]:
array([('one', 2, 3. ), ('two', 4, 5.5), ('three', 6, 7. )],
dtype=[('f0', 'O'), ('f1', '<i8'), ('f2', '<f8')])
In [140]: bb = np.zeros_like(b)
In [141]: for n in a.dtype.names: bb[n] = b[n] + b[n]
In [142]: bb
Out[142]:
array([('oneone', 4, 6.), ('twotwo', 8, 11.), ('threethree', 12, 14.)],
dtype=[('f0', 'O'), ('f1', '<i8'), ('f2', '<f8')])
Python strings do have a __add__, defined as concatenate. Numpy dtype strings don't have that definition. Python strings can be multiplied by an integer, but raise an error otherwise.
My guess is that pandas resorts to something like what I just did. I doubt if it implements dataframe addition in compiled code (except in some special cases). It probably works column by column if the dtype allows. It also seems to freely switch to object dtype (for example a column with both np.nan and a string). Timings might confirm my guess (I don't have pandas installed on this OS).
According to the documentation, the right way to make your arrays is:
types=[('name1', int), ('name2', float)]
a = np.array([(2, 3.3)], dtype=types)
b = np.array([(4, 5.35)], dtype=types)
Which gives generates a and b as you want them:
a['name1']
array([2])
But summing them is not as straight forward as the conventional numpy arrays, so I also suggest to use pandas:
names=['name1','name2']
a=pd.Series([2,3.3],index=names)
b=pd.Series([4,5.35],index=names)
a+b
name1 6.00
name2 8.65
dtype: float64

Delete numpy column by name

I created a numpy array from csv by
dtest = np.genfromtxt('data/test.csv', delimiter=",", names = True)
The data has 200 columns named 'name', 'id', and so on.
I'm trying to delete the 'id' column.
Can I do that using the name of the column?
The answers in the proposed duplicate, How do you remove a column from a structured numpy array?
show how to reference a subset of the fields of a structured array. That may be what you want, but it has a potential problem, which I'll illustrate in a bit.
Start with a small sample csv 'file':
In [32]: txt=b"""a,id,b,c,d,e
...: a1, 3, 0,0,0,0.1
...: b2, 4, 1,2,3,4.4
...: """
In [33]: data=np.genfromtxt(txt.splitlines(), delimiter=',',names=True, dtype=None)
In [34]: data
Out[34]:
array([(b'a1', 3, 0, 0, 0, 0.1),
(b'b2', 4, 1, 2, 3, 4.4)],
dtype=[('a', 'S2'), ('id', '<i4'), ('b', '<i4'), ('c', '<i4'), ('d', '<i4'), ('e', '<f8')])
Multifield selection
I can get a 'view' of a subset of the fields with a field name list. The 'duplicate' showed how to construct such a list from the data.dtype.names. Here I'll just type it in, omitting the 'id' name.
In [35]: subd=data[['a','b','c','d']]
In [36]: subd
Out[36]:
array([(b'a1', 0, 0, 0), (b'b2', 1, 2, 3)],
dtype=[('a', 'S2'), ('b', '<i4'), ('c', '<i4'), ('d', '<i4')])
The problem is that this isn't regular 'view'. It's fine for reading, but any attempt to write to the subset, raises a warning.
In [37]: subd[0]['b'] = 3
/usr/local/bin/ipython3:1: FutureWarning: Numpy has detected that you (may be) writing to an array returned
by numpy.diagonal or by selecting multiple fields in a structured
array. This code will likely break in a future numpy release --
see numpy.diagonal or arrays.indexing reference docs for details.
The quick fix is to make an explicit copy (e.g., do
arr.diagonal().copy() or arr[['f0','f1']].copy()).
#!/usr/bin/python3
Making a subset copy is ok. But changes to subd won't affect data.
In [38]: subd=data[['a','b','c','d']].copy()
In [39]: subd[0]['b'] = 3
In [40]: subd
Out[40]:
array([(b'a1', 3, 0, 0), (b'b2', 1, 2, 3)],
dtype=[('a', 'S2'), ('b', '<i4'), ('c', '<i4'), ('d', '<i4')])
A simple way to delete the ith field name from the indexing list:
In [60]: subnames = list(data.dtype.names) # list so its mutable
In [61]: subnames
Out[61]: ['a', 'id', 'b', 'c', 'd', 'e']
In [62]: del subnames[1]
usecols
Since you are reading this array from the csv, you could use usecols to load everything but the 'id' column
Since you have a large number of columns it would easist to do something like:
In [42]: col=list(range(6)); del col[1]
In [43]: col
Out[43]: [0, 2, 3, 4, 5]
In [44]: np.genfromtxt(txt.splitlines(), delimiter=',',names=True, dtype=None,usecols=col)
Out[44]:
array([(b'a1', 0, 0, 0, 0.1), (b'b2', 1, 2, 3, 4.4)],
dtype=[('a', 'S2'), ('b', '<i4'), ('c', '<i4'), ('d', '<i4'), ('e', '<f8')])
recfunctions
There's a library of functions that can help manipulate structured arrays
In [45]: import numpy.lib.recfunctions as rf
In [47]: rf.drop_fields(data, ['id'])
Out[47]:
array([(b'a1', 0, 0, 0, 0.1), (b'b2', 1, 2, 3, 4.4)],
dtype=[('a', 'S2'), ('b', '<i4'), ('c', '<i4'), ('d', '<i4'), ('e', '<f8')])
Most functions in this group work by constructing a 'blank' array with the target dtype, and then copying values, by field, from the source to the target.
field copy
Here's the field copy approach used in recfunctions:
In [65]: data.dtype.descr # dtype description as list of tuples
Out[65]:
[('a', '|S2'),
('id', '<i4'),
('b', '<i4'),
('c', '<i4'),
('d', '<i4'),
('e', '<f8')]
In [66]: desc=data.dtype.descr
In [67]: del desc[1] # remove one field
In [68]: res = np.zeros(data.shape, dtype=desc) # target
In [69]: res
Out[69]:
array([(b'', 0, 0, 0, 0.), (b'', 0, 0, 0, 0.)],
dtype=[('a', 'S2'), ('b', '<i4'), ('c', '<i4'), ('d', '<i4'), ('e', '<f8')])
In [70]: for name in res.dtype.names: # copy by field name
...: res[name] = data[name]
In [71]: res
Out[71]:
array([(b'a1', 0, 0, 0, 0.1), (b'b2', 1, 2, 3, 4.4)],
dtype=[('a', 'S2'), ('b', '<i4'), ('c', '<i4'), ('d', '<i4'), ('e', '<f8')])
Since usually structured arrays have many records, and few fields, copying by field name is relatively fast.
The linked SO cited matplotlib.mlab.rec_drop_fields(rec, names). This essentially does what I just outlined - make a target with the desired fields, and copy fields by name.
newdtype = np.dtype([(name, rec.dtype[name]) for name in rec.dtype.names
if name not in names])
I know you have a comprehensive answer but this is another that I just put together.
import numpy as np
For some sample data file:
test1.csv =
a b c id
0 1 2 3
4 5 6 7
8 9 10 11
Import using genfromtxt:
d = np.genfromtxt('test1.csv', delimiter="\t", names = True)
d
> array([(0.0, 1.0, 2.0, 3.0), (4.0, 5.0, 6.0, 7.0), (8.0, 9.0, 10.0, 11.0)],
dtype=[('a', '<f8'), ('b', '<f8'), ('c', '<f8'), ('id', '<f8')])
Return a single column from your array by doing:
d['a']
> array([ 0., 4., 8.])
To delete the column by the name 'id' you can do the following:
Return a list of the column names by writing:
list(d.dtype.names)
> ['a', 'b', 'c', 'id']
Create a new numpy array by returning only those columns not equal to the string id.
Use a list comprehension to return a new list without your 'id' string:
[b for b in list(d.dtype.names) if b != 'id']
> ['a', 'b', 'c']
Combine to give:
d_new = d[[b for b in list(d.dtype.names) if b != 'id']]
> array([(0.0, 1.0, 2.0), (4.0, 5.0, 6.0), (8.0, 9.0, 10.0)],
dtype=[('a', '<f8'), ('b', '<f8'), ('c', '<f8')])
This returns the array:
a b c
0 1 2
4 5 6
8 9 10
This may be new functionality in numpy (works in 1.20.2) but you can just slice your named array using a list of names (a tuple of names doesn't work though).
data = np.genfromtxt('some_file.csv', names=['a', 'b', 'c', 'd', 'e'])
# I don't want colums b or d
sliced = data[['a', 'c', 'd']]
I notice that you need to eliminate many columns that are named id. These columns show up as ['id', 'id_1', 'id_2', ...] and so on when parsed by genfromtxt, so you can use some list comprehension to pick out those column names and make a slice out of them.
no_ids = data[[n for n in data.dtype.names if 'id' not in n]]

Categories