Numpy: preserving dtype after column_stack() - python

python3, numpy1.10
Let's say, I have something like
some_array = ['a', 'b', 'c']
bool_array = numpy.array([False for x in range(len(parts_array))], dtype='bool_')
Then bool_array will be [False, False, False] and with bool type.
And when I do
another_array = numpy.column_stack((some_array, bool_array)), both types become str, which I don't want to.
What I want is preserving bool type in the second column. I don't care about type of the first column.
Will I need to create another array? Seems like the solution is to pass the dtype like in structured arrays, but I'd like to not copy the whole array generated by column_stack().

I don't think you can have multiple data types in one numpy array. You might want to try a pandas data frame for example if you have
some_array=['a','b','c']
bool_array=[False, False, Fale]
you can define a data frame
df=pd.DataFrame(bool_array, index=some_array.tolist(), columns=[1])
This will like like
1
a False
b False
c False
the bool_array will keep its type but you will have to use data frame operations to access and manipulate it

There are 2 (main) ways of adding values to a structured array - with a list of tuples or by copying values to each field.
e.g. by field:
In [368]: arr=np.zeros((3,),dtype='S10,bool')
In [369]: arr
Out[369]:
array([(b'', False), (b'', False), (b'', False)],
dtype=[('f0', 'S10'), ('f1', '?')])
In [371]: arr['f0']=['one','two','three']
In [372]: arr['f1']=[True,False,True]
In [373]: arr
Out[373]:
array([(b'one', True), (b'two', False), (b'three', True)],
dtype=[('f0', 'S10'), ('f1', '?')])
If you have a list of sublists, you'll need to convert the sublists to tuples:
In [378]: alist=[['one',True],['two',False],['three',True]]
In [379]: np.array([tuple(i) for i in alist], dtype=arr.dtype)
Out[379]:
array([(b'one', True), (b'two', False), (b'three', True)],
dtype=[('f0', 'S10'), ('f1', '?')])
The column_stack won't help as an intermediate array, since it has lost the original column dtypes.
You could also explore functions in from numpy.lib import recfunctions. That module has functions to merge arrays. But the ones I have looked at use the field copy method, so they won't save any time.
In [381]: recfunctions.merge_arrays([['one','two','three'],[False,True,False]])
Out[381]:
array([('one', False), ('two', True), ('three', False)],
dtype=[('f0', '<U5'), ('f1', '?')])

Related

Pandas to_records keep turning my bytes column to str

I am trying to read in a csv as a pandas dataframe and turn it into a list of tuples which I am currently doing using to_records(). However, one of my column which is bytes keep getting turned into a string, for example, goes from b'\x00\x01\x02\x03\x04' to "b'\x00\x01\x02\x03\x04'". I want pandas to keep the original formatting, is there any way to achieve this?
This is what the dataframe looks like:
col1
col2
True
b'\\x00\\x01\\x02\\x03\\x04'
False
b'\\x05\\x06\\x07\\x08\\x09'
But when I turn it into a list of tuples using to_records it looks like this: [(True, "b'\\x00\\x01\\x02\\x03\\x04'"), (False, "b'\\x05\\x06\\x07\\x08\\x09'")]
^ As you can see the bytes get turned into a string.
As #mozway suggests, your values are already strings look like bytes. Try to use pd.eval:
>>> df.assign(col2=pd.eval(df['col2'])).to_records()
rec.array([(0, True, b'\\x00\\x01\\x02\\x03\\x04'),
(1, False, b'\\x05\\x06\\x07\\x08\\x09')],
dtype=[('index', '<i8'), ('col1', '?'), ('col2', 'O')])
>>> df.to_records()
rec.array([(0, True, "b'\\\\x00\\\\x01\\\\x02\\\\x03\\\\x04'"),
(1, False, "b'\\\\x05\\\\x06\\\\x07\\\\x08\\\\x09'")],
dtype=[('index', '<i8'), ('col1', '?'), ('col2', 'O')])

Adding a field to a structured numpy array (4)

This has been addressed before (here, here and here). I want to add a new field to a structure array returned by numpy genfromtxt (also asked here).
My new problem is that the csv file I'm reading has only a header line and a single data row:
output-Summary.csv:
Wedge, DWD, Yield (wedge), Efficiency
1, 16.097825, 44283299.473156, 2750887.118836
I'm reading it via genfromtxt and calculate a new value 'tl':
test_out = np.genfromtxt('output-Summary.csv', delimiter=',', names=True)
tl = 300 / test_out['DWD']
test_out looks like this:
array((1., 16.097825, 44283299.473156, 2750887.118836),
dtype=[('Wedge', '<f8'), ('DWD', '<f8'), ('Yield_wedge', '<f8'), ('Efficiency', '<f8')])
Using recfunctions.append_fields (as suggested in the examples 1-3 above) fails over the use of len() for the size 1 array:
from numpy.lib import recfunctions as rfn
rfn.append_fields(test_out,'tl',tl)
TypeError: len() of unsized object
Searching for alternatives (one of the answers here) I find that mlab.rec_append_fields works well (but is deprecated):
mlab.rec_append_fields(test_out,'tl',tl)
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: MatplotlibDeprecationWarning: The rec_append_fields function was deprecated in version 2.2.
"""Entry point for launching an IPython kernel.
rec.array((1., 16.097825, 44283299.473156, 2750887.118836, 18.63605798),
dtype=[('Wedge', '<f8'), ('DWD', '<f8'), ('Yield_wedge', '<f8'), ('Efficiency', '<f8'), ('tl', '<f8')])
I can also copy the array over to a new structured array "by hand" as suggested here. This works:
test_out_new = np.zeros(test_out.shape, dtype=new_dt)
for name in test_out.dtype.names:
test_out_new[name]=test_out[name]
test_out_new['tl']=tl
So in summary - is there a way to get recfunctions.append_fields to work with the genfromtxt output from my single row csv file?
I would really rather use a standard way to handle this rather than a home brew..
Reshape the array (and new field) to size (1,). With just one line, the genfromtxt is loading the data as a 0d array, shape (). The rfn code isn't heavily used, and isn't a robust as it should be. In other words, the 'standard way' is still bit buggy.
For example:
In [201]: arr=np.array((1,2,3), dtype='i,i,i')
In [202]: arr.reshape(1)
Out[202]: array([(1, 2, 3)], dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4')])
In [203]: rfn.append_fields(arr.reshape(1), 't1',[1], usemask=False)
Out[203]:
array([(1, 2, 3, 1)],
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('t1', '<i8')])
Nothing wrong with the home_brew. Most of the rfn functions use that mechanism - define a new dtype, create a recipient array with that dtype, and copy the fields over, name by name.

When dtype is specified with genfromtxt a 2D array becomes 1D - how to prevent this?

As seen here:
http://library.isr.ist.utl.pt/docs/numpy/user/basics.io.genfromtxt.html#choosing-the-data-type
"In all the cases but the first one, the output will be a 1D array with a structured dtype. This dtype has as many fields as items in the sequence. The field names are defined with the names keyword."
The problem is how do I get around this? I want to use genfromtxt with a data file with columns that are, e.g. int, string, int.
If I do:
dtype=(int, "|S5|", int)
Then the entire shape changes from (x, y) to merely (x, ) and I get 'too many indices' errors when I try to use masks.
When I use dtype=None I get to keep the 2D structure, but it often makes mistakes if the 1st row the column looks like it could be a number (this often occurs in my data set).
How am I best to get around this?
You cannot have a 2D array, it would mean having 1D arrays with mixed dtype for each row, which is not possible.
Having an array of records shouldn't be a problem:
In [1]: import numpy as np
In [2]: !cat test.txt
42 foo 41
40 bar 39
In [3]: data = np.genfromtxt('test.txt',
..: dtype=np.dtype([('f1', int), ('f2', np.str_, 5), ('f3', int)]))
In [4]: data
Out[4]:
array([(42, 'foo', 41), (40, 'bar', 39)],
dtype=[('f1', '<i8'), ('f2', '<U5'), ('f3', '<i8')])
In [5]: data['f3']
Out[5]: array([41, 39])
In [6]: data['f3'][1]
Out[6]: 39
If you need a masked array, look here: How can I mask elements of a record array in Numpy?
To mask by 1st column value:
In [7]: data['f1'] == 40
Out[7]: array([False, True], dtype=bool)
In [8]: data[data['f1'] == 40]
Out[8]:
array([(40, 'bar', 39)],
dtype=[('f1', '<i8'), ('f2', '<U5'), ('f3', '<i8')])

SQL join or R's merge() function in NumPy?

Is there an implementation where I can join two arrays based on their keys? Speaking of which, is the canonical way to store keys in one of the NumPy columns (NumPy doesn't have an 'id' or 'rownames' attribute)?
If you want to use only numpy, you can use structured arrays and the lib.recfunctions.join_by function (see http://pyopengl.sourceforge.net/pydoc/numpy.lib.recfunctions.html). A little example:
In [1]: import numpy as np
...: import numpy.lib.recfunctions as rfn
...: a = np.array([(1, 10.), (2, 20.), (3, 30.)], dtype=[('id', int), ('A', float)])
...: b = np.array([(2, 200.), (3, 300.), (4, 400.)], dtype=[('id', int), ('B', float)])
In [2]: rfn.join_by('id', a, b, jointype='inner', usemask=False)
Out[2]:
array([(2, 20.0, 200.0), (3, 30.0, 300.0)],
dtype=[('id', '<i4'), ('A', '<f8'), ('B', '<f8')])
Another option is to use pandas (documentation). I have no experience with it, but it provides more powerful data structures and functionality than standard numpy, "to make working with “relational” or “labeled” data both easy and intuitive". And it certainly has joining and merging functions (for example see http://pandas.sourceforge.net/merging.html#joining-on-a-key).
If you have any duplicates in the joined key fields, you should use pandas.merge instead of recfunctions. Per the docs (as mentioned by #joris, http://pyopengl.sourceforge.net/pydoc/numpy.lib.recfunctions.html):
Neither r1 nor r2 should have any duplicates along key: the presence of duplicates will make the output quite unreliable. Note that duplicates are not looked for by the algorithm.
In my case, I absolutely want duplicate keys. I'm comparing the rows of each column with the rows of all the other columns, inclusive (or, thinking like a database person, I want an inner join without an on or where clause). Or, translated into a loop, something like this:
for i in a:
for j in a:
print(i, j, i*j)
Such procedures are frequent in data mining operations.

List of tuples to Numpy recarray

Given a list of tuples, where each tuple represents a row in a table, e.g.
tab = [('a',1),('b',2)]
Is there an easy way to convert this to a record array? I tried
np.recarray(tab,dtype=[('name',str),('value',int)])
which doesn't seem to work.
try
np.rec.fromrecords(tab)
rec.array([('a', 1), ('b', 2)],
dtype=[('f0', '|S1'), ('f1', '<i4')])

Categories