Related
This has been addressed before (here, here and here). I want to add a new field to a structure array returned by numpy genfromtxt (also asked here).
My new problem is that the csv file I'm reading has only a header line and a single data row:
output-Summary.csv:
Wedge, DWD, Yield (wedge), Efficiency
1, 16.097825, 44283299.473156, 2750887.118836
I'm reading it via genfromtxt and calculate a new value 'tl':
test_out = np.genfromtxt('output-Summary.csv', delimiter=',', names=True)
tl = 300 / test_out['DWD']
test_out looks like this:
array((1., 16.097825, 44283299.473156, 2750887.118836),
dtype=[('Wedge', '<f8'), ('DWD', '<f8'), ('Yield_wedge', '<f8'), ('Efficiency', '<f8')])
Using recfunctions.append_fields (as suggested in the examples 1-3 above) fails over the use of len() for the size 1 array:
from numpy.lib import recfunctions as rfn
rfn.append_fields(test_out,'tl',tl)
TypeError: len() of unsized object
Searching for alternatives (one of the answers here) I find that mlab.rec_append_fields works well (but is deprecated):
mlab.rec_append_fields(test_out,'tl',tl)
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: MatplotlibDeprecationWarning: The rec_append_fields function was deprecated in version 2.2.
"""Entry point for launching an IPython kernel.
rec.array((1., 16.097825, 44283299.473156, 2750887.118836, 18.63605798),
dtype=[('Wedge', '<f8'), ('DWD', '<f8'), ('Yield_wedge', '<f8'), ('Efficiency', '<f8'), ('tl', '<f8')])
I can also copy the array over to a new structured array "by hand" as suggested here. This works:
test_out_new = np.zeros(test_out.shape, dtype=new_dt)
for name in test_out.dtype.names:
test_out_new[name]=test_out[name]
test_out_new['tl']=tl
So in summary - is there a way to get recfunctions.append_fields to work with the genfromtxt output from my single row csv file?
I would really rather use a standard way to handle this rather than a home brew..
Reshape the array (and new field) to size (1,). With just one line, the genfromtxt is loading the data as a 0d array, shape (). The rfn code isn't heavily used, and isn't a robust as it should be. In other words, the 'standard way' is still bit buggy.
For example:
In [201]: arr=np.array((1,2,3), dtype='i,i,i')
In [202]: arr.reshape(1)
Out[202]: array([(1, 2, 3)], dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4')])
In [203]: rfn.append_fields(arr.reshape(1), 't1',[1], usemask=False)
Out[203]:
array([(1, 2, 3, 1)],
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('t1', '<i8')])
Nothing wrong with the home_brew. Most of the rfn functions use that mechanism - define a new dtype, create a recipient array with that dtype, and copy the fields over, name by name.
python3, numpy1.10
Let's say, I have something like
some_array = ['a', 'b', 'c']
bool_array = numpy.array([False for x in range(len(parts_array))], dtype='bool_')
Then bool_array will be [False, False, False] and with bool type.
And when I do
another_array = numpy.column_stack((some_array, bool_array)), both types become str, which I don't want to.
What I want is preserving bool type in the second column. I don't care about type of the first column.
Will I need to create another array? Seems like the solution is to pass the dtype like in structured arrays, but I'd like to not copy the whole array generated by column_stack().
I don't think you can have multiple data types in one numpy array. You might want to try a pandas data frame for example if you have
some_array=['a','b','c']
bool_array=[False, False, Fale]
you can define a data frame
df=pd.DataFrame(bool_array, index=some_array.tolist(), columns=[1])
This will like like
1
a False
b False
c False
the bool_array will keep its type but you will have to use data frame operations to access and manipulate it
There are 2 (main) ways of adding values to a structured array - with a list of tuples or by copying values to each field.
e.g. by field:
In [368]: arr=np.zeros((3,),dtype='S10,bool')
In [369]: arr
Out[369]:
array([(b'', False), (b'', False), (b'', False)],
dtype=[('f0', 'S10'), ('f1', '?')])
In [371]: arr['f0']=['one','two','three']
In [372]: arr['f1']=[True,False,True]
In [373]: arr
Out[373]:
array([(b'one', True), (b'two', False), (b'three', True)],
dtype=[('f0', 'S10'), ('f1', '?')])
If you have a list of sublists, you'll need to convert the sublists to tuples:
In [378]: alist=[['one',True],['two',False],['three',True]]
In [379]: np.array([tuple(i) for i in alist], dtype=arr.dtype)
Out[379]:
array([(b'one', True), (b'two', False), (b'three', True)],
dtype=[('f0', 'S10'), ('f1', '?')])
The column_stack won't help as an intermediate array, since it has lost the original column dtypes.
You could also explore functions in from numpy.lib import recfunctions. That module has functions to merge arrays. But the ones I have looked at use the field copy method, so they won't save any time.
In [381]: recfunctions.merge_arrays([['one','two','three'],[False,True,False]])
Out[381]:
array([('one', False), ('two', True), ('three', False)],
dtype=[('f0', '<U5'), ('f1', '?')])
I need to generate a numpy array with named columns from list. I dont know how to do it so now i use temp txt file to use it with genfromtxt numpy function.
my_data = np.genfromtxt('tmp.txt',delimiter='|', dtype=None, names ["Num", "Date", "Desc", "Rgh" ,"Prc", "Color", "Smb", "MType"])
How to get rid of genfromtxt cause i need to generate the same structure array from list of strings instead of file
List of strings
For a list of strings, you can use genfromtxt directly. It accepts any iterable that can feed it strings/lines one at a time. I use this approach all the time when answering genfromtxt questions, e.g. in
https://stackoverflow.com/a/35874408/901925
In [186]: txt='1|abc|Red|no'
In [187]: txt=[txt,txt,txt]
In [188]: A=np.genfromtxt(txt, dtype=None, delimiter='|')
In [189]: A
Out[189]:
array([(1, 'abc', 'Red', 'no'), (1, 'abc', 'Red', 'no'),
(1, 'abc', 'Red', 'no')],
dtype=[('f0', '<i4'), ('f1', 'S3'), ('f2', 'S3'), ('f3', 'S2')])
In Python3 there's the added complication of byte strings v. regular ones.
List of values
In ways genfromtxt is an easy of creating a structured array. But with a few key facts, it isn't hard to generate it directly.
First, define the dtype. There are various ways of doing this.
dt = np.dtype([('name1',int),('name2',float),('name3','S10'),...])
I usually test this expression in an interactive shell.
A = np.zeros((n,), dtype=dt)
creates an 'blank' array of the correct type. Try it with a small n, and print the result.
Now try assigning values. The easiest is by field
A['name1'] = [1,2,3]
A['name3'] = ['abc','def',...]
Or by record
A[0] = (1, 1.23, 'str', ...)
multiple records are assign values with a list of tuples. That is the key. For a 2d array, a list of lists works; but for a structured 1d array the elements have to be tuples.
A = np.array([(1,1.2,'abc'),(2,342.,'xyz'),(3,0,'')], dtype=dt)
Sometimes it helps to use a list comprehension to turn a list of lists into a list of tuples.
alist = [[1,1.0,'str'],[]...]
A[:] = [tuple(l) for l in alist]
As seen here:
http://library.isr.ist.utl.pt/docs/numpy/user/basics.io.genfromtxt.html#choosing-the-data-type
"In all the cases but the first one, the output will be a 1D array with a structured dtype. This dtype has as many fields as items in the sequence. The field names are defined with the names keyword."
The problem is how do I get around this? I want to use genfromtxt with a data file with columns that are, e.g. int, string, int.
If I do:
dtype=(int, "|S5|", int)
Then the entire shape changes from (x, y) to merely (x, ) and I get 'too many indices' errors when I try to use masks.
When I use dtype=None I get to keep the 2D structure, but it often makes mistakes if the 1st row the column looks like it could be a number (this often occurs in my data set).
How am I best to get around this?
You cannot have a 2D array, it would mean having 1D arrays with mixed dtype for each row, which is not possible.
Having an array of records shouldn't be a problem:
In [1]: import numpy as np
In [2]: !cat test.txt
42 foo 41
40 bar 39
In [3]: data = np.genfromtxt('test.txt',
..: dtype=np.dtype([('f1', int), ('f2', np.str_, 5), ('f3', int)]))
In [4]: data
Out[4]:
array([(42, 'foo', 41), (40, 'bar', 39)],
dtype=[('f1', '<i8'), ('f2', '<U5'), ('f3', '<i8')])
In [5]: data['f3']
Out[5]: array([41, 39])
In [6]: data['f3'][1]
Out[6]: 39
If you need a masked array, look here: How can I mask elements of a record array in Numpy?
To mask by 1st column value:
In [7]: data['f1'] == 40
Out[7]: array([False, True], dtype=bool)
In [8]: data[data['f1'] == 40]
Out[8]:
array([(40, 'bar', 39)],
dtype=[('f1', '<i8'), ('f2', '<U5'), ('f3', '<i8')])
Is there an implementation where I can join two arrays based on their keys? Speaking of which, is the canonical way to store keys in one of the NumPy columns (NumPy doesn't have an 'id' or 'rownames' attribute)?
If you want to use only numpy, you can use structured arrays and the lib.recfunctions.join_by function (see http://pyopengl.sourceforge.net/pydoc/numpy.lib.recfunctions.html). A little example:
In [1]: import numpy as np
...: import numpy.lib.recfunctions as rfn
...: a = np.array([(1, 10.), (2, 20.), (3, 30.)], dtype=[('id', int), ('A', float)])
...: b = np.array([(2, 200.), (3, 300.), (4, 400.)], dtype=[('id', int), ('B', float)])
In [2]: rfn.join_by('id', a, b, jointype='inner', usemask=False)
Out[2]:
array([(2, 20.0, 200.0), (3, 30.0, 300.0)],
dtype=[('id', '<i4'), ('A', '<f8'), ('B', '<f8')])
Another option is to use pandas (documentation). I have no experience with it, but it provides more powerful data structures and functionality than standard numpy, "to make working with “relational” or “labeled” data both easy and intuitive". And it certainly has joining and merging functions (for example see http://pandas.sourceforge.net/merging.html#joining-on-a-key).
If you have any duplicates in the joined key fields, you should use pandas.merge instead of recfunctions. Per the docs (as mentioned by #joris, http://pyopengl.sourceforge.net/pydoc/numpy.lib.recfunctions.html):
Neither r1 nor r2 should have any duplicates along key: the presence of duplicates will make the output quite unreliable. Note that duplicates are not looked for by the algorithm.
In my case, I absolutely want duplicate keys. I'm comparing the rows of each column with the rows of all the other columns, inclusive (or, thinking like a database person, I want an inner join without an on or where clause). Or, translated into a loop, something like this:
for i in a:
for j in a:
print(i, j, i*j)
Such procedures are frequent in data mining operations.