Apply function to single column of structured numpy array in Python - python

I have a structured numpy array with two columns. One column contains a series of date times as strings, and the other contains measured values corresponding to that date.
data = array([('date1', 2.3), ('date3', 2.4), ...]
dtype=[('date', '<U16'), ('val', '<f8')])
I also have a number of functions similar to the following:
def example_func(x):
return 5*x + 1
I am trying to apply example_func to the second column of my array and generate the result
array([('date1', 12.5), ('date3', 11.6), ...]
dtype=[('date', '<U16'), ('val', '<f8')])
Everything I try, however, either raises a future warning from numpy or requires a for loop. Any ideas on how I can do this efficiently?

This works for me:
In [7]: example_func(data['val'])
Out[7]: array([ 12.5, 13. ])
In [8]: data['val'] = example_func(data['val'])
In [9]: data
Out[9]:
array([('date1', 12.5), ('date3', 13. )],
dtype=[('date', '<U16'), ('val', '<f8')])
In [10]: np.__version__
Out[10]: '1.12.0'
I have gotten future warnings when accessing several fields (with a list of names), and then attempting some sort of modification. It suggests making a copy etc. But I can't generate such a warning with a single field access like this.
In [15]: data[['val', 'date']]
Out[15]:
array([( 12.5, 'date1'), ( 13. , 'date3')],
dtype=[('val', '<f8'), ('date', '<U16')])
In [16]: data[['val', 'date']][0] = (12, 'date2')
/usr/local/bin/ipython3:1: FutureWarning: Numpy has detected that you (may be) writing to an array returned
by numpy.diagonal or by selecting multiple fields in a structured
array. This code will likely break in a future numpy release --
see numpy.diagonal or arrays.indexing reference docs for details.
The quick fix is to make an explicit copy (e.g., do
arr.diagonal().copy() or arr[['f0','f1']].copy()).
Developers aren't happy with how they access several fields at once. It's ok to read them, but changing is under evaluation. And in '1.13' there's some change about copying fields by position rather than name.

Related

How to apply np.ceil to a structured numpy array

I'm trying to use the np.ceil function on a structrued numpy array, but all I get is the error message:
TypeError: ufunc 'ceil' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Here's a simply example of what that array would look like:
arr = np.array([(1.4,2.3), (3.2,4.1)], dtype=[("x", "<f8"), ("y", "<f8")])
When I try
np.ceil(arr)
I get the above mentioned error. When I just use one column, it works:
In [77]: np.ceil(arr["x"])
Out[77]: array([ 2., 4.])
But I need to get the entire array. Is there any way other than going column by column, or not using structured arrays all together?
Here's a dirty solution based on viewing the array without its structure, taking the ceiling, and then converting it back to a structured array.
# sample array
arr = np.array([(1.4,2.3), (3.2,4.1)], dtype = [("x", "<f8"), ("y", "<f8")])
# remove struct and take the ceiling
arr1 = np.ceil(arr.view((float, len(arr.dtype.names))))
# coerce it back into the struct
arr = np.array(list(tuple(t) for t in arr1), dtype = arr.dtype)
# kill the intermediate copy
del arr1
and here it is as an unreadable one-liner but without assigning the intermediate copy arr1
arr = np.array(
list(tuple(t) for t in np.ceil(arr.view((float, len(arr.dtype.names))))),
dtype = arr.dtype
)
# array([(2., 3.), (4., 5.)], dtype=[('x', '<f8'), ('y', '<f8')])
I don't claim this is a great solution, but it should help you move on with your project until something better is proposed

Adding a field to a structured numpy array (4)

This has been addressed before (here, here and here). I want to add a new field to a structure array returned by numpy genfromtxt (also asked here).
My new problem is that the csv file I'm reading has only a header line and a single data row:
output-Summary.csv:
Wedge, DWD, Yield (wedge), Efficiency
1, 16.097825, 44283299.473156, 2750887.118836
I'm reading it via genfromtxt and calculate a new value 'tl':
test_out = np.genfromtxt('output-Summary.csv', delimiter=',', names=True)
tl = 300 / test_out['DWD']
test_out looks like this:
array((1., 16.097825, 44283299.473156, 2750887.118836),
dtype=[('Wedge', '<f8'), ('DWD', '<f8'), ('Yield_wedge', '<f8'), ('Efficiency', '<f8')])
Using recfunctions.append_fields (as suggested in the examples 1-3 above) fails over the use of len() for the size 1 array:
from numpy.lib import recfunctions as rfn
rfn.append_fields(test_out,'tl',tl)
TypeError: len() of unsized object
Searching for alternatives (one of the answers here) I find that mlab.rec_append_fields works well (but is deprecated):
mlab.rec_append_fields(test_out,'tl',tl)
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: MatplotlibDeprecationWarning: The rec_append_fields function was deprecated in version 2.2.
"""Entry point for launching an IPython kernel.
rec.array((1., 16.097825, 44283299.473156, 2750887.118836, 18.63605798),
dtype=[('Wedge', '<f8'), ('DWD', '<f8'), ('Yield_wedge', '<f8'), ('Efficiency', '<f8'), ('tl', '<f8')])
I can also copy the array over to a new structured array "by hand" as suggested here. This works:
test_out_new = np.zeros(test_out.shape, dtype=new_dt)
for name in test_out.dtype.names:
test_out_new[name]=test_out[name]
test_out_new['tl']=tl
So in summary - is there a way to get recfunctions.append_fields to work with the genfromtxt output from my single row csv file?
I would really rather use a standard way to handle this rather than a home brew..
Reshape the array (and new field) to size (1,). With just one line, the genfromtxt is loading the data as a 0d array, shape (). The rfn code isn't heavily used, and isn't a robust as it should be. In other words, the 'standard way' is still bit buggy.
For example:
In [201]: arr=np.array((1,2,3), dtype='i,i,i')
In [202]: arr.reshape(1)
Out[202]: array([(1, 2, 3)], dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4')])
In [203]: rfn.append_fields(arr.reshape(1), 't1',[1], usemask=False)
Out[203]:
array([(1, 2, 3, 1)],
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('t1', '<i8')])
Nothing wrong with the home_brew. Most of the rfn functions use that mechanism - define a new dtype, create a recipient array with that dtype, and copy the fields over, name by name.

Reading a particular column in csv file using numpy in python

How to read string column only using numpy in python?
csv file:
1,2,3,"Hello"
3,3,3,"New"
4,5,6,"York"
How to get array like:
["Hello","york","New"]
without using pandas and sklearn.
I give the column name as a,b,c,d in csv
import numpy as np
ary=np.genfromtxt(r'yourcsv.csv',delimiter=',',dtype=None)
ary.T[-1]
Out[139]:
array([b'd', b'Hello', b'New', b'York'],
dtype='|S5')
import numpy
fname = 'sample.csv'
csv = numpy.genfromtxt(fname, dtype=str, delimiter=",")
names = csv[:,-1]
print(names)
Choosing the data type
The main way to control how the sequences of strings we have read from the file are converted to other types is to set the dtype argument. Acceptable values for this argument are:
a single type, such as dtype=float. The output will be 2D with the given dtype, unless a name has been associated with each column with the use of the names argument (see below). Note that dtype=float is the default for genfromtxt.
a sequence of types, such as dtype=(int, float, float).
a comma-separated string, such as dtype="i4,f8,|U3".
a dictionary with two keys 'names' and 'formats'.
a sequence of tuples (name, type), such as dtype=[('A', int), ('B', float)].
an existing numpy.dtype object.
the special value None. In that case, the type of the columns will be determined from the data itself (see below).
When dtype=None, the type of each column is determined iteratively from its data. We start by checking whether a string can be converted to a boolean (that is, if the string matches true or false in lower cases); then whether it can be converted to an integer, then to a float, then to a complex and eventually to a string. This behavior may be changed by modifying the default mapper of the StringConverter class.
The option dtype=None is provided for convenience. However, it is significantly slower than setting the dtype explicitly.
A quick file substitute:
In [275]: txt = b'''
...: 1,2,3,"Hello"
...: 3,3,3,"New"
...: 4,5,6,"York"'''
In [277]: np.genfromtxt(txt.splitlines(), delimiter=',',dtype=None,usecols=3)
Out[277]:
array([b'"Hello"', b'"New"', b'"York"'],
dtype='|S7')
bytestring array in Py3; or a default unicode string dtype:
In [278]: np.genfromtxt(txt.splitlines(), delimiter=',',dtype=str,usecols=3)
Out[278]:
array(['"Hello"', '"New"', '"York"'],
dtype='<U7')
Or the whole thing:
In [279]: data=np.genfromtxt(txt.splitlines(), delimiter=',',dtype=None)
In [280]: data
Out[280]:
array([(1, 2, 3, b'"Hello"'), (3, 3, 3, b'"New"'), (4, 5, 6, b'"York"')],
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3', 'S7')])
select the f3 field:
In [282]: data['f3']
Out[282]:
array([b'"Hello"', b'"New"', b'"York"'],
dtype='|S7')
Speed should be basically the same
To extract specific values into the numpy array one approach could be:
with open('Exercise1.csv', 'r') as file:
file_content = list(csv.reader(file, delimiter=","))
data = np.array(file_content)
print(file_content[1][1], len(file_content))
for i in range(1, len(file_content)):
patient.append(file_content[i][0])
first_column_array = np.array(patient, dtype=(''))
i iterates through the rows of data and j is the place of the value in the row, so for 0, the first value

Python generate from list instead of Genfromtext

I need to generate a numpy array with named columns from list. I dont know how to do it so now i use temp txt file to use it with genfromtxt numpy function.
my_data = np.genfromtxt('tmp.txt',delimiter='|', dtype=None, names ["Num", "Date", "Desc", "Rgh" ,"Prc", "Color", "Smb", "MType"])
How to get rid of genfromtxt cause i need to generate the same structure array from list of strings instead of file
List of strings
For a list of strings, you can use genfromtxt directly. It accepts any iterable that can feed it strings/lines one at a time. I use this approach all the time when answering genfromtxt questions, e.g. in
https://stackoverflow.com/a/35874408/901925
In [186]: txt='1|abc|Red|no'
In [187]: txt=[txt,txt,txt]
In [188]: A=np.genfromtxt(txt, dtype=None, delimiter='|')
In [189]: A
Out[189]:
array([(1, 'abc', 'Red', 'no'), (1, 'abc', 'Red', 'no'),
(1, 'abc', 'Red', 'no')],
dtype=[('f0', '<i4'), ('f1', 'S3'), ('f2', 'S3'), ('f3', 'S2')])
In Python3 there's the added complication of byte strings v. regular ones.
List of values
In ways genfromtxt is an easy of creating a structured array. But with a few key facts, it isn't hard to generate it directly.
First, define the dtype. There are various ways of doing this.
dt = np.dtype([('name1',int),('name2',float),('name3','S10'),...])
I usually test this expression in an interactive shell.
A = np.zeros((n,), dtype=dt)
creates an 'blank' array of the correct type. Try it with a small n, and print the result.
Now try assigning values. The easiest is by field
A['name1'] = [1,2,3]
A['name3'] = ['abc','def',...]
Or by record
A[0] = (1, 1.23, 'str', ...)
multiple records are assign values with a list of tuples. That is the key. For a 2d array, a list of lists works; but for a structured 1d array the elements have to be tuples.
A = np.array([(1,1.2,'abc'),(2,342.,'xyz'),(3,0,'')], dtype=dt)
Sometimes it helps to use a list comprehension to turn a list of lists into a list of tuples.
alist = [[1,1.0,'str'],[]...]
A[:] = [tuple(l) for l in alist]

SQL join or R's merge() function in NumPy?

Is there an implementation where I can join two arrays based on their keys? Speaking of which, is the canonical way to store keys in one of the NumPy columns (NumPy doesn't have an 'id' or 'rownames' attribute)?
If you want to use only numpy, you can use structured arrays and the lib.recfunctions.join_by function (see http://pyopengl.sourceforge.net/pydoc/numpy.lib.recfunctions.html). A little example:
In [1]: import numpy as np
...: import numpy.lib.recfunctions as rfn
...: a = np.array([(1, 10.), (2, 20.), (3, 30.)], dtype=[('id', int), ('A', float)])
...: b = np.array([(2, 200.), (3, 300.), (4, 400.)], dtype=[('id', int), ('B', float)])
In [2]: rfn.join_by('id', a, b, jointype='inner', usemask=False)
Out[2]:
array([(2, 20.0, 200.0), (3, 30.0, 300.0)],
dtype=[('id', '<i4'), ('A', '<f8'), ('B', '<f8')])
Another option is to use pandas (documentation). I have no experience with it, but it provides more powerful data structures and functionality than standard numpy, "to make working with “relational” or “labeled” data both easy and intuitive". And it certainly has joining and merging functions (for example see http://pandas.sourceforge.net/merging.html#joining-on-a-key).
If you have any duplicates in the joined key fields, you should use pandas.merge instead of recfunctions. Per the docs (as mentioned by #joris, http://pyopengl.sourceforge.net/pydoc/numpy.lib.recfunctions.html):
Neither r1 nor r2 should have any duplicates along key: the presence of duplicates will make the output quite unreliable. Note that duplicates are not looked for by the algorithm.
In my case, I absolutely want duplicate keys. I'm comparing the rows of each column with the rows of all the other columns, inclusive (or, thinking like a database person, I want an inner join without an on or where clause). Or, translated into a loop, something like this:
for i in a:
for j in a:
print(i, j, i*j)
Such procedures are frequent in data mining operations.

Categories