Python generate from list instead of Genfromtext - python

I need to generate a numpy array with named columns from list. I dont know how to do it so now i use temp txt file to use it with genfromtxt numpy function.
my_data = np.genfromtxt('tmp.txt',delimiter='|', dtype=None, names ["Num", "Date", "Desc", "Rgh" ,"Prc", "Color", "Smb", "MType"])
How to get rid of genfromtxt cause i need to generate the same structure array from list of strings instead of file

List of strings
For a list of strings, you can use genfromtxt directly. It accepts any iterable that can feed it strings/lines one at a time. I use this approach all the time when answering genfromtxt questions, e.g. in
https://stackoverflow.com/a/35874408/901925
In [186]: txt='1|abc|Red|no'
In [187]: txt=[txt,txt,txt]
In [188]: A=np.genfromtxt(txt, dtype=None, delimiter='|')
In [189]: A
Out[189]:
array([(1, 'abc', 'Red', 'no'), (1, 'abc', 'Red', 'no'),
(1, 'abc', 'Red', 'no')],
dtype=[('f0', '<i4'), ('f1', 'S3'), ('f2', 'S3'), ('f3', 'S2')])
In Python3 there's the added complication of byte strings v. regular ones.
List of values
In ways genfromtxt is an easy of creating a structured array. But with a few key facts, it isn't hard to generate it directly.
First, define the dtype. There are various ways of doing this.
dt = np.dtype([('name1',int),('name2',float),('name3','S10'),...])
I usually test this expression in an interactive shell.
A = np.zeros((n,), dtype=dt)
creates an 'blank' array of the correct type. Try it with a small n, and print the result.
Now try assigning values. The easiest is by field
A['name1'] = [1,2,3]
A['name3'] = ['abc','def',...]
Or by record
A[0] = (1, 1.23, 'str', ...)
multiple records are assign values with a list of tuples. That is the key. For a 2d array, a list of lists works; but for a structured 1d array the elements have to be tuples.
A = np.array([(1,1.2,'abc'),(2,342.,'xyz'),(3,0,'')], dtype=dt)
Sometimes it helps to use a list comprehension to turn a list of lists into a list of tuples.
alist = [[1,1.0,'str'],[]...]
A[:] = [tuple(l) for l in alist]

Related

How to apply np.ceil to a structured numpy array

I'm trying to use the np.ceil function on a structrued numpy array, but all I get is the error message:
TypeError: ufunc 'ceil' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Here's a simply example of what that array would look like:
arr = np.array([(1.4,2.3), (3.2,4.1)], dtype=[("x", "<f8"), ("y", "<f8")])
When I try
np.ceil(arr)
I get the above mentioned error. When I just use one column, it works:
In [77]: np.ceil(arr["x"])
Out[77]: array([ 2., 4.])
But I need to get the entire array. Is there any way other than going column by column, or not using structured arrays all together?
Here's a dirty solution based on viewing the array without its structure, taking the ceiling, and then converting it back to a structured array.
# sample array
arr = np.array([(1.4,2.3), (3.2,4.1)], dtype = [("x", "<f8"), ("y", "<f8")])
# remove struct and take the ceiling
arr1 = np.ceil(arr.view((float, len(arr.dtype.names))))
# coerce it back into the struct
arr = np.array(list(tuple(t) for t in arr1), dtype = arr.dtype)
# kill the intermediate copy
del arr1
and here it is as an unreadable one-liner but without assigning the intermediate copy arr1
arr = np.array(
list(tuple(t) for t in np.ceil(arr.view((float, len(arr.dtype.names))))),
dtype = arr.dtype
)
# array([(2., 3.), (4., 5.)], dtype=[('x', '<f8'), ('y', '<f8')])
I don't claim this is a great solution, but it should help you move on with your project until something better is proposed

How to efficiently extract values from nested numpy arrays generated by loadmat function?

Is there a more efficient method in python to extract data from a nested python list such as A = array([[array([[12000000]])]], dtype=object). I have been using A[0][0][0][0], it does not seem to be an efficinet method when you have lots of data like A.
I have also used
numpy.squeeeze(array([[array([[12000000]])]], dtype=object)) but this gives me
array(array([[12000000]]), dtype=object)
PS: The nested array was generated by loadmat() function in scipy module to load a .mat file which consists of nested structures.
Creating such an array is a bit tedious, but loadmat does it to handle the MATLAB cells and 2d matrix:
In [5]: A = np.empty((1,1),object)
In [6]: A[0,0] = np.array([[1.23]])
In [7]: A
Out[7]: array([[array([[ 1.23]])]], dtype=object)
In [8]: A.any()
Out[8]: array([[ 1.23]])
In [9]: A.shape
Out[9]: (1, 1)
squeeze compresses the shape, but does not cross the object boundary
In [10]: np.squeeze(A)
Out[10]: array(array([[ 1.23]]), dtype=object)
but if you have one item in an array (regardless of shape) item() can extract it. Indexing also works, A[0,0]
In [11]: np.squeeze(A).item()
Out[11]: array([[ 1.23]])
item again to extract the number from that inner array:
In [12]: np.squeeze(A).item().item()
Out[12]: 1.23
Or we don't even need the squeeze:
In [13]: A.item().item()
Out[13]: 1.23
loadmat has a squeeze_me parameter.
Indexing is just as easy:
In [17]: A[0,0]
Out[17]: array([[ 1.23]])
In [18]: A[0,0][0,0]
Out[18]: 1.23
astype can also work (though it can be picky about the number of dimensions).
In [21]: A.astype(float)
Out[21]: array([[ 1.23]])
With single item arrays like efficiency isn't much of an issue. All these methods are quick. Things become more complicated when the array has many items, or the items are themselves large.
How to access elements of numpy ndarray?
You could use A.all() or A.any() to get a scalar. This would only work if A contains one element.
Try A.flatten()[0]
This will flatten the array into a single dimension and extract the first item from it. In your case, the first item is the only item.
What worked in my case was the following..
import scipy.io
xcat = scipy.io.loadmat(os.path.join(dir_data, file_name))
pars = xcat['pars'] # Extract numpy.void element from the loadmat object
# Note that you are dealing with a numpy structured array object when you enter pars[0][0].
# Thus you can acces names and all that...
dict_values = [x[0][0] for x in pars[0][0]] # Extract all elements in one go
dict_keys = list(pars.dtype.names) # Extract the corresponding names/tags
dict_xcat = dict(zip(dict_keys, dict_values)) # Pack it up again in a dict
where the idea behind this is.. first extract ALL values I want, and format them in a nice python dict.
This prevents me from cumbersome indexing later in the file...
Of course, this is a very specific solution. Since in my case the values I needed were all floats/ints.

Reading a particular column in csv file using numpy in python

How to read string column only using numpy in python?
csv file:
1,2,3,"Hello"
3,3,3,"New"
4,5,6,"York"
How to get array like:
["Hello","york","New"]
without using pandas and sklearn.
I give the column name as a,b,c,d in csv
import numpy as np
ary=np.genfromtxt(r'yourcsv.csv',delimiter=',',dtype=None)
ary.T[-1]
Out[139]:
array([b'd', b'Hello', b'New', b'York'],
dtype='|S5')
import numpy
fname = 'sample.csv'
csv = numpy.genfromtxt(fname, dtype=str, delimiter=",")
names = csv[:,-1]
print(names)
Choosing the data type
The main way to control how the sequences of strings we have read from the file are converted to other types is to set the dtype argument. Acceptable values for this argument are:
a single type, such as dtype=float. The output will be 2D with the given dtype, unless a name has been associated with each column with the use of the names argument (see below). Note that dtype=float is the default for genfromtxt.
a sequence of types, such as dtype=(int, float, float).
a comma-separated string, such as dtype="i4,f8,|U3".
a dictionary with two keys 'names' and 'formats'.
a sequence of tuples (name, type), such as dtype=[('A', int), ('B', float)].
an existing numpy.dtype object.
the special value None. In that case, the type of the columns will be determined from the data itself (see below).
When dtype=None, the type of each column is determined iteratively from its data. We start by checking whether a string can be converted to a boolean (that is, if the string matches true or false in lower cases); then whether it can be converted to an integer, then to a float, then to a complex and eventually to a string. This behavior may be changed by modifying the default mapper of the StringConverter class.
The option dtype=None is provided for convenience. However, it is significantly slower than setting the dtype explicitly.
A quick file substitute:
In [275]: txt = b'''
...: 1,2,3,"Hello"
...: 3,3,3,"New"
...: 4,5,6,"York"'''
In [277]: np.genfromtxt(txt.splitlines(), delimiter=',',dtype=None,usecols=3)
Out[277]:
array([b'"Hello"', b'"New"', b'"York"'],
dtype='|S7')
bytestring array in Py3; or a default unicode string dtype:
In [278]: np.genfromtxt(txt.splitlines(), delimiter=',',dtype=str,usecols=3)
Out[278]:
array(['"Hello"', '"New"', '"York"'],
dtype='<U7')
Or the whole thing:
In [279]: data=np.genfromtxt(txt.splitlines(), delimiter=',',dtype=None)
In [280]: data
Out[280]:
array([(1, 2, 3, b'"Hello"'), (3, 3, 3, b'"New"'), (4, 5, 6, b'"York"')],
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3', 'S7')])
select the f3 field:
In [282]: data['f3']
Out[282]:
array([b'"Hello"', b'"New"', b'"York"'],
dtype='|S7')
Speed should be basically the same
To extract specific values into the numpy array one approach could be:
with open('Exercise1.csv', 'r') as file:
file_content = list(csv.reader(file, delimiter=","))
data = np.array(file_content)
print(file_content[1][1], len(file_content))
for i in range(1, len(file_content)):
patient.append(file_content[i][0])
first_column_array = np.array(patient, dtype=(''))
i iterates through the rows of data and j is the place of the value in the row, so for 0, the first value

Apply function to single column of structured numpy array in Python

I have a structured numpy array with two columns. One column contains a series of date times as strings, and the other contains measured values corresponding to that date.
data = array([('date1', 2.3), ('date3', 2.4), ...]
dtype=[('date', '<U16'), ('val', '<f8')])
I also have a number of functions similar to the following:
def example_func(x):
return 5*x + 1
I am trying to apply example_func to the second column of my array and generate the result
array([('date1', 12.5), ('date3', 11.6), ...]
dtype=[('date', '<U16'), ('val', '<f8')])
Everything I try, however, either raises a future warning from numpy or requires a for loop. Any ideas on how I can do this efficiently?
This works for me:
In [7]: example_func(data['val'])
Out[7]: array([ 12.5, 13. ])
In [8]: data['val'] = example_func(data['val'])
In [9]: data
Out[9]:
array([('date1', 12.5), ('date3', 13. )],
dtype=[('date', '<U16'), ('val', '<f8')])
In [10]: np.__version__
Out[10]: '1.12.0'
I have gotten future warnings when accessing several fields (with a list of names), and then attempting some sort of modification. It suggests making a copy etc. But I can't generate such a warning with a single field access like this.
In [15]: data[['val', 'date']]
Out[15]:
array([( 12.5, 'date1'), ( 13. , 'date3')],
dtype=[('val', '<f8'), ('date', '<U16')])
In [16]: data[['val', 'date']][0] = (12, 'date2')
/usr/local/bin/ipython3:1: FutureWarning: Numpy has detected that you (may be) writing to an array returned
by numpy.diagonal or by selecting multiple fields in a structured
array. This code will likely break in a future numpy release --
see numpy.diagonal or arrays.indexing reference docs for details.
The quick fix is to make an explicit copy (e.g., do
arr.diagonal().copy() or arr[['f0','f1']].copy()).
Developers aren't happy with how they access several fields at once. It's ok to read them, but changing is under evaluation. And in '1.13' there's some change about copying fields by position rather than name.

SQL join or R's merge() function in NumPy?

Is there an implementation where I can join two arrays based on their keys? Speaking of which, is the canonical way to store keys in one of the NumPy columns (NumPy doesn't have an 'id' or 'rownames' attribute)?
If you want to use only numpy, you can use structured arrays and the lib.recfunctions.join_by function (see http://pyopengl.sourceforge.net/pydoc/numpy.lib.recfunctions.html). A little example:
In [1]: import numpy as np
...: import numpy.lib.recfunctions as rfn
...: a = np.array([(1, 10.), (2, 20.), (3, 30.)], dtype=[('id', int), ('A', float)])
...: b = np.array([(2, 200.), (3, 300.), (4, 400.)], dtype=[('id', int), ('B', float)])
In [2]: rfn.join_by('id', a, b, jointype='inner', usemask=False)
Out[2]:
array([(2, 20.0, 200.0), (3, 30.0, 300.0)],
dtype=[('id', '<i4'), ('A', '<f8'), ('B', '<f8')])
Another option is to use pandas (documentation). I have no experience with it, but it provides more powerful data structures and functionality than standard numpy, "to make working with “relational” or “labeled” data both easy and intuitive". And it certainly has joining and merging functions (for example see http://pandas.sourceforge.net/merging.html#joining-on-a-key).
If you have any duplicates in the joined key fields, you should use pandas.merge instead of recfunctions. Per the docs (as mentioned by #joris, http://pyopengl.sourceforge.net/pydoc/numpy.lib.recfunctions.html):
Neither r1 nor r2 should have any duplicates along key: the presence of duplicates will make the output quite unreliable. Note that duplicates are not looked for by the algorithm.
In my case, I absolutely want duplicate keys. I'm comparing the rows of each column with the rows of all the other columns, inclusive (or, thinking like a database person, I want an inner join without an on or where clause). Or, translated into a loop, something like this:
for i in a:
for j in a:
print(i, j, i*j)
Such procedures are frequent in data mining operations.

Categories