SQL join or R's merge() function in NumPy? - python

Is there an implementation where I can join two arrays based on their keys? Speaking of which, is the canonical way to store keys in one of the NumPy columns (NumPy doesn't have an 'id' or 'rownames' attribute)?

If you want to use only numpy, you can use structured arrays and the lib.recfunctions.join_by function (see http://pyopengl.sourceforge.net/pydoc/numpy.lib.recfunctions.html). A little example:
In [1]: import numpy as np
...: import numpy.lib.recfunctions as rfn
...: a = np.array([(1, 10.), (2, 20.), (3, 30.)], dtype=[('id', int), ('A', float)])
...: b = np.array([(2, 200.), (3, 300.), (4, 400.)], dtype=[('id', int), ('B', float)])
In [2]: rfn.join_by('id', a, b, jointype='inner', usemask=False)
Out[2]:
array([(2, 20.0, 200.0), (3, 30.0, 300.0)],
dtype=[('id', '<i4'), ('A', '<f8'), ('B', '<f8')])
Another option is to use pandas (documentation). I have no experience with it, but it provides more powerful data structures and functionality than standard numpy, "to make working with “relational” or “labeled” data both easy and intuitive". And it certainly has joining and merging functions (for example see http://pandas.sourceforge.net/merging.html#joining-on-a-key).

If you have any duplicates in the joined key fields, you should use pandas.merge instead of recfunctions. Per the docs (as mentioned by #joris, http://pyopengl.sourceforge.net/pydoc/numpy.lib.recfunctions.html):
Neither r1 nor r2 should have any duplicates along key: the presence of duplicates will make the output quite unreliable. Note that duplicates are not looked for by the algorithm.
In my case, I absolutely want duplicate keys. I'm comparing the rows of each column with the rows of all the other columns, inclusive (or, thinking like a database person, I want an inner join without an on or where clause). Or, translated into a loop, something like this:
for i in a:
for j in a:
print(i, j, i*j)
Such procedures are frequent in data mining operations.

Related

Reading a particular column in csv file using numpy in python

How to read string column only using numpy in python?
csv file:
1,2,3,"Hello"
3,3,3,"New"
4,5,6,"York"
How to get array like:
["Hello","york","New"]
without using pandas and sklearn.
I give the column name as a,b,c,d in csv
import numpy as np
ary=np.genfromtxt(r'yourcsv.csv',delimiter=',',dtype=None)
ary.T[-1]
Out[139]:
array([b'd', b'Hello', b'New', b'York'],
dtype='|S5')
import numpy
fname = 'sample.csv'
csv = numpy.genfromtxt(fname, dtype=str, delimiter=",")
names = csv[:,-1]
print(names)
Choosing the data type
The main way to control how the sequences of strings we have read from the file are converted to other types is to set the dtype argument. Acceptable values for this argument are:
a single type, such as dtype=float. The output will be 2D with the given dtype, unless a name has been associated with each column with the use of the names argument (see below). Note that dtype=float is the default for genfromtxt.
a sequence of types, such as dtype=(int, float, float).
a comma-separated string, such as dtype="i4,f8,|U3".
a dictionary with two keys 'names' and 'formats'.
a sequence of tuples (name, type), such as dtype=[('A', int), ('B', float)].
an existing numpy.dtype object.
the special value None. In that case, the type of the columns will be determined from the data itself (see below).
When dtype=None, the type of each column is determined iteratively from its data. We start by checking whether a string can be converted to a boolean (that is, if the string matches true or false in lower cases); then whether it can be converted to an integer, then to a float, then to a complex and eventually to a string. This behavior may be changed by modifying the default mapper of the StringConverter class.
The option dtype=None is provided for convenience. However, it is significantly slower than setting the dtype explicitly.
A quick file substitute:
In [275]: txt = b'''
...: 1,2,3,"Hello"
...: 3,3,3,"New"
...: 4,5,6,"York"'''
In [277]: np.genfromtxt(txt.splitlines(), delimiter=',',dtype=None,usecols=3)
Out[277]:
array([b'"Hello"', b'"New"', b'"York"'],
dtype='|S7')
bytestring array in Py3; or a default unicode string dtype:
In [278]: np.genfromtxt(txt.splitlines(), delimiter=',',dtype=str,usecols=3)
Out[278]:
array(['"Hello"', '"New"', '"York"'],
dtype='<U7')
Or the whole thing:
In [279]: data=np.genfromtxt(txt.splitlines(), delimiter=',',dtype=None)
In [280]: data
Out[280]:
array([(1, 2, 3, b'"Hello"'), (3, 3, 3, b'"New"'), (4, 5, 6, b'"York"')],
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3', 'S7')])
select the f3 field:
In [282]: data['f3']
Out[282]:
array([b'"Hello"', b'"New"', b'"York"'],
dtype='|S7')
Speed should be basically the same
To extract specific values into the numpy array one approach could be:
with open('Exercise1.csv', 'r') as file:
file_content = list(csv.reader(file, delimiter=","))
data = np.array(file_content)
print(file_content[1][1], len(file_content))
for i in range(1, len(file_content)):
patient.append(file_content[i][0])
first_column_array = np.array(patient, dtype=(''))
i iterates through the rows of data and j is the place of the value in the row, so for 0, the first value

Apply function to single column of structured numpy array in Python

I have a structured numpy array with two columns. One column contains a series of date times as strings, and the other contains measured values corresponding to that date.
data = array([('date1', 2.3), ('date3', 2.4), ...]
dtype=[('date', '<U16'), ('val', '<f8')])
I also have a number of functions similar to the following:
def example_func(x):
return 5*x + 1
I am trying to apply example_func to the second column of my array and generate the result
array([('date1', 12.5), ('date3', 11.6), ...]
dtype=[('date', '<U16'), ('val', '<f8')])
Everything I try, however, either raises a future warning from numpy or requires a for loop. Any ideas on how I can do this efficiently?
This works for me:
In [7]: example_func(data['val'])
Out[7]: array([ 12.5, 13. ])
In [8]: data['val'] = example_func(data['val'])
In [9]: data
Out[9]:
array([('date1', 12.5), ('date3', 13. )],
dtype=[('date', '<U16'), ('val', '<f8')])
In [10]: np.__version__
Out[10]: '1.12.0'
I have gotten future warnings when accessing several fields (with a list of names), and then attempting some sort of modification. It suggests making a copy etc. But I can't generate such a warning with a single field access like this.
In [15]: data[['val', 'date']]
Out[15]:
array([( 12.5, 'date1'), ( 13. , 'date3')],
dtype=[('val', '<f8'), ('date', '<U16')])
In [16]: data[['val', 'date']][0] = (12, 'date2')
/usr/local/bin/ipython3:1: FutureWarning: Numpy has detected that you (may be) writing to an array returned
by numpy.diagonal or by selecting multiple fields in a structured
array. This code will likely break in a future numpy release --
see numpy.diagonal or arrays.indexing reference docs for details.
The quick fix is to make an explicit copy (e.g., do
arr.diagonal().copy() or arr[['f0','f1']].copy()).
Developers aren't happy with how they access several fields at once. It's ok to read them, but changing is under evaluation. And in '1.13' there's some change about copying fields by position rather than name.

PYTHON/NUMPY: How to assign various data types to the data type object numpy.dtype() in a loop

I want to construct an numpy-array with various different data types using numpy.dtype() object.
I have a dictionary 'mydict' where all the information about the data is stored, I want to create a data type object dt with it. 'mydict' is dynamically created depending on which properties I am choosing and the data type is corresponding to the property as well!
import numpy as np
mydict={'name0': 'mass', 'name1': 'position', 'name2': 'ID',
'format0': np.float32, 'format1': np.int8, 'format2': np.uint64}
the data type object dt should look like this
dt = np.dtype([('mass', np.float32), ('position', np.int8), ('ID', np.uint64)])
My question is how to create dt without constructing/writing it manually into the code?
The main problem is that I do not know how to append np.dtype() with another entry of 'name' and 'format' combination or if that is even possible ...
I am using dt then to read my data into a numpy array like this!
data_array=np.array((nr_rows, nr_cols)), dtype=dt)
I tried certain attempts with Dict Comprehensions, lists and dictionaries but I could not find the right way to do that.
In [209]: dd={'name0': 'mass', 'name1': 'position', 'name2': 'ID',
...: 'format0': np.float32, 'format1': np.int8, 'format2': np.uint64}
In [213]: [(dd['name%s'%i],dd['format%s'%i]) for i in range(3)]
Out[213]: [('mass', numpy.float32), ('position', numpy.int8), ('ID', numpy.uint64)]
In [214]: dt=np.dtype([(dd['name%s'%i],dd['format%s'%i]) for i in range(3)])
In [216]: arr = np.zeros((2,), dt)
In [217]: arr
Out[217]:
array([( 0., 0, 0), ( 0., 0, 0)],
dtype=[('mass', '<f4'), ('position', 'i1'), ('ID', '<u8')])
If your keys follow strict patterns as in the question, you can take a look at this; 1) extract values where the keys start with name and set them as the first element of the tuple; 2) replace name with format, and extract corresponding value as the second element of the tuple; 3) construct the dtype from the list of tuples.
import numpy as np
np.dtype([(d[k], d[k.replace('name', 'format')]) for k in d.keys() if k.startswith('name')])
# dtype([('position', 'i1'), ('ID', '<u8'), ('mass', '<f4')])
Note: you may need an OrderDict initially in order to have the right order of columns here.

Python generate from list instead of Genfromtext

I need to generate a numpy array with named columns from list. I dont know how to do it so now i use temp txt file to use it with genfromtxt numpy function.
my_data = np.genfromtxt('tmp.txt',delimiter='|', dtype=None, names ["Num", "Date", "Desc", "Rgh" ,"Prc", "Color", "Smb", "MType"])
How to get rid of genfromtxt cause i need to generate the same structure array from list of strings instead of file
List of strings
For a list of strings, you can use genfromtxt directly. It accepts any iterable that can feed it strings/lines one at a time. I use this approach all the time when answering genfromtxt questions, e.g. in
https://stackoverflow.com/a/35874408/901925
In [186]: txt='1|abc|Red|no'
In [187]: txt=[txt,txt,txt]
In [188]: A=np.genfromtxt(txt, dtype=None, delimiter='|')
In [189]: A
Out[189]:
array([(1, 'abc', 'Red', 'no'), (1, 'abc', 'Red', 'no'),
(1, 'abc', 'Red', 'no')],
dtype=[('f0', '<i4'), ('f1', 'S3'), ('f2', 'S3'), ('f3', 'S2')])
In Python3 there's the added complication of byte strings v. regular ones.
List of values
In ways genfromtxt is an easy of creating a structured array. But with a few key facts, it isn't hard to generate it directly.
First, define the dtype. There are various ways of doing this.
dt = np.dtype([('name1',int),('name2',float),('name3','S10'),...])
I usually test this expression in an interactive shell.
A = np.zeros((n,), dtype=dt)
creates an 'blank' array of the correct type. Try it with a small n, and print the result.
Now try assigning values. The easiest is by field
A['name1'] = [1,2,3]
A['name3'] = ['abc','def',...]
Or by record
A[0] = (1, 1.23, 'str', ...)
multiple records are assign values with a list of tuples. That is the key. For a 2d array, a list of lists works; but for a structured 1d array the elements have to be tuples.
A = np.array([(1,1.2,'abc'),(2,342.,'xyz'),(3,0,'')], dtype=dt)
Sometimes it helps to use a list comprehension to turn a list of lists into a list of tuples.
alist = [[1,1.0,'str'],[]...]
A[:] = [tuple(l) for l in alist]

List of tuples to Numpy recarray

Given a list of tuples, where each tuple represents a row in a table, e.g.
tab = [('a',1),('b',2)]
Is there an easy way to convert this to a record array? I tried
np.recarray(tab,dtype=[('name',str),('value',int)])
which doesn't seem to work.
try
np.rec.fromrecords(tab)
rec.array([('a', 1), ('b', 2)],
dtype=[('f0', '|S1'), ('f1', '<i4')])

Categories