Optimal generation of numpy associative array with subarrays and variable length - python

I have Python-generated data, of the type
fa fb fc
fa1 fb1 [fc01, fc02,..., fc0m]
fa2 fb2 [fc11, fc12,..., fc1m]
... ... ...
fan fbn [fcn1, fcn2,..., fcnm]
I need to create a Python-compatible data structure to store it, maximizing ease of creation, and minimizing memory usage and read/write time. I need to be able to identify columns via field names (i.e. retrieve fa1 with something like data['fa'][0]). fa values are ints, and fb and fc are floats. Neither m nor n are known before runtime, but are known before data is inserted into the data structure, and do not change. m will not exceed 1000, and n won't exceed 10000. Data is generated one row at a time.
Until now, I've used a numpy associative array, asar, of dtype=[('f0,'i2'), ('f1','f8'), ('f2', 'f8', (m))]. However, since I can't just add a new row to a numpy array without deleting and recreating it each time a row is added, I've been using a separate counting variable ind_n, creating asar with asar = numpy.zeroes(n, dtype=dtype), overwriting asar[ind_n]'s zeroes with the data to be added, then incrementing ind_n until it reaches n. This works, but it seems like there must be a better solution (or at least one that allows me to eliminate ind_n). Is there a standard way to create the skeleton of asar (perhaps with something like np.zeroes()), then insert each line of data into the first nonzero row? Or a way to convert a standard python nested list to an associative array, once the nested list has been completely generated? (I know this conversion can definitely be done, but run into issues (e.g. ValueError: setting an array element with a sequence.) when converting the subarray, when I attempt it.)

In [39]: n, m = 5, 3
In [41]: dt=np.dtype([('f0','i2'), ('f1','f8'), ('f2', 'f8', (m))])
In [45]: asar = np.zeros(n, dt)
In [46]: asar
Out[46]:
array([(0, 0., [0., 0., 0.]), (0, 0., [0., 0., 0.]),
(0, 0., [0., 0., 0.]), (0, 0., [0., 0., 0.]),
(0, 0., [0., 0., 0.])],
dtype=[('f0', '<i2'), ('f1', '<f8'), ('f2', '<f8', (3,))])
Filling by field:
In [49]: asar['f0'] = np.arange(5)
In [50]: asar['f1'] = np.random.rand(5)
In [51]: asar['f2'] = np.random.rand(5,3)
In [52]: asar
Out[52]:
array([(0, 0.45120412, [0.86481761, 0.08861093, 0.42212446]),
(1, 0.63926708, [0.43788684, 0.89254029, 0.90637292]),
(2, 0.33844457, [0.80352251, 0.25411018, 0.315124 ]),
(3, 0.24271258, [0.27849709, 0.9905879 , 0.94155558]),
(4, 0.89239324, [0.1580938 , 0.52844036, 0.59092695])],
dtype=[('f0', '<i2'), ('f1', '<f8'), ('f2', '<f8', (3,))])
Generating a list with matching nesting:
In [53]: alist = [(i,i,[10]*3) for i in range(5)]
In [54]: np.array(alist, dt)
Out[54]:
array([(0, 0., [10., 10., 10.]), (1, 1., [10., 10., 10.]),
(2, 2., [10., 10., 10.]), (3, 3., [10., 10., 10.]),
(4, 4., [10., 10., 10.])],
dtype=[('f0', '<i2'), ('f1', '<f8'), ('f2', '<f8', (3,))])
Obviously you could do:
for i, row in enumerate(alist):
asar[i] = row
enumerate is a nice idiomatic way of generating an index along with a value. But then so is range(n).

If you know n at the time you create the first record your solution is essentially correct.
You can use np.empty instead of np.zeros saving a bit (but not much) time.
If you feel bad about ind_n you can create an array iterator instead.
>>> m = 5
>>> n = 7
>>> dt = [('col1', 'i2'), ('col2', float), ('col3', float, (m,))]
>>> data = [(np.random.randint(10), np.random.random(), np.random.random((m,))) for _ in range(n)]
>>>
>>> rec = np.empty((n,), dt)
>>> irec = np.nditer(rec, op_flags=[['readwrite']], flags=['c_index'])
>>>
>>> for src in data:
... # roughly equivalent to list.append:
... next(irec)[()] = src
... print()
... # getting the currently valid part:
... print(irec.operands[0][:irec.index+1])
...
[(9, 0.07368308, [0.44691665, 0.38875103, 0.83522137, 0.39281718, 0.62078615])]
[(9, 0.07368308, [0.44691665, 0.38875103, 0.83522137, 0.39281718, 0.62078615])
(6, 0.82350335, [0.57971597, 0.61270304, 0.05280996, 0.03702404, 0.99159465])]
[(9, 0.07368308, [0.44691665, 0.38875103, 0.83522137, 0.39281718, 0.62078615])
(6, 0.82350335, [0.57971597, 0.61270304, 0.05280996, 0.03702404, 0.99159465])
(3, 0.06565234, [0.88921842, 0.21097122, 0.83276431, 0.01824657, 0.49105466])]
[(9, 0.07368308, [0.44691665, 0.38875103, 0.83522137, 0.39281718, 0.62078615])
(6, 0.82350335, [0.57971597, 0.61270304, 0.05280996, 0.03702404, 0.99159465])
(3, 0.06565234, [0.88921842, 0.21097122, 0.83276431, 0.01824657, 0.49105466])
(2, 0.69806099, [0.87749632, 0.22119474, 0.25623813, 0.26587436, 0.04772489])]
[(9, 0.07368308, [0.44691665, 0.38875103, 0.83522137, 0.39281718, 0.62078615])
(6, 0.82350335, [0.57971597, 0.61270304, 0.05280996, 0.03702404, 0.99159465])
(3, 0.06565234, [0.88921842, 0.21097122, 0.83276431, 0.01824657, 0.49105466])
(2, 0.69806099, [0.87749632, 0.22119474, 0.25623813, 0.26587436, 0.04772489])
(1, 0.77573727, [0.44359522, 0.62471617, 0.65742177, 0.38889958, 0.13901824])]
[(9, 0.07368308, [0.44691665, 0.38875103, 0.83522137, 0.39281718, 0.62078615])
(6, 0.82350335, [0.57971597, 0.61270304, 0.05280996, 0.03702404, 0.99159465])
(3, 0.06565234, [0.88921842, 0.21097122, 0.83276431, 0.01824657, 0.49105466])
(2, 0.69806099, [0.87749632, 0.22119474, 0.25623813, 0.26587436, 0.04772489])
(1, 0.77573727, [0.44359522, 0.62471617, 0.65742177, 0.38889958, 0.13901824])
(0, 0.45797521, [0.79193395, 0.69029592, 0.0541346 , 0.49603146, 0.36146384])]
[(9, 0.07368308, [0.44691665, 0.38875103, 0.83522137, 0.39281718, 0.62078615])
(6, 0.82350335, [0.57971597, 0.61270304, 0.05280996, 0.03702404, 0.99159465])
(3, 0.06565234, [0.88921842, 0.21097122, 0.83276431, 0.01824657, 0.49105466])
(2, 0.69806099, [0.87749632, 0.22119474, 0.25623813, 0.26587436, 0.04772489])
(1, 0.77573727, [0.44359522, 0.62471617, 0.65742177, 0.38889958, 0.13901824])
(0, 0.45797521, [0.79193395, 0.69029592, 0.0541346 , 0.49603146, 0.36146384])
(6, 0.85225039, [0.62028917, 0.4895316 , 0.00922578, 0.66836154, 0.53082779])]

Related

Strange behavior of numpy astype for record

I am using numpy's .astype() method to convert the data types, however, it gives the strange result, Suppose the following code:
import pandas as pd
import numpy as np
import sys
df = pd.DataFrame([[0.1, 2, 'a']], columns=["a1", "a2", "str"])
arr = df.to_records(index=False)
dtype1 = [('a1', np.float32), ('a2', np.int32), ('str', '|S2')]
dtype2 = [('a2', np.int32), ('a1', np.float32), ('str', '|S2')]
arr1 = arr.astype(dtype1)
arr2 = arr.astype(dtype2)
print(arr1)
print(arr2)
print(arr)
print(sys.version)
print(np.__version__)
print(pd.__version__)
I have test it on different python version, and gives me the different result. The newer version gives me the unexpected result:
[(0.1, 2, b'a')]
[(0, 2., b'a')]
[(0.1, 2, 'a')]
3.6.5 |Anaconda custom (64-bit)| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]
1.15.0
0.23.4
While the older version give the correct result:
[(0.10000000149011612, 2, 'a') (0.10000000149011612, 2, 'b')]
[(2, 0.10000000149011612, 'a') (2, 0.10000000149011612, 'b')]
[(0.1, 2L, 'a') (0.1, 2L, 'b')]
2.7.13 (v2.7.13:a06454b1afa1, Dec 17 2016, 20:53:40) [MSC v.1500 64 bit (AMD64)]
1.11.1
0.20.3
Can someone tell me what is going on?
https://docs.scipy.org/doc/numpy/user/basics.rec.html#assignment-from-other-structured-arrays
says that assignment from other structured arrays is by position, not by field name. I think that applies to astype. If so it means you can't reorder fields with an astype.
Accessing multiple fields at once has changed in recent releases, and may change more. Part of it is whether such access should be a copy or view.
recfunctions has code for adding, deleting or merging fields. A common strategy is to create a target array with the new dtype, and copy values to it by field name. This is iterative but since typically an array will have many more records than fields the time penalty isn't big,
In version 1.14, I can do:
In [152]: dt1 = np.dtype([('a',float),('b',int), ('c','U3')])
In [153]: dt2 = np.dtype([('b',int),('a',float), ('c','S3')])
In [154]: arr1 = np.array([(1,2,'a'),(3,4,'b'),(5,6,'c')], dt1)
In [155]: arr1
Out[155]:
array([(1., 2, 'a'), (3., 4, 'b'), (5., 6, 'c')],
dtype=[('a', '<f8'), ('b', '<i8'), ('c', '<U3')])
Simply using astype does not reorder the fields:
In [156]: arr1.astype(dt2)
Out[156]:
array([(1, 2., b'a'), (3, 4., b'b'), (5, 6., b'c')],
dtype=[('b', '<i8'), ('a', '<f8'), ('c', 'S3')])
but multifield indexing does:
In [157]: arr1[['b','a','c']]
Out[157]:
array([(2, 1., 'a'), (4, 3., 'b'), (6, 5., 'c')],
dtype=[('b', '<i8'), ('a', '<f8'), ('c', '<U3')])
now the dt2 astype is right:
In [158]: arr2 = arr1[['b','a','c']].astype(dt2)
In [159]: arr2
Out[159]:
array([(2, 1., b'a'), (4, 3., b'b'), (6, 5., b'c')],
dtype=[('b', '<i8'), ('a', '<f8'), ('c', 'S3')])
In [160]: arr1['a']
Out[160]: array([1., 3., 5.])
In [161]: arr2['a']
Out[161]: array([1., 3., 5.])
This is 1.14; you are using 1.15, and the docs mention differences in 1.16. So this is a moving target.
The astype is behaving the same as assignment to 'blank' array:
In [162]: arr2 = np.zeros(arr1.shape, dt2)
In [163]: arr2
Out[163]:
array([(0, 0., b''), (0, 0., b''), (0, 0., b'')],
dtype=[('b', '<i8'), ('a', '<f8'), ('c', 'S3')])
In [164]: arr2[:] = arr1
In [165]: arr2
Out[165]:
array([(1, 2., b'a'), (3, 4., b'b'), (5, 6., b'c')],
dtype=[('b', '<i8'), ('a', '<f8'), ('c', 'S3')])
In [166]: arr2[:] = arr1[['b','a','c']]
In [167]: arr2
Out[167]:
array([(2, 1., b'a'), (4, 3., b'b'), (6, 5., b'c')],
dtype=[('b', '<i8'), ('a', '<f8'), ('c', 'S3')])

genfromtxt only imports first column, after changing dtype

I am importing huge data sets with various types of data, using genfromtxt.
My original code worked fine (ucols is the list of columns I want to load):
data = np.genfromtxt(fname,comments = '#', skip_header=1, usecols=(ucols))
Some of my values are strings, so to avoid getting entries of NaN I tried setting dtype = None :
data = np.genfromtxt(fname, dtype = None,comments = '#', skip_header=1, usecols=(ucols))
Now for some reason I only get one column of data, IE the first column. Can someone explain what I am doing wrong?
EDIT: I now understand I am supposed to obtain a 1D structured array that can be referenced to get a whole row of values. However I wish to have my data as a numpy array, is it possible to use genfromtxt with dtype = None and still obtain a numpy array instead of a structured array, or alternatively is there a quick way to convert between the two. Although the second method is not preferable unless it can be quick and efficient since I am moving much larger values than this current instance usually.
Make a structured array and write it to csv:
In [131]: arr=np.ones((3,), dtype='i,f,U10,i,f')
In [132]: arr['f2']=['a','bc','def']
In [133]: arr
Out[133]:
array([(1, 1., 'a', 1, 1.), (1, 1., 'bc', 1, 1.), (1, 1., 'def', 1, 1.)],
dtype=[('f0', '<i4'), ('f1', '<f4'), ('f2', '<U10'), ('f3', '<i4'), ('f4', '<f4')])
In [134]: np.savetxt('test',arr,fmt='%d,%e,%s,%d,%f')
In [135]: cat test
1,1.000000e+00,a,1,1.000000
1,1.000000e+00,bc,1,1.000000
1,1.000000e+00,def,1,1.000000
load all columns with dtype=None:
In [137]: np.genfromtxt('test',delimiter=',',dtype=None,encoding=None)
Out[137]:
array([(1, 1., 'a', 1, 1.), (1, 1., 'bc', 1, 1.), (1, 1., 'def', 1, 1.)],
dtype=[('f0', '<i8'), ('f1', '<f8'), ('f2', '<U3'), ('f3', '<i8'), ('f4', '<f8')])
load a subset of the columns:
In [138]: np.genfromtxt('test',delimiter=',',dtype=None,encoding=None,usecols=
...: (1,2,4))
Out[138]:
array([(1., 'a', 1.), (1., 'bc', 1.), (1., 'def', 1.)],
dtype=[('f0', '<f8'), ('f1', '<U3'), ('f2', '<f8')])

Indexing array using column names

I'm loading pretty large input files into Numpy array (30 columns, over 10k rows). Data contains only floating point numbers. To simplify data processing I'd like to name columns and access them using human-readable names. AFAIK it's only possibly using structured/record arrays. However, if I'm right, when i use structured arrays I'll loose some information. For instance:
x = np.array([(1.0, 2), (3.0, 4), (11, 22)], dtype='float64')
y = np.array([(1.0, 2), (3.0, 4), (11, 22)], dtype=[('x', float), ('y', float), ('z', float)])
Both arrays contains the same data and the same dtype. y can be accessed using column names:
yIn [155]: y['x']
Out[155]: array([ 1., 3., 11.])
Unfortunately, I loose (or I get wrong impression?) so essential properties when I use structured arrays. x and y have different shapes, y cannot be transposed etc.
In [160]: x
Out[160]:
array([[ 1., 2.],
[ 3., 4.],
[11., 22.]])
In [161]: y
Out[161]:
array([( 1., 2.), ( 3., 4.), (11., 22.)],
dtype=[('x', '<f8'), ('y', '<f8')])
In [162]: x.shape
Out[162]: (3, 2)
In [163]: y.shape
Out[163]: (3,)
In [164]: x.T
Out[164]:
array([[ 1., 3., 11.],
[ 2., 4., 22.]])
In [165]: y.T
Out[165]:
array([( 1., 2.), ( 3., 4.), (11., 22.)],
dtype=[('x', '<f8'), ('y', '<f8')])
Is it possible to continue using "regular 2D Numpy arrays" and access columns using their names?

Python Numpy DTYPE dynamic array

I have a code like this
np.fromfile( f,
dtype = np.dtype( [ ( 'f1', np.float16 ),
( 'f2', np.float16 )
]
),
count = -1
)
and I need to make a dtype dependent on variable ( instead of size=2 make size=var ).
Tried to google it and tried to ask on IRC but no rescue. Thanks in advance!
If the individual dtypes are the same, and you aren't picky about names, this dtype format is easy to use and generalize:
In [358]: dt = np.dtype('f,f,f')
In [359]: dt
Out[359]: dtype([('f0', '
Or with the common string join:
In [360]: dt = np.dtype(','.join(['f']*3))
In [361]: np.ones((3,), dtype=dt)
Out[361]:
array([( 1., 1., 1.), ( 1., 1., 1.), ( 1., 1., 1.)],
dtype=[('f0', '<f4'), ('f1', '<f4'), ('f2', '<f4')])
Or with list zip
In [364]: dt = [('foo%s'%n, fmt) for n,fmt in zip(range(3),'ifd')]
In [365]: dt
Out[365]: [('foo0', 'i'), ('foo1', 'f'), ('foo2', 'd')]
In [366]: np.ones(2, dtype=dt)
Out[366]:
array([(1, 1., 1.), (1, 1., 1.)],
dtype=[('foo0', '<i4'), ('foo1', '<f4'), ('foo2', '<f8')])

Filter numpy array of tuples

Scikit-learn library have a brilliant example of data clustering - stock market structure. It works fine within US stocks. But when one adds tickers from other markets, numpy's error appear that arrays shoud have the same size - this is true, for example, german stocks have different trading calendar.
Ok, after quotes download I add preparation of shared dates:
quotes = [quotes_historical_yahoo_ochl(symbol, d1, d2, asobject=True)
for symbol in symbols]
def intersect(list_1, list_2):
return list(set(list_1) & set(list_2))
dates_all = quotes[0].date
for q in quotes:
dates_symbol = q.date
dates_all = intersect(dates_all, dates_symbol)
Then I'm stuck with filtering numpy array of tuples. Here's some tries:
# for index, q in enumerate(quotes):
# filtered = [i for i in q if i.date in dates_all]
# quotes[index] = np.rec.array(filtered, dtype=q.dtype)
# quotes[index] = np.asanyarray(filtered, dtype=q.dtype)
#
# quotes[index] = np.where(a.date in dates_all for a in q)
#
# quotes[index] = np.where(q[0].date in dates_all)
How to apply filter to numpy array or how to truly convert list of records (after filter) back to numpy's recarray?
quotes[0].dtype:
'(numpy.record, [('date', 'O'), ('year', '<i2'), ('month', 'i1'), ('day', 'i1'), ('d', '<f8'), ('open', '<f8'), ('close', '<f8'), ('high', '<f8'), ('low', '<f8'), ('volume', '<f8'), ('aclose', '<f8')])'
quotes[0].shape:
<class 'tuple'>: (261,)
So quotes is a list of recarrays, and in date_all you collect the intersection of all values in the date field.
I can recreate one such array with:
In [286]: dt=np.dtype([('date', 'O'), ('year', '<i2'), ('month', 'i1'), ('day',
...:
...: ), ('low', '<f8'), ('volume', '<f8'), ('aclose', '<f8')])
In [287]:
In [287]: arr=np.ones((2,), dtype=dt) # 2 element structured array
In [288]: arr
Out[288]:
array([(1, 1, 1, 1, 1., 1., 1., 1., 1., 1., 1.),
(1, 1, 1, 1, 1., 1., 1., 1., 1., 1., 1.)],
dtype=[('date', 'O'), ('year', '<i2'), ('month', 'i1'), ('day', 'i1'), ... ('aclose', '<f8')])
In [289]: type(arr[0])
Out[289]: numpy.void
turn that into a recarray (I dont' use those as much as plain structured arrays):
In [291]: np.rec.array(arr)
Out[291]:
rec.array([(1, 1, 1, 1, 1., 1., 1., 1., 1., 1., 1.),
(1, 1, 1, 1, 1., 1., 1., 1., 1., 1., 1.)],
dtype=[('date', 'O'), ('year', '<i2'), ('month', 'i1'), ('day', 'i1'), .... ('aclose', '<f8')])
dtype of the recarray displays slightly different:
In [292]: _.dtype
Out[292]: dtype((numpy.record, [('date', 'O'), ('year', '<i2'), ('month', 'i1'), ....('aclose', '<f8')]))
In [293]: __.date
Out[293]: array([1, 1], dtype=object)
In any case the date field is an array of objects, possibly of datetime?
q is one of these arrays; i is an element, and i.date is the date field.
[i for i in q if i.date in dates_all]
So filtered is list of recarray elements. np.stack does a better job of reassembling them into an array (that works with the recarray too).
np.stack([i for i in arr if i['date'] in alist])
Or you could collect the indices of the matching records, and index the quote array
In [319]: [i for i,v in enumerate(arr) if v['date'] in alist]
Out[319]: [0, 1]
In [320]: arr[_]
or pull out the date field first:
In [321]: [i for i,v in enumerate(arr['date']) if v in alist]
Out[321]: [0, 1]
in1d might also work to search
In [322]: np.in1d(arr['date'],alist)
Out[322]: array([ True, True], dtype=bool)
In [323]: np.where(np.in1d(arr['date'],alist))
Out[323]: (array([0, 1], dtype=int32),)

Categories