Related
I have a numpy float array and an int array of the same length. I would like to concatenate them such that the output has the composite dtype (float, int). column_stacking them together just yields a float64 array:
import numpy
a = numpy.random.rand(5)
b = numpy.random.randint(0, 100, 5)
ab = numpy.column_stack([a, b])
print(ab.dtype)
float64
Any hints?
Create a 'blank' array:
In [391]: dt = np.dtype('f,i')
In [392]: arr = np.zeros(5, dtype=dt)
In [393]: arr
Out[393]:
array([(0., 0), (0., 0), (0., 0), (0., 0), (0., 0)],
dtype=[('f0', '<f4'), ('f1', '<i4')])
fill it:
In [394]: arr['f0']=np.random.rand(5)
In [396]: arr['f1']=np.random.randint(0,100,5)
In [397]: arr
Out[397]:
array([(0.40140057, 75), (0.93731374, 99), (0.6226782 , 48),
(0.01068745, 68), (0.19197434, 53)],
dtype=[('f0', '<f4'), ('f1', '<i4')])
There are recfunctions that can be used as well, but it's good to know (and understand) this basic approach.
I am importing huge data sets with various types of data, using genfromtxt.
My original code worked fine (ucols is the list of columns I want to load):
data = np.genfromtxt(fname,comments = '#', skip_header=1, usecols=(ucols))
Some of my values are strings, so to avoid getting entries of NaN I tried setting dtype = None :
data = np.genfromtxt(fname, dtype = None,comments = '#', skip_header=1, usecols=(ucols))
Now for some reason I only get one column of data, IE the first column. Can someone explain what I am doing wrong?
EDIT: I now understand I am supposed to obtain a 1D structured array that can be referenced to get a whole row of values. However I wish to have my data as a numpy array, is it possible to use genfromtxt with dtype = None and still obtain a numpy array instead of a structured array, or alternatively is there a quick way to convert between the two. Although the second method is not preferable unless it can be quick and efficient since I am moving much larger values than this current instance usually.
Make a structured array and write it to csv:
In [131]: arr=np.ones((3,), dtype='i,f,U10,i,f')
In [132]: arr['f2']=['a','bc','def']
In [133]: arr
Out[133]:
array([(1, 1., 'a', 1, 1.), (1, 1., 'bc', 1, 1.), (1, 1., 'def', 1, 1.)],
dtype=[('f0', '<i4'), ('f1', '<f4'), ('f2', '<U10'), ('f3', '<i4'), ('f4', '<f4')])
In [134]: np.savetxt('test',arr,fmt='%d,%e,%s,%d,%f')
In [135]: cat test
1,1.000000e+00,a,1,1.000000
1,1.000000e+00,bc,1,1.000000
1,1.000000e+00,def,1,1.000000
load all columns with dtype=None:
In [137]: np.genfromtxt('test',delimiter=',',dtype=None,encoding=None)
Out[137]:
array([(1, 1., 'a', 1, 1.), (1, 1., 'bc', 1, 1.), (1, 1., 'def', 1, 1.)],
dtype=[('f0', '<i8'), ('f1', '<f8'), ('f2', '<U3'), ('f3', '<i8'), ('f4', '<f8')])
load a subset of the columns:
In [138]: np.genfromtxt('test',delimiter=',',dtype=None,encoding=None,usecols=
...: (1,2,4))
Out[138]:
array([(1., 'a', 1.), (1., 'bc', 1.), (1., 'def', 1.)],
dtype=[('f0', '<f8'), ('f1', '<U3'), ('f2', '<f8')])
I just discovered Numpy structured arrays and I find them to be quite powerful. The natural question arises in my mind: How in the world do I create a Numpy structure scalar. Let me show you what I mean. Let's say I want a structure containing some data:
import numpy as np
dtype = np.dtype([('a', np.float_), ('b', np.int_)])
ar = np.array((0.5, 1), dtype=dtype)
ar['a']
This gives me array(0.5) instead of 0.5. On the other hand, if I do this:
import numpy as np
dtype = np.dtype([('a', np.float_), ('b', np.int_)])
ar = np.array([(0.5, 1)], dtype=dtype)
ar[0]['a']
I get 0.5, just like I want. Which means that ar[0] isn't an array, but a scalar. Is it possible to create a structured scalar in a way more elegant than the one I've described?
Singleton isn't quite the right term, but I get what you want.
arr = np.array((0.5, 1), dtype=dtype)
Creates a 0d, single element array of this dtype. Check its dtype and shape.
arr.item() returns a tuple (0.5, 1). Aso test arr[()] and arr.tolist().
np.float64(0.5) creates a float with a numpy wrapper. It is similar to, but exactly the same as np.array(0.5). Their methods diff some.
I don't know anything similar with a compound dtype.
In [123]: dt = np.dtype('i,f,U10')
In [124]: dt
Out[124]: dtype([('f0', '<i4'), ('f1', '<f4'), ('f2', '<U10')])
In [125]: arr = np.array((1,2,3),dtype=dt)
In [126]: arr
Out[126]:
array((1, 2., '3'),
dtype=[('f0', '<i4'), ('f1', '<f4'), ('f2', '<U10')])
In [127]: arr.shape
Out[127]: ()
arr is a 0d 1 element array. It can be indexed with:
In [128]: arr[()]
Out[128]: (1, 2., '3')
In [129]: type(_)
Out[129]: numpy.void
This indexing produces a np.void object. Doing the same thing on a 0d float array would produce a np.float object.
But you can't use np.void((1,2,3), dtype=dt) to directly create such an object (in contrast to np.float(12.34)).
item is the normal way of extracting a 'scalar' from an array. Here it returns a tuple, the same sort of object that we used as input to create arr:
In [131]: arr.item()
Out[131]: (1, 2.0, '3')
In [132]: type(_)
Out[132]: tuple
np.asscalar(arr) returns the same tuple.
One difference between the np.void object and the tuple, is that it can still be indexed with the field name, arr[()]['f0'], whereas the tuple has to be indexed by number arr.item()[0]. The void still has a dtype, while the tuple doesn't.
fromrecords makes a recarray. This is similar to a structured array, but allows us to access fields as attributes. It may actually be an older class, that has been merged to into numpy, hence the np.rec prefix. Mostly we use structured arrays, though np.rec still has some convenience functions. (actually in numpy.lib.recfunctions):
In [133]: res = np.rec.fromrecords((1,2,3), dt)
In [134]: res
Out[134]:
rec.array((1, 2., '3'),
dtype=[('f0', '<i4'), ('f1', '<f4'), ('f2', '<U10')])
In [135]: res.f0
Out[135]: array(1, dtype=int32)
In [136]: res.item()
Out[136]: (1, 2.0, '3')
In [137]: type(_)
Out[137]: tuple
In [138]: res[()]
Out[138]: (1, 2.0, '3')
In [139]: type(_)
Out[139]: numpy.record
So this produced a np.record instead of a np.void. But that's just a subclass:
In [143]: numpy.record.__mro__
Out[143]: (numpy.record, numpy.void, numpy.flexible, numpy.generic, object)
Accessing a structured array by field name gives an array of the corresponding dtype (and same shape)
In [145]: arr['f1']
Out[145]: array(2.0, dtype=float32)
In [146]: arr[()]['f1']
Out[146]: 2.0
In [147]: type(_)
Out[147]: numpy.float32
Out[146] could also be created with np.float32(2.0).
Checking my comment on the ar[0] for the 1d array:
In [158]: arr1d = np.array([(1,2,3)], dt)
In [159]: arr1d
Out[159]:
array([(1, 2., '3')],
dtype=[('f0', '<i4'), ('f1', '<f4'), ('f2', '<U10')])
In [160]: arr1d[0]
Out[160]: (1, 2., '3')
In [161]: type(_)
Out[161]: numpy.void
So arr[()] and arr1d[0] do the same thing for their respective sized arrays. Likewise arr2d[0,0], which can also be written as arr2d[(0,0)].
Use np.asscalar.
In both of your cases it will be just np.asscalar(ar['a']).
Also, you might find useful np.item.
I have a code like this
np.fromfile( f,
dtype = np.dtype( [ ( 'f1', np.float16 ),
( 'f2', np.float16 )
]
),
count = -1
)
and I need to make a dtype dependent on variable ( instead of size=2 make size=var ).
Tried to google it and tried to ask on IRC but no rescue. Thanks in advance!
If the individual dtypes are the same, and you aren't picky about names, this dtype format is easy to use and generalize:
In [358]: dt = np.dtype('f,f,f')
In [359]: dt
Out[359]: dtype([('f0', '
Or with the common string join:
In [360]: dt = np.dtype(','.join(['f']*3))
In [361]: np.ones((3,), dtype=dt)
Out[361]:
array([( 1., 1., 1.), ( 1., 1., 1.), ( 1., 1., 1.)],
dtype=[('f0', '<f4'), ('f1', '<f4'), ('f2', '<f4')])
Or with list zip
In [364]: dt = [('foo%s'%n, fmt) for n,fmt in zip(range(3),'ifd')]
In [365]: dt
Out[365]: [('foo0', 'i'), ('foo1', 'f'), ('foo2', 'd')]
In [366]: np.ones(2, dtype=dt)
Out[366]:
array([(1, 1., 1.), (1, 1., 1.)],
dtype=[('foo0', '<i4'), ('foo1', '<f4'), ('foo2', '<f8')])
Scikit-learn library have a brilliant example of data clustering - stock market structure. It works fine within US stocks. But when one adds tickers from other markets, numpy's error appear that arrays shoud have the same size - this is true, for example, german stocks have different trading calendar.
Ok, after quotes download I add preparation of shared dates:
quotes = [quotes_historical_yahoo_ochl(symbol, d1, d2, asobject=True)
for symbol in symbols]
def intersect(list_1, list_2):
return list(set(list_1) & set(list_2))
dates_all = quotes[0].date
for q in quotes:
dates_symbol = q.date
dates_all = intersect(dates_all, dates_symbol)
Then I'm stuck with filtering numpy array of tuples. Here's some tries:
# for index, q in enumerate(quotes):
# filtered = [i for i in q if i.date in dates_all]
# quotes[index] = np.rec.array(filtered, dtype=q.dtype)
# quotes[index] = np.asanyarray(filtered, dtype=q.dtype)
#
# quotes[index] = np.where(a.date in dates_all for a in q)
#
# quotes[index] = np.where(q[0].date in dates_all)
How to apply filter to numpy array or how to truly convert list of records (after filter) back to numpy's recarray?
quotes[0].dtype:
'(numpy.record, [('date', 'O'), ('year', '<i2'), ('month', 'i1'), ('day', 'i1'), ('d', '<f8'), ('open', '<f8'), ('close', '<f8'), ('high', '<f8'), ('low', '<f8'), ('volume', '<f8'), ('aclose', '<f8')])'
quotes[0].shape:
<class 'tuple'>: (261,)
So quotes is a list of recarrays, and in date_all you collect the intersection of all values in the date field.
I can recreate one such array with:
In [286]: dt=np.dtype([('date', 'O'), ('year', '<i2'), ('month', 'i1'), ('day',
...:
...: ), ('low', '<f8'), ('volume', '<f8'), ('aclose', '<f8')])
In [287]:
In [287]: arr=np.ones((2,), dtype=dt) # 2 element structured array
In [288]: arr
Out[288]:
array([(1, 1, 1, 1, 1., 1., 1., 1., 1., 1., 1.),
(1, 1, 1, 1, 1., 1., 1., 1., 1., 1., 1.)],
dtype=[('date', 'O'), ('year', '<i2'), ('month', 'i1'), ('day', 'i1'), ... ('aclose', '<f8')])
In [289]: type(arr[0])
Out[289]: numpy.void
turn that into a recarray (I dont' use those as much as plain structured arrays):
In [291]: np.rec.array(arr)
Out[291]:
rec.array([(1, 1, 1, 1, 1., 1., 1., 1., 1., 1., 1.),
(1, 1, 1, 1, 1., 1., 1., 1., 1., 1., 1.)],
dtype=[('date', 'O'), ('year', '<i2'), ('month', 'i1'), ('day', 'i1'), .... ('aclose', '<f8')])
dtype of the recarray displays slightly different:
In [292]: _.dtype
Out[292]: dtype((numpy.record, [('date', 'O'), ('year', '<i2'), ('month', 'i1'), ....('aclose', '<f8')]))
In [293]: __.date
Out[293]: array([1, 1], dtype=object)
In any case the date field is an array of objects, possibly of datetime?
q is one of these arrays; i is an element, and i.date is the date field.
[i for i in q if i.date in dates_all]
So filtered is list of recarray elements. np.stack does a better job of reassembling them into an array (that works with the recarray too).
np.stack([i for i in arr if i['date'] in alist])
Or you could collect the indices of the matching records, and index the quote array
In [319]: [i for i,v in enumerate(arr) if v['date'] in alist]
Out[319]: [0, 1]
In [320]: arr[_]
or pull out the date field first:
In [321]: [i for i,v in enumerate(arr['date']) if v in alist]
Out[321]: [0, 1]
in1d might also work to search
In [322]: np.in1d(arr['date'],alist)
Out[322]: array([ True, True], dtype=bool)
In [323]: np.where(np.in1d(arr['date'],alist))
Out[323]: (array([0, 1], dtype=int32),)