Using np.view() with changes to structured arrays in numpy 1.14 - python

I have a numpy structured array with a mixed dtype (i.e., floats, ints, and strings). I want to select some of the columns of the array (all of which contain only floats) and then get the sum, by column, of the rows, as a standard numpy array. The initial array takes a form comparable to:
some_data = np.array([('foo', 3.5, 2.15), ('bar', 2.8, 5.3), ('baz', 1.2, 3.7)],
dtype=[('col1', '<U20'), ('A', '<f8'), ('B', '<f8')])
For this example, I'd like to take the sum of columns A and B, yielding np.array([7.5, 11.15]). With numpy ≤1.13, I could do that as follows:
get_cols = ['A', 'B']
desired_sum = np.sum(some_data[get_cols].view(('<f8', len(get_cols))), axis=0)
With the release of numpy 1.14, this method now fails with ValueError: Changing the dtype to a subarray type is only supported if the total itemsize is unchanged, which is a result of the changes made in numpy 1.14 to the handling of structured arrays. (User bbengfort commented about the FutureWarning given about this change in this answer.)
In light of these changes to structured arrays, how can I obtain the desired sum from the structured array subset?

In [165]: some_data = np.array([('foo', 3.5, 2.15), ('bar', 2.8, 5.3), ('baz', 1.2, 3.7)], dtype=[('col1', '<U20'), ('A', '<f8'), ('B', '<f8')])
...:
In [166]: get_cols = ['A','B']
In [167]: some_data[get_cols]
Out[167]:
array([( 3.5, 2.15), ( 2.8, 5.3 ), ( 1.2, 3.7 )],
dtype=[('A', '<f8'), ('B', '<f8')])
Simply reading the field values is fine. In 1.13 we get a warning
In [168]: some_data[get_cols].view(('<f8', len(get_cols)))
/usr/local/bin/ipython3:1: FutureWarning: Numpy has detected that you may be viewing or writing to an array returned by selecting multiple fields in a structured array.
This code may break in numpy 1.13 because this will return a view instead of a copy -- see release notes for details.
#!/usr/bin/python3
Out[168]:
array([[ 3.5 , 2.15],
[ 2.8 , 5.3 ],
[ 1.2 , 3.7 ]])
With the recommended copy, no warning:
In [169]: some_data[get_cols].copy().view(('<f8', len(get_cols)))
Out[169]:
array([[ 3.5 , 2.15],
[ 2.8 , 5.3 ],
[ 1.2 , 3.7 ]])
In [171]: np.sum(_, axis=0)
Out[171]: array([ 7.5 , 11.15])
In your original array,
dtype([('col1', '<U20'), ('A', '<f8'), ('B', '<f8')])
An A,B slice would have the two f8 items interspersed with the 20U items. Changing the view dtype of such a mix is problematic. That's why working with a copy is more reliable.
Since U20 takes up 4*20 bytes, the total itemsize is 96, a multiple of 8. We can convert the whole thing to f8, reshape and 'throw-away' the U20 columns:
In [183]: some_data.view('f8').reshape(3,-1)[:,-2:]
Out[183]:
array([[ 3.5 , 2.15],
[ 2.8 , 5.3 ],
[ 1.2 , 3.7 ]])
It's not very pretty and I don't recommend it, but it may give some insight into how structured data is arranged.
view on a structured array is useful at times, but often a bit tricky to use correctly.
If the 2 numeric fields are usually used together, I'd recommend a compound dtype like:
In [184]: some_data = np.array([('foo', [3.5, 2.15]), ('bar', [2.8, 5.3]), ('baz
...: ', [1.2, 3.7])],
...: dtype=[('col1', '<U20'), ('AB', '<f8',(2,))])
...:
...:
In [185]: some_data
Out[185]:
array([('foo', [ 3.5 , 2.15]), ('bar', [ 2.8 , 5.3 ]),
('baz', [ 1.2 , 3.7 ])],
dtype=[('col1', '<U20'), ('AB', '<f8', (2,))])
In [186]: some_data['AB']
Out[186]:
array([[ 3.5 , 2.15],
[ 2.8 , 5.3 ],
[ 1.2 , 3.7 ]])
genfromtxt accepts this style of dtype.

Related

How to get the each rows in a pandas to a np arrays? [duplicate]

How do I convert a pandas dataframe into a NumPy array?
DataFrame:
import numpy as np
import pandas as pd
index = [1, 2, 3, 4, 5, 6, 7]
a = [np.nan, np.nan, np.nan, 0.1, 0.1, 0.1, 0.1]
b = [0.2, np.nan, 0.2, 0.2, 0.2, np.nan, np.nan]
c = [np.nan, 0.5, 0.5, np.nan, 0.5, 0.5, np.nan]
df = pd.DataFrame({'A': a, 'B': b, 'C': c}, index=index)
df = df.rename_axis('ID')
gives
label A B C
ID
1 NaN 0.2 NaN
2 NaN NaN 0.5
3 NaN 0.2 0.5
4 0.1 0.2 NaN
5 0.1 0.2 0.5
6 0.1 NaN 0.5
7 0.1 NaN NaN
I would like to convert this to a NumPy array, like so:
array([[ nan, 0.2, nan],
[ nan, nan, 0.5],
[ nan, 0.2, 0.5],
[ 0.1, 0.2, nan],
[ 0.1, 0.2, 0.5],
[ 0.1, nan, 0.5],
[ 0.1, nan, nan]])
Also, is it possible to preserve the dtypes, like this?
array([[ 1, nan, 0.2, nan],
[ 2, nan, nan, 0.5],
[ 3, nan, 0.2, 0.5],
[ 4, 0.1, 0.2, nan],
[ 5, 0.1, 0.2, 0.5],
[ 6, 0.1, nan, 0.5],
[ 7, 0.1, nan, nan]],
dtype=[('ID', '<i4'), ('A', '<f8'), ('B', '<f8'), ('B', '<f8')])
Use df.to_numpy()
It's better than df.values, here's why.*
It's time to deprecate your usage of values and as_matrix().
pandas v0.24.0 introduced two new methods for obtaining NumPy arrays from pandas objects:
to_numpy(), which is defined on Index, Series, and DataFrame objects, and
array, which is defined on Index and Series objects only.
If you visit the v0.24 docs for .values, you will see a big red warning that says:
Warning: We recommend using DataFrame.to_numpy() instead.
See this section of the v0.24.0 release notes, and this answer for more information.
* - to_numpy() is my recommended method for any production code that needs to run reliably for many versions into the future. However if you're just making a scratchpad in jupyter or the terminal, using .values to save a few milliseconds of typing is a permissable exception. You can always add the fit n finish later.
Towards Better Consistency: to_numpy()
In the spirit of better consistency throughout the API, a new method to_numpy has been introduced to extract the underlying NumPy array from DataFrames.
# Setup
df = pd.DataFrame(data={'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]},
index=['a', 'b', 'c'])
# Convert the entire DataFrame
df.to_numpy()
# array([[1, 4, 7],
# [2, 5, 8],
# [3, 6, 9]])
# Convert specific columns
df[['A', 'C']].to_numpy()
# array([[1, 7],
# [2, 8],
# [3, 9]])
As mentioned above, this method is also defined on Index and Series objects (see here).
df.index.to_numpy()
# array(['a', 'b', 'c'], dtype=object)
df['A'].to_numpy()
# array([1, 2, 3])
By default, a view is returned, so any modifications made will affect the original.
v = df.to_numpy()
v[0, 0] = -1
df
A B C
a -1 4 7
b 2 5 8
c 3 6 9
If you need a copy instead, use to_numpy(copy=True).
pandas >= 1.0 update for ExtensionTypes
If you're using pandas 1.x, chances are you'll be dealing with extension types a lot more. You'll have to be a little more careful that these extension types are correctly converted.
a = pd.array([1, 2, None], dtype="Int64")
a
<IntegerArray>
[1, 2, <NA>]
Length: 3, dtype: Int64
# Wrong
a.to_numpy()
# array([1, 2, <NA>], dtype=object) # yuck, objects
# Correct
a.to_numpy(dtype='float', na_value=np.nan)
# array([ 1., 2., nan])
# Also correct
a.to_numpy(dtype='int', na_value=-1)
# array([ 1, 2, -1])
This is called out in the docs.
If you need the dtypes in the result...
As shown in another answer, DataFrame.to_records is a good way to do this.
df.to_records()
# rec.array([('a', 1, 4, 7), ('b', 2, 5, 8), ('c', 3, 6, 9)],
# dtype=[('index', 'O'), ('A', '<i8'), ('B', '<i8'), ('C', '<i8')])
This cannot be done with to_numpy, unfortunately. However, as an alternative, you can use np.rec.fromrecords:
v = df.reset_index()
np.rec.fromrecords(v, names=v.columns.tolist())
# rec.array([('a', 1, 4, 7), ('b', 2, 5, 8), ('c', 3, 6, 9)],
# dtype=[('index', '<U1'), ('A', '<i8'), ('B', '<i8'), ('C', '<i8')])
Performance wise, it's nearly the same (actually, using rec.fromrecords is a bit faster).
df2 = pd.concat([df] * 10000)
%timeit df2.to_records()
%%timeit
v = df2.reset_index()
np.rec.fromrecords(v, names=v.columns.tolist())
12.9 ms ± 511 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.56 ms ± 291 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Rationale for Adding a New Method
to_numpy() (in addition to array) was added as a result of discussions under two GitHub issues GH19954 and GH23623.
Specifically, the docs mention the rationale:
[...] with .values it was unclear whether the returned value would be the
actual array, some transformation of it, or one of pandas custom
arrays (like Categorical). For example, with PeriodIndex, .values
generates a new ndarray of period objects each time. [...]
to_numpy aims to improve the consistency of the API, which is a major step in the right direction. .values will not be deprecated in the current version, but I expect this may happen at some point in the future, so I would urge users to migrate towards the newer API, as soon as you can.
Critique of Other Solutions
DataFrame.values has inconsistent behaviour, as already noted.
DataFrame.get_values() was quietly removed in v1.0 and was previously deprecated in v0.25. Before that, it was simply a wrapper around DataFrame.values, so everything said above applies.
DataFrame.as_matrix() was removed in v1.0 and was previously deprecated in v0.23. Do NOT use!
To convert a pandas dataframe (df) to a numpy ndarray, use this code:
df.values
array([[nan, 0.2, nan],
[nan, nan, 0.5],
[nan, 0.2, 0.5],
[0.1, 0.2, nan],
[0.1, 0.2, 0.5],
[0.1, nan, 0.5],
[0.1, nan, nan]])
Note: The .as_matrix() method used in this answer is deprecated. Pandas 0.23.4 warns:
Method .as_matrix will be removed in a future version. Use .values instead.
Pandas has something built in...
numpy_matrix = df.as_matrix()
gives
array([[nan, 0.2, nan],
[nan, nan, 0.5],
[nan, 0.2, 0.5],
[0.1, 0.2, nan],
[0.1, 0.2, 0.5],
[0.1, nan, 0.5],
[0.1, nan, nan]])
I would just chain the DataFrame.reset_index() and DataFrame.values functions to get the Numpy representation of the dataframe, including the index:
In [8]: df
Out[8]:
A B C
0 -0.982726 0.150726 0.691625
1 0.617297 -0.471879 0.505547
2 0.417123 -1.356803 -1.013499
3 -0.166363 -0.957758 1.178659
4 -0.164103 0.074516 -0.674325
5 -0.340169 -0.293698 1.231791
6 -1.062825 0.556273 1.508058
7 0.959610 0.247539 0.091333
[8 rows x 3 columns]
In [9]: df.reset_index().values
Out[9]:
array([[ 0. , -0.98272574, 0.150726 , 0.69162512],
[ 1. , 0.61729734, -0.47187926, 0.50554728],
[ 2. , 0.4171228 , -1.35680324, -1.01349922],
[ 3. , -0.16636303, -0.95775849, 1.17865945],
[ 4. , -0.16410334, 0.0745164 , -0.67432474],
[ 5. , -0.34016865, -0.29369841, 1.23179064],
[ 6. , -1.06282542, 0.55627285, 1.50805754],
[ 7. , 0.95961001, 0.24753911, 0.09133339]])
To get the dtypes we'd need to transform this ndarray into a structured array using view:
In [10]: df.reset_index().values.ravel().view(dtype=[('index', int), ('A', float), ('B', float), ('C', float)])
Out[10]:
array([( 0, -0.98272574, 0.150726 , 0.69162512),
( 1, 0.61729734, -0.47187926, 0.50554728),
( 2, 0.4171228 , -1.35680324, -1.01349922),
( 3, -0.16636303, -0.95775849, 1.17865945),
( 4, -0.16410334, 0.0745164 , -0.67432474),
( 5, -0.34016865, -0.29369841, 1.23179064),
( 6, -1.06282542, 0.55627285, 1.50805754),
( 7, 0.95961001, 0.24753911, 0.09133339),
dtype=[('index', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])
You can use the to_records method, but have to play around a bit with the dtypes if they are not what you want from the get go. In my case, having copied your DF from a string, the index type is string (represented by an object dtype in pandas):
In [102]: df
Out[102]:
label A B C
ID
1 NaN 0.2 NaN
2 NaN NaN 0.5
3 NaN 0.2 0.5
4 0.1 0.2 NaN
5 0.1 0.2 0.5
6 0.1 NaN 0.5
7 0.1 NaN NaN
In [103]: df.index.dtype
Out[103]: dtype('object')
In [104]: df.to_records()
Out[104]:
rec.array([(1, nan, 0.2, nan), (2, nan, nan, 0.5), (3, nan, 0.2, 0.5),
(4, 0.1, 0.2, nan), (5, 0.1, 0.2, 0.5), (6, 0.1, nan, 0.5),
(7, 0.1, nan, nan)],
dtype=[('index', '|O8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])
In [106]: df.to_records().dtype
Out[106]: dtype([('index', '|O8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])
Converting the recarray dtype does not work for me, but one can do this in Pandas already:
In [109]: df.index = df.index.astype('i8')
In [111]: df.to_records().view([('ID', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])
Out[111]:
rec.array([(1, nan, 0.2, nan), (2, nan, nan, 0.5), (3, nan, 0.2, 0.5),
(4, 0.1, 0.2, nan), (5, 0.1, 0.2, 0.5), (6, 0.1, nan, 0.5),
(7, 0.1, nan, nan)],
dtype=[('ID', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])
Note that Pandas does not set the name of the index properly (to ID) in the exported record array (a bug?), so we profit from the type conversion to also correct for that.
At the moment Pandas has only 8-byte integers, i8, and floats, f8 (see this issue).
It seems like df.to_records() will work for you. The exact feature you're looking for was requested and to_records pointed to as an alternative.
I tried this out locally using your example, and that call yields something very similar to the output you were looking for:
rec.array([(1, nan, 0.2, nan), (2, nan, nan, 0.5), (3, nan, 0.2, 0.5),
(4, 0.1, 0.2, nan), (5, 0.1, 0.2, 0.5), (6, 0.1, nan, 0.5),
(7, 0.1, nan, nan)],
dtype=[(u'ID', '<i8'), (u'A', '<f8'), (u'B', '<f8'), (u'C', '<f8')])
Note that this is a recarray rather than an array. You could move the result in to regular numpy array by calling its constructor as np.array(df.to_records()).
Try this:
a = numpy.asarray(df)
Here is my approach to making a structure array from a pandas DataFrame.
Create the data frame
import pandas as pd
import numpy as np
import six
NaN = float('nan')
ID = [1, 2, 3, 4, 5, 6, 7]
A = [NaN, NaN, NaN, 0.1, 0.1, 0.1, 0.1]
B = [0.2, NaN, 0.2, 0.2, 0.2, NaN, NaN]
C = [NaN, 0.5, 0.5, NaN, 0.5, 0.5, NaN]
columns = {'A':A, 'B':B, 'C':C}
df = pd.DataFrame(columns, index=ID)
df.index.name = 'ID'
print(df)
A B C
ID
1 NaN 0.2 NaN
2 NaN NaN 0.5
3 NaN 0.2 0.5
4 0.1 0.2 NaN
5 0.1 0.2 0.5
6 0.1 NaN 0.5
7 0.1 NaN NaN
Define function to make a numpy structure array (not a record array) from a pandas DataFrame.
def df_to_sarray(df):
"""
Convert a pandas DataFrame object to a numpy structured array.
This is functionally equivalent to but more efficient than
np.array(df.to_array())
:param df: the data frame to convert
:return: a numpy structured array representation of df
"""
v = df.values
cols = df.columns
if six.PY2: # python 2 needs .encode() but 3 does not
types = [(cols[i].encode(), df[k].dtype.type) for (i, k) in enumerate(cols)]
else:
types = [(cols[i], df[k].dtype.type) for (i, k) in enumerate(cols)]
dtype = np.dtype(types)
z = np.zeros(v.shape[0], dtype)
for (i, k) in enumerate(z.dtype.names):
z[k] = v[:, i]
return z
Use reset_index to make a new data frame that includes the index as part of its data. Convert that data frame to a structure array.
sa = df_to_sarray(df.reset_index())
sa
array([(1L, nan, 0.2, nan), (2L, nan, nan, 0.5), (3L, nan, 0.2, 0.5),
(4L, 0.1, 0.2, nan), (5L, 0.1, 0.2, 0.5), (6L, 0.1, nan, 0.5),
(7L, 0.1, nan, nan)],
dtype=[('ID', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])
EDIT: Updated df_to_sarray to avoid error calling .encode() with python 3. Thanks to Joseph Garvin and halcyon for their comment and solution.
A Simpler Way for Example DataFrame:
df
gbm nnet reg
0 12.097439 12.047437 12.100953
1 12.109811 12.070209 12.095288
2 11.720734 11.622139 11.740523
3 11.824557 11.926414 11.926527
4 11.800868 11.727730 11.729737
5 12.490984 12.502440 12.530894
USE:
np.array(df.to_records().view(type=np.matrix))
GET:
array([[(0, 12.097439 , 12.047437, 12.10095324),
(1, 12.10981081, 12.070209, 12.09528824),
(2, 11.72073428, 11.622139, 11.74052253),
(3, 11.82455653, 11.926414, 11.92652727),
(4, 11.80086775, 11.72773 , 11.72973699),
(5, 12.49098389, 12.50244 , 12.53089367)]],
dtype=(numpy.record, [('index', '<i8'), ('gbm', '<f8'), ('nnet', '<f4'),
('reg', '<f8')]))
Two ways to convert the data-frame to its Numpy-array representation.
mah_np_array = df.as_matrix(columns=None)
mah_np_array = df.values
Doc: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.as_matrix.html
I went through the answers above. The "as_matrix()" method works but its obsolete now. For me, What worked was ".to_numpy()".
This returns a multidimensional array. I'll prefer using this method if you're reading data from excel sheet and you need to access data from any index. Hope this helps :)
Just had a similar problem when exporting from dataframe to arcgis table and stumbled on a solution from usgs (https://my.usgs.gov/confluence/display/cdi/pandas.DataFrame+to+ArcGIS+Table).
In short your problem has a similar solution:
df
A B C
ID
1 NaN 0.2 NaN
2 NaN NaN 0.5
3 NaN 0.2 0.5
4 0.1 0.2 NaN
5 0.1 0.2 0.5
6 0.1 NaN 0.5
7 0.1 NaN NaN
np_data = np.array(np.rec.fromrecords(df.values))
np_names = df.dtypes.index.tolist()
np_data.dtype.names = tuple([name.encode('UTF8') for name in np_names])
np_data
array([( nan, 0.2, nan), ( nan, nan, 0.5), ( nan, 0.2, 0.5),
( 0.1, 0.2, nan), ( 0.1, 0.2, 0.5), ( 0.1, nan, 0.5),
( 0.1, nan, nan)],
dtype=(numpy.record, [('A', '<f8'), ('B', '<f8'), ('C', '<f8')]))
A simple way to convert dataframe to numpy array:
import pandas as pd
df = pd.DataFrame({"A": [1, 2], "B": [3, 4]})
df_to_array = df.to_numpy()
array([[1, 3],
[2, 4]])
Use of to_numpy is encouraged to preserve consistency.
Reference:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_numpy.html
Try this:
np.array(df)
array([['ID', nan, nan, nan],
['1', nan, 0.2, nan],
['2', nan, nan, 0.5],
['3', nan, 0.2, 0.5],
['4', 0.1, 0.2, nan],
['5', 0.1, 0.2, 0.5],
['6', 0.1, nan, 0.5],
['7', 0.1, nan, nan]], dtype=object)
Some more information at: [https://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html]
Valid for numpy 1.16.5 and pandas 0.25.2.
Further to meteore's answer, I found the code
df.index = df.index.astype('i8')
doesn't work for me. So I put my code here for the convenience of others stuck with this issue.
city_cluster_df = pd.read_csv(text_filepath, encoding='utf-8')
# the field 'city_en' is a string, when converted to Numpy array, it will be an object
city_cluster_arr = city_cluster_df[['city_en','lat','lon','cluster','cluster_filtered']].to_records()
descr=city_cluster_arr.dtype.descr
# change the field 'city_en' to string type (the index for 'city_en' here is 1 because before the field is the row index of dataframe)
descr[1]=(descr[1][0], "S20")
newArr=city_cluster_arr.astype(np.dtype(descr))

Creating a Numpy structure scalar instead of array

I just discovered Numpy structured arrays and I find them to be quite powerful. The natural question arises in my mind: How in the world do I create a Numpy structure scalar. Let me show you what I mean. Let's say I want a structure containing some data:
import numpy as np
dtype = np.dtype([('a', np.float_), ('b', np.int_)])
ar = np.array((0.5, 1), dtype=dtype)
ar['a']
This gives me array(0.5) instead of 0.5. On the other hand, if I do this:
import numpy as np
dtype = np.dtype([('a', np.float_), ('b', np.int_)])
ar = np.array([(0.5, 1)], dtype=dtype)
ar[0]['a']
I get 0.5, just like I want. Which means that ar[0] isn't an array, but a scalar. Is it possible to create a structured scalar in a way more elegant than the one I've described?
Singleton isn't quite the right term, but I get what you want.
arr = np.array((0.5, 1), dtype=dtype)
Creates a 0d, single element array of this dtype. Check its dtype and shape.
arr.item() returns a tuple (0.5, 1). Aso test arr[()] and arr.tolist().
np.float64(0.5) creates a float with a numpy wrapper. It is similar to, but exactly the same as np.array(0.5). Their methods diff some.
I don't know anything similar with a compound dtype.
In [123]: dt = np.dtype('i,f,U10')
In [124]: dt
Out[124]: dtype([('f0', '<i4'), ('f1', '<f4'), ('f2', '<U10')])
In [125]: arr = np.array((1,2,3),dtype=dt)
In [126]: arr
Out[126]:
array((1, 2., '3'),
dtype=[('f0', '<i4'), ('f1', '<f4'), ('f2', '<U10')])
In [127]: arr.shape
Out[127]: ()
arr is a 0d 1 element array. It can be indexed with:
In [128]: arr[()]
Out[128]: (1, 2., '3')
In [129]: type(_)
Out[129]: numpy.void
This indexing produces a np.void object. Doing the same thing on a 0d float array would produce a np.float object.
But you can't use np.void((1,2,3), dtype=dt) to directly create such an object (in contrast to np.float(12.34)).
item is the normal way of extracting a 'scalar' from an array. Here it returns a tuple, the same sort of object that we used as input to create arr:
In [131]: arr.item()
Out[131]: (1, 2.0, '3')
In [132]: type(_)
Out[132]: tuple
np.asscalar(arr) returns the same tuple.
One difference between the np.void object and the tuple, is that it can still be indexed with the field name, arr[()]['f0'], whereas the tuple has to be indexed by number arr.item()[0]. The void still has a dtype, while the tuple doesn't.
fromrecords makes a recarray. This is similar to a structured array, but allows us to access fields as attributes. It may actually be an older class, that has been merged to into numpy, hence the np.rec prefix. Mostly we use structured arrays, though np.rec still has some convenience functions. (actually in numpy.lib.recfunctions):
In [133]: res = np.rec.fromrecords((1,2,3), dt)
In [134]: res
Out[134]:
rec.array((1, 2., '3'),
dtype=[('f0', '<i4'), ('f1', '<f4'), ('f2', '<U10')])
In [135]: res.f0
Out[135]: array(1, dtype=int32)
In [136]: res.item()
Out[136]: (1, 2.0, '3')
In [137]: type(_)
Out[137]: tuple
In [138]: res[()]
Out[138]: (1, 2.0, '3')
In [139]: type(_)
Out[139]: numpy.record
So this produced a np.record instead of a np.void. But that's just a subclass:
In [143]: numpy.record.__mro__
Out[143]: (numpy.record, numpy.void, numpy.flexible, numpy.generic, object)
Accessing a structured array by field name gives an array of the corresponding dtype (and same shape)
In [145]: arr['f1']
Out[145]: array(2.0, dtype=float32)
In [146]: arr[()]['f1']
Out[146]: 2.0
In [147]: type(_)
Out[147]: numpy.float32
Out[146] could also be created with np.float32(2.0).
Checking my comment on the ar[0] for the 1d array:
In [158]: arr1d = np.array([(1,2,3)], dt)
In [159]: arr1d
Out[159]:
array([(1, 2., '3')],
dtype=[('f0', '<i4'), ('f1', '<f4'), ('f2', '<U10')])
In [160]: arr1d[0]
Out[160]: (1, 2., '3')
In [161]: type(_)
Out[161]: numpy.void
So arr[()] and arr1d[0] do the same thing for their respective sized arrays. Likewise arr2d[0,0], which can also be written as arr2d[(0,0)].
Use np.asscalar.
In both of your cases it will be just np.asscalar(ar['a']).
Also, you might find useful np.item.

Read/Write Python List from/to Binary file

According to Python Cookbook, below is how to write a list of tuple into binary file:
from struct import Struct
def write_records(records, format, f):
'''
Write a sequence of tuples to a binary file of structures.
'''
record_struct = Struct(format)
for r in records:
f.write(record_struct.pack(*r))
# Example
if __name__ == '__main__':
records = [ (1, 2.3, 4.5),
(6, 7.8, 9.0),
(12, 13.4, 56.7) ]
with open('data.b', 'wb') as f:
write_records(records, '<idd', f)
And it works well.
For reading (large amount of binary data), the author recommended the following:
>>> import numpy as np
>>> f = open('data.b', 'rb')
>>> records = np.fromfile(f, dtype='<i,<d,<d')
>>> records
array([(1, 2.3, 4.5), (6, 7.8, 9.0), (12, 13.4, 56.7)],
dtype=[('f0', '<i4'), ('f1', '<f8'), ('f2', '<f8')])
>>> records[0]
(1, 2.3, 4.5)
>>> records[1]
(6, 7.8, 9.0)
>>>
It is also good, but this record is not a normal numpy array. For instance, type(record[0]) will return <type 'numpy.void'>. Even worse, I cannot extract the first column using X = record[:, 0].
Is there a way to efficiently load list(or any other types) from binary file into a normal numpy array?
Thx in advance.
In [196]: rec = np.fromfile('data.b', dtype='<i,<d,<d')
In [198]: rec
Out[198]:
array([( 1, 2.3, 4.5), ( 6, 7.8, 9. ), (12, 13.4, 56.7)],
dtype=[('f0', '<i4'), ('f1', '<f8'), ('f2', '<f8')])
This is a 1d structured array
In [199]: rec['f0']
Out[199]: array([ 1, 6, 12], dtype=int32)
In [200]: rec.shape
Out[200]: (3,)
In [201]: rec.dtype
Out[201]: dtype([('f0', '<i4'), ('f1', '<f8'), ('f2', '<f8')])
Note that its tolist looks identical to your original records:
In [202]: rec.tolist()
Out[202]: [(1, 2.3, 4.5), (6, 7.8, 9.0), (12, 13.4, 56.7)]
In [203]: records
Out[203]: [(1, 2.3, 4.5), (6, 7.8, 9.0), (12, 13.4, 56.7)]
You could create a 2d array from either list with:
In [204]: arr2 = np.array(rec.tolist())
In [205]: arr2
Out[205]:
array([[ 1. , 2.3, 4.5],
[ 6. , 7.8, 9. ],
[ 12. , 13.4, 56.7]])
In [206]: arr2.shape
Out[206]: (3, 3)
There are other ways of converting a structured array to 'regular' array, but this is simplest and most consistent.
The tolist of a regular array uses nested lists. The tuples in the structured version are intended to convey a difference:
In [207]: arr2.tolist()
Out[207]: [[1.0, 2.3, 4.5], [6.0, 7.8, 9.0], [12.0, 13.4, 56.7]]
In the structured array the first field is integer. In the regular array the first column is same as the others, float.
If the binary file contained all floats, you could load it as a 1d of floats and reshape
In [208]: with open('data.f', 'wb') as f:
...: write_records(records, 'ddd', f)
In [210]: rec2 = np.fromfile('data.f', dtype='<d')
In [211]: rec2
Out[211]: array([ 1. , 2.3, 4.5, 6. , 7.8, 9. , 12. , 13.4, 56.7])
But to take advantage of any record structure in the binary file, you have load by records as well, which means structured array:
In [213]: rec3 = np.fromfile('data.f', dtype='d,d,d')
In [214]: rec3
Out[214]:
array([( 1., 2.3, 4.5), ( 6., 7.8, 9. ), ( 12., 13.4, 56.7)],
dtype=[('f0', '<f8'), ('f1', '<f8'), ('f2', '<f8')])

numpy array to ndarray

I have an exported pandas dataframe that is now a numpy.array object.
subset = array[:4,:]
array([[ 2. , 12. , 33.33333333, 2. ,
33.33333333, 12. ],
[ 2. , 2. , 33.33333333, 2. ,
33.33333333, 2. ],
[ 2.8 , 8. , 45.83333333, 2.75 ,
46.66666667, 13. ],
[ 3.11320755, 75. , 56. , 3.24 ,
52.83018868, 33. ]])
print subset.dtype
dtype('float64')
I was to convert the column values to specific types, and set column names as well, this means I need to convert it to a ndarray.
Here are my dtypes:
[('PERCENT_A_NEW', '<f8'), ('JoinField', '<i4'), ('NULL_COUNT_B', '<f8'),
('PERCENT_COMP_B', '<f8'), ('RANKING_A', '<f8'), ('RANKING_B', '<f8'),
('NULL_COUNT_B', '<f8')]
When I go to convert the array, I get:
ValueError: new type not compatible with array.
How do you cast each column to a specific value so I can convert the array to an ndarray?
Thanks
You already have an ndarray. What you are seeking is a structured array, one with this compound dtype. First see if pandas can do it for you. If that fails we might be able to do something with tolist and a list comprehension.
In [84]: dt=[('PERCENT_A_NEW', '<f8'), ('JoinField', '<i4'), ('NULL_COUNT_B', '<
...: f8'),
...: ('PERCENT_COMP_B', '<f8'), ('RANKING_A', '<f8'), ('RANKING_B', '<f8'),
...: ('NULL_COUNT_B', '<f8')]
In [85]: subset=np.array([[ 2. , 12. , 33.33333333, 2.
...: ,
...: 33.33333333, 12. ],
...: [ 2. , 2. , 33.33333333, 2. ,
...: 33.33333333, 2. ],
...: [ 2.8 , 8. , 45.83333333, 2.75 ,
...: 46.66666667, 13. ],
...: [ 3.11320755, 75. , 56. , 3.24 ,
...: 52.83018868, 33. ]])
In [86]: subset
Out[86]:
array([[ 2. , 12. , 33.33333333, 2. ,
33.33333333, 12. ],
[ 2. , 2. , 33.33333333, 2. ,
33.33333333, 2. ],
[ 2.8 , 8. , 45.83333333, 2.75 ,
46.66666667, 13. ],
[ 3.11320755, 75. , 56. , 3.24 ,
52.83018868, 33. ]])
Now make an array with dt. Input for a structured array has to be a list of tuples - so I'm using tolist and a list comprehension
In [87]: np.array([tuple(row) for row in subset.tolist()],dtype=dt)
....
ValueError: field 'NULL_COUNT_B' occurs more than once
In [88]: subset.shape
Out[88]: (4, 6)
In [89]: dt
Out[89]:
[('PERCENT_A_NEW', '<f8'),
('JoinField', '<i4'),
('NULL_COUNT_B', '<f8'),
('PERCENT_COMP_B', '<f8'),
('RANKING_A', '<f8'),
('RANKING_B', '<f8'),
('NULL_COUNT_B', '<f8')]
In [90]: dt=[('PERCENT_A_NEW', '<f8'), ('JoinField', '<i4'), ('NULL_COUNT_B', '<
...: f8'),
...: ('PERCENT_COMP_B', '<f8'), ('RANKING_A', '<f8'), ('RANKING_B', '<f8')]
In [91]: np.array([tuple(row) for row in subset.tolist()],dtype=dt)
Out[91]:
array([(2.0, 12, 33.33333333, 2.0, 33.33333333, 12.0),
(2.0, 2, 33.33333333, 2.0, 33.33333333, 2.0),
(2.8, 8, 45.83333333, 2.75, 46.66666667, 13.0),
(3.11320755, 75, 56.0, 3.24, 52.83018868, 33.0)],
dtype=[('PERCENT_A_NEW', '<f8'), ('JoinField', '<i4'), ('NULL_COUNT_B', '<f8'), ('PERCENT_COMP_B', '<f8'), ('RANKING_A', '<f8'), ('RANKING_B', '<f8')])

Assigning field names to numpy array in Python 2.7.3

I am going nuts over this one, as I obviously miss the point and the solution is too simple to see :(
I have an np.array with x columns, and I want to assign a field name. So here is my code:
data = np.array([[1,2,3], [4.0,5.0,6.0], [11,12,12.3]])
a = np.array(data, dtype= {'names': ['1st', '2nd', '3rd'], 'formats':['f8','f8', 'f8']})
print a['1st']
why does this give
[[ 1. 2. 3. ]
[ 4. 5. 6. ]
[ 11. 12. 12.3]]
Instead of [1, 2, 3]?
In [1]: data = np.array([[1,2,3], [4.0,5.0,6.0], [11,12,12.3]])
In [2]: dt = np.dtype({'names': ['1st', '2nd', '3rd'], 'formats':['f8','f8', 'f8']})
Your attempt:
In [3]: np.array(data,dt)
Out[3]:
array([[(1.0, 1.0, 1.0), (2.0, 2.0, 2.0), (3.0, 3.0, 3.0)],
[(4.0, 4.0, 4.0), (5.0, 5.0, 5.0), (6.0, 6.0, 6.0)],
[(11.0, 11.0, 11.0), (12.0, 12.0, 12.0), (12.3, 12.3, 12.3)]],
dtype=[('1st', '<f8'), ('2nd', '<f8'), ('3rd', '<f8')])
produces a (3,3) array, with the same values assigned to each field. data.astype(dt) does the same thing.
But view produces a (3,1) array in which each field contains the data for a column.
In [4]: data.view(dt)
Out[4]:
array([[(1.0, 2.0, 3.0)],
[(4.0, 5.0, 6.0)],
[(11.0, 12.0, 12.3)]],
dtype=[('1st', '<f8'), ('2nd', '<f8'), ('3rd', '<f8')])
I should caution that view only works if all the fields have the same data type as the original. It uses the same data buffer, just interpreting the values differently.
You could reshape the result from (3,1) to (3,).
But since you want A['1st'] to be [1,2,3] - a row of data - we have to do some other manipulation.
In [16]: data.T.copy().view(dt)
Out[16]:
array([[(1.0, 4.0, 11.0)],
[(2.0, 5.0, 12.0)],
[(3.0, 6.0, 12.3)]],
dtype=[('1st', '<f8'), ('2nd', '<f8'), ('3rd', '<f8')])
In [17]: _['1st']
Out[17]:
array([[ 1.],
[ 2.],
[ 3.]])
I transpose, and then make a copy (rearranging the underlying data buffer). Now a view puts [1,2,3] in one field.
Note that the display of the structured array uses () instead of [] for the 'rows'. This is clue as to how it accepts input.
I can turn your data into a list of tuples with:
In [19]: [tuple(i) for i in data.T]
Out[19]: [(1.0, 4.0, 11.0), (2.0, 5.0, 12.0), (3.0, 6.0, 12.300000000000001)]
In [20]: np.array([tuple(i) for i in data.T],dt)
Out[20]:
array([(1.0, 4.0, 11.0), (2.0, 5.0, 12.0), (3.0, 6.0, 12.3)],
dtype=[('1st', '<f8'), ('2nd', '<f8'), ('3rd', '<f8')])
In [21]: _['1st']
Out[21]: array([ 1., 2., 3.])
This is a (3,) array with 3 fields.
A list of tuples is the normal way of supplying data to np.array(...,dt). See the doc link in my comment.
You can also create an empty array, and fill it, row by row, or field by field
In [26]: A=np.zeros((3,),dt)
In [27]: for i in range(3):
....: A[i]=data[:,i].copy()
Without the copy I get a ValueError: ndarray is not C-contiguous
Fill field by field:
In [29]: for i in range(3):
....: A[dt.names[i]]=data[i,:]
Usually a structured array has many rows, and a few fields. So filling by field is relatively fast. That's how recarray functions handle most copying tasks.
fromiter can also be used:
In [31]: np.fromiter(data, dtype=dt)
Out[31]:
array([(1.0, 2.0, 3.0), (4.0, 5.0, 6.0), (11.0, 12.0, 12.3)],
dtype=[('1st', '<f8'), ('2nd', '<f8'), ('3rd', '<f8')])
But the error I get when using data.T without the copy is a strong indication that is doing the row by row iteration (my In[27])
In [32]: np.fromiter(data.T, dtype=dt)
ValueError: ndarray is not C-contiguous
zip(*data) is another way of reordering the input array (see #unutbu's answer in the comment link).
np.fromiter(zip(*data),dtype=dt)
As pointed out in a comment, fromarrays works:
np.rec.fromarrays(data,dt)
This is an example of a rec function that uses the by field copy method:
arrayList = [sb.asarray(x) for x in arrayList]
....
_array = recarray(shape, descr)
# populate the record array (makes a copy)
for i in range(len(arrayList)):
_array[_names[i]] = arrayList[i]
Which in our case is:
In [8]: data1 = [np.asarray(i) for i in data]
In [9]: data1
Out[9]: [array([ 1., 2., 3.]), array([ 4., 5., 6.]), array([ 11. , 12. , 12.3])]
In [10]: for i in range(3):
A[dt.names[i]] = data1[i]

Categories