Adding a field to a structured numpy array - python

What is the cleanest way to add a field to a structured numpy array? Can it be done destructively, or is it necessary to create a new array and copy over the existing fields? Are the contents of each field stored contiguously in memory so that such copying can be done efficiently?

If you're using numpy 1.3, there's also numpy.lib.recfunctions.append_fields().
For many installations, you'll need to import numpy.lib.recfunctions to access this. import numpy will not allow one to see the numpy.lib.recfunctions

import numpy
def add_field(a, descr):
"""Return a new array that is like "a", but has additional fields.
Arguments:
a -- a structured numpy array
descr -- a numpy type description of the new fields
The contents of "a" are copied over to the appropriate fields in
the new array, whereas the new fields are uninitialized. The
arguments are not modified.
>>> sa = numpy.array([(1, 'Foo'), (2, 'Bar')], \
dtype=[('id', int), ('name', 'S3')])
>>> sa.dtype.descr == numpy.dtype([('id', int), ('name', 'S3')])
True
>>> sb = add_field(sa, [('score', float)])
>>> sb.dtype.descr == numpy.dtype([('id', int), ('name', 'S3'), \
('score', float)])
True
>>> numpy.all(sa['id'] == sb['id'])
True
>>> numpy.all(sa['name'] == sb['name'])
True
"""
if a.dtype.fields is None:
raise ValueError, "`A' must be a structured numpy array"
b = numpy.empty(a.shape, dtype=a.dtype.descr + descr)
for name in a.dtype.names:
b[name] = a[name]
return b

Related

Why does h5py throw an error when adding 3 variable length strings to a dataset?

I'm trying to set up and write to an HDF5 dataset using h5py (Python 3) that contains a one dimensional array of compound objects. Each compound object is made up of three variable length string properties.
with h5py.File("myfile.hdf5", "a") as file:
dt = np.dtype([
("label", h5py.string_dtype(encoding='utf-8')),
("name", h5py.string_dtype(encoding='utf-8')),
("id", h5py.string_dtype(encoding='utf-8'))])
dset = file.require_dataset("initial_data", (50000,), dtype=dt)
dset[0, "label"] = "foo"
When I run the example above, the last line of code causes h5py (or more accurately numpy) to throw an error saying:
"Cannot change data-type for object array."
Do I understand correctly that the type for "foo" is not h5py.string_dtype(encoding='utf-8')?
How come? And how can I fix this?
UPDATE 1:
Stepping into the stacktrace, I can see that the error is thrown from an internal numpy function called _view_is_safe(oldtype, newtype). In my case oldtype is dtype('O') but newtype is dtype([('label', 'O')]) which causes the error to be thrown.
UPDATE 2:
My question has been answered successfully below but for completeness I'm linking to a GH issue that might be related: https://github.com/h5py/h5py/issues/1921
You're setting the dtype as a tuple of variable length strings, so you'd set the tuple all at once. By only setting the label element, the other two tuple values aren't being set, so they are not string types.
Example:
import h5py
import numpy as np
with h5py.File("myfile.hdf5", "a") as file:
dt = np.dtype([
("label", h5py.string_dtype(encoding='utf-8')),
("name", h5py.string_dtype(encoding='utf-8')),
("id", h5py.string_dtype(encoding='utf-8'))])
dset = file.require_dataset("initial_data", (50000,), dtype=dt)
#Add a row of data with a tuple:
dset[0] = "foo", "bar", "baz"
#Add another row of data with a np recarray (1 row):
npdt = np.dtype([
("label", 'S4'),
("name", 'S4'),
("id", 'S4') ])
dset[1] = np.array( ("foo1", "bar1", "baz1"), dtype=npdt )
#Add 3 rows of data with a np recarray (3 rows built from a list of arrays):
s1 = np.array( ("A", "B", "C"), dtype='S4' )
s2 = np.array( ("a", "b", "c"), dtype='S4' )
s3 = np.array( ("X", "Y", "Z"), dtype='S4' )
recarr = np.rec.fromarrays([s1, s2, s3], dtype=npdt)
dset[2:5] = recarr
Result #1:
Result using all 3 methods:

numpy different data type of multidimensional array

I want to insert 2d data into numpy and project as data frame in pandas .
Basically it has 10 rows and 5 column . Data type of all 5 columns are order
( int , string , int , string, string) .
_MaxRows = 10
_MaxColumns = 5
person = np.zeros([_MaxRows,5])
person
def personGen():
for i in range(0,_MaxRows):
# add person dynamically
# person[i][0] = i
# person[i][1] = names[i]
# person[i][2] = random.randint(10,50)
# person[i][3] = random.choice(['M','F'])
# person[i][4] = 'Desc'
personGen()
OUTPUT REQUIRED AS DATA FRAME
Id Name Age Gender Desc
1 Sumeet 12 'M' 'HELLO'
2 Sumeet2 13 'M' 'HELLO2'
You cannot have different data types in the same numpy array. You could instead have a list of linear numpy arrays with each having its own data type.
e.g. like this:
names = ["asd", "shd", "wdf"]
ages = np.array([12, 35, 23])
d = {'name': names, 'age': ages}
df = pd.DataFrame(data=d)
Building from yar's answer:
You cannot have different data types in the same numpy (uni-multi)dimensional array.
You can have different data types in numpy's structured array.
_MaxRows = 2
_MaxNameSize = 7
_MaxDescSize = 4
names = ['Sumeet', 'Sumeet2']
data = list(
[(i, names[i], random.randint(10,50), random.choice(['M', 'F']), 'Desc') for i in range(0,_MaxRows)]
)
nparray = numpy.array(data, dtype=[('id', int), ('name', f'S{_MaxNameSize}'), ('age', int), ('Gender', 'S1'), ('Desc', f'S{_MaxDescSize}')])
# if you cannot specify string types size you can do like below, however I've read it decreases numpys operations performance because it removes the benefit of the array's data contiguous memory usage.
# nparray = numpy.array(data, dtype=[('id', int), ('name', object), ('age', int), ('Gender', 'S1'), ('Desc', object)])
print('Numpy structured array:\n', nparray)
pddataframe = pd.DataFrame(nparray)
print('\nPandas dataframe:\n', pddataframe)
Panda's dataframe by default already creates an index (0, 1, ..) for your incoming data. So you may choose to not inform the 'id' column.

python pandas dataframe to matlab struct using scipy.io

I am trying to save a pandas dataframe to a matlab .mat file using scipy.io.
I have the following:
array1 = np.array([1,2,3])
array2 = np.array(['a','b','c'])
array3 = np.array([1.01,2.02,3.03])
df = DataFrame({1:array1, 2:array2,3:array3}, index=('array1','array2','array3'))
recarray_ = df.to_records()
## Produces:
# rec.array([('array1', 1, 'a', 1.01), ('array2', 2, 'b', 2.02),
# ('array3', 3, 'c', 3.03)],
# dtype=[('index', 'O'), ('1', '<i4'), ('2', 'O'), ('3', '<f8')])
scipy.io.savemat('test_recarray_struct.mat', {'struct':df.to_records()})
In Matlab, I would expect this to produce a struct containing three arrays (one int, one char, one float) but it actually produces is a struct containing 3 more structs, each containing four variables; 'index', 1, '2', 3. When trying to select 1, '2' or 3 I get the error 'The variable struct(1, 1).# does not exist.'
Can anyone explain the expected behaviour and how best to save DataFrames to .mat files?
I am using the following workaround in the meantime. Please let me know if you have a better solution:
a_dict = {col_name : df[col_name].values for col_name in df.columns.values}
## optional if you want to save the index as an array as well:
# a_dict[df.index.name] = df.index.values
scipy.io.savemat('test_struct_to_mat.mat', {'struct':a_dict})
I think what you need is to create the dataframe like this:
df = DataFrame({'array1':array1, 'array2':array2,'array3':array3})
and save it like this:
scipy.io.savemat('test_recarray_struct.mat', {'struct':df.to_dict("list")})
So the code should be something like:
# ... import appropritely
array1 = np.array([1,2,3])
array2 = np.array(['a','b','c'])
array3 = np.array([1.01,2.02,3.03])
df = DataFrame({'array1':array1, 'array2':array2,'array3':array3})
scipy.io.savemat('test_recarray_struct.mat', {'struct':df.to_dict("list")})

initialize numpy array with named tuples

I'm trying to initialize a NumPy array that contains named tuples. Everything works fine when I initialize the array with empty data and set that data afterwards; when using the numpy.array constructor, however, NumPy doesn't do what I had expected.
The output of
import numpy
data = numpy.random.rand(10, 3)
print data[0]
# Works
a = numpy.empty(
len(data),
dtype=numpy.dtype([('nodes', (float, 3))])
)
a['nodes'] = data
print
print a[0]['nodes']
# Doesn't work
b = numpy.array(
data,
dtype=numpy.dtype([('nodes', (float, 3))])
)
print
print b[0]['nodes']
is
[ 0.28711363 0.89643579 0.82386232]
[ 0.28711363 0.89643579 0.82386232]
[[ 0.28711363 0.28711363 0.28711363]
[ 0.89643579 0.89643579 0.89643579]
[ 0.82386232 0.82386232 0.82386232]]
This is with NumPy 1.8.1.
Any hints on how to organize the array constructor?
This is awful, but:
Starting with your example copied and pasted into an ipython, try
dtype=numpy.dtype([('nodes', (float, 3))])
c = numpy.array([(aa,) for aa in data], dtype=dtype)
it seems to do the trick.
It's instructive to construct a different array:
dt3=np.dtype([('x','<f8'),('y','<f8'),('z','<f8')])
b=np.zeros((10,),dtype=dt3)
b[:]=[tuple(x) for x in data]
b['x'] = data[:,0] # alt
np.array([tuple(x) for x in data],dtype=dt3) # or in one statement
a[:1]
# array([([0.32726803375966484, 0.5845638956708634, 0.894278688117277],)], dtype=[('nodes', '<f8', (3,))])
b[:1]
# array([(0.32726803375966484, 0.5845638956708634, 0.894278688117277)], dtype=[('x', '<f8'), ('y', '<f8'), ('z', '<f8')])
I don't think there's a way of assigning data to all the fields of b without some sort of iteration.
genfromtxt is a common way of generating record arrays like this. Looking at its code I see a pattern like:
data = list(zip(*[...]))
output = np.array(data, dtype)
Which inspired me to try:
dtype=numpy.dtype([('nodes', (float, 3))])
a = np.array(zip(data), dtype=dtype)
(speed's basically the same as eickenberg's comprehension; so it's doing the same pure Python list operations.)
And for the 3 fields:
np.array(zip(*data.T), dtype=dt3)
curiously, explicitly converting to list first is even faster (almost 2x the zip(data) calc)
np.array(zip(*data.T.tolist()), dtype=dt3)

Python: Numpy, the data genfromtxt cannot do tranpose

for example, if I have a data.csv file, it looks like:
A B
1| 1 2
2| 3 4
if I load this data this way:
import numpy as np
from StringIO import StringIO
with open('test.csv', 'rb') as text:
b = np.genfromtxt( StringIO(text.read()),
delimiter=",",
dtype = [('x', int), ('y', int)]) ##<=== in my case
## I want may columns dtype
## in different format
print b.tranpose()
it still gives me like this, which is not been transposed.
[(1,2)
(3,4)]
But if I do: dtype = 'str' (which means all columns with same format
np.transpose() works good.
===============================================
But if I do it this way like:
import numpy as np
a = np.array([(1,2), (3,4)])
print a.transpose()
it works correctly
======= MY QUESTION =======
If I load numpy data from txt or csv, and define each columns' data type. Is there any way to make the numpy data working properly?
This is what is designed to do, see document
T Same as self.transpose(), except that self is returned if self.ndim < 2.
Here you have ndim of 1 for the b array (you may think it has a shape of (2,2) but it has a shape of (2,), be aware), so T returns itself.

Categories