View object array under different dtype

View object array under different dtype - python

I would like to view an object array with a dtype that encapsulates entire rows:
data = np.array([['a', '1'], ['a', 'z'], ['b', 'a']], dtype=object)
dt = np.dtype([('x', object), ('y', object)])
data.view(dt)
I get an error:
TypeError: Cannot change data-type for object array.
I have tried the following workarounds:
dt2 = np.dtype([('x', np.object, 2)])
data.view()
data.view(np.uint8).view(dt)
data.view(np.void).view(dt)
All cases result in the same error. Is there some way to view an object array with a different dtype?
I have also tried a more general approach (this is for reference, since it's functionally identical to what's shown above):
dt = np.dtype(','.join(data.dtype.char * data.shape[1]))
dt2 = np.dtype([('x', data.dtype, data.shape[1])])

It seems that you can always force a view of a buffer using np.array:
view = np.array(data, dtype=dt, copy=not data.flags['C_CONTIGUOUS'])
While this is a quick and dirty approach, the data gets copied in this case, and dt2 does not get applied correctly:
>>> print(view.base)
None
>>> np.array(data, dtype=dt2, copy=not data.flags['C_CONTIGUOUS'])
array([[(['a', 'a'],), (['1', '1'],)],
[(['a', 'a'],), (['z', 'z'],)],
[(['b', 'b'],), (['a', 'a'],)]], dtype=[('x', 'O', (2,))])
For a more correct approach (in some circumstances), you can use the raw np.ndarray constructor:
real_view = np.ndarray(data.shape[:1], dtype=dt2, buffer=data)
This makes a true view of the data:
>>> real_view
array([(['a', '1'],), (['a', 'z'],), (['b', 'a'],)], dtype=[('x', 'O', (2,))])
>>> real_view.base is data
True
As shown, this only works when the data has C-contiguous rows.

Related

pandas isin needs hashable data?

Does the pandas.Series.isin function require the data to be hashable? I did not find this requirement in the documentation (Series.isin or Series, though I see the index needs to be hashable, not the data).
foo = ['x', 'y']
bar = pd.Series(foo, index=['a', 'b'])
baz = pd.Series([foo[0]], index=['c'])
print(bar.isin(baz))
works as expected and returns
a True
b False
dtype: bool
However, the following fails with the error TypeError: unhashable type: 'list':
foo = [['x', 'y'], ['z', 't']]
bar = pd.Series(foo, index=['a', 'b'])
baz = pd.Series([foo[0]], index=['c'])
print(bar.isin(baz))
Is that intended? Is it documented somewhere? Or is it a bug in pandas?

python pandas dataframe to matlab struct using scipy.io

I am trying to save a pandas dataframe to a matlab .mat file using scipy.io.
I have the following:
array1 = np.array([1,2,3])
array2 = np.array(['a','b','c'])
array3 = np.array([1.01,2.02,3.03])
df = DataFrame({1:array1, 2:array2,3:array3}, index=('array1','array2','array3'))
recarray_ = df.to_records()
## Produces:
# rec.array([('array1', 1, 'a', 1.01), ('array2', 2, 'b', 2.02),
# ('array3', 3, 'c', 3.03)],
# dtype=[('index', 'O'), ('1', '<i4'), ('2', 'O'), ('3', '<f8')])
scipy.io.savemat('test_recarray_struct.mat', {'struct':df.to_records()})
In Matlab, I would expect this to produce a struct containing three arrays (one int, one char, one float) but it actually produces is a struct containing 3 more structs, each containing four variables; 'index', 1, '2', 3. When trying to select 1, '2' or 3 I get the error 'The variable struct(1, 1).# does not exist.'
Can anyone explain the expected behaviour and how best to save DataFrames to .mat files?

I am using the following workaround in the meantime. Please let me know if you have a better solution:
a_dict = {col_name : df[col_name].values for col_name in df.columns.values}
## optional if you want to save the index as an array as well:
# a_dict[df.index.name] = df.index.values
scipy.io.savemat('test_struct_to_mat.mat', {'struct':a_dict})

I think what you need is to create the dataframe like this:
df = DataFrame({'array1':array1, 'array2':array2,'array3':array3})
and save it like this:
scipy.io.savemat('test_recarray_struct.mat', {'struct':df.to_dict("list")})
So the code should be something like:
# ... import appropritely
array1 = np.array([1,2,3])
array2 = np.array(['a','b','c'])
array3 = np.array([1.01,2.02,3.03])
df = DataFrame({'array1':array1, 'array2':array2,'array3':array3})
scipy.io.savemat('test_recarray_struct.mat', {'struct':df.to_dict("list")})

How to preserve column names starting with a minus when using numpy.genfromtxt?

Similar to this question, numpy.genfromtxt modifies my columns' names:
import numpy as np
from io import BytesIO # https://stackoverflow.com/a/11970414/321973
str = 'x,-1,1\n0,1,1\n1,2,3'
data = np.genfromtxt(BytesIO(str.encode()), delimiter=',', names=True)
print(data.dtype.names)
yields ('x', '1', '1_1') instead of the desired ('x', '-1', '1') (or even better, ('x', -1, 1)). I tried deletechars="""~!##$%^&*()=+~\|]}[{';: /?>,<""" as suggested there to no avail.

The behavior you're seeing is caused by the fact that np.genfromtxt uses the NameValidator class here to automatically strip certain non-alphanumeric characters from the field names.
It's perfectly legal for a field name to contain a '-' character, e.g.:
x = np.array((1,), dtype=[('-1', 'i')])
print(x['-1'])
# 1
In fact, two out of three of the modified field names you get back from np.genfromtxt are also not "valid Python identifiers" ('1' and '1_1', since they start with digits).
It's therefore possible to construct the array you describe as long as you bypass using np.genfromtxt to set the field names. One way to do it would be to initialize an empty array, specify the field names and dtypes explicitly, then fill it with the rest of the string contents:
names = str.splitlines()[0].split(',')
types = ('i',) * 3
dtype = zip(names, types)
data = np.empty(2, dtype=dtype)
data[:] = np.genfromtxt(BytesIO(str.encode()), delimiter=',', dtype=dtype,
skiprows=1)
print(repr(data))
# array([(0, 0, 1), (1, 0, 2)],
# dtype=[('x', '<i4'), ('-1', '<i4'), ('1', '<i4')])
However, just because you can doesn't mean you should - there may well be other unpredictable consequences to having a '-' in one of your field names. The safest option is to stick with using only valid Python identifiers as field names.

initialize numpy array with named tuples

I'm trying to initialize a NumPy array that contains named tuples. Everything works fine when I initialize the array with empty data and set that data afterwards; when using the numpy.array constructor, however, NumPy doesn't do what I had expected.
The output of
import numpy
data = numpy.random.rand(10, 3)
print data[0]
# Works
a = numpy.empty(
len(data),
dtype=numpy.dtype([('nodes', (float, 3))])
)
a['nodes'] = data
print
print a[0]['nodes']
# Doesn't work
b = numpy.array(
data,
dtype=numpy.dtype([('nodes', (float, 3))])
)
print
print b[0]['nodes']
is
[ 0.28711363 0.89643579 0.82386232]
[ 0.28711363 0.89643579 0.82386232]
[[ 0.28711363 0.28711363 0.28711363]
[ 0.89643579 0.89643579 0.89643579]
[ 0.82386232 0.82386232 0.82386232]]
This is with NumPy 1.8.1.
Any hints on how to organize the array constructor?

This is awful, but:
Starting with your example copied and pasted into an ipython, try
dtype=numpy.dtype([('nodes', (float, 3))])
c = numpy.array([(aa,) for aa in data], dtype=dtype)
it seems to do the trick.

It's instructive to construct a different array:
dt3=np.dtype([('x','<f8'),('y','<f8'),('z','<f8')])
b=np.zeros((10,),dtype=dt3)
b[:]=[tuple(x) for x in data]
b['x'] = data[:,0] # alt
np.array([tuple(x) for x in data],dtype=dt3) # or in one statement
a[:1]
# array([([0.32726803375966484, 0.5845638956708634, 0.894278688117277],)], dtype=[('nodes', '<f8', (3,))])
b[:1]
# array([(0.32726803375966484, 0.5845638956708634, 0.894278688117277)], dtype=[('x', '<f8'), ('y', '<f8'), ('z', '<f8')])
I don't think there's a way of assigning data to all the fields of b without some sort of iteration.
genfromtxt is a common way of generating record arrays like this. Looking at its code I see a pattern like:
data = list(zip(*[...]))
output = np.array(data, dtype)
Which inspired me to try:
dtype=numpy.dtype([('nodes', (float, 3))])
a = np.array(zip(data), dtype=dtype)
(speed's basically the same as eickenberg's comprehension; so it's doing the same pure Python list operations.)
And for the 3 fields:
np.array(zip(*data.T), dtype=dt3)
curiously, explicitly converting to list first is even faster (almost 2x the zip(data) calc)
np.array(zip(*data.T.tolist()), dtype=dt3)

How do I import data with different types from file into a Python Numpy array?

Say I have a file myfile.txt containing:
1 2.0000 buckle_my_shoe
3 4.0000 margery_door
How do I import data from the file to a numpy array as an int, float and string?
I am aiming to get:
array([[1,2.0000,"buckle_my_shoe"],
[3,4.0000,"margery_door"]])
I've been playing around with the following to no avail:
a = numpy.loadtxt('myfile.txt',dtype=(numpy.int_,numpy.float_,numpy.string_))
EDIT: Another approach might be to use the ndarray type and convert afterwards.
b = numpy.loadtxt('myfile.txt',dtype=numpy.ndarray)
array([['1', '2.0000', 'buckle_my_shoe'],
['3', '4.0000', 'margery_door']], dtype=object)

Use numpy.genfromtxt:
import numpy as np
np.genfromtxt('filename', dtype= None)
# array([(1, 2.0, 'buckle_my_shoe'), (3, 4.0, 'margery_door')],
# dtype=[('f0', '<i4'), ('f1', '<f8'), ('f2', '|S14')])

Pandas can do that for you. The docs for the function you could use are here.
Assuming your columns are tab separated, this should do the trick (adapted from this question):
df = DataFrame.from_csv('myfile.txt', sep='\t')
array = df.values # the array you are interested in

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

View object array under different dtype - python

Related

pandas isin needs hashable data?

python pandas dataframe to matlab struct using scipy.io

How to preserve column names starting with a minus when using numpy.genfromtxt?

initialize numpy array with named tuples

How do I import data with different types from file into a Python Numpy array?

Categories

Resources