I am trying to save a pandas dataframe to a matlab .mat file using scipy.io.
I have the following:
array1 = np.array([1,2,3])
array2 = np.array(['a','b','c'])
array3 = np.array([1.01,2.02,3.03])
df = DataFrame({1:array1, 2:array2,3:array3}, index=('array1','array2','array3'))
recarray_ = df.to_records()
## Produces:
# rec.array([('array1', 1, 'a', 1.01), ('array2', 2, 'b', 2.02),
# ('array3', 3, 'c', 3.03)],
# dtype=[('index', 'O'), ('1', '<i4'), ('2', 'O'), ('3', '<f8')])
scipy.io.savemat('test_recarray_struct.mat', {'struct':df.to_records()})
In Matlab, I would expect this to produce a struct containing three arrays (one int, one char, one float) but it actually produces is a struct containing 3 more structs, each containing four variables; 'index', 1, '2', 3. When trying to select 1, '2' or 3 I get the error 'The variable struct(1, 1).# does not exist.'
Can anyone explain the expected behaviour and how best to save DataFrames to .mat files?
I am using the following workaround in the meantime. Please let me know if you have a better solution:
a_dict = {col_name : df[col_name].values for col_name in df.columns.values}
## optional if you want to save the index as an array as well:
# a_dict[df.index.name] = df.index.values
scipy.io.savemat('test_struct_to_mat.mat', {'struct':a_dict})
I think what you need is to create the dataframe like this:
df = DataFrame({'array1':array1, 'array2':array2,'array3':array3})
and save it like this:
scipy.io.savemat('test_recarray_struct.mat', {'struct':df.to_dict("list")})
So the code should be something like:
# ... import appropritely
array1 = np.array([1,2,3])
array2 = np.array(['a','b','c'])
array3 = np.array([1.01,2.02,3.03])
df = DataFrame({'array1':array1, 'array2':array2,'array3':array3})
scipy.io.savemat('test_recarray_struct.mat', {'struct':df.to_dict("list")})
Related
I would like to view an object array with a dtype that encapsulates entire rows:
data = np.array([['a', '1'], ['a', 'z'], ['b', 'a']], dtype=object)
dt = np.dtype([('x', object), ('y', object)])
data.view(dt)
I get an error:
TypeError: Cannot change data-type for object array.
I have tried the following workarounds:
dt2 = np.dtype([('x', np.object, 2)])
data.view()
data.view(np.uint8).view(dt)
data.view(np.void).view(dt)
All cases result in the same error. Is there some way to view an object array with a different dtype?
I have also tried a more general approach (this is for reference, since it's functionally identical to what's shown above):
dt = np.dtype(','.join(data.dtype.char * data.shape[1]))
dt2 = np.dtype([('x', data.dtype, data.shape[1])])
It seems that you can always force a view of a buffer using np.array:
view = np.array(data, dtype=dt, copy=not data.flags['C_CONTIGUOUS'])
While this is a quick and dirty approach, the data gets copied in this case, and dt2 does not get applied correctly:
>>> print(view.base)
None
>>> np.array(data, dtype=dt2, copy=not data.flags['C_CONTIGUOUS'])
array([[(['a', 'a'],), (['1', '1'],)],
[(['a', 'a'],), (['z', 'z'],)],
[(['b', 'b'],), (['a', 'a'],)]], dtype=[('x', 'O', (2,))])
For a more correct approach (in some circumstances), you can use the raw np.ndarray constructor:
real_view = np.ndarray(data.shape[:1], dtype=dt2, buffer=data)
This makes a true view of the data:
>>> real_view
array([(['a', '1'],), (['a', 'z'],), (['b', 'a'],)], dtype=[('x', 'O', (2,))])
>>> real_view.base is data
True
As shown, this only works when the data has C-contiguous rows.
I want to insert 2d data into numpy and project as data frame in pandas .
Basically it has 10 rows and 5 column . Data type of all 5 columns are order
( int , string , int , string, string) .
_MaxRows = 10
_MaxColumns = 5
person = np.zeros([_MaxRows,5])
person
def personGen():
for i in range(0,_MaxRows):
# add person dynamically
# person[i][0] = i
# person[i][1] = names[i]
# person[i][2] = random.randint(10,50)
# person[i][3] = random.choice(['M','F'])
# person[i][4] = 'Desc'
personGen()
OUTPUT REQUIRED AS DATA FRAME
Id Name Age Gender Desc
1 Sumeet 12 'M' 'HELLO'
2 Sumeet2 13 'M' 'HELLO2'
You cannot have different data types in the same numpy array. You could instead have a list of linear numpy arrays with each having its own data type.
e.g. like this:
names = ["asd", "shd", "wdf"]
ages = np.array([12, 35, 23])
d = {'name': names, 'age': ages}
df = pd.DataFrame(data=d)
Building from yar's answer:
You cannot have different data types in the same numpy (uni-multi)dimensional array.
You can have different data types in numpy's structured array.
_MaxRows = 2
_MaxNameSize = 7
_MaxDescSize = 4
names = ['Sumeet', 'Sumeet2']
data = list(
[(i, names[i], random.randint(10,50), random.choice(['M', 'F']), 'Desc') for i in range(0,_MaxRows)]
)
nparray = numpy.array(data, dtype=[('id', int), ('name', f'S{_MaxNameSize}'), ('age', int), ('Gender', 'S1'), ('Desc', f'S{_MaxDescSize}')])
# if you cannot specify string types size you can do like below, however I've read it decreases numpys operations performance because it removes the benefit of the array's data contiguous memory usage.
# nparray = numpy.array(data, dtype=[('id', int), ('name', object), ('age', int), ('Gender', 'S1'), ('Desc', object)])
print('Numpy structured array:\n', nparray)
pddataframe = pd.DataFrame(nparray)
print('\nPandas dataframe:\n', pddataframe)
Panda's dataframe by default already creates an index (0, 1, ..) for your incoming data. So you may choose to not inform the 'id' column.
I'm trying to initialize a NumPy array that contains named tuples. Everything works fine when I initialize the array with empty data and set that data afterwards; when using the numpy.array constructor, however, NumPy doesn't do what I had expected.
The output of
import numpy
data = numpy.random.rand(10, 3)
print data[0]
# Works
a = numpy.empty(
len(data),
dtype=numpy.dtype([('nodes', (float, 3))])
)
a['nodes'] = data
print
print a[0]['nodes']
# Doesn't work
b = numpy.array(
data,
dtype=numpy.dtype([('nodes', (float, 3))])
)
print
print b[0]['nodes']
is
[ 0.28711363 0.89643579 0.82386232]
[ 0.28711363 0.89643579 0.82386232]
[[ 0.28711363 0.28711363 0.28711363]
[ 0.89643579 0.89643579 0.89643579]
[ 0.82386232 0.82386232 0.82386232]]
This is with NumPy 1.8.1.
Any hints on how to organize the array constructor?
This is awful, but:
Starting with your example copied and pasted into an ipython, try
dtype=numpy.dtype([('nodes', (float, 3))])
c = numpy.array([(aa,) for aa in data], dtype=dtype)
it seems to do the trick.
It's instructive to construct a different array:
dt3=np.dtype([('x','<f8'),('y','<f8'),('z','<f8')])
b=np.zeros((10,),dtype=dt3)
b[:]=[tuple(x) for x in data]
b['x'] = data[:,0] # alt
np.array([tuple(x) for x in data],dtype=dt3) # or in one statement
a[:1]
# array([([0.32726803375966484, 0.5845638956708634, 0.894278688117277],)], dtype=[('nodes', '<f8', (3,))])
b[:1]
# array([(0.32726803375966484, 0.5845638956708634, 0.894278688117277)], dtype=[('x', '<f8'), ('y', '<f8'), ('z', '<f8')])
I don't think there's a way of assigning data to all the fields of b without some sort of iteration.
genfromtxt is a common way of generating record arrays like this. Looking at its code I see a pattern like:
data = list(zip(*[...]))
output = np.array(data, dtype)
Which inspired me to try:
dtype=numpy.dtype([('nodes', (float, 3))])
a = np.array(zip(data), dtype=dtype)
(speed's basically the same as eickenberg's comprehension; so it's doing the same pure Python list operations.)
And for the 3 fields:
np.array(zip(*data.T), dtype=dt3)
curiously, explicitly converting to list first is even faster (almost 2x the zip(data) calc)
np.array(zip(*data.T.tolist()), dtype=dt3)
Say I have a file myfile.txt containing:
1 2.0000 buckle_my_shoe
3 4.0000 margery_door
How do I import data from the file to a numpy array as an int, float and string?
I am aiming to get:
array([[1,2.0000,"buckle_my_shoe"],
[3,4.0000,"margery_door"]])
I've been playing around with the following to no avail:
a = numpy.loadtxt('myfile.txt',dtype=(numpy.int_,numpy.float_,numpy.string_))
EDIT: Another approach might be to use the ndarray type and convert afterwards.
b = numpy.loadtxt('myfile.txt',dtype=numpy.ndarray)
array([['1', '2.0000', 'buckle_my_shoe'],
['3', '4.0000', 'margery_door']], dtype=object)
Use numpy.genfromtxt:
import numpy as np
np.genfromtxt('filename', dtype= None)
# array([(1, 2.0, 'buckle_my_shoe'), (3, 4.0, 'margery_door')],
# dtype=[('f0', '<i4'), ('f1', '<f8'), ('f2', '|S14')])
Pandas can do that for you. The docs for the function you could use are here.
Assuming your columns are tab separated, this should do the trick (adapted from this question):
df = DataFrame.from_csv('myfile.txt', sep='\t')
array = df.values # the array you are interested in
I'm learning Matplotlib, and trying to implement a simple linear regression by hand.
However, I've run into a problem when importing and then working with my data after using csv2rec.
data= matplotlib.mlab.csv2rec('KC_Filtered01.csv',delimiter=',')
x = data['list_price']
y = data['square_feet']
sumx = x.sum()
sumy = y.sum()
sumxSQ = sum([sq**2 for sq in x])
sumySQ = sum([sq**2 for sq in y])
I'm reading in a list of housing prices, and trying to get the sum of the squares. However, when csv2rec reads in the prices from the file, it stores the values as an int32. Since the sum of the squares of the housing prices is greater than a 32 bit integer, it overflows. However I don't see a method of changing the data type that is assigned when csv2rec reads the file. How can I change the data type when the array is read in or assigned?
x = data['list_price'].astype('int64')
and the same with y.
And: csv2rec has a converterd argument: http://matplotlib.sourceforge.net/api/mlab_api.html#matplotlib.mlab.csv2rec
Instead of mlab.csv2rec, you can use an equivalent function of numpy, numpy.loadtxt (documentation), to read your data. This function has an argument to specify the dtype of your data.
Or if you want to work with column names (as in your example code), the function numpy.genfromtxt (documentation). This is like loadtxt, but with more options, such as to read in the column names from the first line of your file (with names = True).
An example of its usage:
In [9]:
import numpy as np
from StringIO import StringIO
data = StringIO("a, b, c\n 1, 2, 3\n 4, 5, 6")
np.genfromtxt(data, names=True, dtype = 'int64', delimiter = ',')
Out[9]:
array([(1L, 2L, 3L), (4L, 5L, 6L)],
dtype=[('a', '<i8'), ('b', '<i8'), ('c', '<i8')])
Another remark on your code, when using numpy arrays you do't have to use for-loops. To calculate the square, you can just do:
xSQ = x**2
sumxSQ = xSQ.sum()
or in one line:
sumxSQ = numpy.sum(x**2)