for example, if I have a data.csv file, it looks like:
A B
1| 1 2
2| 3 4
if I load this data this way:
import numpy as np
from StringIO import StringIO
with open('test.csv', 'rb') as text:
b = np.genfromtxt( StringIO(text.read()),
delimiter=",",
dtype = [('x', int), ('y', int)]) ##<=== in my case
## I want may columns dtype
## in different format
print b.tranpose()
it still gives me like this, which is not been transposed.
[(1,2)
(3,4)]
But if I do: dtype = 'str' (which means all columns with same format
np.transpose() works good.
===============================================
But if I do it this way like:
import numpy as np
a = np.array([(1,2), (3,4)])
print a.transpose()
it works correctly
======= MY QUESTION =======
If I load numpy data from txt or csv, and define each columns' data type. Is there any way to make the numpy data working properly?
This is what is designed to do, see document
T Same as self.transpose(), except that self is returned if self.ndim < 2.
Here you have ndim of 1 for the b array (you may think it has a shape of (2,2) but it has a shape of (2,), be aware), so T returns itself.
Related
I have an array that is composed of multiple np arrays. I want to give every array a key and convert it to an HDF5 file
arr = np.concatenate((Hsp_data, Hsp_rdiff, PosC44_WKS, PosX_WKS, PosY_WKS, PosZ_WKS,
RMS_Acc_HSp, RMS_Acc_Rev, RMS_Schall, Rev_M, Rev_rdiff, X_rdiff, Z_I, Z_rdiff, time), axis=1)
d1 = np.random.random(size=(7501, 15))
hf = h5py.File('data.hdf5', 'w')
hf.create_dataset('arr', data=d1)
hf.close()
hf = h5py.File('data.hdf5', 'r+')
print(hf.key)
This what I have done so far and I get this error AttributeError: 'File' object has no attribute 'key'.
I want the final answer to be like this when printing the keys
<KeysViewHDF5 ['Hsp_M', 'Hsp_rdiff', 'PosC44_WKS', 'PosX_WKS', 'PosY_WKS', 'PosZ_WKS', 'RMS_Acc_HSp', 'RMS_Acc_Rev', 'RMS_Schall', 'Rev_M', 'Rev_rdiff', 'X_rdiff', 'Z_I', 'Z_rdiff']>
any ideas?
You/we need a clearer idea of how the original .mat is laid out. In h5py, the file is viewed as a nested set of groups, which are dict like. Hence the use of keys(). At the ends of that nesting are datasets which can be loaded (or saved from) as numpy arrays. The datasets/arrays don't have keys; it's the file and groups that have those.
Creating your file:
In [69]: import h5py
In [70]: d1 = np.random.random(size=(7501, 15))
...: hf = h5py.File('data.hdf5', 'w')
...: hf.create_dataset('arr', data=d1)
...: hf.close()
Reading it:
In [71]: hf = h5py.File('data.hdf5', 'r+')
In [72]: hf.keys()
Out[72]: <KeysViewHDF5 ['arr']>
In [73]: hf['arr']
Out[73]: <HDF5 dataset "arr": shape (7501, 15), type "<f8">
In [75]: arr = hf['arr'][:]
In [76]: arr.shape
Out[76]: (7501, 15)
'arr' is the name of the dataset that we created at the start. In this case there's no group; just the one dataset. [75] loads the dataset to an array which I called arr, but that name could be anything (like the original d1).
Arrays and datasets may have a compound dtype, which has named fields. I don't know if MATLAB uses those or not.
Without knowledge of the group and dataset layout in the original .mat, it's hard to help you. And when looking at datasets, pay particular attention to shape and dtype.
I would like to convert a multi-dimensional Numpy array into a string and, later, convert that string back into an equivalent Numpy array.
I do not want to save the Numpy array to a file (e.g. via the savetxt and loadtxt interface).
Is this possible?
You could use np.tostring and np.fromstring:
In [138]: x = np.arange(12).reshape(3,4)
In [139]: x.tostring()
Out[139]: '\x00\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00\x05\x00\x00\x00\x06\x00\x00\x00\x07\x00\x00\x00\x08\x00\x00\x00\t\x00\x00\x00\n\x00\x00\x00\x0b\x00\x00\x00'
In [140]: np.fromstring(x.tostring(), dtype=x.dtype).reshape(x.shape)
Out[140]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
Note that the string returned by tostring does not save the dtype nor the shape of the original array. You have to re-supply those yourself.
Another option is to use np.save or np.savez or np.savez_compressed to write to a io.BytesIO object (instead of a file):
import numpy as np
import io
x = np.arange(12).reshape(3,4)
output = io.BytesIO()
np.savez(output, x=x)
The string is given by
content = output.getvalue()
And given the string, you can load it back into an array using np.load:
data = np.load(io.BytesIO(content))
x = data['x']
This method stores the dtype and shape as well.
For large arrays, np.savez_compressed will give you the smallest string.
Similarly, you could use np.savetxt and np.loadtxt:
import numpy as np
import io
x = np.arange(12).reshape(3,4)
output = io.BytesIO()
np.savetxt(output, x)
content = output.getvalue()
# '0.000000000000000000e+00 1.000000000000000000e+00 2.000000000000000000e+00 3.000000000000000000e+00\n4.000000000000000000e+00 5.000000000000000000e+00 6.000000000000000000e+00 7.000000000000000000e+00\n8.000000000000000000e+00 9.000000000000000000e+00 1.000000000000000000e+01 1.100000000000000000e+01\n'
x = np.loadtxt(io.BytesIO(content))
print(x)
Summary:
tostring gives you the underlying data as a string, with no dtype or
shape
save is like tostring except it also saves dtype and shape (.npy format)
savez saves the array in npz format (uncompressed)
savez_compressed saves the array in compressed npz format
savetxt formats the array in a humanly readable format
If you want to save the dtype as well you can also use the pickle module from python.
import pickle
import numpy as np
a = np.ones(4)
string = pickle.dumps(a)
pickle.loads(string)
np.tostring and np.fromstring does NOT work anymore. They use np.tobyte but it parses the np.array as bytes and not string. To do that use ast.literal_eval.
if elements of lists are 2D float. ast.literal_eval() cannot handle a lot very complex list of list of nested list while retrieving back.
Therefore, it is better to parse list of list as dict and dump the string.
while loading a saved dump, ast.literal_eval() handles dict as strings in a better way. convert the string to dict and then dict to list of list
k = np.array([[[0.09898942, 0.22804536],[0.06109612, 0.19022354],[0.93369348, 0.53521671],[0.64630094, 0.28553219]],[[0.94503154, 0.82639528],[0.07503319, 0.80149062],[0.1234832 , 0.44657691],[0.7781163 , 0.63538195]]])
d = dict(enumerate(k.flatten(), 1))
d = str(d) ## dump as string (pickle and other packages parse the dump as bytes)
m = ast.literal_eval(d) ### convert the dict as str to dict
m = np.fromiter(m.values(), dtype=float) ## convert m to nparray
I use JSON to do that:
1. Encode to JSON
The first step is to encode it to JSON:
import json
import numpy as np
np_array = np.array(
[[[0.2123842 , 0.45560746, 0.23575005, 0.40605248],
[0.98393952, 0.03679023, 0.6192098 , 0.00547201],
[0.13259942, 0.69461942, 0.8781533 , 0.83025555]],
[[0.8398132 , 0.98341709, 0.25296835, 0.84055815],
[0.27619265, 0.55544911, 0.56615598, 0.058845 ],
[0.76205113, 0.18001961, 0.68206229, 0.47252472]]])
json_array = json.dumps(np_array.tolist())
print("converted to: " + str(type(json_array)))
print("looks like:")
print(json_array)
Which results in this:
converted to: <class 'str'>
looks like:
[[[0.2123842, 0.45560746, 0.23575005, 0.40605248], [0.98393952, 0.03679023, 0.6192098, 0.00547201], [0.13259942, 0.69461942, 0.8781533, 0.83025555]], [[0.8398132, 0.98341709, 0.25296835, 0.84055815], [0.27619265, 0.55544911, 0.56615598, 0.058845], [0.76205113, 0.18001961, 0.68206229, 0.47252472]]]
2. Decode back to Numpy
To convert it back to a numpy array you can use:
list_from_json = json.loads(json_array)
np.array(list_from_json)
print("converted to: " + str(type(list_from_json)))
print("converted to: " + str(type(np.array(list_from_json))))
print(np.array(list_from_json))
Which give you:
converted to: <class 'list'>
converted to: <class 'numpy.ndarray'>
[[[0.2123842 0.45560746 0.23575005 0.40605248]
[0.98393952 0.03679023 0.6192098 0.00547201]
[0.13259942 0.69461942 0.8781533 0.83025555]]
[[0.8398132 0.98341709 0.25296835 0.84055815]
[0.27619265 0.55544911 0.56615598 0.058845 ]
[0.76205113 0.18001961 0.68206229 0.47252472]]]
I like this method because the string is easy to read and, although for this case you didn't need storing it in a file or something, this can be done as well with this format.
I'm trying to initialize a NumPy array that contains named tuples. Everything works fine when I initialize the array with empty data and set that data afterwards; when using the numpy.array constructor, however, NumPy doesn't do what I had expected.
The output of
import numpy
data = numpy.random.rand(10, 3)
print data[0]
# Works
a = numpy.empty(
len(data),
dtype=numpy.dtype([('nodes', (float, 3))])
)
a['nodes'] = data
print
print a[0]['nodes']
# Doesn't work
b = numpy.array(
data,
dtype=numpy.dtype([('nodes', (float, 3))])
)
print
print b[0]['nodes']
is
[ 0.28711363 0.89643579 0.82386232]
[ 0.28711363 0.89643579 0.82386232]
[[ 0.28711363 0.28711363 0.28711363]
[ 0.89643579 0.89643579 0.89643579]
[ 0.82386232 0.82386232 0.82386232]]
This is with NumPy 1.8.1.
Any hints on how to organize the array constructor?
This is awful, but:
Starting with your example copied and pasted into an ipython, try
dtype=numpy.dtype([('nodes', (float, 3))])
c = numpy.array([(aa,) for aa in data], dtype=dtype)
it seems to do the trick.
It's instructive to construct a different array:
dt3=np.dtype([('x','<f8'),('y','<f8'),('z','<f8')])
b=np.zeros((10,),dtype=dt3)
b[:]=[tuple(x) for x in data]
b['x'] = data[:,0] # alt
np.array([tuple(x) for x in data],dtype=dt3) # or in one statement
a[:1]
# array([([0.32726803375966484, 0.5845638956708634, 0.894278688117277],)], dtype=[('nodes', '<f8', (3,))])
b[:1]
# array([(0.32726803375966484, 0.5845638956708634, 0.894278688117277)], dtype=[('x', '<f8'), ('y', '<f8'), ('z', '<f8')])
I don't think there's a way of assigning data to all the fields of b without some sort of iteration.
genfromtxt is a common way of generating record arrays like this. Looking at its code I see a pattern like:
data = list(zip(*[...]))
output = np.array(data, dtype)
Which inspired me to try:
dtype=numpy.dtype([('nodes', (float, 3))])
a = np.array(zip(data), dtype=dtype)
(speed's basically the same as eickenberg's comprehension; so it's doing the same pure Python list operations.)
And for the 3 fields:
np.array(zip(*data.T), dtype=dt3)
curiously, explicitly converting to list first is even faster (almost 2x the zip(data) calc)
np.array(zip(*data.T.tolist()), dtype=dt3)
I am running Python 2.6. I have the following example where I am trying to concatenate the date and time string columns from a csv file. Based on the dtype I set (None vs object), I am seeing some differences in behavior that I cannot explained, see Question 1 and 2 at the end of the post. The exception returned is not too descriptive, and the dtype documentation doesn't mention any specific behavior to expect when dtype is set to object.
Here is the snippet:
#! /usr/bin/python
import numpy as np
# simulate a csv file
from StringIO import StringIO
data = StringIO("""
Title
Date,Time,Speed
,,(m/s)
2012-04-01,00:10, 85
2012-04-02,00:20, 86
2012-04-03,00:30, 87
""".strip())
# (Fail) case 1: dtype=None splicing a column fails
next(data) # eat away the title line
header = [item.strip() for item in next(data).split(',')] # get the headers
arr1 = np.genfromtxt(data, dtype=None, delimiter=',',skiprows=1)# skiprows=1 for the row with units
arr1.dtype.names = header # assign the header to names
# so we can do y=arr['Speed']
y1 = arr1['Speed']
# Q1 IndexError: invalid index
#a1 = arr1[:,0]
#print a1
# EDIT1:
print "arr1.shape "
print arr1.shape # (3,)
# Fails as expected TypeError: unsupported operand type(s) for +: 'numpy.ndarray' and 'numpy.ndarray'
# z1 = arr1['Date'] + arr1['Time']
# This can be workaround by specifying dtype=object, which leads to case 2
data.seek(0) # resets
# (Fail) case 2: dtype=object assign header fails
next(data) # eat away the title line
header = [item.strip() for item in next(data).split(',')] # get the headers
arr2 = np.genfromtxt(data, dtype=object, delimiter=',',skiprows=1) # skiprows=1 for the row with units
# Q2 ValueError: there are no fields define
#arr2.dtype.names = header # assign the header to names. so we can use it to do indexing
# ie y=arr['Speed']
# y2 = arr['Date'] + arr['Time'] # column headings were assigned previously by arr.dtype.names = header
data.seek(0) # resets
# (Good) case 3: dtype=object but don't assign headers
next(data) # eat away the title line
header = [item.strip() for item in next(data).split(',')] # get the headers
arr3 = np.genfromtxt(data, dtype=object, delimiter=',',skiprows=1) # skiprows=1 for the row with units
y3 = arr3[:,0] + arr3[:,1] # slice the columns
print y3
# case 4: dtype=None, all data are ints, array dimension 2-D
# simulate a csv file
from StringIO import StringIO
data2 = StringIO("""
Title
Date,Time,Speed
,,(m/s)
45,46,85
12,13,86
50,46,87
""".strip())
next(data2) # eat away the title line
header = [item.strip() for item in next(data2).split(',')] # get the headers
arr4 = np.genfromtxt(data2, dtype=None, delimiter=',',skiprows=1)# skiprows=1 for the row with units
#arr4.dtype.names = header # Value error
print "arr4.shape "
print arr4.shape # (3,3)
data2.seek(0) # resets
Question 1: At comment Q1, why can I not slice a column, when dtype=None?
This could be avoided by
a) arr1=np-genfromtxt... was initialized with dtype=object like case 3,
b) arr1.dtype.names=... wascommented out to avoid the Value error in case 2
Question 2: At comment Q2, why can I not set the dtype.names when dtype=object?
EDIT1:
Added a case 4 that shows when the dimension of the array would be 2-D if the values in the simulated csv files are all ints instead. One can slice the column, but assigning the dtype.names would still fail.
Update the term 'splice' to 'slice'.
Question 1
This is indexing, not 'splicing', and you can't index into the columns of data for exactly the same reason I explained to you before in my answer to Question 7 here. Look at arr1.shape - it is (3,), i.e. arr1 is 1D, not 2D. There are no columns for you to index into.
Now look at the shape of arr2 - you'll see that it's (3,3). Why is this? If you do specify dtype=desired_type, np.genfromtxt will treat every delimited part of your input string the same (i.e. as desired_type), and it will give you an ordinary, non-structured numpy array back.
I'm not quite sure what you wanted to do with this line:
z1 = arr1['Date'] + arr1['Time']
Did you mean to concatenate the date and time strings together like this: '2012-04-01 00:10'? You could do it like this:
z1 = [d + ' ' + t for d,t in zip(arr1['Date'],arr1['Time'])]
It depends what you want to do with the output (this will give you a list of strings, not a numpy array).
I should point out that, as of version 1.7, Numpy has core array types that support datetime functionality. This would allow you to do much more useful things like computing time deltas etc.
dts = np.array(z1,dtype=np.datetime64)
Edit:
If you want to plot timeseries data, you can use matplotlib.dates.strpdate2num to convert your strings to matplotlib datenums, then use plot_date():
from matplotlib import dates
from matplotlib import pyplot as pp
# convert date and time strings to matplotlib datenums
dtconv = dates.strpdate2num('%Y-%m-%d%H:%M')
datenums = [dtconv(d+t) for d,t in zip(arr1['Date'],arr1['Time'])]
# use plot_date to plot timeseries
pp.plot_date(datenums,arr1['Speed'],'-ob')
You should also take a look at Pandas, which has some nice tools for visualising timeseries data.
Question 2
You can't set the names of arr2 because it is not a structured array (see above).
I'm learning Matplotlib, and trying to implement a simple linear regression by hand.
However, I've run into a problem when importing and then working with my data after using csv2rec.
data= matplotlib.mlab.csv2rec('KC_Filtered01.csv',delimiter=',')
x = data['list_price']
y = data['square_feet']
sumx = x.sum()
sumy = y.sum()
sumxSQ = sum([sq**2 for sq in x])
sumySQ = sum([sq**2 for sq in y])
I'm reading in a list of housing prices, and trying to get the sum of the squares. However, when csv2rec reads in the prices from the file, it stores the values as an int32. Since the sum of the squares of the housing prices is greater than a 32 bit integer, it overflows. However I don't see a method of changing the data type that is assigned when csv2rec reads the file. How can I change the data type when the array is read in or assigned?
x = data['list_price'].astype('int64')
and the same with y.
And: csv2rec has a converterd argument: http://matplotlib.sourceforge.net/api/mlab_api.html#matplotlib.mlab.csv2rec
Instead of mlab.csv2rec, you can use an equivalent function of numpy, numpy.loadtxt (documentation), to read your data. This function has an argument to specify the dtype of your data.
Or if you want to work with column names (as in your example code), the function numpy.genfromtxt (documentation). This is like loadtxt, but with more options, such as to read in the column names from the first line of your file (with names = True).
An example of its usage:
In [9]:
import numpy as np
from StringIO import StringIO
data = StringIO("a, b, c\n 1, 2, 3\n 4, 5, 6")
np.genfromtxt(data, names=True, dtype = 'int64', delimiter = ',')
Out[9]:
array([(1L, 2L, 3L), (4L, 5L, 6L)],
dtype=[('a', '<i8'), ('b', '<i8'), ('c', '<i8')])
Another remark on your code, when using numpy arrays you do't have to use for-loops. To calculate the square, you can just do:
xSQ = x**2
sumxSQ = xSQ.sum()
or in one line:
sumxSQ = numpy.sum(x**2)