I have an array that is composed of multiple np arrays. I want to give every array a key and convert it to an HDF5 file
arr = np.concatenate((Hsp_data, Hsp_rdiff, PosC44_WKS, PosX_WKS, PosY_WKS, PosZ_WKS,
RMS_Acc_HSp, RMS_Acc_Rev, RMS_Schall, Rev_M, Rev_rdiff, X_rdiff, Z_I, Z_rdiff, time), axis=1)
d1 = np.random.random(size=(7501, 15))
hf = h5py.File('data.hdf5', 'w')
hf.create_dataset('arr', data=d1)
hf.close()
hf = h5py.File('data.hdf5', 'r+')
print(hf.key)
This what I have done so far and I get this error AttributeError: 'File' object has no attribute 'key'.
I want the final answer to be like this when printing the keys
<KeysViewHDF5 ['Hsp_M', 'Hsp_rdiff', 'PosC44_WKS', 'PosX_WKS', 'PosY_WKS', 'PosZ_WKS', 'RMS_Acc_HSp', 'RMS_Acc_Rev', 'RMS_Schall', 'Rev_M', 'Rev_rdiff', 'X_rdiff', 'Z_I', 'Z_rdiff']>
any ideas?
You/we need a clearer idea of how the original .mat is laid out. In h5py, the file is viewed as a nested set of groups, which are dict like. Hence the use of keys(). At the ends of that nesting are datasets which can be loaded (or saved from) as numpy arrays. The datasets/arrays don't have keys; it's the file and groups that have those.
Creating your file:
In [69]: import h5py
In [70]: d1 = np.random.random(size=(7501, 15))
...: hf = h5py.File('data.hdf5', 'w')
...: hf.create_dataset('arr', data=d1)
...: hf.close()
Reading it:
In [71]: hf = h5py.File('data.hdf5', 'r+')
In [72]: hf.keys()
Out[72]: <KeysViewHDF5 ['arr']>
In [73]: hf['arr']
Out[73]: <HDF5 dataset "arr": shape (7501, 15), type "<f8">
In [75]: arr = hf['arr'][:]
In [76]: arr.shape
Out[76]: (7501, 15)
'arr' is the name of the dataset that we created at the start. In this case there's no group; just the one dataset. [75] loads the dataset to an array which I called arr, but that name could be anything (like the original d1).
Arrays and datasets may have a compound dtype, which has named fields. I don't know if MATLAB uses those or not.
Without knowledge of the group and dataset layout in the original .mat, it's hard to help you. And when looking at datasets, pay particular attention to shape and dtype.
Related
I am trying to write a program which puts data into a .h5 file. There should be 3 columns, one with the number of the variable (from counter in the for loop. one with the name of the variable (shown in the 2nd column of list_of_vars), and one with its unit (3rd column of list_of_vars).
Code is below:
import numpy as np
import h5py as h5
list_of_vars = [
('ADC_alt', 'ADC_alt', 'ft'),
('ADC_temp', 'ADC_temp', 'degC'),
('ADC_ias', 'ADC_ias', 'kts'),
('ADC_tas', 'ADC_tas', 'kts'),
('ADC_aos', 'ADC_aos', 'deg'),
('ADC_aoa', 'ADC_aoa', 'deg'),
]
#write new h5 file
var = h5.File('telemetry.h5','w')
for counter, val in enumerate(list_of_vars):
varnum = var.create_dataset('n°', (6,), data = counter)
varname = var.create_dataset('Variable name', (6,), dtype = 'str_', data = val[1])
varunit = var.create_dataset('Unit', (6,), dtype = 'str_', data = val[2])
data = np.array(varname,varunit)
print(data)
However, when I run it, I get the error ValueError: Shape tuple is incompatible with data
What is wrong here?
Lots of little problems to correct. If I understand, you want to create ONE heterogeneous dataset (with 1 field(column) of ints named 'n°', and 2 fields(columns) of strings named 'Variable name' and 'Unit'). What your code is trying to create is 18 separate datasets (3 created with each loop thru enumerate(list_of_vars).
There is a trick when working with heterogeneous datasets: If you add row wise, you have to reference the dataset row AND column indices, OR add the entire row. I prefer to load data field/column wise. Generally you have fewer fields than rows -- fewer loops == fewer write cycles == faster.
Here is the process you want. It creates the dataset, then adds the data for each field from the count, then list_of_vars[1], then from list_of_vars[2]. At the end it reads and prints the data from the dataset. Code below:
#write new h5 file
with h5.File('telemetry.h5','w') as var:
dt = np.dtype( [('n°', int), ('Variable name','S10'), ('Unit', 'S10')] )
dset = var.create_dataset('data', dtype=dt, shape=(len(list_of_vars),))
dset['n°'] = np.arange(len(list_of_vars))
dset['Variable name'] = [val[1] for val in list_of_vars]
dset['Unit'] = [val[2] for val in list_of_vars]
data = dset[:]
print(data)
If you prefer to use the enumerate loop, use this method. It loads items by row index. For completeness, it also shows how to index the dataset by [row,field name], but do not recommend it.
#write new h5 file
with h5.File('telemetry.h5','w') as var:
dt = np.dtype( [('n°', int), ('Variable name','S10'), ('Unit', 'S10')] )
dset = var.create_dataset('data',dtype=dt,shape=(len(list_of_vars),))
for counter, val in enumerate(list_of_vars):
dset[counter] = (counter, val[1], val[2])
# alternate row/field indexing method:
# dset[counter,'n°'] = counter
# dset[counter,'Variable name'] = val[1]
# dset[counter,'Unit'] = val[2]
data = dset[:]
print(data)
When asking about a problem, show the whole error.
When I run your code I get:
1129:~/mypy$ python3 stack68181330.py
Traceback (most recent call last):
File "stack68181330.py", line 18, in <module>
varnum = var.create_dataset('n°', (6,), data = counter)
File "/usr/local/lib/python3.8/dist-packages/h5py/_hl/group.py", line 149, in create_dataset
dsid = dataset.make_new_dset(group, shape, dtype, data, name, **kwds)
File "/usr/local/lib/python3.8/dist-packages/h5py/_hl/dataset.py", line 61, in make_new_dset
raise ValueError("Shape tuple is incompatible with data")
ValueError: Shape tuple is incompatible with data
It's having problems with your first dataset creation:
var.create_dataset('n°', (6,), data = counter)
What are you saying here? Make a dataset with name 'n*', and shape (6,) - 6 elements. But what is counter? It's the current enumerate value, one integer. Do you see how the shape (6,) doesn't match the data?
The other dataset lines potentially have similar problems.
It doesn't get on to another problem. You are doing the create_dataset repeatedly in the loop. Once you have a dataset named 'n*', you can't create another with the same name.
I suspect you want to make one dataset, with 6 slots, and repeatedly assign counter values to it. Not to repeatedly create a dataset with the same name.
Let's change the dataset creation and write to something that works:
varnum = var.create_dataset('n°', (6,), dtype=int)
varname = var.create_dataset('Variable name', (6,), dtype = 'S10')
varunit = var.create_dataset('Unit', (6,), dtype = 'S10')
for counter, val in enumerate(list_of_vars):
varnum[counter] = counter
varname[counter] = val[1]
varunit[counter] = val[2]
var.flush()
print(varnum, varnum[:])
print(varname)
print(varname[:])
print(varunit)
print(varunit[:])
and run:
1144:~/mypy$ python3 stack68181330.py
<HDF5 dataset "n°": shape (6,), type "<i8"> [0 1 2 3 4 5]
<HDF5 dataset "Variable name": shape (6,), type "|S10">
[b'ADC_alt' b'ADC_temp' b'ADC_ias' b'ADC_tas' b'ADC_aos' b'ADC_aoa']
<HDF5 dataset "Unit": shape (6,), type "|S10">
[b'ft' b'degC' b'kts' b'kts' b'deg' b'deg]
The b'ft' string display is bytestrings, the result of using S10. I think there are ways of specifying unicode, but I haven't looked at the h5py docs in a while.
There are simpler ways of writing this data, but I chose to keep it close to your attempt, to better illustrate the basics of both Python iteration, and h5py use.
I could write the data directly to the datasets, without iteration with:
Make array from the list:
arr = np.array(list_of_vars, dtype='S')
print(arr)
varnum = var.create_dataset('n°', np.arange(arr.shape[0]))
varname = var.create_dataset('Variable name', data=arr[:,1])
varunit = var.create_dataset('Unit', data=arr[:,2])
I let it deduce shape and dtype from the data.
I have a dict of scipy.sparse.csr_matrix objects as values, with integer keys. How can I save this in a separate file?
If I had a regular ndarray for each entry, then I could serialize it with json, but when I try this with a sparse matrix:
with open('filename.txt', 'w') as f:
f.write(json.dumps(the_matrix))
I get a TypeError:
TypeError: <75x75 sparse matrix of type '<type 'numpy.int64'>'
with 10 stored elements in Compressed Sparse Row format> is not JSON serializable
How can I save my dictionary with keys that are integers and values that are sparse csr matrices?
I faced this same issue trying to save a dictionary whose values are csr_matrix. Dumped it to disk using pickle. file handler should be opened in "wb" mode.
import pickle
pickle.dump(csr_dict_obj, open("csr_dict.pkl","wb"))
load the dict back using.
csr_dict = pickle.load(open("csr_dict.pkl","rb"))
Newer scipy versions have a scipy.sparse.save_npz function (and corresponding load). It saves the attributes of a sparse matrix to a numpy savez zip archive. In the case of a csr is saves the data, indices and indptr arrays, plus shape.
scipy.io.savemat can save a sparse matrix in a MATLAB compatible format (csc). There are one or two other scipy.io formats that can handle sparse matrices, but I haven't worked with them.
While a sparse matrix contains numpy arrays it isn't an array subclass, so the numpy functions can't be used directly.
The pickle method for numpy arrays is its np.save. And an array that contains objects, uses pickle (if possible). So a pickle of a dictionary of arrays should work.
The sparse dok format is a subclass of dict, so might be pickleable. It might even work with json. But I haven't tried it.
By the way, a plain numpy array can't be jsoned either:
In [427]: json.dumps(np.arange(5))
TypeError: array([0, 1, 2, 3, 4]) is not JSON serializable
In [428]: json.dumps(np.arange(5).tolist())
Out[428]: '[0, 1, 2, 3, 4]'
dok doesn't work either. The keys are tuples of indices,
In [433]: json.dumps(M.todok())
TypeError: keys must be a string
MatrixMarket is a text format that handles sparse:
In [444]: io.mmwrite('test.mm', M)
In [446]: cat test.mm.mtx
%%MatrixMarket matrix coordinate integer general
%
1 5 4
1 2 1
1 3 2
1 4 3
1 5 4
import numpy as np
from scipy.sparse import lil_matrix, csr_matrix, issparse
import re
def save_sparse_csr(filename, **kwargs):
arg_dict = dict()
for key, value in kwargs.items():
if issparse(value):
value = value.tocsr()
arg_dict[key+'_data'] = value.data
arg_dict[key+'_indices'] = value.indices
arg_dict[key+'_indptr'] = value.indptr
arg_dict[key+'_shape'] = value.shape
else:
arg_dict[key] = value
np.savez(filename, **arg_dict)
def load_sparse_csr(filename):
loader = np.load(filename)
new_d = dict()
finished_sparse_list = []
sparse_postfix = ['_data', '_indices', '_indptr', '_shape']
for key, value in loader.items():
IS_SPARSE = False
for postfix in sparse_postfix:
if key.endswith(postfix):
IS_SPARSE = True
key_original = re.match('(.*)'+postfix, key).group(1)
if key_original not in finished_sparse_list:
value_original = csr_matrix((loader[key_original+'_data'], loader[key_original+'_indices'], loader[key_original+'_indptr']),
shape=loader[key_original+'_shape'])
new_d[key_original] = value_original.tolil()
finished_sparse_list.append(key_original)
break
if not IS_SPARSE:
new_d[key] = value
return new_d
You can write a wrapper as shown above.
I have a complicated set of data that I have to do distance calculations on. Each record in the data set contains many different data types so a record array or structured array appears to be the way to go. The problem is when I have to do my distance calculations, the scipy spatial distance functions take arrays and the recored array is numpy voids. How to I make a recored array of numpy arrays instead of numpy voids? Below is a very simple example of what I'm talking about.
import numpy
import scipy.spatial.distance as scidist
input_data = [
('340.9', '7548.2', '1192.4', 'set001.txt'),
('546.7', '9039.9', '5546.1', 'set002.txt'),
('456.3', '2234.8', '2198.8', 'set003.txt'),
('332.1', '1144.2', '2344.5', 'set004.txt'),
]
record_array = numpy.array(input_data,
dtype=[('d1', 'float64'), ('d2', 'float64'), ('d3', 'float64'), ('file', '|S20')])
The following code fails...
this_fails_and_makes_me_cry = record_array[['d1', 'd2', 'd3']]
scidist.pdist(this_fails_and_makes_me_cry)
I get this error....
Traceback (most recent call last):
File "/home/someguy/working_datasets/trial003/scrap.py", line 16, in <module>
scidist.pdist(record_array[['d1', 'd2', 'd3']])
File "/usr/lib/python2.7/dist-packages/scipy/spatial/distance.py", line 1093, in pdist
raise ValueError('A 2-dimensional array must be passed.');
ValueError: A 2-dimensional array must be passed.
The error occurs because this_fails_and_makes_me_cry is an array of numpy.voids. To get it to work I have to convert each time like this...
this_works = numpy.array(map(list, record_array[['d1', 'd2', 'd3']]))
scidist.pdist(this_works)
Is it possible to create a record array of numpy arrays to begin with? Or is a numpy record/structured array restricted to numpy voids? It would be handy for the record array to contain the data in a format compatible with scipy's spatial distance functions so that I don't have to convert each time. Is this possible?
this_fails_and_makes_me_cry = record_array[['d1', 'd2', 'd3']]
creates a one-dimensional structured array, with fields d1, d2 and d3. pdist expects a two-dimensional array. Here's one way to create that two-dimensional array containing only the d fields of record_array.
(Note: The following won't work if the fields that you want to use for the distance calculation are not contiguous within the data type of the structured array record_array. See below for an alternative in that case.)
First, we make a new dtype, in which d1, d2 and d3 become a single field called d containing three floating point values:
In [61]: dt2 = dtype([('d', 'f8', 3), ('file', 'S20')])
Next, use the view method to create a view of record_array using this dtype:
In [62]: rav = record_array.view(dt2)
In [63]: rav
Out[63]:
array([([340.9, 7548.2, 1192.4], 'set001.txt'),
([546.7, 9039.9, 5546.1], 'set002.txt'),
([456.3, 2234.8, 2198.8], 'set003.txt'),
([332.1, 1144.2, 2344.5], 'set004.txt')],
dtype=[('d', '<f8', (3,)), ('file', 'S20')])
rav is not a copy--it is a view of the same block of memory used by record_array.
Now access field d to get the two-dimensional array:
In [64]: d = rav['d']
In [65]: d
Out[65]:
array([[ 340.9, 7548.2, 1192.4],
[ 546.7, 9039.9, 5546.1],
[ 456.3, 2234.8, 2198.8],
[ 332.1, 1144.2, 2344.5]])
d can be passed to pdist:
In [66]: pdist(d)
Out[66]:
array([ 4606.75875427, 5409.10137454, 6506.81395539, 7584.32432455,
8522.8149229 , 1107.27706108])
Note that instead of converting record_array to rav, you could use dt2 as the data type of record_array from the start, and just write d = record_array['d'].
If the fields in record_array that are used for the distance calculation are not contiguous in the structure, you'll first have to pull them out into a new array so they are contiguous:
In [83]: arr = record_array[['d1','d2','d3']]
Then take a view of arr and reshape to make it two-dimensional:
In [84]: d = arr.view(np.float64).reshape(-1,3)
In [85]: d
Out[85]:
array([[ 340.9, 7548.2, 1192.4],
[ 546.7, 9039.9, 5546.1],
[ 456.3, 2234.8, 2198.8],
[ 332.1, 1144.2, 2344.5]])
You can combine those into a single line, if that's more convenient:
In [86]: d = record_array[['d1', 'd2', 'd3']].view(np.float64).reshape(-1, 3)
I saved a couple of numpy arrays with np.save(), and put together they're quite huge.
Is it possible to load them all as memory-mapped files, and then concatenate and slice through all of them without ever loading anythin into memory?
Using numpy.concatenate apparently load the arrays into memory. To avoid this you can easily create a thrid memmap array in a new file and read the values from the arrays you wish to concatenate. In a more efficient way, you can also append new arrays to an already existing file on disk.
For any case you must choose the right order for the array (row-major or column-major).
The following examples illustrate how to concatenate along axis 0 and axis 1.
1) concatenate along axis=0
a = np.memmap('a.array', dtype='float64', mode='w+', shape=( 5000,1000)) # 38.1MB
a[:,:] = 111
b = np.memmap('b.array', dtype='float64', mode='w+', shape=(15000,1000)) # 114 MB
b[:,:] = 222
You can define a third array reading the same file as the first array to be concatenated (here a) in mode r+ (read and append), but with the shape of the final array you want to achieve after concatenation, like:
c = np.memmap('a.array', dtype='float64', mode='r+', shape=(20000,1000), order='C')
c[5000:,:] = b
Concatenating along axis=0 does not require to pass order='C' because this is already the default order.
2) concatenate along axis=1
a = np.memmap('a.array', dtype='float64', mode='w+', shape=(5000,3000)) # 114 MB
a[:,:] = 111
b = np.memmap('b.array', dtype='float64', mode='w+', shape=(5000,1000)) # 38.1MB
b[:,:] = 222
The arrays saved on disk are actually flattened, so if you create c with mode=r+ and shape=(5000,4000) without changing the array order, the 1000 first elements from the second line in a will go to the first in line in c. But you can easily avoid this passing order='F' (column-major) to memmap:
c = np.memmap('a.array', dtype='float64', mode='r+',shape=(5000,4000), order='F')
c[:, 3000:] = b
Here you have an updated file 'a.array' with the concatenation result. You may repeat this process to concatenate in pairs of two.
Related questions:
Working with big data in python and numpy, not enough ram, how to save partial results on disc?
Maybe an alternative solution, but I also had a single multidimensional array spread over multiple files which I only wanted to read. I solved this issue with dask concatenation.
import numpy as np
import dask.array as da
a = np.memmap('a.array', dtype='float64', mode='r', shape=( 5000,1000))
b = np.memmap('b.array', dtype='float64', mode='r', shape=(15000,1000))
c = da.concatenate([a, b], axis=0)
This way one avoids the hacky additional file handle. The dask array can then be sliced and worked with almost like any numpy array, and when it comes time to calculate a result one calls compute.
Note that there are two caveats:
it is not possible to do in-place re-assignment e.g. c[::2] = 0 is not possible, so creative solutions are necessary in those cases.
this also means the original files can no longer be updated. To save results out, the dask store methods should be used. This method can again accept a memmapped array.
If u use order='F',will leads another problem, which when u load the file next time it will be quit a mess even pass the order='F. So my solution is below, I have test a lot, it just work fine.
fp = your old memmap...
shape = fp.shape
data = your ndarray...
data_shape = data.shape
concat_shape = data_shape[:-1] + (data_shape[-1] + shape[-1],)
print('cancat shape:{}'.format(concat_shape))
new_fp = np.memmap(new_file_name, dtype='float32', mode='r+', shape=concat_shape)
if len(concat_shape) == 1:
new_fp[:shape[0]] = fp[:]
new_fp[shape[0]:] = data[:]
if len(concat_shape) == 2:
new_fp[:, :shape[-1]] = fp[:]
new_fp[:, shape[-1]:] = data[:]
elif len(concat_shape) == 3:
new_fp[:, :, :shape[-1]] = fp[:]
new_fp[:, :, shape[-1]:] = data[:]
fp = new_fp
fp.flush()
I'm learning Matplotlib, and trying to implement a simple linear regression by hand.
However, I've run into a problem when importing and then working with my data after using csv2rec.
data= matplotlib.mlab.csv2rec('KC_Filtered01.csv',delimiter=',')
x = data['list_price']
y = data['square_feet']
sumx = x.sum()
sumy = y.sum()
sumxSQ = sum([sq**2 for sq in x])
sumySQ = sum([sq**2 for sq in y])
I'm reading in a list of housing prices, and trying to get the sum of the squares. However, when csv2rec reads in the prices from the file, it stores the values as an int32. Since the sum of the squares of the housing prices is greater than a 32 bit integer, it overflows. However I don't see a method of changing the data type that is assigned when csv2rec reads the file. How can I change the data type when the array is read in or assigned?
x = data['list_price'].astype('int64')
and the same with y.
And: csv2rec has a converterd argument: http://matplotlib.sourceforge.net/api/mlab_api.html#matplotlib.mlab.csv2rec
Instead of mlab.csv2rec, you can use an equivalent function of numpy, numpy.loadtxt (documentation), to read your data. This function has an argument to specify the dtype of your data.
Or if you want to work with column names (as in your example code), the function numpy.genfromtxt (documentation). This is like loadtxt, but with more options, such as to read in the column names from the first line of your file (with names = True).
An example of its usage:
In [9]:
import numpy as np
from StringIO import StringIO
data = StringIO("a, b, c\n 1, 2, 3\n 4, 5, 6")
np.genfromtxt(data, names=True, dtype = 'int64', delimiter = ',')
Out[9]:
array([(1L, 2L, 3L), (4L, 5L, 6L)],
dtype=[('a', '<i8'), ('b', '<i8'), ('c', '<i8')])
Another remark on your code, when using numpy arrays you do't have to use for-loops. To calculate the square, you can just do:
xSQ = x**2
sumxSQ = xSQ.sum()
or in one line:
sumxSQ = numpy.sum(x**2)