h5py extend dataset without knowing row index - python

EDIT to show more accurately my situation
import h5py # necessary for storing data
import numpy as np
dat = np.random.random([3, 2])
# create 'test.hdf5' if not exist, otherwise open
with h5py.File('test.? Shdf5', 'a') as f:
# create group or get ref to existing group
group = f.require_group('test_group')
# create dataset or get ref to existing dataset
dataset = group.require_dataset('test_set', shape=(0, dat.shape[1]),
maxshape=(None, dat.shape[1]),
dtype=float, chunks=True)
dataset_shape = dataset.shape # get shape of current dataset
dataset_new_length = dataset_shape[0] + dat.shape[0] # new row length
dataset.resize((dataset_new_length, dat.shape[1])) # increase row length dataset
dataset[dataset_shape[0]:dataset_new_length] = dat # add new data to dataset
Problem
The first time
I run this script, it works without problems. The 2nd time however shape=(0, dat.shape[1]) does not match anymore, because it became shape=(3, dat.shape[1]). The whole point of require_dataset() is that you don't have to make a try: except: construction, however because of the required shape=(), it does not work with a maxshape=(None, dat.shape[1]).
Also, shape=(None, dat.shape[1]) is not allowed.
Question
Is there a solution using require_dataset() that does not use a try: except: like this (if you replace dataset = group.require_dataset() with this piece of code, the script runs without problems)?:
try:
dataset = group.create_dataset('test_set', shape=(0, dat.shape[1]),
maxshape=(None, dat.shape[1]),
dtype=float, chunks=True)
except:
dataset = group['test_set']
Better solution would be
A command where I don't need to know the size would be even better, as something like this:
dataset = dataset.extend(data)

Related

Python h5py - 'Shape tuple is incompatible with data' error

I am trying to write a program which puts data into a .h5 file. There should be 3 columns, one with the number of the variable (from counter in the for loop. one with the name of the variable (shown in the 2nd column of list_of_vars), and one with its unit (3rd column of list_of_vars).
Code is below:
import numpy as np
import h5py as h5
list_of_vars = [
('ADC_alt', 'ADC_alt', 'ft'),
('ADC_temp', 'ADC_temp', 'degC'),
('ADC_ias', 'ADC_ias', 'kts'),
('ADC_tas', 'ADC_tas', 'kts'),
('ADC_aos', 'ADC_aos', 'deg'),
('ADC_aoa', 'ADC_aoa', 'deg'),
]
#write new h5 file
var = h5.File('telemetry.h5','w')
for counter, val in enumerate(list_of_vars):
varnum = var.create_dataset('n°', (6,), data = counter)
varname = var.create_dataset('Variable name', (6,), dtype = 'str_', data = val[1])
varunit = var.create_dataset('Unit', (6,), dtype = 'str_', data = val[2])
data = np.array(varname,varunit)
print(data)
However, when I run it, I get the error ValueError: Shape tuple is incompatible with data
What is wrong here?
Lots of little problems to correct. If I understand, you want to create ONE heterogeneous dataset (with 1 field(column) of ints named 'n°', and 2 fields(columns) of strings named 'Variable name' and 'Unit'). What your code is trying to create is 18 separate datasets (3 created with each loop thru enumerate(list_of_vars).
There is a trick when working with heterogeneous datasets: If you add row wise, you have to reference the dataset row AND column indices, OR add the entire row. I prefer to load data field/column wise. Generally you have fewer fields than rows -- fewer loops == fewer write cycles == faster.
Here is the process you want. It creates the dataset, then adds the data for each field from the count, then list_of_vars[1], then from list_of_vars[2]. At the end it reads and prints the data from the dataset. Code below:
#write new h5 file
with h5.File('telemetry.h5','w') as var:
dt = np.dtype( [('n°', int), ('Variable name','S10'), ('Unit', 'S10')] )
dset = var.create_dataset('data', dtype=dt, shape=(len(list_of_vars),))
dset['n°'] = np.arange(len(list_of_vars))
dset['Variable name'] = [val[1] for val in list_of_vars]
dset['Unit'] = [val[2] for val in list_of_vars]
data = dset[:]
print(data)
If you prefer to use the enumerate loop, use this method. It loads items by row index. For completeness, it also shows how to index the dataset by [row,field name], but do not recommend it.
#write new h5 file
with h5.File('telemetry.h5','w') as var:
dt = np.dtype( [('n°', int), ('Variable name','S10'), ('Unit', 'S10')] )
dset = var.create_dataset('data',dtype=dt,shape=(len(list_of_vars),))
for counter, val in enumerate(list_of_vars):
dset[counter] = (counter, val[1], val[2])
# alternate row/field indexing method:
# dset[counter,'n°'] = counter
# dset[counter,'Variable name'] = val[1]
# dset[counter,'Unit'] = val[2]
data = dset[:]
print(data)
When asking about a problem, show the whole error.
When I run your code I get:
1129:~/mypy$ python3 stack68181330.py
Traceback (most recent call last):
File "stack68181330.py", line 18, in <module>
varnum = var.create_dataset('n°', (6,), data = counter)
File "/usr/local/lib/python3.8/dist-packages/h5py/_hl/group.py", line 149, in create_dataset
dsid = dataset.make_new_dset(group, shape, dtype, data, name, **kwds)
File "/usr/local/lib/python3.8/dist-packages/h5py/_hl/dataset.py", line 61, in make_new_dset
raise ValueError("Shape tuple is incompatible with data")
ValueError: Shape tuple is incompatible with data
It's having problems with your first dataset creation:
var.create_dataset('n°', (6,), data = counter)
What are you saying here? Make a dataset with name 'n*', and shape (6,) - 6 elements. But what is counter? It's the current enumerate value, one integer. Do you see how the shape (6,) doesn't match the data?
The other dataset lines potentially have similar problems.
It doesn't get on to another problem. You are doing the create_dataset repeatedly in the loop. Once you have a dataset named 'n*', you can't create another with the same name.
I suspect you want to make one dataset, with 6 slots, and repeatedly assign counter values to it. Not to repeatedly create a dataset with the same name.
Let's change the dataset creation and write to something that works:
varnum = var.create_dataset('n°', (6,), dtype=int)
varname = var.create_dataset('Variable name', (6,), dtype = 'S10')
varunit = var.create_dataset('Unit', (6,), dtype = 'S10')
for counter, val in enumerate(list_of_vars):
varnum[counter] = counter
varname[counter] = val[1]
varunit[counter] = val[2]
var.flush()
print(varnum, varnum[:])
print(varname)
print(varname[:])
print(varunit)
print(varunit[:])
and run:
1144:~/mypy$ python3 stack68181330.py
<HDF5 dataset "n°": shape (6,), type "<i8"> [0 1 2 3 4 5]
<HDF5 dataset "Variable name": shape (6,), type "|S10">
[b'ADC_alt' b'ADC_temp' b'ADC_ias' b'ADC_tas' b'ADC_aos' b'ADC_aoa']
<HDF5 dataset "Unit": shape (6,), type "|S10">
[b'ft' b'degC' b'kts' b'kts' b'deg' b'deg]
The b'ft' string display is bytestrings, the result of using S10. I think there are ways of specifying unicode, but I haven't looked at the h5py docs in a while.
There are simpler ways of writing this data, but I chose to keep it close to your attempt, to better illustrate the basics of both Python iteration, and h5py use.
I could write the data directly to the datasets, without iteration with:
Make array from the list:
arr = np.array(list_of_vars, dtype='S')
print(arr)
varnum = var.create_dataset('n°', np.arange(arr.shape[0]))
varname = var.create_dataset('Variable name', data=arr[:,1])
varunit = var.create_dataset('Unit', data=arr[:,2])
I let it deduce shape and dtype from the data.

Is there a way to extend a PyTables EArray in the second dimension?

I have a 2D array that can grow to larger sizes than I'm able to fit on memory, so I'm trying to store it in a h5 file using Pytables. The number of rows is known beforehand but the length of each row is not known and is variable between rows. After some research, I thought something along these lines would work, where I can set the extendable dimension as the second dimension.
filename = os.path.join(tempfile.mkdtemp(), 'example.h5')
h5_file = open_file(filename, mode="w", title="Example Extendable Array")
h5_group = h5_file.create_group("/", "example_on_dim_2")
e_array = h5_file.create_earray(h5_group, "example", Int32Atom(shape=()), (100, 0)) # Assume num of rows is 100
# Add some item to index 2
print(e_array[2]) # should print an empty array
e_array[2] = np.append(e_array[2], 5) # add the value 5 to row 2
print(e_array[2]) # should print [5], currently printing empty array
I'm not sure if it's possible to add elements in this way (I might have misunderstood the way earrays work), but any help would be greatly appreciated!
You're close...but have a small misunderstanding of some of the arguments and behavior. When you create the EArray with shape=(100,0), you don't have any data...just an object designated to have 100 rows that can add columns. Also, you need to use e_array.append() to add data, not np.append(). Also, if you are going to create a very large array, consider defining the expectedrows= parameter for improved performance as the EArray grows.
Take a look at this code.
import tables as tb
import numpy as np
filename = 'example.h5'
with tb.File(filename, mode="w", title="Example Extendable Array") as h5_file :
h5_group = h5_file.create_group("/", "example_on_dim_2")
# Assume num of rows is 100
#e_array = h5_file.create_earray(h5_group, "example", Int32Atom(shape=()), (100, 0))
e_array = h5_file.create_earray(h5_group, "example", atom=tb.IntAtom(), shape=(100, 0))
print (e_array.shape)
e_array.append(np.arange(100,dtype=int).reshape(100,1)) # append a column of values
print (e_array.shape)
print(e_array[2]) # prints [2]
Here is an example showing how to create a VLArray (Variable Length). It is similar to the EArray example above, and follows the example from the Pytables doc (link in comment above). However, although a VLArray supports variable length rows, it does not have a mechanism to add items to an existing row (AFAIK).
import tables as tb
import numpy as np
filename = 'example_vlarray.h5'
with tb.File(filename, mode="w", title="Example Variable Length Array") as h5_file :
h5_group = h5_file.create_group("/", "vl_example")
vlarray = h5_file.create_vlarray(h5_group, "example", tb.IntAtom(), "ragged array of ints",)
# Append some (variable length) rows:
vlarray.append(np.array([0]))
vlarray.append(np.array([1, 2]))
vlarray.append([3, 4, 5])
vlarray.append([6, 7, 8, 9])
# Now, read it through an iterator:
print('-->', vlarray.title)
for x in vlarray:
print('%s[%d]--> %s' % (vlarray.name, vlarray.nrow, x))

Creating a dataset from multiple hdf5 groups

creating a dataset from multiple hdf5 groups
Code for groups with
np.array(hdf.get('all my groups'))
I have then added code for creating a dataset from groups.
with h5py.File('/train.h5', 'w') as hdf:
hdf.create_dataset('train', data=one_T+two_T+three_T+four_T+five_T)
The error message being
ValueError: operands could not be broadcast together with shapes(534456,4) (534456,14)
The numbers in each group are the same other than the varying column lengths. 5 separate groups to one dataset.
This answer addresses the OP's request in comments to my first answer ("an example would be ds_1 all columns, ds_2 first two columns, ds_3 column 4 and 6, ds_4 all columns"). The process is very similar, but the input is "slightly more complicated" than the first answer. As a result I used a different approach to define dataset names and colums to be copied. Differences:
The first solution iterates over the dataset names from the "keys()" (copying each dataset completely, appending to a dataset in the new file). The size of the new dataset is calculated by summing sizes of all datasets.
The second solution uses 2 lists to define 1) dataset names (ds_list) and 2) associated columns to copy from each dataset (col_list is a of lists). The size of the new dataset is calculated by summing the number of columns in col_list. I used "fancy indexing" to extract the columns using col_list.
How you decide to do this depends on your data.
Note: for simplicity, I deleted the dtype and shape tests. You should include these to avoid errors with "real world" problems.
Code below:
# Data for file1
arr1 = np.random.random(120).reshape(20,6)
arr2 = np.random.random(120).reshape(20,6)
arr3 = np.random.random(120).reshape(20,6)
arr4 = np.random.random(120).reshape(20,6)
# Create file1 with 4 datasets
with h5py.File('file1.h5','w') as h5f :
h5f.create_dataset('ds_1',data=arr1)
h5f.create_dataset('ds_2',data=arr2)
h5f.create_dataset('ds_3',data=arr3)
h5f.create_dataset('ds_4',data=arr4)
# Open file1 for reading and file2 for writing
with h5py.File('file1.h5','r') as h5f1 , \
h5py.File('file2.h5','w') as h5f2 :
# Loop over datasets in file1 to get dtype and rows (should test compatibility)
for i, ds in enumerate(h5f1.keys()) :
if i == 0:
ds_0_dtype = h5f1[ds].dtype
n_rows = h5f1[ds].shape[0]
break
# Create new empty dataset with appropriate dtype and size
# Use maxshape parameter to make resizable in the future
ds_list = ['ds_1','ds_2','ds_3','ds_4']
col_list =[ [0,1,2,3,4,5], [0,1], [3,5], [0,1,2,3,4,5] ]
n_cols = sum( [ len(c) for c in col_list])
h5f2.create_dataset('combined', dtype=ds_0_dtype, shape=(n_rows,n_cols), maxshape=(n_rows,None))
# Loop over datasets in file1, read data into xfer_arr, and write to file2
first = 0
for ds, cols in zip(ds_list, col_list) :
xfer_arr = h5f1[ds][:,cols]
last = first + xfer_arr.shape[1]
h5f2['combined'][:, first:last] = xfer_arr[:]
first = last
Here you go; a simple example to copy values from 3 datasets in file1 to a single dataset in file2. I included some tests to verify compatible dtype and shape. The code to create file1 are included at the top. Comments in the code should explain the process. I have another post that shows multiple ways to copy data between 2 HDF5 files. See this post: How can I combine multiple .h5 file?
import h5py
import numpy as np
import sys
# Data for file1
arr1 = np.random.random(80).reshape(20,4)
arr2 = np.random.random(40).reshape(20,2)
arr3 = np.random.random(60).reshape(20,3)
#Create file1 with 3 datasets
with h5py.File('file1.h5','w') as h5f :
h5f.create_dataset('ds_1',data=arr1)
h5f.create_dataset('ds_2',data=arr2)
h5f.create_dataset('ds_3',data=arr3)
# Open file1 for reading and file2 for writing
with h5py.File('file1.h5','r') as h5f1 , \
h5py.File('file2.h5','w') as h5f2 :
# Loop over datasets in file1 and check data compatiblity
for i, ds in enumerate(h5f1.keys()) :
if i == 0:
ds_0 = ds
ds_0_dtype = h5f1[ds].dtype
n_rows = h5f1[ds].shape[0]
n_cols = h5f1[ds].shape[1]
else:
if h5f1[ds].dtype != ds_0_dtype :
print(f'Dset 0:{ds_0}: dtype:{ds_0_dtype}')
print(f'Dset {i}:{ds}: dtype:{h5f1[ds].dtype}')
sys.exit('Error: incompatible dataset dtypes')
if h5f1[ds].shape[0] != n_rows :
print(f'Dset 0:{ds_0}: shape[0]:{n_rows}')
print(f'Dset {i}:{ds}: shape[0]:{h5f1[ds].shape[0]}')
sys.exit('Error: incompatible dataset shape')
n_cols += h5f1[ds].shape[1]
prev_ds = ds
# Create new empty dataset with appropriate dtype and size
# Using maxshape paramater to make resizable in the future
h5f2.create_dataset('ds_123', dtype=ds_0_dtype, shape=(n_rows,n_cols), maxshape=(n_rows,None))
# Loop over datasets in file1, read data into xfer_arr, and write to file2
first = 0
for ds in h5f1.keys() :
xfer_arr = h5f1[ds][:]
last = first + xfer_arr.shape[1]
h5f2['ds_123'][:, first:last] = xfer_arr[:]
first = last

Graphlab and numpy issue

I'm currently doing a course on Coursera (Machine Leraning) offered by University of Washington and I'm facing little problem with the numpy and graphlab
The course requests to use a version of graphlab higher than 1.7
Mine is higher as you can see below, however, when I run the script below, I got an error as follows:
[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started.
def get_numpy_data(data_sframe, features, output):
data_sframe['constant'] = 1
features = ['constant'] + features # this is how you combine two lists
# the following line will convert the features_SFrame into a numpy matrix:
feature_matrix = features_sframe.to_numpy()
# assign the column of data_sframe associated with the output to the SArray output_sarray
# the following will convert the SArray into a numpy array by first converting it to a list
output_array = output_sarray.to_numpy()
return(feature_matrix, output_array)
(example_features, example_output) = get_numpy_data(sales,['sqft_living'], 'price') # the [] around 'sqft_living' makes it a list
print example_features[0,:] # this accesses the first row of the data the ':' indicates 'all columns'
print example_output[0] # and the corresponding output
----> 8 feature_matrix = features_sframe.to_numpy()
NameError: global name 'features_sframe' is not defined
The script above was written by the course authors, so I believe there is something I'm doing wrong
Any help will be highly appreciated.
You are supposed to complete the function get_numpy_data before running it, that's why you are getting an error. Follow the instructions in the original function, which actually are:
def get_numpy_data(data_sframe, features, output):
data_sframe['constant'] = 1 # this is how you add a constant column to an SFrame
# add the column 'constant' to the front of the features list so that we can extract it along with the others:
features = ['constant'] + features # this is how you combine two lists
# select the columns of data_SFrame given by the features list into the SFrame features_sframe (now including constant):
# the following line will convert the features_SFrame into a numpy matrix:
feature_matrix = features_sframe.to_numpy()
# assign the column of data_sframe associated with the output to the SArray output_sarray
# the following will convert the SArray into a numpy array by first converting it to a list
output_array = output_sarray.to_numpy()
return(feature_matrix, output_array)
The graphlab assignment instructions have you convert from graphlab to pandas and then to numpy. You could just skip the the graphlab parts and use pandas directly. (This is explicitly allowed in the homework description.)
First, read in the data files.
import pandas as pd
dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float, 'grade':int, 'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str, 'long':float, 'sqft_lot15':float, 'sqft_living':float, 'floors':str, 'condition':int, 'lat':float, 'date':str, 'sqft_basement':int, 'yr_built':int, 'id':str, 'sqft_lot':int, 'view':int}
sales = pd.read_csv('data//kc_house_data.csv', dtype=dtype_dict)
train_data = pd.read_csv('data//kc_house_train_data.csv', dtype=dtype_dict)
test_data = pd.read_csv('data//kc_house_test_data.csv', dtype=dtype_dict)
The convert to numpy function then becomes
def get_numpy_data(df, features, output):
df['constant'] = 1
# add the column 'constant' to the front of the features list so that we can extract it along with the others
features = ['constant'] + features
# select the columns of data_SFrame given by the features list into the SFrame features_sframe
features_df = pd.DataFrame(**FILL IN THE BLANK HERE WITH YOUR CODE**)
# cast the features_df into a numpy matrix
feature_matrix = features_df.as_matrix()
etc.
The remaining code should be the same (since you only work with the numpy versions for the rest of the assignment).

PYTHON - Error while using numpy genfromtxt to import csv data with multiple data types

I'm working on a kaggle competition to predict restaurant revenue based on multiple predictors. I'm a beginner user of Python, I would normally use Rapidminer for data analysis. I am using Python 3.4 on the Spyder 2.3 dev environment.
I am using the below code to import the training csv file.
from sklearn import linear_model
from numpy import genfromtxt, savetxt
def main():
#create the training & test sets, skipping the header row with [1:]
dataset = genfromtxt(open('data/train.csv','rb'), delimiter=",", dtype= None)[1:]
train = [x[1:41] for x in dataset]
test = genfromtxt(open('data/test.csv','rb'), delimiter=",")[1:]
This is the error I get:
dataset = genfromtxt(open('data/train.csv','rb'), delimiter=",", dtype= None)[1:]
IndexError: too many indices for array
Then I checked for various imported data types using print (dataset.dtype)
I noticed that the datatypes had been individually assigned for every value in the csv file. Moreover, the code wouldn't work with [1:] in the end. It gave me the same error of too many indices. And if I removed [1:] and defined the input with the skip_header=1 option, I got the below error:
output = np.array(data, dtype=ddtype)
TypeError: Empty data-type
It seems to me like the entire data set is being read as a single row with over 5000 columns.
The data set consists of 43 columns and 138 rows.
I'm stuck at this point, I would appreciate any help with how I can proceed.
I'm posting the raw csv data below (a sample):
Id,Open Date,City,City Group,Type,P1,P2,P3,P4,P5,P6,P7,P8,P9,P10,P11,P12,P13,P14,P15,P16,P17,P18,P19,P20,P21,P22,P23,P24,P25,P26,P27,P28,P29,P30,P31,P32,P33,P34,P35,P36,P37,revenue
0,7/17/99,Ä°stanbul,Big Cities,IL,4,5,4,4,2,2,5,4,5,5,3,5,5,1,2,2,2,4,5,4,1,3,3,1,1,1,4,2,3,5,3,4,5,5,4,3,4,5653753
1,2/14/08,Ankara,Big Cities,FC,4,5,4,4,1,2,5,5,5,5,1,5,5,0,0,0,0,0,3,2,1,3,2,0,0,0,0,3,3,0,0,0,0,0,0,0,0,6923131
2,3/9/13,DiyarbakÄr,Other,IL,2,4,2,5,2,3,5,5,5,5,2,5,5,0,0,0,0,0,1,1,1,1,1,0,0,0,0,1,3,0,0,0,0,0,0,0,0,2055379
3,2/2/12,Tokat,Other,IL,6,4.5,6,6,4,4,10,8,10,10,8,10,7.5,6,4,9,3,12,20,12,6,1,10,2,2,2.5,2.5,2.5,7.5,25,12,10,6,18,12,12,6,2675511
4,5/9/09,Gaziantep,Other,IL,3,4,3,4,2,2,5,5,5,5,2,5,5,2,1,2,1,4,2,2,1,2,1,2,3,3,5,1,3,5,1,3,2,3,4,3,3,4316715
5,2/12/10,Ankara,Big Cities,FC,6,6,4.5,7.5,8,10,10,8,8,8,10,8,6,0,0,0,0,0,5,6,3,1,5,0,0,0,0,7.5,5,0,0,0,0,0,0,0,0,5017319
6,10/11/10,Ä°stanbul,Big Cities,IL,2,3,4,4,1,5,5,5,5,5,2,5,5,3,4,4,3,4,2,4,1,2,1,5,4,4,5,1,3,4,5,2,2,3,5,4,4,5166635
7,6/21/11,Ä°stanbul,Big Cities,IL,4,5,4,5,2,3,5,4,4,4,4,3,4,0,0,0,0,0,3,5,2,4,2,0,0,0,0,3,2,0,0,0,0,0,0,0,0,4491607
8,8/28/10,Afyonkarahisar,Other,IL,1,1,4,4,1,2,1,5,5,5,1,5,5,1,1,2,1,4,1,1,1,1,1,4,4,4,2,2,3,4,5,5,3,4,5,4,5,4952497
9,11/16/11,Edirne,Other,IL,6,4.5,6,7.5,6,4,10,10,10,10,2,10,7.5,0,0,0,0,0,25,3,3,1,10,0,0,0,0,5,2.5,0,0,0,0,0,0,0,0,5444227
I think the characters (e.g. Ä°) are causing the problem in genfromtxt. I found the following reads in the data you have here,
dtypes = "i8,S12,S12,S12,S12" + ",i8"*38
test = genfromtxt(open('data/test.csv','rb'), delimiter="," , names = True, dtype=dtypes)
You can then access the elements by name,
In [16]: test['P8']
Out[16]: array([ 4, 5, 5, 8, 5, 8, 5, 4, 5, 10])
The values for the city column,
test['City']
returns,
array(['\xc3\x84\xc2\xb0stanbul', 'Ankara', 'Diyarbak\xc3\x84r', 'Tokat',
'Gaziantep', 'Ankara', '\xc3\x84\xc2\xb0stanbul',
'\xc3\x84\xc2\xb0stanbul', 'Afyonkarahis', 'Edirne'],
dtype='|S12')
In principle, you could try to convert these to unicode in your python script with something like,
In [17]: unicode(test['City'][0], 'utf8')
Out[17]: u'\xc4\xb0stanbul
Where \xc4\xb0 is UTF-8 hexadecimal encoding for İ. To avoid this, you could also try to clean up the csv input files.
[Solved].
I just chucked numpy's genfromtext and opted to use read_csv from pandas since it gives the option to import text in 'utf-8' encoding.

Categories