I am trying to write a program which puts data into a .h5 file. There should be 3 columns, one with the number of the variable (from counter in the for loop. one with the name of the variable (shown in the 2nd column of list_of_vars), and one with its unit (3rd column of list_of_vars).
Code is below:
import numpy as np
import h5py as h5
list_of_vars = [
('ADC_alt', 'ADC_alt', 'ft'),
('ADC_temp', 'ADC_temp', 'degC'),
('ADC_ias', 'ADC_ias', 'kts'),
('ADC_tas', 'ADC_tas', 'kts'),
('ADC_aos', 'ADC_aos', 'deg'),
('ADC_aoa', 'ADC_aoa', 'deg'),
]
#write new h5 file
var = h5.File('telemetry.h5','w')
for counter, val in enumerate(list_of_vars):
varnum = var.create_dataset('n°', (6,), data = counter)
varname = var.create_dataset('Variable name', (6,), dtype = 'str_', data = val[1])
varunit = var.create_dataset('Unit', (6,), dtype = 'str_', data = val[2])
data = np.array(varname,varunit)
print(data)
However, when I run it, I get the error ValueError: Shape tuple is incompatible with data
What is wrong here?
Lots of little problems to correct. If I understand, you want to create ONE heterogeneous dataset (with 1 field(column) of ints named 'n°', and 2 fields(columns) of strings named 'Variable name' and 'Unit'). What your code is trying to create is 18 separate datasets (3 created with each loop thru enumerate(list_of_vars).
There is a trick when working with heterogeneous datasets: If you add row wise, you have to reference the dataset row AND column indices, OR add the entire row. I prefer to load data field/column wise. Generally you have fewer fields than rows -- fewer loops == fewer write cycles == faster.
Here is the process you want. It creates the dataset, then adds the data for each field from the count, then list_of_vars[1], then from list_of_vars[2]. At the end it reads and prints the data from the dataset. Code below:
#write new h5 file
with h5.File('telemetry.h5','w') as var:
dt = np.dtype( [('n°', int), ('Variable name','S10'), ('Unit', 'S10')] )
dset = var.create_dataset('data', dtype=dt, shape=(len(list_of_vars),))
dset['n°'] = np.arange(len(list_of_vars))
dset['Variable name'] = [val[1] for val in list_of_vars]
dset['Unit'] = [val[2] for val in list_of_vars]
data = dset[:]
print(data)
If you prefer to use the enumerate loop, use this method. It loads items by row index. For completeness, it also shows how to index the dataset by [row,field name], but do not recommend it.
#write new h5 file
with h5.File('telemetry.h5','w') as var:
dt = np.dtype( [('n°', int), ('Variable name','S10'), ('Unit', 'S10')] )
dset = var.create_dataset('data',dtype=dt,shape=(len(list_of_vars),))
for counter, val in enumerate(list_of_vars):
dset[counter] = (counter, val[1], val[2])
# alternate row/field indexing method:
# dset[counter,'n°'] = counter
# dset[counter,'Variable name'] = val[1]
# dset[counter,'Unit'] = val[2]
data = dset[:]
print(data)
When asking about a problem, show the whole error.
When I run your code I get:
1129:~/mypy$ python3 stack68181330.py
Traceback (most recent call last):
File "stack68181330.py", line 18, in <module>
varnum = var.create_dataset('n°', (6,), data = counter)
File "/usr/local/lib/python3.8/dist-packages/h5py/_hl/group.py", line 149, in create_dataset
dsid = dataset.make_new_dset(group, shape, dtype, data, name, **kwds)
File "/usr/local/lib/python3.8/dist-packages/h5py/_hl/dataset.py", line 61, in make_new_dset
raise ValueError("Shape tuple is incompatible with data")
ValueError: Shape tuple is incompatible with data
It's having problems with your first dataset creation:
var.create_dataset('n°', (6,), data = counter)
What are you saying here? Make a dataset with name 'n*', and shape (6,) - 6 elements. But what is counter? It's the current enumerate value, one integer. Do you see how the shape (6,) doesn't match the data?
The other dataset lines potentially have similar problems.
It doesn't get on to another problem. You are doing the create_dataset repeatedly in the loop. Once you have a dataset named 'n*', you can't create another with the same name.
I suspect you want to make one dataset, with 6 slots, and repeatedly assign counter values to it. Not to repeatedly create a dataset with the same name.
Let's change the dataset creation and write to something that works:
varnum = var.create_dataset('n°', (6,), dtype=int)
varname = var.create_dataset('Variable name', (6,), dtype = 'S10')
varunit = var.create_dataset('Unit', (6,), dtype = 'S10')
for counter, val in enumerate(list_of_vars):
varnum[counter] = counter
varname[counter] = val[1]
varunit[counter] = val[2]
var.flush()
print(varnum, varnum[:])
print(varname)
print(varname[:])
print(varunit)
print(varunit[:])
and run:
1144:~/mypy$ python3 stack68181330.py
<HDF5 dataset "n°": shape (6,), type "<i8"> [0 1 2 3 4 5]
<HDF5 dataset "Variable name": shape (6,), type "|S10">
[b'ADC_alt' b'ADC_temp' b'ADC_ias' b'ADC_tas' b'ADC_aos' b'ADC_aoa']
<HDF5 dataset "Unit": shape (6,), type "|S10">
[b'ft' b'degC' b'kts' b'kts' b'deg' b'deg]
The b'ft' string display is bytestrings, the result of using S10. I think there are ways of specifying unicode, but I haven't looked at the h5py docs in a while.
There are simpler ways of writing this data, but I chose to keep it close to your attempt, to better illustrate the basics of both Python iteration, and h5py use.
I could write the data directly to the datasets, without iteration with:
Make array from the list:
arr = np.array(list_of_vars, dtype='S')
print(arr)
varnum = var.create_dataset('n°', np.arange(arr.shape[0]))
varname = var.create_dataset('Variable name', data=arr[:,1])
varunit = var.create_dataset('Unit', data=arr[:,2])
I let it deduce shape and dtype from the data.
creating a dataset from multiple hdf5 groups
Code for groups with
np.array(hdf.get('all my groups'))
I have then added code for creating a dataset from groups.
with h5py.File('/train.h5', 'w') as hdf:
hdf.create_dataset('train', data=one_T+two_T+three_T+four_T+five_T)
The error message being
ValueError: operands could not be broadcast together with shapes(534456,4) (534456,14)
The numbers in each group are the same other than the varying column lengths. 5 separate groups to one dataset.
This answer addresses the OP's request in comments to my first answer ("an example would be ds_1 all columns, ds_2 first two columns, ds_3 column 4 and 6, ds_4 all columns"). The process is very similar, but the input is "slightly more complicated" than the first answer. As a result I used a different approach to define dataset names and colums to be copied. Differences:
The first solution iterates over the dataset names from the "keys()" (copying each dataset completely, appending to a dataset in the new file). The size of the new dataset is calculated by summing sizes of all datasets.
The second solution uses 2 lists to define 1) dataset names (ds_list) and 2) associated columns to copy from each dataset (col_list is a of lists). The size of the new dataset is calculated by summing the number of columns in col_list. I used "fancy indexing" to extract the columns using col_list.
How you decide to do this depends on your data.
Note: for simplicity, I deleted the dtype and shape tests. You should include these to avoid errors with "real world" problems.
Code below:
# Data for file1
arr1 = np.random.random(120).reshape(20,6)
arr2 = np.random.random(120).reshape(20,6)
arr3 = np.random.random(120).reshape(20,6)
arr4 = np.random.random(120).reshape(20,6)
# Create file1 with 4 datasets
with h5py.File('file1.h5','w') as h5f :
h5f.create_dataset('ds_1',data=arr1)
h5f.create_dataset('ds_2',data=arr2)
h5f.create_dataset('ds_3',data=arr3)
h5f.create_dataset('ds_4',data=arr4)
# Open file1 for reading and file2 for writing
with h5py.File('file1.h5','r') as h5f1 , \
h5py.File('file2.h5','w') as h5f2 :
# Loop over datasets in file1 to get dtype and rows (should test compatibility)
for i, ds in enumerate(h5f1.keys()) :
if i == 0:
ds_0_dtype = h5f1[ds].dtype
n_rows = h5f1[ds].shape[0]
break
# Create new empty dataset with appropriate dtype and size
# Use maxshape parameter to make resizable in the future
ds_list = ['ds_1','ds_2','ds_3','ds_4']
col_list =[ [0,1,2,3,4,5], [0,1], [3,5], [0,1,2,3,4,5] ]
n_cols = sum( [ len(c) for c in col_list])
h5f2.create_dataset('combined', dtype=ds_0_dtype, shape=(n_rows,n_cols), maxshape=(n_rows,None))
# Loop over datasets in file1, read data into xfer_arr, and write to file2
first = 0
for ds, cols in zip(ds_list, col_list) :
xfer_arr = h5f1[ds][:,cols]
last = first + xfer_arr.shape[1]
h5f2['combined'][:, first:last] = xfer_arr[:]
first = last
Here you go; a simple example to copy values from 3 datasets in file1 to a single dataset in file2. I included some tests to verify compatible dtype and shape. The code to create file1 are included at the top. Comments in the code should explain the process. I have another post that shows multiple ways to copy data between 2 HDF5 files. See this post: How can I combine multiple .h5 file?
import h5py
import numpy as np
import sys
# Data for file1
arr1 = np.random.random(80).reshape(20,4)
arr2 = np.random.random(40).reshape(20,2)
arr3 = np.random.random(60).reshape(20,3)
#Create file1 with 3 datasets
with h5py.File('file1.h5','w') as h5f :
h5f.create_dataset('ds_1',data=arr1)
h5f.create_dataset('ds_2',data=arr2)
h5f.create_dataset('ds_3',data=arr3)
# Open file1 for reading and file2 for writing
with h5py.File('file1.h5','r') as h5f1 , \
h5py.File('file2.h5','w') as h5f2 :
# Loop over datasets in file1 and check data compatiblity
for i, ds in enumerate(h5f1.keys()) :
if i == 0:
ds_0 = ds
ds_0_dtype = h5f1[ds].dtype
n_rows = h5f1[ds].shape[0]
n_cols = h5f1[ds].shape[1]
else:
if h5f1[ds].dtype != ds_0_dtype :
print(f'Dset 0:{ds_0}: dtype:{ds_0_dtype}')
print(f'Dset {i}:{ds}: dtype:{h5f1[ds].dtype}')
sys.exit('Error: incompatible dataset dtypes')
if h5f1[ds].shape[0] != n_rows :
print(f'Dset 0:{ds_0}: shape[0]:{n_rows}')
print(f'Dset {i}:{ds}: shape[0]:{h5f1[ds].shape[0]}')
sys.exit('Error: incompatible dataset shape')
n_cols += h5f1[ds].shape[1]
prev_ds = ds
# Create new empty dataset with appropriate dtype and size
# Using maxshape paramater to make resizable in the future
h5f2.create_dataset('ds_123', dtype=ds_0_dtype, shape=(n_rows,n_cols), maxshape=(n_rows,None))
# Loop over datasets in file1, read data into xfer_arr, and write to file2
first = 0
for ds in h5f1.keys() :
xfer_arr = h5f1[ds][:]
last = first + xfer_arr.shape[1]
h5f2['ds_123'][:, first:last] = xfer_arr[:]
first = last
I'm working on a kaggle competition to predict restaurant revenue based on multiple predictors. I'm a beginner user of Python, I would normally use Rapidminer for data analysis. I am using Python 3.4 on the Spyder 2.3 dev environment.
I am using the below code to import the training csv file.
from sklearn import linear_model
from numpy import genfromtxt, savetxt
def main():
#create the training & test sets, skipping the header row with [1:]
dataset = genfromtxt(open('data/train.csv','rb'), delimiter=",", dtype= None)[1:]
train = [x[1:41] for x in dataset]
test = genfromtxt(open('data/test.csv','rb'), delimiter=",")[1:]
This is the error I get:
dataset = genfromtxt(open('data/train.csv','rb'), delimiter=",", dtype= None)[1:]
IndexError: too many indices for array
Then I checked for various imported data types using print (dataset.dtype)
I noticed that the datatypes had been individually assigned for every value in the csv file. Moreover, the code wouldn't work with [1:] in the end. It gave me the same error of too many indices. And if I removed [1:] and defined the input with the skip_header=1 option, I got the below error:
output = np.array(data, dtype=ddtype)
TypeError: Empty data-type
It seems to me like the entire data set is being read as a single row with over 5000 columns.
The data set consists of 43 columns and 138 rows.
I'm stuck at this point, I would appreciate any help with how I can proceed.
I'm posting the raw csv data below (a sample):
Id,Open Date,City,City Group,Type,P1,P2,P3,P4,P5,P6,P7,P8,P9,P10,P11,P12,P13,P14,P15,P16,P17,P18,P19,P20,P21,P22,P23,P24,P25,P26,P27,P28,P29,P30,P31,P32,P33,P34,P35,P36,P37,revenue
0,7/17/99,Ä°stanbul,Big Cities,IL,4,5,4,4,2,2,5,4,5,5,3,5,5,1,2,2,2,4,5,4,1,3,3,1,1,1,4,2,3,5,3,4,5,5,4,3,4,5653753
1,2/14/08,Ankara,Big Cities,FC,4,5,4,4,1,2,5,5,5,5,1,5,5,0,0,0,0,0,3,2,1,3,2,0,0,0,0,3,3,0,0,0,0,0,0,0,0,6923131
2,3/9/13,DiyarbakÄr,Other,IL,2,4,2,5,2,3,5,5,5,5,2,5,5,0,0,0,0,0,1,1,1,1,1,0,0,0,0,1,3,0,0,0,0,0,0,0,0,2055379
3,2/2/12,Tokat,Other,IL,6,4.5,6,6,4,4,10,8,10,10,8,10,7.5,6,4,9,3,12,20,12,6,1,10,2,2,2.5,2.5,2.5,7.5,25,12,10,6,18,12,12,6,2675511
4,5/9/09,Gaziantep,Other,IL,3,4,3,4,2,2,5,5,5,5,2,5,5,2,1,2,1,4,2,2,1,2,1,2,3,3,5,1,3,5,1,3,2,3,4,3,3,4316715
5,2/12/10,Ankara,Big Cities,FC,6,6,4.5,7.5,8,10,10,8,8,8,10,8,6,0,0,0,0,0,5,6,3,1,5,0,0,0,0,7.5,5,0,0,0,0,0,0,0,0,5017319
6,10/11/10,Ä°stanbul,Big Cities,IL,2,3,4,4,1,5,5,5,5,5,2,5,5,3,4,4,3,4,2,4,1,2,1,5,4,4,5,1,3,4,5,2,2,3,5,4,4,5166635
7,6/21/11,Ä°stanbul,Big Cities,IL,4,5,4,5,2,3,5,4,4,4,4,3,4,0,0,0,0,0,3,5,2,4,2,0,0,0,0,3,2,0,0,0,0,0,0,0,0,4491607
8,8/28/10,Afyonkarahisar,Other,IL,1,1,4,4,1,2,1,5,5,5,1,5,5,1,1,2,1,4,1,1,1,1,1,4,4,4,2,2,3,4,5,5,3,4,5,4,5,4952497
9,11/16/11,Edirne,Other,IL,6,4.5,6,7.5,6,4,10,10,10,10,2,10,7.5,0,0,0,0,0,25,3,3,1,10,0,0,0,0,5,2.5,0,0,0,0,0,0,0,0,5444227
I think the characters (e.g. Ä°) are causing the problem in genfromtxt. I found the following reads in the data you have here,
dtypes = "i8,S12,S12,S12,S12" + ",i8"*38
test = genfromtxt(open('data/test.csv','rb'), delimiter="," , names = True, dtype=dtypes)
You can then access the elements by name,
In [16]: test['P8']
Out[16]: array([ 4, 5, 5, 8, 5, 8, 5, 4, 5, 10])
The values for the city column,
test['City']
returns,
array(['\xc3\x84\xc2\xb0stanbul', 'Ankara', 'Diyarbak\xc3\x84r', 'Tokat',
'Gaziantep', 'Ankara', '\xc3\x84\xc2\xb0stanbul',
'\xc3\x84\xc2\xb0stanbul', 'Afyonkarahis', 'Edirne'],
dtype='|S12')
In principle, you could try to convert these to unicode in your python script with something like,
In [17]: unicode(test['City'][0], 'utf8')
Out[17]: u'\xc4\xb0stanbul
Where \xc4\xb0 is UTF-8 hexadecimal encoding for İ. To avoid this, you could also try to clean up the csv input files.
[Solved].
I just chucked numpy's genfromtext and opted to use read_csv from pandas since it gives the option to import text in 'utf-8' encoding.