Creating a dataset from multiple hdf5 groups

Creating a dataset from multiple hdf5 groups - python

creating a dataset from multiple hdf5 groups
Code for groups with
np.array(hdf.get('all my groups'))
I have then added code for creating a dataset from groups.
with h5py.File('/train.h5', 'w') as hdf:
hdf.create_dataset('train', data=one_T+two_T+three_T+four_T+five_T)
The error message being
ValueError: operands could not be broadcast together with shapes(534456,4) (534456,14)
The numbers in each group are the same other than the varying column lengths. 5 separate groups to one dataset.

This answer addresses the OP's request in comments to my first answer ("an example would be ds_1 all columns, ds_2 first two columns, ds_3 column 4 and 6, ds_4 all columns"). The process is very similar, but the input is "slightly more complicated" than the first answer. As a result I used a different approach to define dataset names and colums to be copied. Differences:
The first solution iterates over the dataset names from the "keys()" (copying each dataset completely, appending to a dataset in the new file). The size of the new dataset is calculated by summing sizes of all datasets.
The second solution uses 2 lists to define 1) dataset names (ds_list) and 2) associated columns to copy from each dataset (col_list is a of lists). The size of the new dataset is calculated by summing the number of columns in col_list. I used "fancy indexing" to extract the columns using col_list.
How you decide to do this depends on your data.
Note: for simplicity, I deleted the dtype and shape tests. You should include these to avoid errors with "real world" problems.
Code below:
# Data for file1
arr1 = np.random.random(120).reshape(20,6)
arr2 = np.random.random(120).reshape(20,6)
arr3 = np.random.random(120).reshape(20,6)
arr4 = np.random.random(120).reshape(20,6)
# Create file1 with 4 datasets
with h5py.File('file1.h5','w') as h5f :
h5f.create_dataset('ds_1',data=arr1)
h5f.create_dataset('ds_2',data=arr2)
h5f.create_dataset('ds_3',data=arr3)
h5f.create_dataset('ds_4',data=arr4)
# Open file1 for reading and file2 for writing
with h5py.File('file1.h5','r') as h5f1 , \
h5py.File('file2.h5','w') as h5f2 :
# Loop over datasets in file1 to get dtype and rows (should test compatibility)
for i, ds in enumerate(h5f1.keys()) :
if i == 0:
ds_0_dtype = h5f1[ds].dtype
n_rows = h5f1[ds].shape[0]
break
# Create new empty dataset with appropriate dtype and size
# Use maxshape parameter to make resizable in the future
ds_list = ['ds_1','ds_2','ds_3','ds_4']
col_list =[ [0,1,2,3,4,5], [0,1], [3,5], [0,1,2,3,4,5] ]
n_cols = sum( [ len(c) for c in col_list])
h5f2.create_dataset('combined', dtype=ds_0_dtype, shape=(n_rows,n_cols), maxshape=(n_rows,None))
# Loop over datasets in file1, read data into xfer_arr, and write to file2
first = 0
for ds, cols in zip(ds_list, col_list) :
xfer_arr = h5f1[ds][:,cols]
last = first + xfer_arr.shape[1]
h5f2['combined'][:, first:last] = xfer_arr[:]
first = last

Here you go; a simple example to copy values from 3 datasets in file1 to a single dataset in file2. I included some tests to verify compatible dtype and shape. The code to create file1 are included at the top. Comments in the code should explain the process. I have another post that shows multiple ways to copy data between 2 HDF5 files. See this post: How can I combine multiple .h5 file?
import h5py
import numpy as np
import sys
# Data for file1
arr1 = np.random.random(80).reshape(20,4)
arr2 = np.random.random(40).reshape(20,2)
arr3 = np.random.random(60).reshape(20,3)
#Create file1 with 3 datasets
with h5py.File('file1.h5','w') as h5f :
h5f.create_dataset('ds_1',data=arr1)
h5f.create_dataset('ds_2',data=arr2)
h5f.create_dataset('ds_3',data=arr3)
# Open file1 for reading and file2 for writing
with h5py.File('file1.h5','r') as h5f1 , \
h5py.File('file2.h5','w') as h5f2 :
# Loop over datasets in file1 and check data compatiblity
for i, ds in enumerate(h5f1.keys()) :
if i == 0:
ds_0 = ds
ds_0_dtype = h5f1[ds].dtype
n_rows = h5f1[ds].shape[0]
n_cols = h5f1[ds].shape[1]
else:
if h5f1[ds].dtype != ds_0_dtype :
print(f'Dset 0:{ds_0}: dtype:{ds_0_dtype}')
print(f'Dset {i}:{ds}: dtype:{h5f1[ds].dtype}')
sys.exit('Error: incompatible dataset dtypes')
if h5f1[ds].shape[0] != n_rows :
print(f'Dset 0:{ds_0}: shape[0]:{n_rows}')
print(f'Dset {i}:{ds}: shape[0]:{h5f1[ds].shape[0]}')
sys.exit('Error: incompatible dataset shape')
n_cols += h5f1[ds].shape[1]
prev_ds = ds
# Create new empty dataset with appropriate dtype and size
# Using maxshape paramater to make resizable in the future
h5f2.create_dataset('ds_123', dtype=ds_0_dtype, shape=(n_rows,n_cols), maxshape=(n_rows,None))
# Loop over datasets in file1, read data into xfer_arr, and write to file2
first = 0
for ds in h5f1.keys() :
xfer_arr = h5f1[ds][:]
last = first + xfer_arr.shape[1]
h5f2['ds_123'][:, first:last] = xfer_arr[:]
first = last

Related

How do I optimize a for loop for faster results in Python

I've written a piece of code to extract data from a HDF5 file and save into a dataframe that I can export as .csv later. The final data frame effectively has 2.5 million rows and is taking a lot of time to execute.
Is there any way, I can optimize this code so that it can run effectively.
Current runtime is 7.98 minutes!
Ideally I would want to run this program for 48 files like these and expect a faster run time.
Link to source file: https://drive.google.com/file/d/1g2fpJHZmD5FflfB4s3BlAoiB5sGISKmg/view
import h5py
import numpy as np
import pandas as pd
#import geopandas as gpd
#%%
f = h5py.File('mer.h5', 'r')
for key in f.keys():
#print(key) #Names of the root level object names in HDF5 file - can be groups or datasets.
#print(type(f[key])) # get the object type: usually group or dataset
ls = list(f.keys())
#Get the HDF5 group; key needs to be a group name from above
key ='DHI'
#group = f['OBSERVATION_TIME']
#print("Group")
#print(group)
#for key in ls:
#data = f.get(key)
#dataset1 = np.array(data)
#length=len(dataset1)
masterdf=pd.DataFrame()
data = f.get(key)
dataset1 = np.array(data)
#masterdf[key]=dataset1
X = f.get('X')
X_1 = pd.DataFrame(X)
Y = f.get('Y')
Y_1 = pd.DataFrame(Y)
#%%
data_df = pd.DataFrame(index=range(len(Y_1)),columns=range(len(X_1)))
for i in data_df.index:
data_df.iloc[i] = dataset1[0][i]
#data_df.to_csv("test.csv")
#%%
final = pd.DataFrame(index=range(1616*1616),columns=['X', 'Y','GHI'])
k=0
for y in range(len(Y_1)):
for x in range(len(X_1[:-2])): #X and Y ranges are not same
final.loc[k,'X'] = X_1[0][x]
final.loc[k,'Y'] = Y_1[0][y]
final.loc[k,'GHI'] = data_df.iloc[y,x]
k=k+1
# print(k)`

we can optimize loops by vectorizing operations. this is one/two orders of magnitude faster than their pure python equivalents(especially in numerical computations). vectorization is something we can get with NumPy. it is a library with efficient data structures designed to hold matrix data.

Could you please try the following (file.h5 your file):
import pandas as pd
import h5py
with h5py.File("file.h5", "r") as file:
df_X = pd.DataFrame(file.get("X")[:-2], columns=["X"])
df_Y = pd.DataFrame(file.get("Y"), columns=["Y"])
DHI = file.get("DHI")[0][:, :-2].reshape(-1)
final = df_Y.merge(df_X, how="cross").assign(DHI=DHI)[["X", "Y", "DHI"]]
Some explanations:
First read the data with key X into a dataframe df_X with one column X, except for the last 2 data points.
Then read the full data with key Y into a dataframe df_Y with one column Y.
Then get the data with key DHI and take the first element [0] (there are no more): Result is a NumpPy array with 2 dimensions, a matrix. Now remove the last two columns ([:, :-2]) and reshape the matrix into an 1-dimensional array, in the order you are looking for (order="C" is default). The result is the column DHI of your final dataframe.
Finally take the cross product of df_Y and df_X (y is your outer dimension in the loop) via .merge with how="cross", add the DHI column, and rearrange the columns in the order you want.

How to combine h5 data numpy arrays based on date in filename?

I have hundreds of .h5 files with dates in their filename (e.g ...20221017...). For each file, I have extracted some parameters into a numpy array of the format
[[param_1a, param_2a...param_5a],
...
[param_1x, param_2x,...param_5x]]
which represents data of interest. I want to group the data by month, so instead of having (e.g) 30 arrays for one month, I have 1 array which represents the average of the 30 arrays. How can I do this?
This is the code I have so far, filename represents a txt file of file names.
def combine_months(filename):
fin = open(filename, 'r')
next_name = fin.readline()
while (next_name != ""):
year = next_name[6:10]
month = next_name[11:13]
date = month+'\\'+year
#not sure where to go from here
fin.close()
An example of what I hope to achieve is that say array_1, array_2, array_3 are numpy arrays representing data from different h5 files with the same month in the date of their filename.
array_1 = [[ 1 4 10]
[ 2 5 11]
[3 6 12]]
array_2 = [[ 1 2 5]
[ 2 2 3]
[ 3 6 12]]
array_3 = [[ 2 4 10]
[ 3 2 3]
[ 4 6 12]]
I want the result to look like:
2022_04_data = [[1,3,7.5]
[2, 2, 6.5]
[3,4,7.5]
[4,6,12]]
Note that the first number of each row represents an ID, so I need to group those data together based on the first number as well.

Ok, here is the beginning of an answer. (I suspect you may have more questions as you work thru the details.)
There are several ways to get the filenames. You could put them in a file, but it's easier (and better IMHO) to use the glob.iglob() function. There are 2 examples below that show how to: 1) open each file, 2) read the data from the data dataset into an array, and 3) append the array to a list. The first example has the file names in a list. The second uses the glob.iglob() function to get the filenames. (You could also use glob.glob() to create a list of names.)
Method 1: read filenames from list
import h5py
arr_list = []
for h5file in ['20221001.h5', '20221002.h5', '20221003.h5']:
with h5py.File(h5file,'r') as h5f:
arr = h5f['data'][()]
#print(arr)
arr_list.append(arr)
Method 2: use glob.iglob() to get files using wildcard names
import h5py
from glob import iglob
arr_list = []
for h5file in iglob('202210*.h5'):
with h5py.File(h5file,'r') as h5f:
print(h5f.keys()) # to get the dataset names from the keys
arr = h5f['data'][()]
#print(arr)
arr_list.append(arr)
After you have read the datasets into arrays, you iterate over the list, do your calculations and create a new array from the results. Code below shows how to get the shape and dtype.
for arr in arr_list:
# do something with the data based on column 0 value
print(arr.shape, arr.dtype)
Code below shows a way to sum rows with matching column 0 values. Without more details it's hard to show exactly how to do this. It reads all column 0 values into a sorted list, then uses to size count and sum arrays, then as a index to the proper row.
# first create a list from column 0 values, then sort
row_value_list = []
for arr in arr_list:
col_vals = arr[:,0]
for val in col_vals:
if val not in row_value_list:
row_value_list.append(val)
# Sort list of column IDs
row_value_list.sort()
# get length index list to create cnt and sum arrays
a0 = len(row_value_list)
# get shape and dtype from 1st array, assume constant for all
a1 = arr_list[0].shape[1]
dt = arr_list[0].dtype
arr_cnt = np.zeros(shape=(a0,a1),dtype=dt)
arr_cnt[:,0] = row_value_list
arr_sum = np.zeros(shape=(a0,a1),dtype=dt)
arr_sum[:,0] = row_value_list
for arr in arr_list:
for row in arr:
idx = row_value_list.index(row[0])
arr_cnt[idx,1:] += 1
arr_sum[idx,1:] += row[1:]
print('Count Array\n',arr_cnt)
print('Sum Array\n',arr_sum)
arr_ave = arr_sum/arr_cnt
arr_ave[:,0] = row_value_list
print('Average Array\n',arr_ave)
Here is an alternate way to create row_value_list from a set. It's simpler because sets don't retain duplicate values, so you don't have to check for existing values when adding them to row_value_set.
# first create a set from column 0 values, then create a sorted list
row_value_set = set()
for arr in arr_list:
col_vals = set(arr[:,0])
row_value_set = row_value_set.union(col_vals)
row_value_list = sorted(row_value_set)

This is a new, updated answer that addresses the comment/request about calculating the median. (It also calculates the mean, and can be easily extended to calculate other statistics from the masked array.)
As noted in my comment on Nov 4 2022, "starting from my first answer quickly got complicated and hard to follow". This process is similar but different from the first answer. It uses glob to get a list of filenames (instead of iglob). Instead of loading the H5 datasets into a list of arrays, it loads all of the data into a single array (data is "stacked" on the 0-axis.). I don't think this increases the memory footprint. However, memory could be a problem if you load a lot of very large datasets for analysis.
Summary of the procedure:
Use glob.glob() to load filenames to a list based on a wildcard
Allocate an array to hold all the data (arr_all) based on the # of
files and size of 1 dataset.
Loop thru all H5 files, loading data to arr_all
Create a sorted list of unique group IDs (column 0 values)
Allocate arrays to hold mean/median (arr_mean and arr_median) based on the # of unique row IDs and # of columns in arr_all.
Loop over values in ID list, then:
a. Create masked array (mask) where column 0 value = loop value
b. Broadcast mask to match arr_all shape then apply to create ma_arr_all
c. Loop over columns of ma_arr_all, compress to get unmasked values, then calculate mean and median and save.
Code below:
import h5py
from glob import glob
import numpy as np
# use glob.glob() to get list of files using wildcard names
file_list = glob('202210*.h5')
with h5py.File(file_list[0],'r') as h5f:
a0, a1 = h5f['data'].shape
# allocate array to hold values from all datasets
arr_all = np.zeros(shape=(len(file_list)*a0,a1), dtype=h5f['data'].dtype)
start, stop = 0, a0
for i, h5file in enumerate(file_list):
with h5py.File(h5file,'r') as h5f:
arr_all[start:stop,:] = h5f['data'][()]
start += a0
stop += a0
# Create a set from column 0 values, and use to create a sorted list
row_value_list = sorted(set(arr_all[:,0]))
arr_mean = np.zeros(shape=(len(row_value_list),arr_all.shape[1]))
arr_median = np.zeros(shape=(len(row_value_list),arr_all.shape[1]))
col_0 = arr_all[:,0:1]
for i, row_val in enumerate(row_value_list):
row_mask = np.where(col_0==row_val, False, True ) # True mask value ignores data.
all_mask= np.broadcast_to(row_mask, arr_all.shape)
ma_arr_all = np.ma.masked_array(arr_all, mask=all_mask)
for j in range(ma_arr_all.shape[1]):
masked_col = ma_arr_all[:,j:j+1].compressed()
arr_mean[i:i+1,j:j+1] = np.mean(masked_col)
arr_median[i:i+1,j:j+1] = np.median(masked_col)
print('Mean values:\n',arr_mean)
print('Median values:\n',arr_median)
Added Nov 22, 2022:
Method above uses np.broadcast_to() introduced in NumPy 1.10. Here is an alternate method for prior versions. (Replaces the entire for i, row_val loop.) It should be more memory efficient. I haven't profiled to verify, but arrays all_mask and ma_arr_all are not created.
for i, row_val in enumerate(row_value_list):
row_mask = np.where(col_0==row_val, False, True ) # True mask value ignores data.
for j in range(arr_all.shape[1]):
masked_col = np.ma.masked_array(arr_all[:,j:j+1], mask=row_mask).compressed()
arr_mean[i:i+1,j:j+1] = np.mean(masked_col)
arr_median[i:i+1,j:j+1] = np.median(masked_col)
I ran with values provided by OP. Output is provided below and is the same for both methods:
Mean values:
[[ 1. 3. 7.5 ]
[ 2. 3.66666667 8. ]
[ 3. 4.66666667 9. ]
[ 4. 6. 12. ]]
Median values:
[[ 1. 3. 7.5]
[ 2. 4. 10. ]
[ 3. 6. 12. ]
[ 4. 6. 12. ]]

Is there a way to extend a PyTables EArray in the second dimension?

I have a 2D array that can grow to larger sizes than I'm able to fit on memory, so I'm trying to store it in a h5 file using Pytables. The number of rows is known beforehand but the length of each row is not known and is variable between rows. After some research, I thought something along these lines would work, where I can set the extendable dimension as the second dimension.
filename = os.path.join(tempfile.mkdtemp(), 'example.h5')
h5_file = open_file(filename, mode="w", title="Example Extendable Array")
h5_group = h5_file.create_group("/", "example_on_dim_2")
e_array = h5_file.create_earray(h5_group, "example", Int32Atom(shape=()), (100, 0)) # Assume num of rows is 100
# Add some item to index 2
print(e_array[2]) # should print an empty array
e_array[2] = np.append(e_array[2], 5) # add the value 5 to row 2
print(e_array[2]) # should print [5], currently printing empty array
I'm not sure if it's possible to add elements in this way (I might have misunderstood the way earrays work), but any help would be greatly appreciated!

You're close...but have a small misunderstanding of some of the arguments and behavior. When you create the EArray with shape=(100,0), you don't have any data...just an object designated to have 100 rows that can add columns. Also, you need to use e_array.append() to add data, not np.append(). Also, if you are going to create a very large array, consider defining the expectedrows= parameter for improved performance as the EArray grows.
Take a look at this code.
import tables as tb
import numpy as np
filename = 'example.h5'
with tb.File(filename, mode="w", title="Example Extendable Array") as h5_file :
h5_group = h5_file.create_group("/", "example_on_dim_2")
# Assume num of rows is 100
#e_array = h5_file.create_earray(h5_group, "example", Int32Atom(shape=()), (100, 0))
e_array = h5_file.create_earray(h5_group, "example", atom=tb.IntAtom(), shape=(100, 0))
print (e_array.shape)
e_array.append(np.arange(100,dtype=int).reshape(100,1)) # append a column of values
print (e_array.shape)
print(e_array[2]) # prints [2]

Here is an example showing how to create a VLArray (Variable Length). It is similar to the EArray example above, and follows the example from the Pytables doc (link in comment above). However, although a VLArray supports variable length rows, it does not have a mechanism to add items to an existing row (AFAIK).
import tables as tb
import numpy as np
filename = 'example_vlarray.h5'
with tb.File(filename, mode="w", title="Example Variable Length Array") as h5_file :
h5_group = h5_file.create_group("/", "vl_example")
vlarray = h5_file.create_vlarray(h5_group, "example", tb.IntAtom(), "ragged array of ints",)
# Append some (variable length) rows:
vlarray.append(np.array([0]))
vlarray.append(np.array([1, 2]))
vlarray.append([3, 4, 5])
vlarray.append([6, 7, 8, 9])
# Now, read it through an iterator:
print('-->', vlarray.title)
for x in vlarray:
print('%s[%d]--> %s' % (vlarray.name, vlarray.nrow, x))

How to convert a single column containing JSON with 250 variables to 250 separate column dataset using arrays?

I have an issue converting a JSON column (which contains around 250 variables) into 250 separate columns. I'm able to use Pandas dataframe, but just for 46k rows it takes 30 minutes and sometimes kernel is crashing due to low memory (for 0.5 million rows in database).
Can somebody help me with code using NumPy arrays (which should decrease conversion time and reduce file size)?
The JSON column has data in below format:
My code :
for x in records:
list_ = list(x)
json_acceptable_string = list_[4].read()
list_features.append(json.loads(json_acceptable_string)
Once I get the list-features I'm preprocessing and using machine learning pipeline. This isn't working for large data.

I think this could help for building your np array
variable_name_list = ['var1','va2',....,'var250']
list_features = np.empty(shape=(len(records),len(variable_name_list)),dtype=str)
for index in range(records):
list_ = list(records[index])
json_acceptable_string = list_[4].read()
tmp_feature_list = []
tmp_feature_dict = json.loads(json_acceptable_string)
for var_name in variable_name_list:
if var_name not in tmp_feature_dict.keys():
tmp_feature_list.append("missing_val")
else :
tmp_feature_list.append(tmp_feature_dict[var_name])
tmp_feature_list = np.asarray(tmp_feature_list,dtype=str).reshape(1,len(variable_name_list))
list_features[index] = tmp_feature_list

Graphlab and numpy issue

I'm currently doing a course on Coursera (Machine Leraning) offered by University of Washington and I'm facing little problem with the numpy and graphlab
The course requests to use a version of graphlab higher than 1.7
Mine is higher as you can see below, however, when I run the script below, I got an error as follows:
[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started.
def get_numpy_data(data_sframe, features, output):
data_sframe['constant'] = 1
features = ['constant'] + features # this is how you combine two lists
# the following line will convert the features_SFrame into a numpy matrix:
feature_matrix = features_sframe.to_numpy()
# assign the column of data_sframe associated with the output to the SArray output_sarray
# the following will convert the SArray into a numpy array by first converting it to a list
output_array = output_sarray.to_numpy()
return(feature_matrix, output_array)
(example_features, example_output) = get_numpy_data(sales,['sqft_living'], 'price') # the [] around 'sqft_living' makes it a list
print example_features[0,:] # this accesses the first row of the data the ':' indicates 'all columns'
print example_output[0] # and the corresponding output
----> 8 feature_matrix = features_sframe.to_numpy()
NameError: global name 'features_sframe' is not defined
The script above was written by the course authors, so I believe there is something I'm doing wrong
Any help will be highly appreciated.

You are supposed to complete the function get_numpy_data before running it, that's why you are getting an error. Follow the instructions in the original function, which actually are:
def get_numpy_data(data_sframe, features, output):
data_sframe['constant'] = 1 # this is how you add a constant column to an SFrame
# add the column 'constant' to the front of the features list so that we can extract it along with the others:
features = ['constant'] + features # this is how you combine two lists
# select the columns of data_SFrame given by the features list into the SFrame features_sframe (now including constant):
# the following line will convert the features_SFrame into a numpy matrix:
feature_matrix = features_sframe.to_numpy()
# assign the column of data_sframe associated with the output to the SArray output_sarray
# the following will convert the SArray into a numpy array by first converting it to a list
output_array = output_sarray.to_numpy()
return(feature_matrix, output_array)

The graphlab assignment instructions have you convert from graphlab to pandas and then to numpy. You could just skip the the graphlab parts and use pandas directly. (This is explicitly allowed in the homework description.)
First, read in the data files.
import pandas as pd
dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float, 'grade':int, 'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str, 'long':float, 'sqft_lot15':float, 'sqft_living':float, 'floors':str, 'condition':int, 'lat':float, 'date':str, 'sqft_basement':int, 'yr_built':int, 'id':str, 'sqft_lot':int, 'view':int}
sales = pd.read_csv('data//kc_house_data.csv', dtype=dtype_dict)
train_data = pd.read_csv('data//kc_house_train_data.csv', dtype=dtype_dict)
test_data = pd.read_csv('data//kc_house_test_data.csv', dtype=dtype_dict)
The convert to numpy function then becomes
def get_numpy_data(df, features, output):
df['constant'] = 1
# add the column 'constant' to the front of the features list so that we can extract it along with the others
features = ['constant'] + features
# select the columns of data_SFrame given by the features list into the SFrame features_sframe
features_df = pd.DataFrame(**FILL IN THE BLANK HERE WITH YOUR CODE**)
# cast the features_df into a numpy matrix
feature_matrix = features_df.as_matrix()
etc.
The remaining code should be the same (since you only work with the numpy versions for the rest of the assignment).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Creating a dataset from multiple hdf5 groups - python

Related

How do I optimize a for loop for faster results in Python

How to combine h5 data numpy arrays based on date in filename?

Is there a way to extend a PyTables EArray in the second dimension?

How to convert a single column containing JSON with 250 variables to 250 separate column dataset using arrays?

Graphlab and numpy issue

Categories

Resources