How to combine h5 data numpy arrays based on date in filename?

How to combine h5 data numpy arrays based on date in filename? - python

I have hundreds of .h5 files with dates in their filename (e.g ...20221017...). For each file, I have extracted some parameters into a numpy array of the format
[[param_1a, param_2a...param_5a],
...
[param_1x, param_2x,...param_5x]]
which represents data of interest. I want to group the data by month, so instead of having (e.g) 30 arrays for one month, I have 1 array which represents the average of the 30 arrays. How can I do this?
This is the code I have so far, filename represents a txt file of file names.
def combine_months(filename):
fin = open(filename, 'r')
next_name = fin.readline()
while (next_name != ""):
year = next_name[6:10]
month = next_name[11:13]
date = month+'\\'+year
#not sure where to go from here
fin.close()
An example of what I hope to achieve is that say array_1, array_2, array_3 are numpy arrays representing data from different h5 files with the same month in the date of their filename.
array_1 = [[ 1 4 10]
[ 2 5 11]
[3 6 12]]
array_2 = [[ 1 2 5]
[ 2 2 3]
[ 3 6 12]]
array_3 = [[ 2 4 10]
[ 3 2 3]
[ 4 6 12]]
I want the result to look like:
2022_04_data = [[1,3,7.5]
[2, 2, 6.5]
[3,4,7.5]
[4,6,12]]
Note that the first number of each row represents an ID, so I need to group those data together based on the first number as well.

Ok, here is the beginning of an answer. (I suspect you may have more questions as you work thru the details.)
There are several ways to get the filenames. You could put them in a file, but it's easier (and better IMHO) to use the glob.iglob() function. There are 2 examples below that show how to: 1) open each file, 2) read the data from the data dataset into an array, and 3) append the array to a list. The first example has the file names in a list. The second uses the glob.iglob() function to get the filenames. (You could also use glob.glob() to create a list of names.)
Method 1: read filenames from list
import h5py
arr_list = []
for h5file in ['20221001.h5', '20221002.h5', '20221003.h5']:
with h5py.File(h5file,'r') as h5f:
arr = h5f['data'][()]
#print(arr)
arr_list.append(arr)
Method 2: use glob.iglob() to get files using wildcard names
import h5py
from glob import iglob
arr_list = []
for h5file in iglob('202210*.h5'):
with h5py.File(h5file,'r') as h5f:
print(h5f.keys()) # to get the dataset names from the keys
arr = h5f['data'][()]
#print(arr)
arr_list.append(arr)
After you have read the datasets into arrays, you iterate over the list, do your calculations and create a new array from the results. Code below shows how to get the shape and dtype.
for arr in arr_list:
# do something with the data based on column 0 value
print(arr.shape, arr.dtype)
Code below shows a way to sum rows with matching column 0 values. Without more details it's hard to show exactly how to do this. It reads all column 0 values into a sorted list, then uses to size count and sum arrays, then as a index to the proper row.
# first create a list from column 0 values, then sort
row_value_list = []
for arr in arr_list:
col_vals = arr[:,0]
for val in col_vals:
if val not in row_value_list:
row_value_list.append(val)
# Sort list of column IDs
row_value_list.sort()
# get length index list to create cnt and sum arrays
a0 = len(row_value_list)
# get shape and dtype from 1st array, assume constant for all
a1 = arr_list[0].shape[1]
dt = arr_list[0].dtype
arr_cnt = np.zeros(shape=(a0,a1),dtype=dt)
arr_cnt[:,0] = row_value_list
arr_sum = np.zeros(shape=(a0,a1),dtype=dt)
arr_sum[:,0] = row_value_list
for arr in arr_list:
for row in arr:
idx = row_value_list.index(row[0])
arr_cnt[idx,1:] += 1
arr_sum[idx,1:] += row[1:]
print('Count Array\n',arr_cnt)
print('Sum Array\n',arr_sum)
arr_ave = arr_sum/arr_cnt
arr_ave[:,0] = row_value_list
print('Average Array\n',arr_ave)
Here is an alternate way to create row_value_list from a set. It's simpler because sets don't retain duplicate values, so you don't have to check for existing values when adding them to row_value_set.
# first create a set from column 0 values, then create a sorted list
row_value_set = set()
for arr in arr_list:
col_vals = set(arr[:,0])
row_value_set = row_value_set.union(col_vals)
row_value_list = sorted(row_value_set)

This is a new, updated answer that addresses the comment/request about calculating the median. (It also calculates the mean, and can be easily extended to calculate other statistics from the masked array.)
As noted in my comment on Nov 4 2022, "starting from my first answer quickly got complicated and hard to follow". This process is similar but different from the first answer. It uses glob to get a list of filenames (instead of iglob). Instead of loading the H5 datasets into a list of arrays, it loads all of the data into a single array (data is "stacked" on the 0-axis.). I don't think this increases the memory footprint. However, memory could be a problem if you load a lot of very large datasets for analysis.
Summary of the procedure:
Use glob.glob() to load filenames to a list based on a wildcard
Allocate an array to hold all the data (arr_all) based on the # of
files and size of 1 dataset.
Loop thru all H5 files, loading data to arr_all
Create a sorted list of unique group IDs (column 0 values)
Allocate arrays to hold mean/median (arr_mean and arr_median) based on the # of unique row IDs and # of columns in arr_all.
Loop over values in ID list, then:
a. Create masked array (mask) where column 0 value = loop value
b. Broadcast mask to match arr_all shape then apply to create ma_arr_all
c. Loop over columns of ma_arr_all, compress to get unmasked values, then calculate mean and median and save.
Code below:
import h5py
from glob import glob
import numpy as np
# use glob.glob() to get list of files using wildcard names
file_list = glob('202210*.h5')
with h5py.File(file_list[0],'r') as h5f:
a0, a1 = h5f['data'].shape
# allocate array to hold values from all datasets
arr_all = np.zeros(shape=(len(file_list)*a0,a1), dtype=h5f['data'].dtype)
start, stop = 0, a0
for i, h5file in enumerate(file_list):
with h5py.File(h5file,'r') as h5f:
arr_all[start:stop,:] = h5f['data'][()]
start += a0
stop += a0
# Create a set from column 0 values, and use to create a sorted list
row_value_list = sorted(set(arr_all[:,0]))
arr_mean = np.zeros(shape=(len(row_value_list),arr_all.shape[1]))
arr_median = np.zeros(shape=(len(row_value_list),arr_all.shape[1]))
col_0 = arr_all[:,0:1]
for i, row_val in enumerate(row_value_list):
row_mask = np.where(col_0==row_val, False, True ) # True mask value ignores data.
all_mask= np.broadcast_to(row_mask, arr_all.shape)
ma_arr_all = np.ma.masked_array(arr_all, mask=all_mask)
for j in range(ma_arr_all.shape[1]):
masked_col = ma_arr_all[:,j:j+1].compressed()
arr_mean[i:i+1,j:j+1] = np.mean(masked_col)
arr_median[i:i+1,j:j+1] = np.median(masked_col)
print('Mean values:\n',arr_mean)
print('Median values:\n',arr_median)
Added Nov 22, 2022:
Method above uses np.broadcast_to() introduced in NumPy 1.10. Here is an alternate method for prior versions. (Replaces the entire for i, row_val loop.) It should be more memory efficient. I haven't profiled to verify, but arrays all_mask and ma_arr_all are not created.
for i, row_val in enumerate(row_value_list):
row_mask = np.where(col_0==row_val, False, True ) # True mask value ignores data.
for j in range(arr_all.shape[1]):
masked_col = np.ma.masked_array(arr_all[:,j:j+1], mask=row_mask).compressed()
arr_mean[i:i+1,j:j+1] = np.mean(masked_col)
arr_median[i:i+1,j:j+1] = np.median(masked_col)
I ran with values provided by OP. Output is provided below and is the same for both methods:
Mean values:
[[ 1. 3. 7.5 ]
[ 2. 3.66666667 8. ]
[ 3. 4.66666667 9. ]
[ 4. 6. 12. ]]
Median values:
[[ 1. 3. 7.5]
[ 2. 4. 10. ]
[ 3. 6. 12. ]
[ 4. 6. 12. ]]

Related

Is there a way to extend a PyTables EArray in the second dimension?

I have a 2D array that can grow to larger sizes than I'm able to fit on memory, so I'm trying to store it in a h5 file using Pytables. The number of rows is known beforehand but the length of each row is not known and is variable between rows. After some research, I thought something along these lines would work, where I can set the extendable dimension as the second dimension.
filename = os.path.join(tempfile.mkdtemp(), 'example.h5')
h5_file = open_file(filename, mode="w", title="Example Extendable Array")
h5_group = h5_file.create_group("/", "example_on_dim_2")
e_array = h5_file.create_earray(h5_group, "example", Int32Atom(shape=()), (100, 0)) # Assume num of rows is 100
# Add some item to index 2
print(e_array[2]) # should print an empty array
e_array[2] = np.append(e_array[2], 5) # add the value 5 to row 2
print(e_array[2]) # should print [5], currently printing empty array
I'm not sure if it's possible to add elements in this way (I might have misunderstood the way earrays work), but any help would be greatly appreciated!

You're close...but have a small misunderstanding of some of the arguments and behavior. When you create the EArray with shape=(100,0), you don't have any data...just an object designated to have 100 rows that can add columns. Also, you need to use e_array.append() to add data, not np.append(). Also, if you are going to create a very large array, consider defining the expectedrows= parameter for improved performance as the EArray grows.
Take a look at this code.
import tables as tb
import numpy as np
filename = 'example.h5'
with tb.File(filename, mode="w", title="Example Extendable Array") as h5_file :
h5_group = h5_file.create_group("/", "example_on_dim_2")
# Assume num of rows is 100
#e_array = h5_file.create_earray(h5_group, "example", Int32Atom(shape=()), (100, 0))
e_array = h5_file.create_earray(h5_group, "example", atom=tb.IntAtom(), shape=(100, 0))
print (e_array.shape)
e_array.append(np.arange(100,dtype=int).reshape(100,1)) # append a column of values
print (e_array.shape)
print(e_array[2]) # prints [2]

Here is an example showing how to create a VLArray (Variable Length). It is similar to the EArray example above, and follows the example from the Pytables doc (link in comment above). However, although a VLArray supports variable length rows, it does not have a mechanism to add items to an existing row (AFAIK).
import tables as tb
import numpy as np
filename = 'example_vlarray.h5'
with tb.File(filename, mode="w", title="Example Variable Length Array") as h5_file :
h5_group = h5_file.create_group("/", "vl_example")
vlarray = h5_file.create_vlarray(h5_group, "example", tb.IntAtom(), "ragged array of ints",)
# Append some (variable length) rows:
vlarray.append(np.array([0]))
vlarray.append(np.array([1, 2]))
vlarray.append([3, 4, 5])
vlarray.append([6, 7, 8, 9])
# Now, read it through an iterator:
print('-->', vlarray.title)
for x in vlarray:
print('%s[%d]--> %s' % (vlarray.name, vlarray.nrow, x))

Creating a dataset from multiple hdf5 groups

creating a dataset from multiple hdf5 groups
Code for groups with
np.array(hdf.get('all my groups'))
I have then added code for creating a dataset from groups.
with h5py.File('/train.h5', 'w') as hdf:
hdf.create_dataset('train', data=one_T+two_T+three_T+four_T+five_T)
The error message being
ValueError: operands could not be broadcast together with shapes(534456,4) (534456,14)
The numbers in each group are the same other than the varying column lengths. 5 separate groups to one dataset.

This answer addresses the OP's request in comments to my first answer ("an example would be ds_1 all columns, ds_2 first two columns, ds_3 column 4 and 6, ds_4 all columns"). The process is very similar, but the input is "slightly more complicated" than the first answer. As a result I used a different approach to define dataset names and colums to be copied. Differences:
The first solution iterates over the dataset names from the "keys()" (copying each dataset completely, appending to a dataset in the new file). The size of the new dataset is calculated by summing sizes of all datasets.
The second solution uses 2 lists to define 1) dataset names (ds_list) and 2) associated columns to copy from each dataset (col_list is a of lists). The size of the new dataset is calculated by summing the number of columns in col_list. I used "fancy indexing" to extract the columns using col_list.
How you decide to do this depends on your data.
Note: for simplicity, I deleted the dtype and shape tests. You should include these to avoid errors with "real world" problems.
Code below:
# Data for file1
arr1 = np.random.random(120).reshape(20,6)
arr2 = np.random.random(120).reshape(20,6)
arr3 = np.random.random(120).reshape(20,6)
arr4 = np.random.random(120).reshape(20,6)
# Create file1 with 4 datasets
with h5py.File('file1.h5','w') as h5f :
h5f.create_dataset('ds_1',data=arr1)
h5f.create_dataset('ds_2',data=arr2)
h5f.create_dataset('ds_3',data=arr3)
h5f.create_dataset('ds_4',data=arr4)
# Open file1 for reading and file2 for writing
with h5py.File('file1.h5','r') as h5f1 , \
h5py.File('file2.h5','w') as h5f2 :
# Loop over datasets in file1 to get dtype and rows (should test compatibility)
for i, ds in enumerate(h5f1.keys()) :
if i == 0:
ds_0_dtype = h5f1[ds].dtype
n_rows = h5f1[ds].shape[0]
break
# Create new empty dataset with appropriate dtype and size
# Use maxshape parameter to make resizable in the future
ds_list = ['ds_1','ds_2','ds_3','ds_4']
col_list =[ [0,1,2,3,4,5], [0,1], [3,5], [0,1,2,3,4,5] ]
n_cols = sum( [ len(c) for c in col_list])
h5f2.create_dataset('combined', dtype=ds_0_dtype, shape=(n_rows,n_cols), maxshape=(n_rows,None))
# Loop over datasets in file1, read data into xfer_arr, and write to file2
first = 0
for ds, cols in zip(ds_list, col_list) :
xfer_arr = h5f1[ds][:,cols]
last = first + xfer_arr.shape[1]
h5f2['combined'][:, first:last] = xfer_arr[:]
first = last

Here you go; a simple example to copy values from 3 datasets in file1 to a single dataset in file2. I included some tests to verify compatible dtype and shape. The code to create file1 are included at the top. Comments in the code should explain the process. I have another post that shows multiple ways to copy data between 2 HDF5 files. See this post: How can I combine multiple .h5 file?
import h5py
import numpy as np
import sys
# Data for file1
arr1 = np.random.random(80).reshape(20,4)
arr2 = np.random.random(40).reshape(20,2)
arr3 = np.random.random(60).reshape(20,3)
#Create file1 with 3 datasets
with h5py.File('file1.h5','w') as h5f :
h5f.create_dataset('ds_1',data=arr1)
h5f.create_dataset('ds_2',data=arr2)
h5f.create_dataset('ds_3',data=arr3)
# Open file1 for reading and file2 for writing
with h5py.File('file1.h5','r') as h5f1 , \
h5py.File('file2.h5','w') as h5f2 :
# Loop over datasets in file1 and check data compatiblity
for i, ds in enumerate(h5f1.keys()) :
if i == 0:
ds_0 = ds
ds_0_dtype = h5f1[ds].dtype
n_rows = h5f1[ds].shape[0]
n_cols = h5f1[ds].shape[1]
else:
if h5f1[ds].dtype != ds_0_dtype :
print(f'Dset 0:{ds_0}: dtype:{ds_0_dtype}')
print(f'Dset {i}:{ds}: dtype:{h5f1[ds].dtype}')
sys.exit('Error: incompatible dataset dtypes')
if h5f1[ds].shape[0] != n_rows :
print(f'Dset 0:{ds_0}: shape[0]:{n_rows}')
print(f'Dset {i}:{ds}: shape[0]:{h5f1[ds].shape[0]}')
sys.exit('Error: incompatible dataset shape')
n_cols += h5f1[ds].shape[1]
prev_ds = ds
# Create new empty dataset with appropriate dtype and size
# Using maxshape paramater to make resizable in the future
h5f2.create_dataset('ds_123', dtype=ds_0_dtype, shape=(n_rows,n_cols), maxshape=(n_rows,None))
# Loop over datasets in file1, read data into xfer_arr, and write to file2
first = 0
for ds in h5f1.keys() :
xfer_arr = h5f1[ds][:]
last = first + xfer_arr.shape[1]
h5f2['ds_123'][:, first:last] = xfer_arr[:]
first = last

Split data in columns and store it as two dimensional array

I have a data in this form:
49907 87063
42003 51519
21301 46100
97578 26010
52364 86618
25783 71775
1617 29096
2662 47428
74888 54550
17182 35976
86973 5323
......
I need to traverse it at the end like for line in file.
I want to split them like first column values store in array one and second column values store in array two, so whenever I call Array_one[0], Array_two[0] I will get the first row values like 49907 87063 and same for other values.

You can use space as a seperator.
Ex:
import pandas as pd
df = pd.read_csv(filename, sep="\s+", names = ["A", "B"])
print(df["A"][0])
print(df["B"][0])
Output:
49907
87063
for i in df.values:
print(i)
Output:
[49907 87063]
[42003 51519]
[21301 46100]
[97578 26010]
[52364 86618]
[25783 71775]
[ 1617 29096]
[ 2662 47428]
[74888 54550]
[17182 35976]
[86973 5323]

You can use numpy.genfromtxt to extract directly into a numpy array:
A = np.genfromtxt(file, dtype=int)
Whitespace is the default delimiter.
You can then use standard numpy indexing / slicing:
To extract the first row: A[0]; the second column: A[:, 1].
To extract the first element of the first row: A[0, 0].
To extract the first element of the second column: A[0, 1]
To iterate the entire array by row:
for i in range(A.shape[0]):
print(A[i])

Averaging values in a list of a list of a list in Python

I'm working on a method to average data from multiple files and put the results into a single file. Each line of the files looks like:
File #1
Test1,5,2,1,8
Test2,10,4,3,2
...
File #2
Test1,2,4,5,1
Test2,4,6,10,3
...
Here is the code I use to store the data:
totalData = []
for i in range(0, len(files)):
data = []
if ".csv" in files[i]:
infile = open(files[i],"r")
temp = infile.readline()
while temp != "":
data.append([c.strip() for c in temp.split(",")])
temp = infile.readline()
totalData.append(data)
So what I'm left with is totalData looking like the following:
totalData = [[
[Test1,5,2,1,8],
[Test2,10,4,3,2]],
[[Test1,2,4,5,1],
[Test2,4,6,10,3]]]
What I want to average is for all Test1, Test2, etc, average all the first values and then the second values and so forth. So testAverage would look like:
testAverage = [[Test1,3.5,3,3,4.5],
[Test2,7,5,6.5,2.5]]
I'm struggling to think of a concise/efficient way to do this. Any help is greatly appreciated! Also, if there are better ways to manage this type of data, please let me know.

It just need two loops
totalData = [ [['Test1',5,2,1,8],['Test2',10,4,3,2]],
[['Test1',2,4,5,1],['Test2',4,6,10,3]] ]
for t in range(len(totalData[0])): #tests
result = [totalData[0][t][0],]
for i in range(1,len(totalData[0][0])): #numbers
sum = 0.0
for j in range(len(totalData)):
sum += totalData[j][t][i]
sum /= len(totalData)
result.append(sum)
print result

first flatten it out
results = itertools.chain.from_iterable(totalData)
then sort it
results.sort()
then use groupby
data = {}
for key,values in itertools.groupby(results,lambda x:x[0]):
columns = zip(*values)
data[key] = [sum(c)*1.0/len(c) for c in columns]
and finally just print your data

If your data structure is regular, the best is probably to use numpy. You should be able to install it with pip from the terminal
pip install numpy
Then in python:
import numpy as np
totalData = np.array(totalData)
# remove the last dimension (i.e. 'Test1', 'Test2'), since it's not a number
totalData = np.array(totalData[:, :, 1:], float)
# average
np.mean(totalData, axis=0)

How do I index n sets of 4 columns to plot multiple plots using matplotlib?

I want to know how I should index / access some data programmatically in python.
I have columnar data: depth, temperature, gradient, gamma, for a set of boreholes. There are n boreholes. I have a header, which lists the borehole name and numeric ID. Example:
Bore_name,Bore_ID,,,Bore_name,Bore_ID,,,, ...
<a row of headers>
depth,temp,gradient,gamma,depth,temp,gradient,gamma ...
I don't know how to index the data, apart from rude iteration:
with open(filename,'rU') as f:
bores = f.readline().rstrip().split(',')
headers = f.readline().rstrip().split(',')
# load from CSV file, missing values are empty 'cells'
tdata = numpy.genfromtxt(filename, skip_header=2, delimiter=',', missing_values='', filling_values=numpy.nan)
for column in range(0,numpy.shape(tdata)[1],4):
# plots temperature on x, depth on y
pl.plot(tdata[:,column+1],tdata[:,column], label=bores[column])
# get index at max depth
depth = numpy.nanargmin(tdata[:,column])
# plot text label at max depth (y) and temp at that depth (x)
pl.text(tdata[depth,column+1],tdata[depth,column],bores[column])
It seems easy enough this way, but I've been using R recently and have got a bit used to their way of referencing data objects via classes and subclasses interpreted from headers.

Well if you like R's data.table, there have been a few (at least) attempts to re-create that functionality in NumPy--through additional classes in NumPy Core and through external Python libraries. The effort i find most promising is the datarray library by Fernando Perez. Here's how it works.
>>> # create a NumPy array for use as our data set
>>> import numpy as NP
>>> D = NP.random.randint(0, 10, 40).reshape(8, 5)
>>> # create some generic row and column names to pass to the constructor
>>> row_ids = [ "row{0}".format(c) for c in range(D1.shape[0]) ]
>>> rows = 'rows_id', row_ids
>>> variables = [ "col{0}".format(c) for c in range(D1.shape[1]) ]
>>> cols = 'variable', variables
Instantiate the DataArray instance, by calling the constructor and passing in an ordinary NumPy array and a list of tuples--one tuple for each axis, and since ndim = 2 here, there are two tuples in the list each tuple is comprised of axis label (str) and a sequence of labels for that axes (list).
>>> from datarray.datarray import DataArray as DA
>>> D1 = DA(D, [rows, cols])
>>> D1.axes
(Axis(name='rows', index=0, labels=['row0', 'row1', 'row2', 'row3',
'row4', 'row5', 'row6', 'row7']), Axis(name='cols', index=1,
labels=['col0', 'col1', 'col2', 'col3', 'col4']))
>>> # now you can use R-like syntax to reference a NumPy data array by column:
>>> D1[:,'col1']
DataArray([8, 5, 0, 7, 8, 9, 9, 4])
('rows',)

You could put your data into a dict for each borehole, keyed by the borehole id, and values as dicts with headers as keys. Roughly like this:
data = {boreid1:{"temp":temparray, ...}, boreid2:{"temp":temparray}}
Probably reading from files will be a little bit more cumbersome with these approach, but for plotting you could do something like
pl.plot(data[boreid]["temperature"], data[boreid]["depth"])

Here are idioms for naming rows and columns:
row0, row1 = np.ones((2,5))
for col in range(0, tdata.shape[1], 4):
depth,temp,gradient,gamma = tdata[:, col:col+4] .T
pl.plot( temp, depth )
See also namedtuple:
from collections import namedtuple
Rec = namedtuple( "Rec", "depth temp gradient gamma" )
r = Rec( *tdata[:, col:col+4].T )
print r.temp, r.depth
datarray (thanks Doug) is certainly more general.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to combine h5 data numpy arrays based on date in filename? - python

Related

Is there a way to extend a PyTables EArray in the second dimension?

Creating a dataset from multiple hdf5 groups

Split data in columns and store it as two dimensional array

Averaging values in a list of a list of a list in Python

How do I index n sets of 4 columns to plot multiple plots using matplotlib?

Categories

Resources