I'm trying to overwrite a numpy array that's a small part of a pretty complicated h5 file.
I'm extracting an array, changing some values, then want to re-insert the array into the h5 file.
I have no problem extracting the array that's nested.
f1 = h5py.File(file_name,'r')
X1 = f1['meas/frame1/data'].value
f1.close()
My attempted code looks something like this with no success:
f1 = h5py.File(file_name,'r+')
dset = f1.create_dataset('meas/frame1/data', data=X1)
f1.close()
As a sanity check, I executed this in Matlab using the following code, and it worked with no problems.
h5write(file1, '/meas/frame1/data', X1);
Does anyone have any suggestions on how to do this successfully?
You want to assign values, not create a dataset:
f1 = h5py.File(file_name, 'r+') # open the file
data = f1['meas/frame1/data'] # load the data
data[...] = X1 # assign new values to data
f1.close() # close the file
To confirm the changes were properly made and saved:
f1 = h5py.File(file_name, 'r')
np.allclose(f1['meas/frame1/data'].value, X1)
#True
askewchan's answer describes the way to do it (you cannot create a dataset under a name that already exists, but you can of course modify the dataset's data). Note, however, that the dataset must have the same shape as the data (X1) you are writing to it. If you want to replace the dataset with some other dataset of different shape, you first have to delete it:
del f1['meas/frame1/data']
dset = f1.create_dataset('meas/frame1/data', data=X1)
Different scenarios:
Partial changes to dataset
with h5py.File(file_name,'r+') as ds:
ds['meas/frame1/data'][5] = val # change index 5 to scalar "val"
ds['meas/frame1/data'][3:7] = vals # change values of indices 3--6 to "vals"
Change each value of dataset (identical dataset sizes)
with h5py.File(file_name,'r+') as ds:
ds['meas/frame1/data'][...] = X1 # change array values to those of "X1"
Overwrite dataset to one of different size
with h5py.File(file_name,'r+') as ds:
del ds['meas/frame1/data'] # delete old, differently sized dataset
ds.create_dataset('meas/frame1/data',data=X1) # implant new-shaped dataset "X1"
Since the File object is a context manager, using with statements is a nice way to package your code, and automatically close out of your dataset once you're done altering it. (You don't want to be in read/write mode if you only need to read off data!)
Related
def get_df():
df = pd.DataFrame()
os.chdir("C:/Users/s/Desktop/P")
for file in os.listdir():
if file.endswith('.csv'):
av_a = np.average(a, axis=0)
np.savetxt('merged_average.csv', av_a, delimiter=',')
I've tried to save it but it always overwrites with the next file and deletes the previous results
At the moment, your code is a bit hard to read, as you are declaring variables which are not used (df) and using variables which are not declared (a). In the future, try to give a minimal reproducible example of your problematic code.
I'll still try to give you an interpreted answer:
If you want to store multiple columns from different files next to each other, the job becomes simpler by first acquiring all columns, and then afterwardds save them to the file in a single action.
Here is an interpretation of your code:
def get_df():
# create an empty list to collect all results
average_results = []
os.chdir("C:/Users/s/Desktop/P")
for file in os.listdir():
if file.endswith('.csv'):
a = something(file) # unknown to me
average_results.append(np.average(a, axis=0))
# convert the results to a 2d numpy matrix,
# optionally transpose it to get the desired data orientation
data = np.array(average_results).transpose()
# save the full dataset
np.savetxt('merged_average.csv', data , delimiter=',')
This is the code I am currently using to open this very large matlab file:
self.data = {}
f = h5py.File(filepath, 'r')
for k, v in f.items():
self.data[k] = np.array(v)
self.data = list(self.data.items())
self.data = np.array(self.data)
self.fs = self.data[1][1][0][0]
self.data = self.data[0][1]
print('fs = ', self.fs)
print('DONE. Data read using h5py reader')
It takes about 3-5 minutes to load fully. How can I improve this code so I can speed up the process?
Saving and using objects is easy. It's hard to provide specifics about objects vs arrays without knowing the purpose of your code. What do you want to do with variable self.data after you get it? You retrieve all of the datasets, then keep redefining self.data. At the end, you assign self.data to an array for the 1st dataset. Also, self.fs points to a single value from the 2nd dataset. So, it's not clear why you retrieve all of them.
If you only want the data from the first dataset, something like this will do the job.
f = h5py.File(filepath, 'r')
# returns first key from group member names:
dsname = next(iter(f)) # (name of 1st dataset)
self.data = f[dsname][:] # (data from 1st dataset)
If you want the name and data from all the datasets, this will return a dictionary of names and objects.
f = h5py.File(filepath, 'r')
for k, v in f.items():
self.data[k] = v
Then, when you need an array of data, use one of these methods:
arr = data['a dataset name'][:] # for 1 dataset (when you know the name)
for dsname, obj in data.items(): # loop thru all pairs in data
arr = obj[:] # or
arr = f[dsname][:]
To understand what's going on, let's start with some HDF5 basics. It's critically important to understand the schema when working with HDF5. The data structure is similar to folders and files on a computer. The Folders are "Groups" and the Files are "Datasets". With h5py, you access Groups members using dictionary syntax, and you access Datasets with NumPy array syntax. (The h5py File object behaves like a Group.)
So, when you execute for k,v in f.items(), h5py returns key,value pairs for each group member (k,v in your code). The key is the object's name and the value is a h5py object (either a dataset or another group).
Once you have a dataset, there are 3 ways to access the associated data:
You can simply save the object and reference it "as-if" it is a NumPy
array.
You can save the object's name, then when you need the object's data, you retrieve using the dataset name and associated group.
You can read the dataset values into a NumPy array. (This is what your code does.)
Note: I prefer Methods #1 and #2 because they do not load the data into memory until you read the data.
Here is your code with my comments about each step:
# returns a file object "f" (behaves like a group):
f = h5py.File(filepath, 'r')
# returns key/value pairs of group member names/objects:
for k, v in f.items():
# adds a dictionary key (object name) with value of an array read from object "v":
self.data[k] = np.array(v)
# converts the (name,array) dictionary into a list:
self.data = list(self.data.items())
# converts the list into a nd.array of dtype=object:
# -has 1 row for each dataset with 2 columns:
# -column 0: object name, column 1: array of data
self.data = np.array(self.data)
# reads [0][0] value from array at self.data[1][1] ([0][0] value for 2nd dataset)
self.fs = self.data[1][1][0][0]
# resets data to value from array at self.data[0][1] (array for 1st dataset)
self.data = self.data[0][1]
I am working with multiple data files (File_1, File_2, .....). I want the desired outputs for each data file to be saved in the same txt file as row values of a new column.
I tried the following code for my first data file (File_1). The desired outputs (Av_Age_btwn_0_to_5, Av_Age_btwn_5_to_10) are stored as row values of a column in the output txt file (Result.txt). Now, I want these outputs to be stored as row values of a next column of the same txt file when I work with File_2. Then for File_3, in a similar manner, I want the outputs in the next column and so on.
import numpy as np
data=np.loadtxt('C:/Users/Hrihaan/Desktop/File_1.txt')
Age=data[:,0]
Age_btwn_0_to_5=Age[(Age<5) & (Age>0)]
Age_btwn_5_to_10=Age[(Age<10) & (Age>=5)]
Av_Age_btwn_0_to_5=np.mean(Age_btwn_0_to_5)
Av_Age_btwn_5_to_10=np.mean(Age_btwn_5_to_10)
np.savetxt('/Users/Hrihaan/Desktop/Result.txt', (Av_Age_btwn_0_to_5, Av_Age_btwn_5_to_10), delimiter=',')
Any help would be appreciated.
If I understand correctly, each of your files is a column, and you want to combine them into a matrix (one file per column).
Maybe something like this could work?
import numpy as np
# Simulate some dummy data
def simulate_data(n_files):
for i in range(n_files):
ages = np.random.randint(0,10,100)
np.savetxt("/tmp/File_{}.txt".format(i),ages,fmt='%i')
# Your file processing
def process(age):
age_btwn_0_to_5=age[(age<5) & (age>0)]
age_btwn_5_to_10=age[(age<10) & (age>=5)]
av_age_btwn_0_to_5=np.mean(age_btwn_0_to_5)
av_age_btwn_5_to_10=np.mean(age_btwn_5_to_10)
return (av_age_btwn_0_to_5, av_age_btwn_5_to_10)
n_files = 5
simulate_data(n_files)
results = []
for i in range(n_files):
# load data
data=np.loadtxt('/tmp/File_{}.txt'.format(i))
# Process your file and extract your information
data_processed = process(data)
# Store the result
results.append(data_processed)
results = np.asarray(results)
np.savetxt('/tmp/Result.txt',results.T,delimiter=',',fmt='%.3f')
In the end, you have something like that:
2.649,2.867,2.270,2.475,2.632
7.080,6.920,7.288,7.231,6.880
Is it what you're looking for?
import numpy as np
# some data
age = np.arange(10)
time = np.arange(10)
mean = np.arange(10)
output = np.array(list(zip(age,time,mean)))
np.savetxt('FooFile.txt', output, delimiter=',', fmt='%s')
# ^^^^^^^^ --> Use this keyword argument if you want to save it as int. For simplicity just don't use it.
output:
0,0,0
1,1,1
2,2,2
3,3,3
4,4,4
5,5,5
6,6,6
7,7,7
8,8,8
9,9,9
I have several groups in my h5 file: 'group1', 'group2', ... and each group has 3 different datasets: 'dataset1', 'dataset2', 'dataset3', all of which are arrays with numerical values but the size of array is different.
My goal is to save each dataset from group to a numpy array.
Example:
import h5py
filename = '../Results/someFileName.h5'
data = h5py.File(filename, 'r')
Now I can easily iterate over all groups with
for i in range(len(data.keys())):
group = list(data.keys())[i]
but I can't figure out how to access the datasets within the group. So I am looking for something like MATLAB:
hinfo = h5info(filename);
for i = 1:length(hinfo.Groups())
datasetname = [hinfo.Groups(i).Name '/dataset1'];
dset = h5read(fn, datasetname);
Where dset is now an array of numbers.
Is there a way I could do the same with h5py?
You are have the right idea.
But, you don't need to loop on range(len(data.keys())).
Just use data.keys(); it generates an iterable list of object names.
Try this:
import h5py
filename = '../Results/someFileName.h5'
data = h5py.File(filename, 'r')
for group in data.keys() :
print (group)
for dset in data[group].keys():
print (dset)
ds_data = data[group][dset] # returns HDF5 dataset object
print (ds_data)
print (ds_data.shape, ds_data.dtype)
arr = data[group][dset][:] # adding [:] returns a numpy array
print (arr.shape, arr.dtype)
print (arr)
Note: logic above is valid ONLY when there are only groups at the top level (no datasets). It does not test object types as groups or data sets.
To avoid these assumptions/limitations, you should investigate .visititems() or write a generator to recursively visit objects. The first 2 answers are examples showing .visititems() usage, and the last 1 uses a generator function:
Use visititems(-function-) to loop recursively
This example uses isinstance() as the test. The object is a Group when it tests true for h5py.Group and is a Dataset when it tests true for h5py.Dataset . I consider this more Pythonic than the second example below (IMHO).
Convert hdf5 to raw organised in folders
It checks for number of objects below the visited object. when there are no subgroups, it is a dataset. And when there subgroups, it is a group.
How can I combine multiple .h5 file? This quesion has multipel answers. This answer uses a generator to merge data from several files with several groups and datasets into a single file.
This method requires that dataset names, 'dataset1', 'dataset2', 'dataset3', etc., be the same in each of the hdf5 groups of one hdf5 file.
# create empty lists
lat = []
lon = []
x = []
y = []
# fill lists creating numpy arrays
h5f = h5py.File('filename.h5', 'r') # read file
for group in h5f.keys(): # iterate through groups
for datasets in h5f[group].keys(): #iterate through datasets
lat = np.append(lat, h5f[group]['lat'][()]) # append data
lon = np.append(lon, h5f[group]['lon'][()])
x = np.append(x, h5f[group]['x'][()])
y = np.append(y, h5f[group]['y'][()])
I find a lot of documents/forums telling how to convert a csv to a Tensorflow dataset, but not a single one saying how to convert a dataset to a csv. I have csv with two columns now (filename, weight - more columns maybe be added later). I read that into tensorflow and create a dataset. At the end of the script the 2nd column is modified and I need to save these columns to a csv. I need them in csv (not checkpoint) because I may need to do stuff with it on Matlab.
I tried to call the dataset map function and tried to save to csv inside map function. But it doesn't work as expected.
#reading csv to dataset
def map_func1(line):
FIELD_DEFAULTS = [[""], [0.0]]
sample,weight = tf.decode_csv(line, FIELD_DEFAULTS)
return sample,weight
ds = tf.data.TextLineDataset('sample_weights.csv')
ds_1 = ds.map(map_func1)
# then the dataset is modified to ds_2 then, not including code- it's just another map func
# trying to save to csv -
def map_func3(writer,x):
x0,x1 = x
writer.writerow([x0,x1])
return x
with open('sample_weights_mod.csv','w') as file:
writer = csv.writer(file)
ds_3 = ds_2.map(lambda *x: map_func3(writer,x))
This doesn't work as expected just writes the tensor shape to csv Tensor("arg0:0", shape=(), dtype=string) Tensor("arg1:0", shape=(), dtype=float32)
This solution is probably a bad one. I really need to get a neat way to do this
Though not a good way of doing for now I did it as below
type(movies) ## movies variable is of type tensorflow.python.data.ops.dataset_ops.MapDataset
z=[]
for example in movies:
z.append(example.numpy().decode("utf-8"))
mv={'movie_title':z}
pd.DataFrame(mv).to_csv('movie.csv')