I have a few questions wrapped up into this issue. I realize this might be a convoluted post and can provide extra details.
A code package I use can produce large .h5 files (source.h5) (100+ Gb), where almost all of this data resides in 1 dataset (group2/D). I want to make a new .h5 file (dest.h5) using Python that contains all datasets except group2/D of source.h5 without needing to copy the entire file. I then will condense group2/D after some postprocessing and write a new group2/D in dest.h5 with much less data. However, I need to keep source.h5 because this postprocessing may need to be performed multiple times into multiple destination files.
source.h5 is always structured the same and cannot be changed in either source.h5 or dest.h5, where each letter is a dataset:
group1/A
group1/B
group2/C
group2/D
I thus want to initially make a file with this format:
group1/A
group1/B
group2/C
and again, fill in group2/D later. Simply copying source.h5 multiple times is always possible, but I'd like to avoid having to copy a huge file a bunch of times because disk space is limited and this is something that isn't a 1 off case.
I searched and found this question (How to partially copy using python an Hdf5 file into a new one keeping the same structure?) and tested if dest.h5 would be the same as source.h5:
fs = h5py.File('source.h5', 'r')
fd = h5py.File('dest.h5', 'w')
fs.copy('group1', fd)
fd.create_group('group2')
fs.copy('group2/C', fd['/group2'])
fd.copy('group2/D', fd['/group2'])
fs.close()
fd.close()
but the code package I used couldn't read the file I created (which I must have happen), implying there was some critical data loss when I did this operation (the file sizes differ by 7 kb also). I'm assuming the problem was when I created group2 manually because I checked with numpy that the values in group1 datasets exactly matched in both source.h5 and dest.h5. Before I did any digging into what data is missing I wanted to get a few things out of the way:
Question 1: Is there .h5 file metadata that accompanies each group or dataset? If so, how can I see it so I can create a group2 in dest.h5 that exactly matches the one in source.h5? Is there a way to see if 2 groups (not datasets) exactly match each other?
Question 2: Alternatively, is it possible to simply copy the data structure of a .h5 file (i.e. groups and datasets with empty lists as a skeleton file) so that fields can be populated later? Or, as a subset of this question, is there a way to copy a blank dataset to another file such that any metadata is retained (assuming there is some)?
Question 3: Finally, to avoid all this, is it possible to just copy a subset of source.h5 to dest.h5? With something like:
fs.copy(['group1','group2/C'], fd)
Thanks for your time. I appreciate you reading this far
Related
How can i display multiple pandas function created on python in the same csv file
So I have multiple data tables saved as pandas dataframes, and I want to output all of them into the same CSV for ease of access. However, I am not really sure the best way to go about this, as I want to maintain each dataframes inherent structure (ie columns and index), so I can combine them all into 1 single dataframe.
You have 2 choices:
Either you combine them first (pd.concat()) with all the advantages and limitations of that appraoch, then you cann call .to_csv and it will print 1 file. If they are structurally the same, this is great because you will be able to read the file again.
Or, you call .to_csv() multiple times, and save the output in a "buffer", which you can then write (see here). Probably the only way if your DataFrames are very different from a structural perspective, but a mess to read them later.
Is .json output an option for what you want to do?
Thanks alot for the comment Kingotto, I used to first option added the this code and it was able to help me arrange my functions horizontally and exported the file to csv like this:
frames = pd.concat([file_1, file_2, file_3], axis = 1)
save the dataframe
frames.to_csv('Combined.csv', index = False)
I have some instrumental data which saved in hdf-5 format as multiple 2-d array along with the measuring time. As attached figures below, d1 and d2 are two independent file in which the instrument recorded in different time. They have the same data variables, and the only difference is the length of phony_dim_0, which represet the total data points varying with measurement time.
These files need to be loaded to a specific software provided by the instrument company for obtaining meaningful results. I want to merge multiple files with Python xarray while keeping in their original format, and then loaed one merged file into the software.
Here is my attempt:
files = os.listdir("DATA_PATH")
d1 = xarray.open_dataset(files[0])
d2 = xarray.open_dataset(files[1])
## copy a new one to save the merged data array.
d0 = d1
vars_ = [c for c in d1]
for var in vars_:
d0[var].values = np.vstack([d1[var],d2[var]])
The error shows like this:
replacement data must match the Variable's shape. replacement data has shape (761, 200); Variable has shape (441, 200)
I thought about two solution for this problem:
expanding the dimension length to the total length of all merged files.
creating a new empty dataframe in the same format of d1 and d2.
However, I still could not figure out the function to achieve that. Any comments or suggestions would be appreciated.
Supplemental information
dataset example [d1],[d2]
I'm not familiar with xarray, so can't help with your code. However, you don't need xarray to copy HDF5 data; h5py is designed to work nicely with HDF5 data as NumPy arrays, and is all you need to get merge the data.
A note about Xarray. It uses different nomenclature than HDF5 and h5py. Xarray refers to the files as 'datasets', and calls the HDF5 datasets 'data variables'. HDF5/h5py nomenclature is more frequently used, so I am going to use it for the rest of my post.
There are some things to consider when merging datasets across 2 or more HDF5 files. They are:
Consistency of the data schema (which you have checked).
Consistency of attributes. If datasets have different attribute names or values, the merge process gets a lot more complicated! (Yours appear to be consistent.)
It's preferable to create resizabe datasets in the merged file. This simplifies the process, as you don't need to know the total size when you initially create the dataset. Better yet, you can add more data later (if/when you have more files).
I looked at your files. You have 8 HDF5 datasets in each file. One nice thing: the datasets are resizble. That simplifies the merge process. Also, although your datasets have a lot of attributes, they appear to be common in both files. That also simplifies the process.
The code below goes through the following steps to merge the data.
Open the new merge file for writing
Open the first data file (read-only)
Loop thru all data sets
a. use the group copy function to copy the dataset (data plus maxshape parameters, and attribute names and values).
Open the second data file (read-only)
Loop thru all data sets and do the following:
a. get the size of the 2 datasets (existing and to be added)
b. increase the size of HDF5 dataset with .resize() method
c. write values from dataset to end of existing dataset
At the end it loops thru all 3 files and prints shape and
maxshape for all datasets (for visual comparison).
Code below:
import h5py
files = [ '211008_778183_m.h5', '211008_778624_m.h5', 'merged_.h5' ]
# Create the merge file:
with h5py.File('merged_.h5','w') as h5fw:
# Open first HDF5 file and copy each dataset.
# Will use maxhape and attributes from existing dataset.
with h5py.File(files[0],'r') as h5fr:
for ds in h5fr.keys():
h5fw.copy(h5fr[ds], h5fw, name=ds)
# Open second HDF5 file and copy data from each dataset.
# Resizes existing dataset as needed to hold new data.
with h5py.File(files[1],'r') as h5fr:
for ds in h5fr.keys():
ds_a0 = h5fw[ds].shape[0]
add_a0 = h5fr[ds].shape[0]
h5fw[ds].resize(ds_a0+add_a0,axis=0)
h5fw[ds][ds_a0:] = h5fr[ds][:]
for fname in files:
print(f'Working on file:{fname}')
with h5py.File(fname,'r') as h5f:
for ds, h5obj in h5f.items():
print (f'for: {ds}; axshape={h5obj.shape}, maxshape={h5obj.maxshape}')
So I have some Astropy fits tables that I save (they have all have the same format, column names, etc.). I want to take all these fits files and combine them to make one large fits file.
Currently, I am playing around with the astropy.io append and update functions to no avail.
Any help would be greatly appreciated.
So I have it working now. This is what I did essentially:
# Read in the fits table you want to append
table = Table.read(input_file, format='fits')
# Read in the large table you want to append to
base_table = Table.read('base_file.fits', format='fits')
# Use Astropy's 'vstack' function and overwrite the file
concat_table = vstack([base_table,append_table])
concat_table.write('base_file.fits', format='fits', overwrite=True)
In my case, all the columns are the same for every table. So I just looped through all the fits files and appended them one at a time. There are probably other ways to do this, but I found this was the easiest.
I'm a student who is quite new to coding in Python.
I'm using Dymola for several years and now I'm using the Dymola/Python interface with which you can operate Dymola from inside Python (useful for building stock simulations, global sensitivity analysis etc.).
Now, Dymola always generates .mat files in an efficient unreadable data structure. I was wondering how to export variables I'm interested in from that .mat-file to .csv using a Python-script? (I don't want the whole file to be converted to .csv because it is simple way too large)
I'm aware of a DyMat-package for Python that should do the job but either I don't understand the code or the code is not doing what it should do? Does anybody have experience with this?
I probably miss some code defining which .mat file has to be read/exported from, which variables I want and in which directory the result.csv-file should be stored?
import csv, numpy
def export(dm, varList, fileName=None, formatOptions={}):
"""Export DyMat data to a CSV file"""
if not fileName:
fileName = dm.fileName + '.csv'
oFile = open(fileName, 'w')
csvWriter = csv.writer(oFile)
vDict = dm.sortByBlocks(varList)
for vList in vDict.values():
vData = dm.getVarArray(vList)
vList.insert(0, dm._absc[0])
csvWriter.writerow(vList)
csvWriter.writerows(numpy.transpose(vData))
oFile.close()
Thanks!
In the Dymola distribution there is a utility called alist.exe, that allows you to export a number of variables in CSV format.
Another possibility is to convert the MAT file to SDF format, which is a very simple HDF5 interpretation. The HDF5 file is not as compact as the MAT-file, but you can compress the file using ZIP/GZIP/7ZIP to reduce archival storage. There are both MATLAB and Python scripts for reading the SDF format in the Dymola distribution.
Since this was tagged openmodelica, I am proposing a solution using it:
filterSimulationResults("file.mat", "file.csv", {"x","y","z"}) creates a csv-file with only variables x, y, z (If you think it's still too large, it is possible to resample the file).
For small files (<2GB) Buildingspy (or other Python packages) covers pretty much all needs: https://simulationresearch.lbl.gov/modelica/buildingspy/
However, since one will run into issues when the files are above 2GB (e.g. for full years of simulations), "alist.exe" from Dymola may be employed. (filterSimulationResults from OpenModelica also fails then)
"alist.exe" seems to accept up until approx. 100 variables to be exported at once and single executions for each variable seems to slow things down drastically (translation of 1 or 100 rows takes almost the same time). One may employ the alist.exe as follows from Python to facilitate automation and speed things up.
var_list=['Component.Name1','Component.Name3','Component.Name2','...'] #List of Variabels to be extracted
N_batch=100 #Number of variables to be extracted from the .mat file at once (max. approx 110)
cmds=[] #list of commands to be executed batch wise
for i,var in enumerate(var_list):
if (i%N_batch == 0) &(i > 0):
cmds.append(cmd)
cmd=''
cmd+=f' -e {var}'#build command
cmds.append(cmd)
lst_df=[] #list of pandas dataframes
for i,cmd in enumerate(cmds):
os.system(f'"C:/Program Files/Dymola 2021/bin64/alist.exe" {cmd} {inFile} tmp.csv')
lst_df.append(pd.read_csv('tmp.csv',index_col=[0]).squeeze("columns"))
df_overall=pd.concat(lst_df,axis=1)
df_overall.to_csv('CompleteCSVFile.csv')#or use .pkl for more efficient writing and reading
It is still not a fast solution, but enables the processing of the date in the first instance. Variable Selection of Dymola should always be exploited first before trying to wrangle around such amounts of data on a local machine.
Hope this helps someone someday!
I'm trying to generate a large data file (in the GBs) by iterating over thousands of database records. At the top of the file are a line for each "feature" that appears latter in the file. They look like:
#attribute 'Diameter' numeric
#attribute 'Length' real
#attribute 'Qty' integer
lines containing data using these attributes look like:
{0 0.86, 1 0.98, 2 7}
However, since my data is sparse data, each record from my database may not have each attribute, and I don't know what the complete feature set is in advance. I could, in theory, iterate over my database records twice, the first time accumulating the feature set, and then the second time to output my records, but I'm trying to find a more efficient method.
I'd like to try a method like the following pseudo-code:
fout = open('output.dat', 'w')
known_features = set()
for records in records:
if record has unknown features:
jump to top of file
delete existing "#attribute" lines and write new lines
jump to bottom of file
fout.write(record)
It's the jump-to/write/jump-back part I'm not sure how to pull off. How would you do this in Python?
I tried something like:
fout.seek(0)
for new_attribute in new_attributes:
fout.write(attribute)
fout.seek(0, 2)
but this overwrites both the attribute lines and data lines at the top of the file, not simply insert new lines starting at the seek position I specify.
How do you obtain a word-processor's "insert" functionality in Python without loading the entire document into memory? The final file is larger than all my available memory.
Why don't you get a list of all the features and their data types; list them first. If a feature is missing, replace it with a known value - NULL seems appropriate.
This way your records will be complete (in length), and you don't have to hop around the file.
The other approach is, write two files. One contains all your features, the others all your rows. Once both files are generated, append the feature file to the top of the data file.
FWIW, word processors load files in memory for editing; and then they write the entire file out. This is why you can't load a file larger than the addressable/available memory in a word processor; or any other program that is not implemented as a stream reader.
Why don't you build the output in memory first (e.g. as a dict) and write it to a file after all data is known?