Using MFDataset to combine netcdf files in python - python

I am trying to combine netcdf files, but it contifuously shows
" File "CBL_plot.py", line 11, in f = MFDataset(fili) File "utils.pyx", line 274, in netCDF4.MFDataset.init (netCDF4.c:3822) IOError: master dataset THref_11:00.nc does not have a aggregation dimension."
So, I checked only one netcdf files and the information of a netcdf file is as below:
float64 th_ref(u't',)
unlimited dimensions = ()
current size = (30,)
It looks there is no aggregation dimension. However, I would like to combine those netcdf files rather than just using one by one.
Is there any way to create aggregation dimension to make this MFData set work?
Below is the python code I used:
import numpy as np
from netCDF4 import MFDataset
varn = 'th_ref'
fili = THref_*nc'
f = MFDataset(fili)
Th = f.variables[varn]
Th_ref=np.array(Th[:])
print Th.shape
I will really appreciate any help, idea, and hint.
Thank you,
Isaac

Short answer: MFDataset can only aggregate along the slowest varying dimension in your files.
Longer answer: In the netcdf4-python documentation of MFDataset it says "Open a Dataset spanning multiple files, making it look as if it was a single file. Variables in the list of files that share the same dimension (specified with the keyword aggdim) are aggregated. If aggdim is not specified, the unlimited is aggregated. Currently, aggdim must be the leftmost (slowest varying) dimension of each of the variables to be aggregated."
So MFDataset works by aggregating along the slowest varying dimension in the existing files. So if you have a bunch of files that are snapshots of the same logical dataset at different times, and you want to aggregate in time, you need to have a time dimension in each of the files. If the time of the data is simply encoded in the file name, there is currently no way to use MFDataset to aggregate.

Related

Faster alternative than pandas.read_csv for reding txt files with data

I am working on a project where I need to process large amounts of txt files (about 7000 files), where each file has 2 columns of floats with 12500 rows.
I am using pandas, and takes about 2 min 20 sec which is a bit long. With MATLAB this takes 1 min less. I would like to get closer or faster than MATLAB.
Is there any faster alternative that I can implement with python?
I tried Cython and the speed was the same as with pandas.
Here is the code in am using. It reads the files composed of column 1 (time) and column 2 (amplitude). I calculate the envelope and make a list with the resulting envelopes for all files. I extract the time from the first file using the simple numpy.load_txt(), which is slower than pandas but no impact since it is just one file.
Any ideas?
For coding suggestions please try to use the same format as I use.
Cheers
Data example:
1.7949600e-05 -5.3232106e-03
1.7950000e-05 -5.6231098e-03
1.7950400e-05 -5.9230090e-03
1.7950800e-05 -6.3228746e-03
1.7951200e-05 -6.2978830e-03
1.7951600e-05 -6.6727570e-03
1.7952000e-05 -6.5727906e-03
1.7952400e-05 -6.9726562e-03
1.7952800e-05 -7.0726226e-03
1.7953200e-05 -7.2475638e-03
1.7953600e-05 -7.1725890e-03
1.7954000e-05 -6.9476646e-03
1.7954400e-05 -6.6227738e-03
1.7954800e-05 -6.4228410e-03
1.7955200e-05 -5.8480342e-03
1.7955600e-05 -6.1979166e-03
1.7956000e-05 -5.7980510e-03
1.7956400e-05 -5.6231098e-03
1.7956800e-05 -5.3482022e-03
1.7957200e-05 -5.1732611e-03
1.7957600e-05 -4.6484375e-03
20 files here:
https://1drv.ms/u/s!Ag-tHmG9aFpjcFZPqeTO12FWlMY?e=f6Zk38
folder_tarjet="D:\this"
if len(folder_tarjet) > 0:
print ("You chose %s" % folder_tarjet)
list_of_files = os.listdir(folder_tarjet)
list_of_files.sort(key=lambda f: os.path.getmtime(join(folder_tarjet, f)))
num_files=len(list_of_files)
envs_a=[]
for elem in list_of_files:
file_name=os.path.join(folder_tarjet,elem)
amp=pd.read_csv(file_name,header=None,dtype={'amp':np.float64},delim_whitespace=True)
env_amplitudes = np.abs(hilbert(np.array(pd.DataFrame(amp[1]))))
envs_a.append(env_amplitudes)
envelopes=np.array(envs_a).T
file_name=os.path.join(folder_tarjet,list_of_files[1])
Time=np.loadtxt(file_name,usecols=0)
I would suggest you not to use csv to store and load big data quantities, if you already have your data in csv you can still convert all of it at once into a faster format. For instance you can use pickle, h5, feather and parquet, they are non human readable but have much better performance in any other metric.
After the conversion, you will be able to load the data in few seconds (if not less than a second) instead of minutes, so in my opinion it is a much better solution than trying to make a marginal 50% optimization.
If the data is being generated, make sure to generate it in a compressed format. If you don't know which format to use, parquet for instance would be one of the fastest for reading and writing and you can also load it from Matlab.
Here you will the options already supported by pandas.
As discussed in #Ziur Olpa's answer and the comments, a binary format is bound to be faster than to parse text.
The quick way to get those gains is to use Numpy's own NPY format, and have your reader function cache those onto disk; that way, when you re-(re-re-)run your data analysis, it will use the pre-parsed NPY files instead of the "raw" TXT files. Should the TXT files change, you can just remove all NPY files and wait a while longer for parsing to happen (or maybe add logic to look at the modification times of the NPY files c.f. their corresponding TXT files).
Something like this – I hope I got your hilbert/abs/transposition logic the way you wanted to.
def read_scans(folder):
"""
Read scan files from a given folder, yield tuples of filename/matrix.
This will also create "cached" npy files for each txt file, e.g. a.txt -> a.txt.npy.
"""
for filename in sorted(glob.glob(os.path.join(folder, "*.txt")), key=os.path.getmtime):
np_filename = filename + ".npy"
if os.path.exists(np_filename):
yield (filename, np.load(np_filename))
else:
df = pd.read_csv(filename, header=None, dtype=np.float64, delim_whitespace=True)
mat = df.values
np.save(np_filename, mat)
yield (filename, mat)
def read_data_opt(folder):
times = None
envs = []
for filename, mat in read_scans(folder):
if times is None:
# If we hadn't got the times yet, grab them from this
# matrix (by slicing the first column).
times = mat[:, 0]
env_amplitudes = signal.hilbert(mat[:, 1])
envs.append(env_amplitudes)
envs = np.abs(np.array(envs)).T
return (times, envs)
On my machine, the first run (for your 20-file dataset) takes
read_data_opt took 0.13302898406982422 seconds
and the subsequent runs are 4x faster:
read_data_opt took 0.031115055084228516 seconds

Xarray to merge two hdf5 file with different dimension length

I have some instrumental data which saved in hdf-5 format as multiple 2-d array along with the measuring time. As attached figures below, d1 and d2 are two independent file in which the instrument recorded in different time. They have the same data variables, and the only difference is the length of phony_dim_0, which represet the total data points varying with measurement time.
These files need to be loaded to a specific software provided by the instrument company for obtaining meaningful results. I want to merge multiple files with Python xarray while keeping in their original format, and then loaed one merged file into the software.
Here is my attempt:
files = os.listdir("DATA_PATH")
d1 = xarray.open_dataset(files[0])
d2 = xarray.open_dataset(files[1])
## copy a new one to save the merged data array.
d0 = d1
vars_ = [c for c in d1]
for var in vars_:
d0[var].values = np.vstack([d1[var],d2[var]])
The error shows like this:
replacement data must match the Variable's shape. replacement data has shape (761, 200); Variable has shape (441, 200)
I thought about two solution for this problem:
expanding the dimension length to the total length of all merged files.
creating a new empty dataframe in the same format of d1 and d2.
However, I still could not figure out the function to achieve that. Any comments or suggestions would be appreciated.
Supplemental information
dataset example [d1],[d2]
I'm not familiar with xarray, so can't help with your code. However, you don't need xarray to copy HDF5 data; h5py is designed to work nicely with HDF5 data as NumPy arrays, and is all you need to get merge the data.
A note about Xarray. It uses different nomenclature than HDF5 and h5py. Xarray refers to the files as 'datasets', and calls the HDF5 datasets 'data variables'. HDF5/h5py nomenclature is more frequently used, so I am going to use it for the rest of my post.
There are some things to consider when merging datasets across 2 or more HDF5 files. They are:
Consistency of the data schema (which you have checked).
Consistency of attributes. If datasets have different attribute names or values, the merge process gets a lot more complicated! (Yours appear to be consistent.)
It's preferable to create resizabe datasets in the merged file. This simplifies the process, as you don't need to know the total size when you initially create the dataset. Better yet, you can add more data later (if/when you have more files).
I looked at your files. You have 8 HDF5 datasets in each file. One nice thing: the datasets are resizble. That simplifies the merge process. Also, although your datasets have a lot of attributes, they appear to be common in both files. That also simplifies the process.
The code below goes through the following steps to merge the data.
Open the new merge file for writing
Open the first data file (read-only)
Loop thru all data sets
a. use the group copy function to copy the dataset (data plus maxshape parameters, and attribute names and values).
Open the second data file (read-only)
Loop thru all data sets and do the following:
a. get the size of the 2 datasets (existing and to be added)
b. increase the size of HDF5 dataset with .resize() method
c. write values from dataset to end of existing dataset
At the end it loops thru all 3 files and prints shape and
maxshape for all datasets (for visual comparison).
Code below:
import h5py
files = [ '211008_778183_m.h5', '211008_778624_m.h5', 'merged_.h5' ]
# Create the merge file:
with h5py.File('merged_.h5','w') as h5fw:
# Open first HDF5 file and copy each dataset.
# Will use maxhape and attributes from existing dataset.
with h5py.File(files[0],'r') as h5fr:
for ds in h5fr.keys():
h5fw.copy(h5fr[ds], h5fw, name=ds)
# Open second HDF5 file and copy data from each dataset.
# Resizes existing dataset as needed to hold new data.
with h5py.File(files[1],'r') as h5fr:
for ds in h5fr.keys():
ds_a0 = h5fw[ds].shape[0]
add_a0 = h5fr[ds].shape[0]
h5fw[ds].resize(ds_a0+add_a0,axis=0)
h5fw[ds][ds_a0:] = h5fr[ds][:]
for fname in files:
print(f'Working on file:{fname}')
with h5py.File(fname,'r') as h5f:
for ds, h5obj in h5f.items():
print (f'for: {ds}; axshape={h5obj.shape}, maxshape={h5obj.maxshape}')

How to read multiple NetCDF files from MODIS which have variables in groups?

Recently I tried to read MODIS Cloud properties data. I tried to merge/ combine MOIDS NetCDF files however both ncrcat or CDO didn't work. Then I found that variables data in MODIS was collected in each group.
a='MCD06COSP_M3_MODIS.A2002182.061.2020181145824.nc'
b=nc.Dataset(a)
print(b.groups.keys())
c=b.groups['Cloud_Mask_Fraction']
print(c.variables['Mean'])
then it will give results
dict_keys(['Solar_Zenith', 'Solar_Azimuth', 'Sensor_Zenith', 'Sensor_Azimuth', 'Cloud_Top_Pressure', 'Cloud_Mask_Fraction', 'Cloud_Mask_Fraction_Low', 'Cloud_Mask_Fraction_Mid', 'Cloud_Mask_Fraction_High', 'Cloud_Optical_Thickness_Liquid', 'Cloud_Optical_Thickness_Ice', 'Cloud_Optical_Thickness_Total', 'Cloud_Optical_Thickness_PCL_Total', 'Cloud_Optical_Thickness_Log10_Liquid', 'Cloud_Optical_Thickness_Log10_Ice', 'Cloud_Optical_Thickness_Log10_Total', 'Cloud_Particle_Size_Liquid', 'Cloud_Particle_Size_Ice', 'Cloud_Water_Path_Liquid', 'Cloud_Water_Path_Ice', 'Cloud_Retrieval_Fraction_Liquid', 'Cloud_Retrieval_Fraction_Ice', 'Cloud_Retrieval_Fraction_Total'])
<class 'netCDF4._netCDF4.Variable'>
float64 Mean(longitude, latitude)
_FillValue: -999.0
title: Cloud_Mask_Fraction: Mean
units: none
path = /Cloud_Mask_Fraction
unlimited dimensions:
current shape = (360, 180)
filling on
There are other variables in many groups, and I need to read all the other files or merge these files. So I am wondering how can I read multiple NetCDF files with groups? How can I have arrays for each variable with a new dimension time since I have to read these data for years? Does CDO or ncrcat or xarray in python can merge this kind of nc files?
Thanks a lot.
Yuhang
I would recommend to use xarray as the state-of-the-art 4D grid data handler in python.
You have to install netcdf4 and I recommend h5netcdf because of faster processing.
path_to_file = 'MCD06COSP_M3_MODIS.A2002182.061.2020181145824.nc'
# if h5netcdf is installed:
data = xarray.open_dataset(path_to_file, engine='h5netcdf')
# if just netcdf4 is installed:
data = xarray.open_dataset(path_to_file)
# access variables:
data[<variable_name>]
data.<variable_name>
# inspect whole file:
data
You can load multiple files into one dataset:
datasets = xarray.open_datasets([path_to_file_1, path_to_file_2], parallel=True)
I expect some errors in case that you have different time spans but you can find ways to work around such an issue.
I added parallelisation here to enhance the parsing speed.
Please add test data via link to a cloud storage or similar otherwise the community can not help you more like these suggestions.
PS: Please choose variable names wisely ;)

Reading the last batch of data added to hdfs file using Python

I have a program that will add a variable number of rows of data to an hdf5 file as shown below.
data_without_cosmic.to_hdf(new_file,key='s', append=True, mode='r+', format='table')
New_file is the file name and data_without_cosmic is a pandas data frame with 'x' , 'y', 'z', and 'i' columns representing positional data and a scalar quantity. I may add several data frames of this form to the file each time I run the full program. For each data frame I add, the 'z' values are a constant value.
The next time I use the program, I would need to access the last batch of rows that was added to the data in order to perform some operations. I wondered if there was a fast way to retrieve just the last data frame that was added to the file or if I could group the data in some way as I add it in order to be able to do so.
The only other way I can think of achieving my goal is by reading the entire file and then checking the z values from bottom up until it changes, but that seemed a little excessive. Any ideas?
P.S I am very inexperienced with working with hdf5 files but I read that they are efficient to work with.

How to copy a partial or skeleton h5py file

I have a few questions wrapped up into this issue. I realize this might be a convoluted post and can provide extra details.
A code package I use can produce large .h5 files (source.h5) (100+ Gb), where almost all of this data resides in 1 dataset (group2/D). I want to make a new .h5 file (dest.h5) using Python that contains all datasets except group2/D of source.h5 without needing to copy the entire file. I then will condense group2/D after some postprocessing and write a new group2/D in dest.h5 with much less data. However, I need to keep source.h5 because this postprocessing may need to be performed multiple times into multiple destination files.
source.h5 is always structured the same and cannot be changed in either source.h5 or dest.h5, where each letter is a dataset:
group1/A
group1/B
group2/C
group2/D
I thus want to initially make a file with this format:
group1/A
group1/B
group2/C
and again, fill in group2/D later. Simply copying source.h5 multiple times is always possible, but I'd like to avoid having to copy a huge file a bunch of times because disk space is limited and this is something that isn't a 1 off case.
I searched and found this question (How to partially copy using python an Hdf5 file into a new one keeping the same structure?) and tested if dest.h5 would be the same as source.h5:
fs = h5py.File('source.h5', 'r')
fd = h5py.File('dest.h5', 'w')
fs.copy('group1', fd)
fd.create_group('group2')
fs.copy('group2/C', fd['/group2'])
fd.copy('group2/D', fd['/group2'])
fs.close()
fd.close()
but the code package I used couldn't read the file I created (which I must have happen), implying there was some critical data loss when I did this operation (the file sizes differ by 7 kb also). I'm assuming the problem was when I created group2 manually because I checked with numpy that the values in group1 datasets exactly matched in both source.h5 and dest.h5. Before I did any digging into what data is missing I wanted to get a few things out of the way:
Question 1: Is there .h5 file metadata that accompanies each group or dataset? If so, how can I see it so I can create a group2 in dest.h5 that exactly matches the one in source.h5? Is there a way to see if 2 groups (not datasets) exactly match each other?
Question 2: Alternatively, is it possible to simply copy the data structure of a .h5 file (i.e. groups and datasets with empty lists as a skeleton file) so that fields can be populated later? Or, as a subset of this question, is there a way to copy a blank dataset to another file such that any metadata is retained (assuming there is some)?
Question 3: Finally, to avoid all this, is it possible to just copy a subset of source.h5 to dest.h5? With something like:
fs.copy(['group1','group2/C'], fd)
Thanks for your time. I appreciate you reading this far

Categories