Python good practice with NetCDF4 shared dimensions across groups

Python good practice with NetCDF4 shared dimensions across groups - python

This question is conceptual in place of a direct error.
I am working with the python netcdf4 api to translate and store binary datagram packets from multiple sensors packaged in a single file. My question is in reference to Scope of dimensions and best use practices.
According to the Netcdf4 convention and metadata docs, dimension scope is such that all child groups have access to a dimension defined in the parent group (http://cfconventions.org/Data/cf-conventions/cf-conventions-1.9/cf-conventions.html#_scope).
Context:
The multiple sensors datapackets are written to a binary file. Timing adjustments are handled prior to writing the binary file such that we can trust the time stamp of a data packet. Time sampling rates are not synchonious. Sensor 1 samples at say 1Hz. Sensor 2 samples at 100Hz. Sensor 1 and 2 measure a number of different variables.
Questions:
Do I define a single, unlimited time dimension at the root level and create multiple variables using that dimension, or create individual time dimensions at the group level. Psuedo-code below.
In setting up the netcdf I would use the following code:
import netcdf4
data_set = netcdf4.Dataset(file_name, mode='w')
dim_time = data_set.createDimension(dimname='time', size=None)
grp_1 = data_set.createGroup('grp1')
var_time_1 = grp_1.createVariable(varname='sensor_1_time', datatype='f8', dimensions=(time,))
var_time_1[:] = sensor_1_time # timestamp of data values from sensor_1
var_1 = grp_1.createVariable(varname='sensor_1_data', datatype='f8',
dimensions=(time,))
var_1[:] = sensor_1_data # data values from sensor 1
grp_2 = data_set.createGroup('grp2')
var_time_2 = grp_2.createVariable(varname='sensor_2_time', datatype='f8', dimensions=(time,))
var_time_2[:] = sensor_2_time
var_2 = grp_2.createVariable(varname='sensor_2_data', datatype='f8', dimension=(time,))
var_2[:] = sensor_2_data # data values from sensor 2
The group separation is not necessarily by sensor but by logical data grouping. In the case that data from two sensors falls into multiple groups, is it best to replicate the time array in each group or is it acceptable to reference to other groups using the Scope mechanism.
import netcdf4
data_set = netcdf4.Dataset(file_name, mode='w')
dim_time = data_set.createDimension(dimname='time', size=None)
grp_env = data_set.createGroup('env_data')
sensor_time_1 = grp_env.createVariable(varname='sensor_1_time', datatype='f8', dimensions=(time,))
sensor_time_1[:] = sensor_1_time # timestamp of data values from sensor_1
env_1 = grp_env.createVariable(varname='sensor_1_data', datatype='f8', dimensions=(time,))
env_1[:] = sensor_1_data # data values from sensor 1
env_2 = grp_1.createVariable(varname='sensor_2_data', datatype='f8', dimensions=(time,))
env_2.coordinates = "/grp_platform/sensor_time_1"
grp_platform = data_set.createGroup('platform')
sensor_time_2 = grp_platform.createVariable(varname='sensor_2_time', datatype='f8', dimensions=(time,))
sensor_time_2[:] = sensor_2_time
plt_2 = grp_platform.createVariable(varname='sensor_2_data', datatype='f8', dimension=(time,))
var_2[:] = sensor_2_data # data values from sensor 2
Most examples do not deal with these cross group functionality and I can't seem to find the best practices. I'd love some advice, or even a push in the right direction.

Related

HDF5 tagging datasets to events in other datasets

I am sampling time series data off various machines, and every so often need to collect a large high frequency burst of data from another device and append it to the time series data.
Imagine I am measuring temperature over time, and then every 10 degrees increase in temperature I sample a micro at 200khz, I want to be able to tag the large burst of micro data to a timestamp in the time-series data. Maybe even in the form of a figure.
I was trying to do this with regionref, but am struggling to find a elegant solution. and I'm finding myself juggling between pandas store and h5py and it just feels messy.
Initially I thought I would be able to make separate datasets from the burst-data then use reference or links to timestamps in the time-series data. But no luck so far.
Any way to reference a large packet of data to a timestamp in another pile of data would be appreciated!

How did use region references? I assume you had an array of references, with references alternating between a range of "standard rate" and "burst rate" data. That is a valid approach, and it will work. However, you are correct: it's messy to create, and messy to recover the data.
Virtual Datasets might be a more elegant solution....but tracking and creating the virtual layout definitions could get messy too. :-) However, once you have the virtual data set, you can read it with typical slice notation. HDF5/h5py handles everything under the covers.
To demonstrate, I created a "simple" example (realizing virtual datasets aren't "simple"). That said, if you can figure out region references, you can figure out virtual datasets. Here is a link to the h5py Virtual Dataset Documentation and Example for details. Here is a short summary of the process:
Define the virtual layout: this is the shape and dtype of the virtual dataset that will point to other datasets.
Define the virtual sources. Each is a reference to a HDF5 file and dataset (1 virtual source for file/dataset combination.)
Map virtual source data to the virtual layout (you can use slice notation, which is shown in my example).
Repeat steps 2 and 3 for all sources (or slices of sources)
Note: virtual datasets can be in a separate file, or in the same file as the referenced datasets. I will show both in the example. (Once you have defined the layout and sources, both methods are equally easy.)
There are at least 3 other SO questions and answers on this topic:
h5py, enums, and VirtualLayout
h5py error reading virtual dataset into NumPy array
How to combine multiple hdf5 files into one file and dataset?
Example follows:
Step 1: Create some example data. Without your schema, I guessed at how you stored "standard rate" and "burst rate" data. All standard rate data is stored in dataset 'data_log' and each burst is stored in a separate dataset named: 'burst_log_##'.
import numpy as np
import h5py
log_ntimes = 31
log_inc = 1e-3
arr = np.zeros((log_ntimes,2))
for i in range(log_ntimes):
time = i*log_inc
arr[i,0] = time
#temp = 70.+ 100.*time
#print(f'For Time = {time:.5f} ; Temp= {temp:.4f}')
arr[:,1] = 70.+ 100.*arr[:,0]
#print(arr)
with h5py.File('SO_72654160.h5','w') as h5f:
h5f.create_dataset('data_log',data=arr)
n_bursts = 4
burst_ntimes = 11
burst_inc = 5e-5
for n in range(1,n_bursts):
arr = np.zeros((burst_ntimes-1,2))
for i in range(1,burst_ntimes):
burst_time = 0.01*(n)
time = burst_time + i*burst_inc
arr[i-1,0] = time
#temp = 70.+ 100.*t
arr[:,1] = 70.+ 100.*arr[:,0]
with h5py.File('SO_72654160.h5','a') as h5f:
h5f.create_dataset(f'burst_log_{n:02}',data=arr)
Step 2: This is where the virtual layout and sources are defined and used to create the virtual dataset. This creates a virtual dataset a new file, and one in the existing file. (The statements are identical except for the file name and mode.)
source_file = 'SO_72654160.h5'
a0 = 0
with h5py.File(source_file, 'r') as h5f:
for ds_name in h5f:
a0 += h5f[ds_name].shape[0]
print(f'Total data rows in source = {a0}')
# alternate getting data from
# dataset: data_log, get rows 0-11, 11-21, 21-31
# datasets: burst_log_01, burst log_02, etc (each has 10 rows)
# Define virtual dataset layout
layout = h5py.VirtualLayout(shape=(a0, 2),dtype=float)
# Map virstual dataset to logged data
vsource1 = h5py.VirtualSource(source_file, 'data_log', shape=(41,2))
layout[0:11,:] = vsource1[0:11,:]
vsource2 = h5py.VirtualSource(source_file, 'burst_log_01', shape=(10,2))
layout[11:21,:] = vsource2
layout[21:31,:] = vsource1[11:21,:]
vsource2 = h5py.VirtualSource(source_file, 'burst_log_02', shape=(10,2))
layout[31:41,:] = vsource2
layout[41:51,:] = vsource1[21:31,:]
vsource2 = h5py.VirtualSource(source_file, 'burst_log_03', shape=(10,2))
layout[51:61,:] = vsource2
# Create NEW file, then add virtual dataset
with h5py.File('SO_72654160_VDS.h5', 'w') as h5vds:
h5vds.create_virtual_dataset("vdata", layout)
print(f'Total data rows in VDS 1 = {h5vds["vdata"].shape[0]}')
# Open EXISTING file, then add virtual dataset
with h5py.File('SO_72654160.h5', 'a') as h5vds:
h5vds.create_virtual_dataset("vdata", layout)
print(f'Total data rows in VDS 2 = {h5vds["vdata"].shape[0]}')

How to make subplots from multiple files? Python matplot lib

I'm a student researcher who's running simulations on exoplanets to determine if they might be viable for life. The software I'm using, outputs a file with several columns of various types of data. So far, I've written a python script that goes through one file and grabs two columns of data. In this case, time and global temperature of the planet.
What I want to do is:
Write a python script that goes through multiple files, and grabs the same two columns that my current script does.
Then, I want to create subplots of all these files
The things that will stay consistent across all of the files, is the fact that times doesn't change, the x axis will always be time (from 0 to 1 million years). The y axis values will changes across simulations though.
This is what I got so far for my code:
import math as m
import matplotlib.pylab as plt
import numpy as np
## Set datafile equal to the file I plan on using for data, and then open it
datafile = r"C:\Users\sasuk\OneDrive\Desktop\GJ 229 Planet Data\Gj 229 b - [SemiMajor 0.867][Ecc][Mass][Luminosity]\solarsys.Earth.forward"
file = open(datafile, "r")
# Create two empty arrays for my x and y axis of my graphs
years = [ ]
GlobalT = [ ]
# A for loop that looks in my file, and grabs only the 1st and 8th column adding them to the respective arrays
for line in file:
data = line.split(' ')
years.append(float(data[0]))
GlobalT.append(float(data[7]))
# Close the file
file.close()
# Plot my graph
fig = plt.matplotlib.pyplot.gcf()
plt.plot(years, GlobalT)
plt.title('Global Temperature of GJ 229 b over time')
fig.set_size_inches(10, 6, forward=True)
plt.figtext(0.5, 0.0002, "This shows the global temperature of GJ 229 b when it's semi-major axis is 0.929 au, \n"
" and it's actual mass relative to the sun (~8 Earth Masses)", wrap=True, horizontalalignment='center', fontsize=12)
plt.xlabel(" Years ")
plt.ylabel("Global Temp")
plt.show()

I think the simplest thing to do is to turn your code for one file into a function, and then call it in a loop that iterates over the files.
from pathlib import Path
def parse_datafile(pth):
"""Parses datafile"""
results = []
with pth.open('r') as f:
for line in f:
data = line.split(' ')
results.append({'f': pth.stem,
'y': data[0],
't': data[7]})
return results
basedir = Path(r'C:\Users\sasuk\OneDrive\Desktop\GJ 229 Planet Data\Gj 229 b - [SemiMajor 0.867][Ecc][Mass][Luminosity]')
# assuming you want to parse all files in directory
# if not, can change glob string for files you want
all_results = [parse_datafile(pth) for pth in basedir.glob('*')]
df = pd.DataFrame(all_results)
df['y'] = pd.to_numeric(df['y'], errors='coerce')
df['t'] = pd.to_numeric(df['t'], errors='coerce')
This will give you a dataframe with three columns - f (the filename), y (the year), and t (the temperature). You then have to convert y and t to numeric dtypes. This will be faster and handle errors more gracefully than your code, which will raise an error with any malformed data.
You can further manipulate this as needed to generate your plots. Definitely check if there are any NaN values and address them accordingly, either by dropping those rows or using fillna.

Subsample multi-sensor time series data using Python's panda.Dataframe

I have a file coming from multiple sensor readings. Each line is of the following format:
timestamp sensor_name sensor_value
e.g.
191.12 temperature -5.19
191.17 pressure 20.05
191.18 pressure 20.04
191.23 pressure 20.07
191.23 temperature -5.17
191.31 temperature -5.09
...
The frequency of the readings is irregular, approximately 10-20Hz. I need do downsample these readings to 1Hz and output the result in the following format
timestamp sensor_1_value sensor_2_value ... sensor_n_value
reflecting the (running?) mean value of the sensor readings in the successive seconds, e.g.
timestamp temperature pressure
191.00 -5.02 21.93
192.00 -5.01 21.92
193.00 -5.01 21.91
...
I loaded each line of the input file into a dictionary as follows:
def add(self, timestamp, sensor_name, sensor_value):
self.timeseries[sensor_name].append([timestamp, sensor_value])
... and created a DataFrame from the dictionary:
df = pd.DataFrame(self.timeseries)
... but I need some guidance how to move forward from here, i.e. what's an elegant way to perform the sampling.

I'm not 100% sure what you're doing but this is what I'd do to solve the problem. It assumes your data file is space-separated with a header row.
import pandas as pd
import numpy as np
# Load the data
data = pd.read_csv(file_name, sep="\s", index_col=None)
# Take the mean of the values within a second
data = np.floor(data["timestamp"])
data = data.groupby(["timestamp", "sensor_name"]).mean()
data = data.reset_index()
# Pivot
data = data.pivot(index="timestamp", columns="sensor_name", values="sensor_value")
If you have some other concept for "downsampling" in mind for this context you should do that instead of mean.

How to Import multiple CSV files then make a Master Table?

I am a research chemist and have carried out a measurement where I record 'signal intensity' vs 'mass-to-charge (m/z)' . I have repeated this experiment 15x, by changing a specific parameter (Collision Energy). As a result, I have 15 CSV files and would like to align/join them within the same range of m/z values and same interval values. Due to the instrument thresholding rules, certain m/z values were not recorded, thus I have files that cannot simply be exported into excel and copy/pasted. The data looks a bit like the tables posted below
Dataset 1: x | y Dataset 2: x | y
--------- ---------
0.0 5 0.0 2
0.5 3 0.5 6
2.0 7 1.0 9
3.0 1 2.5 1
3.0 4
Using matlab I started with this code:
%% Create a table for the set m/z range with an interval of 0.1 Da
mzrange = 50:0.1:620;
mzrange = mzrange';
mzrange = array2table(mzrange,'VariableNames',{'XThompsons'});
Then I manually imported 1 X/Y CSV (Xtitle=XThompson, Ytitle=YCounts) to align with the specified m/z range.
%% Join/merge the two tables using a common Key variable 'XThompson' (m/z value)
mzspectrum = outerjoin(mzrange,ReserpineCE00,'MergeKeys',true);
% Replace all NaN values with zero
mzspectrum.YCounts(isnan(mzspectrum.YCounts)) = 0;
At this point I am stuck because repeating this process with a separate file will overwrite my YCounts column. The title of the YCounts column doesnt matter to me as I can change it later, however I would like to have the table continue as such:
XThompson | YCounts_1 | YCounts_2 | YCounts_3 | etc...
--------------------------------------------------------
How can I carry this out in Matlab so that this is at least semi-automated? I've had posted earlier describing a similar scenario but it turned out that it could not carry out what I need. I must admit that my mind is not of a programmer so I have been struggling with this problem quite a bit.
PS- Is this problem best executed in Matlab or Python?

I don't know or use matlab so my answer is pure python based. I think python and matlab should be equally well suited to read csv files and generate a master table.
Please consider this answer more as pointer to how to address the problem in python.
In python one would typically address this problem using the pandas package. This package provides "high-performance, easy-to-use data structures and data analysis tools" and can read natively a large set of file formats including CSV files. A master table from two CSV files "foo.csv" and "bar.csv" could be generated e.g. as follows:
import pandas as pd
df = pd.read_csv('foo.csv')
df2 = pd.read_csv('bar.cvs')
master_table = pd.concat([df, df2])
Pandas further allows to group and structure the data in many ways. The pandas documentation has very good descriptions of its various features.
One can install pandas with the python package installer pip:
sudo pip install pandas
if on Linux or OSX.

The counts from the different analyses should be named differently, i.e., YCounts_1, YCounts_2, and YCounts_3 from analyses 1, 2, and 3, respectively, in the different datasets before joining them. However, the M/Z name (i.e., XThompson) should be the same since this is the key that will be used to join the datasets. The code below is for MATLAB.
This step is not needed (just recreates your tables) and I copied dataset2 to create dataset3 for illustration. You could use 'readtable' to import your data i.e., imported_data = readtable('filename');
dataset1 = table([0.0; 0.5; 2.0; 3.0], [5; 3; 7; 1], 'VariableNames', {'XThompson', 'YCounts_1'});
dataset2 = table([0.0; 0.5; 1.0; 2.5; 3.0], [2; 6; 9; 1; 4], 'VariableNames', {'XThompson', 'YCounts_2'});
dataset3 = table([0.0; 0.5; 1.0; 2.5; 3.0], [2; 6; 9; 1; 4], 'VariableNames', {'XThompson', 'YCounts_3'});
Merge tables using outerjoin. You could use loop if you have many datasets.
combined_dataset = outerjoin(dataset1,dataset2, 'MergeKeys', true);
Add dataset3 to the combined_dataset
combined_dataset = outerjoin(combined_dataset,dataset3, 'MergeKeys', true);
You could export the combined data as Excel Sheet by using writetable
writetable(combined_dataset, 'joined_icp_ms_data.xlsx');

I managed to create a solution to my problem based on learning through everyone's input and taking an online matlab courses. I am not a natural coder so my script is not as elegant as the geniuses here, but hopefully it is clear enough for other non-programming scientists to use.
Here's the result that works for me:
% Reads a directory containing *.csv files and corrects the x-axis to an evenly spaced (0.1 unit) interval.
% Create a matrix with the input x range then convert it to a table
prompt = 'Input recorded min/max data range separated by space \n(ex. 1 to 100 = 1 100): ';
inputrange = input(prompt,'s');
min_max = str2num(inputrange)
datarange = (min_max(1):0.1:min_max(2))';
datarange = array2table(datarange,'VariableNames',{'XAxis'});
files = dir('*.csv');
for q=1:length(files);
% Extract each XY pair from the csvread cell and convert it to an array, then back to a table.
data{q} = csvread(files(q).name,2,1);
data1 = data(q);
data2 = cell2mat(data1);
data3 = array2table(data2,'VariableNames',{'XAxis','YAxis'});
% Join the datarange table and the intensity table to obtain an evenly spaced m/z range
data3 = outerjoin(datarange,data3,'MergeKeys',true);
data3.YAxis(isnan(data3.YAxis)) = 0;
data3.XAxis = round(data3.XAxis,1);
% Remove duplicate values
data4 = sortrows(data3,[1 -2]);
[~, idx] = unique(data4.XAxis);
data4 = data4(idx,:);
% Save the file as the same name in CSV without underscores or dashes
filename = files(q).name;
filename = strrep(filename,'_','');
filename = strrep(filename,'-','');
filename = strrep(filename,'.csv','');
writetable(data4,filename,'FileType','text');
clear data data1 data2 data3 data4 filename
end
clear

Loop through netcdf files and run calculations - Python or R

This is my first time using netCDF and I'm trying to wrap my head around working with it.
I have multiple version 3 netcdf files (NOAA NARR air.2m daily averages for an entire year). Each file spans a year between 1979 - 2012. They are 349 x 277 grids with approximately a 32km resolution. Data was downloaded from here.
The dimension is time (hours since 1/1/1800) and my variable of interest is air. I need to calculate accumulated days with a temperature < 0. For example
Day 1 = +4 degrees, accumulated days = 0
Day 2 = -1 degrees, accumulated days = 1
Day 3 = -2 degrees, accumulated days = 2
Day 4 = -4 degrees, accumulated days = 3
Day 5 = +2 degrees, accumulated days = 0
Day 6 = -3 degrees, accumulated days = 1
I need to store this data in a new netcdf file. I am familiar with Python and somewhat with R. What is the best way to loop through each day, check the previous days value, and based on that, output a value to a new netcdf file with the exact same dimension and variable.... or perhaps just add another variable to the original netcdf file with the output I'm looking for.
Is it best to leave all the files separate or combine them? I combined them with ncrcat and it worked fine, but the file is 2.3gb.
Thanks for the input.
My current progress in python:
import numpy
import netCDF4
#Change my working DIR
f = netCDF4.Dataset('air7912.nc', 'r')
for a in f.variables:
print(a)
#output =
lat
long
x
y
Lambert_Conformal
time
time_bnds
air
f.variables['air'][1, 1, 1]
#Output
298.37473
To help me understand this better what type of data structure am I working with? Is ['air'] the key in the above example and [1,1,1] are also keys? to get the value of 298.37473. How can I then loop through [1,1,1]?

You can use the very nice MFDataset feature in netCDF4 to treat a bunch of files as one aggregated file, without the need to use ncrcat. So you code would look like this:
from pylab import *
import netCDF4
f = netCDF4.MFDataset('/usgs/data2/rsignell/models/ncep/narr/air.2m.19??.nc')
# print variables
f.variables.keys()
atemp = f.variables['air']
print atemp
ntimes, ny, nx = shape(atemp)
cold_days = zeros((ny,nx),dtype=int)
for i in xrange(ntimes):
cold_days += atemp[i,:,:].data-273.15 < 0
pcolormesh(cold_days)
colorbar()
And here's one way to write the file (there might be easier ways):
# create NetCDF file
nco = netCDF4.Dataset('/usgs/data2/notebook/cold_days.nc','w',clobber=True)
nco.createDimension('x',nx)
nco.createDimension('y',ny)
cold_days_v = nco.createVariable('cold_days', 'i4', ( 'y', 'x'))
cold_days_v.units='days'
cold_days_v.long_name='total number of days below 0 degC'
cold_days_v.grid_mapping = 'Lambert_Conformal'
lono = nco.createVariable('lon','f4',('y','x'))
lato = nco.createVariable('lat','f4',('y','x'))
xo = nco.createVariable('x','f4',('x'))
yo = nco.createVariable('y','f4',('y'))
lco = nco.createVariable('Lambert_Conformal','i4')
# copy all the variable attributes from original file
for var in ['lon','lat','x','y','Lambert_Conformal']:
for att in f.variables[var].ncattrs():
setattr(nco.variables[var],att,getattr(f.variables[var],att))
# copy variable data for lon,lat,x and y
lono[:]=f.variables['lon'][:]
lato[:]=f.variables['lat'][:]
xo[:]=f.variables['x'][:]
yo[:]=f.variables['y'][:]
# write the cold_days data
cold_days_v[:,:]=cold_days
# copy Global attributes from original file
for att in f.ncattrs():
setattr(nco,att,getattr(f,att))
nco.Conventions='CF-1.6'
nco.close()
If I try looking at the resulting file in the Unidata NetCDF-Java Tools-UI GUI, it seems to be okay:
Also note that here I just downloaded two of the datasets for testing, so I used
f = netCDF4.MFDataset('/usgs/data2/rsignell/models/ncep/narr/air.2m.19??.nc')
as an example. For all the data, you could use
f = netCDF4.MFDataset('/usgs/data2/rsignell/models/ncep/narr/air.2m.????.nc')
or
f = netCDF4.MFDataset('/usgs/data2/rsignell/models/ncep/narr/air.2m.*.nc')

Here is an R solution.
infiles <- list.files("data", pattern = "nc", full.names = TRUE, include.dirs = TRUE)
outfile <- "data/air.colddays.nc"
library(raster)
r <- raster::stack(infiles)
r <- sum((r - 273.15) < 0)
plot(r)

I know this is rather late for this thread from 2013, but I just want to point out that the accepted solution doesn't provide the solution to the exact question posed. The question seems to want the length of each continuous period of temperatures below zero (note in the question the counter resets if the temperature exceeds zero), which can be important for climate applications (e.g. for farming) whereas the accepted solution only gives the total number of days in a year that the temperature is below zero. If this is really what mkmitchell wants (it has been accepted as the answer) then it can be done in from the command line in cdo without having to worry about NETCDF input/output:
cdo timsum -lec,273.15 in.nc out.nc
so a looped script would be:
files=`ls *.nc` # pick up all the netcdf files in a directory
for file in $files ; do
# I use 273.15 as from the question seems T is in Kelvin
cdo timsum -lec,273.15 $file ${file%???}_numdays.nc
done
If you then want the total number over the whole period you can then cat the _numdays files instead which are much smaller:
cdo cat *_numdays.nc total.nc
cdo timsum total.nc total_below_zero.nc
But again, the question seems to want accumulated days per event, which is different, but not provided by the accepted answer.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python good practice with NetCDF4 shared dimensions across groups - python

Related

HDF5 tagging datasets to events in other datasets

How to make subplots from multiple files? Python matplot lib

Subsample multi-sensor time series data using Python's panda.Dataframe

How to Import multiple CSV files then make a Master Table?

Loop through netcdf files and run calculations - Python or R

Categories

Resources