How to extract data from netCDF to CSV using simple python script? - python

This question is related to data extraction from NetCDF files (*.nc) to CSV, to be further used by researchers for other applications - statistical analysis along time series that are not array-like data. There are two alternatives for beginners in python, namely:
Script 1:
I need your help in extracting climate date from an nc file to csv using python. The data founder, namely Copernicus gives some advice in how to extract data from nc to csv using simple python script. However, I have some issues with OverflowError: int too big to convert.
I will briefly describe the steps and provide all necessary info regarding data shape and content.
#this is for reading the .nc in the working folder
import glob
#this is reaquired ti read the netCDF4 data
from netCDF4 import Dataset
#required to read and write the csv files
import pandas as pd
#required for using the array functions
import numpy as np
from matplotlib.dates import num2date
data = Dataset('prAdjust_tmean.nc')
And data looks like this:
print(data)
<class 'netCDF4._netCDF4.Dataset'>
root group (NETCDF4 data model, file format HDF5):
CDI: Climate Data Interface version 1.8.2 (http://mpimet.mpg.de/cdi)
frequency: year
CDO: Climate Data Operators version 1.8.2 (http://mpimet.mpg.de/cdo)
creation_date: 2020-02-12T15:00:49ZCET+0100
Conventions: CF-1.6
institution_url: www.smhi.se
invar_platform_id: -
invar_rcm_model_driver: MPI-M-MPI-ESM-LR
time_coverage_start: 1971
time_coverage_end: 2000
domain: EUR-11
geospatial_lat_min: 23.942343
geospatial_lat_max: 72.641624
geospatial_lat_resolution: 0.04268074 degree
geospatial_lon_min: -35.034023
geospatial_lon_max: 73.937675
geospatial_lon_resolution: 0.009246826 degree
geospatial_bounds: -
NCO: netCDF Operators version 4.7.7 (Homepage = http://nco.sf.net, Code = http://github.com/nco/nco)
acknowledgements: This work was performed within Copernicus Climate Change Service - C3S_424_SMHI, https://climate.copernicus.eu/operational-service-water-sector, on behalf of ECMWF and EU.
contact: Hydro.fou#smhi.se
keywords: precipitation
license: Copernicus License V1.2
output_frequency: 30 year average value
summary: Calculated as the mean annual values of daily precipitation averaged over a 30 year period.
comment: The Climate Data Operators (CDO) software was used for the calculation of climate impact indicators (https://code.mpimet.mpg.de/projects/cdo/embedded/cdo.pdf, https://code.mpimet.mpg.de/projects/cdo/embedded/cdo_eca.pdf).
history: CDO commands (last cdo command first and separated with ;): timmean; yearmean
invar_bc_institution: Swedish Meteorological and Hydrological Institute
invar_bc_method: TimescaleBC, Description in deliverable C3S_D424.SMHI.1.3b
invar_bc_method_id: TimescaleBC v1.02
invar_bc_observation: EFAS-Meteo, https://ec.europa.eu/jrc/en/publication/eur-scientific-and-technical-research-reports/efas-meteo-european-daily-high-resolution-gridded-meteorological-data-set-1990-2011
invar_bc_observation_id: EFAS-Meteo
invar_bc_period: 1990-2018
data_quality: Testing of EURO-CORDEX data performed by ESGF nodes. Additional tests were performed when producing CII and ECVs in C3S_424_SMHI.
institution: SMHI
project_id: C3S_424_SMHI
references:
source: The RCM data originate from EURO-CORDEX (Coordinated Downscaling Experiment - European Domain, EUR-11) https://euro-cordex.net/.
invar_experiment_id: rcp45
invar_realisation_id: r1i1p1
invar_rcm_model_id: MPI-CSC-REMO2009-v1
variable_name: prAdjust_tmean
dimensions(sizes): x(1000), y(950), time(1), bnds(2)
variables(dimensions): float32 lon(y,x), float32 lat(y,x), float64 time(time), float64 time_bnds(time,bnds), float32 prAdjust_tmean(time,y,x)
groups:
Extract the variable:
t2m = data.variables['prAdjust_tmean']
Get dimensions assuming 3D: time, latitude, longitude
time_dim, lat_dim, lon_dim = t2m.get_dims()
time_var = data.variables[time_dim.name]
times = num2date(time_var[:], time_var.units)
latitudes = data.variables[lat_dim.name][:]
longitudes = data.variables[lon_dim.name][:]
output_dir = './'
And the error:
OverflowError Traceback (most recent call last)
<ipython-input-9-69e10e41e621> in <module>
2 time_dim, lat_dim, lon_dim = t2m.get_dims()
3 time_var = data.variables[time_dim.name]
----> 4 times = num2date(time_var[:], time_var.units)
5 latitudes = data.variables[lat_dim.name][:]
6 longitudes = data.variables[lon_dim.name][:]
C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\dates.py in num2date(x, tz)
509 if tz is None:
510 tz = _get_rc_timezone()
--> 511 return _from_ordinalf_np_vectorized(x, tz).tolist()
512
513
C:\ProgramData\Anaconda3\lib\site-packages\numpy\lib\function_base.py in __call__(self, *args, **kwargs)
2106 vargs.extend([kwargs[_n] for _n in names])
2107
-> 2108 return self._vectorize_call(func=func, args=vargs)
2109
2110 def _get_ufunc_and_otypes(self, func, args):
C:\ProgramData\Anaconda3\lib\site-packages\numpy\lib\function_base.py in _vectorize_call(self, func, args)
2190 for a in args]
2191
-> 2192 outputs = ufunc(*inputs)
2193
2194 if ufunc.nout == 1:
C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\dates.py in _from_ordinalf(x, tz)
329
330 dt = (np.datetime64(get_epoch()) +
--> 331 np.timedelta64(int(np.round(x * MUSECONDS_PER_DAY)), 'us'))
332 if dt < np.datetime64('0001-01-01') or dt >= np.datetime64('10000-01-01'):
333 raise ValueError(f'Date ordinal {x} converts to {dt} (using '
OverflowError: int too big to convert
Last part of the script is:
import os
# Path
path = "/home"
# Join various path components
print(os.path.join(path, "User/Desktop", "file.txt"))
# Path
path = "User/Documents"
# Join various path components
print(os.path.join(path, "/home", "file.txt"))
filename = os.path.join(output_dir, 'table.csv')
print(f'Writing data in tabular form to {filename} (this may take some time)...')
times_grid, latitudes_grid, longitudes_grid = [
x.flatten() for x in np.meshgrid(times, latitudes, longitudes, indexing='ij')]
df = pd.DataFrame({
'time': [t.isoformat() for t in times_grid],
'latitude': latitudes_grid,
'longitude': longitudes_grid,
't2m': t2m[:].flatten()})
df.to_csv(filename, index=False)
print('Done')
Script 2:
The second script has a more useful approach, because it gives the user the opportunity to select the exact data from NetCDF file using python, rather than extracting all data. This script is written by GeoDelta Labs on youtube. However, as in most cases the script is "written" for a specific format of nc files, and needs further tweaking to read any nc file.
The code is like this:
import netCDF4
import glob
from netCDF4 import Dataset
import pandas as pd
import numpy as np
# Record all the years of the netCDF files into a Python list
all_years = []
for file in glob.glob('name_of_the_file.nc'):
print(file)
data = Dataset(file, 'r')
data.variables.keys()
print(data.__dict__)
for file in glob.glob('name_of_the_file.nc'):
print(file)
data = Dataset(file, 'r')
time = data.variables['time']
year = time.units[14:18]
all_years.append(year)
all_years
# Creating an empty Pandas DataFrame covering the whole range of data
year_start = min(all_years)
end_year = max(all_years)
date_range = pd.date_range(start = str(year_start) + '-01-01',
end = str(end_year) + '-12-31',
freq = 'Y')
df = pd.DataFrame(0.0, columns = ['Temperature'], index = date_range)
Now the interesting part of this second python script. Usually NetCDf files cover vast geographical areas (for example Nothern Hemisphere or entire continents), which is not so useful for particular statistical analysis. Let say that the user wants to extract climate data from nc files for particular locations - countries, regions, cities, etc. GeoDelta Lab comes with a very useful approach.
It requires a small CSV file, created separately by the user, where three columns are created: Name (name of the region/country/location), Longitude and Latitude.
# Defining the location, lat, lon based on the csv data
cities = pd.read_csv('countries.csv')
cities
all_years
for index, row in cities.iterrows():
location = row['names']
location_latitude = row['lat']
location_longitude = row['long']
# Sorting the all_years python list
all_years.sort()
for yr in all_years:
# Reading-in the data
data = Dataset('prAdjust_tmean_abs_QM-EFAS-Meteo-EUR-11_MPI-M-MPI-ESM-LR_historical_r1i1p1_MPI-CSC-REMO2009-v1_na_1971-2000_grid5km_v1.nc', 'r')
# Storing the lat and lon data of the netCDF file into variables
lat = data.variables['lat'][:]
lon = data.variables['lon'][:]
# Squared difference between the specified lat,lon and the lat,lon of the netCDF
sq_diff_lat = (lat - location_latitude)**2
sq_diff_lon = (lon - location_longitude)**2
# Identify the index of the min value for lat and lon
min_index_lat = sq_diff_lat.argmin()
min_index_lon = sq_diff_lon.argmin()
# Accessing the average temparature data
temp = data.variables['prAdjust_tmean']
# Creating the date range for each year during each iteration
start = str(yr) + '-01-01'
end = str(yr) + '-12-31'
d_range = pd.date_range(start = start,
end = end,
freq = 'D')
for t_index in np.arange(0, len(d_range)):
print('Recording the value for '+ location+': ' + str(d_range[t_index]))
df.loc[d_range[t_index]]['Temparature'] = temp[t_index, min_index_lat, min_index_lon]
df.to_csv(location +'.csv')
The last part combines the data from user-defined CSV (with exact locations) and data extracted from nc file, to create a new CSV with the exact data needed by the user.
However, like always the second script fails to "read" the years from my nc file and finds only the starting year 1970. Can someone explain what I am doing wrong? Could be that the nc file contains only data for 1970? In the description of the data - it is clearly specified that the file contains annual averages from 1970 until 2000.
If someone can help me to solve this issue, I guess this script can be useful for all future users of netCDF files. Many thanks for your time.
Edit: #Robert Davy requested a sample data. I've added the link to Google Drive, where the original nc files are stored and could be downloaded/viewed by anyone that accesses this link: https://drive.google.com/drive/folders/11q9dKCoQWqPE5EMcM5ye1bNPqELGMzOA?usp=sharing

Related

How do I populate an array with values from multiple rows in multiple files?

I have a folder with (for the most part) data collected over a whole year. Each day of the year has a different subfolder which contains a CSV with the data. I need to open each CSV and plot the data on a graph, as well as find the mean etc. I have made the plot by creating a loop that loops over the files and plots each point each iteration, but when it comes to populating arrays outside the loop so I can manipulate data etc I just keep having problems.
I have added lots of comments to show what I have done.
Here is my code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#There are 227 files, each time after each month, day value resets.
#Folders/file name format: yyyymmdd
#First day = 20210411
#Last day = 20211231
#Looping from the first to the last file
for n in range(20210411,20211231):
#Incrementing, there are 227 days from: [20210411 to 20211231]
n = n + 1
try:
#Creating a different file name to loop over and open each file
filename = f'/Users/nathancall92/Desktop/2021/{n}/wf{n}.vav.ada.aia.eof.csv'
#Opening a CSV instance
CSV = pd.read_csv(filename, skiprows=2 , engine='python')
#Converting to array
data = CSV.to_numpy()
#Slicing out columns needed for plotting
day, hour, XCO2, XCH4, azi = data[:,3], data[:,4], data[:,32], data[:,30], data[:,9]
#Converting to decimal day: AEST
decimalDay = day + ((hour + 10)/ 24)
#Take out any azimuth angle larger than 80 due to false data measurements
if azi.any() < 80:
#Plotting inside the if statement
plt.scatter(decimalDay, XCO2, s = 0.2, marker="^", c= 'coral' )
plt.ylim(405,425)
plt.yticks(np.arange(start = 405, stop = 426, step = 5))
plt.xticks(np.arange(start = 100, stop = 365, step = 25))
plt.xlim(100,375)
plt.xlabel('Day of year')
plt.ylabel('XCO2 (ppm)')
plt.title('CO2 emmisions March-December 2021')
plt.show
#Continue looping date integers if they do not create a file
except OSError:
continue
I need to populate an array outside of the loop with all of the values of day, XCO2 and CH4 and I am having a great deal of trouble and can not seem to find a solution no matter what I try.
Any help would be much appreciated.

Python good practice with NetCDF4 shared dimensions across groups

This question is conceptual in place of a direct error.
I am working with the python netcdf4 api to translate and store binary datagram packets from multiple sensors packaged in a single file. My question is in reference to Scope of dimensions and best use practices.
According to the Netcdf4 convention and metadata docs, dimension scope is such that all child groups have access to a dimension defined in the parent group (http://cfconventions.org/Data/cf-conventions/cf-conventions-1.9/cf-conventions.html#_scope).
Context:
The multiple sensors datapackets are written to a binary file. Timing adjustments are handled prior to writing the binary file such that we can trust the time stamp of a data packet. Time sampling rates are not synchonious. Sensor 1 samples at say 1Hz. Sensor 2 samples at 100Hz. Sensor 1 and 2 measure a number of different variables.
Questions:
Do I define a single, unlimited time dimension at the root level and create multiple variables using that dimension, or create individual time dimensions at the group level. Psuedo-code below.
In setting up the netcdf I would use the following code:
import netcdf4
data_set = netcdf4.Dataset(file_name, mode='w')
dim_time = data_set.createDimension(dimname='time', size=None)
grp_1 = data_set.createGroup('grp1')
var_time_1 = grp_1.createVariable(varname='sensor_1_time', datatype='f8', dimensions=(time,))
var_time_1[:] = sensor_1_time # timestamp of data values from sensor_1
var_1 = grp_1.createVariable(varname='sensor_1_data', datatype='f8',
dimensions=(time,))
var_1[:] = sensor_1_data # data values from sensor 1
grp_2 = data_set.createGroup('grp2')
var_time_2 = grp_2.createVariable(varname='sensor_2_time', datatype='f8', dimensions=(time,))
var_time_2[:] = sensor_2_time
var_2 = grp_2.createVariable(varname='sensor_2_data', datatype='f8', dimension=(time,))
var_2[:] = sensor_2_data # data values from sensor 2
The group separation is not necessarily by sensor but by logical data grouping. In the case that data from two sensors falls into multiple groups, is it best to replicate the time array in each group or is it acceptable to reference to other groups using the Scope mechanism.
import netcdf4
data_set = netcdf4.Dataset(file_name, mode='w')
dim_time = data_set.createDimension(dimname='time', size=None)
grp_env = data_set.createGroup('env_data')
sensor_time_1 = grp_env.createVariable(varname='sensor_1_time', datatype='f8', dimensions=(time,))
sensor_time_1[:] = sensor_1_time # timestamp of data values from sensor_1
env_1 = grp_env.createVariable(varname='sensor_1_data', datatype='f8', dimensions=(time,))
env_1[:] = sensor_1_data # data values from sensor 1
env_2 = grp_1.createVariable(varname='sensor_2_data', datatype='f8', dimensions=(time,))
env_2.coordinates = "/grp_platform/sensor_time_1"
grp_platform = data_set.createGroup('platform')
sensor_time_2 = grp_platform.createVariable(varname='sensor_2_time', datatype='f8', dimensions=(time,))
sensor_time_2[:] = sensor_2_time
plt_2 = grp_platform.createVariable(varname='sensor_2_data', datatype='f8', dimension=(time,))
var_2[:] = sensor_2_data # data values from sensor 2
Most examples do not deal with these cross group functionality and I can't seem to find the best practices. I'd love some advice, or even a push in the right direction.

How to make subplots from multiple files? Python matplot lib

I'm a student researcher who's running simulations on exoplanets to determine if they might be viable for life. The software I'm using, outputs a file with several columns of various types of data. So far, I've written a python script that goes through one file and grabs two columns of data. In this case, time and global temperature of the planet.
What I want to do is:
Write a python script that goes through multiple files, and grabs the same two columns that my current script does.
Then, I want to create subplots of all these files
The things that will stay consistent across all of the files, is the fact that times doesn't change, the x axis will always be time (from 0 to 1 million years). The y axis values will changes across simulations though.
This is what I got so far for my code:
import math as m
import matplotlib.pylab as plt
import numpy as np
## Set datafile equal to the file I plan on using for data, and then open it
datafile = r"C:\Users\sasuk\OneDrive\Desktop\GJ 229 Planet Data\Gj 229 b - [SemiMajor 0.867][Ecc][Mass][Luminosity]\solarsys.Earth.forward"
file = open(datafile, "r")
# Create two empty arrays for my x and y axis of my graphs
years = [ ]
GlobalT = [ ]
# A for loop that looks in my file, and grabs only the 1st and 8th column adding them to the respective arrays
for line in file:
data = line.split(' ')
years.append(float(data[0]))
GlobalT.append(float(data[7]))
# Close the file
file.close()
# Plot my graph
fig = plt.matplotlib.pyplot.gcf()
plt.plot(years, GlobalT)
plt.title('Global Temperature of GJ 229 b over time')
fig.set_size_inches(10, 6, forward=True)
plt.figtext(0.5, 0.0002, "This shows the global temperature of GJ 229 b when it's semi-major axis is 0.929 au, \n"
" and it's actual mass relative to the sun (~8 Earth Masses)", wrap=True, horizontalalignment='center', fontsize=12)
plt.xlabel(" Years ")
plt.ylabel("Global Temp")
plt.show()
I think the simplest thing to do is to turn your code for one file into a function, and then call it in a loop that iterates over the files.
from pathlib import Path
def parse_datafile(pth):
"""Parses datafile"""
results = []
with pth.open('r') as f:
for line in f:
data = line.split(' ')
results.append({'f': pth.stem,
'y': data[0],
't': data[7]})
return results
basedir = Path(r'C:\Users\sasuk\OneDrive\Desktop\GJ 229 Planet Data\Gj 229 b - [SemiMajor 0.867][Ecc][Mass][Luminosity]')
# assuming you want to parse all files in directory
# if not, can change glob string for files you want
all_results = [parse_datafile(pth) for pth in basedir.glob('*')]
df = pd.DataFrame(all_results)
df['y'] = pd.to_numeric(df['y'], errors='coerce')
df['t'] = pd.to_numeric(df['t'], errors='coerce')
This will give you a dataframe with three columns - f (the filename), y (the year), and t (the temperature). You then have to convert y and t to numeric dtypes. This will be faster and handle errors more gracefully than your code, which will raise an error with any malformed data.
You can further manipulate this as needed to generate your plots. Definitely check if there are any NaN values and address them accordingly, either by dropping those rows or using fillna.

Subsample multi-sensor time series data using Python's panda.Dataframe

I have a file coming from multiple sensor readings. Each line is of the following format:
timestamp sensor_name sensor_value
e.g.
191.12 temperature -5.19
191.17 pressure 20.05
191.18 pressure 20.04
191.23 pressure 20.07
191.23 temperature -5.17
191.31 temperature -5.09
...
The frequency of the readings is irregular, approximately 10-20Hz. I need do downsample these readings to 1Hz and output the result in the following format
timestamp sensor_1_value sensor_2_value ... sensor_n_value
reflecting the (running?) mean value of the sensor readings in the successive seconds, e.g.
timestamp temperature pressure
191.00 -5.02 21.93
192.00 -5.01 21.92
193.00 -5.01 21.91
...
I loaded each line of the input file into a dictionary as follows:
def add(self, timestamp, sensor_name, sensor_value):
self.timeseries[sensor_name].append([timestamp, sensor_value])
... and created a DataFrame from the dictionary:
df = pd.DataFrame(self.timeseries)
... but I need some guidance how to move forward from here, i.e. what's an elegant way to perform the sampling.
I'm not 100% sure what you're doing but this is what I'd do to solve the problem. It assumes your data file is space-separated with a header row.
import pandas as pd
import numpy as np
# Load the data
data = pd.read_csv(file_name, sep="\s", index_col=None)
# Take the mean of the values within a second
data = np.floor(data["timestamp"])
data = data.groupby(["timestamp", "sensor_name"]).mean()
data = data.reset_index()
# Pivot
data = data.pivot(index="timestamp", columns="sensor_name", values="sensor_value")
If you have some other concept for "downsampling" in mind for this context you should do that instead of mean.

Combining a large amount of netCDF files

I have a large folder of netCDF (.nc) files each one with a similar name. The data files contain variables of time, longitude, latitude, and monthly precipitation. The goal is to get the average monthly precipitation over X amount of years for each month. So in the end I would have 12 values representing the average monthly precipitation over X amount of years for each lat and long. Each file is the same location over many years.
Each file starts with the same name and ends in a “date.sub.nc” for example:
'data1.somthing.somthing1.avg_2d_Ind_Nx.200109.SUB.nc'
'data1.somthing.somthing1.avg_2d_Ind_Nx.200509.SUB.nc'
'data2.somthing.somthing1.avg_2d_Ind_Nx.201104.SUB.nc'
'data2.somthing.somthing1.avg_2d_Ind_Nx.201004.SUB.nc'
'data2.somthing.somthing1.avg_2d_Ind_Nx.201003.SUB.nc'
'data2.somthing.somthing1.avg_2d_Ind_Nx.201103.SUB.nc'
'data1.somthing.somthing1.avg_2d_Ind_Nx.201203.SUB.nc'
The ending is YearMonth.SUB.nc
What I have so far is:
array=[]
f = nc.MFDataset('data*.nc')
precp = f.variables['prectot']
time = f.variables['time']
array = f.variables['time','longitude','latitude','prectot']
I get a KeyError: ('time', 'longitude', 'latitude', 'prectot'). Is there a way to combine all this data so I am able to manipulate it?
As #CharlieZender mentioned, ncra is the way to go here and I'll provide some more details on integrating that function into a Python script. (PS - you can install NCO easily with Homebrew, e.g. http://alejandrosoto.net/blog/2014/01/22/setting-up-my-mac-for-scientific-research/)
import subprocess
import netCDF4
import glob
import numpy as np
for month in range(1,13):
# Gather all the files for this month
month_files = glob.glob('/path/to/files/*{0:0>2d}.SUB.nc'.format(month))
# Using NCO functions ---------------
avg_file = './precip_avg_{0:0>2d}.nc'.format(month)
# Concatenate the files using ncrcat
subprocess.call(['ncrcat'] + month_files + ['-O', avg_file])
# Take the time (record) average using ncra
subprocess.call(['ncra', avg_file, '-O', avg_file])
# Read in the monthly precip climatology file and do whatever now
ncfile = netCDF4.Dataset(avg_file, 'r')
pr = ncfile.variables['prectot'][:,:,:]
....
# Using only Python -------------
# Initialize an array to store monthly-mean precip for all years
# let's presume we know the lat and lon dimensions (nlat, nlon)
nyears = len(month_files)
pr_arr = np.zeros([nyears,nlat,nlon], dtype='f4')
# Populate pr_arr with each file's monthly-mean precip
for idx, filename in enumerate(month_files):
ncfile = netCDF4.Dataset(filename, 'r')
pr = ncfile.variable['prectot'][:,:,:]
pr_arr[idx,:,:] = np.mean(pr, axis=0)
ncfile.close()
# Take the average along all years for a monthly climatology
pr_clim = np.mean(pr_arr, axis=0) # 2D now [lat,lon]
NCO does this with
ncra *.01.SUB.nc pcp_avg_01.nc
ncra *.02.SUB.nc pcp_avg_02.nc
...
ncra *.12.SUB.nc pcp_avg_12.nc
ncrcat pcp_avg_??.nc pcp_avg.nc
Of course the first twelve commands can be done with a Bash loop, reducing the total number of lines to less than five. If you prefer to script with python, you can check your answers with this. ncra docs here.
The command ymonmean calculates the mean of calendar months in CDO. Thus the task can be accomplished in two lines:
cdo mergetime data*.SUB.nc merged.nc # put files together into one series
cdo ymonmean merged.nc annual_cycle.nc # mean of all Jan,Feb etc.
cdo can also calculate the annual cycle of other statistics, ymonstd, ymonmax etc... and the time units can be days or pentads as well as months. (e.g. ydaymean).

Categories