I have a large folder of netCDF (.nc) files each one with a similar name. The data files contain variables of time, longitude, latitude, and monthly precipitation. The goal is to get the average monthly precipitation over X amount of years for each month. So in the end I would have 12 values representing the average monthly precipitation over X amount of years for each lat and long. Each file is the same location over many years.
Each file starts with the same name and ends in a “date.sub.nc” for example:
'data1.somthing.somthing1.avg_2d_Ind_Nx.200109.SUB.nc'
'data1.somthing.somthing1.avg_2d_Ind_Nx.200509.SUB.nc'
'data2.somthing.somthing1.avg_2d_Ind_Nx.201104.SUB.nc'
'data2.somthing.somthing1.avg_2d_Ind_Nx.201004.SUB.nc'
'data2.somthing.somthing1.avg_2d_Ind_Nx.201003.SUB.nc'
'data2.somthing.somthing1.avg_2d_Ind_Nx.201103.SUB.nc'
'data1.somthing.somthing1.avg_2d_Ind_Nx.201203.SUB.nc'
The ending is YearMonth.SUB.nc
What I have so far is:
array=[]
f = nc.MFDataset('data*.nc')
precp = f.variables['prectot']
time = f.variables['time']
array = f.variables['time','longitude','latitude','prectot']
I get a KeyError: ('time', 'longitude', 'latitude', 'prectot'). Is there a way to combine all this data so I am able to manipulate it?
As #CharlieZender mentioned, ncra is the way to go here and I'll provide some more details on integrating that function into a Python script. (PS - you can install NCO easily with Homebrew, e.g. http://alejandrosoto.net/blog/2014/01/22/setting-up-my-mac-for-scientific-research/)
import subprocess
import netCDF4
import glob
import numpy as np
for month in range(1,13):
# Gather all the files for this month
month_files = glob.glob('/path/to/files/*{0:0>2d}.SUB.nc'.format(month))
# Using NCO functions ---------------
avg_file = './precip_avg_{0:0>2d}.nc'.format(month)
# Concatenate the files using ncrcat
subprocess.call(['ncrcat'] + month_files + ['-O', avg_file])
# Take the time (record) average using ncra
subprocess.call(['ncra', avg_file, '-O', avg_file])
# Read in the monthly precip climatology file and do whatever now
ncfile = netCDF4.Dataset(avg_file, 'r')
pr = ncfile.variables['prectot'][:,:,:]
....
# Using only Python -------------
# Initialize an array to store monthly-mean precip for all years
# let's presume we know the lat and lon dimensions (nlat, nlon)
nyears = len(month_files)
pr_arr = np.zeros([nyears,nlat,nlon], dtype='f4')
# Populate pr_arr with each file's monthly-mean precip
for idx, filename in enumerate(month_files):
ncfile = netCDF4.Dataset(filename, 'r')
pr = ncfile.variable['prectot'][:,:,:]
pr_arr[idx,:,:] = np.mean(pr, axis=0)
ncfile.close()
# Take the average along all years for a monthly climatology
pr_clim = np.mean(pr_arr, axis=0) # 2D now [lat,lon]
NCO does this with
ncra *.01.SUB.nc pcp_avg_01.nc
ncra *.02.SUB.nc pcp_avg_02.nc
...
ncra *.12.SUB.nc pcp_avg_12.nc
ncrcat pcp_avg_??.nc pcp_avg.nc
Of course the first twelve commands can be done with a Bash loop, reducing the total number of lines to less than five. If you prefer to script with python, you can check your answers with this. ncra docs here.
The command ymonmean calculates the mean of calendar months in CDO. Thus the task can be accomplished in two lines:
cdo mergetime data*.SUB.nc merged.nc # put files together into one series
cdo ymonmean merged.nc annual_cycle.nc # mean of all Jan,Feb etc.
cdo can also calculate the annual cycle of other statistics, ymonstd, ymonmax etc... and the time units can be days or pentads as well as months. (e.g. ydaymean).
Related
I have a folder with (for the most part) data collected over a whole year. Each day of the year has a different subfolder which contains a CSV with the data. I need to open each CSV and plot the data on a graph, as well as find the mean etc. I have made the plot by creating a loop that loops over the files and plots each point each iteration, but when it comes to populating arrays outside the loop so I can manipulate data etc I just keep having problems.
I have added lots of comments to show what I have done.
Here is my code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#There are 227 files, each time after each month, day value resets.
#Folders/file name format: yyyymmdd
#First day = 20210411
#Last day = 20211231
#Looping from the first to the last file
for n in range(20210411,20211231):
#Incrementing, there are 227 days from: [20210411 to 20211231]
n = n + 1
try:
#Creating a different file name to loop over and open each file
filename = f'/Users/nathancall92/Desktop/2021/{n}/wf{n}.vav.ada.aia.eof.csv'
#Opening a CSV instance
CSV = pd.read_csv(filename, skiprows=2 , engine='python')
#Converting to array
data = CSV.to_numpy()
#Slicing out columns needed for plotting
day, hour, XCO2, XCH4, azi = data[:,3], data[:,4], data[:,32], data[:,30], data[:,9]
#Converting to decimal day: AEST
decimalDay = day + ((hour + 10)/ 24)
#Take out any azimuth angle larger than 80 due to false data measurements
if azi.any() < 80:
#Plotting inside the if statement
plt.scatter(decimalDay, XCO2, s = 0.2, marker="^", c= 'coral' )
plt.ylim(405,425)
plt.yticks(np.arange(start = 405, stop = 426, step = 5))
plt.xticks(np.arange(start = 100, stop = 365, step = 25))
plt.xlim(100,375)
plt.xlabel('Day of year')
plt.ylabel('XCO2 (ppm)')
plt.title('CO2 emmisions March-December 2021')
plt.show
#Continue looping date integers if they do not create a file
except OSError:
continue
I need to populate an array outside of the loop with all of the values of day, XCO2 and CH4 and I am having a great deal of trouble and can not seem to find a solution no matter what I try.
Any help would be much appreciated.
This question is related to data extraction from NetCDF files (*.nc) to CSV, to be further used by researchers for other applications - statistical analysis along time series that are not array-like data. There are two alternatives for beginners in python, namely:
Script 1:
I need your help in extracting climate date from an nc file to csv using python. The data founder, namely Copernicus gives some advice in how to extract data from nc to csv using simple python script. However, I have some issues with OverflowError: int too big to convert.
I will briefly describe the steps and provide all necessary info regarding data shape and content.
#this is for reading the .nc in the working folder
import glob
#this is reaquired ti read the netCDF4 data
from netCDF4 import Dataset
#required to read and write the csv files
import pandas as pd
#required for using the array functions
import numpy as np
from matplotlib.dates import num2date
data = Dataset('prAdjust_tmean.nc')
And data looks like this:
print(data)
<class 'netCDF4._netCDF4.Dataset'>
root group (NETCDF4 data model, file format HDF5):
CDI: Climate Data Interface version 1.8.2 (http://mpimet.mpg.de/cdi)
frequency: year
CDO: Climate Data Operators version 1.8.2 (http://mpimet.mpg.de/cdo)
creation_date: 2020-02-12T15:00:49ZCET+0100
Conventions: CF-1.6
institution_url: www.smhi.se
invar_platform_id: -
invar_rcm_model_driver: MPI-M-MPI-ESM-LR
time_coverage_start: 1971
time_coverage_end: 2000
domain: EUR-11
geospatial_lat_min: 23.942343
geospatial_lat_max: 72.641624
geospatial_lat_resolution: 0.04268074 degree
geospatial_lon_min: -35.034023
geospatial_lon_max: 73.937675
geospatial_lon_resolution: 0.009246826 degree
geospatial_bounds: -
NCO: netCDF Operators version 4.7.7 (Homepage = http://nco.sf.net, Code = http://github.com/nco/nco)
acknowledgements: This work was performed within Copernicus Climate Change Service - C3S_424_SMHI, https://climate.copernicus.eu/operational-service-water-sector, on behalf of ECMWF and EU.
contact: Hydro.fou#smhi.se
keywords: precipitation
license: Copernicus License V1.2
output_frequency: 30 year average value
summary: Calculated as the mean annual values of daily precipitation averaged over a 30 year period.
comment: The Climate Data Operators (CDO) software was used for the calculation of climate impact indicators (https://code.mpimet.mpg.de/projects/cdo/embedded/cdo.pdf, https://code.mpimet.mpg.de/projects/cdo/embedded/cdo_eca.pdf).
history: CDO commands (last cdo command first and separated with ;): timmean; yearmean
invar_bc_institution: Swedish Meteorological and Hydrological Institute
invar_bc_method: TimescaleBC, Description in deliverable C3S_D424.SMHI.1.3b
invar_bc_method_id: TimescaleBC v1.02
invar_bc_observation: EFAS-Meteo, https://ec.europa.eu/jrc/en/publication/eur-scientific-and-technical-research-reports/efas-meteo-european-daily-high-resolution-gridded-meteorological-data-set-1990-2011
invar_bc_observation_id: EFAS-Meteo
invar_bc_period: 1990-2018
data_quality: Testing of EURO-CORDEX data performed by ESGF nodes. Additional tests were performed when producing CII and ECVs in C3S_424_SMHI.
institution: SMHI
project_id: C3S_424_SMHI
references:
source: The RCM data originate from EURO-CORDEX (Coordinated Downscaling Experiment - European Domain, EUR-11) https://euro-cordex.net/.
invar_experiment_id: rcp45
invar_realisation_id: r1i1p1
invar_rcm_model_id: MPI-CSC-REMO2009-v1
variable_name: prAdjust_tmean
dimensions(sizes): x(1000), y(950), time(1), bnds(2)
variables(dimensions): float32 lon(y,x), float32 lat(y,x), float64 time(time), float64 time_bnds(time,bnds), float32 prAdjust_tmean(time,y,x)
groups:
Extract the variable:
t2m = data.variables['prAdjust_tmean']
Get dimensions assuming 3D: time, latitude, longitude
time_dim, lat_dim, lon_dim = t2m.get_dims()
time_var = data.variables[time_dim.name]
times = num2date(time_var[:], time_var.units)
latitudes = data.variables[lat_dim.name][:]
longitudes = data.variables[lon_dim.name][:]
output_dir = './'
And the error:
OverflowError Traceback (most recent call last)
<ipython-input-9-69e10e41e621> in <module>
2 time_dim, lat_dim, lon_dim = t2m.get_dims()
3 time_var = data.variables[time_dim.name]
----> 4 times = num2date(time_var[:], time_var.units)
5 latitudes = data.variables[lat_dim.name][:]
6 longitudes = data.variables[lon_dim.name][:]
C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\dates.py in num2date(x, tz)
509 if tz is None:
510 tz = _get_rc_timezone()
--> 511 return _from_ordinalf_np_vectorized(x, tz).tolist()
512
513
C:\ProgramData\Anaconda3\lib\site-packages\numpy\lib\function_base.py in __call__(self, *args, **kwargs)
2106 vargs.extend([kwargs[_n] for _n in names])
2107
-> 2108 return self._vectorize_call(func=func, args=vargs)
2109
2110 def _get_ufunc_and_otypes(self, func, args):
C:\ProgramData\Anaconda3\lib\site-packages\numpy\lib\function_base.py in _vectorize_call(self, func, args)
2190 for a in args]
2191
-> 2192 outputs = ufunc(*inputs)
2193
2194 if ufunc.nout == 1:
C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\dates.py in _from_ordinalf(x, tz)
329
330 dt = (np.datetime64(get_epoch()) +
--> 331 np.timedelta64(int(np.round(x * MUSECONDS_PER_DAY)), 'us'))
332 if dt < np.datetime64('0001-01-01') or dt >= np.datetime64('10000-01-01'):
333 raise ValueError(f'Date ordinal {x} converts to {dt} (using '
OverflowError: int too big to convert
Last part of the script is:
import os
# Path
path = "/home"
# Join various path components
print(os.path.join(path, "User/Desktop", "file.txt"))
# Path
path = "User/Documents"
# Join various path components
print(os.path.join(path, "/home", "file.txt"))
filename = os.path.join(output_dir, 'table.csv')
print(f'Writing data in tabular form to {filename} (this may take some time)...')
times_grid, latitudes_grid, longitudes_grid = [
x.flatten() for x in np.meshgrid(times, latitudes, longitudes, indexing='ij')]
df = pd.DataFrame({
'time': [t.isoformat() for t in times_grid],
'latitude': latitudes_grid,
'longitude': longitudes_grid,
't2m': t2m[:].flatten()})
df.to_csv(filename, index=False)
print('Done')
Script 2:
The second script has a more useful approach, because it gives the user the opportunity to select the exact data from NetCDF file using python, rather than extracting all data. This script is written by GeoDelta Labs on youtube. However, as in most cases the script is "written" for a specific format of nc files, and needs further tweaking to read any nc file.
The code is like this:
import netCDF4
import glob
from netCDF4 import Dataset
import pandas as pd
import numpy as np
# Record all the years of the netCDF files into a Python list
all_years = []
for file in glob.glob('name_of_the_file.nc'):
print(file)
data = Dataset(file, 'r')
data.variables.keys()
print(data.__dict__)
for file in glob.glob('name_of_the_file.nc'):
print(file)
data = Dataset(file, 'r')
time = data.variables['time']
year = time.units[14:18]
all_years.append(year)
all_years
# Creating an empty Pandas DataFrame covering the whole range of data
year_start = min(all_years)
end_year = max(all_years)
date_range = pd.date_range(start = str(year_start) + '-01-01',
end = str(end_year) + '-12-31',
freq = 'Y')
df = pd.DataFrame(0.0, columns = ['Temperature'], index = date_range)
Now the interesting part of this second python script. Usually NetCDf files cover vast geographical areas (for example Nothern Hemisphere or entire continents), which is not so useful for particular statistical analysis. Let say that the user wants to extract climate data from nc files for particular locations - countries, regions, cities, etc. GeoDelta Lab comes with a very useful approach.
It requires a small CSV file, created separately by the user, where three columns are created: Name (name of the region/country/location), Longitude and Latitude.
# Defining the location, lat, lon based on the csv data
cities = pd.read_csv('countries.csv')
cities
all_years
for index, row in cities.iterrows():
location = row['names']
location_latitude = row['lat']
location_longitude = row['long']
# Sorting the all_years python list
all_years.sort()
for yr in all_years:
# Reading-in the data
data = Dataset('prAdjust_tmean_abs_QM-EFAS-Meteo-EUR-11_MPI-M-MPI-ESM-LR_historical_r1i1p1_MPI-CSC-REMO2009-v1_na_1971-2000_grid5km_v1.nc', 'r')
# Storing the lat and lon data of the netCDF file into variables
lat = data.variables['lat'][:]
lon = data.variables['lon'][:]
# Squared difference between the specified lat,lon and the lat,lon of the netCDF
sq_diff_lat = (lat - location_latitude)**2
sq_diff_lon = (lon - location_longitude)**2
# Identify the index of the min value for lat and lon
min_index_lat = sq_diff_lat.argmin()
min_index_lon = sq_diff_lon.argmin()
# Accessing the average temparature data
temp = data.variables['prAdjust_tmean']
# Creating the date range for each year during each iteration
start = str(yr) + '-01-01'
end = str(yr) + '-12-31'
d_range = pd.date_range(start = start,
end = end,
freq = 'D')
for t_index in np.arange(0, len(d_range)):
print('Recording the value for '+ location+': ' + str(d_range[t_index]))
df.loc[d_range[t_index]]['Temparature'] = temp[t_index, min_index_lat, min_index_lon]
df.to_csv(location +'.csv')
The last part combines the data from user-defined CSV (with exact locations) and data extracted from nc file, to create a new CSV with the exact data needed by the user.
However, like always the second script fails to "read" the years from my nc file and finds only the starting year 1970. Can someone explain what I am doing wrong? Could be that the nc file contains only data for 1970? In the description of the data - it is clearly specified that the file contains annual averages from 1970 until 2000.
If someone can help me to solve this issue, I guess this script can be useful for all future users of netCDF files. Many thanks for your time.
Edit: #Robert Davy requested a sample data. I've added the link to Google Drive, where the original nc files are stored and could be downloaded/viewed by anyone that accesses this link: https://drive.google.com/drive/folders/11q9dKCoQWqPE5EMcM5ye1bNPqELGMzOA?usp=sharing
I'm a student researcher who's running simulations on exoplanets to determine if they might be viable for life. The software I'm using, outputs a file with several columns of various types of data. So far, I've written a python script that goes through one file and grabs two columns of data. In this case, time and global temperature of the planet.
What I want to do is:
Write a python script that goes through multiple files, and grabs the same two columns that my current script does.
Then, I want to create subplots of all these files
The things that will stay consistent across all of the files, is the fact that times doesn't change, the x axis will always be time (from 0 to 1 million years). The y axis values will changes across simulations though.
This is what I got so far for my code:
import math as m
import matplotlib.pylab as plt
import numpy as np
## Set datafile equal to the file I plan on using for data, and then open it
datafile = r"C:\Users\sasuk\OneDrive\Desktop\GJ 229 Planet Data\Gj 229 b - [SemiMajor 0.867][Ecc][Mass][Luminosity]\solarsys.Earth.forward"
file = open(datafile, "r")
# Create two empty arrays for my x and y axis of my graphs
years = [ ]
GlobalT = [ ]
# A for loop that looks in my file, and grabs only the 1st and 8th column adding them to the respective arrays
for line in file:
data = line.split(' ')
years.append(float(data[0]))
GlobalT.append(float(data[7]))
# Close the file
file.close()
# Plot my graph
fig = plt.matplotlib.pyplot.gcf()
plt.plot(years, GlobalT)
plt.title('Global Temperature of GJ 229 b over time')
fig.set_size_inches(10, 6, forward=True)
plt.figtext(0.5, 0.0002, "This shows the global temperature of GJ 229 b when it's semi-major axis is 0.929 au, \n"
" and it's actual mass relative to the sun (~8 Earth Masses)", wrap=True, horizontalalignment='center', fontsize=12)
plt.xlabel(" Years ")
plt.ylabel("Global Temp")
plt.show()
I think the simplest thing to do is to turn your code for one file into a function, and then call it in a loop that iterates over the files.
from pathlib import Path
def parse_datafile(pth):
"""Parses datafile"""
results = []
with pth.open('r') as f:
for line in f:
data = line.split(' ')
results.append({'f': pth.stem,
'y': data[0],
't': data[7]})
return results
basedir = Path(r'C:\Users\sasuk\OneDrive\Desktop\GJ 229 Planet Data\Gj 229 b - [SemiMajor 0.867][Ecc][Mass][Luminosity]')
# assuming you want to parse all files in directory
# if not, can change glob string for files you want
all_results = [parse_datafile(pth) for pth in basedir.glob('*')]
df = pd.DataFrame(all_results)
df['y'] = pd.to_numeric(df['y'], errors='coerce')
df['t'] = pd.to_numeric(df['t'], errors='coerce')
This will give you a dataframe with three columns - f (the filename), y (the year), and t (the temperature). You then have to convert y and t to numeric dtypes. This will be faster and handle errors more gracefully than your code, which will raise an error with any malformed data.
You can further manipulate this as needed to generate your plots. Definitely check if there are any NaN values and address them accordingly, either by dropping those rows or using fillna.
I was just wondering if it is possible to use Time as x-axis values for a matplotlib live graph.
If so, how should it be done? I have been trying many different methods but end up with errors.
This is my current code :
update_label(label):
def getvoltage():
f=open("VoltageReadings.txt", "a+")
readings = [0]*100
maxsample = 100
counter = 0
while (counter < maxsample):
reading = adc.read_adc(0, gain=GAIN)
readings.append(reading)
counter += 1
avg = sum(readings)/100
voltage = (avg * 0.1259)/100
time = str(datetime.datetime.now().time())
f.write("%.2f," % (voltage) + time + "\r\n")
readings.clear()
label.config(text=str('Voltage: {0:.2f}'.format(voltage)))
label.after(1000, getvoltage)
getvoltage()
def animate(i):
pullData = open("VoltageReadings.txt","r").read()
dataList = pullData.split('\n')
xList=[]
yList=[]
for eachLine in dataList:
if len(eachLine) > 1:
y, x = eachLine.split(',')
xList.append(float(x)))
yList.append(float(y))
a.clear()
a.plot(xList,yList)
This is one of the latest method I've tried and I'm getting error that says
ValueError: could not convert string to float: '17:21:55'
I've tried finding ways to convert the string into a float but I can't seem to do it
I'd really appreciate some help and guidance, thank you :)
I think that you should use the datetime library. You can read your dates using this command date=datetime.strptime('17:21:55','%H:%M:%S') but you have to use the Julian date as a reference by setting a date0=datetime(1970, 1, 1) You can also use the starting point of your time series as a date0 and then set your date as date=datetime.strptime('01-01-2000 17:21:55','%d-%m-%Y %H%H:%M:%S'). Compute the differences between your actual date and the reference date IN SECONDS (there are several functions to do this) for each line in your file using a loop and affect this difference to a list element (We will call this list Diff_list). At the end use T_plot= [dtm.datetime.utcfromtimestamp(i) for i in Diff_List]. Finally a plt.plot(T_plot,values) will allow you to visualize the dates on the x-axis.
You can also use the pandas library
first, define your date parsing depending on the dates type in your file parser=pd.datetime.strptime(date, '%Y-%m-%d %H:%M:%S')
Then you read your file
tmp = pd.read_csv(your_file, parse_dates={'datetime': ['date', 'time']}, date_parser=parser, comment='#',delim_whitespace=True,names=['date', 'time', 'Values'])
data = tmp.set_index(tmp['datetime']).drop('datetime', axis=1)
You can adapt these lines if you need to represent only hours HH:MM:SS not the whole date.
N.B: Indexing will not be from 0 to data.values.shape[0] but the dates will be used as indexes. So if you want to plot you can do a import matplotlib.pyplot as plt and then plt.plot(data.index,data.Values)
You could use the polt Python package which I developed for this exact purpose. polt uses matplotlib to display data from multiple sources simulateneously.
Create a script adc_read.py that reads values from your ADC and prints them out:
import random, sys, time
def read_adc():
"""
Implement reading a voltage from your ADC here
"""
# simulate measurement delay/sampling interval
time.sleep(0.001)
# simulate reading a voltage between 0 and 5V
return random.uniform(0, 5)
while True:
# gather 100 readings
adc_readings = tuple(read_adc() for i in range(100))
# calculate average
adc_average = sum(adc_readings) / len(adc_readings)
# output average
print(adc_average)
sys.stdout.flush()
which outputs
python3 adc_read.py
# output
2.3187490696344444
2.40019412977279
2.3702603804716555
2.3793495215651435
2.5596985467604703
2.5433401603774413
2.6048815735614004
2.350392397280291
2.4372325168231948
2.5618046803145647
...
This output can then be piped into polt to display the live data stream:
python3 adc_read.py | polt live
Labelling can be achieved by adding metadata:
python3 adc_read.py | \
polt \
add-source -c- -o name=ADC \
add-filter -f metadata -o set-quantity=voltage -o set-unit='V' \
live
The polt documentation contains information on possibilities for further customization.
This is my first time using netCDF and I'm trying to wrap my head around working with it.
I have multiple version 3 netcdf files (NOAA NARR air.2m daily averages for an entire year). Each file spans a year between 1979 - 2012. They are 349 x 277 grids with approximately a 32km resolution. Data was downloaded from here.
The dimension is time (hours since 1/1/1800) and my variable of interest is air. I need to calculate accumulated days with a temperature < 0. For example
Day 1 = +4 degrees, accumulated days = 0
Day 2 = -1 degrees, accumulated days = 1
Day 3 = -2 degrees, accumulated days = 2
Day 4 = -4 degrees, accumulated days = 3
Day 5 = +2 degrees, accumulated days = 0
Day 6 = -3 degrees, accumulated days = 1
I need to store this data in a new netcdf file. I am familiar with Python and somewhat with R. What is the best way to loop through each day, check the previous days value, and based on that, output a value to a new netcdf file with the exact same dimension and variable.... or perhaps just add another variable to the original netcdf file with the output I'm looking for.
Is it best to leave all the files separate or combine them? I combined them with ncrcat and it worked fine, but the file is 2.3gb.
Thanks for the input.
My current progress in python:
import numpy
import netCDF4
#Change my working DIR
f = netCDF4.Dataset('air7912.nc', 'r')
for a in f.variables:
print(a)
#output =
lat
long
x
y
Lambert_Conformal
time
time_bnds
air
f.variables['air'][1, 1, 1]
#Output
298.37473
To help me understand this better what type of data structure am I working with? Is ['air'] the key in the above example and [1,1,1] are also keys? to get the value of 298.37473. How can I then loop through [1,1,1]?
You can use the very nice MFDataset feature in netCDF4 to treat a bunch of files as one aggregated file, without the need to use ncrcat. So you code would look like this:
from pylab import *
import netCDF4
f = netCDF4.MFDataset('/usgs/data2/rsignell/models/ncep/narr/air.2m.19??.nc')
# print variables
f.variables.keys()
atemp = f.variables['air']
print atemp
ntimes, ny, nx = shape(atemp)
cold_days = zeros((ny,nx),dtype=int)
for i in xrange(ntimes):
cold_days += atemp[i,:,:].data-273.15 < 0
pcolormesh(cold_days)
colorbar()
And here's one way to write the file (there might be easier ways):
# create NetCDF file
nco = netCDF4.Dataset('/usgs/data2/notebook/cold_days.nc','w',clobber=True)
nco.createDimension('x',nx)
nco.createDimension('y',ny)
cold_days_v = nco.createVariable('cold_days', 'i4', ( 'y', 'x'))
cold_days_v.units='days'
cold_days_v.long_name='total number of days below 0 degC'
cold_days_v.grid_mapping = 'Lambert_Conformal'
lono = nco.createVariable('lon','f4',('y','x'))
lato = nco.createVariable('lat','f4',('y','x'))
xo = nco.createVariable('x','f4',('x'))
yo = nco.createVariable('y','f4',('y'))
lco = nco.createVariable('Lambert_Conformal','i4')
# copy all the variable attributes from original file
for var in ['lon','lat','x','y','Lambert_Conformal']:
for att in f.variables[var].ncattrs():
setattr(nco.variables[var],att,getattr(f.variables[var],att))
# copy variable data for lon,lat,x and y
lono[:]=f.variables['lon'][:]
lato[:]=f.variables['lat'][:]
xo[:]=f.variables['x'][:]
yo[:]=f.variables['y'][:]
# write the cold_days data
cold_days_v[:,:]=cold_days
# copy Global attributes from original file
for att in f.ncattrs():
setattr(nco,att,getattr(f,att))
nco.Conventions='CF-1.6'
nco.close()
If I try looking at the resulting file in the Unidata NetCDF-Java Tools-UI GUI, it seems to be okay:
Also note that here I just downloaded two of the datasets for testing, so I used
f = netCDF4.MFDataset('/usgs/data2/rsignell/models/ncep/narr/air.2m.19??.nc')
as an example. For all the data, you could use
f = netCDF4.MFDataset('/usgs/data2/rsignell/models/ncep/narr/air.2m.????.nc')
or
f = netCDF4.MFDataset('/usgs/data2/rsignell/models/ncep/narr/air.2m.*.nc')
Here is an R solution.
infiles <- list.files("data", pattern = "nc", full.names = TRUE, include.dirs = TRUE)
outfile <- "data/air.colddays.nc"
library(raster)
r <- raster::stack(infiles)
r <- sum((r - 273.15) < 0)
plot(r)
I know this is rather late for this thread from 2013, but I just want to point out that the accepted solution doesn't provide the solution to the exact question posed. The question seems to want the length of each continuous period of temperatures below zero (note in the question the counter resets if the temperature exceeds zero), which can be important for climate applications (e.g. for farming) whereas the accepted solution only gives the total number of days in a year that the temperature is below zero. If this is really what mkmitchell wants (it has been accepted as the answer) then it can be done in from the command line in cdo without having to worry about NETCDF input/output:
cdo timsum -lec,273.15 in.nc out.nc
so a looped script would be:
files=`ls *.nc` # pick up all the netcdf files in a directory
for file in $files ; do
# I use 273.15 as from the question seems T is in Kelvin
cdo timsum -lec,273.15 $file ${file%???}_numdays.nc
done
If you then want the total number over the whole period you can then cat the _numdays files instead which are much smaller:
cdo cat *_numdays.nc total.nc
cdo timsum total.nc total_below_zero.nc
But again, the question seems to want accumulated days per event, which is different, but not provided by the accepted answer.