Loop through netcdf files and run calculations - Python or R - python

This is my first time using netCDF and I'm trying to wrap my head around working with it.
I have multiple version 3 netcdf files (NOAA NARR air.2m daily averages for an entire year). Each file spans a year between 1979 - 2012. They are 349 x 277 grids with approximately a 32km resolution. Data was downloaded from here.
The dimension is time (hours since 1/1/1800) and my variable of interest is air. I need to calculate accumulated days with a temperature < 0. For example
Day 1 = +4 degrees, accumulated days = 0
Day 2 = -1 degrees, accumulated days = 1
Day 3 = -2 degrees, accumulated days = 2
Day 4 = -4 degrees, accumulated days = 3
Day 5 = +2 degrees, accumulated days = 0
Day 6 = -3 degrees, accumulated days = 1
I need to store this data in a new netcdf file. I am familiar with Python and somewhat with R. What is the best way to loop through each day, check the previous days value, and based on that, output a value to a new netcdf file with the exact same dimension and variable.... or perhaps just add another variable to the original netcdf file with the output I'm looking for.
Is it best to leave all the files separate or combine them? I combined them with ncrcat and it worked fine, but the file is 2.3gb.
Thanks for the input.
My current progress in python:
import numpy
import netCDF4
#Change my working DIR
f = netCDF4.Dataset('air7912.nc', 'r')
for a in f.variables:
#output =
f.variables['air'][1, 1, 1]
To help me understand this better what type of data structure am I working with? Is ['air'] the key in the above example and [1,1,1] are also keys? to get the value of 298.37473. How can I then loop through [1,1,1]?

You can use the very nice MFDataset feature in netCDF4 to treat a bunch of files as one aggregated file, without the need to use ncrcat. So you code would look like this:
from pylab import *
import netCDF4
f = netCDF4.MFDataset('/usgs/data2/rsignell/models/ncep/narr/air.2m.19??.nc')
# print variables
atemp = f.variables['air']
print atemp
ntimes, ny, nx = shape(atemp)
cold_days = zeros((ny,nx),dtype=int)
for i in xrange(ntimes):
cold_days += atemp[i,:,:].data-273.15 < 0
And here's one way to write the file (there might be easier ways):
# create NetCDF file
nco = netCDF4.Dataset('/usgs/data2/notebook/cold_days.nc','w',clobber=True)
cold_days_v = nco.createVariable('cold_days', 'i4', ( 'y', 'x'))
cold_days_v.long_name='total number of days below 0 degC'
cold_days_v.grid_mapping = 'Lambert_Conformal'
lono = nco.createVariable('lon','f4',('y','x'))
lato = nco.createVariable('lat','f4',('y','x'))
xo = nco.createVariable('x','f4',('x'))
yo = nco.createVariable('y','f4',('y'))
lco = nco.createVariable('Lambert_Conformal','i4')
# copy all the variable attributes from original file
for var in ['lon','lat','x','y','Lambert_Conformal']:
for att in f.variables[var].ncattrs():
# copy variable data for lon,lat,x and y
# write the cold_days data
# copy Global attributes from original file
for att in f.ncattrs():
If I try looking at the resulting file in the Unidata NetCDF-Java Tools-UI GUI, it seems to be okay:
Also note that here I just downloaded two of the datasets for testing, so I used
f = netCDF4.MFDataset('/usgs/data2/rsignell/models/ncep/narr/air.2m.19??.nc')
as an example. For all the data, you could use
f = netCDF4.MFDataset('/usgs/data2/rsignell/models/ncep/narr/air.2m.????.nc')
f = netCDF4.MFDataset('/usgs/data2/rsignell/models/ncep/narr/air.2m.*.nc')

Here is an R solution.
infiles <- list.files("data", pattern = "nc", full.names = TRUE, include.dirs = TRUE)
outfile <- "data/air.colddays.nc"
r <- raster::stack(infiles)
r <- sum((r - 273.15) < 0)

I know this is rather late for this thread from 2013, but I just want to point out that the accepted solution doesn't provide the solution to the exact question posed. The question seems to want the length of each continuous period of temperatures below zero (note in the question the counter resets if the temperature exceeds zero), which can be important for climate applications (e.g. for farming) whereas the accepted solution only gives the total number of days in a year that the temperature is below zero. If this is really what mkmitchell wants (it has been accepted as the answer) then it can be done in from the command line in cdo without having to worry about NETCDF input/output:
cdo timsum -lec,273.15 in.nc out.nc
so a looped script would be:
files=`ls *.nc` # pick up all the netcdf files in a directory
for file in $files ; do
# I use 273.15 as from the question seems T is in Kelvin
cdo timsum -lec,273.15 $file ${file%???}_numdays.nc
If you then want the total number over the whole period you can then cat the _numdays files instead which are much smaller:
cdo cat *_numdays.nc total.nc
cdo timsum total.nc total_below_zero.nc
But again, the question seems to want accumulated days per event, which is different, but not provided by the accepted answer.


How to make subplots from multiple files? Python matplot lib

I'm a student researcher who's running simulations on exoplanets to determine if they might be viable for life. The software I'm using, outputs a file with several columns of various types of data. So far, I've written a python script that goes through one file and grabs two columns of data. In this case, time and global temperature of the planet.
What I want to do is:
Write a python script that goes through multiple files, and grabs the same two columns that my current script does.
Then, I want to create subplots of all these files
The things that will stay consistent across all of the files, is the fact that times doesn't change, the x axis will always be time (from 0 to 1 million years). The y axis values will changes across simulations though.
This is what I got so far for my code:
import math as m
import matplotlib.pylab as plt
import numpy as np
## Set datafile equal to the file I plan on using for data, and then open it
datafile = r"C:\Users\sasuk\OneDrive\Desktop\GJ 229 Planet Data\Gj 229 b - [SemiMajor 0.867][Ecc][Mass][Luminosity]\solarsys.Earth.forward"
file = open(datafile, "r")
# Create two empty arrays for my x and y axis of my graphs
years = [ ]
GlobalT = [ ]
# A for loop that looks in my file, and grabs only the 1st and 8th column adding them to the respective arrays
for line in file:
data = line.split(' ')
# Close the file
# Plot my graph
fig = plt.matplotlib.pyplot.gcf()
plt.plot(years, GlobalT)
plt.title('Global Temperature of GJ 229 b over time')
fig.set_size_inches(10, 6, forward=True)
plt.figtext(0.5, 0.0002, "This shows the global temperature of GJ 229 b when it's semi-major axis is 0.929 au, \n"
" and it's actual mass relative to the sun (~8 Earth Masses)", wrap=True, horizontalalignment='center', fontsize=12)
plt.xlabel(" Years ")
plt.ylabel("Global Temp")
I think the simplest thing to do is to turn your code for one file into a function, and then call it in a loop that iterates over the files.
from pathlib import Path
def parse_datafile(pth):
"""Parses datafile"""
results = []
with pth.open('r') as f:
for line in f:
data = line.split(' ')
results.append({'f': pth.stem,
'y': data[0],
't': data[7]})
return results
basedir = Path(r'C:\Users\sasuk\OneDrive\Desktop\GJ 229 Planet Data\Gj 229 b - [SemiMajor 0.867][Ecc][Mass][Luminosity]')
# assuming you want to parse all files in directory
# if not, can change glob string for files you want
all_results = [parse_datafile(pth) for pth in basedir.glob('*')]
df = pd.DataFrame(all_results)
df['y'] = pd.to_numeric(df['y'], errors='coerce')
df['t'] = pd.to_numeric(df['t'], errors='coerce')
This will give you a dataframe with three columns - f (the filename), y (the year), and t (the temperature). You then have to convert y and t to numeric dtypes. This will be faster and handle errors more gracefully than your code, which will raise an error with any malformed data.
You can further manipulate this as needed to generate your plots. Definitely check if there are any NaN values and address them accordingly, either by dropping those rows or using fillna.

What is the most efficient way to group values in an array based on the month that they occur?

I have some data that I have read in the following way:
filename = 'minamORE.txt'
f1 = open(filename, 'r')
lines = f1.readlines()
mOREt = []
mOREdis = []
import pandas as pd
data = pd.read_csv('minamORE.txt',sep='\t',header=None,usecols=[2,3])
mOREdate = data[2].values
mOREdis = data[3].values
mOREdis = np.float64(mOREdis)
mOREdate = np.array(mOREdate, dtype = "datetime64")
The date array spans over 20 years and has an entry for each day. I would like to some how group all of the January measurements with all of the other January measurements and so on through December.
I'm not experienced enough with python to really think of any solution but to manually do it as follows:
(NOTE: October 1 is the first measurement)
OCTMeasurements = [mOREdis[0,31], mOREdis[0+365, 31+365], ..... [0+20*365, 31+20*365]
For obvious reasons, I'd like to avoid doing this if possible.
The dates are stored in the following format: YYYY-MM-DD.
If I could somehow refer to the values base don the MM value I feel this would be the most efficient way, but again, inexperience renders me unable to do so.

Matplotlib Live Graph - Using Time as x-axis values

I was just wondering if it is possible to use Time as x-axis values for a matplotlib live graph.
If so, how should it be done? I have been trying many different methods but end up with errors.
This is my current code :
def getvoltage():
f=open("VoltageReadings.txt", "a+")
readings = [0]*100
maxsample = 100
counter = 0
while (counter < maxsample):
reading = adc.read_adc(0, gain=GAIN)
counter += 1
avg = sum(readings)/100
voltage = (avg * 0.1259)/100
time = str(datetime.datetime.now().time())
f.write("%.2f," % (voltage) + time + "\r\n")
label.config(text=str('Voltage: {0:.2f}'.format(voltage)))
label.after(1000, getvoltage)
def animate(i):
pullData = open("VoltageReadings.txt","r").read()
dataList = pullData.split('\n')
for eachLine in dataList:
if len(eachLine) > 1:
y, x = eachLine.split(',')
This is one of the latest method I've tried and I'm getting error that says
ValueError: could not convert string to float: '17:21:55'
I've tried finding ways to convert the string into a float but I can't seem to do it
I'd really appreciate some help and guidance, thank you :)
I think that you should use the datetime library. You can read your dates using this command date=datetime.strptime('17:21:55','%H:%M:%S') but you have to use the Julian date as a reference by setting a date0=datetime(1970, 1, 1) You can also use the starting point of your time series as a date0 and then set your date as date=datetime.strptime('01-01-2000 17:21:55','%d-%m-%Y %H%H:%M:%S'). Compute the differences between your actual date and the reference date IN SECONDS (there are several functions to do this) for each line in your file using a loop and affect this difference to a list element (We will call this list Diff_list). At the end use T_plot= [dtm.datetime.utcfromtimestamp(i) for i in Diff_List]. Finally a plt.plot(T_plot,values) will allow you to visualize the dates on the x-axis.
You can also use the pandas library
first, define your date parsing depending on the dates type in your file parser=pd.datetime.strptime(date, '%Y-%m-%d %H:%M:%S')
Then you read your file
tmp = pd.read_csv(your_file, parse_dates={'datetime': ['date', 'time']}, date_parser=parser, comment='#',delim_whitespace=True,names=['date', 'time', 'Values'])
data = tmp.set_index(tmp['datetime']).drop('datetime', axis=1)
You can adapt these lines if you need to represent only hours HH:MM:SS not the whole date.
N.B: Indexing will not be from 0 to data.values.shape[0] but the dates will be used as indexes. So if you want to plot you can do a import matplotlib.pyplot as plt and then plt.plot(data.index,data.Values)
You could use the polt Python package which I developed for this exact purpose. polt uses matplotlib to display data from multiple sources simulateneously.
Create a script adc_read.py that reads values from your ADC and prints them out:
import random, sys, time
def read_adc():
Implement reading a voltage from your ADC here
# simulate measurement delay/sampling interval
# simulate reading a voltage between 0 and 5V
return random.uniform(0, 5)
while True:
# gather 100 readings
adc_readings = tuple(read_adc() for i in range(100))
# calculate average
adc_average = sum(adc_readings) / len(adc_readings)
# output average
which outputs
python3 adc_read.py
# output
This output can then be piped into polt to display the live data stream:
python3 adc_read.py | polt live
Labelling can be achieved by adding metadata:
python3 adc_read.py | \
polt \
add-source -c- -o name=ADC \
add-filter -f metadata -o set-quantity=voltage -o set-unit='V' \
The polt documentation contains information on possibilities for further customization.

Need help writing code that will automatically write more code?

I need help with writing code for a work project. I have written a script that uses pandas to read an excel file. I have a while-loop written to iterate through each row and append latitude/longitude data from the excel file onto a map (Folium, Open Street Map)
The issue I've run into has to do with the GPS data. I download a CVS file with vehicle coordinates. On some of the vehicles I'm tracking, the GPS loses signal for whatever reason and doesn't come back online for hundreds of miles. This causes issues when I'm using line plots to track the vehicle movement on the map. I end up getting long straight lines running across cities since Folium is trying to connect the last GPS coordinate before the vehicle went offline, with the next GPS coordinate available once the vehicle is back online, which could be hundreds of miles away as shown here. I think if every time the script finds a gap in GPS coords, I can have a new loop generated that will basically start a completely new line plot and append it to the existing map. This way I should still see the entire vehicle route on the map but without the long lines trying to connect broken points together.
My idea is to have my script calculate the absolute value difference between each iteration of longitude data. If the difference between each point is greater than 0.01, I want my program to end the loop and to start a new loop. This new loop would then need to have new variables init. I will not know how many new loops would need to be created since there's no way to predict how many times the GPS will go offline/online in each vehicle.
import folium
import pandas as pd
# Pulls CSV file from this location and adds headers to the columns
df = pd.read_csv('Example.CSV',names=['Longitude', 'Latitude',])
lat = (df.Latitude / 10 ** 7) # Converting Lat/Lon into decimal degrees
lon = (df.Longitude / 10 ** 7)
zoom_start = 17 # Zoom level and starting location when map is opened
mapa = folium.Map(location=[lat[1], lon[1]], zoom_start=zoom_start)
i = 0
j = (lat[i] - lat[i - 1])
location = []
while i < len(lat):
if abs(j) < 0.01:
location.append((lat[i], lon[i]))
i += 1
# This section is where additional loops would ideally be generated
# Line plot settings
c1 = folium.MultiPolyLine(locations=[location], color='blue', weight=1.5, opacity=0.5)
Here's pseudocode for how I want to accomplish this.
1) Python reads csv
2) Converts Long/Lat into decimal degrees
3) Init location1
4) Runs while loop to append coords
5) If abs(j) >= 0.01, break loop
6) Init location(2,3,...)
7) Generates new while i < len(lat): loop using location(2,3,...)
9) Repeats step 5-7 while i < len(lat) (Repeat as many times as there are
instances of abs(j) >= 0.01))
10) Creats (c1, c2, c3,...) = folium.MultiPolyLine(locations=[location], color='blue', weight=1.5, opacity=0.5) for each variable of location
11) Creates c1.add_to(mapa) for each c1,c2,c3... listed above
12) mapa.save
Any help would be tremendously appreciated!
Working Solution
import folium
import pandas as pd
# Pulls CSV file from this location and adds headers to the columns
df = pd.read_csv(EXAMPLE.CSV',names=['Longitude', 'Latitude'])
lat = (df.Latitude / 10 ** 7) # Converting Lat/Lon into decimal degrees
lon = (df.Longitude / 10 ** 7)
zoom_start = 17 # Zoom level and starting location when map is opened
mapa = folium.Map(location=[lat[1], lon[1]], zoom_start=zoom_start)
i = 1
location = []
while i < (len(lat)-1):
location.append((lat[i], lon[i]))
i += 1
j = (lat[i] - lat[i - 1])
if abs(j) > 0.01:
c1 = folium.MultiPolyLine(locations=[location], color='blue', weight=1.5, opacity=0.5)
location = []
Your while loop looks wonky. You only set j once, outside the loop. Also, I think you want a list of line segments. Did you want something like this;
i = 0
segment = 0
locations = []
while i < len(lat):
locations[segment] = [] # start a new segment
# add points to the current segment until all are
# consumed or a disconnect is detected
while i < len(lat):
locations[segment].append((lat[i], lon[i]))
i += 1
j = (lat[i] - lat[i - 1])
if abs(j) > 0.01:
segment += 1
When this is done locations will be a list of segments, e.g.;
[ segment0, segment1, ..... ]
each segment will be a list of points, e.g.;
[ (lat,lon), (lan,lon), ..... ]

Combining a large amount of netCDF files

I have a large folder of netCDF (.nc) files each one with a similar name. The data files contain variables of time, longitude, latitude, and monthly precipitation. The goal is to get the average monthly precipitation over X amount of years for each month. So in the end I would have 12 values representing the average monthly precipitation over X amount of years for each lat and long. Each file is the same location over many years.
Each file starts with the same name and ends in a “date.sub.nc” for example:
The ending is YearMonth.SUB.nc
What I have so far is:
f = nc.MFDataset('data*.nc')
precp = f.variables['prectot']
time = f.variables['time']
array = f.variables['time','longitude','latitude','prectot']
I get a KeyError: ('time', 'longitude', 'latitude', 'prectot'). Is there a way to combine all this data so I am able to manipulate it?
As #CharlieZender mentioned, ncra is the way to go here and I'll provide some more details on integrating that function into a Python script. (PS - you can install NCO easily with Homebrew, e.g. http://alejandrosoto.net/blog/2014/01/22/setting-up-my-mac-for-scientific-research/)
import subprocess
import netCDF4
import glob
import numpy as np
for month in range(1,13):
# Gather all the files for this month
month_files = glob.glob('/path/to/files/*{0:0>2d}.SUB.nc'.format(month))
# Using NCO functions ---------------
avg_file = './precip_avg_{0:0>2d}.nc'.format(month)
# Concatenate the files using ncrcat
subprocess.call(['ncrcat'] + month_files + ['-O', avg_file])
# Take the time (record) average using ncra
subprocess.call(['ncra', avg_file, '-O', avg_file])
# Read in the monthly precip climatology file and do whatever now
ncfile = netCDF4.Dataset(avg_file, 'r')
pr = ncfile.variables['prectot'][:,:,:]
# Using only Python -------------
# Initialize an array to store monthly-mean precip for all years
# let's presume we know the lat and lon dimensions (nlat, nlon)
nyears = len(month_files)
pr_arr = np.zeros([nyears,nlat,nlon], dtype='f4')
# Populate pr_arr with each file's monthly-mean precip
for idx, filename in enumerate(month_files):
ncfile = netCDF4.Dataset(filename, 'r')
pr = ncfile.variable['prectot'][:,:,:]
pr_arr[idx,:,:] = np.mean(pr, axis=0)
# Take the average along all years for a monthly climatology
pr_clim = np.mean(pr_arr, axis=0) # 2D now [lat,lon]
NCO does this with
ncra *.01.SUB.nc pcp_avg_01.nc
ncra *.02.SUB.nc pcp_avg_02.nc
ncra *.12.SUB.nc pcp_avg_12.nc
ncrcat pcp_avg_??.nc pcp_avg.nc
Of course the first twelve commands can be done with a Bash loop, reducing the total number of lines to less than five. If you prefer to script with python, you can check your answers with this. ncra docs here.
The command ymonmean calculates the mean of calendar months in CDO. Thus the task can be accomplished in two lines:
cdo mergetime data*.SUB.nc merged.nc # put files together into one series
cdo ymonmean merged.nc annual_cycle.nc # mean of all Jan,Feb etc.
cdo can also calculate the annual cycle of other statistics, ymonstd, ymonmax etc... and the time units can be days or pentads as well as months. (e.g. ydaymean).
