I have a 1d array which is a time series hourly dataset encompassing 49090 points which needs to be converted to netcdf format.
In the code below, result_u2 is a 1d array which stores result from a for loop. It has 49090 datapoints.
nhours = 49091;#one added to no of datapoints
unout.units = 'hours since 2012-10-20 00:00:00'
unout.calendar = 'gregorian'
ncout = Dataset('output.nc','w','NETCDF3');
ncout.createDimension('time',nhours);
datesout = [datetime.datetime(2012,10,20,0,0,0)+n*timedelta(hours=1) for n in range(nhours)]; # create datevalues
timevar = ncout.createVariable('time','float64',('time'));timevar.setncattr('units',unout);timevar[:]=date2num(datesout,unout);
winds = ncout.createVariable('winds','float32',('time',));winds.setncattr('units','m/s');winds[:] = result_u2;
ncout.close()
I'm new to programming. The code I tried above should be able to write the nc file but while running the script no nc file is being created. Please help.
My suggestions would be to have a look at Python syntax in general, if you want to use it / the netCDF4 package. E.g. there are no semicolons in Python code.
Check out the API documentation - the tutorial you find there basically covers what you're asking. Then, your code could look like
import datetime
import netCDF4
# using "with" syntax so you don't have to do the cleanup:
with netCDF4.Dataset('output.nc', 'w', format='NETCDF3_CLASSIC') as ncout:
# create time dimension
nhours = 49091
time = ncout.createDimension('time', nhours)
# create the time variable
times = ncout.createVariable('time', 'f8', ('time',))
times.units = 'hours since 2012-10-20 00:00:00'
times.calendar = 'gregorian'
# fill time
dates = [datetime.datetime(2012,10,20,0,0,0)+n*datetime.timedelta(hours=1) for n in range(nhours)]
times[:] = netCDF4.date2num(dates, units=times.units, calendar=times.calendar)
# create variable 'wind', dependent on time
wind = ncout.createVariable('wind', 'f8', ('time',))
wind.units = 'm/s'
# fill with data, using your 1d array here:
wind[:] = result_u2
Related
I have data inside an xarray.DataArray that I want to manipulate, however, it do not manage to change individual entries in the DataArray.
Example:
import numpy as np
import xarray as xr
data = np.random.rand(2,2)
times = [1998,1999]
locations = ['It','Be']
A = xr.DataArray(data, coords = [times, locations], dims = [time, space])
this gives me a DataArray. Now I want to set the entry for (1998,'It') manually to 5, but the following does not work:
A.sel(time = 1998, space = 'It').values = 5
neither this works:
A.sel(time = 1998, space = 'It').values = array(5)
the data remains as it is. However, strangely the following works out well:
A.sel(time = 1998).values[0] = 5
could you please explain me the logic behind this?
Xarray's assignment does not allow you to assign values to arrays using sel or isel. This is described in the documentation here. For your application, you probably want to use the .loc propoerty:
A.loc[dict(time=1998, space='It')] = 5
It is also possible to use DataArray.where to replace values.
I'm reading wind components (u and v) data from a netCDF file from NCEP/NCAR Reanalysis 1 to make some computations. I'm using xarray to read the file.
In one of the computations, I'd like to mask out all data below some threshould, make them be equal to the missing_value attribute. I don't want to use NaN's.
However, when reading the data with xarray, the missing_value attribute - present in the variable in the netCDF file - isn't copied to xarray.DataArray that contained the data.
I couldn't find a way to copy this attribute from netCDF file variable, with xarray.
Here is an example of what I'm trying to do:
import xarray as xr
import numpy as np
DS1 = xr.open_dataset( "u_250_850_2009012600-2900.nc" )
DS2 = xr.open_dataset( "v_250_850_2009012600-2900.nc" )
u850 = DS1.uwnd.sel( time='2009-01-28 00:00', level=850, lat=slice(10,-60), lon=slice(260,340) )
v850 = DS2.vwnd.sel( time='2009-01-28 00:00', level=850, lat=slice(10,-60), lon=slice(260,340) )
vvel850 = np.sqrt( u850*u850 + v850*v850 )
jet850 = vvel850.where( vvel850 >= 12 )
#jet850 = vvel850.where( vvel850 >= 12, vvel850, vvel850.missing_value )
The last commented line is what I want to do: to use missing_value attribute to fill where vvel850 < 12. The last uncommented line gives me NaN's, what I'm trying to avoid.
Is it the default behaviour of xarray when reading data from netCDF? Whether yes or not, how could I get this attribute from file variable?
An additional information: I'm using PyNGL (http://www.pyngl.ucar.edu/) to make contour plots and it doesn't work with NaN's.
Thanks.
Mateus
The "missing_value" attribute is kept in the encoding dictionary. Other attributes like "units" or "standard_name" are kept in the attrs dictionary. For example:
v850.encoding['missing_value']
You may also be interested a few other xarray features that may help your use case:
xr.open_dataset has a mask_and_scale keyword argument. This will turn off converting missing/fill values to nans.
DataArray.to_masked_array will convert a DataArray (filled with NaNs) to a numpy.MaskedArray for use in plotting programs like Matplotlib or PyNGL.
Essentially, I would like to open a netcdf file, read out the time stamps for individual pixels and then write the timestamps into a new file. Here is my pseudo-code:
f10 = Dataset(nc_f10, 'r')
Time_UTC_10 = np.transpose(f10.variables['TIME_UTC'][:]) #shape is [92,104]
radiance_10 = f10.variables['RADIANCE'][:] #shape is [92,104]
f10.close()
#Manipulate Radiance Information
#python separates the characters in the timestamp, so join it back up:
for i in np.arange(92):
for j in np.arange(104):
joined_16 = ''.join(Time_UTC_16[:,i,j])
datetime_16[i,j] = datetime.datetime.strptime(joined_16, '%Y-%m-%dT%H:%M:%S.%fZ')
#Create and fill the netcdf
nc_out = Dataset(output_directory+nc_out_file, 'w', format='NETCDF4')
y = nc_out.createDimension('y',104)
x = nc_out.createDimension('x',92)
times = nc_out.createVariable('time', np.unicode_, ('x','y'))
O5s = nc_out.createVariable('O5s', np.float32, ('x', 'y'))
times[:] = datetime_16
O5s[:] = radiance_10
nc_out.close()
But when I try to run this, I get the following error:
TypeError: only numpy string, unicode or object arrays can be assigned to VLEN str var slices
I feel like I may be misunderstanding something important here. Any thoughts on how I can correct this code to write the timestamps to a variable in a netcdf?
I really do not know why you want to keep your time variables as a string (this is what the error message says: the values can be either strings, unicode or objects), but one example is like this:
#!/usr/bin/env ipython
# ----------------------
import numpy as np
from netCDF4 import Dataset,num2date,date2num
# ----------------------
ny=104;
nx=92
# ----------------------
radiance_10=np.random.random((ny,nx));
datetime_16=np.ones((ny,nx))
# ----------------------
nc_out = Dataset('test.nc', 'w', format='NETCDF4')
y = nc_out.createDimension('y',ny)
x = nc_out.createDimension('x',nx)
times = nc_out.createVariable('time', np.unicode_, ('x','y'))
O5s = nc_out.createVariable('O5s', np.float32, ('x', 'y'))
O5s[:] = radiance_10
for ii in range(ny):
for jj in range(nx):
times[jj,ii] = "2011-01-01 00:00:00"
nc_out.close()
Basically the values that are written to the time variable are now strings with value at every grid point "2011-01-01 00:00:00".
Nevertheless, I would use timevalues as time elapsed from arbitarily selected timemoment. That is the most common way how to keep time in the netCDF file. Let us assume our data in every point is for time moment 2014-04-11 23:59. Then I could save it as seconds since 2014-04-01. Here is the code that I would use:
import numpy as np
from netCDF4 import Dataset,num2date,date2num
import datetime
# ----------------------
ny=104;
nx=92
# ----------------------
radiance_10=np.random.random((ny,nx));
# ---------------------------------------------------
timevalue = datetime.datetime(2014,4,11,23,59)
time_unit_out= "seconds since 2014-04-01 00:00:00"
# ---------------------------------------------------
nc_out = Dataset('test_b.nc', 'w', format='NETCDF4')
y = nc_out.createDimension('y',ny)
x = nc_out.createDimension('x',nx)
times = nc_out.createVariable('time', np.float64, ('x','y'))
times.setncattr('unit',time_unit_out);
O5s = nc_out.createVariable('O5s', np.float32, ('x', 'y'))
O5s[:] = radiance_10
times[:] = date2num(timevalue,time_unit_out);
nc_out.close()
If you check the value that is now in the time variable, it is 950340, which is the number of seconds from 2014-04-01 00:00 to 2014-04-11 23:59.
I have a dataframe of 600 000 x/y points with date-time information, along another field 'status', with extra descriptive information
My objective is, for each record:
sum column 'status' by records that are within a certain spatial temporal buffer
the specific buffer is within t - 8 hours and < 100 meters
Currently I have the data in a pandas data frame.
I could, loop through the rows, and for each record, subset the dates of interest, then calculate a distances and restrict the selection further. However that would still be quite slow with so many records.
THIS TAKES 4.4 hours to run.
I can see that I could create a 3 dimensional kdtree with x, y, date as epoch time. However, I am not certain how to restrict the distances properly when incorporating dates and geographic distances.
Here is some reproducible code for you guys to test on:
Import
import numpy.random as npr
import numpy
import pandas as pd
from pandas import DataFrame, date_range
from datetime import datetime, timedelta
Create data
np.random.seed(111)
Function to generate test data
def CreateDataSet(Number=1):
Output = []
for i in range(Number):
# Create a date range with hour frequency
date = date_range(start='10/1/2012', end='10/31/2012', freq='H')
# Create long lat data
laty = npr.normal(4815862, 5000,size=len(date))
longx = npr.normal(687993, 5000,size=len(date))
# status of interest
status = [0,1]
# Make a random list of statuses
random_status = [status[npr.randint(low=0,high=len(status))] for i in range(len(date))]
# user pool
user = ['sally','derik','james','bob','ryan','chris']
# Make a random list of users
random_user = [user[npr.randint(low=0,high=len(user))] for i in range(len(date))]
Output.extend(zip(random_user, random_status, date, longx, laty))
return pd.DataFrame(Output, columns = ['user', 'status', 'date', 'long', 'lat'])
#Create data
data = CreateDataSet(3)
len(data)
#some time deltas
before = timedelta(hours = 8)
after = timedelta(minutes = 1)
Function to speed up
def work(df):
output = []
#loop through data index's
for i in range(0, len(df)):
l = []
#first we will filter out the data by date to have a smaller list to compute distances for
#create a mask to query all dates between range for date i
date_mask = (df['date'] >= df['date'].iloc[i]-before) & (df['date'] <= df['date'].iloc[i]+after)
#create a mask to query all users who are not user i (themselves)
user_mask = df['user']!=df['user'].iloc[i]
#apply masks
dists_to_check = df[date_mask & user_mask]
#for point i, create coordinate to calculate distances from
a = np.array((df['long'].iloc[i], df['lat'].iloc[i]))
#create array of distances to check on the masked data
b = np.array((dists_to_check['long'].values, dists_to_check['lat'].values))
#for j in the date queried data
for j in range(1, len(dists_to_check)):
#compute the ueclidean distance between point a and each point of b (the date masked data)
x = np.linalg.norm(a-np.array((b[0][j], b[1][j])))
#if the distance is within our range of interest append the index to a list
if x <=100:
l.append(j)
else:
pass
try:
#use the list of desired index's 'l' to query a final subset of the data
data = dists_to_check.iloc[l]
#summarize the column of interest then append to output list
output.append(data['status'].sum())
except IndexError, e:
output.append(0)
#print "There were no data to add"
return pd.DataFrame(output)
Run code and time it
start = datetime.now()
out = work(data)
print datetime.now() - start
Is there a way to do this query in a vectorized way? Or should I be chasing another technique.
<3
Here is what at least somewhat solves my problem. Since the loop can operate on different parts of the data independently, parallelization makes sense here.
using Ipython...
from IPython.parallel import Client
cli = Client()
cli.ids
cli = Client()
dview=cli[:]
with dview.sync_imports():
import numpy as np
import os
from datetime import timedelta
import pandas as pd
#We also need to add the time deltas and output list into the function as
#local variables as well as add the Ipython.parallel decorator
#dview.parallel(block=True)
def work(df):
before = timedelta(hours = 8)
after = timedelta(minutes = 1)
output = []
final time 1:17:54.910206, about 1/4 original time
I would still be very interested for anyone to suggest small speed improvements within the body of the function.
I have an excel/( to be converted to CSV a link) file.
The data- , has 8 columns. The first two are day of the year and time respectively while two before the last are minimum temperature and maximum temperature. For each day I need to find the maximum and minimum of the day subtract and save the value for that day.
Two problems I ran into, how do I parse 24 lines at a time ( there are no missing data lines!) and in each batch find the maximum or minimum.
I have 63126 lines=24 hr*263 days
So to iterate through the lines;
import numpy as np
input_temps='/L7_HW_SASP_w1112.csv'
up_air_min=np.genfromtxt(input_temps,skip_header=1, dtype=float, delimiter=',',usecols=(5))
up_air_max=np.genfromtxt(input_temps,skip_header=1, dtype=float, delimiter=',',usecols=(6))
day_year=np.genfromtxt(input_temps,skip_header=1, dtype=float, delimiter=',',usecols=(0))
dt_per_all_days=[]
for i in range (0,63126,1):
# I get stuck here how to limit the iteration for 24 at a time.
# if I can do that I think I can get the rest done.
min_d=[]
max_d=[]
min_d.append( up_air_min[i])
max_d.append( up_air_max[i])
max_per_day=max(max_d)
min_per_day=min(min_d)
dt_d=max_per_day-min_per_day
dt_per_all_days.append(dt_d)
del(min_d)
del(max_d)
move to the next batch of 24 lines....
`
Use the Numpy, Luke, avoid for-loops.
Then you have ap_air_min and ap_air_max numpy arrays you can easily do what you want by using numpy element-wise functions.
At first, create 2d array with 263 rows (one for a day) and 24 columns like this:
min_matrix = up_air_min.reshape((263, 24))
max_matrix = up_air_max.reshape((263, 24))
Then use np.min and np.max functions along axis 1 (good array tip sheet):
min_temperature = np.min(min_matrix, axis=1)
max_temperature = mp.max(max_matrix, axis=1)
And find the difference:
dt = max_temperature - min_temperature
dt is array with needed values. Let's save it to foo.csv:
np.savetxt('foo.csv', np.swapaxes([day_year, dt], 0, 1), delimiter=',')
And final code looks like this:
import numpy as np
# This I got from your answer.
input_temps='/L7_HW_SASP_w1112.csv'
up_air_min=np.genfromtxt(input_temps,skip_header=1, dtype=float, delimiter=',',usecols=(5))
up_air_max=np.genfromtxt(input_temps,skip_header=1, dtype=float, delimiter=',',usecols=(6))
day_year=np.genfromtxt(input_temps,skip_header=1, dtype=float, delimiter=',',usecols=(0))
# Split arrays and create matrix with 263 lines-days and 24 values in every line.
min_matrix = up_air_min.reshape((263, 24))
max_matrix = up_air_max.reshape((263, 24))
# Find min temperature for every day. min_temperature is an array with 263 values.
min_temperature = np.min(min_matrix, axis=1)
# The same for max temperature.
max_temperature = mp.max(max_matrix, axis=1)
# Subtract min temperature from max.
dt = max_temperature - min_temperature
# Save result in csv.
np.savetxt('foo.csv', np.swapaxes([day_year, dt], 0, 1), delimiter=',')
A reasonably pythonic way to do this would be to have a function that loops over the rows, gathering them up and spitting out the gathered rows using yield when the day changes. This gives you a generator that generates 263 lists each holding 24 values, which is a bit easier to process.
If you've definitely not got any missing values, you could use a trivial doubly-nested loop without batching up the elements first. That's a bit more fragile, but it sounds like you might not be planning to re-use the code anyway.
Here's a somewhat contrived example of how you could chunk things by 24 lines at a time.
from StringIO import StringIO
from random import random as r
import numpy as np
import operator
s = StringIO()
for x in xrange(0,10000):
s.write('%f,%f,%f\n' % (r(),r()*10,r()*100))
s.seek(0)
data = np.genfromtxt(s,dtype=None, names=['pitch','yaw','thrust'], delimiter=',')
for x in range(0,len(data),24):
print('Acting on hours %d through %d' % (x, x+24))
one_day = data[x:x+24]
minimum_yaw = min(one_day['yaw'])
max_yaw = max(one_day['yaw'])
print 'min',minimum_yaw,'max',max_yaw,'one_day',one_day['yaw']