Missing_value attribute is lost reading data from a netCDF file? - python

I'm reading wind components (u and v) data from a netCDF file from NCEP/NCAR Reanalysis 1 to make some computations. I'm using xarray to read the file.
In one of the computations, I'd like to mask out all data below some threshould, make them be equal to the missing_value attribute. I don't want to use NaN's.
However, when reading the data with xarray, the missing_value attribute - present in the variable in the netCDF file - isn't copied to xarray.DataArray that contained the data.
I couldn't find a way to copy this attribute from netCDF file variable, with xarray.
Here is an example of what I'm trying to do:
import xarray as xr
import numpy as np
DS1 = xr.open_dataset( "u_250_850_2009012600-2900.nc" )
DS2 = xr.open_dataset( "v_250_850_2009012600-2900.nc" )
u850 = DS1.uwnd.sel( time='2009-01-28 00:00', level=850, lat=slice(10,-60), lon=slice(260,340) )
v850 = DS2.vwnd.sel( time='2009-01-28 00:00', level=850, lat=slice(10,-60), lon=slice(260,340) )
vvel850 = np.sqrt( u850*u850 + v850*v850 )
jet850 = vvel850.where( vvel850 >= 12 )
#jet850 = vvel850.where( vvel850 >= 12, vvel850, vvel850.missing_value )
The last commented line is what I want to do: to use missing_value attribute to fill where vvel850 < 12. The last uncommented line gives me NaN's, what I'm trying to avoid.
Is it the default behaviour of xarray when reading data from netCDF? Whether yes or not, how could I get this attribute from file variable?
An additional information: I'm using PyNGL (http://www.pyngl.ucar.edu/) to make contour plots and it doesn't work with NaN's.
Thanks.
Mateus

The "missing_value" attribute is kept in the encoding dictionary. Other attributes like "units" or "standard_name" are kept in the attrs dictionary. For example:
v850.encoding['missing_value']
You may also be interested a few other xarray features that may help your use case:
xr.open_dataset has a mask_and_scale keyword argument. This will turn off converting missing/fill values to nans.
DataArray.to_masked_array will convert a DataArray (filled with NaNs) to a numpy.MaskedArray for use in plotting programs like Matplotlib or PyNGL.

Related

pytables and pandas string padding question

I've created a dataset using hdf5cpp library with a fixed size string (requirement). However when loading with pytables or pandas the strings are always represented like:
b'test\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff
The string value of 'test' with the padding after it. Does anyone know a way to suppress or not show this padding data? I really just want 'test' shown. I realise this may be correct behaviour.
My hdf5cpp setup for strings:
strType = H5Tcopy(H5T_C_S1);
status = H5Tset_size(strType, 36);
H5Tset_strpad(strType, H5T_STR_NULLTERM);
I can't help with your C Code. It is possible to work with padded strings in Pytables. I can read data written by a C application that creates a struct array of mixed types, including padded strings. (Note: there was an issue related to copying a NumPy struct array with padding. It was fixed in 3.5.0. Read this for details: PyTables GitHub Pull 720.)
Here is an example that shows proper string handling with a file created by PyTables. Maybe it will help you investigate your problem. Checking the dataset's properties would be a good start.
import tables as tb
import numpy as np
arr = np.empty((10), 'S10')
arr[0]='test'
arr[1]='one'
arr[2]='two'
arr[3]='three'
with tb.File('SO_63184571.h5','w') as h5f:
ds = h5f.create_array('/', 'testdata', obj=arr)
print (ds.atom)
for i in range(4):
print (ds[i])
print (ds[i].decode('utf-8'))
Example below added to demonstrate compound dataset with int and fixed string. This is called a Table in PyTables (Arrays always contain homogeneous values). This can be done a number of ways. I show the 2 methods I prefer:
Create a record array and reference with the description= or
obj= parameter. This is useful when already have all of your data AND it will fit in memory.
Create a record array dtype and reference with the description=
parameter. Then add the data with the .append() method. This is
useful when all of your data will NOT fit in memory, OR you need to add data to an existing table.
Code below:
recarr_dtype = np.dtype(
{ 'names': ['ints', 'strs' ],
'formats': [int, 'S10'] } )
a = np.arange(5)
b = np.array(['a', 'b', 'c', 'd', 'e'])
recarr = np.rec.fromarrays((a, b), dtype=recarr_dtype)
with tb.File('SO_63184571.h5','w') as h5f:
ds1 = h5f.create_table('/', 'compound_data1', description=recarr)
for i in range(5):
print (ds1[i]['ints'], ds1[i]['strs'].decode('utf-8'))
ds2 = h5f.create_table('/', 'compound_data2', description=recarr_dtype)
ds2.append(recarr)
for i in range(5):
print (ds2[i]['ints'], ds2[i]['strs'].decode('utf-8'))

converting a 1d array to netcdf

I have a 1d array which is a time series hourly dataset encompassing 49090 points which needs to be converted to netcdf format.
In the code below, result_u2 is a 1d array which stores result from a for loop. It has 49090 datapoints.
nhours = 49091;#one added to no of datapoints
unout.units = 'hours since 2012-10-20 00:00:00'
unout.calendar = 'gregorian'
ncout = Dataset('output.nc','w','NETCDF3');
ncout.createDimension('time',nhours);
datesout = [datetime.datetime(2012,10,20,0,0,0)+n*timedelta(hours=1) for n in range(nhours)]; # create datevalues
timevar = ncout.createVariable('time','float64',('time'));timevar.setncattr('units',unout);timevar[:]=date2num(datesout,unout);
winds = ncout.createVariable('winds','float32',('time',));winds.setncattr('units','m/s');winds[:] = result_u2;
ncout.close()
I'm new to programming. The code I tried above should be able to write the nc file but while running the script no nc file is being created. Please help.
My suggestions would be to have a look at Python syntax in general, if you want to use it / the netCDF4 package. E.g. there are no semicolons in Python code.
Check out the API documentation - the tutorial you find there basically covers what you're asking. Then, your code could look like
import datetime
import netCDF4
# using "with" syntax so you don't have to do the cleanup:
with netCDF4.Dataset('output.nc', 'w', format='NETCDF3_CLASSIC') as ncout:
# create time dimension
nhours = 49091
time = ncout.createDimension('time', nhours)
# create the time variable
times = ncout.createVariable('time', 'f8', ('time',))
times.units = 'hours since 2012-10-20 00:00:00'
times.calendar = 'gregorian'
# fill time
dates = [datetime.datetime(2012,10,20,0,0,0)+n*datetime.timedelta(hours=1) for n in range(nhours)]
times[:] = netCDF4.date2num(dates, units=times.units, calendar=times.calendar)
# create variable 'wind', dependent on time
wind = ncout.createVariable('wind', 'f8', ('time',))
wind.units = 'm/s'
# fill with data, using your 1d array here:
wind[:] = result_u2

Multiplication of values in a dataframe with scalars

I am working on a problem where I want to convert X and Y pixel values to physical coordinates. I have a huge folder containing many csv files and i load them, pass them to my function, compute the coordinates and overwrite the columns and return the data frame. I then overwrite it outside the function. I have the formula which does it correctly but I am having some problems implementing it in python.
Each CSV files has many columns. The columns I am interested in are Latitude (degree), Longitude (degree), XPOS and YPOS. The former 2 are blank and the latter 2 have the data with which I need to fill up the former two.
import pandas as pd
import glob
max_long = float(XXXX)
max_lat = float(XXXX)
min_long = float(XXXX)
min_lat = float(XXXX)
hoi = int(909)
woi = int(1070)
def pixel2coor (filepath, max_long, max_lat, min_lat, min_long, hoi, woi):
data = pd.read_csv(filepath) #reading Csv
data2 = data.set_index("Log File") #Setting index of dataframe with first column
data2.loc[data2['Longitude (degree)']] = (((max_long-min_long)/hoi)*[data2[:,'XPOS']]+min_long) #Computing Longitude & Overwriting
data2.loc[data2['Latitude (degree)']] = (((max_lat-min_lat)/woi)*[data2[:,'YPOS']]+min_lat) #Computing Latitude & Overwriting
return data2 #Return dataframe
filenames = sorted(glob.glob('*.csv'))
for file in filenames:
df = pixel2coor (file, max_long, max_lat, min_lat, min_long, hoi, woi) #Calling pixel 2 coor function and passing a csv file in every iteration
df.to_csv(file) #overwriting the file with the dataframe
I am getting the following error.
**
TypeError: '(slice(None, None, None), 'XPOS')' is an invalid key
**
It looks to me like your syntax is off. In the following line:
data2.loc[data2['Longitude (degree)']] = (((max_long-min_long)/hoi)*[data2[:,'XPOS']]+min_long) #Computing Longitude & Overwriting
The left side of your equation appears to be referring to a column, but you have it in the 'row' section of .loc slicer. So it should be:
data2.loc[:, 'Longitude (degree)']
On the right side of your equation, you've forgotten .loc or need to drop the ':,' so two possible solutions:
(((max_long-min_long)/hoi)*data2.loc[:,'XPOS']+min_long)
(((max_long-min_long)/hoi)*data2['XPOS']+min_long)
Also, I would add that your brackets on the right side should be more explicit. It's a bit unclear how you want scalars to act on the series. Do you want to add min_long first? Or multiply (((max_long-min_long)/hoi) first?
Your final row might look like this, forcing addition first as an example:
data2.loc[:, 'Longitude (degree)'] = ((max_long-min_long)/hoi)*(data2.loc[:,'XPOS']+min_long)
This applies to your next line as well. You may get more errors after you fix this.

Store ndarray in a PyTable (and how to define the Col()-type)

TL;DR: I have a PyTable with a float32 Col and get an error when writing a numpy-float32-array into it. (How) can I store a numpy-array (float32) in the Column of a PyTables table?
I'm new to PyTables - following a recommendation of TFtables (a lib to use HDF5 in Tensorflow), I'm using it to store all my HDF5 data (currently distributed in batches in several files with each three datasets) within a table in a single HDF5 file. Datasets are
'data' : (n_elements, 1024, 1024, 4)#float32
'label' : (n_elements, 1024, 1024, 1)#uint8
'weights' : (n_elements, 1024, 1024, 1)#float32
where the n_elements are distributed over several files that I want to merge into one now (to allow unordered access).
So when I build my table, I figured each dataset represents a column. I built everything in a generic way that allows to do this for an arbitrary number of datasets:
# gets dtypes (and shapes) of the dsets (accessed by dset_keys = ['data', 'label', 'weights']
dtypes, shapes = _determine_shape(hdf5_files, dset_keys)
# to dynamically generate a table, I'm using a dict (not a class as in the PyTables tutorials)
# the dict is (conform with the doc): { 'col_name' : Col()-class-descendent }
table_description = {dset_keys[i]: tables.Col.from_dtype(dtypes[i]) for i in range(len(dset_keys))}
# create a file, a group-node and attach a table to it
h5file = tables.open_file(destination_file, mode="w", title="merged")
group = h5file.create_group("/", 'main', 'Node for data table')
table = h5file.create_table(group, 'data_table', table_description, "Collected data with %s" % (str(val_keys)))
The dtypes that I get for each dsets (read with h5py) are obviously the ones of the numpy arrays (ndarray) that reading the dset returns: float32 or uint8. So the Col()-types are Float32Col an UInt8Col. I naively assumed that I can now write a float32-array into this col, but filling in data with:
dummy_data = np.zeros([1024,1024,3], float32) # normally data read from other files
sample = table.row
sample['data'] = dummy_data
results in TypeError: invalid type (<class 'numpy.ndarray'>) for column ``data``. So now I feel stupid for assuming I'd be able to write an array in there, BUT there are no "ArrayCol()" types offered, neither are there any hints in the PyTables doc as to whether or how it is possible to write an array into a column. How do I do this?
There are "shape" arguments in the Col() class and it's descendents, so it should be possible, otherwise what are these for?!
I know it's a bit late, but I think the answer to your problem lies in the shape parameter for Float32Col.
Here's how it's used in the documentation:
from tables import *
from numpy import *
# Describe a particle record
class Particle(IsDescription):
name = StringCol(itemsize=16) # 16-character string
lati = Int32Col() # integer
longi = Int32Col() # integer
pressure = Float32Col(shape=(2,3)) # array of floats (single-precision)
temperature = Float64Col(shape=(2,3)) # array of doubles (double-precision)
# Open a file in "w"rite mode
fileh = open_file("tutorial2.h5", mode = "w")
# Get the HDF5 root group
root = fileh.root
# Create the groups:
for groupname in ("Particles", "Events"):
group = fileh.create_group(root, groupname)
# Now, create and fill the tables in Particles group
gparticles = root.Particles
# Create 3 new tables
for tablename in ("TParticle1", "TParticle2", "TParticle3"):
# Create a table
table = fileh.create_table("/Particles", tablename, Particle, "Particles: "+tablename)
# Get the record object associated with the table:
particle = table.row
# Fill the table with 257 particles
for i in xrange(257):
# First, assign the values to the Particle record
particle['name'] = 'Particle: %6d' % (i)
particle['lati'] = i
particle['longi'] = 10 - i
########### Detectable errors start here. Play with them!
particle['pressure'] = array(i*arange(2*3)).reshape((2,4)) # Incorrect
#particle['pressure'] = array(i*arange(2*3)).reshape((2,3)) # Correct
########### End of errors
particle['temperature'] = (i**2) # Broadcasting
# This injects the Record values
particle.append()
# Flush the table buffers
table.flush()
Here's the link to the part of the documentation I'm referring to
https://www.pytables.org/usersguide/tutorials.html
Edit: I just saw that the tables.Col.from_type(type, shape) allows using the precision of a type (float32 instead of float alone). The rest stays the same (takes a string and shape).
The factory function tables.Col.from_kind(kind, shape) can be used to construct a Col-Type that supports ndarrays. What "kind" is and how to use this isn't documented anywhere I found; however with trial and error I found that allowed "kind"s are strings of basic datatypes. I.e.: 'float', 'uint', ... without the precision (NOT 'float64')
Since I get numpy.dtypes from h5py reading a dataset (dset.dtype), these have to be cast to str and the precision needs to be removed.
In the end the relevant lines look like this:
# get key, dtype and shapes of elements per dataset from the datasource files
val_keys, dtypes, element_shapes = _get_dtypes(datasources, element_axis=element_axis)
# for storing arrays in columns apparently one has to use "kind"
# "kind" cannot be created with dtype but only a string representing
# the dtype w/o precision, e.g. 'float' or 'uint'
dtypes_kind = [''.join(i for i in str(dtype) if not i.isdigit()) for dtype in dtypes]
# create table description as dictionary
description = {val_keys[i]: tables.Col.from_kind(dtypes_kind[i], shape=element_shapes[i]) for i in range(len(val_keys))}
Then writing data into the table finally works as suggested:
sample = table.row
sample[key] = my_array
Since it all felt a bit "hacky" and isn't documented well, I am still wondering, whether this is not an intended use for PyTables and would leave this question open for abit to see if s.o. knows more about this...

Gurobi in Python: best way to read csv file

I'm learning how to solve combinatorial optimization problems in Gurobi using Python. I would like to know what is the best option to read a csv file to use the data as model parameters. I'm using 'genfromtxt' to read the csv file, but I'm having difficulties in using it for constraint construction (Gurobi doesn't support this type - see error).
Here my code and error message, my_data is composed by 4 columns: node index, x coordinate, y coordinate and maximum degree.
from gurobipy import *
from numpy import genfromtxt
import math
# Read data from csv file
my_data = genfromtxt('prob25.csv', delimiter=',')
# Number of vertices
n = len(my_data)
# Function to calculate euclidean distancces
dist = {(i,j) :
math.sqrt(sum((my_data[i][k]-my_data[j][k])**2 for k in [1,2]))
for i in range(n) for j in range(i)}
# Create a new model
m = Model("dcstNarula")
# Create variables
vars = m.addVars(dist.keys(), obj=dist, vtype=GRB.BINARY, name='e')
for i,j in vars.keys():
vars[j,i] = vars[i,j] # edge in opposite direction
m.update()
# Add degree-b constraint
m.addConstrs((vars.sum('*',j) <= my_data[:,3]
for i in range(n)), name='degree')
GurobiError: Unsupported type (<type 'numpy.ndarray'>) for LinExpr addition argument
First two lines of data
1,19.007,35.75,1
2,4.4447,6.0735,2
Actually it was a problem of indexing instead of data type. In the code:
# Add degree-b constraint
m.addConstrs((vars.sum('*',j) <= my_data[:,3]
for i in range(n)), name='degree')
It should be used vars.sum('*',i) instead of vars.sum('*',j) and my_data[i,3] instead of my_data[:,3]
Even though this question is answered, for future visitors who are looking for good ways to read a csv file, pandas must be mentioned:
import pandas as pd
df = pd.read_csv('prob25.csv', header=None, index_col=0, names=['x', 'y', 'idx'])
df
x y idx
1 19.0070 35.7500 1
2 4.4447 6.0735 2

Categories