python mask netcdf data using shapefile - python

I am using the following packages:
import pandas as pd
import numpy as np
import xarray as xr
import geopandas as gpd
I have the following objects storing data:
print(precip_da)
Out[]:
<xarray.DataArray 'precip' (time: 13665, latitude: 200, longitude: 220)>
[601260000 values with dtype=float32]
Coordinates:
* longitude (longitude) float32 35.024994 35.074997 35.125 35.175003 ...
* latitude (latitude) float32 5.0249977 5.074997 5.125 5.174999 ...
* time (time) datetime64[ns] 1981-01-01 1981-01-02 1981-01-03 ...
Attributes:
standard_name: convective precipitation rate
long_name: Climate Hazards group InfraRed Precipitation with St...
units: mm/day
time_step: day
geostatial_lat_min: -50.0
geostatial_lat_max: 50.0
geostatial_lon_min: -180.0
geostatial_lon_max: 180.0
This looks as follows:
precip_da.mean(dim="time").plot()
I have my shapefile as a geopandas.GeoDataFrame which represents a polygon.
awash = gpd.read_file(shp_dir)
awash
Out[]:
OID_ Name FolderPath SymbolID AltMode Base Clamped Extruded Snippet PopupInfo Shape_Leng Shape_Area geometry
0 0 Awash_Basin Awash_Basin.kml 0 0 0.0 -1 0 None None 30.180944 9.411263 POLYGON Z ((41.78939511000004 11.5539922500000...
Which looks as follows:
awash.plot()
Plotted one on top of the other they look like this:
ax = awash.plot(alpha=0.2, color='black')
precip_da.mean(dim="time").plot(ax=ax,zorder=-1)
My question is, how do I mask the xarray.DataArray by checking if the lat-lon points lie INSIDE the shapefile stored as a geopandas.GeoDataFrame?
 So I want ONLY the precipitation values (mm/day) which fall INSIDE that shapefile.
I want to do something like the following:
masked_precip = precip_da.within(awash)
OR
masked_precip = precip_da.loc[precip_da.isin(awash)]
EDIT 1
I have thought about using the rasterio.mask module but I don't know what format the input data needs to be. It sounds as if it does exactly the right thing:
"Creates a masked or filled array using input shapes. Pixels are masked or set to nodata outside the input shapes"
Reposted from GIS Stack Exchange here

This is the current working solution that I have taken from this gist. This is Stephan Hoyer's answer to a github issue for the xarray project.
On top of the other packages above both affine and rasterio are required
from rasterio import features
from affine import Affine
def transform_from_latlon(lat, lon):
""" input 1D array of lat / lon and output an Affine transformation
"""
lat = np.asarray(lat)
lon = np.asarray(lon)
trans = Affine.translation(lon[0], lat[0])
scale = Affine.scale(lon[1] - lon[0], lat[1] - lat[0])
return trans * scale
def rasterize(shapes, coords, latitude='latitude', longitude='longitude',
fill=np.nan, **kwargs):
"""Rasterize a list of (geometry, fill_value) tuples onto the given
xray coordinates. This only works for 1d latitude and longitude
arrays.
usage:
-----
1. read shapefile to geopandas.GeoDataFrame
`states = gpd.read_file(shp_dir+shp_file)`
2. encode the different shapefiles that capture those lat-lons as different
numbers i.e. 0.0, 1.0 ... and otherwise np.nan
`shapes = (zip(states.geometry, range(len(states))))`
3. Assign this to a new coord in your original xarray.DataArray
`ds['states'] = rasterize(shapes, ds.coords, longitude='X', latitude='Y')`
arguments:
---------
: **kwargs (dict): passed to `rasterio.rasterize` function
attrs:
-----
:transform (affine.Affine): how to translate from latlon to ...?
:raster (numpy.ndarray): use rasterio.features.rasterize fill the values
outside the .shp file with np.nan
:spatial_coords (dict): dictionary of {"X":xr.DataArray, "Y":xr.DataArray()}
with "X", "Y" as keys, and xr.DataArray as values
returns:
-------
:(xr.DataArray): DataArray with `values` of nan for points outside shapefile
and coords `Y` = latitude, 'X' = longitude.
"""
transform = transform_from_latlon(coords[latitude], coords[longitude])
out_shape = (len(coords[latitude]), len(coords[longitude]))
raster = features.rasterize(shapes, out_shape=out_shape,
fill=fill, transform=transform,
dtype=float, **kwargs)
spatial_coords = {latitude: coords[latitude], longitude: coords[longitude]}
return xr.DataArray(raster, coords=spatial_coords, dims=(latitude, longitude))
def add_shape_coord_from_data_array(xr_da, shp_path, coord_name):
""" Create a new coord for the xr_da indicating whether or not it
is inside the shapefile
Creates a new coord - "coord_name" which will have integer values
used to subset xr_da for plotting / analysis/
Usage:
-----
precip_da = add_shape_coord_from_data_array(precip_da, "awash.shp", "awash")
awash_da = precip_da.where(precip_da.awash==0, other=np.nan)
"""
# 1. read in shapefile
shp_gpd = gpd.read_file(shp_path)
# 2. create a list of tuples (shapely.geometry, id)
# this allows for many different polygons within a .shp file (e.g. States of US)
shapes = [(shape, n) for n, shape in enumerate(shp_gpd.geometry)]
# 3. create a new coord in the xr_da which will be set to the id in `shapes`
xr_da[coord_name] = rasterize(shapes, xr_da.coords,
longitude='longitude', latitude='latitude')
return xr_da
It can be implemented as follows:
precip_da = add_shape_coord_from_data_array(precip_da, shp_dir, "awash")
awash_da = precip_da.where(precip_da.awash==0, other=np.nan)
awash_da.mean(dim="time").plot()

You should have a look at the following packages:
salem and the region of interest example
regionmask
Both may get you to what you want.

Related

Converting from 2D lat/lon matrix into 1D lat/lon array

I am working with a netcdf file, that has no coordinates. My lat/lon values are sotred in the variables in a form of a matrix, in a way lat (x,y) and lon(x,y). My goal here is to extract a lat and a lon 1D array to assign it as coordinates, as these must be a 1D array.
Here is how the dataset originally looks like:
<xarray.Dataset>
Dimensions: (y: 10980, x: 10980)
Dimensions without coordinates: y, x
Data variables: (12/20)
lon (y, x) float32 ...
lat (y, x) float32 ...
For example, the lat variable looks like this:
<xarray.DataArray 'lat' (y: 10980, x: 10980)>
array([[41.52681 , 41.52681 , 41.526814, ..., 41.54671 , 41.54671 , 41.546715],
[41.52672 , 41.526722, 41.526722, ..., 41.54662 , 41.546623, 41.546623],
[41.52663 , 41.52663 , 41.526634, ..., 41.54653 , 41.54653 , 41.54653 ],
...,
[40.538834, 40.538837, 40.538837, ..., 40.55806 , 40.55806 , 40.558064],
[40.538746, 40.538746, 40.53875 , ..., 40.55797 , 40.557972, 40.557972],
[40.538654, 40.53866 , 40.53866 , ..., 40.55788 , 40.55788 , 40.55788 ]],
dtype=float32)
Dimensions without coordinates: y, x
Attributes:
parameter: lat
standard_name: latitude
long_name: latitude
units: degree_north
So in order to convert both variables into 1D array I do the following:
#First I open the dataset
file_to_input = 'landsat.nc'
nc1 = xr.open_dataset(file_to_input)
#Then I take the y axis from lat:
lati = nc1.lat[:,0]
#And the x axis from lon:
long = nc1.lon[0,:]
#To then assign them as 1D array to the dataset:
nc1 = nc1.assign_coords({'x':long,'y':lati})
nc1 = nc1.rio.set_spatial_dims('x', 'y')
#I set the proper CRS for the varaible to export (EPSG: 32631):
nc1var = nc1['ndci'].rio.set_crs("epsg:32631")
#And then I export it as a geotiff:
nc1var.rio.to_raster('ndci.tiff')
So far so good. The problem comes when I visualize the exported geotiff, the geotiff is SLIGHTLY shifted. In the image below you can appreciate a small shift down of the geotiff with respect to the basemap. I tried using this method with other good working tiffs and the same shift happens, so I assume it has to do witht he way I am changing from 2D lat lon to 1D array.
I think it can be achieved with pyproj transformer or smthing like that, but I have no idea how to use that with lat/lon 2D grid into 1D. Any help would be very appreciated!!
UPDATE
Download the dataset here
(.rar - 241MB, .nc file - 1.34GB)
I will propose following code to get data to regular grid using nearest neighbor interpolation. Other methods could be used for the interpolation, this is just the easiest and not changing original data much.
Also notice that I am reducing the original resolution for memory efficiency, this part should be otherwise removed from the script.
Here is the code:
#!/usr/bin/env ipython
# ---------------------
import numpy as np
from scipy.interpolate import griddata
from pylab import pcolormesh, show
from netCDF4 import Dataset
# --------------------------------------------
def nc_varget(fin,vin):
with Dataset(fin) as f: return f.variables[vin][:];
# --------------------------------------------
fin = 'S2A_reduced_dataset.nc'
xin = nc_varget(fin,'lon');
yin = nc_varget(fin,'lat');
zin = nc_varget(fin,'ndci');
# --------------------------------------------
# I will reduce the size further to test the code (matrices with 10k X 10k points are too big for my laptop and testing purposes)
xskip, yskip = 25,25
xin = xin[::yskip,::xskip]
yin = yin[::yskip,::xskip]
zin = zin[::yskip,::xskip]
# --------------------------
# let us take some info from original coordinates:
x0,x1,dx = np.min(xin),np.max(xin),np.abs(np.mean(np.diff(xin)))
y0,y1,dy = np.min(yin),np.max(yin),np.abs(np.mean(np.diff(yin.T)))
# --------------------------
# let us make new (regular) coordinates:
xout = np.arange(x0,x1+dx,dx)
yout = np.arange(y0,y1+dx,dy)
# --------------------------
xm,ym = np.meshgrid(xout,yout)
zo = griddata((xin.flatten(),yin.flatten()),zin.flatten(),(xm,ym),'nearest')
# ---------------------------------------------------------------------------
# let us save results as netCDF:
import xarray as xr
df = xr.DataArray(zo,dims=['lat','lon'])
df.coords['lon'] = xout
df.coords['lat'] = yout
df.to_netcdf('test.nc')

label coordinates that fall in a specific shapefile

I have a data frame for individuals, each individual has X and y coordinates, and i have a .shp file
that have number of polygons.
individuals data frame look like:
ind_ID
x_coordinates
y_coordinates
1
2.333
6.572711
2
3.4444
6.57273
the .shp file looks like:
Code
shape length
shape area
222
.22
.5432
2322
.54322
.4342
122
.65656
.43
2122
.5445
.5678
what I want to do is to add a new column to the data frame, in order to label each coordinate with the linked Code of .shp file that this coordinate fall inside it.
to do so, I build this code :
from shapely.geometry import Point
import csv
from shapely.geometry.polygon import Polygon
import shapefile
from shapely.geometry import shape # shape() is a function to convert geo objects through the interface
import numpy as np
import pandas as pd
import shapefile as shp
Individual = pd.read_csv("dataframe.csv")
sf = shapefile.Reader('path to the shape file.shp')
sf.shapes()
len(sf.shapes())
# function to read the shapefile
def read_shapefile(sf):
"""
Read a shapefile into a Pandas dataframe with a 'coords'
column holding the geometry information. This uses the pyshp
package
"""
fields = [x[0] for x in sf.fields][1:]
records = sf.records()
shps = [s.points for s in sf.shapes()]
df = pd.DataFrame(columns=fields, data=records)
df = df.assign(coords=shps)
return df
df = read_shapefile(sf)
df.shape
I used the read_shapefile function to find all x,y points inside each feature, output DF
Code
shape length
shape area
cooded
222
.22
.5432
3.23232,2.72323,3.931226,2.543,3.435534 ....
2322
.54322
.4342
3.23232,2.72322,3.111226,2.343,3.12312 ...
122
.65656
.43
3.2323,2.23325,3.1212,2.1221,3.12321 ...
2122
.5445
.5678
3.9232,2.23232,2.931226,1.2123,3.213 ...
the next step is to check each induvial , wheather is it fall inside any cooded points,
if yes add a new column to the Individual df contain the corresponding Code of the shape file.
I need the help of this part ^^",
I started as checking x,y in the sh data coords
Individual.["X","Y"].isin(sf .["coords"]).astype(int)
I couldn't check as there is an error.
the output need is : individuals data frame look like:
ind_ID
x_coordinates
y_coordinates
Code
1
2.333
6.572711
222
2
3.4444
6.57273
122

Select data by latitude and longitude

I am using a dataset from DWD (Deutscher Wetterdienst) and want to select data by latitude and longitude. The import works so far. So no problem there. Now I want to select data by latitude and longitude. It works when I try to select data with sel when I use x and y.
But not with lat and long. I tried all the answer which I could find, like:
ds.sel(latitude=50, longitude=14, method='nearest')
but I am getting the error
ValueError: dimensions or multi-index levels ['latitude', 'longitude'] do not exist
That's my code:
import cartopy.crs as ccrs
import cartopy.feature as cfeature
import matplotlib.pyplot as plt
import xarray as xr
​
​
ds = xr.open_dataset(
'cosmo-d2_germany_rotated-lat-lon_single-level_2019061721_012_ASWDIFD_S.grib2',
engine='cfgrib',
backend_kwargs={'filter_by_keys': {'stepUnits': 1}}
)
​
print(ds)
Output:
<xarray.Dataset>
Dimensions: (x: 651, y: 716)
Coordinates:
time datetime64[ns] ...
step timedelta64[ns] ...
surface int32 ...
latitude (y, x) float64 ...
longitude (y, x) float64 ...
valid_time datetime64[ns] ...
Dimensions without coordinates: x, y
Data variables:
ASWDIFD_S (y, x) float32 ...
Attributes:
GRIB_edition: 2
GRIB_centre: edzw
GRIB_centreDescription: Offenbach
GRIB_subCentre: 255
Conventions: CF-1.7
institution: Offenbach
history: 2019-07-22T13:35:33 GRIB to CDM+CF via cfgrib-
In your file latitude and longitude are not dimensions but rather helper 2D variables containing coordinate data. In xarray parlance they are called non-dimension coordinates and you cannot slice on them. See also Working with Multidimensional Coordinates.
It would be better if you regrid the data to a regular grid inside python so that you have latitudes and longitudes as 1D vectors, you would have to make a grid and then interpolate the data over that grid.
Also you need to check https://www.ecmwf.int/sites/default/files/elibrary/2018/18727-cfgrib-easy-and-efficient-grib-file-access-xarray.pdf to see the way to access grib files in xarray. If you dont want to use xarray for this purpose pygrib is another option.
I can't test the solution as I don't have the cfgrib engine installed, but could you try to use
numpy.find_nearest(lonarray, lonvalue)
to find the lon and lat indexes near your point as per this soln:
Find nearest value in numpy array
And then select the point using the index directly on the x,y coordinates?
http://xarray.pydata.org/en/stable/indexing.html
i wrote a function for the files from the DWD:
import pygrib # https://jswhit.github.io/pygrib/docs/
import numpy as np
def get_grib_data_nearest_point(grib_file, inp_lat, inp_lon):
"""
Gets the correspondent value to a latitude-longitude pair of coordinates in
a grib file.
:param grib_file: path to the grib file in disk
:param lat: latitude
:param lon: longitude
:return: scalar
"""
# open the grib file, get the coordinates and values
grbs = pygrib.open(grib_file)
grb = grbs[1]
lats, lons = grb.latlons()
values = grb.values
grbs.close()
# check if user coords are valide
if inp_lat > max(grb.distinctLatitudes): return np.nan
if inp_lat < min(grb.distinctLatitudes): return np.nan
if inp_lon > max(grb.distinctLongitudes): return np.nan
if inp_lon < min(grb.distinctLongitudes): return np.nan
# find index for closest lat (x)
diff_save = 999
for x in range(0, len(lats)):
diff = abs(lats[x][0] - inp_lat)
if diff < diff_save:
diff_save = diff
else:
break
# find index for closest lon (y)
diff_save = 999
for y in range(0, len(lons[x])):
diff = abs(lons[x][y] - inp_lon)
if diff < diff_save:
diff_save = diff
else:
break
# index the array to return the correspondent value
return values[x][y]
As noted above, you can re-grid your data (probably given in curvilinear grid i.e., lat and lon in 2D arrays) to your desired resolution of 1-D array (lat/lon) , after which you can use .sel directly on the lat/lon coords to slice the data.
Check out xESMF(https://xesmf.readthedocs.io/en/latest/notebooks/Curvilinear_grid.html).
Easy, fast interpolation and regridding of Xarray fields with good examples and documentation.

netCDF grid file: Extracting information from 1D array using 2D values

I am trying to work in Python 3 with topography/bathymetry-information (basically a grid containing x [longitude in decimal degrees], y [latitude in decimal degrees] and z [meter]).
The grid file has the extension .nc and is therefore a netCDF-file. Normally I would use it in mapping tools like Generic Mapping Tools and don't have to bother with how a netCDF file works, but I need to extract specific information in a Python script. Right now this is only limiting the dataset to certain longitude/latitude ranges.
However, right now I am a bit lost on how to get to the z-information for specific x and y values. Here's what I know about the data so far
import netCDF4
#----------------------
# Load netCDF file
#----------------------
bathymetry_file = 'C:/Users/te279/Matlab/data/gebco_08.nc'
fh = netCDF4.Dataset(bathymetry_file, mode='r')
#----------------------
# Getting information about the file
#----------------------
print(fh.file_format)
NETCDF3_CLASSIC
print(fh)
root group (NETCDF3_CLASSIC data model, file format NETCDF3):
title: GEBCO_08 Grid
source: 20100927
dimensions(sizes): side(2), xysize(933120000)
variables(dimensions): float64 x_range(side), float64 y_range(side), int16 z_range(side), float64 spacing(side), int32 dimension(side), int16 z(xysize)
groups:
print(fh.dimensions.keys())
odict_keys(['side', 'xysize'])
print(fh.dimensions['side'])
: name = 'side', size = 2
print(fh.dimensions['xysize'])
: name = 'xysize', size = 933120000
#----------------------
# Variables
#----------------------
print(fh.variables.keys()) # returns all available variable keys
odict_keys(['x_range', 'y_range', 'z_range', 'spacing', 'dimension', 'z'])
xrange = fh.variables['x_range'][:]
print(xrange)
[-180. 180.] # contains the values -180 to 180 for the longitude of the whole world
yrange = fh.variables['y_range'][:]
print(yrange)
[-90. 90.] # contains the values -90 to 90 for the latitude of the whole world
zrange = fh.variables['z_range'][:]
[-10977 8685] # contains the depths/topography range for the world
spacing = fh.variables['spacing'][:]
[ 0.00833333 0.00833333] # spacing in both x and y. Equals the dimension, if multiplied with x and y range
dimension = fh.variables['dimension'][:]
[43200 21600] # corresponding to the shape of z if it was the 2D array I would've hoped for (it's currently an 1D array of 9333120000 - which is 43200*21600)
z = fh.variables['z'][:] # currently an 1D array of the depth/topography/z information I want
fh.close
Based on this information I still don't know how to access z for specific x/y (longitude/latitude) values. I think basically I need to convert the 1D array of z into a 2D array corresponding to longitude/latitude values. I just have not a clue how to do that. I saw in some posts where people tried to convert a 1D into a 2D array, but I have no means to know in what corner of the world they start and how they progress.
I know there is a 3 year old similar post, however, I don't know how to find an analogue "index of the flattened array" for my problem - or how to exactly work with that. Can somebody help?
You need to first read in all three of z's dimensions (lat, lon, depth) and then extract values across each of those dimensions. Here are a few examnples.
# Read in all 3 dimensions [lat x lon x depth]
z = fh.variables['z'][:,:,:]
# Topography at a single lat/lon/depth (1 value):
z_1 = z[5,5,5]
# Topography at all depths for a single lat/lon (1D array):
z_2 = z[5,5,:]
# Topography at all latitudes and longitudes for a single depth (2D array):
z_3 = z[:,:,5]
Note that the number you enter for lat/lon/depth is the index in that dimension, not an actual latitude, for instance. You'll need to determine the indices of the values you are looking for beforehand.
I just found the solution in this post. Sorry that I didn't see that before. Here's what my code looks like now. Thanks to Dave (he answered his own question in the post above). The only thing I had to work on was that the dimensions have to stay integers.
import netCDF4
import numpy as np
#----------------------
# Load netCDF file
#----------------------
bathymetry_file = 'C:/Users/te279/Matlab/data/gebco_08.nc'
fh = netCDF4.Dataset(bathymetry_file, mode='r')
#----------------------
# Extract variables
#----------------------
xrange = fh.variables['x_range'][:]
yrange = fh.variables['y_range'][:]
zz = fh.variables['z'][:]
fh.close()
#----------------------
# Compute Lat/Lon
#----------------------
nx = (xrange[-1]-xrange[0])/spacing[0] # num pts in x-dir
ny = (yrange[-1]-yrange[0])/spacing[1] # num pts in y-dir
nx = nx.astype(np.integer)
ny = ny.astype(np.integer)
lon = np.linspace(xrange[0],xrange[-1],nx)
lat = np.linspace(yrange[0],yrange[-1],ny)
#----------------------
# Reshape the 1D to an 2D array
#----------------------
bathy = zz[:].reshape(ny, nx)
So, now when I look at the shape of both zz and bathy (following code), the former is a 1D array with a length of 933120000, the latter the 2D array with dimensions of 43200x21600.
print(zz.shape)
print(bathy.shape)
The next step is to use indices to access the bathymetry/topography data correctly, just as N1B4 described in his post

indices of 2D lat lon data

I am trying to find the equivalent (if there exists one) of an NCL function that returns the indices of two-dimensional latitude/longitude arrays closest to a user-specified latitude/longitude coordinate pair.
This is the link to the NCL function that I am hoping there is an equivalent to in python. I'm suspecting at this point that there is not, so any tips on how to get indices from lat/lon coordinates is appreciated
https://www.ncl.ucar.edu/Document/Functions/Contributed/getind_latlon2d.shtml
Right now , I have my coordinate values saved into an .nc file and are read by:
coords='coords.nc'
fh = Dataset(coords, mode='r')
lons = fh.variables['g5_lon_1'][:,:]
lats = fh.variables['g5_lat_0'][:,:]
rot = fh.variables['g5_rot_2'][:,:]
fh.close()
I found scipy spatial.KDTree can perform similar task. Here is my code of finding the model grid that is closest to the observation location
from scipy import spatial
from netCDF4 import Dataset
# read in the one dimensional lat lon info from a dataset
fname = '0k_T_ann_clim.nc'
fid = Dataset(fname, 'r')
lat = fid.variables['lat'][:]
lon = fid.variables['lon'][:]
# make them a meshgrid for later use KDTree
lon2d, lat2d = np.meshgrid(lon, lat)
# zip them together
model_grid = list( zip(np.ravel(lon2d), np.ravel(lat2d)) )
#target point location : 30.5N, 56.1E
target_pts = [30.5 56.1]
distance, index = spatial.KDTree(model_grid).query(target_pts)
# the nearest model location (in lat and lon)
model_loc_coord = [coord for i, coord in enumerate(model_grid) if i==index]
I'm not sure how lon/lat arrays are stored when read in python, so to use the following solution you may need to convert lon/lat to numpy arrays. You can just put the abs(array-target).argmin() in a function.
import numpy as np
# make a dummy longitude array, 0.5 degree resolution.
lon=np.linspace(0.5,360,720)
# find index of nearest longitude to 25.4
ind=abs(lon-25.4).argmin()
# check it works! this gives 25.5
lon[ind]

Categories