How to apply land, sea mask in xarray? - python

I have a gridded temperature dataset df (time: 2920 x: 349 y: 277) and a land sea mask for the same grid mf (time: 1 x: 349 y: 277) where mf.land = 1 for land grid points and mf.land = 0 for ocean points. I want to use the land sea mask to eliminate ocean points from my temperature dataset df, i.e. I only want grid points in df where mf.land = 1.
Here's what df looks like:
And here's what mf looks like:
I'm trying this:
#import libraries
import os
import matplotlib.pyplot as plt
from netCDF4 import Dataset as netcdf_dataset
import numpy as np
from cartopy import config
import cartopy.crs as ccrs
import cartopy.feature as cfeature
import xarray as xr
import pandas as pd
import netCDF4 as nc
#open temperature data and land sea mask
df=xr.open_dataset('/home/mmartin/LauNath/air.2m.2015.nc')
mf=xr.open_dataset('/home/mmartin/WinterMaxThesis/NOAAGrid/land.nc')
#apply mask
mask = (mf.land >= 1)
LandOnly=df.air.loc[mask]
But am having trouble because of the difference in dimensions. How can I mask out these ocean grid points?

The problem is actually occurring because the data arrays do have the same dimensions, but they shouldn’t. What I mean by that is that the time dimension on the land mask makes xarray think that it needs to align the two time dimensions. However, there is no overlap in the time coordinate on the two datasets, so when xarray aligns them all the data is mis-aligned in time and thus dropped. Since the land mask doesn't change through time (at least, that's what I'm assuming) it's best to exclude the time dimension from the land mask so xarray can broadcast it against the full time dimension of the data.
If you drop the time dimension on land_mask, it will broadcast as you expect:
mask = (mf.land >= 1).squeeze(['time'], drop=True)
now, you can mask your data with .where, optionally dropping all-nan slices with drop=True:
LandOnly=df.air.where(mask, drop=True)
See the user guide sections on broadcasting and automatic alignment for more info.

Related

How to apply a point transformation to many points?

I have a gridded temperature dataset and a list of weather stations across the country and their latitudes and longitudes. I want to find the grid points that are nearest to the weather stations. My gridded data has coordinates x,y which latitude and longitude are a function of.
I found that the simplest way of finding the nearest grid point is to first transform the latitude and longitude (Lat, Lon) of the weather stations to x and y values and then find the nearest grid point. I did that for one station (lat= , lon= ) by doing the following:
import matplotlib.pyplot as plt
from netCDF4 import Dataset as netcdf_dataset
import numpy as np
from cartopy import config
import cartopy.crs as ccrs
import cartopy.feature as cfeature
import xarray as xr
import pandas as pd
import netCDF4 as nc
#open gridded data
df=xr.open_dataset('/home/mmartin/LauNath/air.2m.2015.nc')
#open weather station data
CMStations=pd.read_csv('Slope95.csv')
import cartopy.crs as ccrs
# Example - your x and y coordinates are in a Lambert Conformal projection
data_crs = ccrs.LambertConformal(central_longitude=-107.0,central_latitude=50.0,standard_parallels = (50, 50.000001),false_easting=5632642.22547,false_northing=4612545.65137)
# Transform the point - src_crs is always Plate Carree for lat/lon grid
x, y = data_crs.transform_point(-94.5786,39.0997, src_crs=ccrs.PlateCarree())
# Now you can select data
ks=df.sel(x=x, y=y, method='nearest')
How would I apply this to all of the weather stations latitudes and longitudes (Lat,Lon)?
There is no need to use geopandas in here... just use crs.transform_points() instead of crs.transform_point() and pass the coordinates as arrays!
import numpy as np
import cartopy.crs as ccrs
data_crs = ccrs.LambertConformal(central_longitude=-107.0,central_latitude=50.0,standard_parallels = (50, 50.000001),false_easting=5632642.22547,false_northing=4612545.65137)
lon, lat = np.array([1,2,3]), np.array([1,2,3])
data_crs.transform_points(ccrs.PlateCarree(), lon, lat)
which will return an array of the projected coordinates:
array([[16972983.1673108 , 8528848.37931063, 0. ],
[16841398.80456616, 8697676.02704447, 0. ],
[16709244.32834945, 8862533.81411212, 0. ]])
... also... if you really have a lot of points to transform (and maybe use some crs not yet supported by cartopy) you might want to have a look at PyProj directly since it provides a lot more functionality and also some tricks to speed up transformations. (it's used under the hood by cartopy as well so you should already have it installed!)
You can create a geopandas GeoDataFrame from x, y columns using geopandas.points_from_xy. I'll assume these points are WGS84/EPSG4326:
import geopandas as gpd
stations = gpd.GeoDataFrame(
CMStations,
geometry=gpd.points_from_xy(
CMStations.Lon, CMStations.Lat, crs="epsg:4326" # assume WGS84
),
)
Now, we can use geopandas.GeoDataFrame.to_crs to transform all the points at once:
stations_xy = stations.to_crs(data_crs)
Finally, we can use xarray's advanced indexing, using DataArrays with lat/lon data and station ID as coordinates, to reshape the x/y data to the shape of the CMStations index:
station_x = stations.geometry.x.to_xarray()
station_y = stations.geometry.y.to_xarray()
# use these to select from xarray Dataset ds
station_data = ds.sel(y=station_y, x=station_x, method="nearest")
If desired, you could set a station ID column to be the index first with CMStations.set_index("station_id") to get the station_id column as the dataset dimension which replaces x and y.

mask NetCDF using shapefile and calculate average and anomaly for all polygons within the shapefile

There are several tutorials (example 1, example 2, example 3) about masking NetCDF using shapefile and calculating average measures. However, I was confused with those workflows about masking NetCDF and extracting measures such as average, and those tutorials did not include extract anomaly (for example, the difference between temperature in 2019 and a baseline average temperature).
I make an example here. I have downloaded monthly temperature (download temperature file) from 2000 to 2019 and the state-level US shapefile (download shapefile). I want to get the state-level average temperature based on the monthly average temperature from 2000 to 2019 and the temperature anomaly of year 2019 relative to baseline temperature from 2000 to 2010. Specifically, the final dataframe looks as follow:
state
avg_temp
anom_temp2019
AL
xx
xx
AR
xx
xx
...
...
...
WY
xx
xx
# Load libraries
%matplotlib inline
import regionmask
import numpy as np
import xarray as xr
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
# Read shapefile
us = gpd.read_file('./shp/state_cus.shp')
# Read gridded data
ds = xr.open_mfdataset('./temp/monthly_mean_t2m_*.nc')
......
I really appreciate your help that providing an explicit workflow that could do the above task. Thanks a lot.
This can be achieved using regionmask. I don't use your files but the xarray tutorial data and naturalearth data for the US states.
import numpy as np
import regionmask
import xarray as xr
# load polygons of US states
us_states_50 = regionmask.defined_regions.natural_earth.us_states_50
# load an example dataset
air = xr.tutorial.load_dataset("air_temperature")
# turn into monthly time resolution
air = air.resample(time="M").mean()
# create a mask
mask3D = us_states_50.mask_3D(air)
# latitude weights
wgt = np.cos(np.deg2rad(air.lat))
# calculate regional averages
reg_ave = air.weighted(mask3D * wgt).mean(("lat", "lon"))
# calculate the average temperature (over 2013-2014)
avg_temp = reg_ave.sel(time=slice("2013", "2014")).mean("time")
# calculate the anomaly (w.r.t. 2013-2014)
reg_ave_anom = reg_ave - avg_temp
# select a single timestep (January 2013)
reg_ave_anom_ts = reg_ave_anom.sel(time="2013-01")
# remove the time dimension
reg_ave_anom_ts = reg_ave_anom_ts.squeeze(drop=True)
# convert to a pandas dataframe so it's in tabular form
df = reg_ave_anom_ts.air.to_dataframe()
# set the state codes as index
df = df.set_index("abbrevs")
# remove other columns
df = df.drop(columns="names")
You can find info how to use your own shapefile on the regionmask docs (Working with geopandas).
disclaimer: I am the main author of regionmask.

compute and plot monthly mean SST anomalies and plot with xarray multindex (pangeo tutorial gallery)

I'm working through the pangeo tutorial gallery and am stuck on the ENSO exercise at the end of xarray
you'll need to download some files:
%%bash
git clone https://github.com/pangeo-data/tutorial-data.git
Then:
import numpy as np
import xarray as xr
import pandas as pd
# load all files
ds_all = xr.open_mfdataset('./tutorial-data/sst/*nc', combine='by_coords')
# slice for enso3.4 region
sst_enso = ds_all.sel(lat=slice(-5,5), lon=slice(-170+360,-120+360))
# subset years to match hint at the bottom
sst_enso = sst_enso.sel(time=sst_enso.time.dt.year>=1982)
# groupby each timepoint and find mean for entire spatial region
avg_enso = sst_enso.sst.groupby('time').mean(dim=['lat','lon'])
avg_enso.plot()
This figure matches that shown at the bottom of the tutorial. so far so good, but i'd like to compute and plot ONI as well. Warm or cold phases of the Oceanic Nino Index are defined by a five consecutive 3-month running mean of sea surface temperature (SST) anomalies in the Niño 3.4 region that is above (below) the threshold of +0.5°C (-0.5°C). This is known as the Oceanic Niño Index (ONI).
I run into trouble because the month becomes an index.
Q1. I'm not sure how to make sure that subtracting sst_enso - enso_clim results in the correct math.
Assuming that is correct, I can compute the regional mean anomaly again and then use a rolling window mean.
enso_clim = sst_enso.sst.groupby('time.month').mean('time')
sst_anom = sst_enso - enso_clim
enso_anom = sst_anom.groupby('time').mean(dim=['lat','lon'])
oni = enso_anom.rolling(time = 3).mean()
Now I'd like to plot a bar chart of oni with positive red, negative blue. Something like this:
for exaample with:
oni.sst.plot.bar(color=(oni.sst < 0).map({True: 'b', False: 'r'}))
Instead oni.sst.plot() gives me:
Resetting the index enso_anom.reset_index('month', drop=True).sst still keeps month as a dimension and gives the same plot. If you drop_dims('month') then the sst data goes away.
I also tried converting to a pd with oni.to_dataframe() but you end up with 5040 rows which is 12 months x 420 month-years I subsetted for. According to the docs "The DataFrame is indexed by the Cartesian product of index coordinates (in the form of a pandas.MultiIndex)." so I guess that makes sense, but not useful. Even if you reset_index of oni before converting to a dataframe you get the same 5040 rows. Q2. Since the dataframe must be repeating itself I can probably figure out where, but is there a way to do this "cleaner" with each date not repeated for all 12 months?
Your code results into an DataArray with the dimensions time and month due to the
re-chunking. This is the reason why you end up with such a plot.
There is a trick (found here) to calculate anomalies. Besides this I would select as a reference period 1986-2015 ( see NOAA definition for ONI-index).
Combining both I ended up in this short code (without the bar plots):
import xarray as xr
import pandas as pd
import matplotlib.pyplot as plt
# load all files
ds_all = xr.open_mfdataset('./tutorial-data/sst/*nc', combine='by_coords')
# slice for enso3.4 region
sst_enso = ds_all.sel(lat=slice(-5,5), lon=slice(-170+360,-120+360))
avg_enso = sst_enso.sst.groupby('time').mean(dim=['lat','lon'])
avg_enso.plot()
ds = sst_enso.sst.mean(dim=['lat','lon'])
enso_clim = ds.sel(time=slice('1986-01-01', '2016-01-01')).groupby("time.month").mean("time")
# ref: https://origin.cpc.ncep.noaa.gov/products/analysis_monitoring/ensostuff/ONI_change.shtml
enso_anom = ds.groupby("time.month") - enso_clim
# ref: http://xarray.pydata.org/en/stable/examples/weather-data.html#Calculate-monthly-anomalies
enso_anom.plot()
oni = enso_anom.rolling(time = 3).mean()
oni.plot()

Pearson correlation matrix in Python

I'm characterizing biological samples in infrared spectroscopy (mid infrared region) and the resulting data is used in predictive models, for disease prediction. I'm now working the spectrum data using supervised learning and the first step is to prepare the data (smoothing, peak finding, peak filtering, etc) and now I have a matrix of 93x1 (the dependent variables, or the disease/not disease label), where 93 are the number of samples, and a matrix of 93x210, where 210 are the number of wavelenghts where I can find the pre-filered absorption peaks. From these 210 wavelenght I need to extract the features (absorption peaks) that I'll feed into my model. For this, I'm using Pearson correlation matrix in python where the header is the 210 xi wavelenghts. I want to find correlations between absortion peaks at 'xi wavelenght' and samples. The issue is that the resulted matrix gives me '1' everywhere
Disclaimer: I'm a newbie in python
import numpy as np
import pandas as pd
from google.colab import files
uploaded = files.upload()
df2 = pd.read_excel(io.BytesIO(uploaded['20191201-Peaks.xlsx']),header=None, Index=None)
df2.columns=['Label_0_1','Sample',1700.105,...,1500.49]
print (df2.dtypes)
Label_0_1 object
Sample object
1700.105 float64
1699.141 float64
1698.177 float64
...
1504.35 float64
1503.38 float64
1502.42 float64
1501.45 float64
1500.49 float64
Length: 210, dtype: object
df2.shape
(93, 210)
import chart_studio.plotly as py
import plotly.graph_objs as go
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(150,150))
cor = df2.corr(method='pearson')
cor
[![correlation matrix][1]][1]

Hough Transform on arrays of coordinates(Stock prices)

I want to apply Hough Transform on stock prices (array of numbers).
I read OpenCV and scikit-image docs and examples ,but got nothing how to apply the transformation to the arrays of numbers instead of images.
I created 2D array from data. First dimension is X(simply index of data) and second dimension is close prices.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import pywt as wt
from skimage.transform import (hough_line, hough_line_peaks,probabilistic_hough_line)
from matplotlib import cm
path = "22-31May-100Tick.csv"
df = pd.read_csv(path)
y = df.Close.values
x = np.arange(0,len(y),1)
data = []
for i in x:
a = [i,y[i]]
data.append(a)
data = np.array(data)
How is it possible to apply the transformation with OpenCV or sickit-image?
Thank you

Categories