I am trying to plot a time series of the sea surface temperature (SST) for a specific region from a .nc file. The SST is a three-dimensional variable (lat,lon,time), that has mean daily values for a specific region from 1982 to 2016. I want my plot to reflect the seasonal sst variability of the entire period of time. I assume that what I need to do first is to obtain a mean sst value for my lat,lon region for each of the days with which I can work alter on. So far, I assume that I need to read the .nc file and the variables:
import netCDF4 as nc
f = nc.Dataset('cmems_SST_MED_SST_L4_REP_OBSERVATIONS_010_021_1639073212518.nc')
sst = f.variables['analysed_sst'][:]
lon = f.variables['longitude'][:]
lat = f.variables['latitude'][:]
Next, following the code suggested here, I tried to reshape and obtain the mean, but an error pops up:
global_average= np.nanmean(sst[:,:,:],axis=(1,2))
annual_temp = np.nanmean(np.reshape(global_average, (34,12)), axis = 1)
#34 years between 1982 and 2016, and 12 months per year.
ERROR cannot reshape array of size 14008 into shape (34,12)
From here I found different ways, like using cdo or nco (which didn't work due installation problems) among others, which were not suitable for my case. I used nanmean because know that in MATLAB this is done using the nanmean function. I am quite new to this topic and I would like to ask for some hints, like, where should I focus more or what path is more suitable for this case. Thank you!!
Handling daily data with just pure python is difficult because you should consider leap years and sub-setting a region require tedious indexing striding....
As steTATO mentioned, since the data that you are working has daily temporal resolution you need to consider the following
You need to reshape the global_average in the shape of (34,365) or (34,366) depending on the year (1984,1988,1992,1996,2000,2004,2008,2012,2016). So your above code should look something like
annual_temp = np.nanmean(np.reshape(global_average, (34,365)), axis = 1)
But, like I said, because of the leap years, you can't do the things you want by simply reshaping the global_average.
If I had no choice but to use python only, I'd do the following
import numpy as np
def days_in_year(in_year):
leap_years = [1984,1988,1992,1996,2000,2004,2008,2012,2016]
if (in_year in leap_years):
out_days = 366
else:
out_days = 365
return out_days
# some of your code, importing netcdf data
year = np.arange(1982,2017)
global_avg= np.nanmean(sst[:,:,:],axis=(1,2))
annual_avgs = []
i = 0
for yr in range(35):
i = i + days_in_year(year[yr])
f = i - days_in_year(year[yr])
annual_avg = np.nanmean(global_avg[i:f])
annual_avgs.append(annual_avg)
Above code basically takes and averages by taking strides of the global_avg considering the leap year, and saving it as annual_avgs.
Related
Problem statement
I am creating a distribution plot of flood events per N year periods starting in 1870. I am using Pandas and Seaborn. I need help with...
specifying the date range of each bin when using sns.displot, and
clearly representing my bin size specifications along the x axis.
To clarify this problem, here is the data that I am working with, what I have tried, and a description of the desired output.
The Data
The data I am using is available from the U.S. Weather service.
import pandas as pd
import bs4
import urllib.request
link = "https://water.weather.gov/ahps2/crests.php?wfo=jan&gage=jacm6&crest_type=historic"
webpage=str(urllib.request.urlopen(link).read())
soup = bs4.BeautifulSoup(webpage)
tbl = soup.find('div', class_='water_information')
vals = tbl.get_text().split(r'\n')
tcdf = pd.Series(vals).str.extractall(r'\((?P<Rank>\d+)\)\s(?P<Stage>\d+.\d+)\sft\son\s(?P<Date>\d{2}\/\d{2}\/\d{4})')\
.reset_index(drop=True)
tcdf['Stage'] = tcdf.Stage.astype(float)
total_crests_events = len(tcdf)
tcdf['Rank'] = tcdf.Rank.astype(int)
tcdf['Date'] = pd.to_datetime(tcdf.Date)
What works
I am able to plot the data with Seaborn's displot, and I can manipulate the number of bins with the bins command.
The second image is closer to my desired output. However, I do not think that it's clear where the bins start and end. For example, the first two bins (reading left to right) clearly start before and end after 1880, but the precise years are not clear.
import seaborn as sns
# fig. 1: data distribution using default bin parameters
sns.displot(data=tcdf,x="Date")
# fig. 2: data distribution using 40 bins
sns.displot(data=tcdf,x="Date",bins=40)
What fails
I tried specifying date ranges using the bins input. The approach is loosely based on a previous SO thread.
my_bins = pd.date_range(start='1870',end='2025',freq='5YS')
sns.displot(data=tcdf,x="Date",bins=my_bins)
This attempt, however, produced a TypeError
TypeError: Cannot cast array data from dtype('O') to dtype('float64') according to the rule 'safe'
This is a long question, so I imagine that some clarification might be necessary. Please do not hesitate to ask questions in the comments.
Thanks in advance.
Seaborn internally converts its input data to numbers so that it can do math on them, and it uses matplotlib's "unit conversion" machinery to do that. So the easiest way to pass bins that will work is to use matplotlib's date converter:
sns.displot(data=tcdf, x="Date", bins=mpl.dates.date2num(my_bins))
I have 10 years output from the WRF climate model. I am looking for an efficient code which for every grid point in the xarray selects only those number of days where T>0 for more than 2 days. For my plots, I want for each month at each grid point the total number of days where T>2 for more than 2 days.
I am new to xarrays and looking at similar questions, I still couldn't find a proper loop or count function to apply for each grid point and month wise! Would really appreciate any help with this code.
Here is my current code:
import xarray as xr
import pandas as pd
import matplotlib.pyplot as plt
import netCDF4
from netCDF4 import Dataset
import numpy as np
#concatenate the 10year output
dataset=xr.open_mfdataset("\Python files for plotting wrfoutput\era5_1990-2000_output\*.nc",concat_dim='Time', combine='nested', compat='no_conflicts', preprocess=None, engine=None, data_vars='all', coords='all', parallel=False, join='outer', attrs_file=None,)
#dimensions are: Time, south_north, west_east
DS=dataset
DS = DS.assign_coords(Time=pd.to_datetime(DS['Time'].values))
#Select/extract only the mean 2m surface temperature (T2) from the large xarray
DST2=DS.T2
#apply the where function to check at which grid points in each month the T2>0
T2threshold=DST2.groupby('Time.month').where(DST2>0)
In general it is difficult to support you without a code that generates the issue you are running in.
Stackoverflow is not there to help you learn programming. It is there to help find solutions for edge cases and issues.
Never mind here are some thoughts for you. xarray is working similar as pandas. So if you can find a solution for pandas, try it with xarray.
ds['threshold_mask'] = ds.T2.where(dataset.T2>0)
Building a mask and then using groupby and cumsum:
ds.groupby((ds['threshold_mask'] == 0).cumsum().threshold_mask).cumsum()
No grants that this works, but I guess it will help you finding the right solution.
Seen here: Pandas : dataframe cumsum , reset if other column is false
#Initialize the mode here
model = GFS(resolution='half', set_type='latest')
#the location I want to forecast the irradiance, and also the timezone
latitude, longitude, tz = 15.134677754177943, 120.63806622424912, 'Asia/Manila'
start = pd.Timestamp(datetime.date.today(), tz=tz)
end = start + pd.Timedelta(days=7)
#pulling the data from the GFS
raw_data = model.get_processed_data(latitude, longitude, start, end)
raw_data = pd.DataFrame(raw_data)
data = raw_data
#Description of the PV system we are using
system = PVSystem(surface_tilt=10, surface_azimuth=180, albedo=0.2,
module_type = 'glass_polymer',
module=module, module_parameters=module,
temperature_model_parameters=temperature_model_parameters,
modules_per_string=24, strings_per_inverter=32,
inverter=inverter, inverter_parameters=inverter,
racking_model='insulated_back')
#Using the ModelChain
mc = ModelChain(system, model.location, orientation_strategy=None,
aoi_model='no_loss', spectral_model='no_loss',
temp_model='sapm', losses_model='no_loss')
mc.run_model(data);
mc.total_irrad.plot()
plt.ylabel('Plane of array irradiance ($W/m^2$)')
plt.legend(loc='best')
Here is the picture of it
I am actually getting the same values for irradiance for days now. So I believe there is something wrong. I think there should somewhat be of different values for everyday at the least
Forecasting Irradiance
I think the reason the days all look the same is that the forecast data predicts those days to be consistently overcast, so there's not necessarily anything "wrong" with the values being very similar across days -- it's just several cloudy days in a row. Take a look at raw_data['total_clouds'] and see how little variation there is for this forecast (nearly always 100% cloud cover). Also note that if you print the actual values of mc.total_irrad, you'll see that there is some minor variation day-to-day that is too small to appear on the plot.
I'm working through the pangeo tutorial gallery and am stuck on the ENSO exercise at the end of xarray
you'll need to download some files:
%%bash
git clone https://github.com/pangeo-data/tutorial-data.git
Then:
import numpy as np
import xarray as xr
import pandas as pd
# load all files
ds_all = xr.open_mfdataset('./tutorial-data/sst/*nc', combine='by_coords')
# slice for enso3.4 region
sst_enso = ds_all.sel(lat=slice(-5,5), lon=slice(-170+360,-120+360))
# subset years to match hint at the bottom
sst_enso = sst_enso.sel(time=sst_enso.time.dt.year>=1982)
# groupby each timepoint and find mean for entire spatial region
avg_enso = sst_enso.sst.groupby('time').mean(dim=['lat','lon'])
avg_enso.plot()
This figure matches that shown at the bottom of the tutorial. so far so good, but i'd like to compute and plot ONI as well. Warm or cold phases of the Oceanic Nino Index are defined by a five consecutive 3-month running mean of sea surface temperature (SST) anomalies in the Niño 3.4 region that is above (below) the threshold of +0.5°C (-0.5°C). This is known as the Oceanic Niño Index (ONI).
I run into trouble because the month becomes an index.
Q1. I'm not sure how to make sure that subtracting sst_enso - enso_clim results in the correct math.
Assuming that is correct, I can compute the regional mean anomaly again and then use a rolling window mean.
enso_clim = sst_enso.sst.groupby('time.month').mean('time')
sst_anom = sst_enso - enso_clim
enso_anom = sst_anom.groupby('time').mean(dim=['lat','lon'])
oni = enso_anom.rolling(time = 3).mean()
Now I'd like to plot a bar chart of oni with positive red, negative blue. Something like this:
for exaample with:
oni.sst.plot.bar(color=(oni.sst < 0).map({True: 'b', False: 'r'}))
Instead oni.sst.plot() gives me:
Resetting the index enso_anom.reset_index('month', drop=True).sst still keeps month as a dimension and gives the same plot. If you drop_dims('month') then the sst data goes away.
I also tried converting to a pd with oni.to_dataframe() but you end up with 5040 rows which is 12 months x 420 month-years I subsetted for. According to the docs "The DataFrame is indexed by the Cartesian product of index coordinates (in the form of a pandas.MultiIndex)." so I guess that makes sense, but not useful. Even if you reset_index of oni before converting to a dataframe you get the same 5040 rows. Q2. Since the dataframe must be repeating itself I can probably figure out where, but is there a way to do this "cleaner" with each date not repeated for all 12 months?
Your code results into an DataArray with the dimensions time and month due to the
re-chunking. This is the reason why you end up with such a plot.
There is a trick (found here) to calculate anomalies. Besides this I would select as a reference period 1986-2015 ( see NOAA definition for ONI-index).
Combining both I ended up in this short code (without the bar plots):
import xarray as xr
import pandas as pd
import matplotlib.pyplot as plt
# load all files
ds_all = xr.open_mfdataset('./tutorial-data/sst/*nc', combine='by_coords')
# slice for enso3.4 region
sst_enso = ds_all.sel(lat=slice(-5,5), lon=slice(-170+360,-120+360))
avg_enso = sst_enso.sst.groupby('time').mean(dim=['lat','lon'])
avg_enso.plot()
ds = sst_enso.sst.mean(dim=['lat','lon'])
enso_clim = ds.sel(time=slice('1986-01-01', '2016-01-01')).groupby("time.month").mean("time")
# ref: https://origin.cpc.ncep.noaa.gov/products/analysis_monitoring/ensostuff/ONI_change.shtml
enso_anom = ds.groupby("time.month") - enso_clim
# ref: http://xarray.pydata.org/en/stable/examples/weather-data.html#Calculate-monthly-anomalies
enso_anom.plot()
oni = enso_anom.rolling(time = 3).mean()
oni.plot()
I have an xarray of monthly average surface temperatures read in from a server using open_dataset with decode_times=False because the calendar type is not understood by xarray.
After some manipulation, I am left with a dataset my_dataset of surface temperatures ('ts') and times ('T'):
<xarray.Dataset>
Dimensions: (T: 1800)
Coordinates:
* T (T) float32 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 ...
Data variables:
ts (T) float64 246.6 247.9 250.7 260.1 271.9 281.1 283.3 280.5 ...
'T' has the following attributes:
Attributes:
pointwidth: 1.0
calendar: 360
gridtype: 0
units: months since 0300-01-01
I would like to take this monthly data and calculate annual averages, but because the T coordinate aren't datetimes, I'm unable to use xarray.Dataset.resample. Right now, I am simply converting to a numpy array, but I would like a way to do this preserving the xarray dataset.
My current, rudimentary way:
temps = np.mean(np.array(my_dataset['ts']).reshape(-1,12),axis=1)
years = np.array(my_dataset['T'])/12
I appreciate any help, even if the best way is redefining the time coordinate to use resampling.
Edit:
Requested how xarray was created, it was done via the following:
import numpy as np
import matplotlib.pyplot as plt
import xarray as xr
filename = 'http://strega.ldeo.columbia.edu:81/CMIP5/.byScenario/.abrupt4xCO2/.atmos/.mon/.ts/ACCESS1-0/r1i1p1/.ts/dods'
ds = xr.open_dataset(filename,decode_times=False)
zonal_mean = ds.mean(dim='lon')
arctic_only = zonal.where(zonal['lat'] >= 60).dropna('lat')
weights = np.cos(np.deg2rad(arctic['lat']))/np.sum(np.cos(np.deg2rad(arctic['lat'])))
my_dataset = (arctic_only * weights).sum(dim='lat')
This is a very common problem, especially with datasets from INGRID. The reason xarray can't decode the date whose units are "months since..." is due to the underlying netcdf4-python library's refusal to parse such dates. This is discussed in a netcdf4-python github issue
The problem with time units such as "months" is that they are not well defined. In contrast to days, hours, etc. the length of a month depends on the calendar used and even varies between different months.
INGRID unfortunately refuses to accept this fact and continues to use "months" as its default unit, despite the ambiguity. So right now there is this frustrating incompatibility between INGRID and xarray / python-netcdf4.
Anyway, here is a hack to accomplish what you want without leaving xarray
# create new coordinates for month and year
ds.coords['month'] = np.ceil(ds['T'] % 12).astype('int')
ds.coords['year'] = (ds['T'] // 12).astype('int')
# calculate monthly climatology
ds_clim = ds.groupby('month').mean(dim='T')
# calculate annual mean
ds_am = ds.groupby('year').mean(dim='T')