I'm trying to obtain monthly means from an observed precipitation data set for the period 1901-2015. The current shape of my prec variable is (1380(time), 360(lon), 720(lat)), with 1380 being the number of months over a 115 year period. I have been informed that to calculate monthly means, the most effective way is to conduct an np.reshape command on the prec variable to split the array up into months and years. However I am not sure what the best way to do this is. I was also wondering if there was a way in Python to select specific months of the year, as I will be producing plots for each month of the year.
I have been attempting to reshape the prec variable with the code below. However I am not sure how to do this correctly:
#Set Source Folder
sys.path.append('../../..')
SrcFld = ("/export/silurian/array-01/obs/CRU/")
#Retrieve Data
data_path = ''
example = (str(SrcFld) + 'cru_ts4.00.1901.2015.pre.dat.nc')
Data = Dataset(example)
#Create Prec Mean Array and reshape to get monthly means
Prec_mean = np.zeros((360,720))
#Retrieve Variables
Prec = Data.variables['pre'][:]
lats = Data.variables['lat'][:]
lons = Data.variables['lon'][:]
np.reshape(Prec, ())
#Get Annual/Monthly Average
Prec_mean =np.mean(Prec,axis=0)
Any guidance on this issue would be appreciated.
The following snippet will first dice the precipitation array year-wise. We can then use that array to get the monthly average of precipitation.
>>> prec = np.random.rand(1380,360,720)
>>> ind = np.arange(12,1380,12)
>>> yearly_split = np.array(np.split(prec, ind, axis=0))
>>> yearly_split.shape
(115, 12, 360, 720)
>>> monthly_mean = yearly_split.mean(axis=0)
>>> monthly_mean.shape
(12, 360, 720)
Related
I am quite new to parallel computing and Dask. In addition, this is my first question here in Stackoverflow and I hope everything will work.
Problem Description
I want to set up a bias correction of climate data (e.g. total precipitation, tp). For this I have three datasets:
Actual predictions (ds_pred), containing several different ensemble members for a period of 215 days from an initial date. (Dimension: ens x days x lat x lon)
Historical model data (ds_mdl_hist) for a given period of time (e.g. 1981 - 2016). (year x days x lat x lon)
Observational data (ds_obs). (year x days x lat x lon)
I created some dummy data, to illustrate my problem. Please note: My real datasets contain a lot of more data.
# create dummy dataframe
days = np.arange(0,215)
lat = np.arange(35, 50)
lon = np.arange(10,20)
ens = np.arange(0,10)
year = np.arange(1981,2016)
data_pred = np.random.randint(0, 100, size=(len(days), len(ens), len(lat), len(lon)))
data_mdl_hist = np.random.randint(0, 100, size=(len(year),len(days), len(ens), len(lat), len(lon)))
data_obs = np.random.randint(0, 100, size=(len(year),len(days), len(lat), len(lon)))
# create dataset
ds_pred = xr.Dataset({'tp': xr.DataArray(
data = data_pred, # enter data here
dims = ['days', 'ens', 'lat', 'lon'],
coords = {'days': days, 'ens': ens, 'lat': lat, 'lon': lon},
attrs = {'units' : 'mm/day'})})
ds_mdl_hist = xr.Dataset({'tp': xr.DataArray(
data = data_mdl_hist, # enter data here
dims = ['year', 'days','ens', 'lat', 'lon'],
coords = {'year': year, 'days': days, 'ens': ens, 'lat': lat, 'lon': lon},
attrs = {'units' : 'mm/day'})})
ds_obs = xr.Dataset({'tp': xr.DataArray(
data = data_obs, # enter data here
dims = ['year', 'days', 'lat', 'lon'],
coords = {'year': year, 'days': days, 'lat': lat, 'lon': lon},
attrs = {'units' : 'mm/day'})})
For each day in ds_pred, I slice the corresponding days in ds_mdl_hist and ds_obs over each year. Furthermore I do not select only the single day, but a 30-day-window around it. Example for day=20:
# Slice data corresponding to time step
k = 20
# Predictions
ds_pred_sub = ds_pred.isel(days=k)
# Pool 15 days before and after the time step k
window = 15
day_range = np.arange(k-window, k+window)
# Historical model
ds_mdl_hist_sub = ds_mdl_hist.isel(days=day_range)
# Observational
ds_obs_sub = ds_obs.isel(days=day_range)
In order to run apply_ufunc, I will stack some dimension.
# Stack dimension in order to use apply_u_func over only one dimension
ds_mdl_hist_sub = ds_mdl_hist_sub.stack(days_ens=("ens", "days", "year"))
ds_obs_sub = ds_obs_sub.stack(days_year=('days', 'year'))
My function for the bias correction includes the calculation of the distribution for both the subset of historical model data (ds_mdl_hist_sub) and observations (ds_obs_sub), as well as the interpolation with the subset of the predictions (ds_pred_sub). The function returns the corrected predictions (pred_corr).
def bc_module(pred, mdl_hist, obs):
# Define number of grid points, at which quantiles are calculated
p = np.linspace(0, 1, 1000)
# Calculate quantile of historical model
q_mdl = np.quantile(mdl_hist, p, interpolation='midpoint')
# Calculate quantile of observations
q_obs = np.quantile(obs, p, interpolation='midpoint')
# Do the bias correction
Y_pred = interp1d(q_mdl,p)(pred)
# Do the back-transformation to the data space
pred_corr = interp1d(p, q_obs, bounds_error=False)(Y_pred)
return pred_corr
Because my real datasets contains much more data, I want to apply the bc_modulein parallel by using apply_ufunc and dask=parallelized over my "time" dimension, by vectorizing over each lat, lon. This is my call:
# Xarray apply_ufunc
pred_corr = xr.apply_ufunc(bc_module, ds_pred_sub, ds_mdl_hist_sub, ds_obs_sub, input_core_dims=[['ens'], ['days_ens'], ['days_year']], output_core_dims=[['ens']], vectorize=True, dask="parallelized", output_dtypes=[np.float64]).compute()
This works fine, for me it seems that my code is not that fast.
Open Questions
I really do not know, how I can loop now over the 215 days (k) in an efficient way. A simple for loop is not sufficient, because I have different variables, which I have to correct and the run time of my real dataset is for one variable about 1 hour. Is it possible to use daskfor this as well? Maybe something like dask.delayed, because it is embarrassingly parallel.
In particular, I wonder if I can use dask to parallelize my outer forloop over the 215 days when apply_ufuncalready uses dask=parallelized.
Is there are more efficient way, then using apply_ufunc? I really want to avoid nested loops over first latand lonand second the 215 days.
Is it possible to add some commands to my apply_ufunc, so that I can include the looping over the 215 days?
I currently have ~40 years worth of daily ozone measurement datasets (which were 3D arrays with dimensions time (24), latitude (361), and longitude (576) respectively). Each day has its own data file.
I then created a 2D array (361, 576) for each day, averaging all of the data from each hour.
My next goal is to create one plot for each day of the calendar year (January 1st, January 2nd, etc.) that ranges through all of the years in my dataset. I'm trying to show the trend of ozone on each day through each respective year. For example, my first plot would be the trend of the daily average on January 1st from the first year to the last year in my dataset.
dims = np.shape(TO3) #Dimensions of original data (24, 361, 576)
avgTO3 = np.arange(dims[1]*dims[2], dtype=float).reshape(dims[1], dims[2]) #Creates new 2D array for daily averages
avgTO3[:,:] = 0.0
for i in range(TO3.shape[0]):
np.add(TO3[i], avgTO3, out=avgTO3)
avgozone = avgTO3 / 24.0 #Final 2D array of daily average
dailyavgdims = np.shape(avgozone)
dailyavgyear = np.arange(dailyavgdims[0]*dailavgdims[1], dtype=float).reshape(dailyavgyear[0], dailyavgyear[1])
dailyavgyear[:,:] = 0.0
dailyavgbyyear = dailyavgyear[..., np.newaxis, np.newaxis]
dailyavgbyyear[:,:,:,:] = 0.0 #Empty 4D array with dimensions (361, 576, 365, 40)
Within the 4D array, the third dimension represents the calendar day (so it would likely go to 365), and the 4th dimension represents the year (which would be around 40).
My question is how I can add each of the 2D arrays to specific dimensions in the 4D array. Like how can I add the daily average of January 1st, 1980 (the first possible day) to the 4D array with dimensions (361, 576, 0, 0), and then January 2nd, 1980's 2D array to (361, 576, 1, 0) and so on? I'm finding it to be difficult, especially since I can't necessarily store these arrays anywhere else because of Linux. Any help is appreciated!
Sidenote: I know my code isn't too condensed, but that's something I'm not terribly worried about at the moment. I'm still trying to learn the in's and out's of Python and Linux.
years = 40
days = 365
# random data for the example. You'd load an array, or hardcode in the dimensions
TO3 = np.random.randn(24, 361, 576)
# get average over first dimension
avgozone = TO3.mean(0)
# Create the empty array
dailyavgbyyear = np.zeros((*avgozone.shape, days, years))
for y in range(years):
for d in range(days):
# Load 3d array
TO3 = np.random.randn(24, 361, 576) # random data for the example
avgozone = TO3.mean(0) # get mean over first dimension
dailyavgbyyear[:, :, d, y] = avgozone
I need to calculate the SMA for a time serie data.
In particular, I want to plot in x-axis that mean function and in y-axis another array.
At the beginning, when I didn't do the SMA, these 2 arrays have the same lenght.
Then I found this example online to calculate the SMA:
# Python program to calculate
# simple moving averages using pandas
import pandas as pd
arr = [1, 2, 3, 7, 9]
window_size = 3
# Convert array of integers to pandas series
numbers_series = pd.Series(arr)
# Get the window of series
# of observations of specified window size
windows = numbers_series.rolling(window_size)
# Create a series of moving
# averages of each window
moving_averages = windows.mean()
# Convert pandas series back to list
moving_averages_list = moving_averages.tolist()
# Remove null entries from the list
final_list = moving_averages_list[window_size - 1:]
print(final_list)
I tried to replay that in my case, but I obtain this error:
"ValueError: x and y must have same first dimension, but have shapes
(261228,) and (261237,)"
I paste a bit of my code, maybe it can be useful to understand better:
y_Int_40_dmRe=pd.Series(y_Int_40_dmRe)
windows_40_dmRe = y_Int_40_dmRe.rolling(window_size)
moving_averages_40_dmRe = windows_40_dmRe.mean()
moving_averages_40_dmRe_list = moving_averages_40_dmRe.tolist()
final_list_40_dmRe = moving_averages_40_dmRe_list[window_size - 1:]
plt.plot(final_list_40_dmRe,y_TotEn_40_dmRe, linewidth=2, label="40° - dmRe")
I'm here if you need more informations, thank you in advance for your help
Chiara
I am trying to plot a time series of the sea surface temperature (SST) for a specific region from a .nc file. The SST is a three-dimensional variable (lat,lon,time), that has mean daily values for a specific region from 1982 to 2016. I want my plot to reflect the seasonal sst variability of the entire period of time. I assume that what I need to do first is to obtain a mean sst value for my lat,lon region for each of the days with which I can work alter on. So far, I assume that I need to read the .nc file and the variables:
import netCDF4 as nc
f = nc.Dataset('cmems_SST_MED_SST_L4_REP_OBSERVATIONS_010_021_1639073212518.nc')
sst = f.variables['analysed_sst'][:]
lon = f.variables['longitude'][:]
lat = f.variables['latitude'][:]
Next, following the code suggested here, I tried to reshape and obtain the mean, but an error pops up:
global_average= np.nanmean(sst[:,:,:],axis=(1,2))
annual_temp = np.nanmean(np.reshape(global_average, (34,12)), axis = 1)
#34 years between 1982 and 2016, and 12 months per year.
ERROR cannot reshape array of size 14008 into shape (34,12)
From here I found different ways, like using cdo or nco (which didn't work due installation problems) among others, which were not suitable for my case. I used nanmean because know that in MATLAB this is done using the nanmean function. I am quite new to this topic and I would like to ask for some hints, like, where should I focus more or what path is more suitable for this case. Thank you!!
Handling daily data with just pure python is difficult because you should consider leap years and sub-setting a region require tedious indexing striding....
As steTATO mentioned, since the data that you are working has daily temporal resolution you need to consider the following
You need to reshape the global_average in the shape of (34,365) or (34,366) depending on the year (1984,1988,1992,1996,2000,2004,2008,2012,2016). So your above code should look something like
annual_temp = np.nanmean(np.reshape(global_average, (34,365)), axis = 1)
But, like I said, because of the leap years, you can't do the things you want by simply reshaping the global_average.
If I had no choice but to use python only, I'd do the following
import numpy as np
def days_in_year(in_year):
leap_years = [1984,1988,1992,1996,2000,2004,2008,2012,2016]
if (in_year in leap_years):
out_days = 366
else:
out_days = 365
return out_days
# some of your code, importing netcdf data
year = np.arange(1982,2017)
global_avg= np.nanmean(sst[:,:,:],axis=(1,2))
annual_avgs = []
i = 0
for yr in range(35):
i = i + days_in_year(year[yr])
f = i - days_in_year(year[yr])
annual_avg = np.nanmean(global_avg[i:f])
annual_avgs.append(annual_avg)
Above code basically takes and averages by taking strides of the global_avg considering the leap year, and saving it as annual_avgs.
I have a ten-year weather data including maximum temperature (Tmax), minimum temperature (Tmin), rainfall and solar radiation (Ra) for each day.
At first, I would like to calculate evapotranspiration (ETo) for each day using the following equation:
ETo=0.0023*(((Tmax+Tmin)/2)+17.8)*sqrt(Tmax-Tmin)*Ra
Then, calculation of the monthly and yearly average of all parameters (Tmax,Tmin, Rainfall, Ra and ETo) and print them in Excel format.
I have written some parts. could you possibly help me with completing it? I think it may need a loop.
import numpy as np
import pandas as pd
import math as mh
# load the weather data file
data_file = pd.read_excel(r'weather data.xlsx', sheet_name='city_1')
# defining time
year = data_file['Year']
month = data_file['month']
day = data_file['day']
# defining weather parameters
Tmax = data_file.loc[:,'Tmax']
Tmin = data_file.loc[:,'Tmin']
Rainfall = data_file.loc[:,'Rainfall']
Ra = data_file.loc[:,'Ra']
# adjusting time to start at zero
year = year-year[0]
month=month-month[0]
day=day-day[0]
#calculation process for estimation of evapotranspiration
ET0=(0.0023*(((Tmax+Tmin)/2)+17.8)*(mh.sqrt(Tmax-Tmin))*Ra
Looks like you've got one data row (record) per day.
Since you already have Tmax, Tmin, Rainfall, and Sunhours in the row, you could add a net ET0 row with the calculation like this:
data_file['ET0'] = data_file.apply(lambda x: 0.0023*(((x.Tmax+x.Tmin)/2)+17.8)*(mh.sqrt(x.Tmax-x.Tmin))*x.Ra, axis=0)