I am new to python and using it to analyse climate data in NetCDF. I am wanting to calculate the total precipitation for each season in each year and then average these seasonal totals across the time period (i.e. an average for DJF over all years in the file and an average for MAM etc.).
Here is what I thought to do:
fn1 = 'cru_fixed.nc'
ds1 = xr.open_dataset(fn1)
ds1_season = ds1['pre'].groupby('time.season').mean('time')
#Then plot each season
ds1_season.plot(col='season')
plt.show()
The original file contains monthly totals of precipitation. This is calculating an average for each season and I need the sum of Dec, Jan and Feb and the sum of Mar, Apr, May etc. for each season in each year. How do I sum and then average over the years?
If I'm not mistaking, you need to first resample you data to have the sum of each seasons on a DataArray, then to average theses sum on multiple years.
To resample:
sum_of_seasons = ds1['pre'].resample(time='Q').sum(dim="time")
resample is an operator to upsample or downsample time series, it uses time offsets of pandas.
However be careful to choose the right offset, it will define the month included in each season. Depending on your needs, you may want to use "Q", "QS" or an anchored offset like "QS-DEC".
To have the same splitting as "time.season", the offset is "QS-DEC" I believe.
Then to group over multiple years, same as you did above:
result = sum_of_seasons.groupby('time.season').mean('time')
Related
I just downloaded a huge database of daily data for a climate station in Canada Env, and I thought it will be "easier" for me to write a code to help me have the monthly annual average for a period of 30 years, and continue doing the same for 15 more stations. I am just also starting learning Python in Spyder. So, I would like that someone help with ideas about what functions or how could be the code be. The headings of my csv are: YYYY, MM, DD, Total Rain, Total Precip, Total Snow, Max Temp, Min Temp and Mean Temp. Here is an csv of example: https://drive.google.com/file/d/1EAbqos89dQXxOpy_hg6v-P2qFKPS37h5/view?usp=sharing
What I would like to have is the average of the precipitation and mean temperature for each month of the year (1 to 12). For precipitation, I first need to calculate the summation of daily precipitation for each month, and then compute the average for the same month for all the years of data. For temperature I need to average the monthly averages of the values (so an average of all the data for all the months gives the exact same result). Once this is done I need to plot both sets of data (precipitation and temperature) using abbreviated months, and also obtain another csv with mean monthly temperature and precipitation for each year (from 1980 to 2010 for example)
So for argument sake here is an example of autoarima for daily data:
auto_arima(df['orders'],seasonal=True,m=7)
Now in that example after running a Seasonal Decomposition that has shown weekly seasonality I "think" you select 7 for m? Is this correct as the seasonality is shown to be weekly?
My first question is as follows - If seasonality is Monthly do you use 12? If it is Annually do you use 1? And is there ever a reason to select 365 for daily?
Secondly if the data you are given is already weekly e.g
date weekly tot
2021/01/01 - 10,000
2021/01/07 - 15,000
2021/01/14 - 9,000
and so on......
And you do the seasonal decomposition would m=1 be used for weekly, m=4 for monthly and m=52 for annually.
Finally if its monthly like so:
date monthly tot
2020/01/01- 10,000
2020/02/01- 15,000
2020/03/01- 9,000
and so on......
And you do the seasonal decomposition would m=1 for monthly and m=12 for annually.
Any help would be greatly appreciated, I just want to be able to confidently select the right criteria.
A season is a recurring pattern in your data and m is the length of that season. m in that case is not a code or anything but simply the length:
Imagine the weather, if you had the weekly average temperature it will rise in the summer and fall in the winter. Since the length of one "season" is a year or 52 weeks, you set m to 52.
If you had a repeating pattern every quarter, then m would be 12, since a quarter is equal to 12 weeks. It always depends on your data and your use case.
To your questions:
If seasonality is Monthly do you use 12?
If the pattern you are looking for repeats every 12 months yes, if it repeats every 3 months it would be 3 and so on.
If it is Annually do you use 1?
A seasonality of 1 does not really make sense, since it would mean that you have a repeating pattern in every single data point.
And is there ever a reason to select 365 for daily?
If your data is daily and the pattern repeats every 365 days (meaning every year) then yes (you need to remember that every fourth year has 366 days though).
I hope you get the concept behind seasonality and m so you can answer the rest.
I have a DataFrame series with day resolution. I want to transform the series to a series of monthly averages. Ofcourse I can apply rolling mean and select only every 30th of means but it would not precise. I want to get series which contains mean value from the previous month on every first day of a month. For example, on February 1 I want to have daily average for the January. How can I do this in pythonic way?
data.resample('M', how='mean')
I need to calculate the 90th percentile based on temperature data for 1961-1990. I have 30 NETCDF files and every file includes daily data for one year. I need to calculate the percentile (90th) for special Lat, Long while considering just summer days out of all 30 years of daily data. I need also consider the years when February has 29 days. When I run my code it just considered the first summer (summer 1961) and cannot consider all summer days with each other.
data = xr.open_mfdataset('/Tmax-2m/Control/*.nc')
time = data.variables['time']
lon = data.variables['lon'][:]
lat = data.variables['lat'][:]
tmax = data.variables['tmax'][:]
df = data.sel(lat=39.18,lon=-95.57, method='nearest')
time2=df.variables['time'][151:243]
dg=df.sel (time=time2, method = 'nearest')
print np.percentile (dg.tmax, 90)
I tried this way but it calculate the percentile for every summer of every year:
splits=[151,516,881,1247,1612,1977,2342,2708,3073,3438,3803,4169,4534,4899,5264,5630,5995,6360,6725,7091,7456,7821,8186,8552,8917,9282,9647,10013,10378,10743]
t0=92
result=[]
for i in splits:
time3=df.variables['time'][i:(i+t0)]
dg=df.sel(time=time3, method ='nearest')
result.append(np.percentile (dg.tmax, 90))
np.savetxt("percentile1.csv", result, fmt="%s")
Did you consider to use CDO for this task? (If you are running under linux this is easy, if you are on windows, you probably need to install it under cygwin)
You can merge the 30 files into one timeseries like this:
cdo mergetime file_y*.nc timeseries.nc
here the * is a wildcard for the year (1961, 1962 etc) in the filename that I assuming is file_y1961.nc file_y1962.nc etc... adopt as appropriate. timeseries.nc is the output file.
and then calculate the seasonal percentiles like this :
cdo yseaspctl,90 timeseries.nc -yseasmin timeseries.nc -yseasmax timeseries.nc percen.nc
percen.nc will have the seasonal percentiles in and you can extract the one for summer.
further details here: https://code.mpimet.mpg.de/projects/cdo/
How would I use pandas to calculate a cumulative deviation from a mean monthly rainfall value?
I am given daily rainfall data (e.g. s, below) which I can convert to a pd.Series and resample into monthly periods (sum; e.g. sm, below). But I then want to calculate the difference between each monthly value and the mean for the month. I have added a synthetic example:
rng = pd.period_range(20010101, 20131231, freq='D')
s = pd.Series(np.random.normal(2.5,2,size=len(rng)), index=rng)
sm = s.resample('M', how='sum')
For example, for January 2010 I would like to calculate the difference between the value for that month and the average monthly rainfall for January (over a long period). Then I want a cumulative sum of that difference.
I have tried to use the groupby function:
sm.groupby(lambda x: x.month).mean()
But not successfully. I want each monthly value in 'sm' to have the average for all similar months to be subtracted, then a cumulative sum of that series created. This could be in one step I guess.
How could I achieve this efficiently?
Thanks
This is closely related to an example in the docs. This is untested code, but you want something like this:
monthly_rainfall = daily_rainfall.resample('D', how=np.sum)
To group all Januarys over all the years together (and so on for each month):
grouped = monthly_rainfall.groupby(lambda x: x.month)
Then
deviation = grouped.transform(lambda x: x - x.mean())
deviation.cumsum()