A lot of monthly NetCDF files contains all months in many years (for example, from Jan1948 to Dec2018).
How to use Xarray to compute the seasonal average of each year conveniently?
There are examples using GroupBy to calculate seasonal average, but it seems to group all the months spanning many years to 4 groups, which can't give the seasonal average of every year.
It sounds like you are looking for a resample-type operation. Using the get_dpm function from the documentation example you linked to, I think something like the following should work:
month_length = xr.DataArray(
get_dpm(ds.time.to_index(), calendar='standard'),
coords=[ds.time],
name='month_length'
)
result = ((ds * month_length).resample(time='QS-DEC').sum() /
month_length.resample(time='QS-DEC').sum())
Using 'QS-DEC' frequency will split the data into consecutive three-month periods, anchored at December 1st.
If your data has missing values, you'll need to modify this weighted mean operation to account for that (i.e. we need to mask the month_length before taking the sum in the denominator):
result = (ds * month_length).resample(time='QS-DEC').sum() /
month_length.where(ds.notnull()).resample(time='QS-DEC').sum())
Related
BACKGROUND
I am calculating racial segregation statistics between and within firms using the Theil Index. The data structure is a multi-indexed pandas dataframe. The calculation involves a lot of df.groupby()['foo'].transform(), where the transformation is the entropy function from scipy.stats. I have to calculate entropy on smaller and smaller groups within this structure, which means calling entropy more and more times on the groupby objects. I get the impression that this is O(n), but I wonder whether there is an optimization that I am missing.
EXAMPLE
The key part of this dataframe comprises five variables: county, firm, race, occ, and size. The units of observation are counts: each row tells you the SIZE of the workforce of a given RACE in a given OCCupation in a FIRM in a specific COUNTY. Hence the multiindex:
df = df.set_index(['county', 'firm', 'occ', 'race']).sort_index()
The Theil Index is the size-weighted sum of sub-units' entropy deviations from the unit's entropy. To calculate segregation between counties, for example, you can do this:
from scipy.stats import entropy
from numpy import where
# Helper to calculate the actual components of the Theil statistic
define Hcmp(w_j, w, e_j, e):
return where(e == 0, 0, (w_j / w) * ((e - e_j) / e))
df['size_county'] = df.groupby(['county', 'race'])['size'].transform('sum')
df['size_total'] = df['size'].sum()
# Create a dataframe with observations aggregated over county/race tuples
counties = df.groupby(['county', 'race'])[['size_county', 'size_total']].first()
counties['entropy_county'] = counties.groupby('county')['size_county'].transform(entropy, base=4) # <--
# The base for entropy is 4 because there are four recorded racial categories.
# Assume that counties['entropy_total'] has already been calculated.
counties['seg_cmpnt'] = Hcmp(counties['size_county'], counties['size_total'],
counties['entropy_county'], counties['entropy_total'])
county_segregation = counties['seg_cmpnt'].sum()
Focus on this line:
counties['entropy_county'] = counties.groupby('county')['size_county'].transform(entropy, base=4)
The starting dataframe has 3,130,416 rows. When grouped by county, though, the resulting groupby object has just 2,267 groups. This runs quickly enough. When I calculate segregation within counties and between firms, the corresponding line is this:
firms['entropy_firm'] = firms.groupby('firm')['size_firm'].transform(entropy, base=4)
Here, the groupby object has 86,956 groups (the count of firms in the data). This takes about 40 times as long as the prior, which looks suspiciously like O(n). And when I try to calculate segregation within firms, between occupations...
# Grouping by firm and occupation because occupations are not nested within firms
occs['entropy_occ'] = occs.groupby(['firm', 'occ'])['size_occ'].transform(entropy, base=4)
...There are 782,604 groups. Eagle-eyed viewers will notice that this is exactly 1/4th the size of the raw dataset, because I have one observation for each firm/race/occupation tuple, and four racial categories. It is also nine times the number of groups in the by-firm groupby object, because the data break employment out into nine occupational categories.
This calculation takes about nine times as long: four or five minutes. When the underlying research project involves 40-50 years of data, this part of the process can take three or four hours.
THE PROBLEM, RESTATED
I think the issue is that, even though scipy.stats.entropy() is being applied in a smart, vectorized way, the necessity of calculating it over a very large number of small groups--and thus calling it many, many times--is swamping the performance benefits of vectorized calculations.
I could pre-calculate the necessary logarithms that entropy requires, for example with numpy.log(). If I did that, though, I'd still have to group the data to first get each firm/occupation/race's share within the firm/occupation. I would also lose any advantage of readable code that looks similar at different levels of analysis.
Thus my question, stated as clearly as I can: is there a more computationally efficient way to call something like this entropy function, when calculating it over a very large number of relatively small groups in a large dataset?
I downloaded some stock data from CRSP and need the variance of the stock returns of the last 36 months of that company.
So, basically the variance based on two conditions:
Same PERMCO (company number)
Monthly stock returns of the last 3 years.
However, I excluded penny stocks from my sample (stocks with prices < $2). Hence, sometimes months are missing and e.g. april and junes monthly returns are directly on top of each other.
If I am not mistaken, a rolling function (grouped by Permco) would just take the 36 monthly returns above. But when months are missing, the rolling function would actually take more than 3 years data (since the last 36 monthly returns would exceed that timeframe).
Usually I work with Ms Excel. However, in this case the amount of data is too big and it takes years to let Excel calculate stuff. Thats why I want to tackle that problem with Python.
The sample is organized as follows:
PERMNO date SHRCD PERMCO PRC RET
When I have figured out how to make a proper table in here I will show you a sample of my data.
What I have tried so far:
data["RET"]=data["RET"].replace(["C","B"], np.nan)
data["date"] = pd.to_datetime(date["date"])
data=data.sort_values[("PERMCO" , "date"]).reset_index()
L3Yvariance=data.groupby("PERMCO")["RET"].rolling(36).var().reset_index()
Sometimes there are C and B instead of actual returns, thats why the first line
You can replace the missing values by the mean value. It won't affect the variance as the variance is calculated after subtracting the mean, so in this case, for times you won't have the value, the contribution to variance will be 0.
I'm currently working with CESM Large Ensemble data on the cloud (ala https://medium.com/pangeo/cesm-lens-on-aws-4e2a996397a1) using xarray and Dask and am trying to plot the trends in extreme precipitation in each season over the historical period (Dec-Jan-Feb and Jun-Jul-Aug specifically).
Eg. If one had a daily time-series data split into months like:
1920: J,F,M,A,M,J,J,A,S,O,N,D
1921: J,F,M,A,M,J,J,A,S,O,N,D
...
My aim is to group together the JJA days in each year and then take the maximum value within that group of days for each year. Ditto for DJF, however here you have to be careful because DJF is a year-skipping season; the most natural way to define it is 1921's DJF = 1920 D + 1921 JF.
Using iris this would be simple (though quite inefficient), as you could just add auxiliary time-coordinates for season and season_year and then aggregate/groupby those two coordinates and take a maximum, this would give you a (year, lat, lon) output where each year contains the maximum of the precipitation field in the chosen season (eg. maximum DJF precip in 1921 in each lat,lon pixel).
However in xarray this operation is not as natural because you can't natively groupby multiple coordinates, see https://github.com/pydata/xarray/issues/324 for further info on this. However, in this github issue someone suggests a simple, nested workaround to the problem using xarray's .apply() functionality:
def nested_groupby_apply(dataarray, groupby, apply_fn):
if len(groupby) == 1:
return dataarray.groupby(groupby[0]).apply(apply_fn)
else:
return dataarray.groupby(groupby[0]).apply(nested_groupby_apply, groupby = groupby[1:], apply_fn = apply_fn)
I'd be quite keen to try and use this workaround myself, but I have two main questions beforehand:
1) I can't seem to work out how to groupby coordinates such that I don't take the maximum of DJF in the same year?
Eg. If one simply applies the function like (for a suitable xr_max() function):
outp = nested_groupby_apply(daily_prect, ['time.season', 'time.year'], xr_max)
outp_djf = outp.sel(season='DJF')
Then you effectively define 1921's DJF as 1921 D + 1921 JF, which isn't actually what you want to look at! This is because the 'time.year' grouping doesn't account for the year-skipping behaviour of seasons like DJF. I'm not sure how to workaround this?
2) This nested groupby function is incredibly slow! As such, I was wondering if anyone in the community had found a more efficient solution to this problem, with similar functionality?
Thanks ahead of time for your help, all! Let me know if anything needs clarifying.
EDIT: Since posting this, I've discovered there already is a workaround for this in the specific case of taking DJF/JJA means each year (Take maximum rainfall value for each season over a time period (xarray)), however I'm keeping this question open because the general problem of an efficient workaround for multi-coord grouping is still unsolved.
Could you please help me with this issue as I made many searches but cannot solve it. I have a multivariate dataframe for electricity consumption and I am doing a forecasting using VAR (Vector Auto-regression) model for time series.
I made the predictions but I need to reverse the time series (energy_log_diff) as I applied a seasonal log difference to make the serie stationary, in order to get the real energy value:
df['energy_log'] = np.log(df['energy'])
df['energy_log_diff'] = df['energy_log'] - df['energy_log'].shift(1)
For that, I did first:
df['energy'] = np.exp(df['energy_log_diff'])
This is supposed to give the energy difference between 2 values lagged by 365 days but I am not sure for this neither.
How can I do this?
The reason we use log diff is that they are additive so we can use cumulative sum then multiply by the last observed value.
last_energy=df['energy'].iloc[-1]
df['energy']=(np.exp(df['energy'].cumsum())*last_energy)
As per seasonality: if you de-seasoned the log diff simply add(or multiply) before you do the above step if you de-seasoned the original series then add after
Short answer - you have to run inverse transformations in the reversed order which in your case means:
Inverse transform of differencing
Inverse transform of log
How to convert differenced forecasts back is described e.g. here (it has R flag but there is no code and the idea is the same even for Python). In your post, you calculate the exponential, but you have to reverse differencing at first before doing that.
You could try this:
energy_log_diff_rev = []
v_prev = v_0
for v in df['energy_log_diff']:
v_prev += v
energy_log_diff_rev.append(v_prev)
Or, if you prefer pandas way, you can try this (only for the first order difference):
energy_log_diff_rev = df['energy_log_diff'].expanding(min_periods=0).sum() + v_0
Note the v_0 value, which is the original value (after log transformation before difference), it is described in the link above.
Then, after this step, you can do the exponential (inverse of log):
energy_orig = np.exp(energy_log_diff_rev)
Notes/Questions:
You mention lagged values by 365 but you are shifting data by 1. Does it mean you have yearly data? Or would you like to do this - df['energy_log_diff'] = df['energy_log'] - df['energy_log'].shift(365) instead (in case of daily granularity of data)?
You want to get the reverse time series from predictions, is that right? Or am I missing something? In such a case you would make inverse transformations on prediction not on the data I used above for explanation.
I have a 2D numpy array consisting of ca. 15'000'000 datapoints. Each datapoint has a timestamp and an integer value (between 40 and 200). I must create histograms of the datapoint distribution (16 bins: 40-49, 50-59, etc.), sorted by year, by month within the current year, by week within the current year, and by day within the current month.
Now, I wonder what might be the most efficient way to accomplish this. Given the size of the array, performance is a conspicuous consideration. I am considering nested "for" loops, breaking down the arrays by year, by month, etc. But I was reading that numpy arrays are highly memory-efficient and have all kinds of tricks up their sleeve for fast processing. So I was wondering if there is a faster way to do that. As you may have realized, I am an amateur programmer (a molecular biologist in "real life") and my questions are probably rather naïve.
First, fill in your 16 bins without considering date at all.
Then, sort the elements within each bin by date.
Now, you can use binary search to efficiently locate a given year/month/week within each bin.
In order to do this, there is a function in numpy, numpy.bincount. It is blazingly fast. It is so fast that you can create a bin for each integer (161 bins) and day (maybe 30000 different days?) resulting in a few million bins.
The procedure:
calculate an integer index for each bin (e.g. 17 x number of day from the first day in the file + (integer - 40)//10)
run np.bincount
reshape to the correct shape (number of days, 17)
Now you have the binned data which can then be clumped into whatever bins are needed in the time dimension.
Without knowing the form of your input data the integer bin calculation code could be something like this:
# let us assume we have the data as:
# timestamps: 64-bit integer (seconds since something)
# values: 8-bit unsigned integer with integers between 40 and 200
# find the first day in the sample
first_day = np.min(timestamps) / 87600
# we intend to do this but fast:
indices = (timestamps / 87600 - first_day) * 17 + ((values - 40) / 10)
# get the bincount vector
b = np.bincount(indices)
# calculate the number of days in the sample
no_days = (len(b) + 16) / 17
# reshape b
b.resize((no_days, 17))
It should be noted that the first and last days in b depend on the data. In testing this most of the time is spent in calculating the indices (around 400 ms with an i7 processor). If that needs to be reduced, it can be done in approximately 100 ms with numexpr module. However, the actual implementation depends really heavily on the form of timestamps; some are faster to calculate, some slower.
However, I doubt if any other binning method will be faster if the data is needed up to the daily level.
I did not quite understand it from your question if you wanted to have separate views on the (one by year, ony by week, etc.) or some other binning method. In any case that boils down to summing the relevant rows together.
Here is a solution, employing the group_by functionality found in the link below:
http://pastebin.com/c5WLWPbp
import numpy as np
dates = np.arange('2004-02', '2005-05', dtype='datetime64[D]')
np.random.shuffle(dates)
values = np.random.randint(40,200, len(dates))
years = np.array(dates, dtype='datetime64[Y]')
months = np.array(dates, dtype='datetime64[M]')
weeks = np.array(dates, dtype='datetime64[W]')
from grouping import group_by
bins = np.linspace(40,200,17)
for m, g in zip(group_by(months)(values)):
print m
print np.histogram(g, bins=bins)[0]
Alternatively, you could take a look at the pandas package, which probably has an elegant solution to this problem as well.