Efficient workaround to group by multiple time coordinates in xarray

Efficient workaround to group by multiple time coordinates in xarray - python

I'm currently working with CESM Large Ensemble data on the cloud (ala https://medium.com/pangeo/cesm-lens-on-aws-4e2a996397a1) using xarray and Dask and am trying to plot the trends in extreme precipitation in each season over the historical period (Dec-Jan-Feb and Jun-Jul-Aug specifically).
Eg. If one had a daily time-series data split into months like:
1920: J,F,M,A,M,J,J,A,S,O,N,D
1921: J,F,M,A,M,J,J,A,S,O,N,D
...
My aim is to group together the JJA days in each year and then take the maximum value within that group of days for each year. Ditto for DJF, however here you have to be careful because DJF is a year-skipping season; the most natural way to define it is 1921's DJF = 1920 D + 1921 JF.
Using iris this would be simple (though quite inefficient), as you could just add auxiliary time-coordinates for season and season_year and then aggregate/groupby those two coordinates and take a maximum, this would give you a (year, lat, lon) output where each year contains the maximum of the precipitation field in the chosen season (eg. maximum DJF precip in 1921 in each lat,lon pixel).
However in xarray this operation is not as natural because you can't natively groupby multiple coordinates, see https://github.com/pydata/xarray/issues/324 for further info on this. However, in this github issue someone suggests a simple, nested workaround to the problem using xarray's .apply() functionality:
def nested_groupby_apply(dataarray, groupby, apply_fn):
if len(groupby) == 1:
return dataarray.groupby(groupby[0]).apply(apply_fn)
else:
return dataarray.groupby(groupby[0]).apply(nested_groupby_apply, groupby = groupby[1:], apply_fn = apply_fn)
I'd be quite keen to try and use this workaround myself, but I have two main questions beforehand:
1) I can't seem to work out how to groupby coordinates such that I don't take the maximum of DJF in the same year?
Eg. If one simply applies the function like (for a suitable xr_max() function):
outp = nested_groupby_apply(daily_prect, ['time.season', 'time.year'], xr_max)
outp_djf = outp.sel(season='DJF')
Then you effectively define 1921's DJF as 1921 D + 1921 JF, which isn't actually what you want to look at! This is because the 'time.year' grouping doesn't account for the year-skipping behaviour of seasons like DJF. I'm not sure how to workaround this?
2) This nested groupby function is incredibly slow! As such, I was wondering if anyone in the community had found a more efficient solution to this problem, with similar functionality?
Thanks ahead of time for your help, all! Let me know if anything needs clarifying.
EDIT: Since posting this, I've discovered there already is a workaround for this in the specific case of taking DJF/JJA means each year (Take maximum rainfall value for each season over a time period (xarray)), however I'm keeping this question open because the general problem of an efficient workaround for multi-coord grouping is still unsolved.

Related

Is there a way to efficiently apply a function to 3 million values in a Pandas column?

I am currently working on a course in Data Science on how to win data science competitions. The final project is a Kaggle competition that we have to participate in.
My training dataset has close to 3 million rows, and one of the columns is a "date of purchase" column.
I want to calculate the distance of each date to the nearest public holiday.
E.g. if the date is 31/12/2014, the nearest PH would be 01/01/2015. The number of days apart would be "1".
I cannot think of an efficient way to do this operation. I have a list with a number of Timestamps, each one is a public holiday in Russia (the dataset is from Russia).
def dateDifference (target_date_raw):
abs_deltas_from_target_date = np.subtract(russian_public_holidays, target_date_raw)
abs_deltas_from_target_date = [i.days for i in abs_deltas_from_target_date if i.days >= 0]
index_of_min_delta_from_target_date = np.min(abs_deltas_from_target_date)
return index_of_min_delta_from_target_date
where 'russian_public_holidays' is the list of public holiday dates and 'target_date_raw' is the date for which I want to calculate distance to the nearest public holiday.
This is the code I use to create a new column in my DataFrame for the difference of dates.
training_data['closest_public_holiday'] = [dateDifference(i) for i in training_data['date']]
This code ran for nearly 25 minutes and showed no signs of completing, which is why I turn to you guys for help.
I understand that this is probably the least Pandorable way of doing things, but I couldn't really find a clean way of operating on a single column during my research. I saw a lot of people say that using the "apply" function on a single column is a bad way of doing things. I am very new to working with such large datasets, which is why clean and efficient practices seem to elude me for now. Please do let me know what would be the best way to tackle this!

Try this and see if helps with the timing. I worry that it will take up to much memory. I don't have the data to test. You can try.
df = pd.DataFrame(pd.date_range('01/01/2021','12/31/2021',freq='M'),columns=['Date'])
holidays = pd.to_datetime(np.array(['1/1/2021','12/25/2021','8/9/2021'])).to_numpy()
Assuming holidays: 1/1/2021, 8/9/2021, 12/25/2021
df['Days Away'] = (
np.min(np.absolute(df.Date.to_numpy()
.reshape(-1,1) - holidays),axis=1) /
np.timedelta64(1, 'D')
)

Frequency in pandas timeseries index and statsmodel

I have a pandas timeseries y that does not work well with statsmodel functions.
import statsmodels.api as sm
y.tail(10)
2019-09-20 7.854
2019-10-01 44.559
2019-10-10 46.910
2019-10-20 49.053
2019-11-01 24.881
2019-11-10 52.882
2019-11-20 84.779
2019-12-01 56.215
2019-12-10 23.347
2019-12-20 31.051
Name: mean_rainfall, dtype: float64
I verify that it is indeed a timeseries
type(y)
pandas.core.series.Series
type(y.index)
pandas.core.indexes.datetimes.DatetimeIndex
From here, I am able to pass the timeseries through an autocorrelation function with no problem, which produces the expected output
plot_acf(y, lags=72, alpha=0.05)
However, when I try to pass this exact same object y to SARIMA
mod = sm.tsa.statespace.SARIMAX(y.mean_rainfall, order=pdq, seasonal_order=seasonal_pdq)
results = mod.fit()
I get the following error:
A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.
The problem is that the frequency of my timeseries is not regular (it is the 1st, 10th, and 20th of every month), so I cannot set freq='m'or freq='D' for example. What is the workaround in this case?
I am new to using timeseries, any advice on how to not have my index ignored during forecasting would help. This prevents any predictions from being possible

First of all, it is extremely important to understand what the relationship between the datetime column and the target column (rainfall) is. Looking at the snippet you provide, I can think of two possibilities:
y represents the rainfall that occurred in the date-range between the current row's date and the next row's date. If that is the case, the timeseries is kind of an aggregated rainfall series with unequal buckets of date i.e. 1-10, 10-20, 20-(end-of-month). If that is the case, you have two options:
You can disaggregate your data using either an equal weightage or even better an interpolation to create a continuous and relatively smooth timeseries. You can then fit your model on the daily time-series and generate predictions which will also naturally be daily in nature. These you can aggregate back to the 1-10, 10-20, 20-(end-of-month) buckets to get your predicitons. One way to do the resampling is using the code below.
ts.Date = pd.to_datetime(ts.Date, format='%d/%m/%y')
ts['delta_time'] = (ts['Date'].shift(-1) - ts['Date']).dt.days
ts['delta_rain'] = ts['Rain'].shift(-1) - ts['Rain']
ts['timesteps'] = ts['Date']
ts['grad_rain'] = ts['delta_rain'] / ts['delta_time']
ts.set_index('timesteps', inplace=True )
ts = ts.resample('d').ffill()
ts
ts['daily_rain'] = ts['Rain'] + ts['grad_rain']*(ts.index - ts['Date']).dt.days
ts['daily_rain'] = ts['daily_rain']/ts['delta_time']
print(ts.head(50))
daily_rain is now the target column and the index i.e. timesteps is the timestamp.
The other option is that you approximate that the date-range of 1-10, 10-20, 20-(EOM) is roughly 10 days, so these are indeed equal timesteps. Of course statsmodel won't allow that so you would need to reset the index to mock datetime for which you maintain a mapping. Below is what you use in the statsmodel as y but do maintain a mapping back to your original dates. Freq will 'd' or 'daily' and you would need to rescale seasonality as well such that it follows the new date scale.
y.tail(10)
2019-09-01 7.854
2019-09-02 44.559
2019-09-03 46.910
2019-09-04 49.053
2019-09-05 24.881
2019-09-06 52.882
2019-09-07 84.779
2019-09-08 56.215
2019-09-09 23.347
2019-09-10 31.051
Name: mean_rainfall, dtype: float64
I would recommend the first option though as it's just more accurate in nature. Also you can try out other aggregation levels also during model training as well as for your predictions. More control!
The second scenario is that the data represents measurements only for the date itself and not for the range. That would mean that technically you do not have enough info now to construct an accurate timeseries - your timesteps are not equidistant and you don't have enough info for what happened between the timesteps. However, you can still improvise and get some approximations going. The second approach listed above would still work as is. For the first approach, you'd need to do interpolation but given the target variable which is rainfall and rainfall has a lot of variation, I would highly discourage this!!

As I can see, the package uses the frequency as a premise for everything, since it's a time-series problem.
So you will not be able to use it with data of different frequencies. In fact, you will have to make an assumption for your analysis to adequate your data for the use. Some options are:
1) Consider 3 different analyses (1st days, 10th days, 20th days individually) and use 30d frequency.
2) As you have ~10d equally separated data, you can consider using some kind of interpolation and then make downsampling to a frequency of 1d. Of course, this option only makes sense depending on the nature of your problem and how quickly your data change.
Either way, I just would like to point out that how you model your problem and your data is a key thing when dealing with time series and data science in general. In my experience as a data scientist, I can say that is analyzing at the domain (where your data came from) that you can have a feeling of which approach will work better.

Cumulative Frequency with Defined Bins in Python

I have an array of data on how quickly people take action measured in hours. I want to generate a table that tells me what % of users have taken by the first hour, first day, first week, first month, etc.
I have used the pandas.cut to categorize and give them group_names
bins_hours = [0...]
group_names = [...]
hourlylook = pd.cut(av.date_diff, bins_hours, labels=group_names,right=False)
I then plotted hourlylook and got an awesome bar chart.
But I want to express this information cumulatively, too, in a table format. What's the best way to tackle this problem?

Have a look at: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.cumsum.html
This should allow you to create a new column with the cumulative sum.

Interpolate based on group

I am trying to perform an interpolation based on the input data ( test_data_Inputs) on the test_data data frame. The way I have it set up now is I do it by Peril so so I first created a dataframe that only contained the fire peril (see below) then perform the interpolation on that specific peril group:
The goal is to have a column in the test_data_inputs that has both the Peril Type, and Factor. One of the issues I have been encountering is the situation where the amount of insurance in the test_data_input is a perfect match within the test_data dataframe. It still interpolates regardless of if it is a perfect match or not.
fire_peril_test=test_data[test_data['Peril Type'=='Fire']]
from scipy import interpolate
x=fire_peril_test['Amount of Insurance']
x=fire_peril_test['Amount Of Insurance']
y=_fire_peril_test['Factor']
y=fire_peril_test['Factor']
f=interpolate.interp1d(x,y)
xnew=test_data_Inputs["Amount of Insurance"]
ynew=f(xnew)
test_data_Inputs=pd.DataFrame({'Amount of Insurance':[320000,330000,340000]})
test_data=pd.DataFrame({'Amount of Insurance':[300000,350000,400000,300000,350000,400000],'Peril Type':['Fire','Fire','Fire','Water','Water','Water'],'Factor':[.10,.20,.35,.20,.30,.40]})
Appreciate all of the assistance.

amount_of_insurance=pd.DataFrame()
df['Amount of Insurance']=pd.melt(df['Amount of Insurance'],id_vars=['Amount Of Insurance'],var_name='Peril Type',value_name='Factor')
for peril in df['Amount of Insurance']['Peril Type'].unique():
#peril='Fire'
x=df['Amount of Insurance']['Amount Of Insurance'][df['Amount of Insurance']['Peril Type']==str(peril)]
y=df['Amount of Insurance']['Factor'][df['Amount of Insurance']['Peril Type']==str(peril)]
f=interpolate.interp1d(x,y)
xnew=data_for_rater[['Enter Amount of Insurance']]
ynew=f(xnew)
append_frame=data_for_rater[['Group','Enter Amount of Insurance']]
append_frame['Peril Type']=str(peril)
append_frame['Factor']=ynew
amount_of_insurance=amount_of_insurance.append(append_frame)
My solution with my actual data. Pretty much I melted the data to be able to loop through the unique Peril Types. If you guys have any alternatives let me know...

efficient, fast numpy histograms

I have a 2D numpy array consisting of ca. 15'000'000 datapoints. Each datapoint has a timestamp and an integer value (between 40 and 200). I must create histograms of the datapoint distribution (16 bins: 40-49, 50-59, etc.), sorted by year, by month within the current year, by week within the current year, and by day within the current month.
Now, I wonder what might be the most efficient way to accomplish this. Given the size of the array, performance is a conspicuous consideration. I am considering nested "for" loops, breaking down the arrays by year, by month, etc. But I was reading that numpy arrays are highly memory-efficient and have all kinds of tricks up their sleeve for fast processing. So I was wondering if there is a faster way to do that. As you may have realized, I am an amateur programmer (a molecular biologist in "real life") and my questions are probably rather naïve.

First, fill in your 16 bins without considering date at all.
Then, sort the elements within each bin by date.
Now, you can use binary search to efficiently locate a given year/month/week within each bin.

In order to do this, there is a function in numpy, numpy.bincount. It is blazingly fast. It is so fast that you can create a bin for each integer (161 bins) and day (maybe 30000 different days?) resulting in a few million bins.
The procedure:
calculate an integer index for each bin (e.g. 17 x number of day from the first day in the file + (integer - 40)//10)
run np.bincount
reshape to the correct shape (number of days, 17)
Now you have the binned data which can then be clumped into whatever bins are needed in the time dimension.
Without knowing the form of your input data the integer bin calculation code could be something like this:
# let us assume we have the data as:
# timestamps: 64-bit integer (seconds since something)
# values: 8-bit unsigned integer with integers between 40 and 200
# find the first day in the sample
first_day = np.min(timestamps) / 87600
# we intend to do this but fast:
indices = (timestamps / 87600 - first_day) * 17 + ((values - 40) / 10)
# get the bincount vector
b = np.bincount(indices)
# calculate the number of days in the sample
no_days = (len(b) + 16) / 17
# reshape b
b.resize((no_days, 17))
It should be noted that the first and last days in b depend on the data. In testing this most of the time is spent in calculating the indices (around 400 ms with an i7 processor). If that needs to be reduced, it can be done in approximately 100 ms with numexpr module. However, the actual implementation depends really heavily on the form of timestamps; some are faster to calculate, some slower.
However, I doubt if any other binning method will be faster if the data is needed up to the daily level.
I did not quite understand it from your question if you wanted to have separate views on the (one by year, ony by week, etc.) or some other binning method. In any case that boils down to summing the relevant rows together.

Here is a solution, employing the group_by functionality found in the link below:
http://pastebin.com/c5WLWPbp
import numpy as np
dates = np.arange('2004-02', '2005-05', dtype='datetime64[D]')
np.random.shuffle(dates)
values = np.random.randint(40,200, len(dates))
years = np.array(dates, dtype='datetime64[Y]')
months = np.array(dates, dtype='datetime64[M]')
weeks = np.array(dates, dtype='datetime64[W]')
from grouping import group_by
bins = np.linspace(40,200,17)
for m, g in zip(group_by(months)(values)):
print m
print np.histogram(g, bins=bins)[0]
Alternatively, you could take a look at the pandas package, which probably has an elegant solution to this problem as well.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficient workaround to group by multiple time coordinates in xarray - python

Related

Is there a way to efficiently apply a function to 3 million values in a Pandas column?

Frequency in pandas timeseries index and statsmodel

Cumulative Frequency with Defined Bins in Python

Interpolate based on group

efficient, fast numpy histograms

Categories

Resources