Taking percentile in Python along 3rd dimension - python

I've been struggling with this one for a bit now. I have a matrix that is 55115 x 34, where each number along the first dimension is one day, for 151 years, totally 55115 points.
I am trying to get monthly percentiles of the values in the first dimension, so I have first added a date column, which subsequently groups it into months, although I cannot figure out the best way to take the percentile (95th) of both the days and the third dimension (here is 34). So after grouping the months, the matrix should be 151 x 12 x 34, and I want to take the 95th percentile along the third dimension, so my final matrix would be 151 x 12, in theory. Below is what I have so far to add the dates to the array:
dates = pd.date_range(start='1950-01-01', end='2100-12-31', freq='D') #create daily date range from 1950 to 2100
leap = [] #empty array
for each in dates:
if each.month==2 and each.day ==29: #find each leap day (feb 29)
leap.append(each)
dates = dates.drop(leap) #get rid of leap days
dates = pd.to_datetime(dates) #convert to datetime format
data = {'wind': winddata, 'time': dates} #create table with both dates and data
df = pd.DataFrame(data) #create dataframe
df.set_index('time') #index time
df.groupby(df['time'].dt.strftime('%b'))['wind'].sort_values()
And this is what I have to take the percentile:
months = df.groupby(pd.Grouper(key='time',freq = "M")) #group each month
monthly_percentile = months.aggregate(lambda x: np.percentile(x, q = 95)) #percentile across each month
Although, this does not appear to work. I'm open to other methods of doing this, I just am hoping to a) rearrange the 55115 x 34 data set into months, so that it is 151 (years) x 365 (days) x 34 (ensembles), and then the percentile is taken across the months and third dimension so I end up with 151 x 12 total. I'm happy to clarify anything if I did not specify well enough. Any detailed response would be really helpful. Thank you so much in advance!

If I get your question right, the most straightforward solution I can think of is to add the columns year and month, then groupby over them and compute a required percentile:
import pandas as pd
import numpy as np
dates = pd.date_range(start='1950-01-01', end='2100-12-31', freq='D')
dates_months = [date.month for date in dates]
dates_years = [date.year for date in dates]
values = np.random.rand(34, len(dates))
df = pd.DataFrame()
df['date'] = dates
df['year'] = dates_years
df['month'] = dates_months
for i in range(34):
df[f'values_{i}'] = values[i]
df = df.melt(id_vars=['date', 'year', 'month'], value_vars=[f'values_{i}' for i in range(34)])
sub = df.groupby(['year', 'month']).value.apply(lambda x: np.quantile(x, .95)).reset_index()
finally, if you really need a 151 x 12 array instead of year-month-percentile table of length 1812 (=151*12) you could use something like this:
crosstab = pd.crosstab(index=sub['year'], columns=sub['month'], values=sub['values'], aggfunc=lambda x: x)

Related

Resampling a time series

I have a 40 year time series in the format stn;yyyymmddhh;rainfall , where yyyy= year, mm = month, dd= day,hh= hour. The series is at an hourly resolution. I extracted the maximum values for each year by the following groupby method:
import pandas as pd
df = pd.read_csv('data.txt', delimiter = ";")
df['yyyy'] = df['yyyymmhhdd'].astype(str).str[:4]
df.groupby(['yyyy'])['rainfall'].max().reset_index()
Now, i am trying to extract the maximum values for 3 hour duration each year. I tried this sliding maxima approach but it is not working. k is the duration I am interested in. In simple words,i need maximum precipitation sum for multiple durations in every year (eg 3h, 6h, etc)
class AMS:
def sliding_max(self, k, data):
tp = data.values
period = 24*365
agg_values = []
start_j = 1
end_j = k*int(np.floor(period/k))
for j in range(start_j, end_j + 1):
start_i = j - 1
end_i = j + k + 1
agg_values.append(np.nansum(tp[start_i:end_i]))
self.sliding_max = max(agg_values)
return self.sliding_max
Any suggestions or improvements in my code or is there a way i can implement it with groupby. I am a bit new to python environment, so please excuse if the question isn't put properly.
Stn;yyyymmddhh;rainfall
xyz;1981010100;0.0
xyz;1981010101;0.0
xyz;1981010102;0.0
xyz;1981010103;0.0
xyz;1981010104;0.0
xyz;1981010105;0.0
xyz;1981010106;0.0
xyz;1981010107;0.0
xyz;1981010108;0.0
xyz;1981010109;0.4
xyz;1981010110;0.6
xyz;1981010111;0.1
xyz;1981010112;0.1
xyz;1981010113;0.0
xyz;1981010114;0.1
xyz;1981010115;0.6
xyz;1981010116;0.0
xyz;1981010117;0.0
xyz;1981010118;0.2
xyz;1981010119;0.0
xyz;1981010120;0.0
xyz;1981010121;0.0
xyz;1981010122;0.0
xyz;1981010123;0.0
xyz;1981010200;0.0
You first have to convert your column containing the datetimes to a Series of type datetime. You can do that parsing by providing the format of your datetimes.
df["yyyymmddhh"] = pd.to_datetime(df["yyyymmddhh"], format="%Y%M%d%H")
After having the correct data type you have to set that column as your index and can now use pandas functionality for time series data (resampling in your case).
First you resample the data to 3 hour windows and sum the values. From that you resample to yearly data and take the maximum value of all the 3 hour windows for each year.
df.set_index("yyyymmddhh").resample("3H").sum().resample("Y").max()
# Output
yyyymmddhh rainfall
1981-12-31 1.1

How to groupby a custom time range in xarray?

I have a DataArray with date, x, and y coordinate dims.
date: 265, y: 1458, x: 1159
For each year, over the course of 35 years, there are about 10 arrays with a date and x and y dims.
I'd like to groupby a custom annual season to integrate values for each raster over that custom annual season (maybe October to April for each year). The xarray docs show that I can do something like this:
arr.groupby("date.season")
which results in
DataArrayGroupBy, grouped over 'season'
4 groups with labels 'DJF', 'JJA', 'MAM', 'SON'.
But this groups by season over the entire 35 year record and I can't control the start and ending months that are used to group.
Similarly this does not quite get at what I want:
all_eeflux_arr.groupby("date.year")
DataArrayGroupBy, grouped over 'year'
36 groups with labels 1985, 1986, 1987, ..., 2019, 2020.
The start and the ending date for each year is automatically January/December.
I'd really like to be able to groupby an arbitrary period with a start time and an end time in units of day of year.
If this grouping can also discard dates that fall outside the grouping window, even better, since I may want to group by a day of year range that doesn't span all months.
The DataArray.resample method seems to be able to select a custom offset (see this SO post) for the start of the year, but I can't figure out how to access this with groupby. I can't use resample because it does not return a DataArray and I need to call the .integrate xarray method on each group DataArray (across the date dim, to get custom annual totals).
Data to reproduce (2Gb, but can be tested on a subset):
https://ucsb.box.com/s/zhjxkqje18m61rivv1reig2ixiigohk0
Code to reproduce
import rioxarray as rio
import xarray as xr
import numpy as np
from pathlib import Path
from datetime import datetime
all_scenes_f_et = Path('/home/serdp/rhone/rhone-ecostress/rasters/eeflux/PDR')
all_pdr_et_paths = list(all_scenes_f_et.glob("*.tif"))
def eeflux_path_date(path):
year, month, day, _ = path.name.split("_")
return datetime(int(year), int(month), int(day))
def open_eeflux(path, da_for_match):
data_array = rio.open_rasterio(path) #chunks makes i lazyily executed
data_array.rio.reproject_match(da_for_match)
data_array = data_array.sel(band=1).drop("band") # gets rid of old coordinate dimension since we need bands to have unique coord ids
data_array["date"] = eeflux_path_date(path) # makes a new coordinate
return data_array.expand_dims({"date":1}) # makes this coordinate a dimension
da_for_match = rio.open_rasterio(all_pdr_et_paths[0])
daily_eeflux_arrs = [open_eeflux(path, da_for_match) for path in all_pdr_et_paths]
all_eeflux_arr = xr.concat(daily_eeflux_arrs, dim="date")
all_eeflux_arr = all_eeflux_arr.sortby("date")
### not sure what should go here
all_eeflux_arr.groupby(????????).integrate(dim="date", datetime_unit="D")
Advice is much appreciated!
I ended up writing a function that works well enough. Since my dataset isn't that large the integration doesn't take very long to run in a for loop that iterates over each group.
def group_by_custom_doy(all_eeflux_arr, doy_start, doy_end):
ey = max(all_eeflux_arr['date.year'].values)
sy = min(all_eeflux_arr['date.year'].values)
start_years = range(sy,ey)
end_years = range(sy+1, ey+1)
start_end_years = list(zip(start_year,end_year))
water_year_arrs = []
for water_year in start_end_years:
start_mask = ((all_eeflux_arr['date.dayofyear'].values > doy_start) & (all_eeflux_arr['date.year'].values == water_year[0]))
end_mask = ((all_eeflux_arr['date.dayofyear'].values < doy_end) & (all_eeflux_arr['date.year'].values == water_year[1]))
water_year_arrs.append(all_eeflux_arr[start_mask | end_mask])
return water_year_arrs
water_year_arrs = group_by_custom_doy(all_eeflux_arr, 125, 300)

Calculate recurring customer

I'm analyzing sales data from a shop and want to calculate the percentage of "first order customer" who turn into recurring customers in following month.
I have a DataFrame with all the orders. This includes a customer id, a date and a flag if this is his/her first order. This is my data:
import pandas as pd
data = {'Name': ['Tom', 'nick', 'krish', 'Tom'],
'First_order': [1, 1, 1, 0],
'Date' :['01-01-2018', '01-01-2018', '01-01-2018', '02-02-2018']}
df = pd.DataFrame(data)
I would now create a list of all new customers in January and a list of all recurring customers in February and inner-join them. Then I have two numbers with which I could calculate the percentage.
But I have no clue, how I could calculate this rolling for a whole year without looping over the data frame. Is there a nice pandas/python way to do so?
The goal would be to have a new dataframe with the month and the percentage of recurring customers from the previous month.
One thought would be to take all orders Jan-November and have a column "reccurr" which gives you a true/false based on if this customer ordered in the next month. Then you can take a per-month groupby with count / sum of true/falses and add a column giving the ratio.
EDIT: before this you may need to convert dates:
df.Date = pd.to_datetime(df.Date)
Then:
df['month'] = df['Date'].apply(lambda x: x.month) #this is for simplicity's sake, not hard to extend to MMYYYY
df1 = df[df.month != 12].copy() #now we select everything but Nov
df1 = df1[df1.First_order == 1].copy() #and filter out non-first orders
df1['recurr'] = df1.apply(lambda x: True if len(df[(df.month == x.month + 1)&(df.Name == x.Name)])>0 else False, axis=1) #Now we fill a column with True if it finds an order from the same person next month
df2 = df1[['month','Name','recurr']].groupby('month').agg({'Name':'count','recurr':'sum'})
At this point, for each month, the "Name" column has number of first orders and "recurr" column has number of those that ordered again the following month. A simple extra column gives you percentage:
df2['percentage_of_recurring_customer'] = (df2.recurr/df2.Name)*100
EDIT: For any number of dates, here's a clumsy solution. Choose a start date and use that year's January as month 1, and number all months sequentially after that.
df.Date = pd.to_datetime(df.Date)
start_year = df.Date.min().year
def get_month_num(date):
return (date.year-start_year)*12+date.month
Now that we have a function to convert dates, the slightly changed code:
df['month'] = df['Date'].apply(lambda x: get_month_num(x))
df1 = df[df.First_order == 1].copy()
df1['recurr'] = df1.apply(lambda x: True if len(df[(df.month == x.month + 1)&(df.Name == x.Name)])>0 else False, axis=1)
df2 = df1[['month','Name','recurr']].groupby('month').agg({'Name':'count','recurr':'sum'})
Finally, you can make a function to revert your month numbers into dates:
def restore_month(month_num):
year = int(month_num/12)+start_year #int rounds down so we can do this.
month = month_num%12 #modulo gives us month
return pd.Timestamp(str(year)+'-'+str(month)+'-1') #This returns the first of that month
df3 = df2.reset_index().copy() #removing month from index so we can change it.
df3['month_date'] = df3['month'].apply(lambda x: restore_month(x))

Multiple day wise plots in timeseries dataframe pandas

My data frame look like this -
In [1]: df.head()
Out[1]:
Datetime Value
2018-04-21 14:08:30.761 offline
2018-04-21 14:08:40.761 offline
2018-04-21 14:08:50.761 offline
2018-04-21 14:09:00.761 offline
2018-04-21 14:09:10.761 offline
I have data for 2 weeks. I want to plot Value against time (hours:minutes) for each day in week. If I am to see data one week at a time that also works.
I took a slice for a single day created a plot using plotly.
In[9]: df['numval'] = df.Value.apply(lambda x: 1 if x == 'online' else -1)
In[10]: df.iplot()
If can have mutiple plots similar to this for sunday to saturday using few lines it would speed up my work
Suggestions -
Something like I can put in arguments as weekday (0-6), time (x axis) and Value (y axis) and it would create 7 plots.
In[11]: df['weekday'] = df.index.weekday
In[12]: df['weekdayname'] = df.index.weekday_name
In[13]: df['time'] = df.index.time
Any library would work as I just want to see the data and will need to test out modifications to data.
Optional - Distribution curve, similar to KDE, over data would be nice
This may not be the exact answer you are looking for. Just giving an approach which could be helpful.
The approach here is to group the data based on date and then generate a plot for each group. For this you need to split the DateTime column into two columns - date and time. Code below will do that:
datetime_series = df['Datetime']
date_series = pd.Series()
time_series = pd.Series()
for datetime_string in datetime_series:
date,time = datetime_string.split(" ")
date_s = pd.Series(date,dtype=str)
time_s = pd.Series(time, dtype=str)
date_series=date_series.append(date_s, ignore_index=True)
time_series = time_series.append(time_s, ignore_index=True)
Code above will give you two separate pandas series. One for date and the other one for time. Now you can add the two columns to your dataframe
df['date'] = date_series
df['time'] = time_series
Now you can use groupby functionality to group the data based on date and plot data for each group. Something like this:
First replace 'offline' with value 0:
df1 = df.replace(to_replace='offline',value=0)
Now group the data based on date and plot:
for title, group in df1.groupby('date'):
group.plot(x='time', y='Value', title=title)

Compute daily climatology using pandas python

I am trying to use pandas to compute daily climatology. My code is:
import pandas as pd
dates = pd.date_range('1950-01-01', '1953-12-31', freq='D')
rand_data = [int(1000*random.random()) for i in xrange(len(dates))]
cum_data = pd.Series(rand_data, index=dates)
cum_data.to_csv('test.csv', sep="\t")
cum_data is the data frame containing daily dates from 1st Jan 1950 to 31st Dec 1953. I want to create a new vector of length 365 with the first element containing the average of rand_data for January 1st for 1950, 1951, 1952 and 1953. And so on for the second element...
Any suggestions how I can do this using pandas?
You can groupby the day of the year, and the calculate the mean for these groups:
cum_data.groupby(cum_data.index.dayofyear).mean()
However, you have the be aware of leap years. This will cause problems with this approach. As alternative, you can also group by the month and the day:
In [13]: cum_data.groupby([cum_data.index.month, cum_data.index.day]).mean()
Out[13]:
1 1 462.25
2 631.00
3 615.50
4 496.00
...
12 28 378.25
29 427.75
30 528.50
31 678.50
Length: 366, dtype: float64
Hoping it can be of any help, I want to post my solution to get a climatology series with the same index and length of the original time series.
I use joris' solution to get a "model climatology" of 365/366 elements, then I build my desired series taking values from this model climatology and time index from my original time series.
This way, things like leap years are automatically taken care of.
#I start with my time series named 'serData'.
#I apply joris' solution to it, getting a 'model climatology' of length 365 or 366.
serClimModel = serData.groupby([serData.index.month, serData.index.day]).mean()
#Now I build the climatology series, taking values from serClimModel depending on the index of serData.
serClimatology = serClimModel[zip(serData.index.month, serData.index.day)]
#Now serClimatology has a time index like this: [1,1] ... [12,31].
#So, as a final step, I take as time index the one of serData.
serClimatology.index = serData.index
#joris. Thanks. Your answer was just what I needed to use pandas to calculate daily climatologies, but you stopped short of the final step. Re-mapping the month,day index back to an index of day of the year for all years, including leap years, i.e. 1 thru 366. So I thought I'd share my solution for other users. 1950 thru 1953 is 4 years with one leap year, 1952. Note since random values are used each run will give different results.
...
from datetime import date
doy = []
doy_mean = []
doy_size = []
for name, group in cum_data.groupby([cum_data.index.month, cum_data.index.day]):
(mo, dy) = name
# Note: can use any leap year here.
yrday = (date(1952, mo, dy)).timetuple().tm_yday
doy.append(yrday)
doy_mean.append(group.mean())
doy_size.append(group.count())
# Note: useful climatology stats are also available via group.describe() returned as dict
#desc = group.describe()
# desc["mean"], desc["min"], desc["max"], std,quartiles, etc.
# we lose the counts here.
new_cum_data = pd.Series(doy_mean, index=doy)
print new_cum_data.ix[366]
>> 634.5
pd_dict = {}
pd_dict["mean"] = doy_mean
pd_dict["size"] = doy_size
cum_data_df = pd.DataFrame(data=pd_dict, index=doy)
print cum_data_df.ix[366]
>> mean 634.5
>> size 4.0
>> Name: 366, dtype: float64
# and just to check Feb 29
print cum_data_df.ix[60]
>> mean 343
>> size 1
>> Name: 60, dtype: float64
Groupby month and day is a good solution. However, the perfect thinking of groupby(dayofyear) is still possible if you use xrray.CFtimeIndex instead of pandas.DatetimeIndex. i.e,
Delete feb29 by using
rand_data=rand_data[~((rand_data.index.month==2) & (rand_data.index.day==29))]
Replace the index of the above data by xrray.CFtimeIndex, i.e.,
index = xarray.cftime_range('1950-01-01', '1953-12-31', freq='D', calendar = 'noleap')
index = index[~((index.month==2)&(index.day==29))]
rand_data['time']=index
Now, for both non-leap and leap year, the 60th dayofyear would be March 1st, and the total number of dayofyear would be 365. groupbyyear would be correct to calculate climatological daily mean.

Categories