Pandas - Compute sum of a column as week-wise columns - python

I have a table like below containing values for multiple IDs:
ID
value
date
1
20
2022-01-01 12:20
2
25
2022-01-04 18:20
1
10
2022-01-04 11:20
1
150
2022-01-06 16:20
2
200
2022-01-08 13:20
3
40
2022-01-04 21:20
1
75
2022-01-09 08:20
I would like to calculate week wise sum of values for all IDs:
The start date is given (for example, 01-01-2022).
Weeks are calculated based on range:
every Saturday 00:00 to next Friday 23:59 (i.e. Week 1 is from 01-01-2022 00:00 to 07-01-2022 23:59)
ID
Week 1 sum
Week 2 sum
Week 3 sum
...
1
180
75
--
--
2
25
200
--
--
3
40
--
--
--

There's a pandas function (pd.Grouper) that allows you to specify a groupby instruction.1 In this case, that specification is to "resample" date by a weekly frequency that starts on Fridays.2 Since you also need to group by ID as well, add it to the grouper.
# convert to datetime
df['date'] = pd.to_datetime(df['date'])
# pivot the dataframe
df1 = (
df.groupby(['ID', pd.Grouper(key='date', freq='W-FRI')])['value'].sum()
.unstack(fill_value=0)
)
# rename columns
df1.columns = [f"Week {c} sum" for c in range(1, df1.shape[1]+1)]
df1 = df1.reset_index()
1 What you actually need is a pivot_table result but groupby + unstack is equivalent to pivot_table and groupby + unstack is more convenient here.
2 Because Jan 1, 2022 is a Saturday, you need to specify the anchor on Friday.

You can compute a week column. In case you've data for same year, you can extract just week number, which is less likely in real-time scenarios. In case you've data from multiple years, it might be wise to derive a combination of Year & week number.
df['Year-Week'] = df['Date'].dt.strftime('%Y-%U')
In your case the dates 2022-01-01 & 2022-01-04 18:2 should be convert to 2022-01 as per the scenario you considered.
To calculate your pivot table, you can use the pandas pivot_table. Example code:
pd.pivot_table(df, values='value', index=['ID'], columns=['year_weeknumber'], aggfunc=np.sum)

Let's define a formatting helper.
def fmt(row):
return f"{row.year}-{row.week:02d}" # We ignore row.day
Now it's easy.
>>> df = pd.DataFrame([dict(id=1, value=20, date="2022-01-01 12:20"),
dict(id=2, value=25, date="2022-01-04 18:20")])
>>> df['date'] = pd.to_datetime(df.date)
>>> df['iso'] = df.date.dt.isocalendar().apply(fmt, axis='columns')
>>> df
id value date iso
0 1 20 2022-01-01 12:20:00 2021-52
1 2 25 2022-01-04 18:20:00 2022-01
Just groupby
the ISO week.

Related

Group by list of different time ranges in Pandas

Edit: Changing example to use Timedelta indices.
I have a DataFrame of different time ranges that represent indices in my main DataFrame. eg:
ranges = pd.DataFrame(data=np.array([[1,10,20],[3,15,30]]).T, columns=["Start","Stop"])
ranges = ranges.apply(pd.to_timedelta, unit="s")
ranges
Start Stop
0 0 days 00:00:01 0 days 00:00:03
1 0 days 00:00:10 0 days 00:00:15
2 0 days 00:00:20 0 days 00:00:30
my_data= pd.DataFrame(data=list(range(0,40*5,5)), columns=["data"])
my_data.index = pd.to_timedelta(my_data.index, unit="s")
I want to calculate the averages of the data in my_data for each of the time ranges in ranges. How can I do this?
One option would be as follows:
ranges.apply(lambda row: my_data.loc[row["Start"]:row["Stop"]].iloc[:-1].mean(), axis=1)
data
0 7.5
1 60.0
2 122.5
But can we do this without apply?
Here is one way to approach it:
Generate the timedeltas and concatenate into a single block:
# note the use of closed='left' (`Stop` is not included in the build)
timedelta = [pd.timedelta_range(a,b, closed='left', freq='1s')
for a, b in zip(ranges.Start, ranges.Stop)]
timedelta = timedelta[0].append(timedelta[1:])
Get the grouping which will be used for the groupby and aggregation:
counts = ranges.Stop.sub(ranges.Start).dt.seconds
counts = np.arange(counts.size).repeat(counts)
Group by and aggregate:
my_data.loc[timedelta].groupby(counts).mean()
data
0 7.5
1 60.0
2 122.5

Using pandas to csv, how to organize time and numerical data in a multi-level index

Using pandas to write to a csv, I want Monthly Income sums for each unique Source. Month is in datetime format.
I have tried resampling and groupby methods, but groupby neglects month and resampling neglects source. I currently have a multi-level index with Month and Source as indexes.
Month Source Income
2019-03-01 A 100
2019-03-05 B 50
2019-03-06 A 4
2019-03-22 C 60
2019-04-23 A 40
2019-04-24 A 100
2019-04-24 C 30
2019-06-1 C 100
2019-06-1 B 90
2019-06-8 B 20
2019-06-12 A 50
2019-06-27 C 50
I can groupby Source which neglects date, or I can resample for date which neglects source. I want monthly sums for each unique source.
What you have in the Month column is a Timestamp. So you can separate the month attribute of this Timestamp and afterward apply the groupby method, like this:
df.columns = ['Timestamp', 'Source', 'Income']
month_list = []
for i in range(len(df)):
month_list.append(df.loc[i,'Timestamp'].month)
df['Month'] = month_list
df1 = df.groupby(['Month', 'Source']).sum()
The output should be like this:
Income
Month Source
3 A 104
B 50
C 60
4 A 140
C 30
6 A 50
B 110
C 150

Pandas: How to group the non-continuous date column?

I have a column in a dataframe which contains non-continuous dates. I need to group these date by a frequency of 2 days. Data Sample(after normalization):
2015-04-18 00:00:00
2015-04-20 00:00:00
2015-04-20 00:00:00
2015-04-21 00:00:00
2015-04-27 00:00:00
2015-04-30 00:00:00
2015-05-07 00:00:00
2015-05-08 00:00:00
I tried following but as the dates are not continuous I am not getting the desired result.
df.groupby(pd.Grouper(key = 'l_date', freq='2D'))
Is these a way to achieve the desired grouping using pandas or should I write a separate logic?
Once you have a l_date sorted dataframe. you can create a continuous dummy date (dum_date) column and groupby 2D frequency on it.
df = df.sort_values(by='l_date')
df['dum_date'] = pd.date_range(pd.datetime.today(), periods=df.shape[0]).tolist()
df.groupby(pd.Grouper(key = 'dum_date', freq='2D'))
OR
If you are fine with groupings other than dates. then a generalized way to group n consecutive rows could be:
n = 2 # n = 2 for your use case
df = df.sort_values(by='l_date')
df['grouping'] = [(i//n + 1) for i in range(df.shape[0])]
df.groupby(pd.Grouper(key = 'grouping'))

Tricky slicing specifications on business-day datetimeindex

I have a pandas dataframe with a business-day-based DateTimeIndex. For each month that's in the index, I also have a single 'marker' day specified.
Here's a toy version of that dataframe:
# a dataframe with business dates as the index
df = pd.DataFrame(list(range(91)), pd.date_range('2015-04-01', '2015-6-30'), columns=['foo']).resample('B').last()
# each month has an single, arbitrary marker day specified
marker_dates = [df.index[12], df.index[33], df.index[57]]
For each month in the index, I need to calculate average of the foo column in specific slice of rows in that month.
There are two different ways I need to be able to specify those slices:
1) m'th day to n'th day.
Example might be (2rd to 4th business day in that month). So april would be the average of 1 (apr2), 4 (apr3), and 5 (apr 6) = 3.33. May would be 33 (may 4), 34 (may 5), 35 (may 6) = 34. I don't consider the weekends/holidays that don't occur in the index as days.
2) m'th day before/after the marker date to the n'th day before/after the marker date.
Example might be "average of the slice from 1 day before the marker date to 1 day after the marker date in each month" Eg. In April, the marker date is 17Apr. Looking at the index, we want the average of apr16, apr17, and apr20.
For Example 1, I had an ugly solution that foreach month I would slice the rows of that month away, and then apply df_slice.iloc[m:n].mean()
Whenever I start doing iterative things with pandas, I always suspect I'm doing it wrong. So I imagine there is a cleaner, pythonic/vectorized way to make this result for all the months
For Example 2, I don't not know a good way to do this slice-averaging based on arbitrary dates across many months.
Use BDay() from pandas.tseries.offsets
import pandas as pd
from pandas.tseries.offsets import BDay
M=2
N=4
start_date = pd.datetime(2015,4,1)
end_date = pd.datetime(2015,6,30)
df = pd.DataFrame(list(range(91)), pd.date_range('2015-04-01', '2015-6-30'), columns=['foo']).resample('B').last()
# for month starts
marker_dates = pd.date_range(start=start_date, end=end_date, freq='BMS')
# create IntervalIndex
bins = pd.IntervalIndex.from_tuples([ (d + (M-1)*BDay(), d + (N-1)*BDay()) for d in marker_dates ], closed='both')
df.groupby(pd.cut(df.index, bins)).mean()
#[2015-04-02, 2015-04-06] 3.333333
#[2015-05-04, 2015-05-06] 34.000000
#[2015-06-02, 2015-06-04] 63.000000
# any markers
marker_dates = [df.index[12], df.index[33], df.index[57]]
# M Bday before, and N Bday after
bins = pd.IntervalIndex.from_tuples([ (d - M*BDay(), d + N*BDay()) for d in marker_dates ], closed='both')
df.groupby(pd.cut(df.index, bins)).mean()
#[2015-04-15, 2015-04-23] 18.428571
#[2015-05-14, 2015-05-22] 48.000000
#[2015-06-17, 2015-06-25] 81.428571
The most pythonic/vectorized (pandonic?) way to do this might be to use df.rolling and df.shift to generate the window over which you'll take the average, then df.reindex to select the value at the dates you've marked.
For your example (2), this could look like:
df['foo'].rolling(3).mean().shift(-1).reindex(marker_dates)
Out[8]:
2015-04-17 17.333333
2015-05-18 47.000000
2015-06-19 80.333333
Name: foo, dtype: float64
This could be wrapped in a small function:
def window_mean_at_indices(df, indices, begin=-1, end=1):
return df.rolling(1+end-begin).mean().shift(-end).reindex(indices)
Helping to make it more clear how to apply this to situation (1):
month_starts = pd.date_range(df.index.min(), df.index.max(), freq='BMS')
month_starts
Out[11]: DatetimeIndex(['2015-04-01', '2015-05-01', '2015-06-01'],
dtype='datetime64[ns]', freq='BMS')
window_mean_at_indices(df['foo'], month_starts, begin=1, end=3)
Out[12]:
2015-04-01 3.333333
2015-05-01 34.000000
2015-06-01 63.000000
Freq: BMS, Name: foo, dtype: float64
For your first problem you can use grouper and iloc i.e
low = 2
high= 4
slice_mean = df.groupby(pd.Grouper(level=0,freq='m')).apply(lambda x : x.iloc[low-1:high].mean())
# or df.resample('m').apply(lambda x : x.iloc[low-1:high].mean())
foo
2015-04-30 3.333333
2015-05-31 34.000000
2015-06-30 63.000000
For your second problem you can concat the dates and take the groupy mean per month i.e
idx = pd.np.where(df.index.isin(pd.Series(marker_dates)))[0]
#array([12, 33, 57])
temp = pd.concat([df.iloc[(idx+i)] for i in [-1,0,1]])
foo
2015-04-16 15
2015-05-15 46
2015-06-18 78
2015-04-17 18
2015-05-18 47
2015-06-19 81
2015-04-20 19
2015-05-19 48
2015-06-22 82
# Groupby mean
temp.groupby(pd.Grouper(level=0,freq='m')).mean()
# or temp.resample('m').mean()
foo
2015-04-30 17.333333
2015-05-31 47.000000
2015-06-30 80.333333
dtype: float64
since the index of output aint specified in the question do let us know what the index of output be.
Here's what I managed to come up with:
Import pandas and setup the dataframe
import pandas as pd
df = pd.DataFrame(list(range(91)), pd.date_range('2015-04-01', '2015-6-30'), columns=['foo']).resample('B')
Start with a pure list of marker dates, since I'm guessing that what you're really starting with:
marker_dates = [
pd.to_datetime('2015-04-17', format='%Y-%m-%d'),
pd.to_datetime('2015-05-18', format='%Y-%m-%d'),
pd.to_datetime('2015-06-19', format='%Y-%m-%d')
]
marker_df = pd.DataFrame([], columns=['marker', 'start', 'end', 'avg'])
marker_df['marker'] = marker_dates
For the case where you want to just test ranges, input the start and end manually here instead of calculating it. If you want to change the range you can change the arguments to shift():
marker_df['start'] = df.index.shift(-1)[df.index.isin(marker_df['marker'])]
marker_df['end'] = df.index.shift(1)[df.index.isin(marker_df['marker'])]
Finally, use DataFrame.apply() to do a row by row calculation of averages:
marker_df.apply(
lambda x: df[(x['start'] <= df.index) & (df.index <= x['end'])]['foo'].mean(),
axis=1
)
Which gives us this result:
marker start end avg
0 2015-04-17 2015-04-16 2015-04-20 17.000000
1 2015-05-18 2015-05-15 2015-05-19 46.666667
2 2015-06-19 2015-06-18 2015-06-22 80.000000

Pandas pivot_table on date

I have a pandas DataFrame with a date column. It is not an index.
I want to make a pivot_table on the dataframe using counting aggregate per month for each location.
The data look like this:
['INDEX'] DATE LOCATION COUNT
0 2009-01-02 00:00:00 AAH 1
1 2009-01-03 00:00:00 ABH 1
2 2009-01-03 00:00:00 AAH 1
3 2009-01-03 00:00:00 ABH 1
4 2009-01-04 00:00:00 ACH 1
I used:
pivot_table(cdiff, values='COUNT', rows=['DATE','LOCATION'], aggfunc=np.sum)
to pivot the values. I need a way to convert cdiff.DATE to a month rather than a date.
I hope to end up with something like:
The data look like this:
MONTH LOCATION COUNT
January AAH 2
January ABH 2
January ACH 1
I tried all manner of strftime methods on cdiff.DATE with no success. It wants to apply the to strings, not series object.
I would suggest:
months = cdiff.DATE.map(lambda x: x.month)
pivot_table(cdiff, values='COUNT', rows=[months, 'LOCATION'],
aggfunc=np.sum)
To get a month name, pass a different function or use the built-in calendar.month_name. To get the data in the format you want, you should call reset_index on the result, or you could also do:
cdiff.groupby([months, 'LOCATION'], as_index=False).sum()

Categories