I have a dataframe that is being read from database records and looks like this:
date added total
2020-09-14 5 5
2020-09-15 4 9
2020-09-16 2 11
I need to be able to resample by different periods and this is what I am using:
df = pd.DataFrame.from_records(raw_data, index='date')
df.index = pd.to_datetime(df.index)
# let's say I want yearly sample, then I would do
df = df.fillna(0).resample('Y').sum()
This almost works, but it is obviously summing the total column, which is something I don't want. I need total column to be the value in the date sampled in the dataframe, like this:
# What I need
date added total
2020 11 11
# What I'm getting
date added total
2020 11 25
You can do this by resampling differently for different columns. Here you want to use sum() aggregator for the added column, but max() for the total.
df = pd.DataFrame({'date':[20200914, 20200915, 20200916, 20210101, 20210102],
'added':[5, 4, 2, 1, 6],
'total':[5, 9, 11, 1, 7]})
df['date'] = pd.to_datetime(df['date'], format='%Y%m%d')
df_res = df.resample('Y', on='date').agg({'added':'sum', 'total':'max'})
And the result is:
df_res
added total
date
2020-12-31 11 11
2021-12-31 7 7
Related
I have a table like below containing values for multiple IDs:
ID
value
date
1
20
2022-01-01 12:20
2
25
2022-01-04 18:20
1
10
2022-01-04 11:20
1
150
2022-01-06 16:20
2
200
2022-01-08 13:20
3
40
2022-01-04 21:20
1
75
2022-01-09 08:20
I would like to calculate week wise sum of values for all IDs:
The start date is given (for example, 01-01-2022).
Weeks are calculated based on range:
every Saturday 00:00 to next Friday 23:59 (i.e. Week 1 is from 01-01-2022 00:00 to 07-01-2022 23:59)
ID
Week 1 sum
Week 2 sum
Week 3 sum
...
1
180
75
--
--
2
25
200
--
--
3
40
--
--
--
There's a pandas function (pd.Grouper) that allows you to specify a groupby instruction.1 In this case, that specification is to "resample" date by a weekly frequency that starts on Fridays.2 Since you also need to group by ID as well, add it to the grouper.
# convert to datetime
df['date'] = pd.to_datetime(df['date'])
# pivot the dataframe
df1 = (
df.groupby(['ID', pd.Grouper(key='date', freq='W-FRI')])['value'].sum()
.unstack(fill_value=0)
)
# rename columns
df1.columns = [f"Week {c} sum" for c in range(1, df1.shape[1]+1)]
df1 = df1.reset_index()
1 What you actually need is a pivot_table result but groupby + unstack is equivalent to pivot_table and groupby + unstack is more convenient here.
2 Because Jan 1, 2022 is a Saturday, you need to specify the anchor on Friday.
You can compute a week column. In case you've data for same year, you can extract just week number, which is less likely in real-time scenarios. In case you've data from multiple years, it might be wise to derive a combination of Year & week number.
df['Year-Week'] = df['Date'].dt.strftime('%Y-%U')
In your case the dates 2022-01-01 & 2022-01-04 18:2 should be convert to 2022-01 as per the scenario you considered.
To calculate your pivot table, you can use the pandas pivot_table. Example code:
pd.pivot_table(df, values='value', index=['ID'], columns=['year_weeknumber'], aggfunc=np.sum)
Let's define a formatting helper.
def fmt(row):
return f"{row.year}-{row.week:02d}" # We ignore row.day
Now it's easy.
>>> df = pd.DataFrame([dict(id=1, value=20, date="2022-01-01 12:20"),
dict(id=2, value=25, date="2022-01-04 18:20")])
>>> df['date'] = pd.to_datetime(df.date)
>>> df['iso'] = df.date.dt.isocalendar().apply(fmt, axis='columns')
>>> df
id value date iso
0 1 20 2022-01-01 12:20:00 2021-52
1 2 25 2022-01-04 18:20:00 2022-01
Just groupby
the ISO week.
Using pandas to write to a csv, I want Monthly Income sums for each unique Source. Month is in datetime format.
I have tried resampling and groupby methods, but groupby neglects month and resampling neglects source. I currently have a multi-level index with Month and Source as indexes.
Month Source Income
2019-03-01 A 100
2019-03-05 B 50
2019-03-06 A 4
2019-03-22 C 60
2019-04-23 A 40
2019-04-24 A 100
2019-04-24 C 30
2019-06-1 C 100
2019-06-1 B 90
2019-06-8 B 20
2019-06-12 A 50
2019-06-27 C 50
I can groupby Source which neglects date, or I can resample for date which neglects source. I want monthly sums for each unique source.
What you have in the Month column is a Timestamp. So you can separate the month attribute of this Timestamp and afterward apply the groupby method, like this:
df.columns = ['Timestamp', 'Source', 'Income']
month_list = []
for i in range(len(df)):
month_list.append(df.loc[i,'Timestamp'].month)
df['Month'] = month_list
df1 = df.groupby(['Month', 'Source']).sum()
The output should be like this:
Income
Month Source
3 A 104
B 50
C 60
4 A 140
C 30
6 A 50
B 110
C 150
I've got a dataframe with dates, and I want to select the highest date in each week excluding weekends (so Fridays, if available), unless there is no Monday to Friday data and Saturday/Sunday are the only one's available.
The sample data can be setup like this:
dates = pd.Series(data=['2018-11-05', '2018-11-06', '2018-11-07', '2018-11-08', '2018-11-09',
'2018-11-12', '2018-11-13', '2018-11-14', '2018-11-15', '2018-11-17',
'2018-11-19',
'2018-12-01',
])
nums = np.random.randint(50, 100, 12)
# nums
# array([95, 80, 81, 51, 98, 62, 50, 55, 59, 77, 69])
df = pd.DataFrame(data={'dates': dates, 'nums': nums})
df['dates'] = pd.to_datetime(df['dates'])
The records I want:
2018-11-09 is Friday
2018-11-15 is Thursday (not 2018-11-17 coz it's Saturday)
2018-11-19 is Monday and the only record for that week
2018-12-01 is Saturday but the only record for that week
My current solution is in the answer below but I don't think it's ideal and has some issues I had to work around. Briefly, it's:
groupby week: df.groupby(df['dates'].dt.week).apply(some_function)
if there's just one record for that week, return it
otherwise, select the highest/latest record with day <= Friday and return that
Ideally, I'd like a way to write:
[latest Mon-Fri record] if [has Mon-Fri record] else [latest Sat-Sun record]
Create a new hierarchy of weekdays, where Saturday and Sunday are given the lowest priority. Then sort_values on this new ranking + groupby + .tail(1).
import numpy as np
wd_map = dict(zip(np.arange(0,7,1), np.roll(np.arange(0,7,1),-2)))
# {0: 2, 1: 3, 2: 4, 3: 5, 4: 6, 5: 0, 6: 1}
df = df.assign(day_mapped = df.dates.dt.weekday.map(wd_map)).sort_values('day_mapped')
df.groupby(df.dates.dt.week).tail(1).sort_index()
Output
dates nums day_mapped
4 2018-11-09 57 6
8 2018-11-15 83 5
10 2018-11-19 96 2
11 2018-12-01 66 0
If your data span multiple years, you'll need to group on both Year + week.
I wrote a function to select the valid highest record for the week, which would need to be used on a weekly groupby:
def last_valid_report(recs):
if len(recs) == 1:
return recs
recs = recs.copy()
# recs = recs[recs['dates'].dt.weekday <= 4].nlargest(1, recs['dates'].dt.weekday) # doesn't work
recs['weekday'] = recs['dates'].dt.weekday # because nlargest() needs a column name
recs = recs[recs['weekday'] <= 4].nlargest(1, 'weekday')
del recs['weekday']
return recs
# could have also done:
# return recs[recs['weekday'] <= 4].nlargest(1, 'weekday').drop('weekday', axis=1)
Calling that with the correct groups, I get:
In [155]: df2 = df.groupby(df['dates'].dt.week).apply(last_valid_report)
In [156]: df2
Out[156]:
dates nums
dates
45 4 2018-11-09 63
46 8 2018-11-15 90
47 10 2018-11-19 80
48 11 2018-12-01 94
Couple of issues with this:
If I don't put the recs.copy(), I get ValueError: Shape of passed values is (3, 12), indices imply (3, 4)
pandas' nlargest will only use column names, not an expression.
so I need to create an extra column in the function and drop/delete it before returning it. I could also create this in the original df and drop it after the .apply().
I'm getting an extra index-column 'dates', from the groupby+apply and needs to be explicitly dropped:
In [157]: df2.index = df2.index.droplevel(); df2
Out[157]:
dates nums
4 2018-11-09 63
8 2018-11-15 90
10 2018-11-19 80
11 2018-12-01 94
If I get a record with Saturday and Sunday data (2 days), I need to add a check if recs[recs['weekday'] <= 4] is empty and then just use .nlargest(1, 'weekday') without filtering out weekday <= 4; but that's besides the point of the question.
I have a pandas dataframe with a business-day-based DateTimeIndex. For each month that's in the index, I also have a single 'marker' day specified.
Here's a toy version of that dataframe:
# a dataframe with business dates as the index
df = pd.DataFrame(list(range(91)), pd.date_range('2015-04-01', '2015-6-30'), columns=['foo']).resample('B').last()
# each month has an single, arbitrary marker day specified
marker_dates = [df.index[12], df.index[33], df.index[57]]
For each month in the index, I need to calculate average of the foo column in specific slice of rows in that month.
There are two different ways I need to be able to specify those slices:
1) m'th day to n'th day.
Example might be (2rd to 4th business day in that month). So april would be the average of 1 (apr2), 4 (apr3), and 5 (apr 6) = 3.33. May would be 33 (may 4), 34 (may 5), 35 (may 6) = 34. I don't consider the weekends/holidays that don't occur in the index as days.
2) m'th day before/after the marker date to the n'th day before/after the marker date.
Example might be "average of the slice from 1 day before the marker date to 1 day after the marker date in each month" Eg. In April, the marker date is 17Apr. Looking at the index, we want the average of apr16, apr17, and apr20.
For Example 1, I had an ugly solution that foreach month I would slice the rows of that month away, and then apply df_slice.iloc[m:n].mean()
Whenever I start doing iterative things with pandas, I always suspect I'm doing it wrong. So I imagine there is a cleaner, pythonic/vectorized way to make this result for all the months
For Example 2, I don't not know a good way to do this slice-averaging based on arbitrary dates across many months.
Use BDay() from pandas.tseries.offsets
import pandas as pd
from pandas.tseries.offsets import BDay
M=2
N=4
start_date = pd.datetime(2015,4,1)
end_date = pd.datetime(2015,6,30)
df = pd.DataFrame(list(range(91)), pd.date_range('2015-04-01', '2015-6-30'), columns=['foo']).resample('B').last()
# for month starts
marker_dates = pd.date_range(start=start_date, end=end_date, freq='BMS')
# create IntervalIndex
bins = pd.IntervalIndex.from_tuples([ (d + (M-1)*BDay(), d + (N-1)*BDay()) for d in marker_dates ], closed='both')
df.groupby(pd.cut(df.index, bins)).mean()
#[2015-04-02, 2015-04-06] 3.333333
#[2015-05-04, 2015-05-06] 34.000000
#[2015-06-02, 2015-06-04] 63.000000
# any markers
marker_dates = [df.index[12], df.index[33], df.index[57]]
# M Bday before, and N Bday after
bins = pd.IntervalIndex.from_tuples([ (d - M*BDay(), d + N*BDay()) for d in marker_dates ], closed='both')
df.groupby(pd.cut(df.index, bins)).mean()
#[2015-04-15, 2015-04-23] 18.428571
#[2015-05-14, 2015-05-22] 48.000000
#[2015-06-17, 2015-06-25] 81.428571
The most pythonic/vectorized (pandonic?) way to do this might be to use df.rolling and df.shift to generate the window over which you'll take the average, then df.reindex to select the value at the dates you've marked.
For your example (2), this could look like:
df['foo'].rolling(3).mean().shift(-1).reindex(marker_dates)
Out[8]:
2015-04-17 17.333333
2015-05-18 47.000000
2015-06-19 80.333333
Name: foo, dtype: float64
This could be wrapped in a small function:
def window_mean_at_indices(df, indices, begin=-1, end=1):
return df.rolling(1+end-begin).mean().shift(-end).reindex(indices)
Helping to make it more clear how to apply this to situation (1):
month_starts = pd.date_range(df.index.min(), df.index.max(), freq='BMS')
month_starts
Out[11]: DatetimeIndex(['2015-04-01', '2015-05-01', '2015-06-01'],
dtype='datetime64[ns]', freq='BMS')
window_mean_at_indices(df['foo'], month_starts, begin=1, end=3)
Out[12]:
2015-04-01 3.333333
2015-05-01 34.000000
2015-06-01 63.000000
Freq: BMS, Name: foo, dtype: float64
For your first problem you can use grouper and iloc i.e
low = 2
high= 4
slice_mean = df.groupby(pd.Grouper(level=0,freq='m')).apply(lambda x : x.iloc[low-1:high].mean())
# or df.resample('m').apply(lambda x : x.iloc[low-1:high].mean())
foo
2015-04-30 3.333333
2015-05-31 34.000000
2015-06-30 63.000000
For your second problem you can concat the dates and take the groupy mean per month i.e
idx = pd.np.where(df.index.isin(pd.Series(marker_dates)))[0]
#array([12, 33, 57])
temp = pd.concat([df.iloc[(idx+i)] for i in [-1,0,1]])
foo
2015-04-16 15
2015-05-15 46
2015-06-18 78
2015-04-17 18
2015-05-18 47
2015-06-19 81
2015-04-20 19
2015-05-19 48
2015-06-22 82
# Groupby mean
temp.groupby(pd.Grouper(level=0,freq='m')).mean()
# or temp.resample('m').mean()
foo
2015-04-30 17.333333
2015-05-31 47.000000
2015-06-30 80.333333
dtype: float64
since the index of output aint specified in the question do let us know what the index of output be.
Here's what I managed to come up with:
Import pandas and setup the dataframe
import pandas as pd
df = pd.DataFrame(list(range(91)), pd.date_range('2015-04-01', '2015-6-30'), columns=['foo']).resample('B')
Start with a pure list of marker dates, since I'm guessing that what you're really starting with:
marker_dates = [
pd.to_datetime('2015-04-17', format='%Y-%m-%d'),
pd.to_datetime('2015-05-18', format='%Y-%m-%d'),
pd.to_datetime('2015-06-19', format='%Y-%m-%d')
]
marker_df = pd.DataFrame([], columns=['marker', 'start', 'end', 'avg'])
marker_df['marker'] = marker_dates
For the case where you want to just test ranges, input the start and end manually here instead of calculating it. If you want to change the range you can change the arguments to shift():
marker_df['start'] = df.index.shift(-1)[df.index.isin(marker_df['marker'])]
marker_df['end'] = df.index.shift(1)[df.index.isin(marker_df['marker'])]
Finally, use DataFrame.apply() to do a row by row calculation of averages:
marker_df.apply(
lambda x: df[(x['start'] <= df.index) & (df.index <= x['end'])]['foo'].mean(),
axis=1
)
Which gives us this result:
marker start end avg
0 2015-04-17 2015-04-16 2015-04-20 17.000000
1 2015-05-18 2015-05-15 2015-05-19 46.666667
2 2015-06-19 2015-06-18 2015-06-22 80.000000
I have a pandas Dataframe that is indexed by Date. I would like to select all consecutive gaps by period and all consecutive days by Period. How can I do this?
Example of Dataframe with No Columns but a Date Index:
In [29]: import pandas as pd
In [30]: dates = pd.to_datetime(['2016-09-19 10:23:03', '2016-08-03 10:53:39','2016-09-05 11:11:30', '2016-09-05 11:10:46','2016-09-05 10:53:39'])
In [31]: ts = pd.DataFrame(index=dates)
As you can see there is a gap from 2016-08-03 and 2016-09-19. How do I detect these so I can create descriptive statistics, i.e. 40 gaps, with median gap duration of "x", etc. Also, I can see that 2016-09-05 and 2016-09-06 is a two day range. How I can detect these and also print descriptive stats?
Ideally the result would be returned as another Dataframe in each case since I want use other columns in the Dataframe to groupby.
Pandas version 1.0.1 has a built-in method DataFrame.diff() which you can use to accomplish this. One benefit is you can use pandas series functions like mean() to quickly compute summary statistics on the gaps series object
from datetime import datetime, timedelta
import pandas as pd
# Construct dummy dataframe
dates = pd.to_datetime([
'2016-08-03',
'2016-08-04',
'2016-08-05',
'2016-08-17',
'2016-09-05',
'2016-09-06',
'2016-09-07',
'2016-09-19'])
df = pd.DataFrame(dates, columns=['date'])
# Take the diff of the first column (drop 1st row since it's undefined)
deltas = df['date'].diff()[1:]
# Filter diffs (here days > 1, but could be seconds, hours, etc)
gaps = deltas[deltas > timedelta(days=1)]
# Print results
print(f'{len(gaps)} gaps with average gap duration: {gaps.mean()}')
for i, g in gaps.iteritems():
gap_start = df['date'][i - 1]
print(f'Start: {datetime.strftime(gap_start, "%Y-%m-%d")} | '
f'Duration: {str(g.to_pytimedelta())}')
here's something to get started:
df = pd.DataFrame(np.ones(5),columns = ['ones'])
df.index = pd.DatetimeIndex(['2016-09-19 10:23:03', '2016-08-03 10:53:39', '2016-09-05 11:11:30', '2016-09-05 11:10:46', '2016-09-06 10:53:39'])
daily_rng = pd.date_range('2016-08-03 00:00:00', periods=48, freq='D')
daily_rng = daily_rng.append(df.index)
daily_rng = sorted(daily_rng)
df = df.reindex(daily_rng).fillna(0)
df = df.astype(int)
df['ones'] = df.cumsum()
The cumsum() creates a grouping variable on 'ones' partitioning your data at the points your provided. If you print df to say a spreadsheet it will make sense:
print df.head()
ones
2016-08-03 00:00:00 0
2016-08-03 10:53:39 1
2016-08-04 00:00:00 1
2016-08-05 00:00:00 1
2016-08-06 00:00:00 1
print df.tail()
ones
2016-09-16 00:00:00 4
2016-09-17 00:00:00 4
2016-09-18 00:00:00 4
2016-09-19 00:00:00 4
2016-09-19 10:23:03 5
now to complete:
df = df.reset_index()
df = df.groupby(['ones']).aggregate({'ones':{'gaps':'count'},'index':{'first_spotted':'min'}})
df.columns = df.columns.droplevel()
which gives:
first_time gaps
ones
0 2016-08-03 00:00:00 1
1 2016-08-03 10:53:39 34
2 2016-09-05 11:10:46 1
3 2016-09-05 11:11:30 2
4 2016-09-06 10:53:39 14
5 2016-09-19 10:23:03 1