Grouping data by id, var1 into consecutive dates in python using pandas - python
I have some data that looks like:
df_raw_dates = pd.DataFrame({"id": [102, 102, 102, 103, 103, 103, 104], "var1": ['a', 'b', 'a', 'b', 'b', 'a', 'c'],
"val": [9, 2, 4, 7, 6, 3, 2],
"dates": [pd.Timestamp(2020, 1, 1),
pd.Timestamp(2020, 1, 1),
pd.Timestamp(2020, 1, 2),
pd.Timestamp(2020, 1, 2),
pd.Timestamp(2020, 1, 3),
pd.Timestamp(2020, 1, 5),
pd.Timestamp(2020, 3, 12)]})
I want group this data into IDs and var1 where the dates are consecutive, if a day is missed I want to start a new record.
For example the final output should be:
df_end_result = pd.DataFrame({"id": [102, 102, 103, 103, 104], "var1": ['a', 'b', 'b', 'a', 'c'],
"val": [13, 2, 13, 3, 2],
"start_date": [pd.Timestamp(2020, 1, 1),
pd.Timestamp(2020, 1, 1),
pd.Timestamp(2020, 1, 2),
pd.Timestamp(2020, 1, 5),
pd.Timestamp(2020, 3, 12)],
"end_date": [pd.Timestamp(2020, 1, 2),
pd.Timestamp(2020, 1, 1),
pd.Timestamp(2020, 1, 3),
pd.Timestamp(2020, 1, 5),
pd.Timestamp(2020, 3, 12)]})
I have tried this a few ways and keep failing, the length of time that something can exist for is unknown and the possible number of var1 can change with each id and with date window as well.
For example I have tried to identify consecutive days like this, but it always returns ['count_days'] == 0 (clearly something is wrong!). Then I thought I could take date(min) and date(min)+count_days to get 'start_date' and 'end_date'
s = df_raw_dates.groupby(['id','var1']).dates.diff().eq(pd.Timedelta(days=1))
s1 = s | s.shift(-1, fill_value=False)
df['count_days'] = np.where(s1, s1.groupby(df.id).cumsum(), 0)
I have also tried:
df = df_raw_dates.groupby(['id', 'var1']).agg({'val': 'sum', 'date': ['first', 'last']}).reset_index()
Which gets me closer, but I don't think this deals with the consecutive days problem but instead provides the earliest and latest day which unfortunately isn't something that I can take forward.
EDIT: adding more context
Another approach is:
df = df_raw_dates.groupby(['id', 'dates']).size().reset_index().rename(columns={0: 'del'}).drop('del', axis=1)
which provides a list of ids and dates, but I am getting stuck with finding min max consecutive dates within this new window
Extended example that has a break in the date range for group (102,'a').
df_raw_dates = pd.DataFrame(
{
"id": [102, 102, 102, 103, 103, 103, 104, 102, 102, 102, 102, 108, 108],
"var1": ["a", "b", "a", "b", "b", "a", "c", "a", "a", "a", "a", "a", "a"],
"val": [9, 2, 4, 7, 6, 3, 2, 1, 2, 3, 4, 99, 99],
"dates": [
pd.Timestamp(2020, 1, 1),
pd.Timestamp(2020, 1, 1),
pd.Timestamp(2020, 1, 2),
pd.Timestamp(2020, 1, 2),
pd.Timestamp(2020, 1, 3),
pd.Timestamp(2020, 1, 5),
pd.Timestamp(2020, 3, 12),
pd.Timestamp(2020, 1, 3),
pd.Timestamp(2020, 1, 7),
pd.Timestamp(2020, 1, 8),
pd.Timestamp(2020, 1, 9),
pd.Timestamp(2020, 1, 21),
pd.Timestamp(2020, 1, 25),
],
}
)
Further example
This is using the anwser below from wwii
import pandas as pd
import collections
df_raw_dates1 = pd.DataFrame(
{
"id": [100,105,105,105,100,105,100,100,105,105,105,105,105,105,105,105,105,105,105,105,105,105,105],
"var1": ["a","b","d","a","d","c","b","b","b","a","c","d","c","a","d","b","a","d","b","b","d","c","a"],
"val": [0, 2, 0, 0, 0, 0, 0, 0, 9, 1, 0, 1, 1, 0, 9, 5, 10, 12, 13, 15, 0, 1, 2 ],
"dates": [
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 21),
pd.Timestamp(2021, 1, 21),
pd.Timestamp(2021, 1, 21),
pd.Timestamp(2021, 1, 21),
pd.Timestamp(2021, 1, 21),
pd.Timestamp(2021, 1, 20),
pd.Timestamp(2021, 1, 20),
pd.Timestamp(2021, 1, 20),
pd.Timestamp(2021, 1, 20),
pd.Timestamp(2021, 1, 19),
pd.Timestamp(2021, 1, 19),
pd.Timestamp(2021, 1, 19),
pd.Timestamp(2021, 1, 18),
pd.Timestamp(2021, 1, 18),
pd.Timestamp(2021, 1, 18),
pd.Timestamp(2021, 1, 18)
],
}
)
day = pd.Timedelta('1d')
# again using the extended example in the question
gb = df_raw_dates1.groupby(['id', 'var1'])
new_df = collections.defaultdict(list)
for k,g in gb:
# print(g)
eyed, var1 = k
dt = g['dates']
in_block = ((dt - dt.shift(-1)).abs() == day) | (dt.diff() == day)
filt = g.loc[in_block]
breaks = filt['dates'].diff() != day
groups = breaks.cumsum()
date_groups = g.groupby(groups)
# print(k,groups,groups.any())
# accomodate groups with only one date
if not groups.any():
new_df['id'].append(eyed)
new_df['var1'].append(var1)
new_df['val'].append(g.val.sum())
new_df['start'].append(g.dates.min())
new_df['end'].append(g.dates.max())
continue
for _,date_range in date_groups:
start,end = date_range['dates'].min(), date_range['dates'].max()
val = date_range.val.sum()
new_df['id'].append(eyed)
new_df['var1'].append(var1)
new_df['val'].append(val)
new_df['start'].append(start)
new_df['end'].append(end)
print(pd.DataFrame(new_df))
>>> id var1 val start end
0 100 a 0.0 2021-01-22 2021-01-22
1 100 b 0.0 2021-01-22 2021-01-22
2 100 d 0.0 2021-01-22 2021-01-22
3 105 a 0.0 2021-01-22 2021-01-22
4 105 a 1.0 2021-01-21 2021-01-21
5 105 a 0.0 2021-01-20 2021-01-20
6 105 a 10.0 2021-01-19 2021-01-19
7 105 b 2.0 2021-01-22 2021-01-22
8 105 b 9.0 2021-01-21 2021-01-21
9 105 b 5.0 2021-01-20 2021-01-20
10 105 b 13.0 2021-01-19 2021-01-19
From the above I would have expected the rows 3,4,5,6 to be grouped together and 7,8,9,10 also. I am not sure why this example now breaks?
Not sure what the difference with this example and the extended example above is and why this seems to not work?
I don't have Pandas superpowers so I never try to do groupby one-liners, maybe someday.
Adapting the accepted answer to SO question Find group of consecutive dates in Pandas DataFrame - first group by ['id','var1']; for each group group by consecutive date ranges.
import pandas as pd
sep = "************************************\n"
day = pd.Timedelta('1d')
# using the extended example in the question.
gb = df_raw_dates.groupby(['id', 'var1'])
for k,g in gb:
print(g)
dt = g['dates']
# find difference in days between rows
in_block = ((dt - dt.shift(-1)).abs() == day) | (dt.diff() == day)
# create a Series to identify consecutive ranges to group by
# this cumsum trick can be found in many SO answers
filt = g.loc[in_block]
breaks = filt['dates'].diff() != day
groups = breaks.cumsum()
# split into date ranges
date_groups = g.groupby(groups)
for _,date_range in date_groups:
print(date_range)
print(sep)
You can see that the (102,'a') group has been split into two groups.
id var1 val dates
0 102 a 9 2020-01-01
2 102 a 4 2020-01-02
7 102 a 1 2020-01-03
id var1 val dates
8 102 a 2 2020-01-07
9 102 a 3 2020-01-08
10 102 a 4 2020-01-09
Going a bit further: while iterating construct a dictionary to make a new DataFrame with.
import pandas as pd
import collections
day = pd.Timedelta('1d')
# again using the extended example in the question
gb = df_raw_dates.groupby(['id', 'var1'])
new_df = collections.defaultdict(list)
for k,g in gb:
# print(g)
eyed,var = k
dt = g['dates']
in_block = ((dt - dt.shift(-1)).abs() == day) | (dt.diff() == day)
filt = g.loc[in_block]
breaks = filt['dates'].diff() != day
groups = breaks.cumsum()
date_groups = g.groupby(groups)
# print(k,groups,groups.any())
# accomodate groups with only one date
if not groups.any():
new_df['id'].append(eyed)
new_df['var1'].append(var)
new_df['val'].append(g.val.mean())
new_df['start'].append(g.dates.min())
new_df['end'].append(g.dates.max())
continue
for _,date_range in date_groups:
start,end = date_range['dates'].min(),date_range['dates'].max()
val = date_range.val.mean()
new_df['id'].append(eyed)
new_df['var1'].append(var)
new_df['val'].append(val)
new_df['start'].append(start)
new_df['end'].append(end)
print(pd.DataFrame(new_df))
>>>
id var1 val start end
0 102 a 4.666667 2020-01-01 2020-01-03
1 102 a 3.000000 2020-01-07 2020-01-09
2 102 b 2.000000 2020-01-01 2020-01-01
3 103 a 3.000000 2020-01-05 2020-01-05
4 103 b 6.500000 2020-01-02 2020-01-03
5 104 c 2.000000 2020-03-12 2020-03-12
6 108 a 99.000000 2020-01-21 2020-01-25
Seems pretty tedious, maybe someone will come along with a less-verbose solution. Maybe some of the operations could be put in functions and .apply or .transform or .pipe could be used making it a little cleaner.
It does not account for ('id','var1') groups that have more than one date but only single date ranges. e.g.
id var1 val dates
11 108 a 99 2020-01-21
12 108 a 99 2020-01-25
You might need to detect if there are any gaps in a datetime Series and use that fact to accommodate.
Related
Update dataframe value with other dataframe value if condition met?
I've two dfs. I wanted to assign df1.date = df2.start_date if df1.date <= df2.end_date. df1 = {"date": ['2020-12-23 18:20:37', '2021-08-20 12:17:41.487'], "result": [ 'pass', 'fail']} df2 = {"start_date": ['2021-08-19 12:17:41.487','2021-08-12 12:17:41.487', '2021-08-26 12:17:41.487'], "end_date": ['2021-08-26 12:17:41.487', '2021-08-19 12:17:41.487', '2021-09-02 12:17:41.487']} I just give two rows while in real I'm doing this query on 100,000 rows. How do I achieve this?
assuming im understanding your question correctly and that both your dataframes line up with each other. you could loop through each row and do a compare across to the other df. however if you have thousands of records this could take some time. df1 = pd.DataFrame({"date": [datetime.date(2014, 12, 29), datetime.date(2015, 1, 26), datetime.date(2015, 2, 26), datetime.date(2015, 3, 8), datetime.date(2015, 4, 10)], "result": ['pass', 'fail', 'fail', 'pass', 'pass']}) df2 = pd.DataFrame({'start_date': [datetime.date(2015, 1, 1), datetime.date(2015, 2, 1), datetime.date(2015, 3, 1), datetime.date(2015, 4, 1), datetime.date(2015, 5, 1)], 'end_date': [datetime.date(2015, 1, 25), datetime.date(2015, 2, 20), datetime.date(2015, 3, 15), datetime.date(2015, 4, 24), datetime.date(2015, 5, 23)]}) for i in range(len(df1)): if (df1.date[i] <= df2.end_date[i]): df1.date[i] = df2.start_date[i] but again this is assuming that both data frames have the same length and its a direct compare across
We can make use of numpy's where. # df1.date = df2.start_date if df1.date <= df2.end_date import numpy as np df1.date = np.where(df1.date <= df2.end_date, df2.start_date, df1.date) New df1 date result 0 2015-01-01 pass 1 2015-02-01 pass 2 2015-03-01 fail 3 2015-04-01 fail 4 2015-05-01 pass Data used df1 = {"date": [ '2014-12-29', '2015-01-26', '2015-02-26', '2015-03-08', '2015-04-10' ], "result": [ 'pass', 'pass', 'fail', 'fail', 'pass' ]} df2 = {"start_date": [ '2015-01-01', '2015-02-01', '2015-03-01', '2015-04-01', '2015-05-01' ], "end_date": [ '2015-01-25', '2015-02-20', '2015-03-15', '2015-04-24', '2015-05-23' ]} df1 = pd.DataFrame(data = df1) df2 = pd.DataFrame(data = df2)
Add tuple values to dict if one value is a key in the dict
So I have been stuck on this for a while now. My data looks as follows: Initialer Start uge Start dag Start tid End uge End dag End tid 0 MBAU 18 3 09:00:00 18 5 12:00:00 1 MBAU 22 2 14:00:00 22 2 15:00:00 2 MBAU 13 4 09:00:00 13 4 10:00:00 3 AMPE 14 1 12:00:00 14 1 13:30:00 4 AMPE 26 6 09:00:00 27 2 22:00:00 I am trying to generate a dictionary with the 'Initialer' as keys and the values should consist of two tuples or lists, one containing the"Start"-columns and one containing the "End"-columns. Like this { 'Initialer': [(Start uge, Start dag, Start tid), (End uge, End dag, End tid)] } : {'MBAU': [[(18, 3, 09:00:00), (18, 5, 12:00:00)], [(22, 2, 14:00:00), (22, 2, 15:00:00)], [(13, 4, 09:00:00), (13, 4, 10:00:00)]], 'AMPE': [[(14, 1, 12:00:00), (14, 1, 13;30:00)], [(26, 6, 09:00:00), (27, 2, 22:00:00)]] } However, I am strugling getting it right. I have tried generating two lists of tuples containing the start columns and end columns respectively: start_tuple = self.u_data[['Initialer','Start uge', 'Start dag', 'Start tid']].apply(tuple, axis=1).values >>> [('MBAU', 18, 3, datetime.time(9, 0)) ('MBAU', 22, 2, datetime.time(14, 0)) ('MBAU', 13, 4, datetime.time(9, 0)) ('AMPE', 14, 1, datetime.time(12, 0)) ('AMPE', 26, 6, datetime.time(9, 0))] end_tuple = self.u_data[['Initialer','End uge', 'End dag', 'End tid']].apply(tuple, axis=1).values >>> [('MBAU', 18, 5, datetime.time(12, 0)) ('MBAU', 22, 2, datetime.time(15, 0)) ('MBAU', 13, 4, datetime.time(10, 0)) ('AMPE', 14, 1, datetime.time(13, 30)) ('AMPE', 27, 2, datetime.time(22, 0))] I then created a dict based on unique values in 'Initialer' and tried to use list comprehension to populate it as such: start_dict = {k:[] for k in self.u_data.Initialer.unique()} (start_dict[initialer].append((x,y,z)) for initialer, x, y, z in start_tuple) >>> {'MBAU': [], 'AMPE': []} But this returns only empty values {'MBAU': [], 'AMPE': []}. I have tried to research how I could do this but without any luck. Is there a smart way to accomplish this?
Why it fails? The reason you are getting {'MBAU': [], 'AMPE': []} is because list.append() is an inplace operation and doesn't return anything and, the (i for i in l) creates a generator object instead of actually running the append operation. You can see what happens here instead - start_tuple = df[['Initialer','Start_uge', 'Start_dag', 'Start_tid']].apply(tuple, axis=1) start_dict = {k:[] for k in df.Initialer.unique()} #list comprehension runs the append operation but returns None [start_dict[initialer].append((x,y,z)) for initialer, x, y, z in start_tuple] ### Returns: ### [None, None, None, None, None] ### But if you print start_dict print(start_dict) {'MBAU': [(18, 3, '09:00:00'), (22, 2, '14:00:00'), (13, 4, '09:00:00')], 'AMPE': [(14, 1, '12:00:00'), (26, 6, '09:00:00')]} This means that the operation would run and return None, but the original start_dict object will get modified since now you are actually iterating and not creating a generator. A modified approach with defaultdict More inline to the approach that you have used already, but with using collections.defaultdict. - from collections import defaultdict init = df['Initialer'].tolist() start_tuple = df[['Start_uge', 'Start_dag', 'Start_tid']].apply(tuple, axis=1) end_tuple = df[['End_uge', 'End_dag', 'End_tid']].apply(tuple, axis=1) items = zip(init, start_tuple, end_tuple) d = defaultdict(list) for i,j,k in items: d[i].append([j,k]) output = dict(d) output {'MBAU': [[(18, 3, '09:00:00'), (18, 5, '12:00:00')], [(22, 2, '14:00:00'), (22, 2, '15:00:00')], [(13, 4, '09:00:00'), (13, 4, '10:00:00')]], 'AMPE': [[(14, 1, '12:00:00'), (14, 1, '13:30:00')], [(26, 6, '09:00:00'), (27, 2, '22:00:00')]]} Another variation You can solve in a bit shorter way using collections.defaultdict as - from collections import defaultdict d = defaultdict(list) for _,row in df.iterrows(): vals = row.tolist() d[vals[0]].append([tuple(vals[1:4]),tuple(vals[4:])]) output = dict(d) output {'MBAU': [[(18, 3, '09:00:00'), (18, 5, '12:00:00')], [(22, 2, '14:00:00'), (22, 2, '15:00:00')], [(13, 4, '09:00:00'), (13, 4, '10:00:00')]], 'AMPE': [[(14, 1, '12:00:00'), (14, 1, '13:30:00')], [(26, 6, '09:00:00'), (27, 2, '22:00:00')]]}
Pandas: Missing value imputation based on date
I have a pandas data-frame which is as follows: df_first = pd.DataFrame({"id": [102, 102, 102, 102, 103, 103], "val1": [np.nan, 4, np.nan, np.nan, 1, np.nan], "val2": [5, np.nan, np.nan, np.nan, np.nan, 5], "rand": [np.nan, 3, 7, 8, np.nan, 4], "val3": [5, np.nan, np.nan, np.nan, 3, np.nan], "unique_date": [pd.Timestamp(2002, 3, 3), pd.Timestamp(2002, 3, 5), pd.Timestamp(2003, 4, 5), pd.Timestamp(2003, 4, 9), pd.Timestamp(2003, 8, 7), pd.Timestamp(2003, 9, 7)], "end_date": [pd.Timestamp(2005, 3, 3), pd.Timestamp(2003, 4, 7), np.nan, np.nan, pd.Timestamp(2003, 10, 7), np.nan]}) df_first id val1 val2 rand val3 unique_date end_date 0 102 NaN 5.0 NaN 5.0 2002-03-03 2005-03-03 1 102 4.0 NaN 3.0 NaN 2002-03-05 2003-04-07 2 102 NaN NaN 7.0 NaN 2003-04-05 NaT 3 102 NaN NaN 8.0 NaN 2003-04-09 NaT 4 103 1.0 NaN NaN 3.0 2003-08-07 2003-10-07 5 103 NaN 5.0 4.0 NaN 2003-09-07 NaT The missing value imputation should be done in a way that there is forward fill of the values that appear in each row from the data-frame that has an end_date value. The forward fill performs for as long as the unique_date is before the end_date for the same id. Based on what is said in the last paragraph above, the forward fill should be done per id. Lastly, the missing value imputation should take place only for certain columns that have a name that has val in it. An important note is that no other columns have that pattern in their name. In case I haven't made myself clear enough, the solution for the above posted data-frames is posted bellow: id val1 val2 rand val3 unique_date 0 102 NaN 5.0 NaN 5.0 2002-03-03 1 102 4.0 5.0 3.0 5.0 2002-03-05 2 102 4.0 5.0 7.0 5.0 2003-04-05 3 102 NaN 5.0 8.0 5.0 2003-04-09 4 103 1.0 NaN NaN 3.0 2003-08-07 5 103 1.0 5.0 4.0 3.0 2003-08-07 Let me know if you need any further clarification since the whole thing seems rather complicated at first sight. Looking forward to you answers!
Sorry for the confusing question as well as explanation. At the end I was able to achieve what I wanted in the following way. df_first = pd.DataFrame({"id": [102, 102, 102, 102, 103, 103], "val1": [np.nan, 4, np.nan, np.nan, 1, np.nan], "val2": [5, np.nan, np.nan, np.nan, np.nan, 5], "val3": [np.nan, 3, np.nan, np.nan, np.nan, 4], "val4": [5, np.nan, np.nan, np.nan, 3, np.nan], "rand": [3, np.nan, 1, np.nan, 5, 6], "unique_date": [pd.Timestamp(2002, 3, 3), pd.Timestamp(2002, 3, 5), pd.Timestamp(2003, 4, 5), pd.Timestamp(2003, 4, 9), pd.Timestamp(2003, 8, 7), pd.Timestamp(2003, 9, 7)], "end_date": [pd.Timestamp(2005, 3, 3), pd.Timestamp(2003, 4, 7), np.nan, np.nan, pd.Timestamp(2003, 10, 7), np.nan]}) display(df_first) indexes = [] columns = df_first.filter(like="val").columns for column in columns: indexes.append(df_first.columns.get_loc(column)) elements = df_first.values[:,indexes] ids = df_first.values[:,df_first.columns.get_loc("id")] start_dates = df_first.values[:,df_first.columns.get_loc("unique_date")] end_dates = df_first.values[:,df_first.columns.get_loc("end_date")] for i in range(len(elements)): if pd.notnull(end_dates[i]): not_nan_indexes = np.argwhere(~pd.isnull(elements[i])).ravel() elements_prop = elements[i,not_nan_indexes] j = i while (j < len(elements) and start_dates[j] < end_dates[i] and ids[i] == ids[j]): elements[j, not_nan_indexes] = elements_prop j+=1 df_first[columns] = elements df_first = df_first.drop(columns="end_date") display(df_first) Probably the solution is an overkill, but I was not able to find anything pandas specific to achieve what I wanted to.
timedeltas for a groupby column in pandas [duplicate]
This question already has an answer here: How to calculate time difference by group using pandas? (1 answer) Closed 4 years ago. For a given data frame df timestamps = [ datetime.datetime(2018, 1, 1, 10, 0, 0, 0), # person 1 datetime.datetime(2018, 1, 1, 10, 0, 0, 0), # person 2 datetime.datetime(2018, 1, 1, 11, 0, 0, 0), # person 2 datetime.datetime(2018, 1, 2, 11, 0, 0, 0), # person 2 datetime.datetime(2018, 1, 1, 10, 0, 0, 0), # person 3 datetime.datetime(2018, 1, 2, 11, 0, 0, 0), # person 3 datetime.datetime(2018, 1, 4, 10, 0, 0, 0), # person 3 datetime.datetime(2018, 1, 5, 12, 0, 0, 0) # person 3 ] df = pd.DataFrame({'person': [1, 2, 2, 2, 3, 3, 3, 3], 'timestamp': timestamps }) I want to calculate for each person (df.groupby('person')) the time differences between all timestamps of that person, which I would to with diff(). df.groupby('person').timestamp.diff() is just half the way, because the mapping back to the person is lost. How could a solution look like?
i think you should use df.groupby('person').timestamp.transform(pd.Series.diff)
There is problem diff no aggregate values, so possible solution is transform: df['new'] = df.groupby('person').timestamp.transform(pd.Series.diff) print (df) person timestamp new 0 1 2018-01-01 10:00:00 NaT 1 2 2018-01-01 10:00:00 NaT 2 2 2018-01-01 11:00:00 0 days 01:00:00 3 2 2018-01-02 11:00:00 1 days 00:00:00 4 3 2018-01-01 10:00:00 NaT 5 3 2018-01-02 11:00:00 1 days 01:00:00 6 3 2018-01-04 10:00:00 1 days 23:00:00 7 3 2018-01-05 12:00:00 1 days 02:00:00
Calculate Average Number of Days Between Multiple Dates
Let's say I have the following data frame. I want to calculate the average number of days between all the activities for a particular account. Below is my desired result: Now I know how to calculate the number of days between two dates with the following code. But I don't know how to calculate what I am looking for across multiple dates. from datetime import date d0 = date(2016, 8, 18) d1 = date(2016, 9, 26) delta = d0 - d1 print delta.days
I would do this as follows in pandas (assuming the Date column is a datetime64): In [11]: df Out[11]: Account Activity Date 0 A a 2015-10-21 1 A b 2016-07-07 2 A c 2016-07-07 3 A d 2016-09-14 4 A e 2016-10-12 5 B a 2015-11-24 6 B b 2015-12-30 In [12]: df.groupby("Account")["Date"].apply(lambda x: x.diff().mean()) Out[12]: Account A 89 days 06:00:00 B 36 days 00:00:00 Name: Date, dtype: timedelta64[ns]
If your dates are in a list: >>> from datetime import date >>> dates = [date(2015, 10, 21), date(2016, 7, 7), date(2016, 7, 7), date(2016, 9, 14), date(2016, 10, 12), date(2016, 10, 12), date(2016, 11, 22), date(2016, 12, 21)] >>> differences = [(dates[i]-dates[i-1]).days for i in range(1, len(dates))] #[260, 0, 69, 28, 0, 41, 29] >>> float(sum(differences))/len(differences) 61.0 >>>