Difference between dataframes with same ID pandas - python

I have 2 Dataframes like this:
ID Date1
1 2018-02-01
2 2019-03-01
3 2005-09-02
4 2021-11-09
And then I have this Dataframe:
ID Date2
4 2003-02-01
4 2004-03-11
3 1998-02-11
2 1999-02-11
1 2000-09-25
What I would want to do is find the difference in dates (who have the same ID between he differences in dataframes) using this function:
def days_between(d1, d2):
d1 = datetime.strptime(d1, "%Y-%m-%d")
d2 = datetime.strptime(d2, "%Y-%m-%d")
return abs((d2 - d1).days)
and summing up the differences for the corresponding Id.
The expected output would be:
Date is the summed up Differences in datewith corresponding ID
ID Date
1 6338
2 7323
3 2760
4 13308

Solution if df1.ID has no duplicates, only df2.ID use Series.map for new column used for subtracting by Series.sub, convert timedeltas to days by Series.dt.days and last aggregate sum:
df1['Date1'] = pd.to_datetime(df1['Date1'])
df2['Date2'] = pd.to_datetime(df2['Date2'])
df2['Date'] = df2['ID'].map(df1.set_index('ID')['Date1']).sub(df2['Date2']).dt.days
print (df2)
ID Date2 Date
0 4 2003-02-01 6856
1 4 2004-03-11 6452
2 3 1998-02-11 2760
3 2 1999-02-11 7323
4 1 2000-09-25 6338
df3 = df2.groupby('ID', as_index=False)['Date'].sum()
print (df3)
ID Date
0 1 6338
1 2 7323
2 3 2760
3 4 13308
Or use DataFrame.merge instead map:
df1['Date1'] = pd.to_datetime(df1['Date1'])
df2['Date2'] = pd.to_datetime(df2['Date2'])
df2 = df1.merge(df2, on='ID')
df2['Date'] = df2['Date1'].sub(df2['Date2']).dt.days
print (df2)
ID Date1 Date2 Date
0 1 2018-02-01 2000-09-25 6338
1 2 2019-03-01 1999-02-11 7323
2 3 2005-09-02 1998-02-11 2760
3 4 2021-11-09 2003-02-01 6856
4 4 2021-11-09 2004-03-11 6452
df3 = df2.groupby('ID', as_index=False)['Date'].sum()
print (df3)
ID Date
0 1 6338
1 2 7323
2 3 2760
3 4 13308

Does this work:
d = pd.merge(d1,d2)
d[['Date1','Date2']] = d[['Date1','Date2']].apply(pd.to_datetime, format = '%Y-%m-%d')
d['Date'] = d['Date1'] - d['Date2']
d.groupby('ID')['Date'].sum().reset_index()
ID Date
0 1 6338 days
1 2 7323 days
2 3 2760 days
3 4 13308 days

Related

Cumulative of last 12 months from latest communication date?

I'm looking at counting the number of interactions grouped by ID in the last 12 months for each unique ID. The count starts from the latest date to the last one grouped by ID.
ID date
001 2022-02-01
002 2018-03-26
001 2021-08-05
001 2019-05-01
002 2019-02-01
003 2018-07-01
Output is something like the below.
ID Last_12_Months_Count
001 2
002 2
003 1
How can I achieve this in Pandas? Any function that would count the months based on the dates from the latest date per group?
Use:
m = df['date'].gt(df.groupby('ID')['date'].transform('max')
.sub(pd.offsets.DateOffset(years=1)))
df1 = df[m]
df1 = df1.groupby('ID').size().reset_index(name='Last_12_Months_Count')
print (df1)
ID Last_12_Months_Count
0 1 2
1 2 2
2 3 1
Or:
df1 = (df.groupby('ID')['date']
.agg(lambda x: x.gt(x.max() - pd.offsets.DateOffset(years=1)).sum())
.reset_index(name='Last_12_Months_Count'))
print (df1)
ID Last_12_Months_Count
0 1 2
1 2 2
2 3 1
For count multiple columns use named aggregation:
df['date1'] = df['date']
f = lambda x: x.gt(x.max() - pd.offsets.DateOffset(years=1)).sum()
df1 = (df.groupby('ID')
.agg(Last_12_Months_Count_date = ('date', f),
Last_12_Months_Count_date1 = ('date1', f))
.reset_index())
print (df1)
ID Last_12_Months_Count_date Last_12_Months_Count_date1
0 1 2 2
1 2 2 2
2 3 1 1

Python: concat rows of two dataframes where not all columns are the same

I have two dataframes:
EDIT:
df1 = pd.DataFrame(index = [0,1,2], columns=['timestamp', 'order_id', 'account_id', 'USD', 'CAD'])
df1['timestamp']=['2022-01-01','2022-01-02','2022-01-03']
df1['account_id']=['usdcad','usdcad','usdcad']
df1['order_id']=['11233123','12313213','12341242']
df1['USD'] = [1,2,3]
df1['CAD'] = [4,5,6]
df1:
timestamp account_id order_id USD CAD
0 2022-01-01 usdcad 11233123 1 4
1 2022-01-02 usdcad 12313213 2 5
2 2022-01-03 usdcad 12341242 3 6
df2 = pd.DataFrame(index = [0,1], columns = ['timestamp','account_id', 'currency','balance'])
df2['timestamp']=['2021-12-21','2021-12-21']
df2['account_id']=['usdcad','usdcad']
df2['currency'] = ['USD', 'CAD']
df2['balance'] = [2,3]
df2:
timestamp account_id currency balance
0 2021-12-21 usdcad USD 2
1 2021-12-21 usdcad CAD 3
I would like to add a row to df1 at index 0, and fill that row with the balance of df2 based on currency. So the final df should look like this:
df:
timestamp account_id order_id USD CAD
0 0 0 0 2 3
1 2022-01-01 usdcad 11233123 1 4
2 2022-01-02 usdcad 12313213 2 5
3 2022-01-03 usdcad 12341242 3 6
How can I do this in a pythonic way? Thank you
Set the index of df2 to currency then transpose the index to columns, then append this dataframe with df1
df_out = df2.set_index('currency').T.append(df1, ignore_index=True).fillna(0)
print(df_out)
USD CAD order_id
0 2 3 0
1 1 4 11233123
2 2 5 12313213
3 3 6 12341242

Pandas groupby datetime columns by periods

I have the following dataframe:
df=pd.DataFrame(np.array([[1,2,3,4,7,9,5],[2,6,5,4,9,8,2],[3,5,3,21,12,6,7],[1,7,8,4,3,4,3]]),
columns=['9:00:00','9:05:00','09:10:00','09:15:00','09:20:00','09:25:00','09:30:00'])
>>> 9:00:00 9:05:00 09:10:00 09:15:00 09:20:00 09:25:00 09:30:00 ....
a 1 2 3 4 7 9 5
b 2 6 5 4 9 8 2
c 3 5 3 21 12 6 7
d 1 7 8 4 3 4 3
I would like to get for each row (e.g a,b,c,d ...) the mean vale between specific hours. The hours are between 9-15, and I want to groupby period, for example to calculate the mean value between 09:00:00 to 11:00:00, between 11- 12, between 13-15 (or any period I decide to).
I was trying first to convert the columns values to datetime format and then I though it would be easier to do this:
df.columns = pd.to_datetime(df.columns,format="%H:%M:%S")
but then I got the columns names with fake year "1900-01-01 09:00:00"...
And also, the columns headers type was object, so I felt a bit lost...
My end goal is to be able to calculate new columns with the mean value for each row only between columns that fall inside the defined time period (e.g 9-11 etc...)
If need some period, e.g. each 2 hours:
df.columns = pd.to_datetime(df.columns,format="%H:%M:%S")
df1 = df.resample('2H', axis=1).mean()
print (df1)
1900-01-01 08:00:00
0 4.428571
1 5.142857
2 8.142857
3 4.285714
If need some custom periods is possible use cut:
df.columns = pd.to_datetime(df.columns,format="%H:%M:%S")
bins = ['5:00:00','9:00:00','11:00:00','12:00:00', '23:59:59']
dates = pd.to_datetime(bins,format="%H:%M:%S")
labels = [f'{i}-{j}' for i, j in zip(bins[:-1], bins[1:])]
df.columns = pd.cut(df.columns, bins=dates, labels=labels, right=False)
print (df)
9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00 \
0 1 2 3 4
1 2 6 5 4
2 3 5 3 21
3 1 7 8 4
9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00
0 7 9 5
1 9 8 2
2 12 6 7
3 3 4 3
And last use mean per columns, reason of NaNs columns is columns are categoricals:
df2 = df.mean(level=0, axis=1)
print (df2)
9:00:00-11:00:00 5:00:00-9:00:00 11:00:00-12:00:00 12:00:00-23:59:59
0 4.428571 NaN NaN NaN
1 5.142857 NaN NaN NaN
2 8.142857 NaN NaN NaN
3 4.285714 NaN NaN NaN
For avoid NaNs columns convert columns names to strings:
df3 = df.rename(columns=str).mean(level=0, axis=1)
print (df3)
9:00:00-11:00:00
0 4.428571
1 5.142857
2 8.142857
3 4.285714
EDIT: Solution above with timedeltas, because format HH:MM:SS:
df.columns = pd.to_timedelta(df.columns)
print (df)
0 days 09:00:00 0 days 09:05:00 0 days 09:10:00 0 days 09:15:00 \
0 1 2 3 4
1 2 6 5 4
2 3 5 3 21
3 1 7 8 4
0 days 09:20:00 0 days 09:25:00 0 days 09:30:00
0 7 9 5
1 9 8 2
2 12 6 7
3 3 4 3
bins = ['9:00:00','11:00:00','12:00:00']
dates = pd.to_timedelta(bins)
labels = [f'{i}-{j}' for i, j in zip(bins[:-1], bins[1:])]
df.columns = pd.cut(df.columns, bins=dates, labels=labels, right=False)
print (df)
9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00 \
0 1 2 3 4
1 2 6 5 4
2 3 5 3 21
3 1 7 8 4
9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00
0 7 9 5
1 9 8 2
2 12 6 7
3 3 4 3
#missing values because not exist datetimes between 11:00:00-12:00:00
df2 = df.mean(level=0, axis=1)
print (df2)
9:00:00-11:00:00 11:00:00-12:00:00
0 4.428571 NaN
1 5.142857 NaN
2 8.142857 NaN
3 4.285714 NaN
df3 = df.rename(columns=str).mean(level=0, axis=1)
print (df3)
9:00:00-11:00:00
0 4.428571
1 5.142857
2 8.142857
3 4.285714
I am going to show you my code and the results after the ejecution.
First import libraries and dataframe
import numpy as np
import pandas as pd
df=pd.DataFrame(np.array([[1,2,3,4,7,9,5],[2,6,5,4,9,8,2],[3,5,3,21,12,6,7],
[1,7,8,4,3,4,3]]),
columns=
['9:00:00','9:05:00','09:10:00','09:15:00','09:20:00','09:25:00','09:30:00'])
It would be nice create a class in order to define what is a period:
class Period():
def __init__(self,initial,end):
self.initial=initial
self.end=end
def __repr__(self):
return self.initial +' -- ' +self.end
With comand .loc we can get a subdataframe with the columns that I desire:
`def get_colMean(df,period):
df2 = df.loc[:,period.initial:period.end]
array_mean = df.mean(axis=1).values
col_name = 'mean_'+period.initial+'--'+period.end
pd_colMean = pd.DataFrame(array_mean,columns=[col_name])
return pd_colMean`
Finally we use .join in orde to add our column with the means to our original dataframe:
def join_colMean(df,period):
pd_colMean = get_colMean(df,period)
df = df.join(pd_colMean)
return df
I am goint to show you my results:

Pandas filter by date range within each group

I have a df like:
and I have to filter my df by having values within two weeks from each ID
so for each ID, I have to look ahead next two weeks from the first date and only keep those records.
Output:
I tried creating a min date per each ID and using below code to try to filter:
df[df.date.between(df['min_date'],df['min_date']+pd.DateOffset(days=14))]
Is their any efficient way than this? because this is taking a lot of time since my dataframe is big
Setup
df = pd.DataFrame({
'Id': np.repeat([2, 3, 4], [4, 3, 4]),
'Date': ['12/31/2019', '1/1/2020', '1/5/2020', '1/20/2020',
'1/5/2020', '1/10/2020', '1/30/2020', '2/2/2020',
'2/4/2020', '2/10/2020', '2/25/2020'],
'Value': [*'abcbdeefffg']
})
First, convert Date to Timestamp with to_datetime
df['Date'] = pd.to_datetime(df['Date'])
concat with groupby in a comprehension
pd.concat([
d[d.Date <= d.Date.min() + pd.offsets.Day(14)]
for _, d in df.groupby('Id')
])
Id Date Value
0 2 2019-12-31 a
1 2 2020-01-01 b
2 2 2020-01-05 c
4 3 2020-01-05 d
5 3 2020-01-10 e
7 4 2020-02-02 f
8 4 2020-02-04 f
9 4 2020-02-10 f
boolean slice... also with groupby
df[df.Date <= df.Id.map(df.groupby('Id').Date.min() + pd.offsets.Day(14))]
Id Date Value
0 2 2019-12-31 a
1 2 2020-01-01 b
2 2 2020-01-05 c
4 3 2020-01-05 d
5 3 2020-01-10 e
7 4 2020-02-02 f
8 4 2020-02-04 f
9 4 2020-02-10 f
I struggle with pandas.concat, so you can try using merge:
# Convert Date to datetime
df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y')
# Get min Date for each Id and add two weeks (14 days)
s = df.groupby('Id')['Date'].min() + pd.offsets.Day(14)
# Merge df and s
df = df.merge(s, left_on='Id', right_index=True)
# Keep records where Date is less than the allowed limit
df = df.loc[df['Date_x'] <= df['Date_y'], ['Id','Date_x','Value']]
# Rename Date_x to Date (optional)
df.rename(columns={'Date_x':'Date'}, inplace=True)
The result is:
Id Date Value
0 2 2019-12-31 a
1 2 2020-01-01 b
2 2 2020-01-05 c
4 3 2020-01-05 d
5 3 2020-01-10 e
7 4 2020-02-02 f
8 4 2020-02-04 f
9 4 2020-02-10 f

Pandas - Counting the number of days for group by

I want to count the number of days after grouping by 2 columns:
groups = df.groupby([df.col1,df.col2])
Now i want to count the number of days relevant for each group:
result = groups['date_time'].dt.date.nunique()
I'm using something similar when I want to group by day, but here I get an error:
AttributeError: Cannot access attribute 'dt' of 'SeriesGroupBy' objects, try using the 'apply' method
What is the proper way to get the number of days?
You need another variation of groupby - define column first:
df['date_time'].dt.date.groupby([df.col1,df.col2]).nunique()
df.groupby(['col1','col2'])['date_time'].apply(lambda x: x.dt.date.nunique())
df['date_time1'] = df['date_time'].dt.date
a = df.groupby([df.col1,df.col2]).date_time1.nunique()
Sample:
start = pd.to_datetime('2015-02-24')
rng = pd.date_range(start, periods=10, freq='15H')
df = pd.DataFrame({'date_time': rng, 'col1': [0]*5 + [1]*5, 'col2': [2]*3 + [3]*4+ [4]*3})
print (df)
col1 col2 date_time
0 0 2 2015-02-24 00:00:00
1 0 2 2015-02-24 15:00:00
2 0 2 2015-02-25 06:00:00
3 0 3 2015-02-25 21:00:00
4 0 3 2015-02-26 12:00:00
5 1 3 2015-02-27 03:00:00
6 1 3 2015-02-27 18:00:00
7 1 4 2015-02-28 09:00:00
8 1 4 2015-03-01 00:00:00
9 1 4 2015-03-01 15:00:00
#solution with apply
df1 = df.groupby(['col1','col2'])['date_time'].apply(lambda x: x.dt.date.nunique())
print (df1)
col1 col2
0 2 2
3 2
1 3 1
4 2
Name: date_time, dtype: int64
#create new helper column
df['date_time1'] = df['date_time'].dt.date
df2 = df.groupby([df.col1,df.col2]).date_time1.nunique()
print (df2)
col1 col2
0 2 2
3 2
1 3 1
4 2
Name: date_time, dtype: int64
df3 = df['date_time'].dt.date.groupby([df.col1,df.col2]).nunique()
print (df3)
col1 col2
0 2 2
3 2
1 3 1
4 2
Name: date_time, dtype: int64

Categories