DataFrame
ID
DateMade
DelDate
ExpDate
1
01/01/2020
05/06/2020
06/05/2022
1
01/01/2020
07/06/2020
07/05/2022
1
01/01/2020
07/06/2020
09/09/2022
2
03/04/2020
07/08/2020
15/12/2022
2
05/06/2020
23/08/2020
31/12/2022
2
01/01/2021
31/08/2020
09/01/2023
What I want to do is groupby ID and create columns for the Min and Max date for each column. But I'm not sure where to start. I know there's aggregate functions out there that work well with one column but I'm wondering is there a straight forward solution when dealing with multiple columns?
Desired Output
ID
DateMade_Min
DateMade_Max
DelDate_Min
DelDate_Max
ExpDate_Min
ExpDate_Max
1
01/01/2020
01/01/2020
05/06/2020
07/06/2020
06/05/2022
09/09/2022
2
03/04/2020
01/01/2021
07/08/2020
31/08/2020
15/12/2022
09/01/2023
First convert columns by list to datetimes in DataFrame.apply and to_datetime, then correct aggregation min and max, flatten MultiIndex with capitalize:
cols = ['DateMade','DelDate','ExpDate']
df[cols] = df[cols].apply(pd.to_datetime, dayfirst=True)
df1 = df.groupby('ID')[cols].agg(['min','max'])
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1].capitalize()}')
df1 = df1.reset_index()
print (df1)
ID DateMade_Min DateMade_Max DelDate_Min DelDate_Max ExpDate_Min \
0 1 2020-01-01 2020-01-01 2020-06-05 2020-06-07 2022-05-06
1 2 2020-04-03 2021-01-01 2020-08-07 2020-08-31 2022-12-15
ExpDate_Max
0 2022-09-09
1 2023-01-09
For orginal format of datetimes add lambda function with Series.dt.strftime:
cols = ['DateMade','DelDate','ExpDate']
df[cols] = df[cols].apply(pd.to_datetime, dayfirst=True)
df1 = df.groupby('ID')[cols].agg(['min','max'])
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1].capitalize()}')
df1 = df1.apply(lambda x: x.dt.strftime('%d/%m/%Y'))
df1 = df1.reset_index()
print (df1)
ID DateMade_Min DateMade_Max DelDate_Min DelDate_Max ExpDate_Min \
0 1 01/01/2020 01/01/2020 05/06/2020 07/06/2020 06/05/2022
1 2 03/04/2020 01/01/2021 07/08/2020 31/08/2020 15/12/2022
ExpDate_Max
0 09/09/2022
1 09/01/2023
Related
I have what I assume might be a complexed ask.
I have a few columns in my dataframe, for each date column I want to store the date headers in the column called "Dates" and then I want to create 2 new columns to store the max and min values
DataFrame
DataFrame
ID
Item
DateMade_Min
DateMade_Max
DelDate_Min
DelDate_Max
ExpDate_Min
ExpDate_Max
1
2322
01/01/2020
01/01/2020
05/06/2020
07/06/2020
06/05/2022
09/09/2022
2
4454
03/04/2020
01/01/2021
07/08/2020
31/08/2020
15/12/2022
09/01/2023
Desired Output
ID
Item
Dates
Min
Max
1
2322
DateMade
01/01/2020
01/01/2020
1
2322
DelDate
05/06/2020
07/06/2020
1
2322
ExpDate
06/05/2022
09/09/2022
2
4454
DateMade
03/04/2020
01/01/2021
2
4454
DelDate
07/08/2020
31/08/2020
2
4454
ExpDate
15/12/2022
09/01/2023
You can reshape with an intermediate stacking and a MultiIndex:
out = (df
.set_index(['ID', 'Item'])
.pipe(lambda d: d.set_axis(d.columns.str.split('_', expand=True), axis=1))
.stack(0)
.reset_index().rename(columns={'level_2': 'Dates'})
)
output:
ID Item Dates Max Min
0 1 2322 DateMade 01/01/2020 01/01/2020
1 1 2322 DelDate 07/06/2020 05/06/2020
2 1 2322 ExpDate 09/09/2022 06/05/2022
3 2 4454 DateMade 01/01/2021 03/04/2020
4 2 4454 DelDate 31/08/2020 07/08/2020
5 2 4454 ExpDate 09/01/2023 15/12/2022
alternative
Alternatively, you can use the janitor helper module and its pivot_longer function:
# pip install janitor
import janitor
out = df.pivot_longer(
index=['ID', 'Item'],
names_to=('Dates', '.value'),
names_sep = '_',
sort_by_appearance=True
)
here is one way to doit
# melt the DF
df2=df.melt(['ID','Item'])
# split the date field on '_' to form two columns
df2[['date','minmax']]=df2['variable'].str.split('_', expand=True)
# use pivot to reformat the resultset
df2.pivot(index=['ID','Item','date'], columns='minmax', values='value').reset_index()
minmax ID Item date Max Min
0 1 2322 DateMade 01/01/2020 01/01/2020
1 1 2322 DelDate 07/06/2020 05/06/2020
2 1 2322 ExpDate 09/09/2022 06/05/2022
3 2 4454 DateMade 01/01/2021 03/04/2020
4 2 4454 DelDate 31/08/2020 07/08/2020
5 2 4454 ExpDate 09/01/2023 15/12/2022
I have 2 Dataframes like this:
ID Date1
1 2018-02-01
2 2019-03-01
3 2005-09-02
4 2021-11-09
And then I have this Dataframe:
ID Date2
4 2003-02-01
4 2004-03-11
3 1998-02-11
2 1999-02-11
1 2000-09-25
What I would want to do is find the difference in dates (who have the same ID between he differences in dataframes) using this function:
def days_between(d1, d2):
d1 = datetime.strptime(d1, "%Y-%m-%d")
d2 = datetime.strptime(d2, "%Y-%m-%d")
return abs((d2 - d1).days)
and summing up the differences for the corresponding Id.
The expected output would be:
Date is the summed up Differences in datewith corresponding ID
ID Date
1 6338
2 7323
3 2760
4 13308
Solution if df1.ID has no duplicates, only df2.ID use Series.map for new column used for subtracting by Series.sub, convert timedeltas to days by Series.dt.days and last aggregate sum:
df1['Date1'] = pd.to_datetime(df1['Date1'])
df2['Date2'] = pd.to_datetime(df2['Date2'])
df2['Date'] = df2['ID'].map(df1.set_index('ID')['Date1']).sub(df2['Date2']).dt.days
print (df2)
ID Date2 Date
0 4 2003-02-01 6856
1 4 2004-03-11 6452
2 3 1998-02-11 2760
3 2 1999-02-11 7323
4 1 2000-09-25 6338
df3 = df2.groupby('ID', as_index=False)['Date'].sum()
print (df3)
ID Date
0 1 6338
1 2 7323
2 3 2760
3 4 13308
Or use DataFrame.merge instead map:
df1['Date1'] = pd.to_datetime(df1['Date1'])
df2['Date2'] = pd.to_datetime(df2['Date2'])
df2 = df1.merge(df2, on='ID')
df2['Date'] = df2['Date1'].sub(df2['Date2']).dt.days
print (df2)
ID Date1 Date2 Date
0 1 2018-02-01 2000-09-25 6338
1 2 2019-03-01 1999-02-11 7323
2 3 2005-09-02 1998-02-11 2760
3 4 2021-11-09 2003-02-01 6856
4 4 2021-11-09 2004-03-11 6452
df3 = df2.groupby('ID', as_index=False)['Date'].sum()
print (df3)
ID Date
0 1 6338
1 2 7323
2 3 2760
3 4 13308
Does this work:
d = pd.merge(d1,d2)
d[['Date1','Date2']] = d[['Date1','Date2']].apply(pd.to_datetime, format = '%Y-%m-%d')
d['Date'] = d['Date1'] - d['Date2']
d.groupby('ID')['Date'].sum().reset_index()
ID Date
0 1 6338 days
1 2 7323 days
2 3 2760 days
3 4 13308 days
Input
df1
id date v1
a 2020-1-1 1
a 2020-1-2 2
b 2020-1-4 10
b 2020-1-22 30
c 2020-2-4 10
c 2020-2-22 30
df2
id date v1
a 2020-1-3 1
b 2020-1-7 12
b 2020-1-22 13
c 2020-2-10 15
c 2020-2-22 60
Goal
id date v1 v2
a 2020-1-1 1 0
a 2020-1-2 2 0
a 2020-1-3 0 1
b 2020-1-4 10 0
b 2020-1-7 0 12
b 2020-1-22 30 13
c 2020-2-4 10 0
c 2020-2-10 0 15
c 2020-2-22 30 60
The details:
Only two dataframes, for each id, the date is unique.
Concat two dataframes into df based on id, each id contains all date values from two dataframe
new merge dataframe contains v1 and v2 columns, while the date in df1 and df2, it returns original values, while the date only in one of df1 and df2, it returns original value and 0 if there is no value on the date.
Try
I have searched merge, concat document but I could not find the answers.
First convert columns to datetimes for correct ordering by to_datetime, then DataFrame.merge with outer join and rename column v1 for df2 for avoid v1_x and v1_y columns in output, replace missing values by DataFrame.fillna, sorting output by DataFrame.sort_values:
df1['date'] = pd.to_datetime(df1['date'])
df2['date'] = pd.to_datetime(df2['date'])
df = (df1.merge(df2.rename(columns={'v1':'v2'}), on=['id','date'], how='outer')
.fillna(0)
.sort_values(['id','date']))
print (df)
id date v1 v2
0 a 2020-01-01 1.0 0.0
1 a 2020-01-02 2.0 0.0
6 a 2020-01-03 0.0 1.0
2 b 2020-01-04 10.0 0.0
7 b 2020-01-07 0.0 12.0
3 b 2020-01-22 30.0 13.0
4 c 2020-02-04 10.0 0.0
8 c 2020-02-10 0.0 15.0
5 c 2020-02-22 30.0 60.0
I have dataframe like this:
Date Location_ID Problem_ID
---------------------+------------+----------
2013-01-02 10:00:00 | 1 | 43
2012-08-09 23:03:01 | 5 | 2
...
How can I count how often a Problem occurs per day and per Location?
Use groupby with converting Date column to dates or Grouper with aggregate size:
print (df)
Date Location_ID Problem_ID
0 2013-01-02 10:00:00 1 43
1 2012-08-09 23:03:01 5 2
#if necessary convert column to datetimes
df['Date'] = pd.to_datetime(df['Date'])
df1 = df.groupby([df['Date'].dt.date, 'Location_ID']).size().reset_index(name='count')
print (df1)
Date Location_ID count
0 2012-08-09 5 1
1 2013-01-02 1 1
Or:
df1 = (df.groupby([pd.Grouper(key='Date', freq='D'), 'Location_ID'])
.size()
.reset_index(name='count'))
If first column is index:
print (df)
Location_ID Problem_ID
Date
2013-01-02 10:00:00 1 43
2012-08-09 23:03:01 5 2
df.index = pd.to_datetime(df.index)
df1 = (df.groupby([df.index.date, 'Location_ID'])
.size()
.reset_index(name='count')
.rename(columns={'level_0':'Date'}))
print (df1)
Date Location_ID count
0 2012-08-09 5 1
1 2013-01-02 1 1
df1 = (df.groupby([pd.Grouper(level='Date', freq='D'), 'Location_ID'])
.size()
.reset_index(name='count'))
I have the below dataframe. Date in DD/MM/YY
Date id
1/5/2017 2:00 PM 100
1/5/2017 3:00 PM 101
2/5/2017 10:00 AM 102
3/5/2017 09:00 AM 103
3/5/2017 10:00 AM 104
4/5/2017 09:00 AM 105
Need output such a way that , able to group by date and also count number of Ids per day , also ignore time. o/p new data frame should be as below
DATE Count
1/5/2017 2 -> count 100,101
2/5/2017 1
3/5/2017 2
4/5/2017 1
Need efficient way to achieve above.
Use:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df1 = df['Date'].dt.date.value_counts().sort_index().reset_index()
df1.columns = ['DATE','Count']
Alternative solution:
df1 = df.groupby(df['Date'].dt.date).size().reset_index(name='Count')
print (df1)
DATE Count
0 2017-05-01 2
1 2017-05-02 1
2 2017-05-03 2
3 2017-05-04 1
If need same format:
df1 = df['Date'].str.split().str[0].value_counts().sort_index().reset_index()
df1.columns = ['DATE','Count']
new = df['Date'].str.split().str[0]
df1 = df.groupby(new).size().reset_index(name='Count')
print (df1)
Date Count
0 1/5/2017 2
1 2/5/2017 1
2 3/5/2017 2
3 4/5/2017 1