Pandas. merge/join/concat. Rows into columns - python

Given data frames similar to the following:
df1 = pd.DataFrame({'Customer': ['Customer1', 'Customer2', 'Customer3'],
'Status': [0, 1, 1]}
Customer Status
0 Customer1 0
1 Customer2 1
2 Customer3 1
df2 = pd.DataFrame({'Customer': ['Customer1', 'Customer1', 'Customer1', 'Customer2', 'Customer2', 'Customer3'],
'Call': ['01-01', '01-02', '01-03', '02-01', '03-02', '06-01']})
Customer Call
0 Customer1 01-01
1 Customer1 01-02
2 Customer1 01-03
3 Customer2 02-01
4 Customer2 03-02
5 Customer3 06-01
What is the most efficient method for me to merge the two into a third data frame in which the rows from df2 become columns added to df1. In the new df each row should be a unique customer and 'Call' from df2 is added as incrementing columns populated by NaN values as required?
I'd like to end up with something like:
Customer Status Call_1 Call_2 Call_3
0 Customer1 0 01-01 01-02 01-03
1 Customer2 1 02-01 03-02 NaN
2 Customer3 1 06-01 NaN NaN
I assume some combination of stack() and merge() is required but can't seem to figure it out.
Help appreciated

Use DataFrame.join with new DataFrame reshaped by GroupBy.cumcount and Series.unstack:
df = df1.join(df2.set_index(['Customer', df2.groupby('Customer').cumcount().add(1)])['Call']
.unstack().add_prefix('Call_'), 'Customer')
print (df)
Customer Status Call_1 Call_2 Call_3
0 Customer1 0 01-01 01-02 01-03
1 Customer2 1 02-01 03-02 NaN
2 Customer3 1 06-01 NaN NaN

First pivot df2 with a cumcount de-duplication, then merge:
out = df1.merge(df2.assign(n=df2.groupby('Customer').cumcount().add(1))
.pivot(index='Customer', columns='n', values='Call')
.add_prefix('Call_'),
left_on='Customer', right_index=True)
Output:
Customer Status Call_1 Call_2 Call_3
0 Customer1 0 01-01 01-02 01-03
1 Customer2 1 02-01 03-02 NaN
2 Customer3 1 06-01 NaN NaN

import pandas as pd
import numpy as np
df1 = pd.DataFrame({'Customer':['Customer1','Customer2','Customer3'],
'Status':[0,1,1]})
df2 = pd.DataFrame({'Customer':['Customer1','Customer1','Customer1','Customer2','Customer2','Customer3'],
'Call': ['01-01','01-02','01-03','02-01','03-02','06-01']
})
group_c = df2.groupby('Customer')
step = group_c.cumcount().max() + 1
empty_df = pd.DataFrame(np.nan, index=range(step), columns=df2.columns)
r = (group_c.apply(lambda g: empty_df.combine_first(g.reset_index(drop=True)).reset_index(drop=True))
.unstack()
.drop('Customer', axis=1)
)
r.columns = r.columns.droplevel(0)+1
r = r.add_prefix('Call_')
Result
>>> r
Customer Status Call_1 Call_2 Call_3
0 Customer1 0 01-01 01-02 01-03
1 Customer2 1 02-01 03-02 NaN
2 Customer3 1 06-01 NaN NaN
Empty_df content :
empty_df
Customer Call
0 NaN NaN
1 NaN NaN
2 NaN NaN

Related

Create Min and Max columns for each date column

DataFrame
ID
DateMade
DelDate
ExpDate
1
01/01/2020
05/06/2020
06/05/2022
1
01/01/2020
07/06/2020
07/05/2022
1
01/01/2020
07/06/2020
09/09/2022
2
03/04/2020
07/08/2020
15/12/2022
2
05/06/2020
23/08/2020
31/12/2022
2
01/01/2021
31/08/2020
09/01/2023
What I want to do is groupby ID and create columns for the Min and Max date for each column. But I'm not sure where to start. I know there's aggregate functions out there that work well with one column but I'm wondering is there a straight forward solution when dealing with multiple columns?
Desired Output
ID
DateMade_Min
DateMade_Max
DelDate_Min
DelDate_Max
ExpDate_Min
ExpDate_Max
1
01/01/2020
01/01/2020
05/06/2020
07/06/2020
06/05/2022
09/09/2022
2
03/04/2020
01/01/2021
07/08/2020
31/08/2020
15/12/2022
09/01/2023
First convert columns by list to datetimes in DataFrame.apply and to_datetime, then correct aggregation min and max, flatten MultiIndex with capitalize:
cols = ['DateMade','DelDate','ExpDate']
df[cols] = df[cols].apply(pd.to_datetime, dayfirst=True)
df1 = df.groupby('ID')[cols].agg(['min','max'])
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1].capitalize()}')
df1 = df1.reset_index()
print (df1)
ID DateMade_Min DateMade_Max DelDate_Min DelDate_Max ExpDate_Min \
0 1 2020-01-01 2020-01-01 2020-06-05 2020-06-07 2022-05-06
1 2 2020-04-03 2021-01-01 2020-08-07 2020-08-31 2022-12-15
ExpDate_Max
0 2022-09-09
1 2023-01-09
For orginal format of datetimes add lambda function with Series.dt.strftime:
cols = ['DateMade','DelDate','ExpDate']
df[cols] = df[cols].apply(pd.to_datetime, dayfirst=True)
df1 = df.groupby('ID')[cols].agg(['min','max'])
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1].capitalize()}')
df1 = df1.apply(lambda x: x.dt.strftime('%d/%m/%Y'))
df1 = df1.reset_index()
print (df1)
ID DateMade_Min DateMade_Max DelDate_Min DelDate_Max ExpDate_Min \
0 1 01/01/2020 01/01/2020 05/06/2020 07/06/2020 06/05/2022
1 2 03/04/2020 01/01/2021 07/08/2020 31/08/2020 15/12/2022
ExpDate_Max
0 09/09/2022
1 09/01/2023

Python: concat rows of two dataframes where not all columns are the same

I have two dataframes:
EDIT:
df1 = pd.DataFrame(index = [0,1,2], columns=['timestamp', 'order_id', 'account_id', 'USD', 'CAD'])
df1['timestamp']=['2022-01-01','2022-01-02','2022-01-03']
df1['account_id']=['usdcad','usdcad','usdcad']
df1['order_id']=['11233123','12313213','12341242']
df1['USD'] = [1,2,3]
df1['CAD'] = [4,5,6]
df1:
timestamp account_id order_id USD CAD
0 2022-01-01 usdcad 11233123 1 4
1 2022-01-02 usdcad 12313213 2 5
2 2022-01-03 usdcad 12341242 3 6
df2 = pd.DataFrame(index = [0,1], columns = ['timestamp','account_id', 'currency','balance'])
df2['timestamp']=['2021-12-21','2021-12-21']
df2['account_id']=['usdcad','usdcad']
df2['currency'] = ['USD', 'CAD']
df2['balance'] = [2,3]
df2:
timestamp account_id currency balance
0 2021-12-21 usdcad USD 2
1 2021-12-21 usdcad CAD 3
I would like to add a row to df1 at index 0, and fill that row with the balance of df2 based on currency. So the final df should look like this:
df:
timestamp account_id order_id USD CAD
0 0 0 0 2 3
1 2022-01-01 usdcad 11233123 1 4
2 2022-01-02 usdcad 12313213 2 5
3 2022-01-03 usdcad 12341242 3 6
How can I do this in a pythonic way? Thank you
Set the index of df2 to currency then transpose the index to columns, then append this dataframe with df1
df_out = df2.set_index('currency').T.append(df1, ignore_index=True).fillna(0)
print(df_out)
USD CAD order_id
0 2 3 0
1 1 4 11233123
2 2 5 12313213
3 3 6 12341242

how to concat specific rows through a pandas dataframe

so, i have this situation:
there is a dataframe like this:
Number
Description
10001
name 2
1002
name2(pt1)
NaN
name2(pt2)
1003
name3
1004
name4(pt1)
NaN
name4(pt2)
1005
name5
So, i need to concat the name (part1 and part2) together into junt one field and then drop the NaN rows but i have no clue how to do this because the rows do not follown a specific interval pattern
Try with groupby aggregate on a series based on the notna Number values.
Groups are created from:
df['Number'].notna().cumsum()
0 1
1 2
2 2
3 3
4 4
5 4
6 5
Name: Number, dtype: int32
Then aggregate taking the 'first' Number (since first value is guaranteed to be notna) and doing some operation to combine Descriptions like join:
new_df = (
df.groupby(df['Number'].notna().cumsum(), as_index=False)
.aggregate({'Number': 'first', 'Description': ''.join})
)
new_df:
Number Description
0 10001.0 name 2
1 1002.0 name2(pt1)name2(pt2)
2 1003.0 name3
3 1004.0 name4(pt1)name4(pt2)
4 1005.0 name5
Or comma separated join:
new_df = (
df.groupby(df['Number'].notna().cumsum(), as_index=False)
.aggregate({'Number': 'first', 'Description': ','.join})
)
new_df:
Number Description
0 10001.0 name 2
1 1002.0 name2(pt1),name2(pt2)
2 1003.0 name3
3 1004.0 name4(pt1),name4(pt2)
4 1005.0 name5
Or as list:
new_df = (
df.groupby(df['Number'].notna().cumsum(), as_index=False)
.aggregate({'Number': 'first', 'Description': list})
)
new_df:
Number Description
0 10001.0 [name 2]
1 1002.0 [name2(pt1), name2(pt2)]
2 1003.0 [name3]
3 1004.0 [name4(pt1), name4(pt2)]
4 1005.0 [name5]

Fill values within dates according other DataFrame

I'm trying to fill this DataFrame (df1) (I can start it with NaN or zero values):
27/05/2021 28/05/2021 29/05/2021 30/05/2021 31/05/2021 01/06/2021 02/06/2021 ...
Name1 Nan Nan Nan Nan Nan Nan Nan
Name2 Nan Nan Nan Nan Nan Nan Nan
Name3 Nan Nan Nan Nan Nan Nan Nan
Name4 Nan Nan Nan Nan Nan Nan Nan
Acording information in this DataFrame (df2):
Start1 End1 Dedication1 (h) Start2 End2 Dedication2 (h)
Name1 24/05/2021 31/05/2021 8 02/06/2021 10/07/2021 3
Name2 29/05/2021 31/05/2021 5 Nan Nan Nan
Name3 27/05/2021 01/06/2021 3 Nan Nan Nan
Name4 29/05/2021 07/08/2021 8 10/10/2021 10/12/2021 2
To get something like this (df3):
27/05/2021 28/05/2021 29/05/2021 30/05/2021 31/05/2021 01/06/2021 02/06/2021 ...
Name1 8 8 8 8 8 0 3
Name2 0 0 5 5 5 0 0
Name3 3 3 3 3 3 3 0
Name4 0 0 8 8 8 8 8
This is a schedule with working hours every day for some months. Both DataFrames will have same index and rows number.
According dates in df2, I need to fill df1 values within start day and end day, with dedication hours in that period.
I have tried loc including all rows, and lambda function to select columns according date, but I dont get fill values within dates. Perhaps I need several steps.
Thanks.
You could try this:
from datetime import datetime
import pandas as pd
# Setup
limits = [("Start1", "End1", "Dedication1"), ("Start2", "End2", "Dedication2")]
df3 = df1.copy()
# Deal with NaN values
df3.fillna(0, inplace=True)
df2["Start2"].fillna("31/12/2099", inplace=True)
df2["End2"].fillna("31/12/2099", inplace=True)
df2["Dedication2"].fillna(0, inplace=True)
# Iterate and fill df3
for i, row in df1.iterrows():
for col in df1.columns:
for start, end, dedication in limits:
mask = (
datetime.strptime(df2.loc[i, start], "%d/%m/%Y")
<= datetime.strptime(col, "%d/%m/%Y")
<= datetime.strptime(df2.loc[i, end], "%d/%m/%Y")
)
if mask:
df3.loc[i, col] = df2.loc[i, dedication]
# Format df3
df3 = df3.astype("int")
print(df3)
# Outputs
27/05/2021 28/05/2021 29/05/2021 ... 31/05/2021 01/06/2021 02/06/2021
Name1 8 8 8 ... 8 0 3
Name2 0 0 5 ... 5 0 0
Name3 3 3 3 ... 3 3 0
Name4 0 0 8 ... 8 8 8

Divide columns in df by another df value based on condition

I have a dataframe:
df = pd.DataFrame({'date': ['2013-04-01','2013-04-01','2013-04-01','2013-04-02', '2013-04-02'],
'month': ['1','1','3','3','5'],
'pmonth': ['1', '1', '2', '5', '5'],
'duration': [30, 15, 20, 15, 30],
'pduration': ['10', '20', '30', '40', '50']})
I have to divide duration and pduration by value column of second dataframe where date and month of two df match. The second df is:
df = pd.DataFrame({'date': ['2013-04-01','2013-04-02','2013-04-03','2013-04-04', '2013-04-05'],
'month': ['1','1','3','3','5'],
'value': ['1', '1', '2', '5', '5'],
})
The second df is grouped by date and month, so duplicate combination of date month won't be present in the second df.
First is necessary check if same dtypes of column date and month in both DataFrames and if numeric for columns for divide:
#convert to numeric
df1['pduration'] = df1['pduration'].astype(int)
df2['value'] = df2['value'].astype(int)
print (df1.dtypes)
date object
month object
pmonth object
duration int64
pduration int32
print (df2.dtypes)
date object
month object
value int32
dtype: object
Then merge with left join and divide by DataFrame.div
df = df1.merge(df2, on=['date', 'month'], how='left')
df[['duration_new','pduration_new']] = df[['duration','pduration']].div(df['value'], axis=0)
print (df)
date month pmonth duration pduration value duration_new \
0 2013-04-01 1 1 30 10 1.0 30.0
1 2013-04-01 1 1 15 20 1.0 15.0
2 2013-04-01 3 2 20 30 NaN NaN
3 2013-04-02 3 5 15 40 NaN NaN
4 2013-04-02 5 5 30 50 NaN NaN
pduration_new
0 10.0
1 20.0
2 NaN
3 NaN
4 NaN
For remove value column use pop:
df[['duration_new','pduration_new']] = (df[['duration','pduration']]
.div(df.pop('value'), axis=0))
print (df)
date month pmonth duration pduration duration_new pduration_new
0 2013-04-01 1 1 30 10 30.0 10.0
1 2013-04-01 1 1 15 20 15.0 20.0
2 2013-04-01 3 2 20 30 NaN NaN
3 2013-04-02 3 5 15 40 NaN NaN
4 2013-04-02 5 5 30 50 NaN NaN
You can merge the second df into the first df and then divide.
Consider the first df as df1 and second df as df2
df1 = df1.merge(df2, on=['date', 'month'], how='left').fillna(1)
df1
date month pmonth duration pduration value
0 2013-04-01 1 1 30 10 1
1 2013-04-01 1 1 15 20 1
2 2013-04-01 3 2 20 30 1
3 2013-04-02 3 5 15 40 1
4 2013-04-02 5 5 30 50 1
df1['duration'] = df1['duration'] / df1['value']
df1['pduration'] = df1['pduration'] / df1['value']
df1.drop('value', axis=1, inplace=True)
you can merge the two dataframes, where the date and month match the value column will be added to the first data frame. If there is no match it will represented by NaN. You can then do division operation. see code below
Assuming your second dataframe is df2, then
df3 = df2.merge(df, how = 'right')
for col in ['duration','pduration']:
df3['new_'+col] = df3[col].astype(float)/df3['value'].astype(float)
df3
results in
date month value pmonth duration pduration newduration newpduration
0 2013-04-01 1 1 1 30 10 30.0 10.0
1 2013-04-01 1 1 1 15 20 15.0 20.0
2 2013-04-01 3 NaN 2 20 30 NaN NaN
3 2013-04-02 3 NaN 5 15 40 NaN NaN
4 2013-04-02 5 NaN 5 30 50 NaN NaN

Categories