Displaying next value on a column considering groups in Pandas Dataframe - python

I'm having this example dataframe and I need to display the next delivery Date for a specific client-region group.
Date could be either coded as a string or datetime, I'm using a string in this example.
# Import pandas library
import pandas as pd
import numpy as np
data = [['NY', 'A','2020-01-01', 10], ['NY', 'A','2020-02-03', 20], ['NY', 'A','2020-04-05', 30], ['NY', 'A','2020-05-05', 25],
['NY', 'B','2020-01-01', 15], ['NY', 'B','2020-02-02', 10], ['NY', 'B','2020-02-10', 20],
['FL', 'A','2020-01-01', 15], ['FL', 'A','2020-02-01', 10], ['FL', 'A','2020-03-01', 12], ['FL', 'A','2020-04-01', 25], ['FL', 'A','2020-05-01', 20]
]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Region', 'Client', 'deliveryDate', 'price'])
# print dataframe.
df
Region Client deliveryDate price
0 NY A 2020-01-01 10
1 NY A 2020-02-03 20
2 NY A 2020-04-05 30
3 NY A 2020-05-05 25
4 NY B 2020-01-01 15
5 NY B 2020-02-02 10
6 NY B 2020-02-10 20
7 FL A 2020-01-01 15
8 FL A 2020-02-01 10
9 FL A 2020-03-01 12
10 FL A 2020-04-01 25
11 FL A 2020-05-01 20
Desired output:
data2 = [['NY', 'A','2020-01-01', '2020-02-03', 10], ['NY', 'A','2020-02-03', '2020-04-05', 20], ['NY', 'A','2020-04-05', '2020-05-05', 30], ['NY', 'A','2020-05-05', float('nan'), 25],
['NY', 'B','2020-01-01', '2020-02-02', 15], ['NY', 'B','2020-02-02','2020-02-10', 10], ['NY', 'B','2020-02-10', float('nan'), 20],
['FL', 'A','2020-01-01', '2020-02-01', 15], ['FL', 'A','2020-02-01', '2020-03-01', 10], ['FL', 'A','2020-03-01', '2020-04-01', 12], ['FL', 'A','2020-04-01', '2020-05-01', 25], ['FL', 'A','2020-05-01', float('nan'), 20]
]
# Create the pandas DataFrame
df2 = pd.DataFrame(data2, columns = ['Region', 'Client', 'deliveryDate', 'nextDelivery', 'price'])
Region Client deliveryDate nextDelivery price
0 NY A 2020-01-01 2020-02-03 10
1 NY A 2020-02-03 2020-04-05 20
2 NY A 2020-04-05 2020-05-05 30
3 NY A 2020-05-05 NaN 25
4 NY B 2020-01-01 2020-02-02 15
5 NY B 2020-02-02 2020-02-10 10
6 NY B 2020-02-10 NaN 20
7 FL A 2020-01-01 2020-02-01 15
8 FL A 2020-02-01 2020-03-01 10
9 FL A 2020-03-01 2020-04-01 12
10 FL A 2020-04-01 2020-05-01 25
11 FL A 2020-05-01 NaN 20
Thanks in advance.

Assuming the delivery dates are ordered, how about grouping by region & client, then applying a shift?
df['nextDelivery'] = df.groupby(['Region','Client']).shift(-1)['deliveryDate']
Output:
Region Client deliveryDate price nextDelivery
0 NY A 2020-01-01 10 2020-02-03
1 NY A 2020-02-03 20 2020-04-05
2 NY A 2020-04-05 30 2020-05-05
3 NY A 2020-05-05 25 NaN
4 NY B 2020-01-01 15 2020-02-02
5 NY B 2020-02-02 10 2020-02-10
6 NY B 2020-02-10 20 NaN
7 FL A 2020-01-01 15 2020-02-01
8 FL A 2020-02-01 10 2020-03-01
9 FL A 2020-03-01 12 2020-04-01
10 FL A 2020-04-01 25 2020-05-01
11 FL A 2020-05-01 20 NaN

Related

I have data frame and I want to calculate the last three month transaction count and sum for each Group Id

Input data set
Id
Date
TransAmt
A
2022-01-02
10
A
2022-01-02
20
A
2022-02-04
30
A
2022-02-05
20
A
2022-04-08
300
A
2022-04-11
100
A
2022-05-13
200
A
2022-06-12
20
A
2022-06-15
300
A
2022-08-16
100
Desired output
Id
Date
TransAmt
CountThreeMonth
AmountThreeMonths
A
2022-01-02
10
2
30
A
2022-01-02
20
2
30
A
2022-02-04
30
4
80
A
2022-02-05
20
4
80
A
2022-04-08
300
4
450
A
2022-04-11
100
4
450
A
2022-05-13
200
3
600
A
2022-06-12
20
5
920
A
2022-06-15
300
5
920
A
2022-08-16
100
3
420
Note: 1. There can be multiple transaction for same date i.e. on 2022-01-02 there are two transaction.
2. I want calculate last 3 months transaction like- Present Month total Transaction count + Previous two month total Transaction count. Similar logic for amount. like for Jan month only 2 transaction and previous month does not have any transaction so 2 + 0 + 0 =2.
3. I want all calculation for Each group of Id.
Please help me achieve my desired output
Thanking you in Advanced.
Example
data = [['A', '2022-01-02', 10], ['A', '2022-01-02', 20], ['A', '2022-02-04', 30],
['A', '2022-02-05', 20], ['A', '2022-04-08', 300], ['A', '2022-04-11', 100],
['A', '2022-05-13', 200], ['A', '2022-06-12', 20], ['A', '2022-06-15', 300],
['A', '2022-08-16', 100], ['B', '2022-01-02', 10], ['B', '2022-01-02', 20],
['B', '2022-02-04', 30], ['B', '2022-02-05', 20], ['B', '2022-04-08', 300],
['B', '2022-04-11', 100], ['B', '2022-05-13', 200], ['B', '2022-06-12', 20],
['B', '2022-06-15', 300], ['B', '2022-08-16', 100]]
df1 = pd.DataFrame(data, columns=['Id', 'Date', 'TransAmt'])
df1
Id Date TransAmt
0 A 2022-01-02 10
1 A 2022-01-02 20
2 A 2022-02-04 30
3 A 2022-02-05 20
4 A 2022-04-08 300
5 A 2022-04-11 100
6 A 2022-05-13 200
7 A 2022-06-12 20
8 A 2022-06-15 300
9 A 2022-08-16 100
10 B 2022-01-02 10
11 B 2022-01-02 20
12 B 2022-02-04 30
13 B 2022-02-05 20
14 B 2022-04-08 300
15 B 2022-04-11 100
16 B 2022-05-13 200
17 B 2022-06-12 20
18 B 2022-06-15 300
19 B 2022-08-16 100
Code
s = df1['Date']
df1['Date'] = df1['Date'].astype('Period[M]')
df2 = df1.groupby(['Id', 'Date'])['TransAmt'].agg(['count', sum])
idx1 = pd.period_range(df1['Date'].min(), df1['Date'].max(), freq='M')
idx2 = pd.MultiIndex.from_product([df1['Id'].unique(), idx1])
cols = ['Id', 'Date', 'CountThreeMonth', 'AmountofThreeMonth']
n = 3
df3 = df2.reindex(idx2, fill_value=0).groupby(level=0).rolling(n, min_periods=1).sum().droplevel(0).reset_index().set_axis(cols, axis=1)
df1.merge(df3, how='left').assign(Date=s)
result(df1.merge(df3, how='left').assign(Date=s))
Id Date TransAmt CountThreeMonth AmountofThreeMonth
0 A 2022-01-02 10 2.0 30.0
1 A 2022-01-02 20 2.0 30.0
2 A 2022-02-04 30 4.0 80.0
3 A 2022-02-05 20 4.0 80.0
4 A 2022-04-08 300 4.0 450.0
5 A 2022-04-11 100 4.0 450.0
6 A 2022-05-13 200 3.0 600.0
7 A 2022-06-12 20 5.0 920.0
8 A 2022-06-15 300 5.0 920.0
9 A 2022-08-16 100 3.0 420.0
10 B 2022-01-02 10 2.0 30.0
11 B 2022-01-02 20 2.0 30.0
12 B 2022-02-04 30 4.0 80.0
13 B 2022-02-05 20 4.0 80.0
14 B 2022-04-08 300 4.0 450.0
15 B 2022-04-11 100 4.0 450.0
16 B 2022-05-13 200 3.0 600.0
17 B 2022-06-12 20 5.0 920.0
18 B 2022-06-15 300 5.0 920.0
19 B 2022-08-16 100 3.0 420.0
I'm sorry it's hard to explain

Join two dataframes on multiple conditions in python

I have the following problem: i am trying to join df1 = ['ID, 'Earnings', 'WC, 'Year'] and df2 = ['ID', 'F1_Earnings', 'df2_year']. So for example: the 'F1_Earnings' of a particular company, e.g. with ID = 1 and year = 1996, in df2 (aka. the Forward Earnings) should get joined on df1 in a way that they show up in df1 under ID = 1 and year = 1995.
I have no clue how to specify a join on two conditions, of course they need to join on "ID", but how do I add a second condition which specifies that they also join on "df1_year = df2_year - 1"?
d1 = {'ID': [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4], 'Earnings': [100, 200, 400, 250, 300, 350, 400, 550, 700, 259, 300, 350], 'WC': [20, 40, 35, 55, 60, 65, 30, 28, 32, 45, 60, 52], 'Year': [1995, 1996, 1997, 1996, 1997, 1998, 1995, 1997, 1998, 1996, 1997, 1998]}
df1 = pd.DataFrame(data=d1)
d2 = {'ID': [1, 2, 3, 4], 'F1_Earnings': [120, 220, 420, 280], 'WC': [23, 37, 40, 52], 'Year': [1996, 1997, 1998, 1999]}
df2 = pd.DataFrame(data=d2)
I did the following, but I guess there miust be a smarter way? I am afraid it wont work for larger datasets...:
df3 = pd.merge(df1, df2, how='left', on = 'ID')
df3.loc[df3['Year_x'] == df3['Year_y'] - 1]
You can use a Series as key in merge:
df1.merge(df2, how='left',
left_on=['ID', 'Year'],
right_on=['ID', df2['Year'].sub(1)])
output:
ID Year Earnings WC_x Year_x F1_Earnings WC_y Year_y
0 1 1995 100 20 1995 120.0 23.0 1996.0
1 1 1996 200 40 1996 NaN NaN NaN
2 1 1997 400 35 1997 NaN NaN NaN
3 2 1996 250 55 1996 220.0 37.0 1997.0
4 2 1997 300 60 1997 NaN NaN NaN
5 2 1998 350 65 1998 NaN NaN NaN
6 3 1995 400 30 1995 NaN NaN NaN
7 3 1997 550 28 1997 420.0 40.0 1998.0
8 3 1998 700 32 1998 NaN NaN NaN
9 4 1996 259 45 1996 NaN NaN NaN
10 4 1997 300 60 1997 NaN NaN NaN
11 4 1998 350 52 1998 280.0 52.0 1999.0
Or change the Year to Year-1, before the merge:
df1.merge(df2.assign(Year=df2['Year'].sub(1)),
how='left', on=['ID', 'Year'])
output:
ID Earnings WC_x Year F1_Earnings WC_y
0 1 100 20 1995 120.0 23.0
1 1 200 40 1996 NaN NaN
2 1 400 35 1997 NaN NaN
3 2 250 55 1996 220.0 37.0
4 2 300 60 1997 NaN NaN
5 2 350 65 1998 NaN NaN
6 3 400 30 1995 NaN NaN
7 3 550 28 1997 420.0 40.0
8 3 700 32 1998 NaN NaN
9 4 259 45 1996 NaN NaN
10 4 300 60 1997 NaN NaN
11 4 350 52 1998 280.0 52.0

Adding correlation result back to pandas dataframe

I am wondering how to add the corr() result back to a panda dataframe as the current output is a bit nested. I just want to have one column in the original dataframe to list the value. What's the best way to achieve this?
id date water fire
0 apple 2018-01-01 100 100
1 orange 2018-01-01 110 110
2 apple 2019-01-01 90 9
3 orange 2019-01-01 50 50
4 apple 2020-01-01 40 4
5 orange 2020-01-01 60 60
6 apple 2021-01-01 70 470
7 orange 2021-01-01 80 15
8 apple 2022-01-01 90 90
9 orange 2022-01-01 100 9100
data = pd.DataFrame({
'id': ['apple', 'orange','apple','orange','apple', 'orange', 'apple', 'orange', 'apple', 'orange'],
'date': [
datetime.datetime(2018, 1, 1),
datetime.datetime(2018, 1, 1),
datetime.datetime(2019, 1, 1),
datetime.datetime(2019, 1, 1),
datetime.datetime(2020, 1, 1),
datetime.datetime(2020, 1, 1),
datetime.datetime(2021, 1, 1),
datetime.datetime(2021, 1, 1),
datetime.datetime(2022, 1, 1),
datetime.datetime(2022, 1, 1)
],
'water': [100, 110, 90, 50, 40, 60, 70, 80, 90, 100],
'fire': [100, 110, 9, 50, 4, 60, 470, 15, 90, 9100]
}
)
data.groupby('id')[['water', 'fire']].apply(lambda x : x.rolling(3).corr())
water fire
id
apple 0 water NaN NaN
fire NaN NaN
2 water NaN NaN
fire NaN NaN
4 water 1.000000 0.663924
fire 0.663924 1.000000
6 water 1.000000 0.123983
fire 0.123983 1.000000
8 water 1.000000 0.285230
fire 0.285230 1.000000
orange 1 water NaN NaN
fire NaN NaN
3 water NaN NaN
fire NaN NaN
5 water 1.000000 1.000000
fire 1.000000 1.000000
7 water 1.000000 -0.854251
fire -0.854251 1.000000
9 water 1.000000 0.863867
fire 0.863867 1.000000
Here is one way to do it:
df = pd.concat(
[
data,
data.groupby("id")[["water", "fire"]]
.apply(lambda x: x.rolling(3).corr())
.reset_index()
.drop_duplicates(subset=["level_1"])
.set_index("level_1")["fire"]
.rename("corr")
],
axis=1,
)
print(df)
# Output
id date water fire corr
0 apple 2018-01-01 100 100 NaN
1 orange 2018-01-01 110 110 NaN
2 apple 2019-01-01 90 9 NaN
3 orange 2019-01-01 50 50 NaN
4 apple 2020-01-01 40 4 0.663924
5 orange 2020-01-01 60 60 1.000000
6 apple 2021-01-01 70 470 0.123983
7 orange 2021-01-01 80 15 -0.854251
8 apple 2022-01-01 90 90 0.285230
9 orange 2022-01-01 100 9100 0.863867

Python Pandas Dataframe: Value of second recent day for each person

I am trying to group one dataframe conditional on another dataframe using Pythons pandas dataframes:
The first dataframe gives the holidays of each person:
import pandas as pd
df_holiday = pd.DataFrame({'Person': ['Alfred', 'Bob', 'Charles'], 'Last Holiday': ['2018-02-01', '2018-06-01', '2018-05-01']})
df_holiday.head()
Last Holiday Person
0 2018-02-01 Alfred
1 2018-06-01 Bob
2 2018-05-01 Charles
The second dataframe gives the sales value for each person and month:
df_sales = pd.DataFrame({'Person': ['Alfred', 'Alfred', 'Alfred','Bob','Bob','Bob','Bob','Bob','Bob','Charles','Charles','Charles','Charles','Charles','Charles'],'Date': ['2018-01-01', '2018-02-01', '2018-03-01', '2018-01-01', '2018-02-01', '2018-03-01','2018-04-01', '2018-05-01', '2018-06-01', '2018-01-01', '2018-02-01', '2018-03-01','2018-04-01', '2018-05-01', '2018-06-01'], 'Sales': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]})
df_sales.head(15)
Date Person Sales
0 2018-01-01 Alfred 1
1 2018-02-01 Alfred 2
2 2018-03-01 Alfred 3
3 2018-01-01 Bob 4
4 2018-02-01 Bob 5
5 2018-03-01 Bob 6
6 2018-04-01 Bob 7
7 2018-05-01 Bob 8
8 2018-06-01 Bob 9
9 2018-01-01 Charles 10
10 2018-02-01 Charles 11
11 2018-03-01 Charles 12
12 2018-04-01 Charles 13
13 2018-05-01 Charles 14
14 2018-06-01 Charles 15
Now, i want the sales number for each person before his last holiday, i.e. the outcome should be:
Date Person Sales
0 2018-01-01 Alfred 1
7 2018-05-01 Bob 8
12 2018-04-01 Charles 13
Any help?
We could do merge then filter and drop_duplicates
df=df_holiday.merge(df_sales).loc[lambda x : x['Last Holiday']>x['Date']].drop_duplicates('Person',keep='last')
Out[163]:
Person Last Holiday Date Sales
0 Alfred 2018-02-01 2018-01-01 1
7 Bob 2018-06-01 2018-05-01 8
12 Charles 2018-05-01 2018-04-01 13

Python: doing multiple column aggregation in pandas

I have dataframe where I went to do multiple column aggregations in pandas.
import pandas as pd
import numpy as np
df = pd.DataFrame({'ser_no': [1, 1, 1, 2, 2, 2, 2, 3, 3, 3],
'CTRY_NM': ['a', 'a', 'b', 'e', 'e', 'a', 'b', 'b', 'b', 'd'],
'lat': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'long': [21, 22, 23, 24, 25, 26, 27, 28, 29, 30]})
df2 = df.groupby(['ser_no', 'CTRY_NM']).lat.agg({'avg_lat': np.mean})
With this code, I get the mean for lat. I would also like to find the mean for long.
I tried df2 = df.groupby(['ser_no', 'CTRY_NM']).lat.agg({'avg_lat': np.mean}).long.agg({'avg_long': np.mean}) but this produces
AttributeError: 'DataFrame' object has no attribute 'long'
If I just do avg_long, the code works as well.
df2 = df.groupby(['ser_no', 'CTRY_NM']).long.agg({'avg_long': np.mean})
In[2]: df2
Out[42]:
avg_long
ser_no CTRY_NM
1 a 21.5
b 23.0
2 a 26.0
b 27.0
e 24.5
3 b 28.5
d 30.0
Is there a way to do this in one step or is this something I have to do separately and join back later?
I think more simplier is use GroupBy.mean:
print df.groupby(['ser_no', 'CTRY_NM']).mean()
lat long
ser_no CTRY_NM
1 a 1.5 21.5
b 3.0 23.0
2 a 6.0 26.0
b 7.0 27.0
e 4.5 24.5
3 b 8.5 28.5
d 10.0 30.0
Ir you need define columns for aggregating:
print df.groupby(['ser_no', 'CTRY_NM']).agg({'lat' : 'mean', 'long' : 'mean'})
lat long
ser_no CTRY_NM
1 a 1.5 21.5
b 3.0 23.0
2 a 6.0 26.0
b 7.0 27.0
e 4.5 24.5
3 b 8.5 28.5
d 10.0 30.0
More info in docs.
EDIT:
If you need rename column names - remove multiindex in columns, you can use list comprehension:
import pandas as pd
df = pd.DataFrame({'ser_no': [1, 1, 1, 2, 2, 2, 2, 3, 3, 3],
'CTRY_NM': ['a', 'a', 'b', 'e', 'e', 'a', 'b', 'b', 'b', 'd'],
'lat': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'long': [21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
'date':pd.date_range(pd.to_datetime('2016-02-24'),
pd.to_datetime('2016-02-28'), freq='10H')})
print df
CTRY_NM date lat long ser_no
0 a 2016-02-24 00:00:00 1 21 1
1 a 2016-02-24 10:00:00 2 22 1
2 b 2016-02-24 20:00:00 3 23 1
3 e 2016-02-25 06:00:00 4 24 2
4 e 2016-02-25 16:00:00 5 25 2
5 a 2016-02-26 02:00:00 6 26 2
6 b 2016-02-26 12:00:00 7 27 2
7 b 2016-02-26 22:00:00 8 28 3
8 b 2016-02-27 08:00:00 9 29 3
9 d 2016-02-27 18:00:00 10 30 3
df2=df.groupby(['ser_no','CTRY_NM']).agg({'lat':'mean','long':'mean','date':[min,max,'count']})
df2.columns = ['_'.join(col) for col in df2.columns]
print df2
lat_mean date_min date_max date_count \
ser_no CTRY_NM
1 a 1.5 2016-02-24 00:00:00 2016-02-24 10:00:00 2
b 3.0 2016-02-24 20:00:00 2016-02-24 20:00:00 1
2 a 6.0 2016-02-26 02:00:00 2016-02-26 02:00:00 1
b 7.0 2016-02-26 12:00:00 2016-02-26 12:00:00 1
e 4.5 2016-02-25 06:00:00 2016-02-25 16:00:00 2
3 b 8.5 2016-02-26 22:00:00 2016-02-27 08:00:00 2
d 10.0 2016-02-27 18:00:00 2016-02-27 18:00:00 1
long_mean
ser_no CTRY_NM
1 a 21.5
b 23.0
2 a 26.0
b 27.0
e 24.5
3 b 28.5
d 30.0
You are getting the error because you are first selecting the lat column of the dataframe and doing operations on that column. Getting the long column through that series is not possible, you need the dataframe.
df2 = df.groupby(['ser_no', 'CTRY_NM'])["lat", "long"].agg(np.mean)
would do the same operation for both columns. If you want the column names changed, you can rename the columns afterwards:
df2 = df.groupby(['ser_no', 'CTRY_NM'])["lat", "long"].agg(np.mean).rename(columns = {"lat": "avg_lat", "long": "avg_long"})
In [22]:
df2 = df.groupby(['ser_no', 'CTRY_NM'])["lat", "long"].agg(np.mean).rename(columns = {"lat": "avg_lat", "long": "avg_long"})
df2
Out[22]:
avg_lat avg_long
ser_no CTRY_NM
1 a 1.5 21.5
b 3.0 23.0
2 a 6.0 26.0
b 7.0 27.0
e 4.5 24.5
3 b 8.5 28.5
d 10.0 30.0

Categories