Python Pandas Dataframe: Value of second recent day for each person - python

I am trying to group one dataframe conditional on another dataframe using Pythons pandas dataframes:
The first dataframe gives the holidays of each person:
import pandas as pd
df_holiday = pd.DataFrame({'Person': ['Alfred', 'Bob', 'Charles'], 'Last Holiday': ['2018-02-01', '2018-06-01', '2018-05-01']})
df_holiday.head()
Last Holiday Person
0 2018-02-01 Alfred
1 2018-06-01 Bob
2 2018-05-01 Charles
The second dataframe gives the sales value for each person and month:
df_sales = pd.DataFrame({'Person': ['Alfred', 'Alfred', 'Alfred','Bob','Bob','Bob','Bob','Bob','Bob','Charles','Charles','Charles','Charles','Charles','Charles'],'Date': ['2018-01-01', '2018-02-01', '2018-03-01', '2018-01-01', '2018-02-01', '2018-03-01','2018-04-01', '2018-05-01', '2018-06-01', '2018-01-01', '2018-02-01', '2018-03-01','2018-04-01', '2018-05-01', '2018-06-01'], 'Sales': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]})
df_sales.head(15)
Date Person Sales
0 2018-01-01 Alfred 1
1 2018-02-01 Alfred 2
2 2018-03-01 Alfred 3
3 2018-01-01 Bob 4
4 2018-02-01 Bob 5
5 2018-03-01 Bob 6
6 2018-04-01 Bob 7
7 2018-05-01 Bob 8
8 2018-06-01 Bob 9
9 2018-01-01 Charles 10
10 2018-02-01 Charles 11
11 2018-03-01 Charles 12
12 2018-04-01 Charles 13
13 2018-05-01 Charles 14
14 2018-06-01 Charles 15
Now, i want the sales number for each person before his last holiday, i.e. the outcome should be:
Date Person Sales
0 2018-01-01 Alfred 1
7 2018-05-01 Bob 8
12 2018-04-01 Charles 13
Any help?

We could do merge then filter and drop_duplicates
df=df_holiday.merge(df_sales).loc[lambda x : x['Last Holiday']>x['Date']].drop_duplicates('Person',keep='last')
Out[163]:
Person Last Holiday Date Sales
0 Alfred 2018-02-01 2018-01-01 1
7 Bob 2018-06-01 2018-05-01 8
12 Charles 2018-05-01 2018-04-01 13

Related

Pandas Rolling window with filtering condition to remove the some latest data

This is a follow-up question of this. I would like to perform a rolling window of the last n days but I want to filter out the latest x days from each window (x is smaller than n)
Here is an example:
d = {'Name': ['Jack', 'Jim', 'Jack', 'Jim', 'Jack', 'Jack', 'Jim', 'Jack', 'Jane', 'Jane'],
'Date': ['08/01/2021',
'27/01/2021',
'05/02/2021',
'10/02/2021',
'17/02/2021',
'18/02/2021',
'20/02/2021',
'21/02/2021',
'22/02/2021',
'29/03/2021'],
'Earning': [40, 10, 20, 20, 40, 50, 100, 70, 80, 90]}
df = pd.DataFrame(data=d)
df['Date'] = pd.to_datetime(df.Date, format='%d/%m/%Y')
df = df.sort_values('Date')
Name Date Earning
0 Jack 2021-01-08 40
1 Jim 2021-01-27 10
2 Jack 2021-02-05 20
3 Jim 2021-02-10 20
4 Jack 2021-02-17 40
5 Jack 2021-02-18 50
6 Jim 2021-02-20 100
7 Jack 2021-02-21 70
8 Jane 2021-02-22 80
9 Jane 2021-03-29 90
I would like to
For each row, take the last 30 days of the same Name - call it a window
Remove the latest 20 days of each window (i.e. only take the earliest 10 days)
Calculate the sum on the Earning column
Expected outcome: (The two columns Window_From and Window_To are not needed. I only use them to demonstrate the mock data)
Name Date Earning Window_From Window_To Sum
0 Jack 2021-01-08 40 2020-12-09 2020-12-19 0.0
1 Jim 2021-01-27 10 2020-12-28 2021-01-07 0.0
2 Jack 2021-02-05 20 2021-01-06 2021-01-16 40.0
3 Jim 2021-02-10 20 2021-01-11 2021-01-21 0.0
4 Jack 2021-02-17 40 2021-01-18 2021-01-28 0.0
5 Jack 2021-02-18 50 2021-01-19 2021-01-29 0.0
6 Jim 2021-02-20 100 2021-01-21 2021-01-31 10.0
7 Jack 2021-02-21 70 2021-01-22 2021-02-01 0.0
8 Jane 2021-02-22 80 2021-01-23 2021-02-02 0.0
9 Jane 2021-03-29 90 2021-02-27 2021-03-09 0.0
Easy solution
Calculate 30 days and 20 days rolling sum then subtract 30 day sum from 20 day sum to get the effective rolling sum for first 10 days
s1 = df.groupby('Name').rolling('30d', on='Date')['Earning'].sum()
s2 = df.groupby('Name').rolling('20d', on='Date')['Earning'].sum()
df.merge(s1.sub(s2).reset_index(name='sum'), how='left')
Name Date Earning sum
0 Jack 2021-01-08 40 0.0
1 Jim 2021-01-27 10 0.0
2 Jack 2021-02-05 20 40.0
3 Jim 2021-02-10 20 0.0
4 Jack 2021-02-17 40 0.0
5 Jack 2021-02-18 50 0.0
6 Jim 2021-02-20 100 10.0
7 Jack 2021-02-21 70 0.0
8 Jane 2021-02-22 80 0.0
9 Jane 2021-03-29 90 0.0
An alternative to rolling (may be faster):
EDIT: actually slower with OP's dataset.
df['start'] = df['Date'] - pd.Timedelta(days=30)
df['end'] = df['start'] + pd.Timedelta(days=10)
df = df.set_index(['Name', 'Date'])
df['Sum'] = [df.xs(n, level=0).loc[start:end, 'Earning'].sum()
for n, start, end in zip(df.index.get_level_values(0), df['start'], df['end'])]
print(df.reset_index().drop(columns=['start', 'end']))
Name Date Earning Sum
0 Jack 2021-01-08 40 0
1 Jim 2021-01-27 10 0
2 Jack 2021-02-05 20 40
3 Jim 2021-02-10 20 0
4 Jack 2021-02-17 40 0
5 Jack 2021-02-18 50 0
6 Jim 2021-02-20 100 10
7 Jack 2021-02-21 70 0
8 Jane 2021-02-22 80 0
9 Jane 2021-03-29 90 0

Putting Na on several columns according to conditions on certain columns

I have a pandas dataframe with amount of bills columns,with dates and ids associated with those amounts. I would like to set the columns to Na when the date is less than 2016-12-31 and their associated id and amount. Here is an example
ID customer
Bill1
Date 1
ID Bill 1
Bill2
Date 2
ID Bill 2
Bill3
Date3
ID Bill 3
Gender
Age
4
6
2000-10-04
1
45
2000-11-05
2
51
1999-12-05
8
M
25
6
8
2016-05-03
7
39
2017-08-09
8
38
2018-07-14
17
W
54
12
14
2016-11-16
10
73
2017-05-04
15
14
2017-07-04
35
M
68
And I would like to get this:
ID customer
Bill1
Date 1
ID Bill 1
Bill2
Date 2
ID Bill 2
Bill3
Date 3
ID Bill 3
Gender
Age
4
Nan
Nan
Nan
Nan
Nan
Nan
Nan
Nan
Nan
M
25
6
Nan
Nan
Nan
39
2017-08-09
8
38
2018-07-14
17
W
54
12
Nan
Nan
Nan
73
2017-05-04
15
14
2017-07-04
35
M
68
One option is to create a MultiIndex.from_frame based on the values extracted str.extractall:
new_df = df.set_index(['ID customer', 'Gender', 'Age'])
orig_cols = new_df.columns # Save For Later
new_df.columns = pd.MultiIndex.from_frame(
new_df.columns.str.extractall(r'(.*?)(?:\s+)?(\d+)')
)
0 Bill Date ID Bill Bill Date ID Bill Bill Date ID Bill
1 1 1 1 2 2 2 3 3 3
ID customer Gender Age
4 M 25 6 2000-10-04 1 45 2000-11-05 2 51 1999-12-05 8
6 W 54 8 2016-05-03 7 39 2017-08-09 8 38 2018-07-14 17
12 M 68 14 2016-11-16 10 73 2017-05-04 15 14 2017-07-04 35
Then mask on the Date column (in level 0) where dates are less than the threshold:
new_df = new_df.mask(new_df['Date'].lt(pd.to_datetime('2016-12-31')))
0 Bill Date ID Bill Bill Date ID Bill Bill Date ID Bill
1 1 1 1 2 2 2 3 3 3
ID customer Gender Age
4 M 25 NaN NaT NaN NaN NaT NaN NaN NaT NaN
6 W 54 NaN NaT NaN 39.0 2017-08-09 8.0 38.0 2018-07-14 17.0
12 M 68 NaN NaT NaN 73.0 2017-05-04 15.0 14.0 2017-07-04 35.0
Lastly, restore columns and order:
new_df.columns = orig_cols # Restore from "save"
new_df = new_df.reset_index().reindex(columns=df.columns)
ID customer Bill1 Date 1 ID Bill 1 Bill2 Date 2 ID Bill 2 Bill3 Date3 ID Bill 3 Gender Age
0 4 NaN NaT NaN NaN NaT NaN NaN NaT NaN M 25
1 6 NaN NaT NaN 39.0 2017-08-09 8.0 38.0 2018-07-14 17.0 W 54
2 12 NaN NaT NaN 73.0 2017-05-04 15.0 14.0 2017-07-04 35.0 M 68
All Together:
(ensure Date Columns are DateTime)
df['Date 1'] = pd.to_datetime(df['Date 1'])
df['Date 2'] = pd.to_datetime(df['Date 2'])
df['Date3'] = pd.to_datetime(df['Date3'])
new_df = df.set_index(['ID customer', 'Gender', 'Age'])
orig_cols = new_df.columns # Save For Later
new_df.columns = pd.MultiIndex.from_frame(
new_df.columns.str.extractall(r'(.*?)(?:\s+)?(\d+)')
)
new_df = new_df.mask(new_df['Date'].lt(pd.to_datetime('2016-12-31')))
new_df.columns = orig_cols # Restore from "save"
new_df = new_df.reset_index().reindex(columns=df.columns)
Another way:
#Assumption Your dates are of dtype datetime[ns]
c=~df.filter(like='Date').lt(pd.to_datetime('2016-12-31'))
c=pd.DataFrame(c.values.repeat(3,1),columns=df.columns[1:10])
Finally:
out=df[df.columns[1:10]]
out=out[c].join(df[['ID customer','Gender','Age']])
Now If you print out you will get your desired output

Displaying next value on a column considering groups in Pandas Dataframe

I'm having this example dataframe and I need to display the next delivery Date for a specific client-region group.
Date could be either coded as a string or datetime, I'm using a string in this example.
# Import pandas library
import pandas as pd
import numpy as np
data = [['NY', 'A','2020-01-01', 10], ['NY', 'A','2020-02-03', 20], ['NY', 'A','2020-04-05', 30], ['NY', 'A','2020-05-05', 25],
['NY', 'B','2020-01-01', 15], ['NY', 'B','2020-02-02', 10], ['NY', 'B','2020-02-10', 20],
['FL', 'A','2020-01-01', 15], ['FL', 'A','2020-02-01', 10], ['FL', 'A','2020-03-01', 12], ['FL', 'A','2020-04-01', 25], ['FL', 'A','2020-05-01', 20]
]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Region', 'Client', 'deliveryDate', 'price'])
# print dataframe.
df
Region Client deliveryDate price
0 NY A 2020-01-01 10
1 NY A 2020-02-03 20
2 NY A 2020-04-05 30
3 NY A 2020-05-05 25
4 NY B 2020-01-01 15
5 NY B 2020-02-02 10
6 NY B 2020-02-10 20
7 FL A 2020-01-01 15
8 FL A 2020-02-01 10
9 FL A 2020-03-01 12
10 FL A 2020-04-01 25
11 FL A 2020-05-01 20
Desired output:
data2 = [['NY', 'A','2020-01-01', '2020-02-03', 10], ['NY', 'A','2020-02-03', '2020-04-05', 20], ['NY', 'A','2020-04-05', '2020-05-05', 30], ['NY', 'A','2020-05-05', float('nan'), 25],
['NY', 'B','2020-01-01', '2020-02-02', 15], ['NY', 'B','2020-02-02','2020-02-10', 10], ['NY', 'B','2020-02-10', float('nan'), 20],
['FL', 'A','2020-01-01', '2020-02-01', 15], ['FL', 'A','2020-02-01', '2020-03-01', 10], ['FL', 'A','2020-03-01', '2020-04-01', 12], ['FL', 'A','2020-04-01', '2020-05-01', 25], ['FL', 'A','2020-05-01', float('nan'), 20]
]
# Create the pandas DataFrame
df2 = pd.DataFrame(data2, columns = ['Region', 'Client', 'deliveryDate', 'nextDelivery', 'price'])
Region Client deliveryDate nextDelivery price
0 NY A 2020-01-01 2020-02-03 10
1 NY A 2020-02-03 2020-04-05 20
2 NY A 2020-04-05 2020-05-05 30
3 NY A 2020-05-05 NaN 25
4 NY B 2020-01-01 2020-02-02 15
5 NY B 2020-02-02 2020-02-10 10
6 NY B 2020-02-10 NaN 20
7 FL A 2020-01-01 2020-02-01 15
8 FL A 2020-02-01 2020-03-01 10
9 FL A 2020-03-01 2020-04-01 12
10 FL A 2020-04-01 2020-05-01 25
11 FL A 2020-05-01 NaN 20
Thanks in advance.
Assuming the delivery dates are ordered, how about grouping by region & client, then applying a shift?
df['nextDelivery'] = df.groupby(['Region','Client']).shift(-1)['deliveryDate']
Output:
Region Client deliveryDate price nextDelivery
0 NY A 2020-01-01 10 2020-02-03
1 NY A 2020-02-03 20 2020-04-05
2 NY A 2020-04-05 30 2020-05-05
3 NY A 2020-05-05 25 NaN
4 NY B 2020-01-01 15 2020-02-02
5 NY B 2020-02-02 10 2020-02-10
6 NY B 2020-02-10 20 NaN
7 FL A 2020-01-01 15 2020-02-01
8 FL A 2020-02-01 10 2020-03-01
9 FL A 2020-03-01 12 2020-04-01
10 FL A 2020-04-01 25 2020-05-01
11 FL A 2020-05-01 20 NaN

column names setup after group by the data in python

My table is as bellowed
datetime source Day area Town County Country
0 2019-01-01 16:22:46 1273 Tuesday Brighton Brighton East Sussex England
1 2019-01-02 09:33:29 1823 Wednesday Taunton Taunton Somerset England
2 2019-01-02 09:44:46 1977 Wednesday Pontefract Pontefract West Yorkshire England
3 2019-01-02 10:01:42 1983 Wednesday Isle of Wight NaN NaN NaN
4 2019-01-02 12:03:13 1304 Wednesday Dover Dover Kent England
My codes are
counts_by_counties = call_by_counties.groupby(['County','Town']).count()
counts_by_counties.head()
My grouped result (Do the column name disappeared?)
datetime source Day area Country
County Town
Aberdeenshire Aberdeen 8 8 8 8 8
Banchory 1 1 1 1 1
Blackburn 18 18 18 18 18
Ellon 6 6 6 6 6
Fraserburgh 2 2 2 2 2
I used this codes to rename the column, I am wondering if there is other efficent way to change the column name.
# slicing of the table
counts_by_counties = counts_by_counties[['datetime',]]
# rename by datetime into Counts
counts_by_counties.rename(columns={'datetime': 'Counts'})
Expected result
Counts
County Town
Aberdeenshire Aberdeen 8
Banchory 1
Blackburn 18
Call reset_index as below.
Replace
counts_by_counties = call_by_counties.groupby(['County','Town']).count()
with
counts_by_counties = call_by_counties.groupby(['County','Town']).count().reset_index()

Extract year from pandas datetime column as numeric value with NaN for empty cells instead of NaT

I want to extract the year from a datetime column into a new 'yyyy'-column AND I want the missing values (NaT) to be displayed as 'NaN', so the datetime-dtype of the new column should be changed I guess but there I'm stuck..
Initial df:
Date ID
0 2016-01-01 12
1 2015-01-01 96
2 NaT 20
3 2018-01-01 73
4 2017-01-01 84
5 NaT 26
6 2013-01-01 87
7 2016-01-01 64
8 2019-01-01 11
9 2014-01-01 34
Desired df:
Date ID yyyy
0 2016-01-01 12 2016
1 2015-01-01 96 2015
2 NaT 20 NaN
3 2018-01-01 73 2018
4 2017-01-01 84 2017
5 NaT 26 NaN
6 2013-01-01 87 2013
7 2016-01-01 64 2016
8 2019-01-01 11 2019
9 2014-01-01 34 2014
Code:
import pandas as pd
import numpy as np

# example df
df = pd.DataFrame({"ID": [12,96,20,73,84,26,87,64,11,34],
"Date": ['2016-01-01', '2015-01-01', np.nan, '2018-01-01', '2017-01-01', np.nan, '2013-01-01', '2016-01-01', '2019-01-01', '2014-01-01']})

df.ID = pd.to_numeric(df.ID)

df.Date = pd.to_datetime(df.Date)
print(df)
#extraction of year from date
df['yyyy'] = pd.to_datetime(df.Date).dt.strftime('%Y')

#Try to set NaT to NaN or datetime to numeric, PROBLEM: empty cells keep 'NaT'
df.loc[(df['yyyy'].isna()), 'yyyy'] = np.nan

 #(try1)
df.yyyy = df.Date.astype(float)
 #(try2)
df.yyyy = pd.to_numeric(df.Date)
 #(try3)
print(df)
Use Series.dt.year with converting to integers with Int64:
df.Date = pd.to_datetime(df.Date)
df['yyyy'] = df.Date.dt.year.astype('Int64')
print (df)
ID Date yyyy
0 12 2016-01-01 2016
1 96 2015-01-01 2015
2 20 NaT <NA>
3 73 2018-01-01 2018
4 84 2017-01-01 2017
5 26 NaT <NA>
6 87 2013-01-01 2013
7 64 2016-01-01 2016
8 11 2019-01-01 2019
9 34 2014-01-01 2014
With no convert floats to integers:
df['yyyy'] = df.Date.dt.year
print (df)
ID Date yyyy
0 12 2016-01-01 2016.0
1 96 2015-01-01 2015.0
2 20 NaT NaN
3 73 2018-01-01 2018.0
4 84 2017-01-01 2017.0
5 26 NaT NaN
6 87 2013-01-01 2013.0
7 64 2016-01-01 2016.0
8 11 2019-01-01 2019.0
9 34 2014-01-01 2014.0
Your solution convert NaT to strings NaT, so is possible use replace.
Btw, in last versions of pandas replace is not necessary, it working correctly.
df['yyyy'] = pd.to_datetime(df.Date).dt.strftime('%Y').replace('NaT', np.nan)
Isn't it:
df['yyyy'] = df.Date.dt.year
Output:
Date ID yyyy
0 2016-01-01 12 2016.0
1 2015-01-01 96 2015.0
2 NaT 20 NaN
3 2018-01-01 73 2018.0
4 2017-01-01 84 2017.0
5 NaT 26 NaN
6 2013-01-01 87 2013.0
7 2016-01-01 64 2016.0
8 2019-01-01 11 2019.0
9 2014-01-01 34 2014.0
For pandas 0.24.2+, you can use Int64 data type for nullable integers:
df['yyyy'] = df.Date.dt.year.astype('Int64')
which gives:
Date ID yyyy
0 2016-01-01 12 2016
1 2015-01-01 96 2015
2 NaT 20 <NA>
3 2018-01-01 73 2018
4 2017-01-01 84 2017
5 NaT 26 <NA>
6 2013-01-01 87 2013
7 2016-01-01 64 2016
8 2019-01-01 11 2019
9 2014-01-01 34 2014

Categories