Python Pandas Dataframe: Value of second recent day for each person

Python Pandas Dataframe: Value of second recent day for each person - python

I am trying to group one dataframe conditional on another dataframe using Pythons pandas dataframes:
The first dataframe gives the holidays of each person:
import pandas as pd
df_holiday = pd.DataFrame({'Person': ['Alfred', 'Bob', 'Charles'], 'Last Holiday': ['2018-02-01', '2018-06-01', '2018-05-01']})
df_holiday.head()
Last Holiday Person
0 2018-02-01 Alfred
1 2018-06-01 Bob
2 2018-05-01 Charles
The second dataframe gives the sales value for each person and month:
df_sales = pd.DataFrame({'Person': ['Alfred', 'Alfred', 'Alfred','Bob','Bob','Bob','Bob','Bob','Bob','Charles','Charles','Charles','Charles','Charles','Charles'],'Date': ['2018-01-01', '2018-02-01', '2018-03-01', '2018-01-01', '2018-02-01', '2018-03-01','2018-04-01', '2018-05-01', '2018-06-01', '2018-01-01', '2018-02-01', '2018-03-01','2018-04-01', '2018-05-01', '2018-06-01'], 'Sales': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]})
df_sales.head(15)
Date Person Sales
0 2018-01-01 Alfred 1
1 2018-02-01 Alfred 2
2 2018-03-01 Alfred 3
3 2018-01-01 Bob 4
4 2018-02-01 Bob 5
5 2018-03-01 Bob 6
6 2018-04-01 Bob 7
7 2018-05-01 Bob 8
8 2018-06-01 Bob 9
9 2018-01-01 Charles 10
10 2018-02-01 Charles 11
11 2018-03-01 Charles 12
12 2018-04-01 Charles 13
13 2018-05-01 Charles 14
14 2018-06-01 Charles 15
Now, i want the sales number for each person before his last holiday, i.e. the outcome should be:
Date Person Sales
0 2018-01-01 Alfred 1
7 2018-05-01 Bob 8
12 2018-04-01 Charles 13
Any help?

We could do merge then filter and drop_duplicates
df=df_holiday.merge(df_sales).loc[lambda x : x['Last Holiday']>x['Date']].drop_duplicates('Person',keep='last')
Out[163]:
Person Last Holiday Date Sales
0 Alfred 2018-02-01 2018-01-01 1
7 Bob 2018-06-01 2018-05-01 8
12 Charles 2018-05-01 2018-04-01 13

Related

Pandas Rolling window with filtering condition to remove the some latest data

This is a follow-up question of this. I would like to perform a rolling window of the last n days but I want to filter out the latest x days from each window (x is smaller than n)
Here is an example:
d = {'Name': ['Jack', 'Jim', 'Jack', 'Jim', 'Jack', 'Jack', 'Jim', 'Jack', 'Jane', 'Jane'],
'Date': ['08/01/2021',
'27/01/2021',
'05/02/2021',
'10/02/2021',
'17/02/2021',
'18/02/2021',
'20/02/2021',
'21/02/2021',
'22/02/2021',
'29/03/2021'],
'Earning': [40, 10, 20, 20, 40, 50, 100, 70, 80, 90]}
df = pd.DataFrame(data=d)
df['Date'] = pd.to_datetime(df.Date, format='%d/%m/%Y')
df = df.sort_values('Date')
Name Date Earning
0 Jack 2021-01-08 40
1 Jim 2021-01-27 10
2 Jack 2021-02-05 20
3 Jim 2021-02-10 20
4 Jack 2021-02-17 40
5 Jack 2021-02-18 50
6 Jim 2021-02-20 100
7 Jack 2021-02-21 70
8 Jane 2021-02-22 80
9 Jane 2021-03-29 90
I would like to
For each row, take the last 30 days of the same Name - call it a window
Remove the latest 20 days of each window (i.e. only take the earliest 10 days)
Calculate the sum on the Earning column
Expected outcome: (The two columns Window_From and Window_To are not needed. I only use them to demonstrate the mock data)
Name Date Earning Window_From Window_To Sum
0 Jack 2021-01-08 40 2020-12-09 2020-12-19 0.0
1 Jim 2021-01-27 10 2020-12-28 2021-01-07 0.0
2 Jack 2021-02-05 20 2021-01-06 2021-01-16 40.0
3 Jim 2021-02-10 20 2021-01-11 2021-01-21 0.0
4 Jack 2021-02-17 40 2021-01-18 2021-01-28 0.0
5 Jack 2021-02-18 50 2021-01-19 2021-01-29 0.0
6 Jim 2021-02-20 100 2021-01-21 2021-01-31 10.0
7 Jack 2021-02-21 70 2021-01-22 2021-02-01 0.0
8 Jane 2021-02-22 80 2021-01-23 2021-02-02 0.0
9 Jane 2021-03-29 90 2021-02-27 2021-03-09 0.0

Easy solution
Calculate 30 days and 20 days rolling sum then subtract 30 day sum from 20 day sum to get the effective rolling sum for first 10 days
s1 = df.groupby('Name').rolling('30d', on='Date')['Earning'].sum()
s2 = df.groupby('Name').rolling('20d', on='Date')['Earning'].sum()
df.merge(s1.sub(s2).reset_index(name='sum'), how='left')
Name Date Earning sum
0 Jack 2021-01-08 40 0.0
1 Jim 2021-01-27 10 0.0
2 Jack 2021-02-05 20 40.0
3 Jim 2021-02-10 20 0.0
4 Jack 2021-02-17 40 0.0
5 Jack 2021-02-18 50 0.0
6 Jim 2021-02-20 100 10.0
7 Jack 2021-02-21 70 0.0
8 Jane 2021-02-22 80 0.0
9 Jane 2021-03-29 90 0.0

An alternative to rolling (may be faster):
EDIT: actually slower with OP's dataset.
df['start'] = df['Date'] - pd.Timedelta(days=30)
df['end'] = df['start'] + pd.Timedelta(days=10)
df = df.set_index(['Name', 'Date'])
df['Sum'] = [df.xs(n, level=0).loc[start:end, 'Earning'].sum()
for n, start, end in zip(df.index.get_level_values(0), df['start'], df['end'])]
print(df.reset_index().drop(columns=['start', 'end']))
Name Date Earning Sum
0 Jack 2021-01-08 40 0
1 Jim 2021-01-27 10 0
2 Jack 2021-02-05 20 40
3 Jim 2021-02-10 20 0
4 Jack 2021-02-17 40 0
5 Jack 2021-02-18 50 0
6 Jim 2021-02-20 100 10
7 Jack 2021-02-21 70 0
8 Jane 2021-02-22 80 0
9 Jane 2021-03-29 90 0

Putting Na on several columns according to conditions on certain columns

I have a pandas dataframe with amount of bills columns,with dates and ids associated with those amounts. I would like to set the columns to Na when the date is less than 2016-12-31 and their associated id and amount. Here is an example
ID customer
Bill1
Date 1
ID Bill 1
Bill2
Date 2
ID Bill 2
Bill3
Date3
ID Bill 3
Gender
Age
4
6
2000-10-04
1
45
2000-11-05
2
51
1999-12-05
8
M
25
6
8
2016-05-03
7
39
2017-08-09
8
38
2018-07-14
17
W
54
12
14
2016-11-16
10
73
2017-05-04
15
14
2017-07-04
35
M
68
And I would like to get this:
ID customer
Bill1
Date 1
ID Bill 1
Bill2
Date 2
ID Bill 2
Bill3
Date 3
ID Bill 3
Gender
Age
4
Nan
Nan
Nan
Nan
Nan
Nan
Nan
Nan
Nan
M
25
6
Nan
Nan
Nan
39
2017-08-09
8
38
2018-07-14
17
W
54
12
Nan
Nan
Nan
73
2017-05-04
15
14
2017-07-04
35
M
68

One option is to create a MultiIndex.from_frame based on the values extracted str.extractall:
new_df = df.set_index(['ID customer', 'Gender', 'Age'])
orig_cols = new_df.columns # Save For Later
new_df.columns = pd.MultiIndex.from_frame(
new_df.columns.str.extractall(r'(.*?)(?:\s+)?(\d+)')
)
0 Bill Date ID Bill Bill Date ID Bill Bill Date ID Bill
1 1 1 1 2 2 2 3 3 3
ID customer Gender Age
4 M 25 6 2000-10-04 1 45 2000-11-05 2 51 1999-12-05 8
6 W 54 8 2016-05-03 7 39 2017-08-09 8 38 2018-07-14 17
12 M 68 14 2016-11-16 10 73 2017-05-04 15 14 2017-07-04 35
Then mask on the Date column (in level 0) where dates are less than the threshold:
new_df = new_df.mask(new_df['Date'].lt(pd.to_datetime('2016-12-31')))
0 Bill Date ID Bill Bill Date ID Bill Bill Date ID Bill
1 1 1 1 2 2 2 3 3 3
ID customer Gender Age
4 M 25 NaN NaT NaN NaN NaT NaN NaN NaT NaN
6 W 54 NaN NaT NaN 39.0 2017-08-09 8.0 38.0 2018-07-14 17.0
12 M 68 NaN NaT NaN 73.0 2017-05-04 15.0 14.0 2017-07-04 35.0
Lastly, restore columns and order:
new_df.columns = orig_cols # Restore from "save"
new_df = new_df.reset_index().reindex(columns=df.columns)
ID customer Bill1 Date 1 ID Bill 1 Bill2 Date 2 ID Bill 2 Bill3 Date3 ID Bill 3 Gender Age
0 4 NaN NaT NaN NaN NaT NaN NaN NaT NaN M 25
1 6 NaN NaT NaN 39.0 2017-08-09 8.0 38.0 2018-07-14 17.0 W 54
2 12 NaN NaT NaN 73.0 2017-05-04 15.0 14.0 2017-07-04 35.0 M 68
All Together:
(ensure Date Columns are DateTime)
df['Date 1'] = pd.to_datetime(df['Date 1'])
df['Date 2'] = pd.to_datetime(df['Date 2'])
df['Date3'] = pd.to_datetime(df['Date3'])
new_df = df.set_index(['ID customer', 'Gender', 'Age'])
orig_cols = new_df.columns # Save For Later
new_df.columns = pd.MultiIndex.from_frame(
new_df.columns.str.extractall(r'(.*?)(?:\s+)?(\d+)')
)
new_df = new_df.mask(new_df['Date'].lt(pd.to_datetime('2016-12-31')))
new_df.columns = orig_cols # Restore from "save"
new_df = new_df.reset_index().reindex(columns=df.columns)

Another way:
#Assumption Your dates are of dtype datetime[ns]
c=~df.filter(like='Date').lt(pd.to_datetime('2016-12-31'))
c=pd.DataFrame(c.values.repeat(3,1),columns=df.columns[1:10])
Finally:
out=df[df.columns[1:10]]
out=out[c].join(df[['ID customer','Gender','Age']])
Now If you print out you will get your desired output

Displaying next value on a column considering groups in Pandas Dataframe

I'm having this example dataframe and I need to display the next delivery Date for a specific client-region group.
Date could be either coded as a string or datetime, I'm using a string in this example.
# Import pandas library
import pandas as pd
import numpy as np
data = [['NY', 'A','2020-01-01', 10], ['NY', 'A','2020-02-03', 20], ['NY', 'A','2020-04-05', 30], ['NY', 'A','2020-05-05', 25],
['NY', 'B','2020-01-01', 15], ['NY', 'B','2020-02-02', 10], ['NY', 'B','2020-02-10', 20],
['FL', 'A','2020-01-01', 15], ['FL', 'A','2020-02-01', 10], ['FL', 'A','2020-03-01', 12], ['FL', 'A','2020-04-01', 25], ['FL', 'A','2020-05-01', 20]
]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Region', 'Client', 'deliveryDate', 'price'])
# print dataframe.
df
Region Client deliveryDate price
0 NY A 2020-01-01 10
1 NY A 2020-02-03 20
2 NY A 2020-04-05 30
3 NY A 2020-05-05 25
4 NY B 2020-01-01 15
5 NY B 2020-02-02 10
6 NY B 2020-02-10 20
7 FL A 2020-01-01 15
8 FL A 2020-02-01 10
9 FL A 2020-03-01 12
10 FL A 2020-04-01 25
11 FL A 2020-05-01 20
Desired output:
data2 = [['NY', 'A','2020-01-01', '2020-02-03', 10], ['NY', 'A','2020-02-03', '2020-04-05', 20], ['NY', 'A','2020-04-05', '2020-05-05', 30], ['NY', 'A','2020-05-05', float('nan'), 25],
['NY', 'B','2020-01-01', '2020-02-02', 15], ['NY', 'B','2020-02-02','2020-02-10', 10], ['NY', 'B','2020-02-10', float('nan'), 20],
['FL', 'A','2020-01-01', '2020-02-01', 15], ['FL', 'A','2020-02-01', '2020-03-01', 10], ['FL', 'A','2020-03-01', '2020-04-01', 12], ['FL', 'A','2020-04-01', '2020-05-01', 25], ['FL', 'A','2020-05-01', float('nan'), 20]
]
# Create the pandas DataFrame
df2 = pd.DataFrame(data2, columns = ['Region', 'Client', 'deliveryDate', 'nextDelivery', 'price'])
Region Client deliveryDate nextDelivery price
0 NY A 2020-01-01 2020-02-03 10
1 NY A 2020-02-03 2020-04-05 20
2 NY A 2020-04-05 2020-05-05 30
3 NY A 2020-05-05 NaN 25
4 NY B 2020-01-01 2020-02-02 15
5 NY B 2020-02-02 2020-02-10 10
6 NY B 2020-02-10 NaN 20
7 FL A 2020-01-01 2020-02-01 15
8 FL A 2020-02-01 2020-03-01 10
9 FL A 2020-03-01 2020-04-01 12
10 FL A 2020-04-01 2020-05-01 25
11 FL A 2020-05-01 NaN 20
Thanks in advance.

Assuming the delivery dates are ordered, how about grouping by region & client, then applying a shift?
df['nextDelivery'] = df.groupby(['Region','Client']).shift(-1)['deliveryDate']
Output:
Region Client deliveryDate price nextDelivery
0 NY A 2020-01-01 10 2020-02-03
1 NY A 2020-02-03 20 2020-04-05
2 NY A 2020-04-05 30 2020-05-05
3 NY A 2020-05-05 25 NaN
4 NY B 2020-01-01 15 2020-02-02
5 NY B 2020-02-02 10 2020-02-10
6 NY B 2020-02-10 20 NaN
7 FL A 2020-01-01 15 2020-02-01
8 FL A 2020-02-01 10 2020-03-01
9 FL A 2020-03-01 12 2020-04-01
10 FL A 2020-04-01 25 2020-05-01
11 FL A 2020-05-01 20 NaN

column names setup after group by the data in python

My table is as bellowed
datetime source Day area Town County Country
0 2019-01-01 16:22:46 1273 Tuesday Brighton Brighton East Sussex England
1 2019-01-02 09:33:29 1823 Wednesday Taunton Taunton Somerset England
2 2019-01-02 09:44:46 1977 Wednesday Pontefract Pontefract West Yorkshire England
3 2019-01-02 10:01:42 1983 Wednesday Isle of Wight NaN NaN NaN
4 2019-01-02 12:03:13 1304 Wednesday Dover Dover Kent England
My codes are
counts_by_counties = call_by_counties.groupby(['County','Town']).count()
counts_by_counties.head()
My grouped result (Do the column name disappeared?)
datetime source Day area Country
County Town
Aberdeenshire Aberdeen 8 8 8 8 8
Banchory 1 1 1 1 1
Blackburn 18 18 18 18 18
Ellon 6 6 6 6 6
Fraserburgh 2 2 2 2 2
I used this codes to rename the column, I am wondering if there is other efficent way to change the column name.
# slicing of the table
counts_by_counties = counts_by_counties[['datetime',]]
# rename by datetime into Counts
counts_by_counties.rename(columns={'datetime': 'Counts'})
Expected result
Counts
County Town
Aberdeenshire Aberdeen 8
Banchory 1
Blackburn 18

Call reset_index as below.
Replace
counts_by_counties = call_by_counties.groupby(['County','Town']).count()
with
counts_by_counties = call_by_counties.groupby(['County','Town']).count().reset_index()

Extract year from pandas datetime column as numeric value with NaN for empty cells instead of NaT

I want to extract the year from a datetime column into a new 'yyyy'-column AND I want the missing values (NaT) to be displayed as 'NaN', so the datetime-dtype of the new column should be changed I guess but there I'm stuck..
Initial df:
Date ID
0 2016-01-01 12
1 2015-01-01 96
2 NaT 20
3 2018-01-01 73
4 2017-01-01 84
5 NaT 26
6 2013-01-01 87
7 2016-01-01 64
8 2019-01-01 11
9 2014-01-01 34
Desired df:
Date ID yyyy
0 2016-01-01 12 2016
1 2015-01-01 96 2015
2 NaT 20 NaN
3 2018-01-01 73 2018
4 2017-01-01 84 2017
5 NaT 26 NaN
6 2013-01-01 87 2013
7 2016-01-01 64 2016
8 2019-01-01 11 2019
9 2014-01-01 34 2014
Code:
import pandas as pd import numpy as np  # example df
df = pd.DataFrame({"ID": [12,96,20,73,84,26,87,64,11,34], "Date": ['2016-01-01', '2015-01-01', np.nan, '2018-01-01', '2017-01-01', np.nan, '2013-01-01', '2016-01-01', '2019-01-01', '2014-01-01']})  df.ID = pd.to_numeric(df.ID)
 df.Date = pd.to_datetime(df.Date) print(df)
#extraction of year from date
df['yyyy'] = pd.to_datetime(df.Date).dt.strftime('%Y')  #Try to set NaT to NaN or datetime to numeric, PROBLEM: empty cells keep 'NaT'
df.loc[(df['yyyy'].isna()), 'yyyy'] = np.nan   #(try1)
df.yyyy = df.Date.astype(float)  #(try2)
df.yyyy = pd.to_numeric(df.Date)  #(try3)
print(df)

Use Series.dt.year with converting to integers with Int64:
df.Date = pd.to_datetime(df.Date)
df['yyyy'] = df.Date.dt.year.astype('Int64')
print (df)
ID Date yyyy
0 12 2016-01-01 2016
1 96 2015-01-01 2015
2 20 NaT <NA>
3 73 2018-01-01 2018
4 84 2017-01-01 2017
5 26 NaT <NA>
6 87 2013-01-01 2013
7 64 2016-01-01 2016
8 11 2019-01-01 2019
9 34 2014-01-01 2014
With no convert floats to integers:
df['yyyy'] = df.Date.dt.year
print (df)
ID Date yyyy
0 12 2016-01-01 2016.0
1 96 2015-01-01 2015.0
2 20 NaT NaN
3 73 2018-01-01 2018.0
4 84 2017-01-01 2017.0
5 26 NaT NaN
6 87 2013-01-01 2013.0
7 64 2016-01-01 2016.0
8 11 2019-01-01 2019.0
9 34 2014-01-01 2014.0
Your solution convert NaT to strings NaT, so is possible use replace.
Btw, in last versions of pandas replace is not necessary, it working correctly.
df['yyyy'] = pd.to_datetime(df.Date).dt.strftime('%Y').replace('NaT', np.nan)

Isn't it:
df['yyyy'] = df.Date.dt.year
Output:
Date ID yyyy
0 2016-01-01 12 2016.0
1 2015-01-01 96 2015.0
2 NaT 20 NaN
3 2018-01-01 73 2018.0
4 2017-01-01 84 2017.0
5 NaT 26 NaN
6 2013-01-01 87 2013.0
7 2016-01-01 64 2016.0
8 2019-01-01 11 2019.0
9 2014-01-01 34 2014.0
For pandas 0.24.2+, you can use Int64 data type for nullable integers:
df['yyyy'] = df.Date.dt.year.astype('Int64')
which gives:
Date ID yyyy
0 2016-01-01 12 2016
1 2015-01-01 96 2015
2 NaT 20 <NA>
3 2018-01-01 73 2018
4 2017-01-01 84 2017
5 NaT 26 <NA>
6 2013-01-01 87 2013
7 2016-01-01 64 2016
8 2019-01-01 11 2019
9 2014-01-01 34 2014

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Pandas Dataframe: Value of second recent day for each person - python

Related

Pandas Rolling window with filtering condition to remove the some latest data

Putting Na on several columns according to conditions on certain columns

Displaying next value on a column considering groups in Pandas Dataframe

column names setup after group by the data in python

Extract year from pandas datetime column as numeric value with NaN for empty cells instead of NaT

Categories

Resources