Finding the closest date given a date in a groupby dataframe (Python) - python

I'm trying to generate the Last_Payment_Date field in my pandas dataframe, and would need to find the closest Payment_Date before the given Order_Date for each customer (i.e. groupby).
Payment_Date will always occur after Order_Date, but may take different periods of time, which is difficult to use sorting and shift to find the nearest date.
Masking seems like a possible way but I've not been able to figure a way on how to use it.
Appreciate all the help I could get!
Cust_No Order_Date Payment_Date Last_Payment_Date
A 5/8/2014 6/8/2014 Nat
B 6/8/2014 1/5/2015 Nat
B 7/8/2014 7/8/2014 Nat
A 8/8/2014 1/5/2015 6/8/2014
A 9/8/2014 10/8/2014 6/8/2014
A 10/11/2014 12/11/2014 10/8/2014
B 11/12/2014 1/1/2015 7/8/2014
B 1/2/2015 2/2/2015 1/1/2015
A 2/5/2015 5/5/2015 1/5/2015
B 3/5/2015 4/5/2015 2/2/2015

Series.searchsorted largely does what you want -- it
can be used to find where the Order_Dates fit inside Payment_Dates. In
particular, it returns the ordinal indices corresponding to where each
Order_Date would need to be inserted in order to keep the Payment_Dates
sorted. For example, suppose
In [266]: df['Payment_Date']
Out[266]:
0 2014-06-08
2 2014-07-08
4 2014-10-08
5 2014-12-11
6 2015-01-01
1 2015-01-05
3 2015-01-05
7 2015-02-02
9 2015-04-05
8 2015-05-05
Name: Payment_Date, dtype: datetime64[ns]
In [267]: df['Order_Date']
Out[267]:
0 2014-05-08
2 2014-07-08
4 2014-09-08
5 2014-10-11
6 2014-11-12
1 2014-06-08
3 2014-08-08
7 2015-01-02
9 2015-03-05
8 2015-02-05
Name: Order_Date, dtype: datetime64[ns]
then searchsorted returns
In [268]: df['Payment_Date'].searchsorted(df['Order_Date'])
Out[268]: array([0, 1, 2, 3, 3, 0, 2, 5, 8, 8])
The first value, 0, for example, indicates that the Order_Date, 2014-05-08,
would have to be inserted at ordinal index 0 (before the Payment_Date
2014-06-08) to keep the Payment_Dates in sorted order. The second value, 1,
indicates that the Order_Date, 2014-07-08, would have to be inserted at
ordinal index 1 (after the Payment_Date 2014-06-08 and before 2014-07-08)
to keep the Payment_Dates in sorted order. And so on for the other indices.
Now, of course, there are some complications:
The Payment_Dates need to be in sorted order for searchsorted to return a
meaningful result:
df = df.sort_values(by=['Payment_Date'])
We need to group by the Cust_No
grouped = df.groupby('Cust_No')
We want the index of the Payment_Date which comes before the
Order_Date. Thus, we really need the decrease the index by one:
idx = grp['Payment_Date'].searchsorted(grp['Order_Date'])
result = grp['Payment_Date'].iloc[idx-1]
So that grp['Payment_Date'].iloc[idx-1] would grab the the prior Payment_Date.
When searchsorted returns 0, the Order_Date is less than all
Payment_Dates. We want a NaT in this case.
result[idx == 0] = pd.NaT
So putting it all togther,
import pandas as pd
NaT = pd.NaT
T = pd.Timestamp
df = pd.DataFrame({
'Cust_No': ['A', 'B', 'B', 'A', 'A', 'A', 'B', 'B', 'A', 'B'],
'expected': [
NaT, NaT, NaT, T('2014-06-08'), T('2014-06-08'), T('2014-10-08'),
T('2014-07-08'), T('2015-01-01'), T('2015-01-05'), T('2015-02-02')],
'Order_Date': [
T('2014-05-08'), T('2014-06-08'), T('2014-07-08'), T('2014-08-08'),
T('2014-09-08'), T('2014-10-11'), T('2014-11-12'), T('2015-01-02'),
T('2015-02-05'), T('2015-03-05')],
'Payment_Date': [
T('2014-06-08'), T('2015-01-05'), T('2014-07-08'), T('2015-01-05'),
T('2014-10-08'), T('2014-12-11'), T('2015-01-01'), T('2015-02-02'),
T('2015-05-05'), T('2015-04-05')]})
def last_payment_date(s, df):
grp = df.loc[s.index]
idx = grp['Payment_Date'].searchsorted(grp['Order_Date'])
result = grp['Payment_Date'].iloc[idx-1]
result[idx == 0] = pd.NaT
return result
df = df.sort_values(by=['Payment_Date'])
grouped = df.groupby('Cust_No')
df['Last_Payment_Date'] = grouped['Payment_Date'].transform(last_payment_date, df)
print(df)
yields
Cust_No Order_Date Payment_Date expected Last_Payment_Date
0 A 2014-05-08 2014-06-08 NaT NaT
2 B 2014-07-08 2014-07-08 NaT NaT
4 A 2014-09-08 2014-10-08 2014-06-08 2014-06-08
5 A 2014-10-11 2014-12-11 2014-10-08 2014-10-08
6 B 2014-11-12 2015-01-01 2014-07-08 2014-07-08
1 B 2014-06-08 2015-01-05 NaT NaT
3 A 2014-08-08 2015-01-05 2014-06-08 2014-06-08
7 B 2015-01-02 2015-02-02 2015-01-01 2015-01-01
9 B 2015-03-05 2015-04-05 2015-02-02 2015-02-02
8 A 2015-02-05 2015-05-05 2015-01-05 2015-01-05

Related

Calculate time intervals to form new column in pandas dataframe

I have 100,000 rows of data in the following format:
import pandas as pd
data = {'ID': [1, 1, 3, 3, 4, 3, 4, 4, 4],
'timestamp': ['12/23/14 16:53', '12/23/14 17:00', '12/23/14 17:01', '12/23/14 17:02', '12/23/14 17:00', '12/23/14 17:06', '12/23/14 17:15', '12/23/14 17:16', '12/23/14 17:20']}
df = pd.DataFrame(data)
ID timestamp
0 1 2014-12-23 16:53:00
1 1 2014-12-23 17:00:00
2 3 2014-12-23 17:01:00
3 3 2014-12-23 17:02:00
4 4 2014-12-23 17:00:00
5 3 2014-12-23 17:06:00
6 4 2014-12-23 17:15:00
7 4 2014-12-23 17:16:00
8 4 2014-12-23 17:20:00
The ID represents a user and the timestamp is when that user visited a website. I want to get information about sessions using pandas, where each session on this site is a maximum of 15 mins long. A new session starts once the user has been logged on for 15 mins. For the above sample data, the desired result would be:
ID session_start session_duration
0 1 12/23/14 16:53. 7 min
1 3 12/23/14 17:02. 4 min
2 4 12/23/14 17:00. 15 min
3 4 12/23/14 17:16. 4 min
Let me know if there is information I should add to clarify. I can't seem to get a working solution. Any help is appreciated!
EDIT: In response to queries below, I noticed a mistake in my example. Sorry guys it was very late at night!
The problem that I am struggling with has mostly to do with user 4. They are still logged in after 15 minutes, and I want to capture in my data that a new session has started.
The reason that my problem is slightly different from this Groupby every 2 hours data of a dataframe is because I want to do this based on individual users.
Not pretty, but here's a solution. The basic idea is to use groupby with diff to calculate differences between timestamps for each ID, but I couldn't find a nice way to only diff every 2 rows. So this approach uses diff for every row, then selects the diff result from every other within each ID.
Note that I'm assuming that the dataframe is properly ordered. Also note that your sample data had an extra entry for ID==1 that I removed.
import pandas as pd
data = {'ID': [1, 1, 3, 4, 3, 4, 4, 4],
'timestamp': ['12/23/14 16:53', '12/23/14 17:00', '12/23/14 17:02', '12/23/14 17:00', '12/23/14 17:06', '12/23/14 17:15', '12/23/14 17:16', '12/23/14 17:20']}
df = pd.DataFrame(data)
df['timestamp']=pd.to_datetime(df['timestamp'])
# groupby to get difference between each timestamp
df['diffs'] = df.groupby('ID')['timestamp'].diff()
# count every time ID appears
df['counts'] = df.groupby('ID')['ID'].cumcount()+1
print("after diffs and counts:")
print(df)
# select entries for every 2nd occurence (where df['counts'] is even)
new_df = df[df['counts'] % 2 == 0][['ID','timestamp','diffs']]
# timestamp here will be the session endtime so subtract the
# diffs to get session start time
new_df['timestamp'] = new_df['timestamp'] - new_df['diffs']
# and a final rename
new_df = new_df.rename(columns={'timestamp':'session_start','diffs':'session_duration'})
print("\nfinal df:")
print(new_df)
Will print out
after diffs and counts:
ID timestamp diffs counts
0 1 2014-12-23 16:53:00 NaT 1
1 1 2014-12-23 17:00:00 0 days 00:07:00 2
2 3 2014-12-23 17:02:00 NaT 1
3 4 2014-12-23 17:00:00 NaT 1
4 3 2014-12-23 17:06:00 0 days 00:04:00 2
5 4 2014-12-23 17:15:00 0 days 00:15:00 2
6 4 2014-12-23 17:16:00 0 days 00:01:00 3
7 4 2014-12-23 17:20:00 0 days 00:04:00 4
final df:
ID session_start session_duration
1 1 2014-12-23 16:53:00 0 days 00:07:00
4 3 2014-12-23 17:02:00 0 days 00:04:00
5 4 2014-12-23 17:00:00 0 days 00:15:00
7 4 2014-12-23 17:16:00 0 days 00:04:00
Then to get session_duration column as number of minutes instead of a timedelta object, you can do:
import numpy as np
new_df['session_duration'] = new_df['session_duration'] / np.timedelta64(1,'s') / 60.

Count business day between using pandas columns

I have tried to calculate the number of business days between two date (stored in separate columns in a dataframe ).
MonthBegin MonthEnd
0 2014-06-09 2014-06-30
1 2014-07-01 2014-07-31
2 2014-08-01 2014-08-31
3 2014-09-01 2014-09-30
4 2014-10-01 2014-10-31
I have tried to apply numpy.busday_count but I get the following error:
Iterator operand 0 dtype could not be cast from dtype('<M8[ns]') to dtype('<M8[D]') according to the rule 'safe'
I have tried to change the type into Timestamp as the following :
Timestamp('2014-08-31 00:00:00')
or datetime :
datetime.date(2014, 8, 31)
or to numpy.datetime64:
numpy.datetime64('2014-06-30T00:00:00.000000000')
Anyone knows how to fix it?
Note 1: I have passed tried np.busday_count in two way :
1. Passing dataframe columns, t['Days']=np.busday_count(t.MonthBegin,t.MonthEnd)
Passing arrays np.busday_count(dt1,dt2)
Note2: My dataframe has over 150K rows so I need to use an efficient algorithm
You can using bdate_range, also I corrected your input , since the most of MonthEnd is early than the MonthBegin
[len(pd.bdate_range(x,y))for x,y in zip(df['MonthBegin'],df['MonthEnd'])]
Out[519]: [16, 21, 22, 23, 20]
I think the best way to do is
df.apply(lambda row : np.busday_count(row['MBegin'],row['MEnd']),axis=1)
For my dataframe df as below:
MBegin MEnd
0 2011-01-01 2011-02-01
1 2011-01-10 2011-02-10
2 2011-01-02 2011-02-02
doing :
df['MBegin'] = df['MBegin'].values.astype('datetime64[D]')
df['MEnd'] = df['MEnd'].values.astype('datetime64[D]')
df['busday'] = df.apply(lambda row : np.busday_count(row['MBegin'],row['MEnd']),axis=1)
>>df
MBegin MEnd busday
0 2011-01-01 2011-02-01 21
1 2011-01-10 2011-02-10 23
2 2011-01-02 2011-02-02 22
You need to provide the template in which your dates are written.
a = datetime.strptime('2014-06-9', '%Y-%m-%d')
Calculate this for your
b = datetime.strptime('2014-06-30', '%Y-%m-%d')
Now their difference
c = b-a
c.days
which gives you the difference 21 days, You can now use list comprehension to get the difference between two dates as days.
will give you datetime.timedelta(21), to convert it into days, just use
You can modify your code to get the desired result as below:
df = pd.DataFrame({'MonthBegin': ['2014-06-09', '2014-08-01', '2014-09-01', '2014-10-01', '2014-11-01'],
'MonthEnd': ['2014-06-30', '2014-08-31', '2014-09-30', '2014-10-31', '2014-11-30']})
df['MonthBegin'] = df['MonthBegin'].astype('datetime64[ns]')
df['MonthEnd'] = df['MonthEnd'].astype('datetime64[ns]')
df['BDays'] = np.busday_count(df['MonthBegin'].tolist(), df['MonthEnd'].tolist())
print(df)
MonthBegin MonthEnd BDays
0 2014-06-09 2014-06-30 15
1 2014-08-01 2014-08-31 21
2 2014-09-01 2014-09-30 21
3 2014-10-01 2014-10-31 22
4 2014-11-01 2014-11-30 20
Additionally numpy.busday_count has few other optional arguments like weekmask, holidays ... which you can use according to your need.

How to replace python data frame values and concatenate another string with where condition

I want to replace column "Time Period" values & attach other string as shown below.
value: 2017M12
M replace with - and add '-01'
Final result: 2017-12-01
Frequency,Time Period,Date
3,2016M12
3,2016M1
3,2016M8
3,2016M7
3,2016M11
3,2016M10
dt['Date'] = dt.loc[dt['Frequency']=='3',replace('Time Period','M','-')]+'-01'
In [18]: df.loc[df.Frequency==3,'Date'] = \
pd.to_datetime(df.loc[df.Frequency==3, 'Time Period'],
format='%YM%m', errors='coerce')
In [19]: df
Out[19]:
Frequency Time Period Date
0 3 2016M12 2016-12-01
1 3 2016M1 2016-01-01
2 3 2016M8 2016-08-01
3 3 2016M7 2016-07-01
4 3 2016M11 2016-11-01
5 3 2016M10 2016-10-01
In [20]: df.dtypes
Out[20]:
Frequency int64
Time Period object
Date datetime64[ns] # <--- NOTE
dtype: object
You can use apply :
dt['Date'] = dt[ dt['Frequency'] ==3]['Time Period'].apply(lambda x: x.replace('M','-')+"-01")
output
Frequency Time Period Date
0 3 2016M12 2016-12-01
1 3 2016M1 2016-1-01
2 3 2016M8 2016-8-01
3 3 2016M7 2016-7-01
4 3 2016M11 2016-11-01
5 3 2016M10 2016-10-01
Also you don't need to create an empty columns 'Data', dt['Date'] = will create it automatically

how to get the datetimes before and after some specific dates in Pandas?

I have a Pandas DataFrame that looks like
col1
2015-02-02
2015-04-05
2016-07-02
I would like to add, for each date in col 1, the x days before and x days after that date.
That means that the resulting DataFrame will contain more rows (specifically, n(1+ 2*x), where n is the orignal number of dates in col1)
How can I do that in a proper Pandonic way?
Output would be (for x=1)
col1
2015-01-01
2015-01-02
2015-01-03
2015-04-04
etc
Thanks!
you can do it this way, but I'm not sure that it's the best / fastest way to do it:
In [143]: df
Out[143]:
col1
0 2015-02-02
1 2015-04-05
2 2016-07-02
In [144]: %paste
N = 2
(df.col1.apply(lambda x: pd.Series(pd.date_range(x - pd.Timedelta(days=N),
x + pd.Timedelta(days=N))
)
)
.stack()
.drop_duplicates()
.reset_index(level=[0,1], drop=True)
.to_frame(name='col1')
)
## -- End pasted text --
Out[144]:
col1
0 2015-01-31
1 2015-02-01
2 2015-02-02
3 2015-02-03
4 2015-02-04
5 2015-04-03
6 2015-04-04
7 2015-04-05
8 2015-04-06
9 2015-04-07
10 2016-06-30
11 2016-07-01
12 2016-07-02
13 2016-07-03
14 2016-07-04
Something like this takes a dataframe with a datetime.date column and then stacks another Series underneath with timedelta shifts to the original data.
import datetime
import pandas as pd
df = pd.DataFrame([{'date': datetime.date(2016, 1, 2)}, {'date': datetime.date(2016, 1, 1)}], columns=['date'])
df = pd.concat([df.date, df.date + datetime.timedelta(days=1)], ignore_index=True).to_frame()

Loading pandas DataFrame from dict of series possible glitch?

I'm constructing a dictionary using a dictionary comprehension which has read_csv embedded within it. This constructs the dictionary fine, but when I then push it into a DataFrame all of my data goes to null and the dates get very wacky as well. Here's sample code and output:
In [129]: a= {x.split(".")[0] : read_csv(x, parse_dates=True, index_col=[0])["Settle"] for x in t[:2]}
In [130]: a
Out[130]:
{'SPH2010': Date
2010-03-19 1172.95
2010-03-18 1166.10
2010-03-17 1165.70
2010-03-16 1159.50
2010-03-15 1150.30
2010-03-12 1151.30
2010-03-11 1150.60
2010-03-10 1145.70
2010-03-09 1140.50
2010-03-08 1137.10
2010-03-05 1136.50
2010-03-04 1122.30
2010-03-03 1118.60
2010-03-02 1117.40
2010-03-01 1114.60
...
2008-04-10 1370.4
2008-04-09 1367.7
2008-04-08 1378.7
2008-04-07 1378.4
2008-04-04 1377.8
2008-04-03 1379.9
2008-04-02 1377.7
2008-04-01 1376.6
2008-03-31 1329.1
2008-03-28 1324.0
2008-03-27 1334.7
2008-03-26 1340.7
2008-03-25 1357.0
2008-03-24 1357.3
2008-03-20 1329.8
Name: Settle, Length: 495,
'SPM2011': Date
2011-06-17 1279.4
2011-06-16 1269.0
2011-06-15 1265.4
2011-06-14 1289.9
2011-06-13 1271.6
2011-06-10 1269.2
2011-06-09 1287.4
2011-06-08 1277.0
2011-06-07 1284.8
2011-06-06 1285.0
2011-06-03 1296.3
2011-06-02 1312.4
2011-06-01 1312.1
2011-05-31 1343.9
2011-05-27 1329.9
...
2009-07-10 856.6
2009-07-09 861.2
2009-07-08 856.0
2009-07-07 861.7
2009-07-06 877.9
2009-07-02 875.8
2009-07-01 902.6
2009-06-30 900.3
2009-06-29 908.0
2009-06-26 901.1
2009-06-25 903.8
2009-06-24 885.2
2009-06-23 877.6
2009-06-22 876.0
2009-06-19 903.4
Name: Settle, Length: 497}
In [131]: DataFrame(a)
Out[131]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 806 entries, 2189-09-10 03:33:28.879144 to 1924-01-20 06:06:06.621835
Data columns:
SPH2010 0 non-null values
SPM2011 0 non-null values
dtypes: float64(2)
Thanks!
EDIT:
I've also tried doing this with concat and I get the same results.
You should be able to use concat and unstack. Here's an example:
df1 = pd.Series([1, 2], name='a')
df2 = pd.Series([3, 4], index=[1, 2], name='b')
d = {'A': s1, 'B': s2} # a dict of Series
In [4]: pd.concat(d)
Out[4]:
A 0 1
1 2
B 1 3
2 4
In [5]: pd.concat(d).unstack().T
Out[5]:
A B
0 1 NaN
1 2 3
2 NaN 4

Categories