I have a dataframe like as shown below. This is a continuation of this post
df = pd.DataFrame({'person_id': [101,101,101,101,202,202,202],
'person_type':['A','A','B','C','D','B','A'],
'login_date':['5/7/2013 09:27:00 AM','09/08/2013 11:21:00 AM','06/06/2014 08:00:00 AM','06/06/2014 05:00:00 AM','12/11/2011 10:00:00 AM','13/10/2012 12:00:00 AM','13/12/2012 11:45:00 AM'],
'logout_date':[np.nan,'11/08/2013 11:21:00 AM',np.nan,'06/06/2014 05:00:00 AM',np.nan,'13/10/2012 12:00:00 AM',np.nan]})
df.login_date = pd.to_datetime(df.login_date)
df.logout_date = pd.to_datetime(df.logout_date)
I would like to apply 2 rules to the logout_date column
Rule 1 - If person type is B, C,D,E AND logout_date is NaN, then copy the login date value
Rule 2 - If person type is A AND logout_date is NaN, then add 2 days to the login date
When I try the below
m1 = df['person_type'].isin(['B','C','D'])
m2 = df['person_type'].isin(['A'])
m3 = df['logout_datetime'].isna()
df['logout_datetime'] = np.select([m1 & m3, m2 & m3],
[df['login_datetime'], df['login_datetime'] + pd.DateOffset(days=2)],
default=df['logout_datetime'])
df['logout_date'] = np.select([m1 & m3, m2 & m3],
[df['login_datetime'].dt.date, (df['login_datetime'] + pd.DateOffset(days=2)).dt.date],
default=df['logout_datetime'])
I would like to get the logout_date column directly by using np.select as shown in sample code.
Currently I get an output like below which is incorrect
I don't understand why some rows are causing issues while other rows are working fine.
Can help me with this? I expect my output to have proper date values
I think problem is missing converting in default parameter in np.select (default=df['logout_datetime']) and change it to default=df['logout_datetime'].dt.date for same types returned from np.select:
df['logout_date'] = np.select([m1 & m3, m2 & m3],
[df['login_date'].dt.date,
(df['login_date'] + pd.DateOffset(days=2)).dt.date],
default=df['logout_date'].dt.date)
print (df)
person_id person_type login_date logout_date
0 101 A 2013-05-07 09:27:00 2013-05-09
1 101 A 2013-09-08 11:21:00 2013-11-08
2 101 B 2014-06-06 08:00:00 2014-06-06
3 101 C 2014-06-06 05:00:00 2014-06-06
4 202 D 2011-12-11 10:00:00 2011-12-11
5 202 B 2012-10-13 00:00:00 2012-10-13
6 202 A 2012-12-13 11:45:00 2012-12-15
If need default with datetimes then Series.dt.normalize remove times (set to 00:00:00) and all types are datetimes, so working well:
df['logout_date'] = np.select([m1 & m3, m2 & m3],
[df['login_date'].dt.normalize(),
(df['login_date'] + pd.DateOffset(days=2)).dt.normalize()],
default=df['logout_date'])
print (df)
person_id person_type login_date logout_date
0 101 A 2013-05-07 09:27:00 2013-05-09 00:00:00
1 101 A 2013-09-08 11:21:00 2013-11-08 11:21:00
2 101 B 2014-06-06 08:00:00 2014-06-06 00:00:00
3 101 C 2014-06-06 05:00:00 2014-06-06 05:00:00
4 202 D 2011-12-11 10:00:00 2011-12-11 00:00:00
5 202 B 2012-10-13 00:00:00 2012-10-13 00:00:00
6 202 A 2012-12-13 11:45:00 2012-12-15 00:00:00
For origional datetimes use:
df['logout_date'] = np.select([m1 & m3, m2 & m3],
[df['login_date'],
(df['login_date'] + pd.DateOffset(days=2))],
default=df['logout_date'])
print (df)
person_id person_type login_date logout_date
0 101 A 2013-05-07 09:27:00 2013-05-09 09:27:00
1 101 A 2013-09-08 11:21:00 2013-11-08 11:21:00
2 101 B 2014-06-06 08:00:00 2014-06-06 08:00:00
3 101 C 2014-06-06 05:00:00 2014-06-06 05:00:00
4 202 D 2011-12-11 10:00:00 2011-12-11 10:00:00
5 202 B 2012-10-13 00:00:00 2012-10-13 00:00:00
6 202 A 2012-12-13 11:45:00 2012-12-15 11:45:00
Related
I'm trying to measure the difference between timestamps using certain conditions. Using below, for each unique ID, I'm hoping to subtract the End Time where Item == A and the Start Time where Item == D.
So the timestamps are actually located on separate rows.
At the moment my process is returning an error. I'm also hoping to drop the .shift() for something more robust as each unique ID will have different combinations. For ex, A,B,C,D - A,B,D - A,D etc.
df = pd.DataFrame({'ID': [10,10,10,20,20,30],
'Start Time': ['2019-08-02 09:00:00','2019-08-03 10:50:00','2019-08-05 16:00:00','2019-08-04 08:00:00','2019-08-04 15:30:00','2019-08-06 11:00:00'],
'End Time': ['2019-08-04 15:00:00','2019-08-04 16:00:00','2019-08-05 16:00:00','2019-08-04 14:00:00','2019-08-05 20:30:00','2019-08-07 10:00:00'],
'Item': ['A','B','D','A','D','A'],
})
df['Start Time'] = pd.to_datetime(df['Start Time'])
df['End Time'] = pd.to_datetime(df['End Time'])
df['diff'] = (df.groupby('ID')
.apply(lambda x: x['End Time'].shift(1) - x['Start Time'].shift(1))
.reset_index(drop=True))
Intended Output:
ID Start Time End Time Item diff
0 10 2019-08-02 09:00:00 2019-08-04 15:00:00 A NaT
1 10 2019-08-03 10:50:00 2019-08-04 16:00:00 B NaT
2 10 2019-08-05 16:00:00 2019-08-05 16:00:00 D 1 days 01:00:00
3 20 2019-08-04 08:00:00 2019-08-04 14:00:00 A NaT
4 20 2019-08-04 15:30:00 2019-08-05 20:30:00 D 0 days 01:30:00
5 30 2019-08-06 11:00:00 2019-08-07 10:00:00 A NaT
df2 = df.set_index('ID')
df2.query('Item == "D"')['Start Time']-df2.query('Item == "A"')['End Time']
output:
ID
10 2 days 05:30:00
20 0 days 20:30:00
30 NaT
dtype: timedelta64[ns]
older answer
The issue is your fillna, you can't have strings in a timedelta column:
df['diff'] = (df.groupby('ID')
.apply(lambda x: x['End Time'].shift(1) - x['Start Time'].shift(1))
#.fillna('-') # the issue is here
.reset_index(drop=True))
output:
ID Start Time End Time Item diff
0 10 2019-08-02 09:00:00 2019-08-02 09:30:00 A NaT
1 10 2019-08-03 10:50:00 2019-08-03 11:00:00 B 0 days 00:30:00
2 10 2019-08-04 15:00:00 2019-08-05 16:00:00 C 0 days 00:10:00
3 20 2019-08-04 08:00:00 2019-08-04 14:00:00 B NaT
4 20 2019-08-05 10:30:00 2019-08-05 20:30:00 C 0 days 06:00:00
5 30 2019-08-06 11:00:00 2019-08-07 10:00:00 A NaT
IIUC use:
df1 = df.pivot('ID','Item')
print (df1)
Start Time \
Item A B D
ID
10 2019-08-02 09:00:00 2019-08-03 10:50:00 2019-08-04 15:00:00
20 2019-08-04 08:00:00 NaT 2019-08-05 10:30:00
30 2019-08-06 11:00:00 NaT NaT
End Time
Item A B D
ID
10 2019-08-02 09:30:00 2019-08-03 11:00:00 2019-08-05 16:00:00
20 2019-08-04 14:00:00 NaT 2019-08-05 20:30:00
30 2019-08-07 10:00:00 NaT NaT
a = df1[('Start Time','D')].sub(df1[('End Time','A')])
print (a)
ID
10 2 days 05:30:00
20 0 days 20:30:00
30 NaT
dtype: timedelta64[ns]
I have a dataframe like as shown below
df = pd.DataFrame({'person_id': [101,101,101,101,202,202,202],
'start_date':['5/7/2013 09:27:00 AM','09/08/2013 11:21:00 AM','06/06/2014 08:00:00 AM','06/06/2014 05:00:00 AM','12/11/2011 10:00:00 AM','13/10/2012 12:00:00 AM','13/12/2012 11:45:00 AM'],
'end_date':['5/12/2013 09:27:00 AM',np.nan,'06/11/2014 08:00:00 AM',np.nan,'12/16/2011 10:00:00','10/18/2012 00:00:00',np.nan],
'type':['O','I','O','O','I','O','I']})
df.start_date = pd.to_datetime(df.start_date)
df['end_date'] = pd.to_datetime(df.end_date)
I would like to fillna() under the end_date column based on two approaches below
a) If NA is found in any row except last row of that person, fillna by copying the value from next row
b) If NA is found in the last row of that person fillna by adding 10 days to his start_date (because there is no next row for that person to copy from. So, we give random value of 10 days)
The rules a and b only for persons with type=I.
For persons with type=O, just fillna by copying the value from start_date.
This is what I tried. You can see am writing the same code line twice.
df['end_date'] = np.where(df['type'].str.contains('I'),pd.DatetimeIndex(df['end_date'].bfill()),pd.DatetimeIndex(df.start_date.dt.date))
df['end_date'] = np.where(df['type'].str.contains('I'),pd.DatetimeIndex(df['start_date'] + pd.DateOffset(10)),pd.DatetimeIndex(df.start_date.dt.date))
Any elegant and efficient way to write this as I have to apply this on a big data with 15 million rows?
I expect my output to be like as shown below
Solution
s1 = df.groupby('person_id')['start_date'].shift(-1)
s1 = s1.fillna(df['start_date'] + pd.DateOffset(days=10))
s1 = df['end_date'].fillna(s1)
s2 = df['end_date'].fillna(df['start_date'])
df['end_date'] = np.where(df['type'].eq('I'), s1, s2)
Explanations
Group the dataframe on person_id and shift the column start_date one units upwards.
>>> df.groupby('person_id')['start_date'].shift(-1)
0 2013-09-08 11:21:00
1 2014-06-06 08:00:00
2 2014-06-06 05:00:00
3 NaT
4 2012-10-13 00:00:00
5 2012-12-13 11:45:00
6 NaT
Name: start_date, dtype: datetime64[ns]
Fill the NaN values in the shifted column with the values from start_date column after adding an offset of 10 days
>>> s1.fillna(df['start_date'] + pd.DateOffset(days=10))
0 2013-09-08 11:21:00
1 2014-06-06 08:00:00
2 2014-06-06 05:00:00
3 2014-06-16 05:00:00
4 2012-10-13 00:00:00
5 2012-12-13 11:45:00
6 2012-12-23 11:45:00
Name: start_date, dtype: datetime64[ns]
Now fill the NaN values in end_date column with the above series s1
>>> df['end_date'].fillna(s1)
0 2013-05-12 09:27:00
1 2014-06-06 08:00:00
2 2014-06-11 08:00:00
3 2014-06-16 05:00:00
4 2011-12-16 10:00:00
5 2012-10-18 00:00:00
6 2012-12-23 11:45:00
Name: end_date, dtype: datetime64[ns]
Similarly fill the NaN values in end_date column with the values from start_date column to create a series s2
>>> df['end_date'].fillna(df['start_date'])
0 2013-05-12 09:27:00
1 2013-09-08 11:21:00
2 2014-06-11 08:00:00
3 2014-06-06 05:00:00
4 2011-12-16 10:00:00
5 2012-10-18 00:00:00
6 2012-12-13 11:45:00
Name: end_date, dtype: datetime64[ns]
Then use np.where to select the values from s1 / s2 based on the condition where the type is I or O
>>> df
person_id start_date end_date type
0 101 2013-05-07 09:27:00 2013-05-12 09:27:00 O
1 101 2013-09-08 11:21:00 2014-06-06 08:00:00 I
2 101 2014-06-06 08:00:00 2014-06-11 08:00:00 O
3 101 2014-06-06 05:00:00 2014-06-06 05:00:00 O
4 202 2011-12-11 10:00:00 2011-12-16 10:00:00 I
5 202 2012-10-13 00:00:00 2012-10-18 00:00:00 O
6 202 2012-12-13 11:45:00 2012-12-23 11:45:00 I
I have a dataframe like as shown below
df = pd.DataFrame({'person_id': [101,101,101,101,202,202,202],
'login_date':['5/7/2013 09:27:00 AM','09/08/2013 11:21:00 AM','06/06/2014 08:00:00 AM','06/06/2014 05:00:00 AM','12/11/2011 10:00:00 AM','13/10/2012 12:00:00 AM','13/12/2012 11:45:00 AM']})
df.login_date = pd.to_datetime(df.login_date)
df['logout_date'] = df.login_date + pd.Timedelta(days=5)
df['login_id'] = [1,1,1,1,8,8,8]
As you can see in the sample dataframe, the login_id is the same even though login and logout dates are different for the person.
For example, person = 101, has logged in and out at 4 different timestamps. but he has got the same login_ids which is incorrect.
Instead, I would like to generate a new login_id column where each person gets a new login_id but retains the 1st login_id information in their subsequent logins. So, we can know its a sequence
I tried the below but it doesn't work well
df.groupby(['person_id','login_date','logout_date'])['login_id'].rank(method="first", ascending=True) + 100000
I expect my output to be like as shown below. You can see how 1 and 8, the 1st login_id for each person is retained in their subsequent login_ids. We just add a sequence by adding 00001 and plus one based on number of rows.
Please note I would like to apply this on a big data and the login_ids may not just be single digit in real data. For ex, 1st login_id could even be 576869578 etc kind of random number. In that case, the subsequent login id will be 57686957800001. Hope this helps. Whatever is the 1st login_id for that subject, add 00001, 00002 etc based on the number of rows that person has. Hope this helps
Update 2: Just realized my previous answers also added 100000 to the first index. Here is a version that uses GroupBy.transform() to add 100000 only to subsequent indexes:
cumcount = df.groupby(['person_id','login_id']).login_id.cumcount()
df.login_id = df.groupby(['person_id','login_id']).login_id.transform(
lambda x: x.shift().mul(100000).fillna(x.min())
).add(cumcount)
person_id login_date logout_date login_id
# 0 101 2013-05-07 09:27:00 2013-05-12 09:27:00 1
# 1 101 2013-09-08 11:21:00 2013-09-13 11:21:00 100001
# 2 101 2014-06-06 08:00:00 2014-06-11 08:00:00 100002
# 3 101 2014-06-06 05:00:00 2014-06-11 05:00:00 100003
# 4 202 2011-12-11 10:00:00 2011-12-16 10:00:00 8
# 5 202 2012-10-13 00:00:00 2012-10-18 00:00:00 800001
# 6 202 2012-12-13 11:45:00 2012-12-18 11:45:00 800002
Update: Faster option is to build the sequence with GroupBy.cumcount():
cumcount = df.groupby(['person_id','login_id']).login_id.cumcount()
df.login_id = df.login_id.mul(100000).add(cumcount)
# person_id login_date logout_date login_id
# 0 101 2013-05-07 09:27:00 2013-05-12 09:27:00 100000
# 1 101 2013-09-08 11:21:00 2013-09-13 11:21:00 100001
# 2 101 2014-06-06 08:00:00 2014-06-11 08:00:00 100002
# 3 101 2014-06-06 05:00:00 2014-06-11 05:00:00 100003
# 4 202 2011-12-11 10:00:00 2011-12-16 10:00:00 800000
# 5 202 2012-10-13 00:00:00 2012-10-18 00:00:00 800001
# 6 202 2012-12-13 11:45:00 2012-12-18 11:45:00 800002
You can build the sequence in a GroupBy.apply():
df.login_id = df.groupby(['person_id','login_id']).login_id.apply(
lambda x: pd.Series([x.min()*100000+seq for seq in range(len(x))], x.index)
)
login_id = df.groupby('person_id').login_id.apply(list)
def modify_id(x):
result= []
for index,value in enumerate(x):
if index > 0:
value = (int(value) * 100000) + index
result.append(value)
return result
df['ogin_id'] = login_id.apply(lambda x : modify_id(x)).explode().to_list()
Will give output -
person_id
login_date
logout_date
login_id
101
2013-05-07 09:27:00
2013-05-12 09:27:00
1
101
2013-09-08 11:21:00
2013-09-13 11:21:00
100001
101
2014-06-06 08:00:00
2014-06-11 08:00:00
100002
101
2014-06-06 05:00:00
2014-06-11 05:00:00
100003
202
2011-12-11 10:00:00
2011-12-16 10:00:00
8
202
2012-10-13 00:00:00
2012-10-18 00:00:00
800001
202
2012-12-13 11:45:00
2012-12-18 11:45:00
800002
You can make use of your original rank()
df['login_id'] = df['login_id'] * 100000 + df.groupby(['person_id'])['login_id'].rank(method="first") - 1
# print(df)
person_id login_date logout_date login_id
0 101 2013-05-07 09:27:00 2013-05-12 09:27:00 100000.0
1 101 2013-09-08 11:21:00 2013-09-13 11:21:00 100001.0
2 101 2014-06-06 08:00:00 2014-06-11 08:00:00 100002.0
3 101 2014-06-06 05:00:00 2014-06-11 05:00:00 100003.0
4 202 2011-12-11 10:00:00 2011-12-16 10:00:00 800000.0
5 202 2012-10-13 00:00:00 2012-10-18 00:00:00 800001.0
6 202 2012-12-13 11:45:00 2012-12-18 11:45:00 800002.0
Then changed the first row of each group
def change_first(group):
group.loc[group.index[0], 'login_id'] = group.iloc[0]['login_id'] / 100000
return group
df['login_id'] = df.groupby(['person_id']).apply(lambda group: change_first(group))['login_id']
# print(df)
person_id login_date logout_date login_id
0 101 2013-05-07 09:27:00 2013-05-12 09:27:00 1.0
1 101 2013-09-08 11:21:00 2013-09-13 11:21:00 100001.0
2 101 2014-06-06 08:00:00 2014-06-11 08:00:00 100002.0
3 101 2014-06-06 05:00:00 2014-06-11 05:00:00 100003.0
4 202 2011-12-11 10:00:00 2011-12-16 10:00:00 8.0
5 202 2012-10-13 00:00:00 2012-10-18 00:00:00 800001.0
6 202 2012-12-13 11:45:00 2012-12-18 11:45:00 800002.0
Or make use of where() to only update the row where condition is False.
df_ = df['login_id'] * 100000 + df.groupby(['person_id'])['login_id'].rank(method="first") - 1
firsts = df.groupby(['person_id']).head(1).index
df['login_id'] = df['login_id'].where(df.index.isin(firsts), df_)
Suppose I have two Date inputs : 2020-01-20 11:35:00 and 2020-01-25 08:00:00 .
I want a output DataFrame as :
time1 time2
-------------------------------------------
2020-01-20 11:35:00 | 2020-01-21 00:00:00
2020-01-21 00:00:00 | 2020-01-22 00:00:00
2020-01-22 00:00:00 | 2020-01-23 00:00:00
2020-01-23 00:00:00 | 2020-01-24 00:00:00
2020-01-24 00:00:00 | 2020-01-25 00:00:00
2020-01-25 00:00:00 | 2020-01-25 08:00:00
no built in way to do this, we can use iloc and pd.date_range to assign the first and last dates and generate your date range.
t1 = pd.Timestamp('2020-01-20 11:35:00')
t2 = pd.Timestamp('2020-01-25 08:00:00')
df = pd.DataFrame({'Time1' : pd.date_range(t1.date(),t2.date())})
df = df.assign(Time2 = df['Time1'] + pd.DateOffset(days=1))
df.iloc[0,0] = t1
df.iloc[-1,1] = t2
print(df)
Time1 Time2
0 2020-01-20 11:35:00 2020-01-21 00:00:00
1 2020-01-21 00:00:00 2020-01-22 00:00:00
2 2020-01-22 00:00:00 2020-01-23 00:00:00
3 2020-01-23 00:00:00 2020-01-24 00:00:00
4 2020-01-24 00:00:00 2020-01-25 00:00:00
5 2020-01-25 00:00:00 2020-01-25 08:00:00
You can use date_range with both dates and then create the dataframe.
d1 = pd.to_datetime('2020-01-20 11:35:00')
d2 = pd.to_datetime('2020-01-25 08:00:00')
l = pd.date_range(d1.date(), d2.date(), freq='d').tolist()[1:] #remove the first date
df = pd.DataFrame({'time1':[d1] + l, 'time2':l + [d2]})
print (df)
time1 time2
0 2020-01-20 11:35:00 2020-01-21 00:00:00
1 2020-01-21 00:00:00 2020-01-22 00:00:00
2 2020-01-22 00:00:00 2020-01-23 00:00:00
3 2020-01-23 00:00:00 2020-01-24 00:00:00
4 2020-01-24 00:00:00 2020-01-25 00:00:00
5 2020-01-25 00:00:00 2020-01-25 08:00:00
I'm looking to filter a large dataframe (millions of rows) based on another much smaller dataframe that has only three columns: ID, Start, End.
The following is what I put together (which works), but it seems like a groupby() or np.where might be faster.
SETUP:
import pandas as pd
import io
csv = io.StringIO(u'''
time id num
2018-01-01 00:00:00 A 1
2018-01-01 01:00:00 A 2
2018-01-01 02:00:00 A 3
2018-01-01 03:00:00 A 4
2018-01-01 04:00:00 A 5
2018-01-01 05:00:00 A 6
2018-01-01 06:00:00 A 6
2018-01-03 07:00:00 B 10
2018-01-03 08:00:00 B 11
2018-01-03 09:00:00 B 12
2018-01-03 10:00:00 B 13
2018-01-03 11:00:00 B 14
2018-01-03 12:00:00 B 15
2018-01-03 13:00:00 B 16
2018-05-29 23:00:00 C 111
2018-05-30 00:00:00 C 122
2018-05-30 01:00:00 C 133
2018-05-30 02:00:00 C 144
2018-05-30 03:00:00 C 155
''')
df = pd.read_csv(csv, sep = '\t')
df['time'] = pd.to_datetime(df['time'])
csv_filter = io.StringIO(u'''
id start end
A 2018-01-01 01:00:00 2018-01-01 02:00:00
B 2018-01-03 09:00:00 2018-01-03 12:00:00
C 2018-05-30 00:00:00 2018-05-30 08:00:00
''')
df_filter = pd.read_csv(csv_filter, sep = '\t')
df_filter['start'] = pd.to_datetime(df_filter['start'])
df_filter['end'] = pd.to_datetime(df_filter['end'])
WORKING CODE
df = pd.merge_asof(df, df_filter, left_on = 'time', right_on = 'start', by = 'id').dropna(subset = ['start']).drop(['start','end'], axis = 1)
df = pd.merge_asof(df, df_filter, left_on = 'time', right_on = 'end', by = 'id', direction = 'forward').dropna(subset = ['end']).drop(['start','end'], axis = 1)
OUTPUT
time id num
0 2018-01-01 01:00:00 A 2
1 2018-01-01 02:00:00 A 3
6 2018-01-03 09:00:00 B 12
7 2018-01-03 10:00:00 B 13
8 2018-01-03 11:00:00 B 14
9 2018-01-03 12:00:00 B 15
11 2018-05-30 00:00:00 C 122
12 2018-05-30 01:00:00 C 133
13 2018-05-30 02:00:00 C 144
14 2018-05-30 03:00:00 C 155
Any thoughts on a more elegant / faster solution?
Why not merge before filter. notice this will eating up your memory when the data set are way to big .
newdf=df.merge(df_filter)
newdf=newdf.loc[newdf.time.between(newdf.start,newdf.end),df.columns.tolist()]
newdf
Out[480]:
time id num
1 2018-01-01 01:00:00 A 2
2 2018-01-01 02:00:00 A 3
9 2018-01-03 09:00:00 B 12
10 2018-01-03 10:00:00 B 13
11 2018-01-03 11:00:00 B 14
12 2018-01-03 12:00:00 B 15
15 2018-05-30 00:00:00 C 122
16 2018-05-30 01:00:00 C 133
17 2018-05-30 02:00:00 C 144
18 2018-05-30 03:00:00 C 155