I have a dataframe like as shown below
df = pd.DataFrame({'person_id': [101,101,101,101,202,202,202],
'start_date':['5/7/2013 09:27:00 AM','09/08/2013 11:21:00 AM','06/06/2014 08:00:00 AM','06/06/2014 05:00:00 AM','12/11/2011 10:00:00 AM','13/10/2012 12:00:00 AM','13/12/2012 11:45:00 AM'],
'end_date':['5/12/2013 09:27:00 AM',np.nan,'06/11/2014 08:00:00 AM',np.nan,'12/16/2011 10:00:00','10/18/2012 00:00:00',np.nan],
'type':['O','I','O','O','I','O','I']})
df.start_date = pd.to_datetime(df.start_date)
df['end_date'] = pd.to_datetime(df.end_date)
I would like to fillna() under the end_date column based on two approaches below
a) If NA is found in any row except last row of that person, fillna by copying the value from next row
b) If NA is found in the last row of that person fillna by adding 10 days to his start_date (because there is no next row for that person to copy from. So, we give random value of 10 days)
The rules a and b only for persons with type=I.
For persons with type=O, just fillna by copying the value from start_date.
This is what I tried. You can see am writing the same code line twice.
df['end_date'] = np.where(df['type'].str.contains('I'),pd.DatetimeIndex(df['end_date'].bfill()),pd.DatetimeIndex(df.start_date.dt.date))
df['end_date'] = np.where(df['type'].str.contains('I'),pd.DatetimeIndex(df['start_date'] + pd.DateOffset(10)),pd.DatetimeIndex(df.start_date.dt.date))
Any elegant and efficient way to write this as I have to apply this on a big data with 15 million rows?
I expect my output to be like as shown below
Solution
s1 = df.groupby('person_id')['start_date'].shift(-1)
s1 = s1.fillna(df['start_date'] + pd.DateOffset(days=10))
s1 = df['end_date'].fillna(s1)
s2 = df['end_date'].fillna(df['start_date'])
df['end_date'] = np.where(df['type'].eq('I'), s1, s2)
Explanations
Group the dataframe on person_id and shift the column start_date one units upwards.
>>> df.groupby('person_id')['start_date'].shift(-1)
0 2013-09-08 11:21:00
1 2014-06-06 08:00:00
2 2014-06-06 05:00:00
3 NaT
4 2012-10-13 00:00:00
5 2012-12-13 11:45:00
6 NaT
Name: start_date, dtype: datetime64[ns]
Fill the NaN values in the shifted column with the values from start_date column after adding an offset of 10 days
>>> s1.fillna(df['start_date'] + pd.DateOffset(days=10))
0 2013-09-08 11:21:00
1 2014-06-06 08:00:00
2 2014-06-06 05:00:00
3 2014-06-16 05:00:00
4 2012-10-13 00:00:00
5 2012-12-13 11:45:00
6 2012-12-23 11:45:00
Name: start_date, dtype: datetime64[ns]
Now fill the NaN values in end_date column with the above series s1
>>> df['end_date'].fillna(s1)
0 2013-05-12 09:27:00
1 2014-06-06 08:00:00
2 2014-06-11 08:00:00
3 2014-06-16 05:00:00
4 2011-12-16 10:00:00
5 2012-10-18 00:00:00
6 2012-12-23 11:45:00
Name: end_date, dtype: datetime64[ns]
Similarly fill the NaN values in end_date column with the values from start_date column to create a series s2
>>> df['end_date'].fillna(df['start_date'])
0 2013-05-12 09:27:00
1 2013-09-08 11:21:00
2 2014-06-11 08:00:00
3 2014-06-06 05:00:00
4 2011-12-16 10:00:00
5 2012-10-18 00:00:00
6 2012-12-13 11:45:00
Name: end_date, dtype: datetime64[ns]
Then use np.where to select the values from s1 / s2 based on the condition where the type is I or O
>>> df
person_id start_date end_date type
0 101 2013-05-07 09:27:00 2013-05-12 09:27:00 O
1 101 2013-09-08 11:21:00 2014-06-06 08:00:00 I
2 101 2014-06-06 08:00:00 2014-06-11 08:00:00 O
3 101 2014-06-06 05:00:00 2014-06-06 05:00:00 O
4 202 2011-12-11 10:00:00 2011-12-16 10:00:00 I
5 202 2012-10-13 00:00:00 2012-10-18 00:00:00 O
6 202 2012-12-13 11:45:00 2012-12-23 11:45:00 I
Related
I'm trying to measure the difference between timestamps using certain conditions. Using below, for each unique ID, I'm hoping to subtract the End Time where Item == A and the Start Time where Item == D.
So the timestamps are actually located on separate rows.
At the moment my process is returning an error. I'm also hoping to drop the .shift() for something more robust as each unique ID will have different combinations. For ex, A,B,C,D - A,B,D - A,D etc.
df = pd.DataFrame({'ID': [10,10,10,20,20,30],
'Start Time': ['2019-08-02 09:00:00','2019-08-03 10:50:00','2019-08-05 16:00:00','2019-08-04 08:00:00','2019-08-04 15:30:00','2019-08-06 11:00:00'],
'End Time': ['2019-08-04 15:00:00','2019-08-04 16:00:00','2019-08-05 16:00:00','2019-08-04 14:00:00','2019-08-05 20:30:00','2019-08-07 10:00:00'],
'Item': ['A','B','D','A','D','A'],
})
df['Start Time'] = pd.to_datetime(df['Start Time'])
df['End Time'] = pd.to_datetime(df['End Time'])
df['diff'] = (df.groupby('ID')
.apply(lambda x: x['End Time'].shift(1) - x['Start Time'].shift(1))
.reset_index(drop=True))
Intended Output:
ID Start Time End Time Item diff
0 10 2019-08-02 09:00:00 2019-08-04 15:00:00 A NaT
1 10 2019-08-03 10:50:00 2019-08-04 16:00:00 B NaT
2 10 2019-08-05 16:00:00 2019-08-05 16:00:00 D 1 days 01:00:00
3 20 2019-08-04 08:00:00 2019-08-04 14:00:00 A NaT
4 20 2019-08-04 15:30:00 2019-08-05 20:30:00 D 0 days 01:30:00
5 30 2019-08-06 11:00:00 2019-08-07 10:00:00 A NaT
df2 = df.set_index('ID')
df2.query('Item == "D"')['Start Time']-df2.query('Item == "A"')['End Time']
output:
ID
10 2 days 05:30:00
20 0 days 20:30:00
30 NaT
dtype: timedelta64[ns]
older answer
The issue is your fillna, you can't have strings in a timedelta column:
df['diff'] = (df.groupby('ID')
.apply(lambda x: x['End Time'].shift(1) - x['Start Time'].shift(1))
#.fillna('-') # the issue is here
.reset_index(drop=True))
output:
ID Start Time End Time Item diff
0 10 2019-08-02 09:00:00 2019-08-02 09:30:00 A NaT
1 10 2019-08-03 10:50:00 2019-08-03 11:00:00 B 0 days 00:30:00
2 10 2019-08-04 15:00:00 2019-08-05 16:00:00 C 0 days 00:10:00
3 20 2019-08-04 08:00:00 2019-08-04 14:00:00 B NaT
4 20 2019-08-05 10:30:00 2019-08-05 20:30:00 C 0 days 06:00:00
5 30 2019-08-06 11:00:00 2019-08-07 10:00:00 A NaT
IIUC use:
df1 = df.pivot('ID','Item')
print (df1)
Start Time \
Item A B D
ID
10 2019-08-02 09:00:00 2019-08-03 10:50:00 2019-08-04 15:00:00
20 2019-08-04 08:00:00 NaT 2019-08-05 10:30:00
30 2019-08-06 11:00:00 NaT NaT
End Time
Item A B D
ID
10 2019-08-02 09:30:00 2019-08-03 11:00:00 2019-08-05 16:00:00
20 2019-08-04 14:00:00 NaT 2019-08-05 20:30:00
30 2019-08-07 10:00:00 NaT NaT
a = df1[('Start Time','D')].sub(df1[('End Time','A')])
print (a)
ID
10 2 days 05:30:00
20 0 days 20:30:00
30 NaT
dtype: timedelta64[ns]
I have a pandas dataframe like as shown below
df = pd.DataFrame({'login_date':['5/7/2013 09:27:00 AM','09/08/2013 11:21:00 AM','06/06/2014 08:00:00 AM','06/06/2014 05:00:00 AM','','10/11/1990'],
'DURATION':[21,30,200,34,45,np.NaN})
I would like to add DURATION values to the login_date column
The DURATION is represented in Days type
If there is NA in DURATION column, just replace it with 0.
So, I tried the below
df['DURATION'] = df['DURATION'].fillna(0)
df['login_date'] = pd.to_datetime(df['login_date'])
df['DURATION'] = df['DURATION'].astype('Int64')
df['logout_Date'] = df['login_date'] + pd.offsets.DateOffset(days=df['DURATION'])
However, this results in an error as shown below
TypeError: Invalid type <class 'pandas.core.series.Series'>. Must be int or float.
But I have already converted my DURATION column to int64 type.
How to add a column of values to my logout_date column
Try:
df["logout_date"] = pd.to_datetime(df["login_date"]) + df["DURATION"].fillna(0).apply(lambda x: pd.Timedelta(days=x))
print(df)
Prints:
login_date DURATION logout_date
0 5/7/2013 09:27:00 AM 21.0 2013-05-28 09:27:00
1 09/08/2013 11:21:00 AM 30.0 2013-10-08 11:21:00
2 06/06/2014 08:00:00 AM 200.0 2014-12-23 08:00:00
3 06/06/2014 05:00:00 AM 34.0 2014-07-10 05:00:00
4 45.0 NaT
5 10/11/1990 NaN 1990-10-11 00:00:00
I have a dataframe like as shown below. This is a continuation of this post
df = pd.DataFrame({'person_id': [101,101,101,101,202,202,202],
'person_type':['A','A','B','C','D','B','A'],
'login_date':['5/7/2013 09:27:00 AM','09/08/2013 11:21:00 AM','06/06/2014 08:00:00 AM','06/06/2014 05:00:00 AM','12/11/2011 10:00:00 AM','13/10/2012 12:00:00 AM','13/12/2012 11:45:00 AM'],
'logout_date':[np.nan,'11/08/2013 11:21:00 AM',np.nan,'06/06/2014 05:00:00 AM',np.nan,'13/10/2012 12:00:00 AM',np.nan]})
df.login_date = pd.to_datetime(df.login_date)
df.logout_date = pd.to_datetime(df.logout_date)
I would like to apply 2 rules to the logout_date column
Rule 1 - If person type is B, C,D,E AND logout_date is NaN, then copy the login date value
Rule 2 - If person type is A AND logout_date is NaN, then add 2 days to the login date
When I try the below
m1 = df['person_type'].isin(['B','C','D'])
m2 = df['person_type'].isin(['A'])
m3 = df['logout_datetime'].isna()
df['logout_datetime'] = np.select([m1 & m3, m2 & m3],
[df['login_datetime'], df['login_datetime'] + pd.DateOffset(days=2)],
default=df['logout_datetime'])
df['logout_date'] = np.select([m1 & m3, m2 & m3],
[df['login_datetime'].dt.date, (df['login_datetime'] + pd.DateOffset(days=2)).dt.date],
default=df['logout_datetime'])
I would like to get the logout_date column directly by using np.select as shown in sample code.
Currently I get an output like below which is incorrect
I don't understand why some rows are causing issues while other rows are working fine.
Can help me with this? I expect my output to have proper date values
I think problem is missing converting in default parameter in np.select (default=df['logout_datetime']) and change it to default=df['logout_datetime'].dt.date for same types returned from np.select:
df['logout_date'] = np.select([m1 & m3, m2 & m3],
[df['login_date'].dt.date,
(df['login_date'] + pd.DateOffset(days=2)).dt.date],
default=df['logout_date'].dt.date)
print (df)
person_id person_type login_date logout_date
0 101 A 2013-05-07 09:27:00 2013-05-09
1 101 A 2013-09-08 11:21:00 2013-11-08
2 101 B 2014-06-06 08:00:00 2014-06-06
3 101 C 2014-06-06 05:00:00 2014-06-06
4 202 D 2011-12-11 10:00:00 2011-12-11
5 202 B 2012-10-13 00:00:00 2012-10-13
6 202 A 2012-12-13 11:45:00 2012-12-15
If need default with datetimes then Series.dt.normalize remove times (set to 00:00:00) and all types are datetimes, so working well:
df['logout_date'] = np.select([m1 & m3, m2 & m3],
[df['login_date'].dt.normalize(),
(df['login_date'] + pd.DateOffset(days=2)).dt.normalize()],
default=df['logout_date'])
print (df)
person_id person_type login_date logout_date
0 101 A 2013-05-07 09:27:00 2013-05-09 00:00:00
1 101 A 2013-09-08 11:21:00 2013-11-08 11:21:00
2 101 B 2014-06-06 08:00:00 2014-06-06 00:00:00
3 101 C 2014-06-06 05:00:00 2014-06-06 05:00:00
4 202 D 2011-12-11 10:00:00 2011-12-11 00:00:00
5 202 B 2012-10-13 00:00:00 2012-10-13 00:00:00
6 202 A 2012-12-13 11:45:00 2012-12-15 00:00:00
For origional datetimes use:
df['logout_date'] = np.select([m1 & m3, m2 & m3],
[df['login_date'],
(df['login_date'] + pd.DateOffset(days=2))],
default=df['logout_date'])
print (df)
person_id person_type login_date logout_date
0 101 A 2013-05-07 09:27:00 2013-05-09 09:27:00
1 101 A 2013-09-08 11:21:00 2013-11-08 11:21:00
2 101 B 2014-06-06 08:00:00 2014-06-06 08:00:00
3 101 C 2014-06-06 05:00:00 2014-06-06 05:00:00
4 202 D 2011-12-11 10:00:00 2011-12-11 10:00:00
5 202 B 2012-10-13 00:00:00 2012-10-13 00:00:00
6 202 A 2012-12-13 11:45:00 2012-12-15 11:45:00
I have a dataframe like as shown below
df = pd.DataFrame({'person_id': [101,101,101,101,202,202,202],
'login_date':['5/7/2013 09:27:00 AM','09/08/2013 11:21:00 AM','06/06/2014 08:00:00 AM','06/06/2014 05:00:00 AM','12/11/2011 10:00:00 AM','13/10/2012 12:00:00 AM','13/12/2012 11:45:00 AM']})
df.login_date = pd.to_datetime(df.login_date)
df['logout_date'] = df.login_date + pd.Timedelta(days=5)
df['login_id'] = [1,1,1,1,8,8,8]
As you can see in the sample dataframe, the login_id is the same even though login and logout dates are different for the person.
For example, person = 101, has logged in and out at 4 different timestamps. but he has got the same login_ids which is incorrect.
Instead, I would like to generate a new login_id column where each person gets a new login_id but retains the 1st login_id information in their subsequent logins. So, we can know its a sequence
I tried the below but it doesn't work well
df.groupby(['person_id','login_date','logout_date'])['login_id'].rank(method="first", ascending=True) + 100000
I expect my output to be like as shown below. You can see how 1 and 8, the 1st login_id for each person is retained in their subsequent login_ids. We just add a sequence by adding 00001 and plus one based on number of rows.
Please note I would like to apply this on a big data and the login_ids may not just be single digit in real data. For ex, 1st login_id could even be 576869578 etc kind of random number. In that case, the subsequent login id will be 57686957800001. Hope this helps. Whatever is the 1st login_id for that subject, add 00001, 00002 etc based on the number of rows that person has. Hope this helps
Update 2: Just realized my previous answers also added 100000 to the first index. Here is a version that uses GroupBy.transform() to add 100000 only to subsequent indexes:
cumcount = df.groupby(['person_id','login_id']).login_id.cumcount()
df.login_id = df.groupby(['person_id','login_id']).login_id.transform(
lambda x: x.shift().mul(100000).fillna(x.min())
).add(cumcount)
person_id login_date logout_date login_id
# 0 101 2013-05-07 09:27:00 2013-05-12 09:27:00 1
# 1 101 2013-09-08 11:21:00 2013-09-13 11:21:00 100001
# 2 101 2014-06-06 08:00:00 2014-06-11 08:00:00 100002
# 3 101 2014-06-06 05:00:00 2014-06-11 05:00:00 100003
# 4 202 2011-12-11 10:00:00 2011-12-16 10:00:00 8
# 5 202 2012-10-13 00:00:00 2012-10-18 00:00:00 800001
# 6 202 2012-12-13 11:45:00 2012-12-18 11:45:00 800002
Update: Faster option is to build the sequence with GroupBy.cumcount():
cumcount = df.groupby(['person_id','login_id']).login_id.cumcount()
df.login_id = df.login_id.mul(100000).add(cumcount)
# person_id login_date logout_date login_id
# 0 101 2013-05-07 09:27:00 2013-05-12 09:27:00 100000
# 1 101 2013-09-08 11:21:00 2013-09-13 11:21:00 100001
# 2 101 2014-06-06 08:00:00 2014-06-11 08:00:00 100002
# 3 101 2014-06-06 05:00:00 2014-06-11 05:00:00 100003
# 4 202 2011-12-11 10:00:00 2011-12-16 10:00:00 800000
# 5 202 2012-10-13 00:00:00 2012-10-18 00:00:00 800001
# 6 202 2012-12-13 11:45:00 2012-12-18 11:45:00 800002
You can build the sequence in a GroupBy.apply():
df.login_id = df.groupby(['person_id','login_id']).login_id.apply(
lambda x: pd.Series([x.min()*100000+seq for seq in range(len(x))], x.index)
)
login_id = df.groupby('person_id').login_id.apply(list)
def modify_id(x):
result= []
for index,value in enumerate(x):
if index > 0:
value = (int(value) * 100000) + index
result.append(value)
return result
df['ogin_id'] = login_id.apply(lambda x : modify_id(x)).explode().to_list()
Will give output -
person_id
login_date
logout_date
login_id
101
2013-05-07 09:27:00
2013-05-12 09:27:00
1
101
2013-09-08 11:21:00
2013-09-13 11:21:00
100001
101
2014-06-06 08:00:00
2014-06-11 08:00:00
100002
101
2014-06-06 05:00:00
2014-06-11 05:00:00
100003
202
2011-12-11 10:00:00
2011-12-16 10:00:00
8
202
2012-10-13 00:00:00
2012-10-18 00:00:00
800001
202
2012-12-13 11:45:00
2012-12-18 11:45:00
800002
You can make use of your original rank()
df['login_id'] = df['login_id'] * 100000 + df.groupby(['person_id'])['login_id'].rank(method="first") - 1
# print(df)
person_id login_date logout_date login_id
0 101 2013-05-07 09:27:00 2013-05-12 09:27:00 100000.0
1 101 2013-09-08 11:21:00 2013-09-13 11:21:00 100001.0
2 101 2014-06-06 08:00:00 2014-06-11 08:00:00 100002.0
3 101 2014-06-06 05:00:00 2014-06-11 05:00:00 100003.0
4 202 2011-12-11 10:00:00 2011-12-16 10:00:00 800000.0
5 202 2012-10-13 00:00:00 2012-10-18 00:00:00 800001.0
6 202 2012-12-13 11:45:00 2012-12-18 11:45:00 800002.0
Then changed the first row of each group
def change_first(group):
group.loc[group.index[0], 'login_id'] = group.iloc[0]['login_id'] / 100000
return group
df['login_id'] = df.groupby(['person_id']).apply(lambda group: change_first(group))['login_id']
# print(df)
person_id login_date logout_date login_id
0 101 2013-05-07 09:27:00 2013-05-12 09:27:00 1.0
1 101 2013-09-08 11:21:00 2013-09-13 11:21:00 100001.0
2 101 2014-06-06 08:00:00 2014-06-11 08:00:00 100002.0
3 101 2014-06-06 05:00:00 2014-06-11 05:00:00 100003.0
4 202 2011-12-11 10:00:00 2011-12-16 10:00:00 8.0
5 202 2012-10-13 00:00:00 2012-10-18 00:00:00 800001.0
6 202 2012-12-13 11:45:00 2012-12-18 11:45:00 800002.0
Or make use of where() to only update the row where condition is False.
df_ = df['login_id'] * 100000 + df.groupby(['person_id'])['login_id'].rank(method="first") - 1
firsts = df.groupby(['person_id']).head(1).index
df['login_id'] = df['login_id'].where(df.index.isin(firsts), df_)
I have the following table:
Hora_Retiro count_uses
0 00:00:18 1
1 00:00:34 1
2 00:02:27 1
3 00:03:13 1
4 00:06:45 1
... ... ...
748700 23:58:47 1
748701 23:58:49 1
748702 23:59:11 1
748703 23:59:47 1
748704 23:59:56 1
And I want to group all values within each hour, so I can see the total number of uses per hour (00:00:00 - 23:00:00)
I have the following code:
hora_pico_aug= hora_pico.groupby(pd.Grouper(key="Hora_Retiro",freq='H')).count()
Hora_Retiro column is of timedelta64[ns] type
Which gives the following output:
count_uses
Hora_Retiro
00:00:02 2566
01:00:02 602
02:00:02 295
03:00:02 5
04:00:02 10
05:00:02 4002
06:00:02 16075
07:00:02 39410
08:00:02 76272
09:00:02 56721
10:00:02 36036
11:00:02 32011
12:00:02 33725
13:00:02 41032
14:00:02 50747
15:00:02 50338
16:00:02 42347
17:00:02 54674
18:00:02 76056
19:00:02 57958
20:00:02 34286
21:00:02 22509
22:00:02 13894
23:00:02 7134
However, the index column starts at 00:00:02, and I want it to start at 00:00:00, and then go from one hour intervals. Something like this:
count_uses
Hora_Retiro
00:00:00 2565
01:00:00 603
02:00:00 295
03:00:00 5
04:00:00 10
05:00:00 4002
06:00:00 16075
07:00:00 39410
08:00:00 76272
09:00:00 56721
10:00:00 36036
11:00:00 32011
12:00:00 33725
13:00:00 41032
14:00:00 50747
15:00:00 50338
16:00:00 42347
17:00:00 54674
18:00:00 76056
19:00:00 57958
20:00:00 34286
21:00:00 22509
22:00:00 13894
23:00:00 7134
How can i make it to start at 00:00:00??
Thanks for the help!
You can create an hour column from Hora_Retiro column.
df['hour'] = df['Hora_Retiro'].dt.hour
And then groupby on the basis of hour
gpby_df = df.groupby('hour')['count_uses'].sum().reset_index()
gpby_df['hour'] = pd.to_datetime(gpby_df['hour'], format='%H').dt.time
gpby_df.columns = ['Hora_Retiro', 'sum_count_uses']
gpby_df
gives
Hora_Retiro sum_count_uses
0 00:00:00 14
1 09:00:00 1
2 10:00:00 2
3 20:00:00 2
I assume that Hora_Retiro column in your DataFrame is of
Timedelta type. It is not datetime, as in this case there
would be printed also the date part.
Indeed, your code creates groups starting at the minute / second
taken from the first row.
To group by "full hours":
round each element in this column to hour,
then group (just by this rounded value).
The code to do it is:
hora_pico.groupby(hora_pico.Hora_Retiro.apply(
lambda tt: tt.round('H'))).count_uses.count()
However I advise you to make up your mind, what do you want to count:
rows or values in count_uses column.
In the second case replace count function with sum.