I have 100,000 rows of data in the following format:
import pandas as pd
data = {'ID': [1, 1, 3, 3, 4, 3, 4, 4, 4],
'timestamp': ['12/23/14 16:53', '12/23/14 17:00', '12/23/14 17:01', '12/23/14 17:02', '12/23/14 17:00', '12/23/14 17:06', '12/23/14 17:15', '12/23/14 17:16', '12/23/14 17:20']}
df = pd.DataFrame(data)
ID timestamp
0 1 2014-12-23 16:53:00
1 1 2014-12-23 17:00:00
2 3 2014-12-23 17:01:00
3 3 2014-12-23 17:02:00
4 4 2014-12-23 17:00:00
5 3 2014-12-23 17:06:00
6 4 2014-12-23 17:15:00
7 4 2014-12-23 17:16:00
8 4 2014-12-23 17:20:00
The ID represents a user and the timestamp is when that user visited a website. I want to get information about sessions using pandas, where each session on this site is a maximum of 15 mins long. A new session starts once the user has been logged on for 15 mins. For the above sample data, the desired result would be:
ID session_start session_duration
0 1 12/23/14 16:53. 7 min
1 3 12/23/14 17:02. 4 min
2 4 12/23/14 17:00. 15 min
3 4 12/23/14 17:16. 4 min
Let me know if there is information I should add to clarify. I can't seem to get a working solution. Any help is appreciated!
EDIT: In response to queries below, I noticed a mistake in my example. Sorry guys it was very late at night!
The problem that I am struggling with has mostly to do with user 4. They are still logged in after 15 minutes, and I want to capture in my data that a new session has started.
The reason that my problem is slightly different from this Groupby every 2 hours data of a dataframe is because I want to do this based on individual users.
Not pretty, but here's a solution. The basic idea is to use groupby with diff to calculate differences between timestamps for each ID, but I couldn't find a nice way to only diff every 2 rows. So this approach uses diff for every row, then selects the diff result from every other within each ID.
Note that I'm assuming that the dataframe is properly ordered. Also note that your sample data had an extra entry for ID==1 that I removed.
import pandas as pd
data = {'ID': [1, 1, 3, 4, 3, 4, 4, 4],
'timestamp': ['12/23/14 16:53', '12/23/14 17:00', '12/23/14 17:02', '12/23/14 17:00', '12/23/14 17:06', '12/23/14 17:15', '12/23/14 17:16', '12/23/14 17:20']}
df = pd.DataFrame(data)
df['timestamp']=pd.to_datetime(df['timestamp'])
# groupby to get difference between each timestamp
df['diffs'] = df.groupby('ID')['timestamp'].diff()
# count every time ID appears
df['counts'] = df.groupby('ID')['ID'].cumcount()+1
print("after diffs and counts:")
print(df)
# select entries for every 2nd occurence (where df['counts'] is even)
new_df = df[df['counts'] % 2 == 0][['ID','timestamp','diffs']]
# timestamp here will be the session endtime so subtract the
# diffs to get session start time
new_df['timestamp'] = new_df['timestamp'] - new_df['diffs']
# and a final rename
new_df = new_df.rename(columns={'timestamp':'session_start','diffs':'session_duration'})
print("\nfinal df:")
print(new_df)
Will print out
after diffs and counts:
ID timestamp diffs counts
0 1 2014-12-23 16:53:00 NaT 1
1 1 2014-12-23 17:00:00 0 days 00:07:00 2
2 3 2014-12-23 17:02:00 NaT 1
3 4 2014-12-23 17:00:00 NaT 1
4 3 2014-12-23 17:06:00 0 days 00:04:00 2
5 4 2014-12-23 17:15:00 0 days 00:15:00 2
6 4 2014-12-23 17:16:00 0 days 00:01:00 3
7 4 2014-12-23 17:20:00 0 days 00:04:00 4
final df:
ID session_start session_duration
1 1 2014-12-23 16:53:00 0 days 00:07:00
4 3 2014-12-23 17:02:00 0 days 00:04:00
5 4 2014-12-23 17:00:00 0 days 00:15:00
7 4 2014-12-23 17:16:00 0 days 00:04:00
Then to get session_duration column as number of minutes instead of a timedelta object, you can do:
import numpy as np
new_df['session_duration'] = new_df['session_duration'] / np.timedelta64(1,'s') / 60.
Related
I have a Dataset like this:
Timestamp Index Var1
19/03/2015 05:55:00 1 3
19/03/2015 06:00:00 2 4
19/03/2015 06:05:00 3 6
19/03/2015 06:10:00 4 5
19/03/2015 06:15:00 5 7
19/03/2015 06:20:00 6 7
19/03/2015 06:25:00 7 4
The data points were collected at 5-minute intervals. Convert 5-minute data points to 30-minute intervals by averaging Var1. For example, the first data point for the 30-minute intervals will be the average of the 1st data point to the 6th data point (row 1 – 6) from the provided dataset of 5-minute intervals.
I tried using
df.groupby(pd.Grouper(key='Timestamp', freq='30min')).mean()
To start from the first timestamp instead of aligning to hours, you just need to specify origin='start'. (I found that in the docs on Grouper.)
Also, averaging the Index column doesn't really make sense. It seems like you want to select only the Var1 column.*
df.groupby(
pd.Grouper(key='Timestamp', freq='30min', origin='start')
)['Var1'].mean()
Output:
Timestamp
2022-09-04 05:55:00 5.333333
2022-09-04 06:25:00 4.000000
Freq: 30T, Name: Var1, dtype: float64
* Or you could just as easily do something else with the Index column, for example, keep the first value from each group:
...
).agg({'Index': 'first', 'Var1': 'mean'})
Index Var1
Timestamp
2022-09-04 05:55:00 1 5.333333
2022-09-04 06:25:00 7 4.000000
I have a dataframe of this type:
arr_time dep_time station
0 19:20:00 19:20:00 a
1 19:38:00 19:45:00 b
2 18:55:00 19:00:00 a
3 19:40:00 19:45:00 a
4 19:50:00 19:55:00 b
.
.
What I need to do is:
for every same item in station, subtract related items in dep_time with every single related item in arr_time (not considering the same item). For example:
for station a:
for i in range(len(arr_time)):
for j in range(len(dep_time)):
if i != j:
dep_time[j] - arr_time[i]
Result, for station a, must be:
result
-00:20:00
00:25:00
and so on, for all stations in station.
Need to write this with Pandas, due to the large amount of data. I will be very thankful to whoever can help me!
Here is one way. I used pd.merge to link every station 'a' to every other station 'a' (etc.). Then I filtered so we won't compare a station to itself, and performed the time arithmetic.
from io import StringIO
import pandas as pd
data = ''' arr_time dep_time station
0 19:20:00 19:20:00 a
1 19:38:00 19:45:00 b
2 18:55:00 19:00:00 a
3 19:40:00 19:45:00 a
4 19:50:00 19:55:00 b
'''
df = pd.read_csv(StringIO(data), sep='\s+')
# create unique identifier for each row
df['id'] = df.reset_index().groupby('station')['index'].rank(method='first').astype(int)
# SQL-style self-join: all station 1's; all station 2's, etc.
t = pd.merge(left=df, right=df, how='inner', on='station', suffixes=('_l', '_r'))
# don't compare station to itself
t = t[ t['id_l'] != t['id_r'] ]
# compute elapsed time (as timedelta object)
t['elapsed'] = pd.to_timedelta(t['dep_time_l']) - pd.to_timedelta(t['arr_time_r'])
# convert elapsed time to minutes (may not be necessary)
t['elapsed'] = t['elapsed'] / pd.Timedelta(minutes=1) # convert to minutes
# create display
t = (t[['station', 'elapsed', 'id_l', 'id_r']]
.sort_values(['station', 'id_l', 'id_r']))
print(t)
station elapsed id_l id_r
1 a 25.0 1 2
2 a -20.0 1 3
3 a -20.0 2 1
5 a -40.0 2 3
6 a 25.0 3 1
7 a 50.0 3 2
10 b -5.0 1 2
11 b 17.0 2 1
We have csv file containing predefined time slots.
According to start time and end time provided by the user we want time slots present between the start time and end time.
eg
start time =11:00:00
end time=19:00:00
output- slot_no 2,3,4,5
I think you need boolean indexing with loc and between for selecting column Slot_no, all columns and values are converted to_timedelta, also midnight is replaced to 24:00:00:
df = pd.DataFrame(
{'Slot_no':[1,2,3,4,5,6,7],
'start_time':['0:01:00','8:01:00','10:01:01','12:01:00','14:01:00','18:01:01','20:01:00'],
'end_time':['8:00:00','10:00:00','12:00:00','14:00:00','18:00:00','20:00:00','0:00:00']})
df = df.reindex_axis(['Slot_no','start_time','end_time'], axis=1)
df['start_time'] = pd.to_timedelta(df['start_time'])
df['end_time'] = pd.to_timedelta(df['end_time'].replace('0:00:00', '24:00:00'))
print (df)
Slot_no start_time end_time
0 1 00:01:00 0 days 08:00:00
1 2 08:01:00 0 days 10:00:00
2 3 10:01:01 0 days 12:00:00
3 4 12:01:00 0 days 14:00:00
4 5 14:01:00 0 days 18:00:00
5 6 18:01:01 0 days 20:00:00
6 7 20:01:00 1 days 00:00:00
start = pd.to_timedelta('11:00:00')
end = pd.to_timedelta('19:00:00')
mask = df['start_time'].between(start, end) | df['end_time'].between(start, end)
s = df.loc[mask, 'Slot_no']
print (s)
2 3
3 4
4 5
5 6
Name: Slot_no, dtype: int64
L = df.loc[mask, 'Slot_no'].tolist()
print (L)
[3, 4, 5, 6]
Hello I have a dataframe df containing data of different trips from an origin X to a destination Y with starting time T. I want to count trips between X and Y in a certain time windows, let say 15 min. So,
df:
X Y T
1 2 2015-12-30 22:30:00.0
1 2 2015-12-30 22:35:00.0
1 2 2015-12-30 22:40:00.0
1 2 2015-12-30 23:40:00.0
3 5 2015-11-30 13:40:00.0
3 5 2015-11-30 13:44:00.0
3 5 2015-11-30 19:54:00.0
I want
dfO:
X Y count
1 2 3
3 5 2
In order to count the all the trips from X to Y I did:
tmp = df.groupby(["X", "Y"]).size()
How can I take in consideration also the fact that I want to count only the same trips in a certain time interval dt?
Perhaps you are looking for pd.TimeGrouper. It allows you to group rows in a DataFrame by intervals of time, provided that the DataFrame has a DatetimeIndex. (Note that MaxU's solution shows how to group by time intervals without using a DatetimeIndex.)
import pandas as pd
df = pd.DataFrame({'T': ['2015-12-30 22:30:00.0',
'2015-12-30 22:35:00.0',
'2015-12-30 22:40:00.0',
'2015-12-30 23:40:00.0',
'2015-11-30 13:40:00.0',
'2015-11-30 13:44:00.0',
'2015-11-30 19:54:00.0'],
'X': [1, 1, 1, 1, 3, 3, 3],
'Y': [2, 2, 2, 2, 5, 5, 5]})
df['T'] = pd.to_datetime(df['T'])
df = df.set_index(['T'])
result = df.groupby([pd.TimeGrouper('15Min'), 'X', 'Y']).size()
print(result)
yields
T X Y
2015-11-30 13:30:00 3 5 2
2015-11-30 19:45:00 3 5 1
2015-12-30 22:30:00 1 2 3
2015-12-30 23:30:00 1 2 1
This contains the information that you want
T X Y
2015-11-30 13:30:00 3 5 2
2015-12-30 22:30:00 1 2 3
and more. It is unclear on what basis you wish to exclude the other rows. If you
explain the criterion, we should be able to produce the desired DataFrame exactly.
if i understood it correctly:
In [34]: df.groupby([pd.Grouper(key='T', freq='15min'),'X','Y'], as_index=False).size()
Out[34]:
T X Y
2015-11-30 13:30:00 3 5 2
2015-11-30 19:45:00 3 5 1
2015-12-30 22:30:00 1 2 3
2015-12-30 23:30:00 1 2 1
I'm trying to generate the Last_Payment_Date field in my pandas dataframe, and would need to find the closest Payment_Date before the given Order_Date for each customer (i.e. groupby).
Payment_Date will always occur after Order_Date, but may take different periods of time, which is difficult to use sorting and shift to find the nearest date.
Masking seems like a possible way but I've not been able to figure a way on how to use it.
Appreciate all the help I could get!
Cust_No Order_Date Payment_Date Last_Payment_Date
A 5/8/2014 6/8/2014 Nat
B 6/8/2014 1/5/2015 Nat
B 7/8/2014 7/8/2014 Nat
A 8/8/2014 1/5/2015 6/8/2014
A 9/8/2014 10/8/2014 6/8/2014
A 10/11/2014 12/11/2014 10/8/2014
B 11/12/2014 1/1/2015 7/8/2014
B 1/2/2015 2/2/2015 1/1/2015
A 2/5/2015 5/5/2015 1/5/2015
B 3/5/2015 4/5/2015 2/2/2015
Series.searchsorted largely does what you want -- it
can be used to find where the Order_Dates fit inside Payment_Dates. In
particular, it returns the ordinal indices corresponding to where each
Order_Date would need to be inserted in order to keep the Payment_Dates
sorted. For example, suppose
In [266]: df['Payment_Date']
Out[266]:
0 2014-06-08
2 2014-07-08
4 2014-10-08
5 2014-12-11
6 2015-01-01
1 2015-01-05
3 2015-01-05
7 2015-02-02
9 2015-04-05
8 2015-05-05
Name: Payment_Date, dtype: datetime64[ns]
In [267]: df['Order_Date']
Out[267]:
0 2014-05-08
2 2014-07-08
4 2014-09-08
5 2014-10-11
6 2014-11-12
1 2014-06-08
3 2014-08-08
7 2015-01-02
9 2015-03-05
8 2015-02-05
Name: Order_Date, dtype: datetime64[ns]
then searchsorted returns
In [268]: df['Payment_Date'].searchsorted(df['Order_Date'])
Out[268]: array([0, 1, 2, 3, 3, 0, 2, 5, 8, 8])
The first value, 0, for example, indicates that the Order_Date, 2014-05-08,
would have to be inserted at ordinal index 0 (before the Payment_Date
2014-06-08) to keep the Payment_Dates in sorted order. The second value, 1,
indicates that the Order_Date, 2014-07-08, would have to be inserted at
ordinal index 1 (after the Payment_Date 2014-06-08 and before 2014-07-08)
to keep the Payment_Dates in sorted order. And so on for the other indices.
Now, of course, there are some complications:
The Payment_Dates need to be in sorted order for searchsorted to return a
meaningful result:
df = df.sort_values(by=['Payment_Date'])
We need to group by the Cust_No
grouped = df.groupby('Cust_No')
We want the index of the Payment_Date which comes before the
Order_Date. Thus, we really need the decrease the index by one:
idx = grp['Payment_Date'].searchsorted(grp['Order_Date'])
result = grp['Payment_Date'].iloc[idx-1]
So that grp['Payment_Date'].iloc[idx-1] would grab the the prior Payment_Date.
When searchsorted returns 0, the Order_Date is less than all
Payment_Dates. We want a NaT in this case.
result[idx == 0] = pd.NaT
So putting it all togther,
import pandas as pd
NaT = pd.NaT
T = pd.Timestamp
df = pd.DataFrame({
'Cust_No': ['A', 'B', 'B', 'A', 'A', 'A', 'B', 'B', 'A', 'B'],
'expected': [
NaT, NaT, NaT, T('2014-06-08'), T('2014-06-08'), T('2014-10-08'),
T('2014-07-08'), T('2015-01-01'), T('2015-01-05'), T('2015-02-02')],
'Order_Date': [
T('2014-05-08'), T('2014-06-08'), T('2014-07-08'), T('2014-08-08'),
T('2014-09-08'), T('2014-10-11'), T('2014-11-12'), T('2015-01-02'),
T('2015-02-05'), T('2015-03-05')],
'Payment_Date': [
T('2014-06-08'), T('2015-01-05'), T('2014-07-08'), T('2015-01-05'),
T('2014-10-08'), T('2014-12-11'), T('2015-01-01'), T('2015-02-02'),
T('2015-05-05'), T('2015-04-05')]})
def last_payment_date(s, df):
grp = df.loc[s.index]
idx = grp['Payment_Date'].searchsorted(grp['Order_Date'])
result = grp['Payment_Date'].iloc[idx-1]
result[idx == 0] = pd.NaT
return result
df = df.sort_values(by=['Payment_Date'])
grouped = df.groupby('Cust_No')
df['Last_Payment_Date'] = grouped['Payment_Date'].transform(last_payment_date, df)
print(df)
yields
Cust_No Order_Date Payment_Date expected Last_Payment_Date
0 A 2014-05-08 2014-06-08 NaT NaT
2 B 2014-07-08 2014-07-08 NaT NaT
4 A 2014-09-08 2014-10-08 2014-06-08 2014-06-08
5 A 2014-10-11 2014-12-11 2014-10-08 2014-10-08
6 B 2014-11-12 2015-01-01 2014-07-08 2014-07-08
1 B 2014-06-08 2015-01-05 NaT NaT
3 A 2014-08-08 2015-01-05 2014-06-08 2014-06-08
7 B 2015-01-02 2015-02-02 2015-01-01 2015-01-01
9 B 2015-03-05 2015-04-05 2015-02-02 2015-02-02
8 A 2015-02-05 2015-05-05 2015-01-05 2015-01-05