Hello I have a dataframe df containing data of different trips from an origin X to a destination Y with starting time T. I want to count trips between X and Y in a certain time windows, let say 15 min. So,
df:
X Y T
1 2 2015-12-30 22:30:00.0
1 2 2015-12-30 22:35:00.0
1 2 2015-12-30 22:40:00.0
1 2 2015-12-30 23:40:00.0
3 5 2015-11-30 13:40:00.0
3 5 2015-11-30 13:44:00.0
3 5 2015-11-30 19:54:00.0
I want
dfO:
X Y count
1 2 3
3 5 2
In order to count the all the trips from X to Y I did:
tmp = df.groupby(["X", "Y"]).size()
How can I take in consideration also the fact that I want to count only the same trips in a certain time interval dt?
Perhaps you are looking for pd.TimeGrouper. It allows you to group rows in a DataFrame by intervals of time, provided that the DataFrame has a DatetimeIndex. (Note that MaxU's solution shows how to group by time intervals without using a DatetimeIndex.)
import pandas as pd
df = pd.DataFrame({'T': ['2015-12-30 22:30:00.0',
'2015-12-30 22:35:00.0',
'2015-12-30 22:40:00.0',
'2015-12-30 23:40:00.0',
'2015-11-30 13:40:00.0',
'2015-11-30 13:44:00.0',
'2015-11-30 19:54:00.0'],
'X': [1, 1, 1, 1, 3, 3, 3],
'Y': [2, 2, 2, 2, 5, 5, 5]})
df['T'] = pd.to_datetime(df['T'])
df = df.set_index(['T'])
result = df.groupby([pd.TimeGrouper('15Min'), 'X', 'Y']).size()
print(result)
yields
T X Y
2015-11-30 13:30:00 3 5 2
2015-11-30 19:45:00 3 5 1
2015-12-30 22:30:00 1 2 3
2015-12-30 23:30:00 1 2 1
This contains the information that you want
T X Y
2015-11-30 13:30:00 3 5 2
2015-12-30 22:30:00 1 2 3
and more. It is unclear on what basis you wish to exclude the other rows. If you
explain the criterion, we should be able to produce the desired DataFrame exactly.
if i understood it correctly:
In [34]: df.groupby([pd.Grouper(key='T', freq='15min'),'X','Y'], as_index=False).size()
Out[34]:
T X Y
2015-11-30 13:30:00 3 5 2
2015-11-30 19:45:00 3 5 1
2015-12-30 22:30:00 1 2 3
2015-12-30 23:30:00 1 2 1
Related
I have a dataframe of this type:
arr_time dep_time station
0 19:20:00 19:20:00 a
1 19:38:00 19:45:00 b
2 18:55:00 19:00:00 a
3 19:40:00 19:45:00 a
4 19:50:00 19:55:00 b
.
.
What I need to do is:
for every same item in station, subtract related items in dep_time with every single related item in arr_time (not considering the same item). For example:
for station a:
for i in range(len(arr_time)):
for j in range(len(dep_time)):
if i != j:
dep_time[j] - arr_time[i]
Result, for station a, must be:
result
-00:20:00
00:25:00
and so on, for all stations in station.
Need to write this with Pandas, due to the large amount of data. I will be very thankful to whoever can help me!
Here is one way. I used pd.merge to link every station 'a' to every other station 'a' (etc.). Then I filtered so we won't compare a station to itself, and performed the time arithmetic.
from io import StringIO
import pandas as pd
data = ''' arr_time dep_time station
0 19:20:00 19:20:00 a
1 19:38:00 19:45:00 b
2 18:55:00 19:00:00 a
3 19:40:00 19:45:00 a
4 19:50:00 19:55:00 b
'''
df = pd.read_csv(StringIO(data), sep='\s+')
# create unique identifier for each row
df['id'] = df.reset_index().groupby('station')['index'].rank(method='first').astype(int)
# SQL-style self-join: all station 1's; all station 2's, etc.
t = pd.merge(left=df, right=df, how='inner', on='station', suffixes=('_l', '_r'))
# don't compare station to itself
t = t[ t['id_l'] != t['id_r'] ]
# compute elapsed time (as timedelta object)
t['elapsed'] = pd.to_timedelta(t['dep_time_l']) - pd.to_timedelta(t['arr_time_r'])
# convert elapsed time to minutes (may not be necessary)
t['elapsed'] = t['elapsed'] / pd.Timedelta(minutes=1) # convert to minutes
# create display
t = (t[['station', 'elapsed', 'id_l', 'id_r']]
.sort_values(['station', 'id_l', 'id_r']))
print(t)
station elapsed id_l id_r
1 a 25.0 1 2
2 a -20.0 1 3
3 a -20.0 2 1
5 a -40.0 2 3
6 a 25.0 3 1
7 a 50.0 3 2
10 b -5.0 1 2
11 b 17.0 2 1
I have 100,000 rows of data in the following format:
import pandas as pd
data = {'ID': [1, 1, 3, 3, 4, 3, 4, 4, 4],
'timestamp': ['12/23/14 16:53', '12/23/14 17:00', '12/23/14 17:01', '12/23/14 17:02', '12/23/14 17:00', '12/23/14 17:06', '12/23/14 17:15', '12/23/14 17:16', '12/23/14 17:20']}
df = pd.DataFrame(data)
ID timestamp
0 1 2014-12-23 16:53:00
1 1 2014-12-23 17:00:00
2 3 2014-12-23 17:01:00
3 3 2014-12-23 17:02:00
4 4 2014-12-23 17:00:00
5 3 2014-12-23 17:06:00
6 4 2014-12-23 17:15:00
7 4 2014-12-23 17:16:00
8 4 2014-12-23 17:20:00
The ID represents a user and the timestamp is when that user visited a website. I want to get information about sessions using pandas, where each session on this site is a maximum of 15 mins long. A new session starts once the user has been logged on for 15 mins. For the above sample data, the desired result would be:
ID session_start session_duration
0 1 12/23/14 16:53. 7 min
1 3 12/23/14 17:02. 4 min
2 4 12/23/14 17:00. 15 min
3 4 12/23/14 17:16. 4 min
Let me know if there is information I should add to clarify. I can't seem to get a working solution. Any help is appreciated!
EDIT: In response to queries below, I noticed a mistake in my example. Sorry guys it was very late at night!
The problem that I am struggling with has mostly to do with user 4. They are still logged in after 15 minutes, and I want to capture in my data that a new session has started.
The reason that my problem is slightly different from this Groupby every 2 hours data of a dataframe is because I want to do this based on individual users.
Not pretty, but here's a solution. The basic idea is to use groupby with diff to calculate differences between timestamps for each ID, but I couldn't find a nice way to only diff every 2 rows. So this approach uses diff for every row, then selects the diff result from every other within each ID.
Note that I'm assuming that the dataframe is properly ordered. Also note that your sample data had an extra entry for ID==1 that I removed.
import pandas as pd
data = {'ID': [1, 1, 3, 4, 3, 4, 4, 4],
'timestamp': ['12/23/14 16:53', '12/23/14 17:00', '12/23/14 17:02', '12/23/14 17:00', '12/23/14 17:06', '12/23/14 17:15', '12/23/14 17:16', '12/23/14 17:20']}
df = pd.DataFrame(data)
df['timestamp']=pd.to_datetime(df['timestamp'])
# groupby to get difference between each timestamp
df['diffs'] = df.groupby('ID')['timestamp'].diff()
# count every time ID appears
df['counts'] = df.groupby('ID')['ID'].cumcount()+1
print("after diffs and counts:")
print(df)
# select entries for every 2nd occurence (where df['counts'] is even)
new_df = df[df['counts'] % 2 == 0][['ID','timestamp','diffs']]
# timestamp here will be the session endtime so subtract the
# diffs to get session start time
new_df['timestamp'] = new_df['timestamp'] - new_df['diffs']
# and a final rename
new_df = new_df.rename(columns={'timestamp':'session_start','diffs':'session_duration'})
print("\nfinal df:")
print(new_df)
Will print out
after diffs and counts:
ID timestamp diffs counts
0 1 2014-12-23 16:53:00 NaT 1
1 1 2014-12-23 17:00:00 0 days 00:07:00 2
2 3 2014-12-23 17:02:00 NaT 1
3 4 2014-12-23 17:00:00 NaT 1
4 3 2014-12-23 17:06:00 0 days 00:04:00 2
5 4 2014-12-23 17:15:00 0 days 00:15:00 2
6 4 2014-12-23 17:16:00 0 days 00:01:00 3
7 4 2014-12-23 17:20:00 0 days 00:04:00 4
final df:
ID session_start session_duration
1 1 2014-12-23 16:53:00 0 days 00:07:00
4 3 2014-12-23 17:02:00 0 days 00:04:00
5 4 2014-12-23 17:00:00 0 days 00:15:00
7 4 2014-12-23 17:16:00 0 days 00:04:00
Then to get session_duration column as number of minutes instead of a timedelta object, you can do:
import numpy as np
new_df['session_duration'] = new_df['session_duration'] / np.timedelta64(1,'s') / 60.
This seems like a basic question. I want to use the datetime index in a pandas dataframe as the x values of a machine leanring algorithm for a univarte time series comparisons.
I tried to isolate the index and then convert it to a number but i get an error.
df=data["Close"]
idx=df.index
df.index.get_loc(idx)
Date
2014-03-31 0.9260
2014-04-01 0.9269
2014-04-02 0.9239
2014-04-03 0.9247
2014-04-04 0.9233
this is what i get when i add your code
2019-04-24 00:00:00 0.7097
2019-04-25 00:00:00 0.7015
2019-04-26 00:00:00 0.7018
2019-04-29 00:00:00 0.7044
x (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...
Name: Close, Length: 1325, dtype: object
I ne
ed a column of 1 to the number of values in my dataframe
First select column Close by double [] for one column DataFrame, so possible add new column:
df = data[["Close"]]
df["x"] = np.arange(1, len(df) + 1)
print (df)
Close x
Date
2014-03-31 0.9260 1
2014-04-01 0.9269 2
2014-04-02 0.9239 3
2014-04-03 0.9247 4
2014-04-04 0.9233 5
You can add a column with value range(1, len(data) + 1) as so:
df = pd.DataFrame({"y": [5, 4, 3, 2, 1]}, index=pd.date_range(start="2019-08-01", periods=5))
In [3]: df
Out[3]:
y
2019-08-01 5
2019-08-02 4
2019-08-03 3
2019-08-04 2
2019-08-05 1
df["x"] = range(1, len(df) + 1)
In [7]: df
Out[7]:
y x
2019-08-01 5 1
2019-08-02 4 2
2019-08-03 3 3
2019-08-04 2 4
2019-08-05 1 5
One query I often do in SQL within a relational database is to join a table back to itself and summarize each row based on records for the same id either backwards or forward in time.
For example, assume table1 as columns 'ID','Date', 'Var1'
In SQL I could sum var1 for the past 3 months for each record like this:
Select a.ID, a.Date, sum(b.Var1) as sum_var1
from table1 a
left outer join table1 b
on a.ID = b.ID
and months_between(a.date,b.date) <0
and months_between(a.date,b.date) > -3
Is there any way to do this in Pandas?
It seems you need GroupBy + rolling. Implementing the logic in precisely the same way it is written in SQL is likely to be expensive as it will involve repeated loops. Let's take an example dataframe:
Date ID Var1
0 2015-01-01 1 0
1 2015-02-01 1 1
2 2015-03-01 1 2
3 2015-04-01 1 3
4 2015-05-01 1 4
5 2015-01-01 2 5
6 2015-02-01 2 6
7 2015-03-01 2 7
8 2015-04-01 2 8
9 2015-05-01 2 9
You can add a column which, by group, looks back and sums a variable over a fixed period. First define a function utilizing pd.Series.rolling:
def lookbacker(x):
"""Sum over past 70 days"""
return x.rolling('70D').sum().astype(int)
Then apply it on a GroupBy object and extract values for assignment:
df['Lookback_Sum'] = df.set_index('Date').groupby('ID')['Var1'].apply(lookbacker).values
print(df)
Date ID Var1 Lookback_Sum
0 2015-01-01 1 0 0
1 2015-02-01 1 1 1
2 2015-03-01 1 2 3
3 2015-04-01 1 3 6
4 2015-05-01 1 4 9
5 2015-01-01 2 5 5
6 2015-02-01 2 6 11
7 2015-03-01 2 7 18
8 2015-04-01 2 8 21
9 2015-05-01 2 9 24
It appears pd.Series.rolling does not work with months, e.g. using '2M' (2 months) instead of '70D' (70 days) gives ValueError: <2 * MonthEnds> is a non-fixed frequency. This makes sense since a "month" is ambiguous given months have different numbers of days.
Another point worth mentioning is you can use GroupBy + rolling directly and possibly more efficiently by bypassing apply, but this requires ensuring your index is monotic. For example, via sort_index:
df['Lookback_Sum'] = df.set_index('Date').sort_index()\
.groupby('ID')['Var1'].rolling('70D').sum()\
.astype(int).values
I don't think pandas.DataFrame.rolling() supports rolling-window aggregation by some number of months; currently, you must specify a fixed number of days, or other fixed-length time period.
But as #jpp mentioned, you can use python loops to perform rolling aggregation over a window size specified in calendar months, where the number of days in each window will vary, depending on what part of the calendar you're rolling over.
The following approach builds on this SO answer as well as #jpp's:
# Build some example data:
# 3 unique IDs, each with 365 samples, one sample per day throughout 2015
df = pd.DataFrame({'Date': pd.date_range('2015-01-01', '2015-12-31', freq='D'),
'Var1': list(range(365))})
df = pd.concat([df] * 3)
df['ID'] = [1]*365 + [2]*365 + [3]*365
df.head()
Date Var1 ID
0 2015-01-01 0 1
1 2015-01-02 1 1
2 2015-01-03 2 1
3 2015-01-04 3 1
4 2015-01-05 4 1
# Define a lookback function that mimics rolling aggregation,
# but uses DateOffset() slicing, rather than a window of fixed size.
# Use .count() here as a sanity check; you will need .sum()
def lookbacker(ser):
return pd.Series([ser.loc[d - pd.offsets.DateOffset(months=3):d].count()
for d in ser.index])
# By default, groupby.agg output is sorted by key. So make sure to
# sort df by (ID, Date) before inserting the flattened groupby result
# into a new column
df.sort_values(['ID', 'Date'], inplace=True)
df.set_index('Date', inplace=True)
df['window_size'] = df.groupby('ID')['Var1'].apply(lookbacker).values
# Manually check the resulting window sizes
df.head()
Var1 ID window_size
Date
2015-01-01 0 1 1
2015-01-02 1 1 2
2015-01-03 2 1 3
2015-01-04 3 1 4
2015-01-05 4 1 5
df.tail()
Var1 ID window_size
Date
2015-12-27 360 3 92
2015-12-28 361 3 92
2015-12-29 362 3 92
2015-12-30 363 3 92
2015-12-31 364 3 93
df[df.ID == 1].loc['2015-05-25':'2015-06-05']
Var1 ID window_size
Date
2015-05-25 144 1 90
2015-05-26 145 1 90
2015-05-27 146 1 90
2015-05-28 147 1 90
2015-05-29 148 1 91
2015-05-30 149 1 92
2015-05-31 150 1 93
2015-06-01 151 1 93
2015-06-02 152 1 93
2015-06-03 153 1 93
2015-06-04 154 1 93
2015-06-05 155 1 93
The last column gives the lookback window size in days, looking back from that date, including both the start and end dates.
Looking "3 months" before 2016-05-31 would land you at 2015-02-31, but February has only 28 days in 2015. As you can see in the sequence 90, 91, 92, 93 in the above sanity check, This DateOffset approach maps the last four days in May to the last day in February:
pd.to_datetime('2015-05-31') - pd.offsets.DateOffset(months=3)
Timestamp('2015-02-28 00:00:00')
pd.to_datetime('2015-05-30') - pd.offsets.DateOffset(months=3)
Timestamp('2015-02-28 00:00:00')
pd.to_datetime('2015-05-29') - pd.offsets.DateOffset(months=3)
Timestamp('2015-02-28 00:00:00')
pd.to_datetime('2015-05-28') - pd.offsets.DateOffset(months=3)
Timestamp('2015-02-28 00:00:00')
I don't know if this matches SQL's behaviour, but in any case, you'll want to test this and decide if this makes sense in your case.
you could use lambda to achieve it.
table1['sum_var1'] = table1.apply(lambda row: findSum(row), axis=1)
and we should write an equivalent method for months_between
the complete example is
from datetime import datetime
import datetime as dt
import pandas as pd
def months_between(date1, date2):
if date1.day == date2.day:
return (date1.year - date2.year) * 12 + date1.month - date2.month
# if both are last days
if date1.month != (date1 + dt.timedelta(days=1)).month :
if date2.month != (date2 + dt.timedelta(days=1)).month :
return date1.month - date2.month
return (date1 - date2).days / 31
def findSum(cRow):
table1['month_diff'] = table1['Date'].apply(months_between, date2=cRow['Date'])
filtered_table = table1[(table1["month_diff"] < 0) & (table1["month_diff"] > -3) & (table1['ID'] == cRow['ID'])]
if filtered_table.empty:
return 0
return filtered_table['Var1'].sum()
table1 = pd.DataFrame(columns = ['ID', 'Date', 'Var1'])
table1.loc[len(table1)] = [1, datetime.strptime('2015-01-01','%Y-%m-%d'), 0]
table1.loc[len(table1)] = [1, datetime.strptime('2015-02-01','%Y-%m-%d'), 1]
table1.loc[len(table1)] = [1, datetime.strptime('2015-03-01','%Y-%m-%d'), 2]
table1.loc[len(table1)] = [1, datetime.strptime('2015-04-01','%Y-%m-%d'), 3]
table1.loc[len(table1)] = [1, datetime.strptime('2015-05-01','%Y-%m-%d'), 4]
table1.loc[len(table1)] = [2, datetime.strptime('2015-01-01','%Y-%m-%d'), 5]
table1.loc[len(table1)] = [2, datetime.strptime('2015-02-01','%Y-%m-%d'), 6]
table1.loc[len(table1)] = [2, datetime.strptime('2015-03-01','%Y-%m-%d'), 7]
table1.loc[len(table1)] = [2, datetime.strptime('2015-04-01','%Y-%m-%d'), 8]
table1.loc[len(table1)] = [2, datetime.strptime('2015-05-01','%Y-%m-%d'), 9]
table1['sum_var1'] = table1.apply(lambda row: findSum(row), axis=1)
table1.drop(columns=['month_diff'], inplace=True)
print(table1)
I'm trying to generate the Last_Payment_Date field in my pandas dataframe, and would need to find the closest Payment_Date before the given Order_Date for each customer (i.e. groupby).
Payment_Date will always occur after Order_Date, but may take different periods of time, which is difficult to use sorting and shift to find the nearest date.
Masking seems like a possible way but I've not been able to figure a way on how to use it.
Appreciate all the help I could get!
Cust_No Order_Date Payment_Date Last_Payment_Date
A 5/8/2014 6/8/2014 Nat
B 6/8/2014 1/5/2015 Nat
B 7/8/2014 7/8/2014 Nat
A 8/8/2014 1/5/2015 6/8/2014
A 9/8/2014 10/8/2014 6/8/2014
A 10/11/2014 12/11/2014 10/8/2014
B 11/12/2014 1/1/2015 7/8/2014
B 1/2/2015 2/2/2015 1/1/2015
A 2/5/2015 5/5/2015 1/5/2015
B 3/5/2015 4/5/2015 2/2/2015
Series.searchsorted largely does what you want -- it
can be used to find where the Order_Dates fit inside Payment_Dates. In
particular, it returns the ordinal indices corresponding to where each
Order_Date would need to be inserted in order to keep the Payment_Dates
sorted. For example, suppose
In [266]: df['Payment_Date']
Out[266]:
0 2014-06-08
2 2014-07-08
4 2014-10-08
5 2014-12-11
6 2015-01-01
1 2015-01-05
3 2015-01-05
7 2015-02-02
9 2015-04-05
8 2015-05-05
Name: Payment_Date, dtype: datetime64[ns]
In [267]: df['Order_Date']
Out[267]:
0 2014-05-08
2 2014-07-08
4 2014-09-08
5 2014-10-11
6 2014-11-12
1 2014-06-08
3 2014-08-08
7 2015-01-02
9 2015-03-05
8 2015-02-05
Name: Order_Date, dtype: datetime64[ns]
then searchsorted returns
In [268]: df['Payment_Date'].searchsorted(df['Order_Date'])
Out[268]: array([0, 1, 2, 3, 3, 0, 2, 5, 8, 8])
The first value, 0, for example, indicates that the Order_Date, 2014-05-08,
would have to be inserted at ordinal index 0 (before the Payment_Date
2014-06-08) to keep the Payment_Dates in sorted order. The second value, 1,
indicates that the Order_Date, 2014-07-08, would have to be inserted at
ordinal index 1 (after the Payment_Date 2014-06-08 and before 2014-07-08)
to keep the Payment_Dates in sorted order. And so on for the other indices.
Now, of course, there are some complications:
The Payment_Dates need to be in sorted order for searchsorted to return a
meaningful result:
df = df.sort_values(by=['Payment_Date'])
We need to group by the Cust_No
grouped = df.groupby('Cust_No')
We want the index of the Payment_Date which comes before the
Order_Date. Thus, we really need the decrease the index by one:
idx = grp['Payment_Date'].searchsorted(grp['Order_Date'])
result = grp['Payment_Date'].iloc[idx-1]
So that grp['Payment_Date'].iloc[idx-1] would grab the the prior Payment_Date.
When searchsorted returns 0, the Order_Date is less than all
Payment_Dates. We want a NaT in this case.
result[idx == 0] = pd.NaT
So putting it all togther,
import pandas as pd
NaT = pd.NaT
T = pd.Timestamp
df = pd.DataFrame({
'Cust_No': ['A', 'B', 'B', 'A', 'A', 'A', 'B', 'B', 'A', 'B'],
'expected': [
NaT, NaT, NaT, T('2014-06-08'), T('2014-06-08'), T('2014-10-08'),
T('2014-07-08'), T('2015-01-01'), T('2015-01-05'), T('2015-02-02')],
'Order_Date': [
T('2014-05-08'), T('2014-06-08'), T('2014-07-08'), T('2014-08-08'),
T('2014-09-08'), T('2014-10-11'), T('2014-11-12'), T('2015-01-02'),
T('2015-02-05'), T('2015-03-05')],
'Payment_Date': [
T('2014-06-08'), T('2015-01-05'), T('2014-07-08'), T('2015-01-05'),
T('2014-10-08'), T('2014-12-11'), T('2015-01-01'), T('2015-02-02'),
T('2015-05-05'), T('2015-04-05')]})
def last_payment_date(s, df):
grp = df.loc[s.index]
idx = grp['Payment_Date'].searchsorted(grp['Order_Date'])
result = grp['Payment_Date'].iloc[idx-1]
result[idx == 0] = pd.NaT
return result
df = df.sort_values(by=['Payment_Date'])
grouped = df.groupby('Cust_No')
df['Last_Payment_Date'] = grouped['Payment_Date'].transform(last_payment_date, df)
print(df)
yields
Cust_No Order_Date Payment_Date expected Last_Payment_Date
0 A 2014-05-08 2014-06-08 NaT NaT
2 B 2014-07-08 2014-07-08 NaT NaT
4 A 2014-09-08 2014-10-08 2014-06-08 2014-06-08
5 A 2014-10-11 2014-12-11 2014-10-08 2014-10-08
6 B 2014-11-12 2015-01-01 2014-07-08 2014-07-08
1 B 2014-06-08 2015-01-05 NaT NaT
3 A 2014-08-08 2015-01-05 2014-06-08 2014-06-08
7 B 2015-01-02 2015-02-02 2015-01-01 2015-01-01
9 B 2015-03-05 2015-04-05 2015-02-02 2015-02-02
8 A 2015-02-05 2015-05-05 2015-01-05 2015-01-05