Pandas add multiple rows with IF condition - python

I have following dataframe cosisting of city bicycle trips. However, I have some problems with handling trips that exceed over one hour(I want to use YYYYmmDDhh as a composite key in my data model). So what I want to do is to create a column "keyhour" that I could connect with other tables. This would be YYYYmmDDhh based on started_at IF start_hour == end_hour. However, if end_hour is greater than start_hour, I want to insert that many rows with the same TourID to my dataframe, in order to indicate that the trip has lasted few hours.
started_at ended_at duration start_station_id start_station_name start_station_description ... end_station_description end_station_latitude end_station_longitude TourID start_hour end_hour
0 2020-05-01 03:03:14.941000+00:00 2020-05-01 03:03:14.941000+00:00 635 484 Karenlyst allé ved Skabos vei ... langs Drammensveien 59.914145 10.715505 0 3 3
1 2020-05-01 03:05:48.529000+00:00 2020-05-01 03:05:48.529000+00:00 141 455 Sofienbergparken sør langs Sofienberggata ... ved Sars gate 59.921206 10.769989 1 3 3
2 2020-05-01 03:13:33.156000+00:00 2020-05-01 03:13:33.156000+00:00 330 550 Thereses gate ved Bislett trikkestopp ... ved Kristian IVs gate 59.914767 10.740971 2 3 3
3 2020-05-01 03:14:14.549000+00:00 2020-05-01 03:14:14.549000+00:00 479 597 Fredensborg ved rundkjøringen ... ved Oslo City 59.912334 10.752292 3 3 3
4 2020-05-01 03:20:12.355000+00:00 2020-05-01 03:20:12.355000+00:00 629 617 Bjerregaardsgate Øst ved Uelands gate ... langs Oslo gate 59.908255 10.767800 4 3 3
So for example if started_at = 2020-05-01 03:03:14.941000+00:00, ended_at = 2020-05-01 06:03:14.941000+00:00 , start_hour = 3, end_hour = 6 and TourID = 1, I want to have rows with:
keyhour ; TourID
2020050103 ;1
2020050104 ;1
2020050105 ;1
2020050106 ;1
And all other values(duration etc) related to this trip id.
However, I really cannot find any way to do it in Pandas. Is it possible or do I have to use pure python to re-write my source csv?
Thank you for any advice!

Assuming your dataframe is df and that you have import pandas as pd
# convert to datetime and rounddown to hour
df['started_at'] = pd.to_datetime(df['started_at']).dt.floor(freq='H')
df['ended_at'] = pd.to_datetime(df['ended_at']).dt.floor(freq='H')
# this creates a list of hourly datetime ranges from started_at to ended_at
df['keyhour'] = df.apply(lambda x: list(pd.date_range(x['started_at'], x['ended_at'], freq="1H")), axis='columns')
# this just expands to row each element in the list of keyhour column
df = df.explode('keyhour')
# conversts it to a string, of the format you specified
df['keyhour'] = df['keyhour'].dt.strftime('%Y%m%d%H')
df

Related

Pandas: Groupby and sum customer profit, for every 6 months, starting from users first transaction

I have a dataset like this:
Customer ID
Date
Profit
1
4/13/2018
10.00
1
4/26/2018
13.27
1
10/23/2018
15.00
2
1/1/2017
7.39
2
7/5/2017
9.99
2
7/7/2017
10.01
3
5/4/2019
30.30
I'd like to groupby and sum profit, for every 6 months, starting at each users first transaction.
The output ideally should look like this:
Customer ID
Date
Profit
1
4/13/2018
23.27
1
10/13/2018
15.00
2
1/1/2017
7.39
2
7/1/2017
20.00
3
5/4/2019
30.30
The closest I've seem to get on this problem is by using:
df.groupby(['Customer ID',pd.Grouper(key='Date', freq='6M', closed='left')])['Profit'].sum().reset_index()
But, that doesn't seem to sum starting on a users first transaction day.
If the changing of dates is not possible (ex. customer 2 date is 7/1/2017 and not 7/5/2017), then at least summing the profit so that its based on each users own 6 month purchase journey would be extremely helpful. Thank you!
I can get you the first of the month until you find a more perfect solution.
df["Date"] = pd.to_datetime(df["Date"], format="%m/%d/%Y")
df = (
df
.set_index("Date")
.groupby(["Customer ID"])
.Profit
.resample("6MS")
.sum()
.reset_index(name="Profit")
)
print(df)
Customer ID Date Profit
0 1 2018-04-01 23.27
1 1 2018-10-01 15.00
2 2 2017-01-01 7.39
3 2 2017-07-01 20.00
4 3 2019-05-01 30.30

How to improve function with all-to-all rows computation within a groupby object?

Say I have this simple dataframe-
dic = {'firstname':['Steve','Steve','Steve','Steve','Steve','Steve'],
'lastname':['Johnson','Johnson','Johnson','Johnson','Johnson',
'Johnson'],
'company':['CHP','CHP','CHP','CHP','CHP','CHP'],
'faveday':['2020-07-13','2020-07-20','2020-07-16','2020-10-14',
'2020-10-28','2020-10-21'],
'paid':[200,300,550,100,900,650]}
df = pd.DataFrame(dic)
df['faveday'] = pd.to_datetime(df['faveday'])
print(df)
with output-
firstname lastname company faveday paid
0 Steve Johnson CHP 2020-07-13 200
1 Steve Johnson CHP 2020-07-20 300
2 Steve Johnson CHP 2020-07-16 550
3 Steve Johnson CHP 2020-10-14 100
4 Steve Johnson CHP 2020-10-28 900
5 Steve Johnson CHP 2020-10-21 650
I want to be able to keep the rows that have a faveday within 7 days of another, but also their paid columns have to sum greater than 1000.
Individually, if I wanted to apply the 7 day function, I would use-
def sefd (x):
return np.sum((np.abs(x.values-x.values[:,None])/np.timedelta64(1, 'D'))<=7,axis=1)>=2
s=df.groupby(['firstname', 'lastname', 'company'])['faveday'].transform(sefd)
df['seven_days']=s
df = df[s]
del df['seven_days']
This would keep all of the entries (All of these are within 7 days of another faveday grouped by firstname, lastname, and company).
If I wanted to apply a function that keeps rows for the same person with the same company and a summed paid amount > 1000, I would use-
df = df[df.groupby(['lastname', 'firstname','company'])['paid'].transform(sum) > 1000]
Just a simple transform(sum) function
This would also keep all of the entries (since all are under the same name and company and sum to greater than 1000).
However, if we were to combine these two functions at the same time, one row actually would not be included.
My desired output is-
firstname lastname company faveday paid
0 Steve Johnson CHP 2020-07-13 200
1 Steve Johnson CHP 2020-07-20 300
2 Steve Johnson CHP 2020-07-16 550
4 Steve Johnson CHP 2020-10-28 900
5 Steve Johnson CHP 2020-10-21 650
Notice how index 3 is no longer valid because it's only within 7 days of index 5, but if you were to sum index 3 paid and index 5 paid, it would only be 750 (<1000).
It is also important to note that since indexes 0, 1, and 2 are all within 7 days of each other, that counts as one summed group (200 + 300 + 550 > 1000).
The logic is that I would want to first see (based on a group of firstname, lastname, and company name) whether or not a faveday is within 7 days of another. Then after confirming this, see if the paid column for these favedays sums to over 1000. If so, keep those indexes in the dataframe. Otherwise, do not.
A suggested answer given to me was-
df=df.sort_values(["firstname","lastname","company","faveday"])
def date_difference_from(x,df):
return abs((df.faveday - x).dt.days)
def grouped_dates(grouped_df):
keep = []
for idx, row in grouped_df.iterrows():
within_7 = date_difference_from(row.faveday,grouped_df) <= 7
keep.append(within_7.sum() > 1 and grouped_df[within_7].paid.sum() > 1000)
msk = np.array(keep)
return grouped_df[msk]
df = df.groupby(["firstname","lastname","company"]).apply(grouped_dates).reset_index(drop=True)
print(df)
This works perfectly for small data sets like this one, but when I apply it to a bigger dataset (10,000+ rows), some inconsistencies appear.
Is there any way to improve this code?
I found a solution that avoids looping idx to compare if other rows are within 7 days, but involves unstack and reindex so it will increase memory usage (I tried tapping into the _get_window_bounds method of rolling but it proved above my expertise). It should be fine for the scale you request. Although this solution's is on par of yours with the toy df you provided, it is orders of magnitude faster on larger datasets.
Edit: allow multiple deposits in one date.
Take this data (with replace=True by default in random.choice)
import string
np.random.seed(123)
n = 40
df = pd.DataFrame([[a, b, b, faveday, paid]
for a in string.ascii_lowercase
for b in string.ascii_lowercase
for faveday, paid in zip(
np.random.choice(pd.date_range('2020-01-01', '2020-12-31'), n),
np.random.randint(100, 1200, n))
], columns=['firstname', 'lastname', 'company', 'faveday', 'paid'])
df['faveday'] = pd.to_datetime(df['faveday'])
df = df.sort_values(["firstname", "lastname", "company", "faveday"]).reset_index(drop=True)
>>>print(df)
firstname lastname company faveday paid
0 a a a 2020-01-03 1180
1 a a a 2020-01-18 206
2 a a a 2020-02-02 490
3 a a a 2020-02-09 615
4 a a a 2020-02-17 471
... ... ... ... ... ...
27035 z z z 2020-11-22 173
27036 z z z 2020-12-22 863
27037 z z z 2020-12-23 675
27038 z z z 2020-12-26 1165
27039 z z z 2020-12-30 683
[27040 rows x 5 columns]
And the code
def get_valid(df, window_size=7, paid_gt=1000, groupbycols=['firstname', 'lastname', 'company']):
# df_clean = df.set_index(['faveday'] + groupbycols).unstack(groupbycols)
# # unstack names to bypass groupby
df_clean = df.groupby(['faveday'] + groupbycols).paid.agg(['size', sum])
df_clean.columns = ['ct', 'paid']
df_clean = df_clean.unstack(groupbycols)
df_clean = df_clean.reindex(pd.date_range(df_clean.index.min(),
df_clean.index.max())).sort_index() # include all dates, to treat index as integer
window = df_clean.fillna(0).rolling(window_size + 1).sum()
# notice fillna to prevent false NaNs while summing
df_clean = df_clean.paid * ( # multiply times a mask for both conditions
(window.ct > 1) & (window.paid > paid_gt)
).replace(False, np.nan).bfill(limit=7)
# replacing with np.nan so we can backfill to include all dates in window
df_clean = df_clean.rename_axis('faveday').stack(groupbycols)\
.reset_index(level='faveday').sort_index().reset_index()
# reshaping to original format
return df_clean
df1 = get_valid(df, window_size=7, paid_gt=1000,
groupbycols=['firstname', 'lastname', 'company'])
Still running at 1.5 seconds (vs 143 seconds of your current code) and returns
firstname lastname company faveday 0
0 a a a 2020-02-02 490.0
1 a a a 2020-02-09 615.0
2 a a a 2020-02-17 1232.0
3 a a a 2020-03-09 630.0
4 a a a 2020-03-14 820.0
... ... ... ... ... ...
17561 z z z 2020-11-12 204.0
17562 z z z 2020-12-22 863.0
17563 z z z 2020-12-23 675.0
17564 z z z 2020-12-26 1165.0
17565 z z z 2020-12-30 683.0
[17566 rows x 5 columns]

How to get weekly averages for column values and week number for the corresponding year based on daily data records with pandas

I'm still learning python and would like to ask your help with the following problem:
I have a csv file with daily data and I'm looking for a solution to sum it per calendar weeks. So for the mockup data below I have rows stretched over 2 weeks (week 14 (current week) and week 13 (past week)). Now I need to find a way to group rows per calendar week, recognize what year they belong to and calculate week sum and week average. In the file input example there are only two different IDs. However, in the actual data file I expect many more.
input.csv
id date activeMembers
1 2020-03-30 10
2 2020-03-30 1
1 2020-03-29 5
2 2020-03-29 6
1 2020-03-28 0
2 2020-03-28 15
1 2020-03-27 32
2 2020-03-27 10
1 2020-03-26 9
2 2020-03-26 3
1 2020-03-25 0
2 2020-03-25 0
1 2020-03-24 0
2 2020-03-24 65
1 2020-03-23 22
2 2020-03-23 12
...
desired output.csv
id week WeeklyActiveMembersSum WeeklyAverageActiveMembers
1 202014 10 1.4
2 202014 1 0.1
1 202013 68 9.7
2 202013 111 15.9
my goal is to:
import pandas as pd
df = pd.read_csv('path/to/my/input.csv')
Here I'd need to group by 'id' + 'date' column (per calendar week - not sure if this is possible) and create a 'week' column with the week number, then sum 'activeMembers' values for the particular week, save as 'WeeklyActiveMembersSum' column in my output file and finally calculate 'weeklyAverageActiveMembers' for the particular week. I was experimenting with groupby and isin parameters but no luck so far... would I have to go with something similar to this:
df.groupby('id', as_index=False).agg({'date':'max',
'activeMembers':'sum'}
and finally save all as output.csv:
df.to_csv('path/to/my/output.csv', index=False)
Thanks in advance!
It seems I'm getting a different week setting than you do:
# should convert datetime column to datetime type
df['date'] = pd.to_datetime(df['date'])
(df.groupby(['id',df.date.dt.strftime('%Y%W')], sort=False)
.activeMembers.agg([('Sum','sum'),('Average','mean')])
.add_prefix('activeMembers')
.reset_index()
)
Output:
id date activeMembersSum activeMembersAverage
0 1 202013 10 10.000000
1 2 202013 1 1.000000
2 1 202012 68 9.714286
3 2 202012 111 15.857143

Build dataframe with sequential timeseries

I have a dataset that contains many timestamps associated with different ships and ports.
obj_id timestamp port
0 4 2019-10-01 Houston
1 2 2019-09-01 New York
2 4 2019-07-31 Boston
3 1 2019-07-28 San Francisco
4 2 2019-10-15 Miami
5 1 2019-09-01 Honolulu
6 1 2019-08-01 Tokyo
I want to build a dataframe that contains a single record for the latest voyage by ship (obj_id), by assigning the latest timestamp/port for each obj_id as a 'destination', and the second latest timestamp/port as the 'origin'. So the final result would look something like this:
obj_id origin_time origin_port destination_time destination_port
0 4 2019-07-31 Boston 2019-10-01 Houston
1 2 2019-09-01 New York 2019-10-15 Miami
3 1 2019-07-28 Tokyo 2019-09-01 Honolulu
I've successfully filtered the latest timestamps for each obj_id through this code but still can't figure a way to filter the second latest timestamp, let alone pull them both into a single row.
df.sort_values(by ='timestamp', ascending = False).drop_duplicates(['obj_id'])
Using groupby.agg with first, last:
dfg = df.sort_values('timestamp').groupby('obj_id').agg(['first', 'last']).reset_index()
dfg.columns = [f'{c1}_{c2}' for c1, c2 in dfg.columns]
obj_id_ timestamp_first timestamp_last port_first port_last
0 1 2019-07-28 2019-09-01 San Francisco Honolulu
1 2 2019-09-01 2019-10-15 New York Miami
2 4 2019-07-31 2019-10-01 Boston Houston
You want to sort the trips by timestamp so we can get the most recent voyages, then group the voyages by object id and grab the first and second voyage per object, then merge.
groups = df.sort_values(by = "timestamp", ascending = False).groupby("obj_id")
pd.merge(groups.nth(1), groups.nth(0),
on="obj_id",
suffixes=("_origin", "_dest"))
Make sure your timestamp column is the proper timestamp data type though, otherwise your sorting will be messed up.

Column operations on pandas groupby object

I have a dataframe df that looks like this:
id Category Time
1 176 12 00:00:00
2 4956 2 00:00:00
3 583 4 00:00:04
4 9395 2 00:00:24
5 176 12 00:03:23
which is basically a set of id and the category of item they used at a particular Time. I use df.groupby['id'] and then I want to see if they used the same category or different and assign True or False respectively (or NaN if that was the first item for that particular id. I also filtered out the data to remove all the ids with only one Time.
For example one of the groups may look like
id Category Time
1 176 12 00:00:00
2 176 12 00:03:23
3 176 2 00:04:34
4 176 2 00:04:54
5 176 2 00:05:23
and I want to perform an operation to get
id Category Time Transition
1 176 12 00:00:00 NaN
2 176 12 00:03:23 False
3 176 2 00:04:34 True
4 176 2 00:04:54 False
5 176 2 00:05:23 False
I thought about doing an apply of some sorts to the Category column after groupby but I am having trouble figuring out the right function.
you don't need a groupby here, you just need sort and shift.
df.sort(['id', 'Time'], inplace=True)
df['Transition'] = df.Category != df.Category.shift(1)
df.loc[df.id != df.id.shift(1), 'Transition'] = np.nan
i haven't tested this, but it should do the trick

Categories