I have a Pandas DataFrame with monthly events send by customers, like this:
df = pd.DataFrame(
[
('2017-01-01 12:00:00', 'SID1', 'Something', 'A. Inc'),
('2017-01-02 00:30:00', 'SID1', 'Something', 'A. Inc'),
('2017-01-02 12:00:00', 'SID2', 'Something', 'A. Inc'),
('2017-01-01 15:00:00', 'SID4', 'Something', 'B. GmbH')
],
columns=['TimeStamp', 'Session ID', 'Event', 'Customer']
)
The Session IDs are unique, but could spann multiple days. In addition multiple sessions could occur on a given day.
I would like to calculate the minutes of usage for each day of the months per customer like this.
Customer
01.01
02.01
...
31.01
A. Inc
720
30
...
50
B. GmbH
1
0
...
0
I suspect, that a split of Timestamp into Days and Time, followed by groupby('Customer', 'Day', 'Session ID') and then applying (via apply()) some maths is the way to go, but so far i could not produce any real progress.
You can try this.
Extract date and time in minutes to new colums. Then sum time for customer and date using groupby and agg. Then finally pivot the dataframe.
df['TimeStamp']= df['TimeStamp'].apply(pd.to_datetime)
df['date'] = df['TimeStamp'].dt.date
df['minutes'] = df['TimeStamp'].dt.strftime('%H:%M').apply(lambda x: int(x.split(':')[0]) * 60 + int(x.split(':')[1]))
new_df = df.groupby(['Customer','date']).agg({'minutes': sum}).reset_index()
print(pd.pivot_table(new_df, values = 'minutes', index=['Customer'], columns = 'date'))
Output:
date 2017-01-01 2017-01-02
Customer
A. Inc 720.0 750.0
B. GmbH 900.0 NaN
Ok, i found a solution, might not be the best, but works.
# group by id and add max and min values of each group to new columns
group_Session = df.groupby(['Session ID'])
df['Start Time'] = group_Session['Timestamp'].transform(lambda x: x.min())
df['Stop Time'] = group_Session['Timestamp'].transform(lambda x: x.max())
df.drop_duplicates(subset=['Session ID'], keep='first', inplace = True)
# now we have start/stop for each session
# add all days of month to dataframe and fill with zeros
dateStart = datetime.datetime(2022, 2, 1)
dateStop = (dateStart + dateutil.relativedelta.relativedelta(day = 31))
for single_date in (dateStart.day + n for n in range(dateStop.day)):
df[str(single_date) + '.' + str(dateStart.month)] = 0
for index, row in df.iterrows():
# Create a dateRange from start to finisch with minute frequency
# Convert dateRange to Dataframe
dateRangeFrame = pd.date_range(start = row['Start Time'], end = row['Stop Time'], freq = 'T').to_frame(name = 'value')
# extract day from dateIndex
dateRangeFrame['day'] = dateRangeFrame['value'].dt.strftime('%#d.%#m')
#group by day and count the results -> now we have: per session(index) a day/minute object
day_to_minute_df = dateRangeFrame.groupby(['day']).count()
# for each group find column from index and add sum of val
for d2m_index, row in day_to_minute_df.iterrows():
df.loc[index, d2m_index] = row['value']
new_df = df.groupby(['Customer']).sum()
Related
The logic of what I am trying to do I think is best explained with code:
import pandas as pd
import numpy as np
from datetime import timedelta
random.seed(365)
#some data
start_date = pd.date_range(start = "2015-01-09", end = "2022-09-11", freq = "6D")
end_date = [start_date + timedelta(days = np.random.exponential(scale = 100)) for start_date in start_date]
df = pd.DataFrame(
{"start_date":start_date,
"end_date":end_date}
)
#randomly remove some end dates
df["end_date"] = df["end_date"].sample(frac = 0.7).reset_index(drop = True)
df["end_date"] = df["end_date"].dt.date.astype("datetime64[ns]")
I first create a pd.Series with the 1st day of every month in the entire history of the data:
dates = pd.Series(df["start_date"].dt.to_period("M").sort_values(ascending = True).unique()).dt.start_time
What I then want to do is count the number of df["start_date"] values which are less than the 1st day of each month in the series and where the df["end_date"] values are null (recorded as NaT)
I would think I would use a for loop to do this and somehow groupby the dates series so that the resulting output looks something like this:
month_start
count
2015-01-01
5
2015-02-01
10
2015-03-01
35
The count column in the resulting output is a count of the number of df rows where the df["start_date"] values are less than the 1st of each month in the series and where the df["end_date"] values are null - this occurs for every value in the series
Here is the logic of what I am trying to do:
df.groupby(by = dates)[["start_date", "end_date"]].apply(
lambda x: [x["start_date"] < date for date in dates] & x["end_date"].isnull == True
)
Is this what you want:
df2 = df[df['end_date'].isnull()]
dates_count = dates.apply(lambda x: df2[df2['start_date'] < x]['start_date'].count())
print(pd.concat([dates, dates_count], axis=1))
IIUC, group by period (shifted by 1 month) and count the NaT, then cumsum to accumulate the counts:
(df['end_date'].isna()
.groupby(df['start_date'].dt.to_period('M').add(1).dt.start_time)
.sum()
.cumsum()
)
Output:
start_date
2015-02-01 0
2015-03-01 0
2015-04-01 0
2015-05-01 0
2015-06-01 0
...
2022-06-01 122
2022-07-01 127
2022-08-01 133
2022-09-01 138
2022-10-01 140
Name: end_date, Length: 93, dtype: int64
How to extract the nearest last month to date data if the same day of the last month did not have the sale? Please refer to the sample provide for more understanding.
Original data:
It may not have the sale in the yesterday (last month), require to find the nearest day compare to today (last month).
Currently, using the pd.merge to get the Last MTD data, but if the same day of last month did not have the product's sale, it will show zero.
Example 1:
02/10/2022 VS 02/09/2022
02/10/2022 have Clothes's sale, but 02/09/2022 did not have. Expect the Last MTD column able to display the MTD data from last month.
Current result:
Expected output:
Code:
df["pdate"] = df.Date.apply(lambda x: (x - pd.DateOffset(months=1)))
df2 = df.copy()
final_df = pd.merge(left = df,right = df2, how="left", left_on=['pdate','Product'], right_on=['Date', 'Product'])
######## For understanding (can ignore)
###############
Example 2:
03/10/2022 VS 03/09/2022
03/10/2022 have Dining room's sale, but 03/09/2022 did not have. Expect the Last MTD column able to display the MTD data from last month.
Current result:
Expected result:
You can merge using a shifted period:
df['period'] = pd.to_datetime(df['date'], dayfirst=False).dt.to_period('M')
df['prev_period'] = df.groupby('product')['period'].shift()
out = (df.merge(df[['product', 'period', 'MTD']],
how='left', suffixes=[None, '_previous'],
left_on=['product', 'prev_period'],
right_on=['product', 'period'])
[['date', 'product', 'MTD', 'MTD_previous']]
)
Example :
date product MTD MTD_previous
0 01/09/2022 A 1 NaN
1 01/09/2022 B 2 NaN
2 02/09/2022 A 3 1.0
3 02/10/2022 B 4 2.0
Used input:
df = pd.DataFrame({'date': ['01/09/2022', '01/09/2022', '02/09/2022', '02/10/2022'],
'product': ['A', 'B', 'A', 'B'],
'MTD': [1, 2, 3, 4]
})
I have two dataframes, they have a start/end datetime and a value. Not the same number of rows. The intervals which overlap may not be in the same row/index.
df1
start_datetime end_datetime value
08:50 09:50 5
09:52 10:10 6
10:50 11:30 2
df2
start_datetime end_datetime value
08:51 08:59 3
09:52 10:02 9
10:03 10:30 1
11:03 11:39 1
13:10 13:15 0
I would like to calculate the sum of duration time when df1 and df2 overlap only if df1.value > df2.value.
During one df2 time interval, df1 can overlaps multiple times and sometimes the condition is True.
I tried something like that:
time = timedelta()
for i, row1 in df1.iterrows():
t1 = pd.Interval(row1.start, row1.end)
for j, row2 in df2.iterrows():
t2 = pd.Interval(row2.start, row2.end)
if t1.overlaps(t2) and row1.value > row2.value:
latest_start = np.maximum(row1.start, row1.start)
earliest_end = np.minimum(row2.end, row2.end)
delta = earliest_end - latest_start
time += delta
I can loop on every df1 rows and test with the whole df2 data but it's not optimized.
expected output (example):
Timedelta('0 days 00:99:99')
Here is my solution:
Create DataFrames:
df1 = pd.DataFrame(
{"start_datetime1": ['08:50' ,'09:52' ,'10:50 ' ],
'end_datetime1' : ['09:50','10:10','11:30'] ,
'value1': [5,6,2]})
df2 = pd.DataFrame(
{"start_datetime2": ['08:51' ,'09:52' ,'10:03 ','11:03 ','13:10 ' ],
'end_datetime2' : ['08:59','10:02','10:30','11:39', '13:15'] ,
'value2': [3,9,1,1,0]})
df2["start_datetime2"]= pd.to_datetime(df2["start_datetime2"])
df2["end_datetime2"]= pd.to_datetime(df2["end_datetime2"])
df1["start_datetime1"]= pd.to_datetime(df1["start_datetime1"])
df1["end_datetime1"]= pd.to_datetime(df1["end_datetime1"])
Combine dataframes to make comparison easier. Combined dataframe has all possible matches :
df1['temp'] = 1 #temporary keys to create all combinations
df2['temp'] = 1
df_combined = pd.merge(df1,df2,on='temp').drop('temp',axis=1)
Compare values with lambda function:
df_combined['Result'] = df_combined.apply(lambda row: max(row["start_datetime1"],row["start_datetime2"]) -
min(row["start_datetime1"],row["start_datetime2"])
if pd.Interval(row['start_datetime1'], row['end_datetime1']).overlaps(
pd.Interval(row['start_datetime2'], row['end_datetime2'])) and
row["value1"] > row["value2"]
else 0, axis = 1 )
df_combined
Result :
total_timedelta = df_combined['Result'].loc[df_combined['Result'] != 0].sum()
0 days 00:25:00
Dataframe:
I would like to filter for customer_id'sthat first appear after a certain date in this case 2019-01-10 and then create a new df with a list of new customers
df
date customer_id
2019-01-01 429492
2019-01-01 344343
2019-01-01 949222
2019-01-10 429492
2019-01-10 344343
2019-01-10 129292
Output df
customer_id
129292
This is what I have tried so far but this gives me also customer_id's that were active before 10th January 2019
s = df.loc[df["date"]>="2019-01-10", "customer_id"]
df_new = df[df["customer_id"].isin(s)]
df_new
You can use boolean indexing with filtering with Series.isin:
df["date"] = pd.to_datetime(df["date"])
mask1 = df["date"]>="2019-01-10"
mask2 = df["customer_id"].isin(df.loc[~mask1,"customer_id"])
df = df.loc[mask1 & ~mask2, ['customer_id']]
print (df)
customer_id
5 129292
df['date'] = pd.to_datetime(df['date'])
cutoff = pd.to_datetime('2019-01-10')
mask = df['date'] >= cutoff
customers_before = df.loc[~mask, 'customer_id'].unique().tolist()
customers_after = df.loc[mask, 'customer_id'].unique().tolist()
result = set(customers_after) - set(customers_before)
"then create a new df with a list of new customers" so in this case your output is null, because 2019-01-10 is last date, there is no new customers after this date
but if you want to get list of customers after certain date or equal than :
df=pd.DataFrame({
'date':['2019-01-01','2019-01-01','2019-01-01',
'2019-01-10','2019-01-10','2019-01-10'],
'customer_id':[429492,344343,949222,429492,344343,129292]
})
certain_date=pd.to_datetime('2019-01-10')
df.date=pd.to_datetime(df.date)
df=df[
df.date>=certain_date
]
print(df)
date customer_id
3 2019-01-10 429492
4 2019-01-10 344343
5 2019-01-10 129292
If your 'date' column has datetime objects you just have to do:
df_new = df[df['date'] >= datetime(2019, 1, 10)]['customer_id']
If your 'date' column doesn't contain datetime objects, you should convert it first it by using to_datetime method:
df['date'] = pd.to_datetime(df['date'])
And then apply the methodology described above.
I am looking for a way to create the column 'min_value' from the dataframe df below. For each row i, we subset from the entire dataframe all the records that correspond to the grouping ['Date_A', 'Date_B'] of the row i and having the condition 'Advance' less than 'Advance' of row i, and finally we pick the minimum of the column 'Amount' from this subset to set 'min_value' for the row i:
Initial dataframe:
dates_A = ['2017-12-25','2017-12-25','2017-12-25','2018-1-25','2018-1-25','2018-1-25']
Date_A = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_A]
dates_B = ['2018-1-1','2018-1-1','2018-1-1','2018-2-1','2018-2-1','2018-2-1']
Date_B = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_B]
df = pd.DataFrame({'Date_A':Date_A,
'Date_B':Date_B,
'Advance' : [10,103,200,5,8,150],
'Amount' : [180,220,200,230,220,240]})
df = df [['Date_A', 'Date_B', 'Advance', 'Amount']]
df
Desired output:
dates_A = ['2017-12-25','2017-12-25','2017-12-25','2018-1-25','2018-1-25','2018-1-25']
Date_A = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_A]
dates_B = ['2018-1-1','2018-1-1','2018-1-1','2018-2-1','2018-2-1','2018-2-1']
Date_B = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_B]
df_out = pd.DataFrame({'Date_A':Date_A,
'Date_B':Date_B,
'Advance' : [10,103,200,5,8,150],
'Amount' : [180,220,200,230,220,240],
'min_value': [180,180,180,230,230,220] })
df_out = df_out [['Date_A', 'Date_B', 'Advance', 'Amount','min_value']]
df_out
I wrote the following loop that I think would do the job but it is much too long to run, I guess there must be much more efficient ways to accomplish this.
for i in range(len(df)):
date1=df['Date_A'][i] #select the date A of the row i
date2=df['Date_B'][i] #select the date B of the row i
advance= df['Advance'][i] #select the advance of the row i
df.loc[i,'min_value'] = df[df['Date_A']==date1][df['Date_B']==date2][df['Advance']<advance]['Amount'].min() # subset the entire dataframe to meet dates and advance conditions
df.loc[df['min_value'].isnull(),'min_value']=df['Amount'] # for the smallest advance value, ste min=to its own amount
df
I hope it is clear enough, thanks for your help.
Improvement question
Thanks a lot for the answer. For the last part, the NA rows, I'd like to replace the amount of the row by the overall amount of the Date_A,Date_B,advance grouping so that I have the overall minimum of the last day before date_A
Improvement desired output (two recodrs for the smallest advance value)
dates_A = ['2017-12-25','2017-12-25','2017-12-25','2017-12-25']
Date_A = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_A]
dates_B = ['2018-1-1','2018-1-1','2018-1-1','2018-1-1']
Date_B = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_B]
df_out = pd.DataFrame({'Date_A':Date_A,
'Date_B':Date_B,
'Advance' : [5,8,150,5],
'Amount' : [230,220,240,225],
'min_value': [225,230,220,225] })
df_out = df_out [['Date_A', 'Date_B', 'Advance', 'Amount','min_value']]
df_out
Thanks
You can use groupby on 'Date_A' and 'Date_B' after sorting the value by 'Advance' and apply the function cummin and shift to the column 'Amount'. Then use fillna with the value from the column 'Amount', such as:
df['min_value'] = (df.sort_values('Advance').groupby(['Date_A','Date_B'])['Amount']
.apply(lambda ser_g: ser_g.cummin().shift()).fillna(df['Amount']))
and you get:
Date_A Date_B Advance Amount min_value
0 2017-12-25 2018-01-01 10 180 180.0
1 2017-12-25 2018-01-01 103 220 180.0
2 2017-12-25 2018-01-01 200 200 180.0
3 2018-01-25 2018-02-01 5 230 230.0
4 2018-01-25 2018-02-01 8 220 230.0
5 2018-01-25 2018-02-01 150 240 220.0