Average time difference for each day Pandas - python

I have a data frame of user activity, with user ID's and time of activity.
I'm trying to calculate the average time difference between activities for each user. I've managed to do this when a user is active for only 1 day, but I struggle with instances when the user is active for multiple days.
for example:
User ID
Activity Date
week
1
7/26/2021 8:29:01 PM
1
1
7/26/2021 8:28:01 PM
1
1
7/26/2021 8:32:01 PM
2
I used this code, and it works fine:
d=d.sort_values('Activity Date').groupby(['User ID','week'])['Activity Date'].apply(lambda x: x.diff().mean()).dt.total_seconds()/60
My issue is when the user is active on multiple days, with my code I still get an average but it doesn't represent the activity the way I need it.
User ID
Activity Date
week
1
7/25/2021 8:29:01 PM
1
1
7/26/2021 8:29:01 PM
1
1
7/26/2021 8:32:01 PM
1
1
7/25/2021 8:28:01 PM
1
1
7/30/2021 8:32:01 PM
2
1
7/30/2021 8:30:01 PM
2
I would like to first compute the average for each day, and than compute the average of averages.
My code gives the result of: week 1: 481.333 minutes, week 2: 2 minutes
I want it to be: for week 1: 2 minutes (for 25/07- 1 minute difference, for 26/07- 3 minute difference=> so the mean is 2 minutes).
I would really appreciate your help or any suggestions!
Thanks!!

You can perform a double groupby, first on user and day, then on user:
df['Activity Date'] = pd.to_datetime(df['Activity Date'])
day = df['Activity Date'].dt.normalize()
out = (df
.sort_values(by=['User ID', 'Activity Date'])
.groupby(['User ID', day])
.diff()
.groupby(df['User ID']).mean()
)
Output:
Activity Date
User ID
1 0 days 00:02:00
also grouping by week
out = (df
.sort_values(by=['User ID', 'Activity Date'])
.groupby(['User ID', day])
.diff()
.groupby([df['User ID'], df['week']]).mean()
)

Related

Calculate the Number of Users at the Start of the Month

I have a table which looks like this:
ID
Start Date
End Date
1
01/01/2022
29/01/2022
2
03/01/2022
3
15/01/2022
4
01/02/2022
01/03/2022
5
01/03/2022
01/05/2022
6
01/04/2022
So, for every row i have the start date of the contract with the user and the end date. If the contract is still present, there will be no end date.
I'm trying to get a table that looks like this:
Feb
Mar
Apr
Jun
3
3
4
3
Which counts the number of active users on the first day of the month.
What is the most efficient way to calculate this?
At the moment the only idea that came to my mind was to use a scaffold table containing the dates i'm intereseted in (the first day of every month) and from that easily create the new table I need.
But my question is, is there a better way to solve this? I would love to find a more efficient way to calculate this since i would need to repeat the exact same calculations for the number of users at the start of the week.
This might help:
# initializing dataframe
df = pd.DataFrame({'start':['01/01/2022','03/01/2022','15/01/2022','01/02/2022','01/03/2022','01/04/2022'],
'end':['29/01/2022','','','01/03/2022','01/05/2022','']})
# cleaning datetime (the empty ones are replaced with the max exit)
df['start'] = pd.to_datetime(df['start'],format='%d/%m/%Y')
df['end'] = pd.to_datetime(df['end'],format='%d/%m/%Y', errors='coerce')
df['end'].fillna(df.end.max(), inplace=True)
dt_range = pd.date_range(start=df.start.min(),end=df.end.max(),freq='MS')
df2 = pd.DataFrame(columns=['month','number'])
for dat in dt_range:
row = {'month':dat.strftime('%B - %Y'),'number':len(df[(df.start <= dat)&(df.end >= dat)])}
df2 = df2.append(row, ignore_index=True)
Output:
month number
0 January - 2022 1
1 February - 2022 3
2 March - 2022 4
3 April - 2022 4
4 May - 2022 4
Or, if you want the format as in your question:
df2.T
month January - 2022 February - 2022 March - 2022 April - 2022 May - 2022
number 1 3 4 4 4

how to filter on date that starts on the last day of the previous month and ends on the current day

how to filter on date that starts on the last day of the previous month and ends on the current day
(If the last day of the previous month is Saturday or Sunday, the previous day must be assigned)
For example:
example = pd.read_excel("C:/Users/USER/Downloads/Movimentos.xls")
example = example.drop(columns=['Data Valor', 'Descrição', 'Valor'])
example.sort_values(by='Data Operação', ascending=False)
display(example)
Data Operação Saldo
0 12/10/2022 310.36
1 12/10/2022 312.86
2 11/10/2022 315.34
3 11/10/2022 317.84
4 09/10/2022 326.44
5 30/09/2022 224.44
... ... ...
188 05/07/2022 128.40
189 01/07/2022 8.40
190 24/06/2022 18.40
191 22/06/2022 23.40
192 27/05/2022 50.00
In this case I would like to filter from ( 5 30/09/2022 224.44 ) which it is the last day of the previous month and it is a weekday, to ( 0 12/10/2022 310.36 ) which it is the current day.
I've seen some examples where you just had to enter the date '2022-09-30' but in this case will be recurring so it needs to be something like:
today = datetime.date.today()
today.month (for the end date)
but for the start date I don't know how I'm supposed to do.
here is one way to do it
btw, you can convert the date column to datetime as follows, to avoid converting while filtering
# optionally
df['Operação'] = pd.to_datetime(df['Operação'], dayfirst=True)
# convert (in memory) the date to ymd format
# using pd.offset, get business monthend of previous month
# finally using loc to identify the rows matching the criteria
(df.loc[(pd.to_datetime(df['Operação'], dayfirst=True) <=
pd.Timestamp(datetime.now()).strftime('%Y-%m-%d')) &
(pd.to_datetime(df['Operação'], dayfirst=True) >=
pd.Timestamp(datetime.now()+pd.offsets.BusinessMonthEnd(-1)).strftime('%Y-%m-%d'))
]
)
OR
to make it more comprehensible
# create today (with a typo in var name) and lmonth variables
# then use these for comparison
tday=pd.Timestamp(datetime.now()).strftime('%Y-%m-%d')
lmonth= (pd.Timestamp(datetime.now()+pd.offsets.BusinessMonthEnd(-1))).strftime('%Y-%m-%d')
(df.loc[(pd.to_datetime(df['Operação'], dayfirst=True) <= tday) &
(pd.to_datetime(df['Operação'], dayfirst=True) >= lmonth)
]
)
Data Operação Saldo
0 0 12/10/2022 310.36
1 1 12/10/2022 312.86
2 2 11/10/2022 315.34
3 3 11/10/2022 317.84
4 4 09/10/2022 326.44
5 5 30/09/2022 224.44

Creating year week based on date with different start date

I have a df
date
2021-03-12
2021-03-17
...
2022-05-21
2022-08-17
I am trying to add a column year_week, but my year week starts at 2021-06-28, which is the first day of July.
I tried:
df['date'] = pd.to_datetime(df['date'])
df['year_week'] = (df['date'] - timedelta(days=datetime(2021, 6, 24).timetuple()
.tm_yday)).dt.isocalendar().week
I played around with the timedelta days values so that the 2021-06-28 has a value of 1.
But then I got problems with previous & dates exceeding my start date + 1 year:
2021-03-12 has a value of 38
2022-08-17 has a value of 8
So it looks like the valid period is from 2021-06-28 + 1 year.
date year_week
2021-03-12 38 # LY38
2021-03-17 39 # LY39
2021-06-28 1 # correct
...
2022-05-21 47 # correct
2022-08-17 8 # NY8
Is there a way to get around this? As I am aggregating the data by year week I get incorrect results due to the past & upcoming dates. I would want to have negative dates for the days before 2021-06-28 or LY38 denoting that its the year week of the last year, accordingly year weeks of 52+ or NY8 denoting that this is the 8th week of the next year?
Here is a way, I added two dates more than a year away. You need the isocalendar from the difference between the date column and the dayofyear of your specific date. Then you can select the different scenario depending on the year of your specific date. use np.select for the different result format.
#dummy dataframe
df = pd.DataFrame(
{'date': ['2020-03-12', '2021-03-12', '2021-03-17', '2021-06-28',
'2022-05-21', '2022-08-17', '2023-08-17']
}
)
# define start date
d = pd.to_datetime('2021-6-24')
# remove the nomber of day of year from each date
s = (pd.to_datetime(df['date']) - pd.Timedelta(days=d.day_of_year)
).dt.isocalendar()
# get the difference in year
m = (s['year'].astype('int32') - d.year)
# all condition of result depending on year difference
conds = [m.eq(0), m.eq(-1), m.eq(1), m.lt(-1), m.gt(1)]
choices = ['', 'LY','NY',(m+1).astype(str)+'LY', '+'+(m-1).astype(str)+'NY']
# create the column
df['res'] = np.select(conds, choices) + s['week'].astype(str)
print(df)
date res
0 2020-03-12 -1LY38
1 2021-03-12 LY38
2 2021-03-17 LY39
3 2021-06-28 1
4 2022-05-21 47
5 2022-08-17 NY8
6 2023-08-17 +1NY8
I think
pandas period_range can be of some help
pd.Series(pd.period_range("6/28/2017", freq="W", periods=Number of weeks you want))

Python: Pandas dataframe get the year to which the week number belongs and not the year of the date

I have a csv-file: https://data.rivm.nl/covid-19/COVID-19_aantallen_gemeente_per_dag.csv
I want to use it to provide insight into the corona deaths per week.
df = pd.read_csv("covid.csv", error_bad_lines=False, sep=";")
df = df.loc[df['Deceased'] > 0]
df["Date_of_publication"] = pd.to_datetime(df["Date_of_publication"])
df["Week"] = df["Date_of_publication"].dt.isocalendar().week
df["Year"] = df["Date_of_publication"].dt.year
df = df[["Week", "Year", "Municipality_name", "Deceased"]]
df = df.groupby(by=["Week", "Year", "Municipality_name"]).agg({"Deceased" : "sum"})
df = df.sort_values(by=["Year", "Week"])
print(df)
Everything seems to be working fine except for the first 3 days of 2021. The first 3 days of 2021 are part of the last week (53) of 2020: http://week-number.net/calendar-with-week-numbers-2021.html.
When I print the dataframe this is the result:
53 2021 Winterswijk 1
Woudenberg 1
Zaanstad 1
Zeist 2
Zutphen 1
So basically what I'm looking for is a way where this line returns the year of the week number and not the year of the date:
df["Year"] = df["Date_of_publication"].dt.year
You can use dt.isocalendar().year to setup df["Year"]:
df["Year"] = df["Date_of_publication"].dt.isocalendar().year
You will get year 2020 for date of 2021-01-01 but will get back to year 2021 for date of 2021-01-04 by this.
This is just similar to how you used dt.isocalendar().week for setting up df["Week"]. Since they are both basing on the same tuple (year, week, day) returned by dt.isocalendar(), they would always be in sync.
Demo
date_s = pd.Series(pd.date_range(start='2021-01-01', periods=5, freq='1D'))
date_s
0
0 2021-01-01
1 2021-01-02
2 2021-01-03
3 2021-01-04
4 2021-01-05
date_s.dt.isocalendar()
year week day
0 2020 53 5
1 2020 53 6
2 2020 53 7
3 2021 1 1
4 2021 1 2
You can simply subtract the two dates and then divide the days attribute of the timedelta object by 7.
For example, this is the current week we are on now.
time_delta = (dt.datetime.today() - dt.datetime(2021, 1, 1))
The output is a datetime timedelta object
datetime.timedelta(days=75, seconds=84904, microseconds=144959)
For your problem, you'd do something like this
time_delta = int((df["Date_of_publication"] - df["Year"].days / 7)
The output would be a number that is the current week since date_of_publication

Count number of open cases per month

I am currently struggling with one issue about counting the number of open tasks at month end with pyspark. I have a dataframe with two timestamps (one for the beginning of the issue, one for the end of the issue). I would like to count the number of open issues at month end.
My dataframe df1 is a merging of two timestamps table (timestamp of the beginning and the end of the task). Let's say it looks like below:
TaskNumber Start Date StartTimestamp EndDate EndTimestamp EndOfTheMonth
1 2/15/2019 Start 4/18/2019 End 2/28/2019
2 2/16/2019 Start 2/23/2019 End 2/28/2019
3 2/17/2019 Start 3/4/2019 End 2/28/2019
4 3/1/2019 Start Null Null 3/31/2019
from pyspark.sql.functions import col, when, sum as _sum
from pyspark.sql import functions as F
df2 = df1.withColumn("EndOfTheMonth", last_day("Start Date"))
df3 = (df2.withColumn("OpenTaskAtMonthEnd", when((col("Start Date") <= col("EndOfTheMonth")) & ((col("End Date") >= col("EndOfTheMonth")) |
(col("EndTask").isNull())),1).otherwise(0))
.withColumn("TaskStillOpened", when((col("EndTimestamp").isNull()),1).otherwise(0))
.withColumn("DateofToday", F.current_date())
)
df4 = df3.filter(col("OpenTaskAtMonthEnd") == 1)
.withColumn("Pending Days", when(col("TaskStillOpened") == 1, F.datediff(F.to_date(df3.DateofToday), F.to_date(df3.Start Date))).otherwise(F.datediff(F.to_date(df3.End Date), F.to_date(df3.Start Date))))
dfOpen = (df4.groupBy("EndOfTheMonth")
.agg(_sum(col("OpenTaskAtMonthEnd").alias("NrTotalOpenTasksAtMonthEnd"), _sum(col(TasksStillOpened)).alias("TotalWorkinProgress"))
)
TaskNumber Start Date End Date EndOfTheMonth OpenTaskatMonthEnd TaskStillOpen
1 2/15/2019 4/18/2019 2/28/2019 1 0
2 2/16/2019 2/23/2019 2/28/2019 0 0
3 2/17/2019 3/4/2019 2/28/2019 1 0
4 3/1/2019 Null 3/31/2019 1 1
I manage to count those open tasks for the month n when they have started in this particular month n. But when, a task has begun in the month n-1 and ended only in the month n+1, I cannot find it with my actual code in the open tasks for the month n.
The expected output is indeed to get the number of open tasks for each month and ideally also for each week.
Thanks a lot for your support ;)

Categories