Calculate difference of rows in Pandas - python

I have a timeseries dataframe where there are alerts for some particular rows. The dataframe looks like-
machineID time vibration alerts
1 2023-02-15 220 1
11:45
1 2023-02-15 221 0
12:00
1 2023-02-15 219 0
12:15
1 2023-02-15 220 1
12:30
1 2023-02-16 220 1
11:45
1 2023-02-16 221 1
12:00
1 2023-02-16 219 0
12:15
1 2023-02-16 220 1
12:30
I want to calculate difference of alerts columns for each day. But since the date column is in time interval of 15 minutes, I am not getting how to group for whole day i.e., sum the alerts for each day and compare it with the sum of all alerts of the previous day.
In short, I need a way to sum all alerts for each day and substract with previous day. The result should be in another dataframe where there is a date column and difference of alerts column. In this case, the new dataframe will be-
time diff_alerts
2023-02-16 1
since there is difference of 1 alert on the next day i.e. 16-02-2023

Group by day with a custom pd.Grouper then sum alerts and finally compute the diff with the previous day:
>>> (df.groupby(pd.Grouper(key='time', freq='D'))['alerts'].sum().diff()
.dropna().rename('diff_alerts').astype(int).reset_index())
time diff_alerts
0 2023-02-16 1
Note: the second line of code is just here to have a clean output.

Related

Average time difference for each day Pandas

I have a data frame of user activity, with user ID's and time of activity.
I'm trying to calculate the average time difference between activities for each user. I've managed to do this when a user is active for only 1 day, but I struggle with instances when the user is active for multiple days.
for example:
User ID
Activity Date
week
1
7/26/2021 8:29:01 PM
1
1
7/26/2021 8:28:01 PM
1
1
7/26/2021 8:32:01 PM
2
I used this code, and it works fine:
d=d.sort_values('Activity Date').groupby(['User ID','week'])['Activity Date'].apply(lambda x: x.diff().mean()).dt.total_seconds()/60
My issue is when the user is active on multiple days, with my code I still get an average but it doesn't represent the activity the way I need it.
User ID
Activity Date
week
1
7/25/2021 8:29:01 PM
1
1
7/26/2021 8:29:01 PM
1
1
7/26/2021 8:32:01 PM
1
1
7/25/2021 8:28:01 PM
1
1
7/30/2021 8:32:01 PM
2
1
7/30/2021 8:30:01 PM
2
I would like to first compute the average for each day, and than compute the average of averages.
My code gives the result of: week 1: 481.333 minutes, week 2: 2 minutes
I want it to be: for week 1: 2 minutes (for 25/07- 1 minute difference, for 26/07- 3 minute difference=> so the mean is 2 minutes).
I would really appreciate your help or any suggestions!
Thanks!!
You can perform a double groupby, first on user and day, then on user:
df['Activity Date'] = pd.to_datetime(df['Activity Date'])
day = df['Activity Date'].dt.normalize()
out = (df
.sort_values(by=['User ID', 'Activity Date'])
.groupby(['User ID', day])
.diff()
.groupby(df['User ID']).mean()
)
Output:
Activity Date
User ID
1 0 days 00:02:00
also grouping by week
out = (df
.sort_values(by=['User ID', 'Activity Date'])
.groupby(['User ID', day])
.diff()
.groupby([df['User ID'], df['week']]).mean()
)

How to calculate the days difference between all the dates of a dataframe column and a single data in Python

I would to calculate the days difference between all the days in the "last_review" column and
2018-08-01, and I want the output to be exact days, like if the observation is 2018-07-31, the output should be 2. And do this for every observation of the dataframe column. The output should be
48894 * 1
You can it like so:
df['last_review'] = pd.to_datetime(df['last_review'])
df['num_days'] = pd.to_datetime("2019-08-01") - df['last_review']
Output:
last_review num_days
0 2018-10-19 286 days
1 2019-05-21 72 days
2 2011-03-28 3048 days
You can use:
sub_date = datetime(2018,8,1)
df['last_review'] = pd.to_datetime(df['last_review'])
df['diff'] = (sub_date - df['last_review']).dt.days

Calculate the time difference between two hh:mm columns in a pandas dataframe

I am reading some data from an csv file where the datatype of the two columns are in hh:mm format. Here is an example:
Start End
11:15 15:00
22:30 2:00
In the above example, the End in the 2nd row happens in the next day. I am trying to get the time difference between these two columns in the most efficient way as the dataset is huge. Is there any good pythonic way for doing this? Also, since there is no date, and some Ends happen in the next I get wrong result when I calculate the diff.
>>> import pandas as pd
>>> df = pd.read_csv(file_path)
>>> pd.to_datetime(df['End'])-pd.to_datetime(df['Start'])
0 0 days 03:45:00
1 0 days 03:00:00
2 -1 days +03:30:00
You can use the technique (a+x)%x with a timedelta of 24h (or 1d, same)
the + timedelta(hours=24) makes all values becomes positive
the % timedelta(hours=24) makes the ones above 24h back of 24h
df['duration'] = (pd.to_datetime(df['End']) - pd.to_datetime(df['Start']) + timedelta(hours=24)) \
% timedelta(hours=24)
Gives
Start End duration
0 11:15 15:00 0 days 03:45:00
1 22:30 2:00 0 days 03:30:00

Obtaining a list of items in a pandas column of timestamps where the time difference from one row to next is zero

I have a dataframe with a column that are timestamps of individual trades that occurred at BitMEX. I am now trying to work out the difference in times between each timestamp to the next using
timediff = df6['timestamp'].diff()
Then when I try df6.timediff.isnull().sum() I get the value as "one" which is the NaT value on top of the column for the first value.
However then when I draw a histogram I see many zeros. On inspecting the dataframe I see many rows with zero total.
Below are the timestamps I can see after doing the .diff(). I also see the timestamp no longer displays milliseconds either.
7463 0 days 00:00:00.342889
7464 0 days 00:01:07.891225
7465 0 days 00:00:00
7466 0 days 00:00:00.038494
7467 0 days 00:00:00.135066
7468 0 days 00:00:00
7469 0 days 00:00:00
7470 0 days 00:00:00
7471 0 days 00:00:00
7472 0 days 00:00:01.122758
7473 0 days 00:00:00.728908
7474 0 days 00:00:13.272938
My question is - how do I find the number of rows of timestamps that are actually zero - i.e. in this case the above timestamp is difference in time (t - t(t-1))

Reformatting and Reordering Dates in a Python Pandas Series

I have a pandas DataFrame and I want to reformat AND order the Date Range column.
This is the df.head():
Numeric Index Origin Movement ID Origin Display Name Destination Movement ID Destination Display Name Date Range Mean Travel Time (Seconds) Range - Lower Bound Travel Time (Seconds) Range - Upper Bound Travel Time (Seconds)
0 0 1074 Traffic Zone 02047 28 Traffic Zone 16024 1/4/2016 - 1/4/2016, Every day, Daily Average 2296 1593 3309
1 1 1074 Traffic Zone 02047 29 Traffic Zone 16025 1/4/2016 - 1/4/2016, Every day, Daily Average 2378 1662 3402
2 2 1074 Traffic Zone 02047 35 Traffic Zone 14080 1/4/2016 - 1/4/2016, Every day, Daily Average 1846 1703 2000
3 3 1074 Traffic Zone 02047 43 Traffic Zone 14072 1/4/2016 - 1/4/2016, Every day, Daily Average 1797 1647 1959
4 4 1074 Traffic Zone 02047 48 Traffic Zone 16027 1/4/2016 - 1/4/2016, Every day, Daily Average 2301 1670 3168
My df['Date Range'] strings are dates from January 2nd 2016 to March 31st 2020 and they are in the following format:
1 1/4/2016 - 1/4/2016, Every day, Daily Average
2 1/4/2016 - 1/4/2016, Every day, Daily Average
3 1/4/2016 - 1/4/2016, Every day, Daily Average
4 1/4/2016 - 1/4/2016, Every day, Daily Average
...
542 1/2/2016 - 1/2/2016, Every day, Daily Average
543 1/2/2016 - 1/2/2016, Every day, Daily Average
544 1/2/2016 - 1/2/2016, Every day, Daily Average
545 1/2/2016 - 1/2/2016, Every day, Daily Average
546 1/2/2016 - 1/2/2016, Every day, Daily Average
How do I transform "1/2/2016 - 1/2/2016, Every day, Daily Average" into "2016-01-02" for every date and order them by date?
Note: The string has two dates and they are the same, for every row, that's why I want to transform them into one date only.
You can split by first space, select first value and convert to datetime with format parameter by to_datetime, last if necessary use DataFrame.sort_values:
df['Date Range'] = pd.to_datetime(df['Date Range'].str.split().str[0], format='%d/%m/%Y')
df = df.sort_values('Date Range')

Categories