How to get cumulative sum between two different dates under conditions - python

I would like to get a cumulative sum of tran_amt for each Cust ID within 24 hours of first transaction. Please see my below example for illustration.
Original Data
DateTime
Tran_amt
Cust_ID
1/1/2021 2:00:00 PM
1000
c103102
1/1/2021 3:00:00 PM
2000
c103102
1/2/2021 10:00:00 AM
2000
c103102
1/2/2021 11:00:00 AM
1000
c211203
1/2/2021 12:00:00 PM
1000
c103102
1/2/2021 5:00:00 PM
2000
c103102
1/3/2021 3:00:00 AM
1000
c211203
Expected Output Data
DateTime
Tran_amt
Cust_ID
First Transaction DateTime
Cumulative_amt
Remark
1/1/2021 2:00:00 PM
1000
c103102
1/1/2021 2:00:00 PM
1000
1/1/2021 3:00:00 PM
2000
c103102
1/1/2021 2:00:00 PM
3000
1/2/2021 10:00:00 AM
2000
c103102
1/1/2021 2:00:00 PM
5000
1/2/2021 11:00:00 AM
1000
c211203
1/2/2021 1:00:00 PM
1000
1/2/2021 12:00:00 PM
1000
c103102
1/1/2021 2:00:00 PM
6000
1/2/2021 5:00:00 PM
2000
c103102
1/2/2021 5:00:00 PM
2000
The tran datetime is exceeding 24 hours of previous first transaction Datetime, and thus the cumulative_amt is reset
1/3/2021 3:00:00 AM
1000
c211203
1/2/2021 1:00:00 PM
2000
Hope someone can help me the above question. Thanks a lot.

Related

Pandas : Dataframe Output System Down Time

I am a beginner of Python. These readings are extracted from sensors which report to system in every 20 mins interval. Now, I would like to find out the total downtime from the start time until end time recovered.
Original Data:
date, Quality Sensor Reading
1/1/2022 9:00 0
1/1/2022 9:20 0
1/1/2022 9:40 0
1/1/2022 10:00 0
1/1/2022 10:20 0
1/1/2022 10:40 0
1/1/2022 12:40 0
1/1/2022 13:00 0
1/1/2022 13:20 0
1/3/2022 1:20 0
1/3/2022 1:40 0
1/3/2022 2:00 0
1/4/2022 14:40 0
1/4/2022 15:00 0
1/4/2022 15:20 0
1/4/2022 17:20 0
1/4/2022 17:40 0
1/4/2022 18:00 0
1/4/2022 18:20 0
1/4/2022 18:40 0
The expected output are as below:
Quality Sensor = 0
Start_Time End_Time Total_Down_Time
2022-01-01 09:00:00 2022-01-01 10:40:00 100 minutes
2022-01-01 12:40:00 2022-01-01 13:20:00 40 minutes
2022-01-03 01:20:00 2022-01-03 02:00:00 40 minutes
2022-01-04 14:40:00 2022-01-04 15:20:00 40 minutes
2022-01-04 17:20:00 2022-01-04 18:40:00 80 minutes
First, let's break them into groups:
df.loc[df.date.diff().gt('00:20:00'), 'group'] = 1
df.group = df.group.cumsum().ffill().fillna(0)
Then, we can extract what we want from each group, and rename:
df2 = df.groupby('group')['date'].agg(['min', 'max']).reset_index(drop=True)
df2.columns = ['start_time', 'end_time']
Finally, we'll add the interval column and format it to minutes:
df2['down_time'] = df2.end_time.sub(df2.start_time)
# Optional, I wouldn't do this here:
df2.down_time = df2.down_time.dt.seconds/60
Output:
start_time end_time down_time
0 2022-01-01 09:00:00 2022-01-01 10:40:00 100.0
1 2022-01-01 12:40:00 2022-01-01 13:20:00 40.0
2 2022-01-03 01:20:00 2022-01-03 02:00:00 40.0
3 2022-01-04 14:40:00 2022-01-04 15:20:00 40.0
4 2022-01-04 17:20:00 2022-01-04 18:40:00 80.0
Let's say the dates are listed in the dataframe df under column date. You can use shift() to create a second column with the subsequent date/time, then create a third that has your duration by subtracting them. Something like:
df['date2'] = df['date'].shift(-1)
df['difference'] = df['date2'] - df['date']
You'll obviously have one row at the end that doesn't have a following value, and therefore doesn't have a difference.

Summarize active time of user activity log data in hourly buckets

I am trying to find the hourly active time of a user from the user activity data. Below is the sample i/p & o/p
Input
ID Status Datetime
A Online 24/09/2017 7:00:00 AM
A Offline 24/09/2017 7:30:00 AM
A Online 24/09/2017 9:30:00 AM
A Offline 24/09/2017 10:00:00 AM
B Online 24/09/2017 6:00:00 AM
B Offline 24/09/2017 7:30:00 AM
B Online 24/09/2017 9:10:00 AM
B Offline 24/09/2017 9:30:00 AM
B Online 24/09/2017 9:40:00 AM
B Offline 24/09/2017 10:00:00 AM
Expected Output
ID Hour_start Hour_end Online_time
A 24/09/2017 7:00:00 AM 24/09/2017 8:00:00 AM 1800
A 24/09/2017 8:00:00 AM 24/09/2017 9:00:00 AM 0
A 24/09/2017 9:00:00 AM 24/09/2017 10:00:00 AM 1800
B 24/09/2017 6:00:00 AM 24/09/2017 7:00:00 AM 3600
B 24/09/2017 7:00:00 AM 24/09/2017 8:00:00 AM 1800
B 24/09/2017 8:00:00 AM 24/09/2017 9:00:00 AM 0
B 24/09/2017 9:00:00 AM 24/09/2017 10:00:00 AM 2400
Please help me out. TIA
My solution from Pandas Grouper calculate time elapsed between events gives proper
results also for this source data.
The result is:
ID Hour_start Hour_end Online_time
0 A 2017-09-24 07:00:00 2017-09-24 08:00:00 1800
1 A 2017-09-24 08:00:00 2017-09-24 09:00:00 0
2 A 2017-09-24 09:00:00 2017-09-24 10:00:00 1800
3 B 2017-09-24 06:00:00 2017-09-24 07:00:00 3600
4 B 2017-09-24 07:00:00 2017-09-24 08:00:00 1800
5 B 2017-09-24 08:00:00 2017-09-24 09:00:00 0
6 B 2017-09-24 09:00:00 2017-09-24 10:00:00 2400
Just as your expected result. So I don't see any error in my solution.
If you have any source data for which my solution gives wrong result,
add this data to your post.

how to calculate and plot frequency of csv time series data from FFT

Hi i have the following data set. the data is large and here i am presenting the 1st 30 data. it is a time series of water level data with 15minute interval.
datetime
time (min)
time(Sec)
WL (mm)
0
1/2/2021
0:00:00
0
7109.380
1
1/2/2021
0:15:00
900
7108.028
2
1/2/2021
0:30:00
1800
7107.959
3
1/2/2021
0:45:00
2700
7109.185
4
1/2/2021
1:00:00
3600
7109.045
5
1/2/2021
1:15:00
4500
7110.831
6
1/2/2021
1:30:00
5400
7110.585
7
1/2/2021
1:45:00
6300
7110.997
8
1/2/2021
2:00:00
7200
7109.854
9
1/2/2021
2:15:00
8100
7109.671
10
1/2/2021
2:30:00
9000
7110.530
11
1/2/2021
2:45:00
9900
7110.583
12
1/2/2021
3:00:00
10800
7111.532
13
1/2/2021
3:15:00
11700
7110.585
14
1/2/2021
3:30:00
12600
7111.124
15
1/2/2021
3:45:00
13500
7111.877
16
1/2/2021
4:00:00
14400
7110.813
17
1/2/2021
4:15:00
15300
7112.031
18
1/2/2021
4:30:00
16200
7113.617
19
1/2/2021
4:45:00
17100
7111.739
20
1/2/2021
5:00:00
18000
7112.435
21
1/2/2021
5:15:00
18900
7110.201
22
1/2/2021
5:30:00
19800
7111.451
23
1/2/2021
5:45:00
20700
7111.533
24
1/2/2021
6:00:00
21600
7112.126
25
1/2/2021
6:15:00
22500
7110.860
26
1/2/2021
6:30:00
23400
7112.207
27
1/2/2021
6:45:00
24300
7110.383
28
1/2/2021
7:00:00
25200
7110.979
29
1/2/2021
7:15:00
26100
7109.918
i want to make the frequency distribution with frequency unit in cycles per day, i have done it through another software as follows:
how can i make the frequency plot from the csv in python?

Want to implement "to_datetime" function for my case

I want to register my dt0 column which is in the format of "1/1/2020 12:00:00 PM" and is incremented for every 0.1 sec with "to_datetime" in python.
I tried the following but it gives an error "Out of bounds nanosecond timestamp: 1-01-01 00:00:00".
df.dt0=pd.to_datetime(df.dt0)
Is it because my intervals are too small? Can someone recommend a better/working solution.
Here are the columns that I want to convert from my table
No.
Date
Time
1
1/15/2020
12:00:00 PM
2
1/15/2020
12:00:00 PM
3
1/15/2020
12:00:00 PM
4
1/15/2020
12:00:00 PM
5
1/15/2020
12:00:00 PM
6
1/15/2020
12:00:00 PM
7
1/15/2020
12:00:00 PM
8
1/15/2020
12:00:00 PM
9
1/15/2020
12:00:00 PM
10
1/15/2020
12:00:00 PM
11
1/15/2020
12:00:00 PM
12
1/15/2020
12:00:00 PM
First convert joined both columns to datetimes:
df['datetime'] = pd.to_datetime(df['Date'] + ' ' + df['Time'],
format='%m/%d/%Y %I:%M:%S %p')
print (df)
No. Date Time datetime
0 1 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
1 2 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
2 3 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
3 4 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
4 5 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
5 6 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
6 7 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
7 8 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
8 9 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
9 10 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
10 11 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
11 12 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
And then add counter of duplicated datetime with GroupBy.cumcount multiple by 100 and converted to timedeltas by to_timedelta:
df['datetime'] += pd.to_timedelta(df.groupby('datetime').cumcount().mul(100), unit='ms')
print (df)
No. Date Time datetime
0 1 1/15/2020 12:00:00 PM 2020-01-15 12:00:00.000
1 2 1/15/2020 12:00:00 PM 2020-01-15 12:00:00.100
2 3 1/15/2020 12:00:00 PM 2020-01-15 12:00:00.200
3 4 1/15/2020 12:00:00 PM 2020-01-15 12:00:00.300
4 5 1/15/2020 12:00:00 PM 2020-01-15 12:00:00.400
5 6 1/15/2020 12:00:00 PM 2020-01-15 12:00:00.500
6 7 1/15/2020 12:00:00 PM 2020-01-15 12:00:00.600
7 8 1/15/2020 12:00:00 PM 2020-01-15 12:00:00.700
8 9 1/15/2020 12:00:00 PM 2020-01-15 12:00:00.800
9 10 1/15/2020 12:00:00 PM 2020-01-15 12:00:00.900
10 11 1/15/2020 12:00:00 PM 2020-01-15 12:00:01.000
11 12 1/15/2020 12:00:00 PM 2020-01-15 12:00:01.100
In order to get your target use the following line
df['dt0'] = df.apply(lambda x : pd.to_datetime(x.Date+' '+x.Time),axis=1)
Verification:
df
Date Time dt0
No.
1 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
2 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
3 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
4 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
5 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
6 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
7 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
8 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
9 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
10 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
11 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
12 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
df.info()
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 12 entries, 1 to 12
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 12 non-null object
1 Time 12 non-null object
2 dt0 12 non-null datetime64[ns]
dtypes: datetime64[ns](1), object(2)
memory usage: 384.0+ bytes
as seen dt0 is datetime64[ns]

Subtracting between rows of different columns in an indexed dataframe in python

I have an indexed dataframe (indexed by type then date) and would like to carry out a subtraction between the end time of the top row and start time of the next row in hours :
type date start_time end_time code
A 01/01/2018 01/01/2018 9:00 01/01/2018 14:00 525
01/02/2018 01/02/2018 5:00 01/02/2018 17:00 524
01/04/2018 01/04/2018 8:00 01/04/2018 10:00 528
B 01/01/2018 01/01/2018 5:00 01/01/2018 14:00 525
01/04/2018 01/04/2018 2:00 01/04/2018 17:00 524
01/05/2018 01/05/2018 7:00 01/05/2018 10:00 528
I would like to get the resulting table with a new column['interval']:
type date interval
A 01/01/2018 -
01/02/2018 15
01/04/2018 39
B 01/01/2018 -
01/04/2018 60
01/05/2018 14
The interval column is in hours
You can convert start_time and end_time to datetime format, then use apply to subtract the end_time of the previous row in each group (using groupby). To convert to hours, divide by pd.Timedelta('1 hour'):
df['start_time'] = pd.to_datetime(df['start_time'])
df['end_time'] = pd.to_datetime(df['end_time'])
df['interval'] = (df.groupby(level=0,sort=False).apply(lambda x: x.start_time-x.end_time.shift(1)) / pd.Timedelta('1 hour')).values
>>> df
start_time end_time code interval
type date
A 01/01/2018 2018-01-01 09:00:00 2018-01-01 14:00:00 525 NaN
01/02/2018 2018-01-02 05:00:00 2018-01-02 17:00:00 524 15.0
01/04/2018 2018-01-04 08:00:00 2018-01-04 10:00:00 528 39.0
B 01/01/2018 2018-01-01 05:00:00 2018-01-01 14:00:00 525 NaN
01/04/2018 2018-01-04 02:00:00 2018-01-04 17:00:00 524 60.0
01/05/2018 2018-01-05 07:00:00 2018-01-05 10:00:00 528 14.0

Categories