I would like to get a cumulative sum of tran_amt for each Cust ID within 24 hours of first transaction. Please see my below example for illustration.
Original Data
DateTime
Tran_amt
Cust_ID
1/1/2021 2:00:00 PM
1000
c103102
1/1/2021 3:00:00 PM
2000
c103102
1/2/2021 10:00:00 AM
2000
c103102
1/2/2021 11:00:00 AM
1000
c211203
1/2/2021 12:00:00 PM
1000
c103102
1/2/2021 5:00:00 PM
2000
c103102
1/3/2021 3:00:00 AM
1000
c211203
Expected Output Data
DateTime
Tran_amt
Cust_ID
First Transaction DateTime
Cumulative_amt
Remark
1/1/2021 2:00:00 PM
1000
c103102
1/1/2021 2:00:00 PM
1000
1/1/2021 3:00:00 PM
2000
c103102
1/1/2021 2:00:00 PM
3000
1/2/2021 10:00:00 AM
2000
c103102
1/1/2021 2:00:00 PM
5000
1/2/2021 11:00:00 AM
1000
c211203
1/2/2021 1:00:00 PM
1000
1/2/2021 12:00:00 PM
1000
c103102
1/1/2021 2:00:00 PM
6000
1/2/2021 5:00:00 PM
2000
c103102
1/2/2021 5:00:00 PM
2000
The tran datetime is exceeding 24 hours of previous first transaction Datetime, and thus the cumulative_amt is reset
1/3/2021 3:00:00 AM
1000
c211203
1/2/2021 1:00:00 PM
2000
Hope someone can help me the above question. Thanks a lot.
Related
I am a beginner of Python. These readings are extracted from sensors which report to system in every 20 mins interval. Now, I would like to find out the total downtime from the start time until end time recovered.
Original Data:
date, Quality Sensor Reading
1/1/2022 9:00 0
1/1/2022 9:20 0
1/1/2022 9:40 0
1/1/2022 10:00 0
1/1/2022 10:20 0
1/1/2022 10:40 0
1/1/2022 12:40 0
1/1/2022 13:00 0
1/1/2022 13:20 0
1/3/2022 1:20 0
1/3/2022 1:40 0
1/3/2022 2:00 0
1/4/2022 14:40 0
1/4/2022 15:00 0
1/4/2022 15:20 0
1/4/2022 17:20 0
1/4/2022 17:40 0
1/4/2022 18:00 0
1/4/2022 18:20 0
1/4/2022 18:40 0
The expected output are as below:
Quality Sensor = 0
Start_Time End_Time Total_Down_Time
2022-01-01 09:00:00 2022-01-01 10:40:00 100 minutes
2022-01-01 12:40:00 2022-01-01 13:20:00 40 minutes
2022-01-03 01:20:00 2022-01-03 02:00:00 40 minutes
2022-01-04 14:40:00 2022-01-04 15:20:00 40 minutes
2022-01-04 17:20:00 2022-01-04 18:40:00 80 minutes
First, let's break them into groups:
df.loc[df.date.diff().gt('00:20:00'), 'group'] = 1
df.group = df.group.cumsum().ffill().fillna(0)
Then, we can extract what we want from each group, and rename:
df2 = df.groupby('group')['date'].agg(['min', 'max']).reset_index(drop=True)
df2.columns = ['start_time', 'end_time']
Finally, we'll add the interval column and format it to minutes:
df2['down_time'] = df2.end_time.sub(df2.start_time)
# Optional, I wouldn't do this here:
df2.down_time = df2.down_time.dt.seconds/60
Output:
start_time end_time down_time
0 2022-01-01 09:00:00 2022-01-01 10:40:00 100.0
1 2022-01-01 12:40:00 2022-01-01 13:20:00 40.0
2 2022-01-03 01:20:00 2022-01-03 02:00:00 40.0
3 2022-01-04 14:40:00 2022-01-04 15:20:00 40.0
4 2022-01-04 17:20:00 2022-01-04 18:40:00 80.0
Let's say the dates are listed in the dataframe df under column date. You can use shift() to create a second column with the subsequent date/time, then create a third that has your duration by subtracting them. Something like:
df['date2'] = df['date'].shift(-1)
df['difference'] = df['date2'] - df['date']
You'll obviously have one row at the end that doesn't have a following value, and therefore doesn't have a difference.
I am trying to find the hourly active time of a user from the user activity data. Below is the sample i/p & o/p
Input
ID Status Datetime
A Online 24/09/2017 7:00:00 AM
A Offline 24/09/2017 7:30:00 AM
A Online 24/09/2017 9:30:00 AM
A Offline 24/09/2017 10:00:00 AM
B Online 24/09/2017 6:00:00 AM
B Offline 24/09/2017 7:30:00 AM
B Online 24/09/2017 9:10:00 AM
B Offline 24/09/2017 9:30:00 AM
B Online 24/09/2017 9:40:00 AM
B Offline 24/09/2017 10:00:00 AM
Expected Output
ID Hour_start Hour_end Online_time
A 24/09/2017 7:00:00 AM 24/09/2017 8:00:00 AM 1800
A 24/09/2017 8:00:00 AM 24/09/2017 9:00:00 AM 0
A 24/09/2017 9:00:00 AM 24/09/2017 10:00:00 AM 1800
B 24/09/2017 6:00:00 AM 24/09/2017 7:00:00 AM 3600
B 24/09/2017 7:00:00 AM 24/09/2017 8:00:00 AM 1800
B 24/09/2017 8:00:00 AM 24/09/2017 9:00:00 AM 0
B 24/09/2017 9:00:00 AM 24/09/2017 10:00:00 AM 2400
Please help me out. TIA
My solution from Pandas Grouper calculate time elapsed between events gives proper
results also for this source data.
The result is:
ID Hour_start Hour_end Online_time
0 A 2017-09-24 07:00:00 2017-09-24 08:00:00 1800
1 A 2017-09-24 08:00:00 2017-09-24 09:00:00 0
2 A 2017-09-24 09:00:00 2017-09-24 10:00:00 1800
3 B 2017-09-24 06:00:00 2017-09-24 07:00:00 3600
4 B 2017-09-24 07:00:00 2017-09-24 08:00:00 1800
5 B 2017-09-24 08:00:00 2017-09-24 09:00:00 0
6 B 2017-09-24 09:00:00 2017-09-24 10:00:00 2400
Just as your expected result. So I don't see any error in my solution.
If you have any source data for which my solution gives wrong result,
add this data to your post.
Hi i have the following data set. the data is large and here i am presenting the 1st 30 data. it is a time series of water level data with 15minute interval.
datetime
time (min)
time(Sec)
WL (mm)
0
1/2/2021
0:00:00
0
7109.380
1
1/2/2021
0:15:00
900
7108.028
2
1/2/2021
0:30:00
1800
7107.959
3
1/2/2021
0:45:00
2700
7109.185
4
1/2/2021
1:00:00
3600
7109.045
5
1/2/2021
1:15:00
4500
7110.831
6
1/2/2021
1:30:00
5400
7110.585
7
1/2/2021
1:45:00
6300
7110.997
8
1/2/2021
2:00:00
7200
7109.854
9
1/2/2021
2:15:00
8100
7109.671
10
1/2/2021
2:30:00
9000
7110.530
11
1/2/2021
2:45:00
9900
7110.583
12
1/2/2021
3:00:00
10800
7111.532
13
1/2/2021
3:15:00
11700
7110.585
14
1/2/2021
3:30:00
12600
7111.124
15
1/2/2021
3:45:00
13500
7111.877
16
1/2/2021
4:00:00
14400
7110.813
17
1/2/2021
4:15:00
15300
7112.031
18
1/2/2021
4:30:00
16200
7113.617
19
1/2/2021
4:45:00
17100
7111.739
20
1/2/2021
5:00:00
18000
7112.435
21
1/2/2021
5:15:00
18900
7110.201
22
1/2/2021
5:30:00
19800
7111.451
23
1/2/2021
5:45:00
20700
7111.533
24
1/2/2021
6:00:00
21600
7112.126
25
1/2/2021
6:15:00
22500
7110.860
26
1/2/2021
6:30:00
23400
7112.207
27
1/2/2021
6:45:00
24300
7110.383
28
1/2/2021
7:00:00
25200
7110.979
29
1/2/2021
7:15:00
26100
7109.918
i want to make the frequency distribution with frequency unit in cycles per day, i have done it through another software as follows:
how can i make the frequency plot from the csv in python?
I want to register my dt0 column which is in the format of "1/1/2020 12:00:00 PM" and is incremented for every 0.1 sec with "to_datetime" in python.
I tried the following but it gives an error "Out of bounds nanosecond timestamp: 1-01-01 00:00:00".
df.dt0=pd.to_datetime(df.dt0)
Is it because my intervals are too small? Can someone recommend a better/working solution.
Here are the columns that I want to convert from my table
No.
Date
Time
1
1/15/2020
12:00:00 PM
2
1/15/2020
12:00:00 PM
3
1/15/2020
12:00:00 PM
4
1/15/2020
12:00:00 PM
5
1/15/2020
12:00:00 PM
6
1/15/2020
12:00:00 PM
7
1/15/2020
12:00:00 PM
8
1/15/2020
12:00:00 PM
9
1/15/2020
12:00:00 PM
10
1/15/2020
12:00:00 PM
11
1/15/2020
12:00:00 PM
12
1/15/2020
12:00:00 PM
First convert joined both columns to datetimes:
df['datetime'] = pd.to_datetime(df['Date'] + ' ' + df['Time'],
format='%m/%d/%Y %I:%M:%S %p')
print (df)
No. Date Time datetime
0 1 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
1 2 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
2 3 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
3 4 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
4 5 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
5 6 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
6 7 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
7 8 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
8 9 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
9 10 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
10 11 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
11 12 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
And then add counter of duplicated datetime with GroupBy.cumcount multiple by 100 and converted to timedeltas by to_timedelta:
df['datetime'] += pd.to_timedelta(df.groupby('datetime').cumcount().mul(100), unit='ms')
print (df)
No. Date Time datetime
0 1 1/15/2020 12:00:00 PM 2020-01-15 12:00:00.000
1 2 1/15/2020 12:00:00 PM 2020-01-15 12:00:00.100
2 3 1/15/2020 12:00:00 PM 2020-01-15 12:00:00.200
3 4 1/15/2020 12:00:00 PM 2020-01-15 12:00:00.300
4 5 1/15/2020 12:00:00 PM 2020-01-15 12:00:00.400
5 6 1/15/2020 12:00:00 PM 2020-01-15 12:00:00.500
6 7 1/15/2020 12:00:00 PM 2020-01-15 12:00:00.600
7 8 1/15/2020 12:00:00 PM 2020-01-15 12:00:00.700
8 9 1/15/2020 12:00:00 PM 2020-01-15 12:00:00.800
9 10 1/15/2020 12:00:00 PM 2020-01-15 12:00:00.900
10 11 1/15/2020 12:00:00 PM 2020-01-15 12:00:01.000
11 12 1/15/2020 12:00:00 PM 2020-01-15 12:00:01.100
In order to get your target use the following line
df['dt0'] = df.apply(lambda x : pd.to_datetime(x.Date+' '+x.Time),axis=1)
Verification:
df
Date Time dt0
No.
1 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
2 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
3 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
4 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
5 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
6 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
7 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
8 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
9 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
10 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
11 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
12 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
df.info()
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 12 entries, 1 to 12
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 12 non-null object
1 Time 12 non-null object
2 dt0 12 non-null datetime64[ns]
dtypes: datetime64[ns](1), object(2)
memory usage: 384.0+ bytes
as seen dt0 is datetime64[ns]
I have an indexed dataframe (indexed by type then date) and would like to carry out a subtraction between the end time of the top row and start time of the next row in hours :
type date start_time end_time code
A 01/01/2018 01/01/2018 9:00 01/01/2018 14:00 525
01/02/2018 01/02/2018 5:00 01/02/2018 17:00 524
01/04/2018 01/04/2018 8:00 01/04/2018 10:00 528
B 01/01/2018 01/01/2018 5:00 01/01/2018 14:00 525
01/04/2018 01/04/2018 2:00 01/04/2018 17:00 524
01/05/2018 01/05/2018 7:00 01/05/2018 10:00 528
I would like to get the resulting table with a new column['interval']:
type date interval
A 01/01/2018 -
01/02/2018 15
01/04/2018 39
B 01/01/2018 -
01/04/2018 60
01/05/2018 14
The interval column is in hours
You can convert start_time and end_time to datetime format, then use apply to subtract the end_time of the previous row in each group (using groupby). To convert to hours, divide by pd.Timedelta('1 hour'):
df['start_time'] = pd.to_datetime(df['start_time'])
df['end_time'] = pd.to_datetime(df['end_time'])
df['interval'] = (df.groupby(level=0,sort=False).apply(lambda x: x.start_time-x.end_time.shift(1)) / pd.Timedelta('1 hour')).values
>>> df
start_time end_time code interval
type date
A 01/01/2018 2018-01-01 09:00:00 2018-01-01 14:00:00 525 NaN
01/02/2018 2018-01-02 05:00:00 2018-01-02 17:00:00 524 15.0
01/04/2018 2018-01-04 08:00:00 2018-01-04 10:00:00 528 39.0
B 01/01/2018 2018-01-01 05:00:00 2018-01-01 14:00:00 525 NaN
01/04/2018 2018-01-04 02:00:00 2018-01-04 17:00:00 524 60.0
01/05/2018 2018-01-05 07:00:00 2018-01-05 10:00:00 528 14.0