I do have a json array, where i will be having id, starttime, endtime. I want to calculate average time being active by user. And some may have only startime but not endtime.
Example data -
data = [{"id":1, "stime":"2020-09-21T06:25:36Z","etime": "2020-09-22T09:25:36Z"},{"id":2, "stime":"2020-09-22T02:24:36Z","etime": "2020-09-23T07:25:36Z"},{"id":3, "stime":"2020-09-20T06:25:36Z","etime": "2020-09-24T09:25:36Z"},{"id":4, "stime":"2020-09-23T06:25:36Z","etime": "2020-09-29T09:25:36Z"}]
My method to achieve this, diff between startine and endtime. then total all difference time and divide by number of total num of Ids.
sample code:
import datetime
from datetime import timedelta
import dateutil.parser
datetimeFormat = '%Y-%m-%d %H:%M:%S.%f'
date_s_time = '2020-09-21T06:25:36Z'
date_e_time = '2020-09-22T09:25:36Z'
d1 = dateutil.parser.parse(date_s_time)
d2 = dateutil.parser.parse(date_e_time)
diff1 = datetime.datetime.strptime(d2.strftime('%Y-%m-%d %H:%M:%S.%f'), datetimeFormat)\
- datetime.datetime.strptime(d1.strftime('%Y-%m-%d %H:%M:%S.%f'), datetimeFormat)
print("Difference 1:", diff1)
date_s_time2 = '2020-09-20T06:25:36Z'
date_e_time2 = '2020-09-28T02:25:36Z'
d3 = dateutil.parser.parse(date_s_time2)
d4 = dateutil.parser.parse(date_e_time2)
diff2 = datetime.datetime.strptime(d4.strftime('%Y-%m-%d %H:%M:%S.%f'), datetimeFormat)\
- datetime.datetime.strptime(d3.strftime('%Y-%m-%d %H:%M:%S.%f'), datetimeFormat)
print("Difference 2:", diff2)
print("total", diff1+diff2)
print(diff1+diff2/2)
please suggest me is there a better approach which will be efficient.
You could use the pandas library.
import pandas as pd
data = [{"id":1, "stime":"2020-09-21T06:25:36Z","etime": "2020-09-22T09:25:36Z"},{"id":1, "stime":"2020-09-22T02:24:36Z","etime": "2020-09-23T07:25:36Z"},{"id":1, "stime":"2020-09-20T06:25:36Z","etime": "2020-09-24T09:25:36Z"},{"id":1, "stime":"2020-09-23T06:25:36Z"}]
(Let's say your last row has no end time)
Now, you can create a Pandas DataFrame using your data
df = pd.DataFrame(data)
df looks like so:
id stime etime
0 1 2020-09-21T06:25:36Z 2020-09-22T09:25:36Z
1 1 2020-09-22T02:24:36Z 2020-09-23T07:25:36Z
2 1 2020-09-20T06:25:36Z 2020-09-24T09:25:36Z
3 1 2020-09-23T06:25:36Z NaN
Now, we want to map the columns stime and etime so that the strings are converted to datetime objects, and fill NaNs with something that makes sense: if no end time exists, could we use the current time?
df = df.fillna(datetime.utcnow().strftime('%Y-%m-%dT%H:%M:%SZ'))
df['etime'] = df['etime'].map(dateutil.parser.parse)
df['stime'] = df['stime'].map(dateutil.parser.parse)
Or, if you want to drop the rows that don't have an etime, just do
df = df.dropna()
Now df becomes:
id stime etime
0 1 2020-09-21 06:25:36+00:00 2020-09-22 09:25:36+00:00
1 1 2020-09-22 02:24:36+00:00 2020-09-23 07:25:36+00:00
2 1 2020-09-20 06:25:36+00:00 2020-09-24 09:25:36+00:00
3 1 2020-09-23 06:25:36+00:00 2020-09-24 20:05:42+00:00
Finally, subtract the two:
df['tdiff'] = df['etime'] - df['stime']
and we get:
id stime etime tdiff
0 1 2020-09-21 06:25:36+00:00 2020-09-22 09:25:36+00:00 1 days 03:00:00
1 1 2020-09-22 02:24:36+00:00 2020-09-23 07:25:36+00:00 1 days 05:01:00
2 1 2020-09-20 06:25:36+00:00 2020-09-24 09:25:36+00:00 4 days 03:00:00
3 1 2020-09-23 06:25:36+00:00 2020-09-24 20:05:42+00:00 1 days 13:40:06
The mean of this column is:
df['tdiff'].mean()
Output: Timedelta('2 days 00:10:16.500000')
Related
if I have 2 different set of dates:
01/05/2022 - 31/12/2022
01/01/2023 - 31/12/2023
01/05/2022 - 30/09/2022
01/10/2022 - 31/12/2022
01/01/2023 - 31/12/2023
I want to check if both set of dates above are contiguous between below range of dates
Date 1 = 01/05/2022
Date 2 = 31/12/2023
Please suggest a solution.
It seems to me easier to use pandas to check if dates fall into the date range.
You have the data day, month, year. In my practice, I usually see the sequences year, month, day.
I changed the variables 'Date_1', 'Date_2' to the desired format and the arrays themselves with dates, which I divided into two parts from and to. Then I filled the dataframe with these arrays and checked the date range. I specifically added one line with data for clarity: 2023-01-01 2025-12-31, it is just filtered, since it does not fall under the condition.
import pandas as pd
from datetime import datetime
Date_1 = '01/05/2022'
Date_2 = '31/12/2023'
Date_1 = datetime.strptime(Date_1, "%d/%m/%Y")
Date_2 = datetime.strptime(Date_2, "%d/%m/%Y")
start = [datetime.strptime(i, "%d/%m/%Y")for i in ['01/05/2022', '01/01/2023', '01/05/2022', '01/10/2022', '01/01/2023', '01/01/2023']]
finish = [datetime.strptime(i, "%d/%m/%Y")for i in ['31/12/2022', '31/12/2023', '30/09/2022', '31/12/2022', '31/12/2023', '31/12/2025']]
df = pd.DataFrame({'start': start, 'finish': finish})
print(df)
print(df[(df['start'] >= Date_1) & (df['finish'] <= Date_2)])
Output print(df)
start finish
0 2022-05-01 2022-12-31
1 2023-01-01 2023-12-31
2 2022-05-01 2022-09-30
3 2022-10-01 2022-12-31
4 2023-01-01 2023-12-31
5 2023-01-01 2025-12-31
Output print(df[(df['start'] >= Date_1) & (df['finish'] <= Date_2)])
start finish
0 2022-05-01 2022-12-31
1 2023-01-01 2023-12-31
2 2022-05-01 2022-09-30
3 2022-10-01 2022-12-31
4 2023-01-01 2023-12-31
how to convert time to week number
year_start = '2019-05-21'
year_end = '2020-02-22'
How do I get the week number based on the date that I set as first week?
For example 2019-05-21 should be Week 1 instead of 2019-01-01
If you do not have dates outside of year_start/year_end, use isocalendar().week and perform a simple subtraction with modulo:
year_start = pd.to_datetime('2019-05-21')
#year_end = pd.to_datetime('2020-02-22')
df = pd.DataFrame({'date': pd.date_range('2019-05-21', '2020-02-22', freq='30D')})
df['week'] = (df['date'].dt.isocalendar().week.astype(int)-year_start.isocalendar()[1])%52+1
Output:
date week
0 2019-05-21 1
1 2019-06-20 5
2 2019-07-20 9
3 2019-08-19 14
4 2019-09-18 18
5 2019-10-18 22
6 2019-11-17 26
7 2019-12-17 31
8 2020-01-16 35
9 2020-02-15 39
Try the following code.
import numpy as np
import pandas as pd
year_start = '2019-05-21'
year_end = '2020-02-22'
# Create a sample dataframe
df = pd.DataFrame(pd.date_range(year_start, year_end, freq='D'), columns=['date'])
# Add the week number
df['week_number'] = (((df.date.view(np.int64) - pd.to_datetime([year_start]).view(np.int64)) / (1e9 * 60 * 60 * 24) - df.date.dt.day_of_week + 7) // 7 + 1).astype(np.int64)
date
week_number
2019-05-21
1
2019-05-22
1
2019-05-23
1
2019-05-24
1
2019-05-25
1
2019-05-26
1
2019-05-27
2
2019-05-28
2
2020-02-18
40
2020-02-19
40
2020-02-20
40
2020-02-21
40
2020-02-22
40
If you just need a function to calculate week no, based on given start and end date:
import pandas as pd
import numpy as np
start_date = "2019-05-21"
end_date = "2020-02-22"
start_datetime = pd.to_datetime(start_date)
end_datetime = pd.to_datetime(end_date)
def get_week_no(date):
given_datetime = pd.to_datetime(date)
# if date in range
if start_datetime <= given_datetime <= end_datetime:
x = given_datetime - start_datetime
# adding 1 as it will return 0 for 1st week
return int(x / np.timedelta64(1, 'W')) + 1
raise ValueError(f"Date is not in range {start_date} - {end_date}")
print(get_week_no("2019-05-21"))
In the function, we are calculating week no by finding difference between given date and start date in weeks.
I have the timeseries dataframe as:
timestamp
signal_value
2017-08-28 00:00:00
10
2017-08-28 00:05:00
3
2017-08-28 00:10:00
5
2017-08-28 00:15:00
5
I am trying to get the average Monthly percentage of the time where "signal_value" is greater than 5. Something like:
Month
metric
January
16%
February
2%
March
8%
April
10%
I tried the following code which gives the result for the whole dataset but how can I summarize it per each month?
total,count = 0, 0
for index, row in df.iterrows():
total += 1
if row["signal_value"] >= 5:
count += 1
print((count/total)*100)
Thank you in advance.
Let us first generate some random data (generate random dates taken from here):
import pandas as pd
import numpy as np
import datetime
def randomtimes(start, end, n):
frmt = '%d-%m-%Y %H:%M:%S'
stime = datetime.datetime.strptime(start, frmt)
etime = datetime.datetime.strptime(end, frmt)
td = etime - stime
dtimes = [np.random.random() * td + stime for _ in range(n)]
return [d.strftime(frmt) for d in dtimes]
# Recreat some fake data
timestamp = randomtimes("01-01-2021 00:00:00", "01-01-2023 00:00:00", 10000)
signal_value = np.random.random(len(timestamp)) * 10
df = pd.DataFrame({"timestamp": timestamp, "signal_value": signal_value})
Now we can transform the timestamp column to pandas timestamps to extract month and year per timestamp:
df.timestamp = pd.to_datetime(df.timestamp)
df["month"] = df.timestamp.dt.month
df["year"] = df.timestamp.dt.year
We generate a boolean column whether signal_value is larger than some threshold (here 5):
df["is_larger5"] = df.signal_value > 5
Finally, we can get the average for every month by using pandas.groupby:
>>> df.groupby(["year", "month"])['is_larger5'].mean()
year month
2021 1 0.509615
2 0.488189
3 0.506024
4 0.519362
5 0.498778
6 0.483709
7 0.498824
8 0.460396
9 0.542918
10 0.463043
11 0.492500
12 0.519789
2022 1 0.481663
2 0.527778
3 0.501139
4 0.527322
5 0.486936
6 0.510638
7 0.483370
8 0.521253
9 0.493639
10 0.495349
11 0.474886
12 0.488372
Name: is_larger5, dtype: float64
import pandas as pd
import numpy as np
data = {'Name':['Si','Ov','Sp','Sa','An'],
'Time1':['02:00:00', '03:02:00', '04:00:30','01:02:30','0'],
'Time2':['03:00:00', '0', '05:00:30','02:02:30','02:00:00']}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
print (df)
Output
Name Time1 Time2
0 Siya 02:00:00 03:00:00
1 Ovi 03:02:00 0
2 Spruha 04:00:30 05:00:30
3 Saanvi 01:02:30 02:02:30
4 Ansh 0 02:00:00
want to add one more column to and apply the formula
Time3=(Time1-Time2)/Time2
There is 0 or nan value also.
Use to_timedelta for convert times to timedeltas:
t1 = pd.to_timedelta(df['Time1'])
t2 = pd.to_timedelta(df['Time2'])
df['Time3'] = t1.sub(t2).div(t2)
print (df)
Name Time1 Time2 Time3
0 Si 02:00:00 03:00:00 -0.333333
1 Ov 03:02:00 0 inf
2 Sp 04:00:30 05:00:30 -0.199667
3 Sa 01:02:30 02:02:30 -0.489796
4 An 0 02:00:00 -1.000000
EDIT:
For add new row and column use:
def format_timedelta(x):
ts = x.total_seconds()
hours, remainder = divmod(ts, 3600)
minutes, seconds = divmod(remainder, 60)
return ('{}:{:02d}:{:02d}').format(int(hours), int(minutes), int(seconds))
t1 = pd.to_timedelta(df['Time1'])
t2 = pd.to_timedelta(df['Time2'])
df['Time3'] = t1.sub(t2).div(t2)
idx = len(df)
df.loc[idx] = (pd.concat([t1, t2], axis=1)
.sum()
.apply(format_timedelta))
df.loc[idx, ['Name','Time3']] = ['Total', df['Time3'].mask(np.isinf(df['Time3'])).sum()]
print (df)
Name Time1 Time2 Time3
0 Si 02:00:00 03:00:00 -0.333333
1 Ov 03:02:00 0 inf
2 Sp 04:00:30 05:00:30 -0.199667
3 Sa 01:02:30 02:02:30 -0.489796
4 An 0 02:00:00 -1.000000
5 Total 10:05:00 12:03:00 -2.022796
I am facing problem while finding the duration. The df is
data ={
'initial_time': ['2019-05-21 22:29:55','2019-10-07 17:43:09','2020-12-13 23:53:00','2018-04-17 23:51:23','2016-08-31 07:40:49'],
'final_time' : ['2019-05-22 01:10:30','2019-10-07 17:59:09','2020-12-13 00:30:10','2018-04-18 01:01:23','2016-08-31 08:45:49'],
'duration' : [0,0,0,0,0]
}
df =pd.DataFrame(data)
df
Output:
initial_time final_time duration
0 2019-05-21 22:29:55 2019-05-22 01:10:30 0
1 2019-10-07 17:43:09 2019-10-07 17:59:09 0
2 2020-12-13 23:53:00 2020-12-13 00:30:10 0
3 2018-04-17 23:51:23 2018-04-18 01:01:23 0
4 2016-08-31 07:40:49 2016-08-31 08:45:49 0
The output I'm expecting is total duration i.e final_time - initial_time.
Note : It consist values whose initial and final time comes on different dates(row 1).
The problem can be broken down into 3 parts:
convert strings to datetime objects
write function that compute duration between 2 datetime objects
apply the function to the new column of your dataframe
datetime_object = datetime.strptime('2019-05-21 22:29:55', '%Y-%m-%d %H:%M:%S')
How do I find the time difference between two datetime objects in python?
3) df['duration'] = df.apply(lambda row: getDuration(row['initial_time'], row['final_time'], 'seconds'), axis=1)