I have a pandas dataframe with over 100 timestamps that defines the non-working-time of a machine:
>>> off_time
date (index) start end
2020-07-04 18:00:00 23:50:00
2020-08-24 00:00:00 08:00:00
2020-08-24 14:00:00 16:00:00
2020-09-04 00:00:00 23:59:59
2020-10-05 18:00:00 22:00:00
I also have a second dataframe (called data) with over 1000 timestamps defining the duration of some processes:
>>> data
process-name start-time end-time duration
name1 2020-07-17 08:00:00+00:00 2020-07-18 22:00:00+00:00 1 day 14:00:00
name2 2020-08-24 01:00:00+00:00 2020-08-24 12:00:00+00:00 14:00:00
name3 2020-09-20 07:00:00+00:00 2020-09-20 19:00:00+00:00 12:00:00
name4 2020-09-04 16:00:00+00:00 2020-09-04 18:50:00+00:00 02:50:00
name5 2020-10-04 11:00:00+00:00 2020-10-05 20:00:00+00:00 1 day 09:00:00
In order to get the effective working time for each process in data, I now have to subtract the non-working time from the duration. For example, I have to subtract the time between 18 and 20 for the process "Name 5", since this time is planned as non-working time.
I wrote a code with many if-else conditions, which I see as a potential source of errors! Is there a clean way to calculate effective time without using too many if-else? Any help would be greatly appreciated.
Set up sample data (I added a couple of rows to your samples to include some edge cases):
######### OFF TIMES
off = pd.DataFrame([
["2020-07-04", dt.time(18), dt.time(23,50)],
["2020-08-24", dt.time(0), dt.time(8)],
["2020-08-24", dt.time(14), dt.time(16)],
["2020-09-04", dt.time(0), dt.time(23,59,59)],
["2020-10-04", dt.time(15), dt.time(18)],
["2020-10-05", dt.time(18), dt.time(22)]], columns= ["date", "start", "end"])
off["date"] = pd.to_datetime(off["date"])
off = off.set_index("date")
### Convert start and end times to datetimes in UTC timezone, since that is much
### easier to handle and fits the other data
off["start"] = pd.to_datetime(off.index.astype("string") + " " + off.start.astype("string")+"+00:00")
off["end"] = pd.to_datetime(off.index.astype("string") + " " + off.end.astype("string")+"+00:00")
off
>>
start end
date
2020-07-04 2020-07-04 18:00:00+00:00 2020-07-04 23:50:00+00:00
2020-08-24 2020-08-24 00:00:00+00:00 2020-08-24 08:00:00+00:00
2020-08-24 2020-08-24 14:00:00+00:00 2020-08-24 16:00:00+00:00
2020-09-04 2020-09-04 00:00:00+00:00 2020-09-04 23:59:59+00:00
2020-10-04 2020-10-04 15:00:00+00:00 2020-10-04 18:00:00+00:00
2020-10-05 2020-10-05 18:00:00+00:00 2020-10-05 22:00:00+00:00
######### PROCESS TIMES
data = pd.DataFrame([
["name1","2020-07-17 08:00:00+00:00","2020-07-18 22:00:00+00:00"],
["name2","2020-08-24 01:00:00+00:00","2020-08-24 12:00:00+00:00"],
["name3","2020-09-20 07:00:00+00:00","2020-09-20 19:00:00+00:00"],
["name4","2020-09-04 16:00:00+00:00","2020-09-04 18:50:00+00:00"],
["name5","2020-10-04 11:00:00+00:00","2020-10-05 20:00:00+00:00"],
["name6","2020-09-03 10:00:00+00:00","2020-09-06 05:00:00+00:00"]
], columns = ["process", "start", "end"])
data["start"] = pd.to_datetime(data["start"])
data["end"] = pd.to_datetime(data["end"])
data["duration"] = data.end -data.start
data
>>
process start end duration
0 name1 2020-07-17 08:00:00+00:00 2020-07-18 22:00:00+00:00 1 days 14:00:00
1 name2 2020-08-24 01:00:00+00:00 2020-08-24 12:00:00+00:00 0 days 11:00:00
2 name3 2020-09-20 07:00:00+00:00 2020-09-20 19:00:00+00:00 0 days 12:00:00
3 name4 2020-09-04 16:00:00+00:00 2020-09-04 18:50:00+00:00 0 days 02:50:00
4 name5 2020-10-04 11:00:00+00:00 2020-10-05 20:00:00+00:00 1 days 09:00:00
5 name6 2020-09-03 10:00:00+00:00 2020-09-06 05:00:00+00:00 2 days 19:00:00
As you can see, I added a row to off on 2020-10-04, so that name5 has 2 off times, which could happen in your data and would need to be handled correctly. (this means that in the example in your question, you need to subtract 5 hours instead of 2)
I also added the process name6 which is multiple days long.
This is my solution, which will be applied to each row in data
def get_relevant_off(pr):
relevant = off[off.end.gt(pr["start"]) & off.start.lt(pr["end"])].copy()
if not relevant.empty:
relevant.loc[relevant["start"].lt(pr["start"]), "start"] = pr["start"]
relevant.loc[relevant["end"].gt(pr["end"]), "end"] = pr["end"]
to_subtract = (relevant.end - relevant.start).sum()
return pr["duration"] - to_subtract
else: return pr.duration
Explanation:
first row in the function subsets the relevant rows of off, based on the row pr
replace off starts that are lower than process starts with process starts and do the same with ends, since we don't want to sum the whole off time, but only what is actually at the same time as the process.
get the duration of off times by subtracting off starts from off ends and sum those
then subtract that from the total duration.
data["effective"] = data.apply(get_relevant_off, axis= 1)
data
>>
process start end duration effective
0 name1 2020-07-17 08:00:00+00:00 2020-07-18 22:00:00+00:00 1 days 14:00:00 1 days 14:00:00
1 name2 2020-08-24 01:00:00+00:00 2020-08-24 12:00:00+00:00 0 days 11:00:00 0 days 04:00:00
2 name3 2020-09-20 07:00:00+00:00 2020-09-20 19:00:00+00:00 0 days 12:00:00 0 days 12:00:00
3 name4 2020-09-04 16:00:00+00:00 2020-09-04 18:50:00+00:00 0 days 02:50:00 0 days 00:00:00
4 name5 2020-10-04 11:00:00+00:00 2020-10-05 20:00:00+00:00 1 days 09:00:00 1 days 04:00:00
5 name6 2020-09-03 10:00:00+00:00 2020-09-06 05:00:00+00:00 2 days 19:00:00 1 days 19:00:01
Caveat: I am assuming that off times never overlap. Also, I liked this problem, but don't have any more time to spend on testing this, so let me know if I overlooked some edge cases that break it and I will try to find the time to fix it.
Related
I have a df with a date index as follow:
ind = pd.date_range(start="2015-12-31", end = "2022-04-26", freq="D")
df = pd.DataFrame(
{
"col1": range(len(ind))
},
index=ind
)
What I need is slice the df in windows from the end of each month from 2017-08-31 to 3 years plus 1 month, so I have the next chunk of code
n = timedelta(365 * 3) + relativedelta(months=1)
fechas_ = pd.date_range("2017-08-31", ind.max() - n, freq="M")
# create a for loop to check the beginning and the end of each window
for i in fechas_:
print(f"start: {i}")
print(f"end: {i + n}")
print("\n")
My problem is that I need the last day of the month as the end of each window e.g.:
# first window
start: 2017-08-31 00:00:00
end: 2020-09-30 00:00:00
# second window
start: 2017-09-30 00:00:00
end: 2020-10-31 00:00:00
# so on
But I'm getting:
# first window
start: 2017-08-31 00:00:00
end: 2020-09-29 00:00:00
# second window
start: 2017-09-30 00:00:00
end: 2020-10-29 00:00:00
# 3
2017-10-31 00:00:00
2020-11-29 00:00:00
# 4
2017-11-30 00:00:00
2020-12-29 00:00:00
# 5
2017-12-31 00:00:00
2021-01-30 00:00:00
# 6
2018-01-31 00:00:00
2021-02-27 00:00:00
# 7
2018-02-28 00:00:00
2021-03-27 00:00:00
# 8
2018-03-31 00:00:00
2021-04-29 00:00:00
# 9
2018-04-30 00:00:00
2021-05-29 00:00:00
# 10
2018-05-31 00:00:00
2021-06-29 00:00:00
# 11
2018-06-30 00:00:00
2021-07-29 00:00:00
# 12
2018-07-31 00:00:00
2021-08-30 00:00:00
# 13
2018-08-31 00:00:00
2021-09-29 00:00:00
# 14
2018-09-30 00:00:00
2021-10-29 00:00:00
# 15
2018-10-31 00:00:00
2021-11-29 00:00:00
# 16
2018-11-30 00:00:00
2021-12-29 00:00:00
# 17
2018-12-31 00:00:00
2022-01-30 00:00:00
# 18
2019-01-31 00:00:00
2022-02-27 00:00:00
# 19
2019-02-28 00:00:00
2022-03-27 00:00:00
Does someone know how can I solve this?
Thanks a lot
In your code
n = timedelta(365 * 3) + relativedelta(months=1)
try replacing it with
n = relativedelta(years=3, months=1, day=31)
I have the following table:
Hora_Retiro count_uses
0 00:00:18 1
1 00:00:34 1
2 00:02:27 1
3 00:03:13 1
4 00:06:45 1
... ... ...
748700 23:58:47 1
748701 23:58:49 1
748702 23:59:11 1
748703 23:59:47 1
748704 23:59:56 1
And I want to group all values within each hour, so I can see the total number of uses per hour (00:00:00 - 23:00:00)
I have the following code:
hora_pico_aug= hora_pico.groupby(pd.Grouper(key="Hora_Retiro",freq='H')).count()
Hora_Retiro column is of timedelta64[ns] type
Which gives the following output:
count_uses
Hora_Retiro
00:00:02 2566
01:00:02 602
02:00:02 295
03:00:02 5
04:00:02 10
05:00:02 4002
06:00:02 16075
07:00:02 39410
08:00:02 76272
09:00:02 56721
10:00:02 36036
11:00:02 32011
12:00:02 33725
13:00:02 41032
14:00:02 50747
15:00:02 50338
16:00:02 42347
17:00:02 54674
18:00:02 76056
19:00:02 57958
20:00:02 34286
21:00:02 22509
22:00:02 13894
23:00:02 7134
However, the index column starts at 00:00:02, and I want it to start at 00:00:00, and then go from one hour intervals. Something like this:
count_uses
Hora_Retiro
00:00:00 2565
01:00:00 603
02:00:00 295
03:00:00 5
04:00:00 10
05:00:00 4002
06:00:00 16075
07:00:00 39410
08:00:00 76272
09:00:00 56721
10:00:00 36036
11:00:00 32011
12:00:00 33725
13:00:00 41032
14:00:00 50747
15:00:00 50338
16:00:00 42347
17:00:00 54674
18:00:00 76056
19:00:00 57958
20:00:00 34286
21:00:00 22509
22:00:00 13894
23:00:00 7134
How can i make it to start at 00:00:00??
Thanks for the help!
You can create an hour column from Hora_Retiro column.
df['hour'] = df['Hora_Retiro'].dt.hour
And then groupby on the basis of hour
gpby_df = df.groupby('hour')['count_uses'].sum().reset_index()
gpby_df['hour'] = pd.to_datetime(gpby_df['hour'], format='%H').dt.time
gpby_df.columns = ['Hora_Retiro', 'sum_count_uses']
gpby_df
gives
Hora_Retiro sum_count_uses
0 00:00:00 14
1 09:00:00 1
2 10:00:00 2
3 20:00:00 2
I assume that Hora_Retiro column in your DataFrame is of
Timedelta type. It is not datetime, as in this case there
would be printed also the date part.
Indeed, your code creates groups starting at the minute / second
taken from the first row.
To group by "full hours":
round each element in this column to hour,
then group (just by this rounded value).
The code to do it is:
hora_pico.groupby(hora_pico.Hora_Retiro.apply(
lambda tt: tt.round('H'))).count_uses.count()
However I advise you to make up your mind, what do you want to count:
rows or values in count_uses column.
In the second case replace count function with sum.
I am trying to read some parquet files using dask.dataframe.read_parquet method. In the data I have a column named timestamp, which contains data such as:
0 2018-12-20 19:00:00
1 2018-12-20 20:00:00
2 2018-12-20 21:00:00
3 2018-12-20 22:00:00
4 2018-12-20 23:00:00
5 2018-12-21 00:00:00
6 2018-12-21 01:00:00
7 2018-12-21 02:00:00
8 2018-12-21 03:00:00
9 2018-12-21 04:00:00
10 2018-12-21 05:00:00
11 2018-12-21 06:00:00
12 2018-12-21 07:00:00
13 2018-12-21 08:00:00
14 2018-12-21 09:00:00
15 2018-12-21 10:00:00
16 2018-12-21 11:00:00
17 2018-12-21 12:00:00
18 2018-12-21 13:00:00
19 2018-12-21 14:00:00
20 2018-12-21 15:00:00
and I would like to filter based on timestamp and return say, data within the last 10 days. How do I do this?
I tried something like:
filter_timestamp_days = pd.Timestamp(datetime.today() - timedelta(days=days))
filters = [('timestamp', '>', filter_timestamp_days)]
df = dask_df.read_parquet(DATA_DIR, engine='pyarrow', filters=filters)
But I am getting the error:
TypeError: Cannot compare type 'Timestamp' with type 'bytes_'
It turned out that the problem was from the data source I was working with. I tested a different data source originally written with dask and it worked simply as:
filter_timestamp_days = pd.Timestamp(datetime.today() - timedelta(days=days))
filters = [('timestamp', '>', filter_timestamp_days)]
df = dask_df.read_parquet(DATA_DIR, engine='fastparquet', filters=filters)
I did not need to convert filter_timestamp_days any further. The former data source was written with a Scala client and it seems somehow the metadata is not readable in dask.
Thank you all for your contributions.
I have two data frames like following, data frame A has datetime even with minutes, data frame B only has hour.
df:A
dataDate original
2018-09-30 11:20:00 3
2018-10-01 12:40:00 10
2018-10-02 07:00:00 5
2018-10-27 12:50:00 5
2018-11-28 19:45:00 7
df:B
dataDate count
2018-09-30 10:00:00 300
2018-10-01 12:00:00 50
2018-10-02 07:00:00 120
2018-10-27 12:00:00 234
2018-11-28 19:05:00 714
I like to merge the two on the basis of hour date and hour, so that now in dataframe A should have all the rows filled on the basis of merge on date and hour
I can try to do it via
A['date'] = A.dataDate.date
B['date'] = B.dataDate.date
A['hour'] = A.dataDate.hour
B['hour'] = B.dataDate.hour
and then merge
merge_df = pd.merge(A,B, how='left', left_on=['date', 'hour'],
right_on=['date', 'hour'])
but its a very long process, Is their an efficient way to perform the same operation with the help of pandas time series or date functionality?
Use map if need append only one column from B to A with floor for set minutes and seconds if exist to 0:
d = dict(zip(B.dataDate.dt.floor('H'), B['count']))
A['count'] = A.dataDate.dt.floor('H').map(d)
print (A)
dataDate original count
0 2018-09-30 11:20:00 3 NaN
1 2018-10-01 12:40:00 10 50.0
2 2018-10-02 07:00:00 5 120.0
3 2018-10-27 12:50:00 5 234.0
4 2018-11-28 19:45:00 7 714.0
For general solution use DataFrame.join:
A.index = A.dataDate.dt.floor('H')
B.index = B.dataDate.dt.floor('H')
A = A.join(B, lsuffix='_left')
print (A)
dataDate_left original dataDate count
dataDate
2018-09-30 11:00:00 2018-09-30 11:20:00 3 NaT NaN
2018-10-01 12:00:00 2018-10-01 12:40:00 10 2018-10-01 12:00:00 50.0
2018-10-02 07:00:00 2018-10-02 07:00:00 5 2018-10-02 07:00:00 120.0
2018-10-27 12:00:00 2018-10-27 12:50:00 5 2018-10-27 12:00:00 234.0
2018-11-28 19:00:00 2018-11-28 19:45:00 7 2018-11-28 19:05:00 714.0
I want to calculate time difference between two columns on specific time range.
I try df.between_time but it only works on index.
Ex. Time range: between 18.00 - 8.00
Data :
start stop
0 2018-07-16 16:00:00 2018-07-16 20:00:00
1 2018-07-11 08:03:00 2018-07-11 12:03:00
2 2018-07-13 17:54:00 2018-07-13 21:54:00
3 2018-07-14 13:09:00 2018-07-14 17:09:00
4 2018-07-20 00:21:00 2018-07-20 04:21:00
5 2018-07-20 17:00:00 2018-07-21 09:00:00
Expect Result :
start stop time_diff
0 2018-07-16 16:00:00 2018-07-16 20:00:00 02:00:00
1 2018-07-11 08:03:00 2018-07-11 12:03:00 0
2 2018-07-13 17:54:00 2018-07-13 21:54:00 03:54:00
3 2018-07-14 13:09:00 2018-07-14 17:09:00 0
4 2018-07-20 00:21:00 2018-07-20 04:21:00 04:00:00
5 2018-07-20 17:00:00 2018-07-21 09:00:00 14:00:00
Note: If time_diff > 1 days, I already deal with that case.
Question: Should I build a function to do this or there are pandas build-in function to do this? Any help or guide would be appreciated.
I think this can be a solution
tmp = pd.DataFrame({'time1': pd.to_datetime(['2018-07-16 16:00:00', '2018-07-11 08:03:00',
'2018-07-13 17:54:00', '2018-07-14 13:09:00',
'2018-07-20 00:21:00', '2018-07-20 17:00:00']),
'time2': pd.to_datetime(['2018-07-16 20:00:00', '2018-07-11 12:03:00',
'2018-07-13 21:54:00', '2018-07-14 17:09:00',
'2018-07-20 04:21:00', '2018-07-21 09:00:00'])})
time1_date = tmp.time1.dt.date.astype(str)
tmp['rule18'], tmp['rule08'] = pd.to_datetime(time1_date + ' 18:00:00'), pd.to_datetime(time1_date + ' 08:00:00')
# if stop exceeds 18:00:00, compute time difference from this hour
tmp['time_diff_rule1'] = np.where(tmp.time2 > tmp.rule18, (tmp.time2 - tmp.rule18), (tmp.time2 - tmp.time1))
# rearrange the dataframe with your second rule
tmp['time_diff_rule2'] = np.where((tmp.time2 < tmp.rule18) & (tmp.time1 > tmp.rule08), 0, tmp['time_diff_rule1'])
time_diff_rule1 time_diff_rule2
0 02:00:00 02:00:00
1 04:00:00 00:00:00
2 03:54:00 03:54:00
3 04:00:00 00:00:00
4 04:00:00 04:00:00
5 15:00:00 15:00:00