I want to write a code that groups time into time periods. I have two columns from and to and I have list periods. Based on the values from two columns I need to insert new column to dataframe named periods that will represent time period.
This is the code:
import pandas as pd
df = pd.DataFrame({"from":['08:10', '14:00', '15:00', '17:01', '13:41'],
"to":['10:11', '15:32', '15:35' , '18:23', '16:16']})
print(df)
periods = ["00:01-06:00", "06:01-12:00", "12:01-18:00", "18:01-00:00"]
#if times are between two periods, for example '17:01' and '18:23', it counts as first period ("12:01-18:00")
Result should look like this:
from to period
0 08:10 10:11 06:01-12:00
1 14:00 15:32 12:01-18:00
2 15:00 15:35 12:01-18:00
3 17:01 18:03 18:01-00:00
4 18:41 19:16 18:01-00:00
Values in two columns are datetime.
Here is a way to do it (I am assuming that "18:00" would belong in period "12:01-18:00"):
results = [0 for x in range(len(df))]
for row in df.iterrows():
item = row[1]
start = item['from']
end = item['to']
for ind, period in enumerate(periods):
per_1, per_2 = period.split("-")
if start.split(":")[0] >= per_1.split(":")[0]: #hours
if start.split(":")[0] == per_1.split(":")[0]:
if start.split(":")[1] >= per_1.split(":")[1]: #minutes
if start.split(":")[1] == per_1.split(":")[1]:
results[row[0]] = period
break
#Wrap around if you reach the end of the list
index = ind+1 if ind<len(periods) else 0
results[row[0]] = periods[index]
break
index = ind-1 if ind>0 else len(periods)-1
results[row[0]] = periods[index]
break
if start.split(":")[0] <= per_2.split(":")[0]:
if start.split(":")[0] == per_2.split(":")[0]:
if start.split(":")[1] == per_2.split(":")[1]:
results[row[0]] = period
break
#If anything else, then its greater, so in next period
index = ind+1 if ind<len(periods) else 0
results[row[0]] = periods[index]
break
results[row[0]] = period
break
print(results)
df['periods'] = results
['06:01-12:00', '12:01-18:00', '12:01-18:00', '12:01-18:00', '18:01-00:00']
df['periods'] = results
df
from to periods
0 08:10 10:11 06:01-12:00
1 14:00 15:32 12:01-18:00
2 15:00 15:35 12:01-18:00
3 17:01 18:23 12:01-18:00
4 18:41 16:16 18:01-00:00
That should cover every scenario. But you should test it out on every edge case of times possible to make sure.
Below
import pandas as pd
from datetime import datetime
df = pd.DataFrame({"from": ['08:10', '14:00', '15:00', '17:01', '13:41'],
"to": ['10:11', '15:32', '15:35', '18:23', '16:16']})
print(df)
periods = ["00:01-06:00", "06:01-12:00", "12:01-18:00", "18:01-00:00"]
_periods = [(datetime.strptime(p.split('-')[0], '%H:%M').time(), datetime.strptime(p.split('-')[1], '%H:%M').time()) for
p in periods]
def match_row_to_period(row):
from_time = datetime.strptime(row['from'], '%H:%M').time()
to_time = datetime.strptime(row['to'], '%H:%M').time()
for idx, p in enumerate(_periods):
if from_time >= p[0] and to_time <= p[1]:
return periods[idx]
for idx, p in enumerate(_periods):
if idx > 0:
prev_p = _periods[idx - 1]
if from_time <= prev_p[1] and to_time >= p[0]:
return periods[idx - 1]
df['period'] = df.apply(lambda row: match_row_to_period(row), axis=1)
print('-----------------------------------')
print('periods: ')
for _p in _periods:
print(str(_p[0]) + ' -- ' + str(_p[1]))
print('-----------------------------------')
print(df)
output
from to
0 08:10 10:11
1 14:00 15:32
2 15:00 15:35
3 17:01 18:23
4 13:41 16:16
-----------------------------------
periods:
00:01:00 -- 06:00:00
06:01:00 -- 12:00:00
12:01:00 -- 18:00:00
18:01:00 -- 00:00:00
-----------------------------------
from to period
0 08:10 10:11 06:01-12:00
1 14:00 15:32 12:01-18:00
2 15:00 15:35 12:01-18:00
3 17:01 18:23 12:01-18:00
4 13:41 16:16 12:01-18:00
Not sure, if there is a better solution, but here's a way using the apply and assign pandas methods , which is generally more pythonic than iterating a DataFrame as pandas is optimised for full df ix assignment operations rather than row by row updates (see this great blog post).
As a side note, the datatypes I've used here are datetime.time instances, rather than strings as in your example. When dealing with time it's always better to use an appropriate time library rather than a string representation.
from datetime import time
df = pd.DataFrame({
"from": [
time(8, 10),
time(14, 00),
time(15, 00),
time(17, 1),
time(13, 41)
],
"to": [
time(10, 11),
time(15, 32),
time(15, 35),
time(18, 23),
time(16, 16)
]
})
periods = [{
'from': time(00, 1),
'to': time(6, 00),
'period': '00:01-06:00'
}, {
'from': time(6, 1),
'to': time(12, 00),
'period': '06:01-12:00'
}, {
'from': time(12, 1),
'to': time(18, 00),
'period': '12:01-18:00'
}, {
'from': time(18, 1),
'to': time(0, 00),
'period': '18:01-00:00'
}]
def find_period(row, periods):
"""Map the df row to the period which it fits between"""
for ix, period in enumerate(periods):
if row['to'] <= periods[ix]['to']:
if row['from'] >= periods[ix]['from']:
return periods[ix]['period']
# Use df assign to assign the new column to the df
df.assign(
**{
'period':
df.apply(lambda row: find_period(row, periods), axis='columns')
}
)
Out:
from to period
0 08:10:00 10:11:00 06:01-12:00
1 14:00:00 15:32:00 12:01-18:00
2 15:00:00 15:35:00 12:01-18:00
3 17:01:00 18:23:00 None
4 13:41:00 16:16:00 12:01-18:00
N.b. The row at ix 3, is correctly showing None as it doesn't accurately fit between any of the two periods you defined (rather it bridges 12:00-18:00 and 18:00-00:00)
Related
I have a pandas dataframe that has a column with a 5 digit code that represent a day and time, and it works like following:
1 - The first three digits represent the day;
2 - The last two digits represent the hour:minute:second.
Example1: The first row have the code 19501, so the 195 represent the 1st of January of 2009 and the 01 part represents the time from 00:00:00 to 00:29:59;
Example2: In the second row i have the code 19502 which is the 1st of January of 2009 from 00:30:00 to 00:59:59;
Example3: Another example, 19711 would be the 3rd of January of 2009 from 05:00:00 to 05:29:59;
Example4: The last row is the code 73048, which represent the 20th of June of 2010 from 23:30:00 to 23:59:59.
Any ideas in how can I convert this 5 digit code into a proper datetime format?
I'm assuming your column is numeric.
import datetime as dt
df = pd.DataFrame({'code': [19501, 19502, 19711, 73048]})
df['days'] = pd.to_timedelta(df['code']//100, 'D')
df['half-hours'] = df['code']%100
df['hours'] = pd.to_timedelta(df['half-hours']//2, 'h')
df['minutes'] = pd.to_timedelta(df['half-hours']%2*30, 'm')
base_day = dt.datetime(2009, 1, 1) - dt.timedelta(days = 195)
df['dt0'] = base_day + df.days + df.hours + df.minutes - dt.timedelta(minutes = 30)
df['dt1'] = base_day + df.days + df.hours + df.minutes - dt.timedelta(seconds = 1)
A simple solution, add the days to 2008-06-20, add the (time-1)*30min;
df = pd.DataFrame({'code': [19501, 19502, 19711, 73048]})
d, t = df['code'].divmod(100)
df['datetime'] = (
pd.to_timedelta(d, unit='D')
.add(pd.Timestamp('2008-06-20'))
.add(pd.to_timedelta((t-1)*30, unit='T'))
)
NB. this gives you the start of the period, for the end replace (t-1)*30 by t*30-1.
Output:
code datetime
0 19501 2009-01-01 00:00:00
1 19502 2009-01-01 00:30:00
2 19711 2009-01-03 05:00:00
3 73048 2010-06-20 23:30:00
I have a dataframe like this,
tidx = pd.date_range('2022-10-01', periods=10, freq='10D')
data_frame = pd.DataFrame(1, columns=['inventory'], index=tidx)
print(data_frame)
Output:
inventory
2022-10-01 1
2022-10-11 1
2022-10-21 1
2022-10-31 1
2022-11-10 1
2022-11-20 1
2022-11-30 1
2022-12-10 1
2022-12-20 1
2022-12-30 1
I want to find the sum from 23rd to the 23rd of each month. I couldn't find a way to pass the day number to resample method. Any help is really appreciated.
Is this what you need?
import pandas as pd
from datetime import timedelta
tidx = pd.date_range('2022-10-01', periods=10, freq='10D')
data_frame = pd.DataFrame(1, columns=['inventory'], index=tidx)
data_frame.index.name = "date"
data_frame = data_frame.reset_index()
data_frame["fin_year_month"] = ""
data_frame.loc[data_frame["date"].dt.day < 23, ["fin_year_month"]] = (data_frame["date"] - timedelta(days=25)).dt.year.astype("str") + "_" + (data_frame["date"] - timedelta(days=25)).dt.month.astype("str")
data_frame.loc[data_frame["date"].dt.day >= 23, ["fin_year_month"]] = (data_frame["date"]).dt.year.astype("str") + "_" + (data_frame["date"]).dt.month.astype("str")
data_frame.groupby("fin_year_month").sum()
Just be careful with the number of days you subtract. For the 23 to 23 I subtract 25 and this is fine. For 30 or 31 it would be a harder problem. Number of days to subtract will depend on a particular month and would be easier to write a function that would give a "previous year-month" given a particular date
This is my dataframe.
Start_hour End_date
23:58:00 00:26:00
23:56:00 00:01:00
23:18:00 23:36:00
How can I get in a new column the difference (in minutes) between these two columns?
>>> from datetime import datetime
>>>
>>> before = datetime.now()
>>> print('wait for more than 1 minute')
wait for more than 1 minute
>>> after = datetime.now()
>>> td = after - before
>>>
>>> td
datetime.timedelta(seconds=98, microseconds=389121)
>>> td.total_seconds()
98.389121
>>> td.total_seconds() / 60
1.6398186833333335
Then you can round it or use it as-is.
You can do something like this:
import pandas as pd
df = pd.DataFrame({
'Start_hour': ['23:58:00', '23:56:00', '23:18:00'],
'End_date': ['00:26:00', '00:01:00', '23:36:00']}
)
df['Start_hour'] = pd.to_datetime(df['Start_hour'])
df['End_date'] = pd.to_datetime(df['End_date'])
df['diff'] = df.apply(
lambda row: (row['End_date']-row['Start_hour']).seconds / 60,
axis=1
)
print(df)
Start_hour End_date diff
0 2021-03-29 23:58:00 2021-03-29 00:26:00 28.0
1 2021-03-29 23:56:00 2021-03-29 00:01:00 5.0
2 2021-03-29 23:18:00 2021-03-29 23:36:00 18.0
You can also rearrange your dates as string again if you like:
df['Start_hour'] = df['Start_hour'].apply(lambda x: x.strftime('%H:%M:%S'))
df['End_date'] = df['End_date'].apply(lambda x: x.strftime('%H:%M:%S'))
print(df)
Output:
Start_hour End_date diff
0 23:58:00 00:26:00 28.0
1 23:56:00 00:01:00 5.0
2 23:18:00 23:36:00 18.0
Short answer:
df['interval'] = df['End_date'] - df['Start_hour']
df['interval'][df['End_date'] < df['Start_hour']] += timedelta(hours=24)
Why so:
You probably trying to solve the problem that your Start_hout and End_date values sometimes belong to a different days, and that's why you can't just substutute one from the other.
It your time window never exceeds 24 hours interval, you could use some modular arithmetic to deal with 23:59:59 - 00:00:00 border:
if End_date < Start_hour, this always means End_date belongs to a next day
this implies, if End_date - Start_hour < 0 then we should add 24 hours to End_date to find the actual difference
The final formula is:
if rec['Start_hour'] < rec['End_date']:
offset = 0
else:
offset = timedelta(hours=24)
rec['delta'] = offset + rec['End_date'] - rec['Start_hour']
To do the same with pandas.DataFrame we need to change code accordingly. And
that's how we get the snippet from the beginning of the answer.
import pandas as pd
df = pd.DataFrame([
{'Start_hour': datetime(1, 1, 1, 23, 58, 0), 'End_date': datetime(1, 1, 1, 0, 26, 0)},
{'Start_hour': datetime(1, 1, 1, 23, 58, 0), 'End_date': datetime(1, 1, 1, 23, 59, 0)},
])
# ...
df['interval'] = df['End_date'] - df['Start_hour']
df['interval'][df['End_date'] < df['Start_hour']] += timedelta(hours=24)
> df
Start_hour End_date interval
0 0001-01-01 23:58:00 0001-01-01 00:26:00 0 days 00:28:00
1 0001-01-01 23:58:00 0001-01-01 23:59:00 0 days 00:01:00
I have two dataframes, they have a start/end datetime and a value. Not the same number of rows. The intervals which overlap may not be in the same row/index.
df1
start_datetime end_datetime value
08:50 09:50 5
09:52 10:10 6
10:50 11:30 2
df2
start_datetime end_datetime value
08:51 08:59 3
09:52 10:02 9
10:03 10:30 1
11:03 11:39 1
13:10 13:15 0
I would like to calculate the sum of duration time when df1 and df2 overlap only if df1.value > df2.value.
During one df2 time interval, df1 can overlaps multiple times and sometimes the condition is True.
I tried something like that:
time = timedelta()
for i, row1 in df1.iterrows():
t1 = pd.Interval(row1.start, row1.end)
for j, row2 in df2.iterrows():
t2 = pd.Interval(row2.start, row2.end)
if t1.overlaps(t2) and row1.value > row2.value:
latest_start = np.maximum(row1.start, row1.start)
earliest_end = np.minimum(row2.end, row2.end)
delta = earliest_end - latest_start
time += delta
I can loop on every df1 rows and test with the whole df2 data but it's not optimized.
expected output (example):
Timedelta('0 days 00:99:99')
Here is my solution:
Create DataFrames:
df1 = pd.DataFrame(
{"start_datetime1": ['08:50' ,'09:52' ,'10:50 ' ],
'end_datetime1' : ['09:50','10:10','11:30'] ,
'value1': [5,6,2]})
df2 = pd.DataFrame(
{"start_datetime2": ['08:51' ,'09:52' ,'10:03 ','11:03 ','13:10 ' ],
'end_datetime2' : ['08:59','10:02','10:30','11:39', '13:15'] ,
'value2': [3,9,1,1,0]})
df2["start_datetime2"]= pd.to_datetime(df2["start_datetime2"])
df2["end_datetime2"]= pd.to_datetime(df2["end_datetime2"])
df1["start_datetime1"]= pd.to_datetime(df1["start_datetime1"])
df1["end_datetime1"]= pd.to_datetime(df1["end_datetime1"])
Combine dataframes to make comparison easier. Combined dataframe has all possible matches :
df1['temp'] = 1 #temporary keys to create all combinations
df2['temp'] = 1
df_combined = pd.merge(df1,df2,on='temp').drop('temp',axis=1)
Compare values with lambda function:
df_combined['Result'] = df_combined.apply(lambda row: max(row["start_datetime1"],row["start_datetime2"]) -
min(row["start_datetime1"],row["start_datetime2"])
if pd.Interval(row['start_datetime1'], row['end_datetime1']).overlaps(
pd.Interval(row['start_datetime2'], row['end_datetime2'])) and
row["value1"] > row["value2"]
else 0, axis = 1 )
df_combined
Result :
total_timedelta = df_combined['Result'].loc[df_combined['Result'] != 0].sum()
0 days 00:25:00
Dataframe:
I have a start date and and end date and I would like to have the date range between start and end, on a specific day (e.g the 10th day of every month)
Example:
start_date = '2020-01-03'
end_date = '2020-10-19'
wanted_result = ['2020-01-03', '2020-01-10', '2020-02-10', '2020-03-10',...,'2020-10-10', '2020-10-19']
I currently have a solution which creates all the dates between start_date and end_date and then subsamples only the dates on the 10th, but I do not like it, I think it is too cumbersome. Any ideas?
import pandas as pd
querydate = 10
dates = pd.date_range(start=start_date, end=end_date)
dates = dates[[0]].append(dates[dates.day == querydate])
If need also get first and last value add Index.isin by last and first value - so get all values unique, not duplicates if first or last day is 10:
dates = pd.date_range(start=start_date, end=end_date)
dates = dates[dates.isin(dates[[0,-1]]) | (dates.day == querydate)]
print (dates)
DatetimeIndex(['2020-01-03', '2020-01-10', '2020-02-10', '2020-03-10',
'2020-04-10', '2020-05-10', '2020-06-10', '2020-07-10',
'2020-08-10', '2020-09-10', '2020-10-10', '2020-10-19'],
dtype='datetime64[ns]', freq=None)
If need list:
print (list(dates.strftime('%Y-%m-%d')))
['2020-01-03', '2020-01-10', '2020-02-10', '2020-03-10',
'2020-04-10', '2020-05-10', '2020-06-10', '2020-07-10',
'2020-08-10', '2020-09-10', '2020-10-10', '2020-10-19']
Changed sample data:
start_date = '2020-01-10'
end_date = '2020-10-10'
querydate = 10
dates = pd.date_range(start=start_date, end=end_date)
dates = dates[dates.isin(dates[[0,-1]]) | (dates.day == querydate)]
print (dates)
DatetimeIndex(['2020-01-10', '2020-02-10', '2020-03-10', '2020-04-10',
'2020-05-10', '2020-06-10', '2020-07-10', '2020-08-10',
'2020-09-10', '2020-10-10'],
dtype='datetime64[ns]', freq=None)
Try this:
dates = pd.Series([pd.to_datetime(start_date)] + [i for i in pd.date_range(start=start_date, end=end_date) if i.day == 10] + [pd.to_datetime(end_date)]).drop_duplicates()
print(dates)
Output:
0 2020-01-03
1 2020-01-10
2 2020-02-10
3 2020-03-10
4 2020-04-10
5 2020-05-10
6 2020-06-10
7 2020-07-10
8 2020-08-10
9 2020-09-10
10 2020-10-10
11 2020-10-19
dtype: datetime64[ns]