Choosing the minumum distance - python

I have the following dataframe:
data = {'id': [0, 0, 0, 0, 0, 0],
'time_order': ['2019-01-01 0:00:00', '2019-01-01 00:11:00', '2019-01-02 00:04:00', '2019-01-02 00:15:00', '2019-01-03 00:07:00', '2019-01-03 00:10:00']}
df_data = pd.DataFrame(data)
df_data['time_order'] = pd.to_datetime(df_data['time_order'])
df_data['day_order'] = df_data['time_order'].dt.strftime('%Y-%m-%d')
df_data['time'] = df_data['time_order'].dt.strftime('%H:%M:%S')
I have been trying to calculate the short time difference between the orders each 15 minutes, e.g.
I take a time window 15 minutes and take only its half 7:30 which means I would like to calculate the difference between the first order '2019-01-01 0:00:00' and 00:07:30 and between the second order '2019-01-01 0:11:00' and 00:07:30 and take only the order that is closer to 00:07:30 each day.
I did the following:
t = 0
x = '00:00:00'
y = '00:15:00'
g = 0
a = []
for i in range(1, len(df_data)):
g +=1
half_time = (pd.Timestamp(y) - pd.Timstamp(x).to_pydatetime()) / 2
half_window = (half_time + pd.Timestamp(x).to_pydatetime()).strftime('%H:%M:%S')
for l in df_data['day_order']:
for k in df_data['time_order']:
if l == k.strftime('%Y-%m-%d')
distance1 = abs(pd.Timestamp(df_data.iat[i-1, 4].to_pydatetime() - pd.Timestamp(half_window).to_pydatetime())
distance2 = abs(pd.Timestamp(df_data.iat[i, 4].to_pydatetime() - pd.Timestamp(half_window).to_pydatetime())
if distance1 < distance2:
d = distance1
else:
d = distance2
a.append(d.seconds)
so the expected result for the first day is abs(00:11:00 - 00:07:30) = 00:03:30 which is less than abs(00:00:00 - 00:07:30) = 00:07:30 and by doing so I would like to consider only the short time distance which means the 00:03:30 and ignor the first order at that day. I would like to do it for each day. I tried it with my code above, it doesn't work. Any idea would be very appreciated. Thanks in advance.

I am not sure about the format of the expected output, but I would try to bring the result to a point where you can extract data as you like:
Loading given data:
import pandas as pd
data = {'id': [0, 0, 0, 0, 0, 0],
'time_order': ['2019-01-01 0:00:00', '2019-01-01 00:11:00', '2019-01-02 00:04:00', '2019-01-02 00:15:00', '2019-01-03 00:07:00', '2019-01-03 00:10:00']}
df_data = pd.DataFrame(data)
df_data['time_order'] = pd.to_datetime(df_data['time_order'])
df_data['day_order'] = df_data['time_order'].dt.strftime('%Y-%m-%d')
df_data['time'] = df_data['time_order'].dt.strftime('%H:%M:%S')
Calculating difference:
x = '00:00:00'
y = '00:15:00'
diff = (pd.Timedelta(y)-pd.Timedelta(x))/2
Creating a new column 'diff' as timedelta:
df_data['diff'] = abs(df_data['time'] - diff)
Grouping (based on date) and apply:
mins = df_data.groupby('day_order').apply(lambda x: x[x['diff']==min(x['diff'])])
Removing Index (optional):
mins.reset_index(drop=True, inplace=True)
Output DataFrame:
>>> mins
id time_order day_order time diff
0 0 2019-01-01 00:11:00 2019-01-01 00:11:00 0 days 00:03:30
1 0 2019-01-02 00:04:00 2019-01-02 00:04:00 0 days 00:03:30
2 0 2019-01-03 00:07:00 2019-01-03 00:07:00 0 days 00:00:30
Making list of difference in seconds:
a = list(mins['diff'].apply(lambda x:x.seconds))
Output:
>>> a
[210, 210, 30]

Related

How to filter by day the last 3 days values of a pandas dataframe considering the hours?

I have this dataframe:
I need to get the values of the single days between time 05:00:00 and 06:00:00 (so, in this example, ignore 07:00:00)
And create a separate dataframe for each day considering the last 3 days.
This is the result i want to achive: (3 dataframes considering 3 days and Time between 05 and 06)
I tried this: (without success)
df.sort_values(by = "Time", inplace=True)
df_of_yesterday = df[ (df.Time.dt.hour > 4)
& (df.Time.dt.hour < 7)]
You can use:
from datetime import date, time, timedelta
today = date.today()
m = df['Time'].dt.time.between(time(5), time(6))
df_yda = df.loc[m & (df['Time'].dt.date == today - timedelta(1))]
df_2da = df.loc[m & (df['Time'].dt.date == today - timedelta(2))]
df_3da = df.loc[m & (df['Time'].dt.date == today - timedelta(3))]
Output:
>>> df_yda
Time Open
77 2022-03-09 05:00:00 0.880443
78 2022-03-09 06:00:00 0.401932
>> df_2da
Time Open
53 2022-03-08 05:00:00 0.781377
54 2022-03-08 06:00:00 0.638676
>>> df_3da
Time Open
29 2022-03-07 05:00:00 0.838719
30 2022-03-07 06:00:00 0.897211
Setup a MRE:
import pandas as pd
import numpy as np
rng = np.random.default_rng()
dti = pd.date_range('2022-03-06', '2022-03-10', freq='H')
df = pd.DataFrame({'Time': dti, 'Open': rng.random(len(dti))})
Use Series.between with set offsets.DateOffset for datetimes between this times in list comprehension for list of DataFrames:
now = pd.to_datetime('now').normalize()
dfs = [df[df.Time.between(now - pd.DateOffset(days=i, hour=5),
now - pd.DateOffset(days=i, hour=6))] for i in range(1,4)]
print (dfs[0])
print (dfs[1])
print (dfs[2])
I've manually copied your data into a dictionary and then converted it to your desired output.
First you should probably edit your question to use the text version of the data instead of an image, here's a small example:
data = {
'Time': [
'2022-03-06 05:00:00',
'2022-03-06 06:00:00',
'2022-03-06 07:00:00',
'2022-03-07 05:00:00',
'2022-03-07 06:00:00',
'2022-03-07 07:00:00',
'2022-03-08 05:00:00',
'2022-03-08 06:00:00',
'2022-03-08 07:00:00',
'2022-03-09 05:00:00',
'2022-03-09 06:00:00',
'2022-03-09 07:00:00'
],
'Open': [
'13823.6',
'13786.6',
'13823.6',
'13823.6',
'13786.6',
'13823.6',
'13823.6',
'13786.6',
'13823.6',
'13823.6',
'13786.6',
'13823.6'
]
}
df = pd.DataFrame(data)
Then you can use this code to get all the dates that are on the same day and in between the hours 4 and 7 and then create your dataframes as follows:
import pandas as pd
from datetime import datetime
dict = {}
for index, row in df.iterrows():
found = False
for item in dict:
date = datetime.strptime(row['Time'], '%Y-%m-%d %H:%M:%S')
date2 = datetime.strptime(item, '%Y-%m-%d')
if(date.date() == date2.date() and date.hour > 4 and date.hour < 7):
dict[item].append(row['Open'])
found = True
date = datetime.strptime(row['Time'], '%Y-%m-%d %H:%M:%S')
if(not found and date.hour > 4 and date.hour < 7):
dict[date.strftime('%Y-%m-%d')] = []
dict[date.strftime('%Y-%m-%d')].append(row['Open'])
for key in dict:
temp = {
key: dict[key]
}
df = pd.DataFrame(temp)
print(df)

Creating missing time ranges in pandas

I have a pandas dataframe where each row corresponds to a period of time for a given record. If a record has more than one period of time there is a gap between them. I would like to fill in all the missing time periods that are between the end of the first time period and the start of the final time period.
My data looks like this:
record = [1, 1, 2, 2, 2]
start_time = pd.to_datetime(['2001-01-01', '2001-02-01', '2000-01-01', '2001-05-31', '2001-09-01'])
stop_time = pd.to_datetime(['2001-01-15', '2001-02-28', '2001-01-31', '2001-08-16', '2001-09-30'])
df = pd.DataFrame({'record': record, 'start_time': start_time, 'stop_time': stop_time})
record start_time stop_time
0 1 2001-01-01 2001-01-15
1 1 2001-02-01 2001-02-28
2 2 2000-01-01 2001-01-31
3 2 2001-05-31 2001-08-16
4 2 2001-09-01 2001-09-30
The gaps in time are between lines 0 and 1 (the stop time is 2001-01-15 and the next start time is 2001-02-01, which is a 16 day gap), as well as 2 and 3, and 3 and 4. Gaps can only happen between the first and last row for a given record.
What I'm trying to achieve is this:
record start_time stop_time
0 1 2001-01-01 2001-01-15
1 1 2001-01-16 2001-01-31
2 1 2001-02-01 2001-02-28
3 2 2000-01-01 2001-01-31
4 2 2001-02-01 2001-05-30
5 2 2001-05-31 2001-08-16
6 2 2001-08-17 2001-08-31
7 2 2001-09-01 2001-09-30
That is, I want to add in rows that have start and stop times that fit those gaps. So in the previous example there would be a new row for record 1 with a start date of 2001-01-16 and an end date of 2001-01-31.
The full dataset has over 2M rows across 1.5M records, so I'm looking for a vectorized solution in pandas that doesn't use apply and is relatively efficient.
Maybe something like this?
import pandas as pd
record = [1, 1, 2, 2, 2]
start_time = pd.to_datetime(['2001-01-01', '2001-02-01', '2000-01-01', '2001-05-31', '2001-09-01'])
stop_time = pd.to_datetime(['2001-01-15', '2001-02-28', '2001-01-31', '2001-08-16', '2001-09-30'])
df = pd.DataFrame({'record': record, 'start_time': start_time, 'stop_time': stop_time})
one_day = pd.Timedelta('1d')
missing_dates = []
for record, df_per_record in df.groupby('record'):
start_time = pd.to_datetime(df_per_record.start_time)
stop_time = pd.to_datetime(df_per_record.stop_time)
reference_date = pd.Timestamp(df_per_record.start_time.iloc[0])
start_time_in_days = (start_time - reference_date) // one_day
stop_time_in_days = (stop_time - reference_date) // one_day
dates_diff = start_time_in_days.iloc[1:].values - stop_time_in_days.iloc[:-1].values
missing_start_dates = stop_time[:-1][dates_diff > 1] + one_day
missing_stop_dates = missing_start_dates + ((dates_diff-2) * one_day)
missing_dates.append(pd.DataFrame({"record": record, "start_time": missing_start_dates, "stop_time": missing_stop_dates}))
print(pd.concat([df]+missing_dates).sort_values(["record", "start_time"]))
Edit:
version #2 this time without the for loop:
import pandas as pd
record = [1, 1, 2, 2, 2]
start_time = pd.to_datetime(['2001-01-01', '2001-02-01', '2000-01-01', '2001-05-31', '2001-09-01'])
stop_time = pd.to_datetime(['2001-01-15', '2001-02-28', '2001-01-31', '2001-08-16', '2001-09-30'])
df = pd.DataFrame({'record': record, 'start_time': start_time, 'stop_time': stop_time})
one_day = pd.Timedelta('1d')
start_time = pd.to_datetime(df.start_time)
stop_time = pd.to_datetime(df.stop_time)
reference_date = pd.Timestamp(df.start_time.iloc[0])
start_time_in_days = (start_time - reference_date) // one_day
stop_time_in_days = (stop_time - reference_date) // one_day
is_same_record = df.record.iloc[1:].values == df.record.iloc[:-1].values
dates_diff = start_time_in_days.iloc[1:].values - stop_time_in_days.iloc[:-1].values
mask = (dates_diff > 1) & is_same_record
missing_start_dates = stop_time[:-1][mask] + one_day
missing_stop_dates = missing_start_dates + ((dates_diff[is_same_record]-2) * one_day)
missing_dates = pd.DataFrame({"record": df.record.iloc[:-1][mask], "start_time": missing_start_dates, "stop_time": missing_stop_dates})
print(pd.concat([df, missing_dates]).sort_values(["record", "start_time"]).reset_index(drop=True))

Is there any function calculate duration in minutes between two datetimes values?

This is my dataframe.
Start_hour End_date
23:58:00 00:26:00
23:56:00 00:01:00
23:18:00 23:36:00
How can I get in a new column the difference (in minutes) between these two columns?
>>> from datetime import datetime
>>>
>>> before = datetime.now()
>>> print('wait for more than 1 minute')
wait for more than 1 minute
>>> after = datetime.now()
>>> td = after - before
>>>
>>> td
datetime.timedelta(seconds=98, microseconds=389121)
>>> td.total_seconds()
98.389121
>>> td.total_seconds() / 60
1.6398186833333335
Then you can round it or use it as-is.
You can do something like this:
import pandas as pd
df = pd.DataFrame({
'Start_hour': ['23:58:00', '23:56:00', '23:18:00'],
'End_date': ['00:26:00', '00:01:00', '23:36:00']}
)
df['Start_hour'] = pd.to_datetime(df['Start_hour'])
df['End_date'] = pd.to_datetime(df['End_date'])
df['diff'] = df.apply(
lambda row: (row['End_date']-row['Start_hour']).seconds / 60,
axis=1
)
print(df)
Start_hour End_date diff
0 2021-03-29 23:58:00 2021-03-29 00:26:00 28.0
1 2021-03-29 23:56:00 2021-03-29 00:01:00 5.0
2 2021-03-29 23:18:00 2021-03-29 23:36:00 18.0
You can also rearrange your dates as string again if you like:
df['Start_hour'] = df['Start_hour'].apply(lambda x: x.strftime('%H:%M:%S'))
df['End_date'] = df['End_date'].apply(lambda x: x.strftime('%H:%M:%S'))
print(df)
Output:
Start_hour End_date diff
0 23:58:00 00:26:00 28.0
1 23:56:00 00:01:00 5.0
2 23:18:00 23:36:00 18.0
Short answer:
df['interval'] = df['End_date'] - df['Start_hour']
df['interval'][df['End_date'] < df['Start_hour']] += timedelta(hours=24)
Why so:
You probably trying to solve the problem that your Start_hout and End_date values sometimes belong to a different days, and that's why you can't just substutute one from the other.
It your time window never exceeds 24 hours interval, you could use some modular arithmetic to deal with 23:59:59 - 00:00:00 border:
if End_date < Start_hour, this always means End_date belongs to a next day
this implies, if End_date - Start_hour < 0 then we should add 24 hours to End_date to find the actual difference
The final formula is:
if rec['Start_hour'] < rec['End_date']:
offset = 0
else:
offset = timedelta(hours=24)
rec['delta'] = offset + rec['End_date'] - rec['Start_hour']
To do the same with pandas.DataFrame we need to change code accordingly. And
that's how we get the snippet from the beginning of the answer.
import pandas as pd
df = pd.DataFrame([
{'Start_hour': datetime(1, 1, 1, 23, 58, 0), 'End_date': datetime(1, 1, 1, 0, 26, 0)},
{'Start_hour': datetime(1, 1, 1, 23, 58, 0), 'End_date': datetime(1, 1, 1, 23, 59, 0)},
])
# ...
df['interval'] = df['End_date'] - df['Start_hour']
df['interval'][df['End_date'] < df['Start_hour']] += timedelta(hours=24)
> df
Start_hour End_date interval
0 0001-01-01 23:58:00 0001-01-01 00:26:00 0 days 00:28:00
1 0001-01-01 23:58:00 0001-01-01 23:59:00 0 days 00:01:00

Group time into time periods in Python Pands

I want to write a code that groups time into time periods. I have two columns from and to and I have list periods. Based on the values from two columns I need to insert new column to dataframe named periods that will represent time period.
This is the code:
import pandas as pd
df = pd.DataFrame({"from":['08:10', '14:00', '15:00', '17:01', '13:41'],
"to":['10:11', '15:32', '15:35' , '18:23', '16:16']})
print(df)
periods = ["00:01-06:00", "06:01-12:00", "12:01-18:00", "18:01-00:00"]
#if times are between two periods, for example '17:01' and '18:23', it counts as first period ("12:01-18:00")
Result should look like this:
from to period
0 08:10 10:11 06:01-12:00
1 14:00 15:32 12:01-18:00
2 15:00 15:35 12:01-18:00
3 17:01 18:03 18:01-00:00
4 18:41 19:16 18:01-00:00
Values in two columns are datetime.
Here is a way to do it (I am assuming that "18:00" would belong in period "12:01-18:00"):
results = [0 for x in range(len(df))]
for row in df.iterrows():
item = row[1]
start = item['from']
end = item['to']
for ind, period in enumerate(periods):
per_1, per_2 = period.split("-")
if start.split(":")[0] >= per_1.split(":")[0]: #hours
if start.split(":")[0] == per_1.split(":")[0]:
if start.split(":")[1] >= per_1.split(":")[1]: #minutes
if start.split(":")[1] == per_1.split(":")[1]:
results[row[0]] = period
break
#Wrap around if you reach the end of the list
index = ind+1 if ind<len(periods) else 0
results[row[0]] = periods[index]
break
index = ind-1 if ind>0 else len(periods)-1
results[row[0]] = periods[index]
break
if start.split(":")[0] <= per_2.split(":")[0]:
if start.split(":")[0] == per_2.split(":")[0]:
if start.split(":")[1] == per_2.split(":")[1]:
results[row[0]] = period
break
#If anything else, then its greater, so in next period
index = ind+1 if ind<len(periods) else 0
results[row[0]] = periods[index]
break
results[row[0]] = period
break
print(results)
df['periods'] = results
['06:01-12:00', '12:01-18:00', '12:01-18:00', '12:01-18:00', '18:01-00:00']
df['periods'] = results
df
from to periods
0 08:10 10:11 06:01-12:00
1 14:00 15:32 12:01-18:00
2 15:00 15:35 12:01-18:00
3 17:01 18:23 12:01-18:00
4 18:41 16:16 18:01-00:00
That should cover every scenario. But you should test it out on every edge case of times possible to make sure.
Below
import pandas as pd
from datetime import datetime
df = pd.DataFrame({"from": ['08:10', '14:00', '15:00', '17:01', '13:41'],
"to": ['10:11', '15:32', '15:35', '18:23', '16:16']})
print(df)
periods = ["00:01-06:00", "06:01-12:00", "12:01-18:00", "18:01-00:00"]
_periods = [(datetime.strptime(p.split('-')[0], '%H:%M').time(), datetime.strptime(p.split('-')[1], '%H:%M').time()) for
p in periods]
def match_row_to_period(row):
from_time = datetime.strptime(row['from'], '%H:%M').time()
to_time = datetime.strptime(row['to'], '%H:%M').time()
for idx, p in enumerate(_periods):
if from_time >= p[0] and to_time <= p[1]:
return periods[idx]
for idx, p in enumerate(_periods):
if idx > 0:
prev_p = _periods[idx - 1]
if from_time <= prev_p[1] and to_time >= p[0]:
return periods[idx - 1]
df['period'] = df.apply(lambda row: match_row_to_period(row), axis=1)
print('-----------------------------------')
print('periods: ')
for _p in _periods:
print(str(_p[0]) + ' -- ' + str(_p[1]))
print('-----------------------------------')
print(df)
output
from to
0 08:10 10:11
1 14:00 15:32
2 15:00 15:35
3 17:01 18:23
4 13:41 16:16
-----------------------------------
periods:
00:01:00 -- 06:00:00
06:01:00 -- 12:00:00
12:01:00 -- 18:00:00
18:01:00 -- 00:00:00
-----------------------------------
from to period
0 08:10 10:11 06:01-12:00
1 14:00 15:32 12:01-18:00
2 15:00 15:35 12:01-18:00
3 17:01 18:23 12:01-18:00
4 13:41 16:16 12:01-18:00
Not sure, if there is a better solution, but here's a way using the apply and assign pandas methods , which is generally more pythonic than iterating a DataFrame as pandas is optimised for full df ix assignment operations rather than row by row updates (see this great blog post).
As a side note, the datatypes I've used here are datetime.time instances, rather than strings as in your example. When dealing with time it's always better to use an appropriate time library rather than a string representation.
from datetime import time
df = pd.DataFrame({
"from": [
time(8, 10),
time(14, 00),
time(15, 00),
time(17, 1),
time(13, 41)
],
"to": [
time(10, 11),
time(15, 32),
time(15, 35),
time(18, 23),
time(16, 16)
]
})
periods = [{
'from': time(00, 1),
'to': time(6, 00),
'period': '00:01-06:00'
}, {
'from': time(6, 1),
'to': time(12, 00),
'period': '06:01-12:00'
}, {
'from': time(12, 1),
'to': time(18, 00),
'period': '12:01-18:00'
}, {
'from': time(18, 1),
'to': time(0, 00),
'period': '18:01-00:00'
}]
def find_period(row, periods):
"""Map the df row to the period which it fits between"""
for ix, period in enumerate(periods):
if row['to'] <= periods[ix]['to']:
if row['from'] >= periods[ix]['from']:
return periods[ix]['period']
# Use df assign to assign the new column to the df
df.assign(
**{
'period':
df.apply(lambda row: find_period(row, periods), axis='columns')
}
)
Out:
from to period
0 08:10:00 10:11:00 06:01-12:00
1 14:00:00 15:32:00 12:01-18:00
2 15:00:00 15:35:00 12:01-18:00
3 17:01:00 18:23:00 None
4 13:41:00 16:16:00 12:01-18:00
N.b. The row at ix 3, is correctly showing None as it doesn't accurately fit between any of the two periods you defined (rather it bridges 12:00-18:00 and 18:00-00:00)

Pandas: De-seasonalizing time-series data

I have the following dataframe df:
[Out]:
VOL
2011-04-01 09:30:00 11297
2011-04-01 09:30:10 6526
2011-04-01 09:30:20 14021
2011-04-01 09:30:30 19472
2011-04-01 09:30:40 7602
...
2011-04-29 15:59:30 79855
2011-04-29 15:59:40 83050
2011-04-29 15:59:50 602014
This df consist of volume observations at every 10 second for 22 non-consecutive days. I want to DE-seasonalized my time-series by dividing each observations by the average volume of their respective 5 minute time interval. To do so, I need to take the time-series average of volume at every 5 minutes across the 22 days. So I would end up with a time-series of averages at every 5 minutes 9:30:00 - 9:35:00; 9:35:00 - 9:40:00; 9:40:00 - 9:45:00 ... until 16:00:00. The average for the interval 9:30:00 - 9:35:00 is the average of volume for this time interval across all 22 days (i.e. So the average between 9:30:00 to 9:35:00 is the total volume between 9:30:00 to 9:35:00 on (day 1 + day 2 + day 3 ... day 22) / 22 . Does it makes sense?). I would then divide each observations in df that are between 9:30:00 - 9:35:00 by the average of this time interval.
Is there a package in Python / Pandas that can do this?
Edited answer:
date_times = pd.date_range(datetime.datetime(2011, 4, 1, 9, 30),
datetime.datetime(2011, 4, 16, 0, 0),
freq='10s')
VOL = np.random.sample(date_times.size) * 10000.0
df = pd.DataFrame(data={'VOL': VOL,'time':date_times}, index=date_times)
df['h'] = df.index.hour
df['m'] = df.index.minute
df1 = df.resample('5Min', how={'VOL': np.mean})
times = pd.to_datetime(df1.index)
df2 = df1.groupby([times.hour,times.minute]).VOL.mean().reset_index()
df2.columns = ['h','m','VOL']
df.merge(df2,on=['h','m'])
df_norm = df.merge(df2,on=['h','m'])
df_norm['norm'] = df_norm['VOL_x']/df_norm['VOL_y']
** Older answer (keeping it temporarily)
Use resample function
df.resample('5Min', how={'VOL': np.mean})
eg:
date_times = pd.date_range(datetime.datetime(2011, 4, 1, 9, 30),
datetime.datetime(2011, 4, 16, 0, 0),
freq='10s')
VOL = np.random.sample(date_times.size) * 10000.0
df = pd.DataFrame(data={'VOL': VOL}, index=date_times)
df.resample('5Min', how={'VOL': np.mean})

Categories