I want to write piece of code that returns the number of days hours minutes that are differ between values of backtest_results and realtime_test. So I want to perform 2022-01-24 10:05:00 - 2022-01-30 14:09:03, 2022-01-27 01:54:00 - 2022-02-02 09:34:06,
backtest_results[0] - realtime_test[0]
backtest_results[1] - realtime_test[1]
...
How would I be able top code the wanted code below.
import pandas as pd
import numpy
backtest_results = pd.to_datetime(['2022-01-24 10:05:00', '2022-01-27 01:54:00',
'2022-01-30 19:08:00','2022-02-02 14:32:00',
'2022-02-10 02:58:00', '2022-02-10 14:01:00',
'2022-02-11 00:25:00' '2022-02-16 13:49:00'])
realtime_test = pd.to_datetime([
'2022-01-30 14:09:03', '2022-02-02 09:34:06',
'2022-02-08 07:37:03', '2022-02-09 22:07:02',
'2022-02-10 09:02:03', '2022-02-10 19:32:25',
'2022-02-12 16:42:03', '2022-02-15 23:19:03'])
result = backtest_results - realtime_test
You're missing a comma in backtest_results in the last row:
backtest_results = pd.to_datetime(['2022-01-24 10:05:00', '2022-01-27 01:54:00',
'2022-01-30 19:08:00','2022-02-02 14:32:00',
'2022-02-10 02:58:00', '2022-02-10 14:01:00',
'2022-02-11 00:25:00', '2022-02-16 13:49:00'])
^^ here
Then, you can simply subtract one from the other.
If you want the raw difference:
>>> backtest_results - realtime_test
TimedeltaIndex(['-7 days +19:55:57', '-7 days +16:19:54', '-9 days +11:30:57',
'-8 days +16:24:58', '-1 days +17:55:57', '-1 days +18:28:35',
'-2 days +07:42:57', '0 days 14:29:57'],
dtype='timedelta64[ns]', freq=None)
If you want to get the difference is days:
>>> (backtest_results - realtime_test).astype('timedelta64[D]')
Float64Index([-7.0, -7.0, -9.0, -8.0, -1.0, -1.0, -2.0, 0.0], dtype='float64')
If you want the difference in hours:
>>> (backtest_results - realtime_test).astype('timedelta64[h]')
Float64Index([-149.0, -152.0, -205.0, -176.0, -7.0, -6.0, -41.0, 14.0], dtype='float64')
If you want the difference in minutes:
>>> (backtest_results - realtime_test).astype('timedelta64[m]')
Float64Index([-8885.0, -9101.0, -12270.0, -10536.0, -365.0, -332.0, -2418.0, 869.0], dtype='float64')
df['diff'] = df.EndDate - df.StartDtate
df.diff = df.diff / np.timedelta64(1. 'D')
‘D’ for day, ‘W’ for weeks, ‘M’ for month, ‘Y’ for year
Related
I am trying to capture the frequency of hours between two timestamps in a dataframe. For example, one row of data has '2022-01-01 00:35:00' and '2022-01-01 05:29:47'. I would like for frequency to be attributed to Hours 0, 1, 2, 3, 4, and 5.
Start Time
End Time
2022-01-01 00:35:00
2022-01-01 05:29:47
2022-01-01 00:55:00
2022-01-01 05:00:17
2022-01-01 01:35:00
2022-01-01 06:26:00
2022-01-01 02:29:00
2022-01-01 04:25:17
I have been trying to capture the time delta between the two but have not been able to figure out counting the frequency of hours.
You can extract the hours and then calculate the delta:
import datetime
df['start_hour'] = [datetime.datetime.strptime(i, "%Y-%m-%d %H:%M:%S").hour for i in df['Start Time']]
df['end_hour'] = [datetime.datetime.strptime(i, "%Y-%m-%d %H:%M:%S").hour for i in df['End Time']]
df['delta'] = df['end_hour'] - df['start_hour']
Try this:
df['freq'] = df.apply(lambda x:[i + x['Start Time'].hour for i in list(range(x['End Time'].hour - x['Start Time'].hour)], axis=1)
I have this dataframe:
I need to get the values of the single days between time 05:00:00 and 06:00:00 (so, in this example, ignore 07:00:00)
And create a separate dataframe for each day considering the last 3 days.
This is the result i want to achive: (3 dataframes considering 3 days and Time between 05 and 06)
I tried this: (without success)
df.sort_values(by = "Time", inplace=True)
df_of_yesterday = df[ (df.Time.dt.hour > 4)
& (df.Time.dt.hour < 7)]
You can use:
from datetime import date, time, timedelta
today = date.today()
m = df['Time'].dt.time.between(time(5), time(6))
df_yda = df.loc[m & (df['Time'].dt.date == today - timedelta(1))]
df_2da = df.loc[m & (df['Time'].dt.date == today - timedelta(2))]
df_3da = df.loc[m & (df['Time'].dt.date == today - timedelta(3))]
Output:
>>> df_yda
Time Open
77 2022-03-09 05:00:00 0.880443
78 2022-03-09 06:00:00 0.401932
>> df_2da
Time Open
53 2022-03-08 05:00:00 0.781377
54 2022-03-08 06:00:00 0.638676
>>> df_3da
Time Open
29 2022-03-07 05:00:00 0.838719
30 2022-03-07 06:00:00 0.897211
Setup a MRE:
import pandas as pd
import numpy as np
rng = np.random.default_rng()
dti = pd.date_range('2022-03-06', '2022-03-10', freq='H')
df = pd.DataFrame({'Time': dti, 'Open': rng.random(len(dti))})
Use Series.between with set offsets.DateOffset for datetimes between this times in list comprehension for list of DataFrames:
now = pd.to_datetime('now').normalize()
dfs = [df[df.Time.between(now - pd.DateOffset(days=i, hour=5),
now - pd.DateOffset(days=i, hour=6))] for i in range(1,4)]
print (dfs[0])
print (dfs[1])
print (dfs[2])
I've manually copied your data into a dictionary and then converted it to your desired output.
First you should probably edit your question to use the text version of the data instead of an image, here's a small example:
data = {
'Time': [
'2022-03-06 05:00:00',
'2022-03-06 06:00:00',
'2022-03-06 07:00:00',
'2022-03-07 05:00:00',
'2022-03-07 06:00:00',
'2022-03-07 07:00:00',
'2022-03-08 05:00:00',
'2022-03-08 06:00:00',
'2022-03-08 07:00:00',
'2022-03-09 05:00:00',
'2022-03-09 06:00:00',
'2022-03-09 07:00:00'
],
'Open': [
'13823.6',
'13786.6',
'13823.6',
'13823.6',
'13786.6',
'13823.6',
'13823.6',
'13786.6',
'13823.6',
'13823.6',
'13786.6',
'13823.6'
]
}
df = pd.DataFrame(data)
Then you can use this code to get all the dates that are on the same day and in between the hours 4 and 7 and then create your dataframes as follows:
import pandas as pd
from datetime import datetime
dict = {}
for index, row in df.iterrows():
found = False
for item in dict:
date = datetime.strptime(row['Time'], '%Y-%m-%d %H:%M:%S')
date2 = datetime.strptime(item, '%Y-%m-%d')
if(date.date() == date2.date() and date.hour > 4 and date.hour < 7):
dict[item].append(row['Open'])
found = True
date = datetime.strptime(row['Time'], '%Y-%m-%d %H:%M:%S')
if(not found and date.hour > 4 and date.hour < 7):
dict[date.strftime('%Y-%m-%d')] = []
dict[date.strftime('%Y-%m-%d')].append(row['Open'])
for key in dict:
temp = {
key: dict[key]
}
df = pd.DataFrame(temp)
print(df)
I have the following dataframe:
data = {'id': [0, 0, 0, 0, 0, 0],
'time_order': ['2019-01-01 0:00:00', '2019-01-01 00:11:00', '2019-01-02 00:04:00', '2019-01-02 00:15:00', '2019-01-03 00:07:00', '2019-01-03 00:10:00']}
df_data = pd.DataFrame(data)
df_data['time_order'] = pd.to_datetime(df_data['time_order'])
df_data['day_order'] = df_data['time_order'].dt.strftime('%Y-%m-%d')
df_data['time'] = df_data['time_order'].dt.strftime('%H:%M:%S')
I have been trying to calculate the short time difference between the orders each 15 minutes, e.g.
I take a time window 15 minutes and take only its half 7:30 which means I would like to calculate the difference between the first order '2019-01-01 0:00:00' and 00:07:30 and between the second order '2019-01-01 0:11:00' and 00:07:30 and take only the order that is closer to 00:07:30 each day.
I did the following:
t = 0
x = '00:00:00'
y = '00:15:00'
g = 0
a = []
for i in range(1, len(df_data)):
g +=1
half_time = (pd.Timestamp(y) - pd.Timstamp(x).to_pydatetime()) / 2
half_window = (half_time + pd.Timestamp(x).to_pydatetime()).strftime('%H:%M:%S')
for l in df_data['day_order']:
for k in df_data['time_order']:
if l == k.strftime('%Y-%m-%d')
distance1 = abs(pd.Timestamp(df_data.iat[i-1, 4].to_pydatetime() - pd.Timestamp(half_window).to_pydatetime())
distance2 = abs(pd.Timestamp(df_data.iat[i, 4].to_pydatetime() - pd.Timestamp(half_window).to_pydatetime())
if distance1 < distance2:
d = distance1
else:
d = distance2
a.append(d.seconds)
so the expected result for the first day is abs(00:11:00 - 00:07:30) = 00:03:30 which is less than abs(00:00:00 - 00:07:30) = 00:07:30 and by doing so I would like to consider only the short time distance which means the 00:03:30 and ignor the first order at that day. I would like to do it for each day. I tried it with my code above, it doesn't work. Any idea would be very appreciated. Thanks in advance.
I am not sure about the format of the expected output, but I would try to bring the result to a point where you can extract data as you like:
Loading given data:
import pandas as pd
data = {'id': [0, 0, 0, 0, 0, 0],
'time_order': ['2019-01-01 0:00:00', '2019-01-01 00:11:00', '2019-01-02 00:04:00', '2019-01-02 00:15:00', '2019-01-03 00:07:00', '2019-01-03 00:10:00']}
df_data = pd.DataFrame(data)
df_data['time_order'] = pd.to_datetime(df_data['time_order'])
df_data['day_order'] = df_data['time_order'].dt.strftime('%Y-%m-%d')
df_data['time'] = df_data['time_order'].dt.strftime('%H:%M:%S')
Calculating difference:
x = '00:00:00'
y = '00:15:00'
diff = (pd.Timedelta(y)-pd.Timedelta(x))/2
Creating a new column 'diff' as timedelta:
df_data['diff'] = abs(df_data['time'] - diff)
Grouping (based on date) and apply:
mins = df_data.groupby('day_order').apply(lambda x: x[x['diff']==min(x['diff'])])
Removing Index (optional):
mins.reset_index(drop=True, inplace=True)
Output DataFrame:
>>> mins
id time_order day_order time diff
0 0 2019-01-01 00:11:00 2019-01-01 00:11:00 0 days 00:03:30
1 0 2019-01-02 00:04:00 2019-01-02 00:04:00 0 days 00:03:30
2 0 2019-01-03 00:07:00 2019-01-03 00:07:00 0 days 00:00:30
Making list of difference in seconds:
a = list(mins['diff'].apply(lambda x:x.seconds))
Output:
>>> a
[210, 210, 30]
I have a start date and and end date and I would like to have the date range between start and end, on a specific day (e.g the 10th day of every month)
Example:
start_date = '2020-01-03'
end_date = '2020-10-19'
wanted_result = ['2020-01-03', '2020-01-10', '2020-02-10', '2020-03-10',...,'2020-10-10', '2020-10-19']
I currently have a solution which creates all the dates between start_date and end_date and then subsamples only the dates on the 10th, but I do not like it, I think it is too cumbersome. Any ideas?
import pandas as pd
querydate = 10
dates = pd.date_range(start=start_date, end=end_date)
dates = dates[[0]].append(dates[dates.day == querydate])
If need also get first and last value add Index.isin by last and first value - so get all values unique, not duplicates if first or last day is 10:
dates = pd.date_range(start=start_date, end=end_date)
dates = dates[dates.isin(dates[[0,-1]]) | (dates.day == querydate)]
print (dates)
DatetimeIndex(['2020-01-03', '2020-01-10', '2020-02-10', '2020-03-10',
'2020-04-10', '2020-05-10', '2020-06-10', '2020-07-10',
'2020-08-10', '2020-09-10', '2020-10-10', '2020-10-19'],
dtype='datetime64[ns]', freq=None)
If need list:
print (list(dates.strftime('%Y-%m-%d')))
['2020-01-03', '2020-01-10', '2020-02-10', '2020-03-10',
'2020-04-10', '2020-05-10', '2020-06-10', '2020-07-10',
'2020-08-10', '2020-09-10', '2020-10-10', '2020-10-19']
Changed sample data:
start_date = '2020-01-10'
end_date = '2020-10-10'
querydate = 10
dates = pd.date_range(start=start_date, end=end_date)
dates = dates[dates.isin(dates[[0,-1]]) | (dates.day == querydate)]
print (dates)
DatetimeIndex(['2020-01-10', '2020-02-10', '2020-03-10', '2020-04-10',
'2020-05-10', '2020-06-10', '2020-07-10', '2020-08-10',
'2020-09-10', '2020-10-10'],
dtype='datetime64[ns]', freq=None)
Try this:
dates = pd.Series([pd.to_datetime(start_date)] + [i for i in pd.date_range(start=start_date, end=end_date) if i.day == 10] + [pd.to_datetime(end_date)]).drop_duplicates()
print(dates)
Output:
0 2020-01-03
1 2020-01-10
2 2020-02-10
3 2020-03-10
4 2020-04-10
5 2020-05-10
6 2020-06-10
7 2020-07-10
8 2020-08-10
9 2020-09-10
10 2020-10-10
11 2020-10-19
dtype: datetime64[ns]
I have a CVS data from a log file:
UTC_TIME,FOCUS,IRIS,ZOOM,PAN,TILT,ROLL,LONGITUDE,LATITUDE,ALTITUDE,RED_RECORD
23:2:24.1, 10.9, 32.0, 180.0, 16.7, -29.5, -0.0, 151.206135, -33.729484, 1614.3, 0
23:2:24.2, 10.9, 32.0, 180.0, 16.7, -29.5, -0.0, 151.206135, -33.729484, 1614.3, 0
23:2:24.3, 10.9, 32.0, 180.0, 16.7, -29.5, -0.0, 151.206135, -33.729484, 1614.3, 0
This is my code so far:
vfx_df = pd.read_csv(data, header=0, low_memory=False)
I have to split the "nano" seconds off because they are in fps not nanoseconds.
vfx_df['UTC_TIME'] = vfx_df['UTC_TIME'].str.split('.', n = 1, expand=True)
vfx_df['UTC_TIME'] = pd.to_datetime(vfx_df['UTC_TIME'], format='%H:%M:%S')
vfx_df.set_index('UTC_TIME', inplace=True, drop=True)
vfx_df = vfx_df.tz_localize('UTC')
vfx_df = vfx_df.tz_convert('Australia/Sydney')
I am left with these results: 1900-01-02 09:32:20+10:05
how do I change the year,day,month to the date it was actually filmed on.
consider also the course of filming can be over 6 hours so a UTC timestamp in the log can go to the next day in local time?
I have tried setting the origin on import and :
vfx_df['UTC_TIME'] = pd.to_datetime(vfx_df['UTC_TIME'], format='%H:%M:%S' unit='D' origin=(pd.Timestamp('2020-03-03')))
I have looked into TimeDeltas and offsets I just can't seem to get it...
I just feel like I'm doing something wrong and would just love to see a more Pythonic way of doing this.
Thanks
Not sure where the date is coming from, but if you're trying to input it manually, you can string format it into the timestamp:
import pandas as pd
from datetime import datetime
def parseDT(input_dt, ts):
out = datetime.strptime(f'{input_dt} {ts}', '%Y-%m-%d %H:%M:%S.%f')
return(out)
input_dt = '2020-04-20'
vfx_df['UTC_DATETIME'] = [parseDT(input_dt, ts) for ts in vfx_df['UTC_TIME']]
Which yields:
In [34]: vfx_df
Out[34]:
UTC_TIME FOCUS IRIS ZOOM PAN TILT ROLL LONGITUDE LATITUDE ALTITUDE RED_RECORD UTC_DATETIME
0 23:2:24.1 10.9 32.0 180.0 16.7 -29.5 -0.0 151.206135 -33.729484 1614.3 0 2020-04-20 23:02:24.100
1 23:2:24.2 10.9 32.0 180.0 16.7 -29.5 -0.0 151.206135 -33.729484 1614.3 0 2020-04-20 23:02:24.200
2 23:2:24.3 10.9 32.0 180.0 16.7 -29.5 -0.0 151.206135 -33.729484 1614.3 0 2020-04-20 23:02:24.300
EDIT: Adding timezone conversion
Since it sounds like your date and time are out of sync you'll have to adjust the time to whatever timezone your date is supposed to be in and then input the date.
Using pytz library:
from datetime import datetime
import pytz
def parse_dt(input_ts, target_tz, year, month, day):
ts = datetime.strptime(f'{input_ts}', '%H:%M:%S.%f') # defaults to UTC
ts_adj = ts.astimezone(pytz.timezone(target_tz)) # convert to time zone
out_dt = ts_adj.replace(year=year, month=month, day=day)
return(out_dt)
Which yields:
In [100]: parse_dt('23:2:24.1', 'EST', 2020, 4, 20)
Out[100]: datetime.datetime(2020, 4, 20, 2, 2, 24, 100000, tzinfo=<StaticTzInfo 'EST'>)
Using timedelta from datetime
from datetime import datetime, timedelta
def parse_dt(input_ts, ts_offset, year, month, day):
ts = datetime.strptime(f'{input_ts}', '%H:%M:%S.%f') # defaults to UTC
ts_adj = ts - timedelta(hours=ts_offset) # subtract x hours
out_dt = ts_adj.replace(year=year, month=month, day=day)
return(out_dt)
Which yields
In [108]: parse_dt('23:2:24.1', 6, 2020, 4, 20)
Out[108]: datetime.datetime(2020, 4, 20, 17, 2, 24, 100000)