find the time different between the day - python

Dataframe
I have different machine running different hours that might cross over a day and I want to differentiate it on different day
Example Machine A running 8 hours from Start Date and Time 12-Aug, 9pm to 13-Aug , 5am
I cant get the correct time that 3hours from 12-Aug and 5hours from 13-Aug
Suspect that because i'm using datetime.now
how do it change the date will be same as Start date/ End date in python?
Here is my code:
endoftoday = datetime.now()
endoftoday = endoftoday.replace(hour=23,minute=59,second=59)
dt['Start_Date']=dt['Start_Time'].dt.strftime('%d/%m/%Y')
dt['End_Date']=dt['Finish_Time'].dt.strftime('%d/%m/%Y')
if (dt.['Start_Date'].str == dt['End_Date'].str):
dt['Tested_Time_Today']= endoftoday-dt['Start_Time']
dt['Tested_Time_NextDay']= dt['Finish_Time'] - endoftoday

Here is my attempt:
import pandas as pd
import datetime
def get_times(args):
start_time, end_time, start_date, end_date = args
hours = {}
for day in pd.date_range(start_date, end_date, freq='d'):
hours[day] = max(day, end_time) - max(start_time, day) + datetime.timedelta(hours=24)
return hours
df = pd.DataFrame({'Start_Time': [datetime.datetime(2021,8,21,6,2), datetime.datetime(2021,8,21,7,19)], 'Finish_Time': [datetime.datetime(2021,8,22,5,12), datetime.datetime(2021,8,21,16,50)], 'Start_Date': [datetime.date(2021,8,21), datetime.date(2021,8,21)], 'End_Date': [datetime.date(2021,8,22), datetime.date(2021,8,21)]})
df['hours'] = df.apply(get_times, axis=1)
print(df)
This is probably not exactly what you are looking for since I also don't really understand your question well enough. But what you get is a new column which contains in each row a dictionary with the dates as key and the hours during that day as value.
If you let us know what exactly you are after, I might be able to improve the answer.
Edit: This won't work if your time period covers more than two days. If that is necessary, the time calculation would have to slightly extended. And if you have more columns than the ones that we perform the calculation on, please change the penultimate row to df['hours'] = df[['Start_Time', 'Finish_Time', 'Start_Date', 'End_Date']].apply(get_times, axis=1)

Related

How to use relativedelta to dynamically add dates to a list of dates in a dataframe

I am new to python and have a few questions regarding dates.
Here is an example - I have a list of dates going from 01/01/2012 - 01/01/2025 with a monthly frequency. They also will change based on the data frame. Say one column of dates will have 130 months in between, the other will have 140 months, and so on.
The end goal is: regardless of how many months each set has, I need each "group" to have 180 months. So, in the above example of 01/01/2012 - 1/1/2025, I would need to add enough months to reach 1/1/2027.
Please let me know if this makes sense.
So if I understand you correctly, you have some data like:
import pandas as pd, numpy as np
from datetime import date
from dateutil.relativedelta import relativedelta
from random import randint
starts = [dt for i in range(130) if (dt := date(2012, 1, 1) + relativedelta(months=i)) <= date(2020, 1, 1)]
ends = [dt + relativedelta(months=randint(1, 5)) for dt in starts]
df = pd.DataFrame({ 'start': starts, 'end': ends })
so the current duration in months is:
df['duration'] = ((df.end - df.start)/np.timedelta64(1, 'M')).round().astype(int)
and you want to know how many to add to make the duration 180 months?
df['need_to_add'] = 180 - df.duration
then you can calculate a new end by something like:
def add_months(start_date, delta_period):
end_date = start_date + relativedelta(months=delta_period)
return end_date
df['new_end'] = df.apply(lambda r: add_months(r['end'], r['need_to_add']), axis=1)
I'm sure I haven't quite understood, as you could just add 180 months to the start date, but hopefully this gets you close to where you need to be.

Using pandas and datetime in Jupyter to see during what hours no sales were ever made (on any day)

So I have sales data that I'm trying to analyze. I have datetime data ["Order Date Time"] and I'd like to see the most common hours for sales but more importantly I'd like to see what minutes have NO sales.
I have been spinning my wheels for a while and I can't get my brain around a solution. Any help is greatly appreciated.
I import the data:
df = pd.read_excel ('Audit Period.xlsx')
print (df)
I clean up the data:
# Remove all columns except `applieddate` and null rows
time_df = df[df["Order Date Time"].notnull()]
# Ensure the index is still sequential
time_df = time_df[["Order Date Time"]].reset_index(drop=True)
# Select the first 10 rows
time_df.head(10)
I convert to datetime and I look at the month totals:
# Convert applieddate to datetime
time_df = time_df.copy()
time_df["Order Date Time"] = time_df["Order Date Time"].apply(pd.to_datetime)
time_df = time_df.set_index(time_df["Order Date Time"])
# Group by month
grouped = time_df.resample("M").count()
time_df = pd.DataFrame({"count": grouped.values.flatten()}, index=grouped.index)
time_df.head(10)
I try to group by hour but that gives me totals per day/hour rather than totals per hour like every order ever at noon, etc:
# Group by hour
grouped = time_df.resample("2H").count()
time_df = pd.DataFrame({"count": grouped.values.flatten()}, index=grouped.index)
time_df.head(10)
And that is where I'm stuck. I'm trying to integrate the below suggestions but can't quite get a grasp on them yet. Any help would be appreciated.
Not sure if this is the most brilliant solution, but I would start by generating a dataframe at the level of detail I wanted, whether that is 1-hour intervals, 5-minute intervals, etc. Then in your df with all the actual data, you could do your grouping as you currently are doing it above. Once it is grouped, join the two. That way you have one dataframe that includes empty rows associated with time spans with no records. The tricky part will just be making sure you have your date and time formatted in a way that it will match and join properly.

How do I calculate time difference in days or months in python3

I've been working on a scraping and EDA project on Python3 using Pandas, BeautifulSoup, and a few other libraries and wanted to do some analysis using the time differences between two dates. I want to determine the number of days (or months or even years if that'll make it easier) between the start dates and end dates, and am stuck. I have two columns (air start date, air end date), with dates in the following format: MM-YYYY (so like 01-2021). I basically wanted to make a third column with the time difference between the end and start dates (so I could use it in later analysis).
# split air_dates column into start and end date
dateList = df["air_dates"].str.split("-", n = 1, expand = True)
df['air_start_date'] = dateList[0]
df['air_end_date'] = dateList[1]
df.drop(columns = ['air_dates'], inplace = True)
df.drop(columns = ['rank'], inplace = True)
# changing dates to numerical notation
df['air_start_date'] = pds.to_datetime(df['air_start_date'])
df['air_start_date'] = df['air_start_date'].dt.date.apply(lambda x: x.strftime('%m-%Y') if pds.notnull(x) else npy.NaN)
df['air_end_date'] = pds.Series(df['air_end_date'])
df['air_end_date'] = pds.to_datetime(df['air_end_date'], errors = 'coerce')
df['air_end_date'] = df['air_end_date'].dt.date.apply(lambda x: x.strftime('%m-%Y') if pds.notnull(x) else npy.NaN)
df.isnull().sum()
df.dropna(subset = ['air_end_date'], inplace = True)
def time_diff(time_series):
return datetime.datetime.strptime(time_series, '%d')
df['time difference'] = df['air_end_date'].apply(time_diff) - df['air_start_date'].apply(time_diff)
The last four lines are my attempt at getting a time difference, but I got an error saying 'ValueError: unconverted data remains: -2021'. Any help would be greatly appreciated, as this has had me stuck for a good while now. Thank you!
As far as I can understand, if you have start date and time and end date and time then you can use datetime module in python.
To use this, something like this would be used:
import datetime
# variable = datetime(year, month, day, hour, minute, second)
start = datetime(2017,5,8,18,56,40)
end = datetime(2019,6,27,12,30,58)
print( start - end ) # this will print the difference of these 2 date and time
Hope this answer helps you.
Ok so I figured it out. In my second to last line, I replaced the %d with %m-%Y and now it populates the new column with the number of days between the two dates. I think the format needed to be consistent when running strptime so that's what was causing that error.
here's a slightly cleaned up version; subtract start date from end date to get a timedelta, then take the days attribute from that.
EX:
import pandas as pd
df = pd.DataFrame({'air_dates': ["Apr 2009 - Jul 2010", "not a date - also not a date"]})
df['air_start_date'] = df['air_dates'].str.split(" - ", expand=True)[0]
df['air_end_date'] = df['air_dates'].str.split(" - ", expand=True)[1]
df['air_start_date'] = pd.to_datetime(df['air_start_date'], errors="coerce")
df['air_end_date'] = pd.to_datetime(df['air_end_date'], errors="coerce")
df['timediff_days'] = (df['air_end_date']-df['air_start_date']).dt.days
That will give you for the dummy example
df['timediff_days']
0 456.0
1 NaN
Name: timediff_days, dtype: float64
Regarding calculation of difference in month, you can find some suggestions how to calculate those here. I'd go with #piRSquared's approach:
df['timediff_months'] = ((df['air_end_date'].dt.year - df['air_start_date'].dt.year) * 12 +
(df['air_end_date'].dt.month - df['air_start_date'].dt.month))
df['timediff_months']
0 15.0
1 NaN
Name: timediff_months, dtype: float64

Drop All Rows That Aren't The Beginning or End of Week

I've got a stock data dataframe with dates as the index column. What I'd like to do is drop all rows that aren't the beginning or ending of the week, effectively leaving me with a dataframe of (mostly) Monday's and Friday's. The trick is, I don't want to just look for Monday's and Friday's because some weeks are short weeks, starting on Tuesday's or ending on Thursday's (or otherwise. Maybe some weeks have a Wednesday off too?).
The logic I have right now (and a reproducible code) for dropping all rows that aren't the beginning of the week looks like this:
import pandas_datareader.data as web
import numpy as np
import pandas as pd
from pandas import Series
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
import warnings
warnings.filterwarnings("once")
from datetime import datetime, timedelta
# Import a stock dataset from Yahoo
ticker = 'SPY'
start = datetime(2010, 1, 1)
end = datetime.today().strftime('%Y-%m-%d')
# Download the df
df = web.DataReader(ticker, 'yahoo', start, end)
# Drop the Adj Close and Volume for now
df = df.drop(['Adj Close'], axis=1)
print(df)
# Check if day of week is Monday
print('Checking for beginnings of weeks...')
df = df.reset_index() # Make the date index an actual column again for now
df['week_day_objects'] = pd.to_datetime(df['Date'], format='%Y-%m-%d') # make the dates a datetime object
for i in range(len(df)-1, 0, -1): # start at the bottom of the DF and work backwards
if df['week_day_objects'].iloc[i] > df['week_day_objects'].iloc[i-1] + timedelta(days=2): # first day of week is always > 2 days since the previous date, holidays included
continue # if today is the start of the week, continue the loop...
else:
df = df.drop([df.index[i]]) # ...else, drop all rows that aren't at the beginning of the week
df = df.set_index(['Date']) # make the date column the index again
df = df.drop(['week_day_objects'], axis=1) # drop the datetime column now
# For review
df.to_csv('./Check_Week_Days.csv', index=True)
...however, I'm stuck trying to also incorporate Friday's (or rather, the end of week) into this solution. And I'm not even sure this is the best way to do it so I'm open to suggestions. The logic above just basically looks for any day that's at least 3 days greater than the previous row, which is the beginning of the week as the beginning of a new work week always happens at least 3 days after the last work day of last week.
As requested, some clarification. Like I mentioned above, I don't just want to drop all rows that aren't Friday's or Monday's because some weeks are short weeks, so the beginning of the week could start on a Tuesday, or the end of a week could end on a Thursday, so I don't want to lose those rows. What I'd like to end up with is a dataframe of rows that start on the beginning business day of that week, and end on the last business day of that week, whether it be a Friday or Thursday/Monday or Tuesday. So the final dataset would look like this:
Notice how most weeks are Monday to Friday, however the 18th is a Tuesday because the 17th of that year was a holiday. I'm not looking to sync my calendar to holidays, I want to drop all the middle days between whatever business day started that week, and whatever business day ended that week. Hope that helps?
Thanks!
You can use the datetime object's dayofweek attribute to select rows and delete those based on the index.
import numpy as np
import pandas as pd
dates_df = pd.DataFrame(np.arange(np.datetime64('2000-01-03'), np.datetime64('2000-01-25')), columns=['date'])
dates_df.drop(dates_df[dates_df['date'].dt.dayofweek == 6].index)
The snippet above will drop all Sunday values.
But you can also select the data that matches the first or last day of the week instead of dropping it
dates_df[(dates_df['date'].dt.dayofweek == 1) | (dates_df['date'].dt.dayofweek == 4)]
I've figured it out with the below function using the day of week numbers:
# Check if day of week is Monday
print('Checking for beginnings of weeks...')
df = df.reset_index() # Make the date index an actual column again for now
df['week_day_objects'] = pd.to_datetime(df['Date'], format='%Y-%m-%d').dt.dayofweek # make the dates a datetime object number
for i in range(len(df)-2, 1, -1): # start at the bottom of the DF and work backwards. Need to trim the top/bottom rows accordingly later.
if (df['week_day_objects'].iloc[i] < df['week_day_objects'].iloc[i-1] and df['week_day_objects'].iloc[i] < df['week_day_objects'].iloc[i+1]) or # A beginning of the week will always have a day of week number less than the day after it, and the day before it
(df['week_day_objects'].iloc[i] > df['week_day_objects'].iloc[i-1] and df['week_day_objects'].iloc[i] > df['week_day_objects'].iloc[i+1]): # ...and a EOW will always have a number greater than the day before it, and the day after it.
continue # if today is the start or end of the week, skip...
else:
df = df.drop([df.index[i]]) # ...else, drop all rows that aren't at the beginning/end of the week
df = df.set_index(['Date']) # make the date column the index again
df = df.drop(['week_day_objects'], axis=1) # drop the datetime column now
# For review
df.to_csv('./Check_Week_Days.csv', index=True)
So a start of week will always have a lower number than the previous row/day's number, and it will also be lower than tomorrow's number. Reverse that for the end of week. This makes it work no matter what the Start or End of week is, being a Thursday end, or a Tuesday start.
This loop doesn't start at the very top/bottom of the dataframe though leaving some cleanup to do, but I will write a separate code to take care of that.

Pandas dataframe get next (trading) day in dataframe

I have a date given that may or may not be a trading day, and I have a pandas dataframe indexed by trading days that has returns of each trading day.
This is my date
dt_query = datetime.datetime(2006, 12, 31, 16)
And I want to do something like this (returns is a pandas dataframe)
returns.ix[pd.Timestamp(dt_query + datetime.timedelta(days = 1))]
However, that may not work as one day ahead may or may not be a trading day. I could make a try block that loops and tries until we find something, but I'm wondering if there's a more elegant way that just uses pandas.
This might not be the most the elegant solution but it works.
Here's the idea: from any date dt_query, within a number of calender days (say 10), there must be trading days, and your next trading day is just the first among them. So you can find all days in returns within dt_query and dt_query + timedelta(days = 10), and then get the first one.
Using your example, it should look like
next_trading_date = returns.index[(returns.index > dt_query) & (returns.index <= dt_query + timedelta(days = 10))][0]
You can check the timedelta of the whole column doing:
delta = returns.column - dt_query
then use np.timedelta64() to define a tolerance used to check which rows you want to select:
tol = np.timedelta64(days=2)
and:
returns[delta < tol]
will return the rows within the desired range...
Thank you! That's been plaguing me for hours.
I altered it a bit:
try:
date_check = dja[start_day]
except KeyError:
print("Start date not a trading day, fetching next trading day...")
test = dja.index.searchsorted(start_day)
next_date = dja.index[(dja.index > start_day)]
start_date = next_date[0]
print("New date:", start_date)

Categories