Rounding datetime based on time of day - python

I have a pandas dataframe with timestamps shown below:
6/30/2019 3:45:00 PM
I would like to round the date based on time. Anything before 6AM will be counted as the day before.
6/30/2019 5:45:00 AM -> 6/29/2019
6/30/2019 6:30:00 AM -> 6/30/2019
What I have considered doing is splitting date and time into 2 different columns then using an if statement to shift the date (if time >= 06:00 etc). Just wondering there is a built in function in pandas to do this. Ive seen posts of people rounding up and down based on the closest hour but never a specific time threshold (6AM).
Thank you for the help!

there could be a better way to do this.. But this is one way of doing it.
import pandas as pd
def checkDates(d):
if d.time().hour < 6:
return d - pd.Timedelta(days=1)
else:
return d
ls = ["12/31/2019 3:45:00 AM", "6/30/2019 9:45:00 PM", "6/30/2019 10:45:00 PM", "1/1/2019 4:45:00 AM"]
df = pd.DataFrame(ls, columns=["dates"])
df["dates"] = df["dates"].apply(lambda d: checkDates(pd.to_datetime(d)))
print (df)
dates
0 2019-12-30 03:45:00
1 2019-06-30 21:45:00
2 2019-06-30 22:45:00
3 2018-12-31 04:45:00
Also note i am not taking into consideration of the time. when giving back the result..
if you just want the date at the end of it you can just get that out of the datetime object doing something like this
print ((pd.to_datetime("12/31/2019 3:45:00 AM")).date()) >>> 2019-12-31
if understand python well and dont want anyone else(in the future) to understand what your are doing
one liner to the above is.
df["dates"] = df["dates"].apply(lambda d: pd.to_datetime(d) - pd.Timedelta(days=1) if pd.to_datetime(d).time().hour < 6 else pd.to_datetime(d))

Related

How do I create a new column with a set timeframe using Pandas datetime64

I’m trying to look at some sales data for a small store. I have a time stamp of when the settlement was made, but sometimes it’s done before midnight and sometimes its done after midnight.
This is giving me data correct for some days and incorrect for others, as anything after midnight should be for the day before. I couldn’t find the correct pandas documentation for what I’m looking for.
Is there an if else solution to create a new column, loop through the NEW_TIMESTAMP column and set a custom timeframe (if after midnight, but before 3pm: set the day before ; else set the day). Every time I write something it either runs forever, or it crashes jupyter.
Data:
What I did is I created another series which says when a day should be offset back by one day, and I multiplied it by a pd.timedelta object, such that 0 turns into "0 days" and 1 turns into "1 day". Subtracting two series gives the right result.
Let me know how the following code works for you.
import pandas as pd
import numpy as np
# copied from https://stackoverflow.com/questions/50559078/generating-random-dates-within-a-given-range-in-pandas
def random_dates(start, end, n=15):
start_u = start.value//10**9
end_u = end.value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
dates = random_dates(start=pd.to_datetime('2020-01-01'),
end=pd.to_datetime('2021-01-01'))
timestamps = pd.Series(dates)
# this takes only the hour component of every datetime
hours = timestamps.dt.hour
# this takes only the hour component of every datetime
dates = timestamps.dt.date
# this compares the hours with 15, and returns a boolean if it is smaller
flag_is_day_before = hours < 15
# now you can set the dates by multiplying the 1s and 0s with a day timedelta
new_dates = dates - pd.to_timedelta(1, unit='day') * flag_is_day_before
df = pd.DataFrame(data=dict(timestamps=timestamps, new_dates=new_dates))
print(df)
This outputs
timestamps new_dates
0 2020-07-10 20:11:13 2020-07-10
1 2020-05-04 01:20:07 2020-05-03
2 2020-03-30 09:17:36 2020-03-29
3 2020-06-01 16:16:58 2020-06-01
4 2020-09-22 04:53:33 2020-09-21
5 2020-08-02 20:07:26 2020-08-02
6 2020-03-22 14:06:53 2020-03-21
7 2020-03-14 14:21:12 2020-03-13
8 2020-07-16 20:50:22 2020-07-16
9 2020-09-26 13:26:55 2020-09-25
10 2020-11-08 17:27:22 2020-11-08
11 2020-11-01 13:32:46 2020-10-31
12 2020-03-12 12:26:21 2020-03-11
13 2020-12-28 08:04:29 2020-12-27
14 2020-04-06 02:46:59 2020-04-05

Calculate the time difference between two hh:mm columns in a pandas dataframe

I am reading some data from an csv file where the datatype of the two columns are in hh:mm format. Here is an example:
Start End
11:15 15:00
22:30 2:00
In the above example, the End in the 2nd row happens in the next day. I am trying to get the time difference between these two columns in the most efficient way as the dataset is huge. Is there any good pythonic way for doing this? Also, since there is no date, and some Ends happen in the next I get wrong result when I calculate the diff.
>>> import pandas as pd
>>> df = pd.read_csv(file_path)
>>> pd.to_datetime(df['End'])-pd.to_datetime(df['Start'])
0 0 days 03:45:00
1 0 days 03:00:00
2 -1 days +03:30:00
You can use the technique (a+x)%x with a timedelta of 24h (or 1d, same)
the + timedelta(hours=24) makes all values becomes positive
the % timedelta(hours=24) makes the ones above 24h back of 24h
df['duration'] = (pd.to_datetime(df['End']) - pd.to_datetime(df['Start']) + timedelta(hours=24)) \
% timedelta(hours=24)
Gives
Start End duration
0 11:15 15:00 0 days 03:45:00
1 22:30 2:00 0 days 03:30:00

Python Pandas (Excel) datasheet code issue

I have imported an excel (.xlsx) spreadsheet into my python code (using Pandas) and want to extract data from it and the spreadsheet contains the following;
DATE: Lecture1: Lecture2:
16/07/2020 09:30 11:00
17/07/2020 09:45 11:30
18/07/2020 09:45 11:00
19/07/2020 10:00 14:30
20/07/2020 09:30 14:45
How can I create the part of the code so that if "now = date.today()", then "print" the row of my lectures for that day...
I have the following;
import pandas as pd
data = pd.read_excel(r'/home/timetable1.xlsx')
data["Date"] = pd.to_datetime(data["Date"]).dt.strftime("%d-%m-%Y")
df = pd.DataFrame(data)
print (df)
This prints out the whole timetable as shown below (note the format changes slightly);
Date Lecture1 Lecture2
0 16-07-2020 09:30:00 11:00:00
1 17-07-2020 09:45:00 11:30:00
2 18-07-2020 09:45:00 11:00:00
3 19-07-2020 10:00:00 14:30:00
4 20-07-2020 09:30:00 14:45:00
So I am not sure what the part of the code will be to determine 'todays' date and show only 'todays' lecture times for example something like this maybe;
now = date.today()
now.strftime("%d-%m-%y")
if ["Date" == now]:
print ('timetable1.xlsx' index_col=now)
I am new to coding so not very good at it. The above code is wrong I know I can't think of a way to display the info.
So my desired output that I want;
Date Lecture1 Lecture2
18-07-2020 09:45:00 11:00:00
Your input would be much appreciated.
Check this:
data['Date'] = pd.to_datetime(data['Date']).dt.strftime("%d-%m-%Y")
now = pd.to_datetime('today').strftime("%d-%m-%Y")
print(data[data['Date'] == now])
Here you go:
from datetime import date
df['DATE'] = pd.to_datetime(df.DATE, format='%d/%m/%Y')
print(df[df.DATE == pd.to_datetime(date.today())])
Output (It's 19th for me)
DATE Lecture1 Lecture2
3 2020-07-19 10:00 14:30
What you can do is take in the current date in the correct format as the dataset like this:
today=date.today()
compare=today.strftime("%d-%m-%y")
And the do a .loc command on the dataframe
df.loc[df['Date'] == compare]

How to extract hours from a pandas.datetime?

I´ve a pandas dataframe which I applied the pandas.to_datetime. Now I want to extract the hours/minutes/seconds from each timestamp. I used df.index.day to get the days, and now, I want to know if there are different hours in my index.
For example, if I have two dates d1 = 2020-01-01 00:00:00 and d2 = 2020-01-02 00:00:00 I can't assume I should apply a smooth operator by hour because makes no sense.
So what I want to know is: how do I know if a day has different hours/minutes or seconds?
Thank you in advance
I think you should use df[index].dt provided by pandas.
You can extract day, week, hour, minute, second by using it.
Please see this.
dir(df[index].dt)
Here is an example.
import pandas as pd
df = pd.DataFrame([["2020-01-01 06:31:00"], ["2020-03-12 10:21:09"]])
print(df)
df['time'] = pd.to_datetime(df["timestamp"])
df['dates'] = df['time'].dt.date
df['hour'] = df['time'].dt.hour
df['minute'] = df['time'].dt.minute
df['second'] = df['time'].dt.second
Now your df should look like this.
0 time dates hour minute second
0 2020-01-01 06:31:00 2020-01-01 06:31:00 2020-01-01 6 31 0
1 2020-03-12 10:21:09 2020-03-12 10:21:09 2020-03-12 10 21 9
If d1 and d2 are datetime or Timestamp objects, you can get the hour, minute and second using attributes hour , minute and second
print(d1.hour,d1.minute,d1.second)
print(d2.hour,d2.minute,d2.second)
Similarly, year, month and day can also be extracted.

Plotting distribution of time data in Python using Pandas

I have a pandas dataframe with some time data which looks like
0 08:00 AM
1 08:15 AM
2 08:30 AM
3 7:45 AM
4 7:30 AM
There are 660 rows like these in total (datatype- String). I want to plot the distribution(histogram) of this column. How can I do that? Also some of the rows are just an empty strings (missing data), so I have to also handle that while plotting. What can be the best way to handle that?
I have tried to use pandas.to_datetime() to convert string to timestamp, but still after that I am stuck on how to plot distribution of those timestamps and missing data.
Let's assume you have the dataframe you're talking about, and you're able to cast as pandas datetime objects:
import pandas as pd
df = pd.DataFrame(['8:00 AM', '8:15 AM', '08:30 AM', '', '7:45 AM','7:45 AM'], columns = ['time'])
df.time = pd.to_datetime(df.time)
df looks like this:
time
0 2019-08-16 08:00:00
1 2019-08-16 08:15:00
2 2019-08-16 08:30:00
3 NaT
4 2019-08-16 07:45:00
5 2019-08-16 07:45:00
I would groupby both hour and minute .
df.groupby([df['time'].dt.hour, df['time'].dt.minute]).count().plot(kind="bar")
results

Categories