Plotting distribution of time data in Python using Pandas - python

I have a pandas dataframe with some time data which looks like
0 08:00 AM
1 08:15 AM
2 08:30 AM
3 7:45 AM
4 7:30 AM
There are 660 rows like these in total (datatype- String). I want to plot the distribution(histogram) of this column. How can I do that? Also some of the rows are just an empty strings (missing data), so I have to also handle that while plotting. What can be the best way to handle that?
I have tried to use pandas.to_datetime() to convert string to timestamp, but still after that I am stuck on how to plot distribution of those timestamps and missing data.

Let's assume you have the dataframe you're talking about, and you're able to cast as pandas datetime objects:
import pandas as pd
df = pd.DataFrame(['8:00 AM', '8:15 AM', '08:30 AM', '', '7:45 AM','7:45 AM'], columns = ['time'])
df.time = pd.to_datetime(df.time)
df looks like this:
time
0 2019-08-16 08:00:00
1 2019-08-16 08:15:00
2 2019-08-16 08:30:00
3 NaT
4 2019-08-16 07:45:00
5 2019-08-16 07:45:00
I would groupby both hour and minute .
df.groupby([df['time'].dt.hour, df['time'].dt.minute]).count().plot(kind="bar")
results

Related

Creating Columns for Hour of Day and date based on datetime column

How can I create a new column that has the day only, and hour of day only based of a column that has a datetime timestamp?
DF has column such as:
Timestamp
2019-05-31 21:11:43
2018-11-21 18:01:00
2017-11-21 22:01:04
2020-04-15 11:01:00
2017-04-20 04:00:33
I want two new columns that look like below:
Day | Hour of Day
2019-05-31 21:00
2018-11-21 18:00
2017-11-21 22:00
2020-04-15 11:00
2017-04-20 04:00
I tried something like below but it only gives me a # for hour of day,
df['hour'] = pd.to_datetime(df['Timestamp'], format='%H:%M:%S').dt.hour
where output would be 9 for 9:32:00 which isnt what I want to calculate
Thanks!
Please try dt.strftime(format+string)
df['hour'] = pd.to_datetime(df['Timestamp']).dt.strftime("%H"+":00")
Following your comments below. Lets Try use df.assign and extract hour and date separately
df=df.assign(hour=pd.to_datetime(df['Timestamp']).dt.strftime("%H"+":00"), Day=pd.to_datetime(df['Timestamp']).dt.date)
You could convert time to string and then just select substrings by index.
df = pd.DataFrame({'Timestamp': ['2019-05-31 21:11:43', '2018-11-21 18:01:00',
'2017-11-21 22:01:04', '2020-04-15 11:01:00',
'2017-04-20 04:00:33']})
df['Day'], df['Hour of Day'] = zip(*df.Timestamp.apply(lambda x: [str(x)[:10], str(x)[11:13]+':00']))

Python Pandas (Excel) datasheet code issue

I have imported an excel (.xlsx) spreadsheet into my python code (using Pandas) and want to extract data from it and the spreadsheet contains the following;
DATE: Lecture1: Lecture2:
16/07/2020 09:30 11:00
17/07/2020 09:45 11:30
18/07/2020 09:45 11:00
19/07/2020 10:00 14:30
20/07/2020 09:30 14:45
How can I create the part of the code so that if "now = date.today()", then "print" the row of my lectures for that day...
I have the following;
import pandas as pd
data = pd.read_excel(r'/home/timetable1.xlsx')
data["Date"] = pd.to_datetime(data["Date"]).dt.strftime("%d-%m-%Y")
df = pd.DataFrame(data)
print (df)
This prints out the whole timetable as shown below (note the format changes slightly);
Date Lecture1 Lecture2
0 16-07-2020 09:30:00 11:00:00
1 17-07-2020 09:45:00 11:30:00
2 18-07-2020 09:45:00 11:00:00
3 19-07-2020 10:00:00 14:30:00
4 20-07-2020 09:30:00 14:45:00
So I am not sure what the part of the code will be to determine 'todays' date and show only 'todays' lecture times for example something like this maybe;
now = date.today()
now.strftime("%d-%m-%y")
if ["Date" == now]:
print ('timetable1.xlsx' index_col=now)
I am new to coding so not very good at it. The above code is wrong I know I can't think of a way to display the info.
So my desired output that I want;
Date Lecture1 Lecture2
18-07-2020 09:45:00 11:00:00
Your input would be much appreciated.
Check this:
data['Date'] = pd.to_datetime(data['Date']).dt.strftime("%d-%m-%Y")
now = pd.to_datetime('today').strftime("%d-%m-%Y")
print(data[data['Date'] == now])
Here you go:
from datetime import date
df['DATE'] = pd.to_datetime(df.DATE, format='%d/%m/%Y')
print(df[df.DATE == pd.to_datetime(date.today())])
Output (It's 19th for me)
DATE Lecture1 Lecture2
3 2020-07-19 10:00 14:30
What you can do is take in the current date in the correct format as the dataset like this:
today=date.today()
compare=today.strftime("%d-%m-%y")
And the do a .loc command on the dataframe
df.loc[df['Date'] == compare]

Rounding datetime based on time of day

I have a pandas dataframe with timestamps shown below:
6/30/2019 3:45:00 PM
I would like to round the date based on time. Anything before 6AM will be counted as the day before.
6/30/2019 5:45:00 AM -> 6/29/2019
6/30/2019 6:30:00 AM -> 6/30/2019
What I have considered doing is splitting date and time into 2 different columns then using an if statement to shift the date (if time >= 06:00 etc). Just wondering there is a built in function in pandas to do this. Ive seen posts of people rounding up and down based on the closest hour but never a specific time threshold (6AM).
Thank you for the help!
there could be a better way to do this.. But this is one way of doing it.
import pandas as pd
def checkDates(d):
if d.time().hour < 6:
return d - pd.Timedelta(days=1)
else:
return d
ls = ["12/31/2019 3:45:00 AM", "6/30/2019 9:45:00 PM", "6/30/2019 10:45:00 PM", "1/1/2019 4:45:00 AM"]
df = pd.DataFrame(ls, columns=["dates"])
df["dates"] = df["dates"].apply(lambda d: checkDates(pd.to_datetime(d)))
print (df)
dates
0 2019-12-30 03:45:00
1 2019-06-30 21:45:00
2 2019-06-30 22:45:00
3 2018-12-31 04:45:00
Also note i am not taking into consideration of the time. when giving back the result..
if you just want the date at the end of it you can just get that out of the datetime object doing something like this
print ((pd.to_datetime("12/31/2019 3:45:00 AM")).date()) >>> 2019-12-31
if understand python well and dont want anyone else(in the future) to understand what your are doing
one liner to the above is.
df["dates"] = df["dates"].apply(lambda d: pd.to_datetime(d) - pd.Timedelta(days=1) if pd.to_datetime(d).time().hour < 6 else pd.to_datetime(d))

Concatenate two dataframe columns as one timestamp

I'm working on a pandas dataframe, one of my column is a date (YYYYMMDD), another one is an hour (HH:MM), I would like to concatenate the two column as one timestamp or datetime64 column, to later use that column as an index (for a time series). Here is the situation :
Do you have any ideas? The classic pandas.to_datetime() seems to work only if the columns contain hours only, day only and year only, ... etc...
Setup
df
Out[1735]:
id date hour other
0 1820 20140423 19:00:00 8
1 4814 20140424 08:20:00 22
Solution
import datetime as dt
#convert date and hour to str, concatenate them and then convert them to datetime format.
df['new_date'] = df[['date','hour']].astype(str).apply(lambda x: dt.datetime.strptime(x.date + x.hour, '%Y%m%d%H:%M:%S'), axis=1)
df
Out[1756]:
id date hour other new_date
0 1820 20140423 19:00:00 8 2014-04-23 19:00:00
1 4814 20140424 08:20:00 22 2014-04-24 08:20:00

Pandas day for day

I have a lot of data in a Pandas dataframe:
Timestamp Value
2015-07-15 07:16:39.034 49.960
2015-07-15 07:16:39.036 49.940
......
2015-08-12 23:16:39.235 42.958
I have about 50 000 entries per day, and I would like to perform different operations on this data, day by day.
For example, if I would like to find the rolling mean, I would enter this:
df['rm5000'] = pd.rolling_mean(df['Value'], window=5000)
But that would give me the rolling mean across dates. The first rolling mean datapoint August 12th would contain 4999 datapoints from August 11th. However, I would like to start all over each day, so as the first 4999 datapoints on each day do not contain a rolling mean of 5000, as there might be a large difference between the last data one date and the first data the next day.
Do I have to slice the data into separate dataframes for each date for Pandas to do certain operations on the data for each separate date?
If you set the timestamps as a index, you can groupby a TimeGrouper with a frequency code to partition the data by days, like below
In [2]: df = pd.DataFrame({'Timestamp': pd.date_range('2015-07-15', '2015-07-18', freq='10min'),
'Value': np.linspace(49, 51, 433)})
In [3]: df = df.set_index('Timestamp')
In [4]: df.groupby(pd.TimeGrouper('D'))['Value'].apply(lambda x: pd.rolling_mean(x, window=15))
Out[4]:
Timestamp
2015-07-15 00:00:00 NaN
2015-07-15 00:10:00 NaN
.....
2015-07-15 23:30:00 49.620370
2015-07-15 23:40:00 49.625000
2015-07-15 23:50:00 49.629630
2015-07-16 00:00:00 NaN
2015-07-16 00:10:00 NaN

Categories