I'm working with a CSV file with flight records. My overall goal is to make plots of flight delays over a few selected days. I am trying to index these flights by the day and the scheduled departure times. So, I have a flight date in a month/day/year format and a departure time formated in hhmm, is there a way to reformat that departure time column to a hh:mm format in 24:00 time? Then would I simply add the columns together and index by them?
I've tried adding the columns together without reformatting the time and I'm not sure matplotlib recognizes this time format for my plots.
data = pd.read_csv("groundhog_query.csv",parse_dates=[['Flight_Date', 'Scheduled_Dep_Time']])
data.index = data['Flight_Date_Scheduled_Dep_Time']
data
the CSV files looks like this
'''
Year,Flight_Date,Day_Of_Year,Unique_Carrier_ID,Airline_ID,Tail_Number,Flight_Number,Origin_Airport_ID,Origin_Market_ID,Origin_Airport_Code,Origin_State,Destination_Airport_ID,Destination_Market_ID,Destination_Airport_Code,Dest_State,Scheduled_Dep_Time,Actual_Dep_Time,Dep_Delay,Pos_Dep_Delay,Scheduled_Arr_Time,Actual_Arr_Time,Arr_Delay,Pos_Arr_Delay,Combined_Arr_Delay,Can_Status,Can_Reason,Div_Status,Scheduled_Elapsed_Time,Actual_Elapsed_Time,Carrier_Delay,Weather_Delay,Natl_Airspace_System_Delay,Security_Delay,Late_Aircraft_Delay,Div_Airport_Landings,Div_Landing_Status,Div_Elapsed_Time,Div_Arrival_Delay,Div_Airport_1_ID,Div_1_Tail_Num,Div_Airport_2_ID,Div_2_Tail_Num,Div_Airport_3_ID,Div_3_Tail_Num,Div_Airport_4_ID,Div_4_Tail_Num,Div_Airport_5_ID,Div_5_Tail_Num
2011,2011-01-24,24,MQ,20398,N717MQ,4527,11278,30852,DCA,VA,14492,34492,RDU,NC,1630,1622.0,-8.0,0.0,1735,1722.0,-13.0,0.0,-13.0,0,,0,65,60.0,,,,,,0,,,,,,,,,,,,,
2011,2011-01-25,25,MQ,20398,N736MQ,4527,11278,30852,DCA,VA,14492,34492,RDU,NC,1630,1624.0,-6.0,0.0,1735,1724.0,-11.0,0.0,-11.0,0,,0,65,60.0,,,,,,0,,,,,,,,,,,,,
2011,2011-01-26,26,MQ,20398,N737MQ,4527,11278,30852,DCA,VA,14492,34492,RDU,NC,1630,,,,1735,,,,,1,B,0,65,,,,,,,0,,,,,,,,,,,,,
2011,2011-01-27,27,MQ,20398,N721MQ,4527,11278,30852,DCA,VA,14492,34492,RDU,NC,1630,1832.0,122.0,122.0,1735,1936.0,121.0,121.0,121.0,0,,0,65,64.0,121.0,0.0,0.0,0.
'''
my current results are in a month/day/year hhmm format
Use the following steps:
1. Read CSV without parsing dates.
2. Merge 'Flight_Date' and 'Scheduled_Dep_Time' columns. Make sure that 'Scheduled_Dep_Time' is converted to string fist (hence .map(str)) since it is by default parsed as int.
3. Convert string to datetime by using correct format ('%Y-%m-%d %H:%M')
4. Set this newly produced column as index
d = pd.read_csv("groundhog_query.csv")
d['Flight_Date_Scheduled_Dep_Time_string'] = d.Flight_Date.str.cat(' ' + d.Scheduled_Dep_Time.map(str))
d['Flight_Date_Scheduled_Dep_Time'] = pd.to_datetime(d.Flight_Date_Scheduled_Dep_Time_string, format='%Y-%m-%d %H:%M')
d = d.set_index('Flight_Date_Scheduled_Dep_Time')
The reference for % directives is here:
https://docs.python.org/3.7/library/datetime.html#strftime-and-strptime-behavior
This question already has an answer here:
convert numerical representation of date (excel format) to python date and time, then split them into two seperate dataframe columns in pandas
(1 answer)
Closed 4 years ago.
I have seen that excel identifies dates with specific serial numbers. For example :
09/07/2018 = 43290
10/07/2018 = 43291
I know that we use the DATEVALUE , VALUE and the TEXT functions to convert between these types.
But what is the logic behind this conversion? why 43290 for 09/07/2018 ?
Also , if I have a list of these dates in the number format in a dataframe (Python), how can I convert this number to the date format?
Similarly with time, I see decimal values in place of a regular time format. What is the logic behind these time conversions?
The following question that has been given in the comments is informative, but does not answer my question of the logic behind the conversion between Date and Text format :
convert numerical representation of date (excel format) to python date and time, then split them into two seperate dataframe columns in pandas
It is simply the number of days (or fraction of days, if talking about date and time) since January 1st 1900:
The DATEVALUE function converts a date that is stored as text to a
serial number that Excel recognizes as a date. For example, the
formula =DATEVALUE("1/1/2008") returns 39448, the serial number of the
date 1/1/2008. Remember, though, that your computer's system date
setting may cause the results of a DATEVALUE function to vary from
this example
...
Excel stores dates as sequential serial numbers so that they can be used in calculations. By default, January 1, 1900 is serial number 1, and January 1, 2008 is serial number 39448 because it is 39,447 days after January 1, 1900.
from DATEVALUE docs
if I have a list of these dates in the number format in a dataframe
(Python), how can I convert this number to the date format?
Since we know this number represents the number of days since 1/1/1900 it can be easily converted to a date:
from datetime import datetime, timedelta
day_number = 43290
print(datetime(1900, 1, 1) + timedelta(days=day_number - 2))
# 2018-07-09 00:00:00 ^ subtracting 2 because 1/1/1900 is
# "day 1", not "day 0"
However pd.read_excel should be able to handle this automatically.