Convert multiple date formats to datetime in pandas - python

I have a row of messy data where date formats are different and I want them to be coherent as datetime in pandas
df:
Date
0 1/05/2015
1 15 Jul 2009
2 1-Feb-15
3 12/08/2019
When I run this part:
df['date'] = pd.to_datetime(df['date'], format='%d %b %Y', errors='coerce')
I get
Date
0 NaT
1 2009-07-15
2 NaT
3 NaT
How do I convert it all to date time in pandas?

pd.to_datetime is capabale of handling multiple date formats in the same column. Specifying a format will hinder its ability to dynamically determine the format, so if there are multiple types do not specify the format:
import pandas as pd
df = pd.DataFrame({
'Date': ['1/05/2015', '15 Jul 2009', '1-Feb-15', '12/08/2019']
})
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
print(df)
Date
0 2015-01-05
1 2009-07-15
2 2015-02-01
3 2019-12-08
*There are limitations to the ability to handle multiple date times. Mixed timezone aware and timezone unaware datetimes will not process correctly. Likewise mixed dayfirst and monthfirst notations will not always parse correctly.

Related

Cant convert string to datetime Pandas to_datetime() out of time error

Error message: OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1-05-22 00:00:00
Given dataframe
Date
Value
May 22
1K
Apr 22
2K
...
...
Jan 00
10K
I have tried convert it to string and called the to_datetime()
df['Date'] = df['Date'].apply(str)
df['Date'] = pd.to_datetime(df['Date'])
My goal is to convert Date to datetime 05-2022, 04-2022, ... 01-2000
You can manually specify format argument of pd.to_datetime
df['Date'] = pd.to_datetime(df['Date'], format='%b %y')
print(df)
Date Value
0 2022-05-01 1K
1 2022-04-01 2K
2 2000-01-01 10K
print(df['Date'].dt.strftime('%m-%Y'))
Date Value
0 05-2022 1K
1 04-2022 2K
2 01-2000 10K

Pandas DateTime for Month

I have month column with values formatted as: 2019M01
To find the seasonality I need this formatted into Pandas DateTime format.
How to format 2019M01 into datetime so that I can use it for my seasonality plotting?
Thanks.
Use to_datetime with format parameter:
print (df)
date
0 2019M01
1 2019M03
2 2019M04
df['date'] = pd.to_datetime(df['date'], format='%YM%m')
print (df)
date
0 2019-01-01
1 2019-03-01
2 2019-04-01

How to make datetimes timezone aware and convert timezones

I have 3 dataframes with multiple columns, with 2 of them having a datetime that is is UTC, and the other one being 'Europe/Amsterdam'. However, they are still unaware.
How do I make these datasets timezone aware, and convert the 'Europe/Amsterdam' to UTC?
The datetimes are in the index of each dataset.
If you're using pandas Dataframes and Python 3, you can do it like this:
import pandas as pd
values = {'dates': ['20190902101010','20190913202020','20190921010101'],
'status': ['Opened','Opened','Closed']
}
df = pd.DataFrame(values, columns = ['dates','status'])
df['dates_datetime'] = pd.to_datetime(df['dates'], format='%Y%m%d%H%M%S')
df['dates_datetime_tz'] = df.dates_datetime.dt.tz_localize('UTC').dt.tz_convert('Asia/Kolkata')
print (df)
print (df.dtypes)
Result:
dates status dates_datetime dates_datetime_tz
0 20190902101010 Opened 2019-09-02 10:10:10 2019-09-02 15:40:10+05:30
1 20190913202020 Opened 2019-09-13 20:20:20 2019-09-14 01:50:20+05:30
2 20190921010101 Closed 2019-09-21 01:01:01 2019-09-21 06:31:01+05:30
I've converted from UTC to a specific TZ, you can choose any other you need.

Convert strings with and without time (mixed format) to datetime in a pandas dataframe

When converting a pandas dataframe column from object to datetime using astype function, the behavior is different depending on if the strings have the time component or not. What is the correct way of converting the column?
df = pd.DataFrame({'Date': ['12/07/2013 21:50:00','13/07/2013 00:30:00','15/07/2013','11/07/2013']})
df['Date'] = pd.to_datetime(df['Date'], format="%d/%m/%Y %H:%M:%S", exact=False, dayfirst=True, errors='ignore')
Output:
Date
0 12/07/2013 21:50:00
1 13/07/2013 00:30:00
2 15/07/2013
3 11/07/2013
but the dtype is still object. When doing:
df['Date'] = df['Date'].astype('datetime64')
it becomes of datetime dtype but the day and month are not parsed correctly on rows 0 and 3.
Date
0 2013-12-07 21:50:00
1 2013-07-13 00:30:00
2 2013-07-15 00:00:00
3 2013-11-07 00:00:00
The expected result is:
Date
0 2013-07-12 21:50:00
1 2013-07-13 00:30:00
2 2013-07-15 00:00:00
3 2013-07-11 00:00:00
If we look at the source code, if you pass format= and dayfirst= arguments, dayfirst= will never be read because passing format= calls a C function (np_datetime_strings.c) that doesn't use dayfirst= to make conversions. On the other hand, if you pass only dayfirst=, it will be used to first guess the format and falls back on dateutil.parser.parse to make conversions. So, use only one of them.
In most cases,
df['Date'] = pd.to_datetime(df['Date'])
does the job.
In the specific example in the OP, passing dayfirst=True does the job.
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
That said, passing the format= makes the conversion run ~25x faster (see this post for more info), so if your frame is anything larger than 10k rows, then it's better to pass the format=. Now since the format is mixed, one way is to perform the conversion in two steps (errors='coerce' argument will be useful)
convert the datetimes with time component
fill in the NaT values (the "coerced" rows) by a Series converted with a different format.
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y %H:%M:%S', errors='coerce')
df['Date'] = df['Date'].fillna(pd.to_datetime(df['Date'], format='%d/%m/%Y', errors='coerce'))
This method (of performing or more conversions) can be used to convert any column with "weirdly" formatted datetimes.

Concatenate two dataframe columns as one timestamp

I'm working on a pandas dataframe, one of my column is a date (YYYYMMDD), another one is an hour (HH:MM), I would like to concatenate the two column as one timestamp or datetime64 column, to later use that column as an index (for a time series). Here is the situation :
Do you have any ideas? The classic pandas.to_datetime() seems to work only if the columns contain hours only, day only and year only, ... etc...
Setup
df
Out[1735]:
id date hour other
0 1820 20140423 19:00:00 8
1 4814 20140424 08:20:00 22
Solution
import datetime as dt
#convert date and hour to str, concatenate them and then convert them to datetime format.
df['new_date'] = df[['date','hour']].astype(str).apply(lambda x: dt.datetime.strptime(x.date + x.hour, '%Y%m%d%H:%M:%S'), axis=1)
df
Out[1756]:
id date hour other new_date
0 1820 20140423 19:00:00 8 2014-04-23 19:00:00
1 4814 20140424 08:20:00 22 2014-04-24 08:20:00

Categories