Pandas: Parsing dates in different columns with read_csv - python

I have an ascii file where the dates are formatted as follows:
Jan 20 2015 00:00:00.000
Jan 20 2015 00:10:00.000
Jan 20 2015 00:20:00.000
Jan 20 2015 00:30:00.000
Jan 20 2015 00:40:00.000
When loading the file into pandas, each column above gets its own column in a pandas dataframe. I've tried the variations of the following:
from pandas import read_csv
from datetime import datetime
df = read_csv('file.txt', header=None, delim_whitespace=True,
parse_dates={'datetime': [0, 1, 2, 3]},
date_parser=lambda x: datetime.strptime(x, '%b %d %Y %H %M %S'))
I get a couple errors:
TypeError: <lambda>() takes 1 positional argument but 4 were given
ValueError: time data 'Jun 29 2017 00:35:00.000' does not match format '%b %d %Y %H %M %S'
I'm confused because:
I'm passing a dict to parse_dates to parse the different columns as a single date.
I'm using: %b - abbreviated month name, %d - day of the month, %Y year with century, %H 24-hour, %M - minute, and %S - second
Anyone see what I'm doing incorrectly?
Edit:
I've tried date_parser=lambda x: datetime.strptime(x, '%b %d %Y %H:%M:%S') which returns ValueError: unconverted data remains: .000
Edit 2:
I tried what #MaxU suggested in his update, but it was problematic because my original data is formatted like the following:
Jan 1 2017 00:00:00.000 123 456 789 111 222 333
I'm only interested in the first 7 columns so I import my file with the following:
df = read_csv(fn, header=None, delim_whitespace=True, usecols=[0, 1, 2, 3, 4, 5, 6])
Then to create a column with datetime information from the first 4 columns I try:
df['datetime'] = to_datetime(df.ix[:, :3], format='%b %d %Y %H:%M:%S.%f')
However this doesn't work because to_datetime expects "integer, float, string, datetime, list, tuple, 1-d array, Series" as the first argument and df.ix[:, :3] returns a dataframe with the following format:
0 1 2 3
0 Jan 1 2017 00:00:00.000
How do I feed in every row of the first four columns to to_datetime such that I get one column of datetimes?
Edit 3:
I think I solved my second problem.
I just use to following command and do everything when I read my file in (I was basically just missing %f to parse past seconds):
df = read_csv(fileName, header=None, delim_whitespace=True,
parse_dates={'datetime': [0, 1, 2, 3]},
date_parser=lambda x: datetime.strptime(x, '%b %d %Y %H:%M:%S.%f'),
usecols=[0, 1, 2, 3, 4, 5, 6])
The whole reason I wanted to parse manually instead of letting pandas handle it like #MaxU suggested was to see if manually feeding in instructions would be faster - and it is! From my tests the snippet above runs approximately 5-6 times faster than letting pandas infer parsing for you.

Have a go to this simpler approach:
df = pandas.read_csv('file.txt')
df.columns = ['date']
df should be a dataframe with a single column. After that try casting that column to datetime
df['date'] = pd.to_datetime(df['date'])

Pandas (tested with version 0.20.1) is smart enough to do it for you:
In [4]: pd.read_csv(fn, sep='\s+', parse_dates={'datetime': [0, 1, 2, 3]})
Out[4]:
datetime
0 2015-01-20 00:10:00
1 2015-01-20 00:20:00
2 2015-01-20 00:30:00
3 2015-01-20 00:40:00
UPDATE: if all entries have the same format you can try to do it this way:
df = pd.read_csv(fn, sep='~', names=['datetime'])
df['datetime'] = pd.to_datetime(df['datetime'], format='%b %d %Y %H:%M:%S.%f')

Related

Convert multiple date formats to datetime in pandas

I have a row of messy data where date formats are different and I want them to be coherent as datetime in pandas
df:
Date
0 1/05/2015
1 15 Jul 2009
2 1-Feb-15
3 12/08/2019
When I run this part:
df['date'] = pd.to_datetime(df['date'], format='%d %b %Y', errors='coerce')
I get
Date
0 NaT
1 2009-07-15
2 NaT
3 NaT
How do I convert it all to date time in pandas?
pd.to_datetime is capabale of handling multiple date formats in the same column. Specifying a format will hinder its ability to dynamically determine the format, so if there are multiple types do not specify the format:
import pandas as pd
df = pd.DataFrame({
'Date': ['1/05/2015', '15 Jul 2009', '1-Feb-15', '12/08/2019']
})
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
print(df)
Date
0 2015-01-05
1 2009-07-15
2 2015-02-01
3 2019-12-08
*There are limitations to the ability to handle multiple date times. Mixed timezone aware and timezone unaware datetimes will not process correctly. Likewise mixed dayfirst and monthfirst notations will not always parse correctly.

Applying date to both %y and %Y

I have a dataframe where one of the columns called 'date', containing objects, looks like:
df =
|Date
|Mar-24
|Aug-22
|Sep-25
|...
I want to convert that column into date so for example Mar-24 would look like 2024-03-01. So far i have tried
df['Date'] = pd.to_datetime(df['Date'], format= '%b-%y')
which i think should work but from the few thousand rows i've found that there are rows which contain the full year such as 'Apr 2023' which won't be picked up by %y, is there a way i could find those rows in the column and change them into the short year before applying the above code or just giving the code both %y and %Y arguments?
Use the parameter errors='coerce' in combination with combine_first:
Minimal example:
import pandas as pd
series = pd.Series(['Mar-24', 'Aug-22', 'Sep-2025'], [0, 1, 2])
date1 = pd.to_datetime(series, format='%b-%y', errors='coerce')
date2 = pd.to_datetime(series, format='%b-%Y', errors='coerce')
date1.combine_first(date2)
Output:
0 2024-03-01
1 2022-08-01
2 2025-09-01
dtype: datetime64[ns]
Or for your specific case in one line:
df['Date'] = pd.to_datetime(df['Date'], format='%b-%y', errors='coerce').combine_first(pd.to_datetime(df['Date'], format='%b-%Y', errors='coerce'))

Changing date string to pandas datestamp

I have a dataframe with column date which looks like this:
Feb 24, 2020 # 12:47:31.616
I would like it to become this:
2020-02-24
I can achieve this using slicing since I am dealing only with one week's data hence all months will be Feb.
Is there a neat pandas way to change the datestamp to date format I desire?
Thank you for your suggestions.
Use to_datetime with format %b %d, %Y # %H:%M:%S.%f and then if necessary convert to dates by Series.dt.date or to datetimes by Series.dt.floor:
#dates
df = pd.DataFrame({'dates':['Feb 24, 2020 # 12:47:31.616','Feb 24, 2020 # 12:47:31.616']})
df['dates'] = pd.to_datetime(df['dates'], format='%b %d, %Y # %H:%M:%S.%f').dt.date
#datetimes
df['dates'] = pd.to_datetime(df['dates'], format='%b %d, %Y # %H:%M:%S.%f').dt.floor('d')
print (df)
dates
0 2020-02-24
1 2020-02-24
Using pd.to_datetime with Series.str.split:
df = pd.DataFrame({'date':['Feb 24, 2020 # 12:47:31.616']})
date
0 Feb 24, 2020 # 12:47:31.616
df['date'] = pd.to_datetime(df['date'].str.split('\s#\s').str[0], format='%b %d, %Y')
date
0 2020-02-24

Identifying a dateformat and change it into another

I am working with the following piece of data which has a different format of dates and which creates confusion later in the process. The data is like:
S. No DateTime Area
1 03/05/2019 6:33 A
2 06/03/2019 07:23:45 AM B
The first row is the format %m/%d/%Y h: mm and the second row is the format of %d/%m/%Y hh:mm: ss AM/PM. The first date value can be confusing though, is it 5th march or 3rd May. So to make everything of the same format, I want that my code detects the date format and changes in the desired format.
I have tried doing this:
df['Detection Date'] = pd.to_datetime(df['Detection Date & Time'], errors = 'coerce').dt.datetime
col = df['Detection Date'].apply(str)
for i in df.index:
if datetime.datetime.strptime(col, '%m/%d/%Y h:mm'):
ColDate = datetime.datetime.strftime(col, '%d/%m/%Y hh:mm:ss AM/PM')
But i am getting an error saying:
TypeError: strptime() argument 1 must be str, not Series
How it should be conducted.
Thanks
If it is ok to install a dependency then you can use dateparser link
import pandas as pd
import dateparser
df = pd.DataFrame({'Detection Date & Time': ['03/05/2019 6:33', '06/03/2019 07:23:45 AM']})
df['Date & time'] = df['Detection Date & Time'].apply(dateparser.parse)
You can specify both possible formats in to_datetime, so if format not match is returned missing values, so is possible use Series.fillna:
date1 = pd.to_datetime(df['DateTime'], errors = 'coerce', format='%m/%d/%Y %H:%M')
date2 = pd.to_datetime(df['DateTime'], errors = 'coerce', format='%d/%m/%Y %H:%M:%S %p')
df['DateTime'] = date1.fillna(date2)
print (df)
S. No DateTime Area
0 1 2019-03-05 06:33:00 A
1 2 2019-03-06 07:23:45 B
Last if want specify new format add Series.dt.strftime - advanatage of solution are verify both formats:
df['DateTime'] = date1.fillna(date2).dt.strftime('%d/%m/%Y %H:%M:%S %p')
print (df)
S. No DateTime Area
0 1 05/03/2019 06:33:00 AM A
1 2 06/03/2019 07:23:45 AM B
Details:
print (date1)
0 2019-03-05 06:33:00
1 NaT
Name: DateTime, dtype: datetime64[ns]
print (date2)
0 NaT
1 2019-03-06 07:23:45
Name: DateTime, dtype: datetime64[ns]
Another possible solution without verify another formats - only repalaced format %m/%d/%Y %H:%M to %d/%m/%Y %H:%M:%S %p:
date1 = pd.to_datetime(df['DateTime'], errors = 'coerce', format='%m/%d/%Y %H:%M').dt.strftime('%d/%m/%Y %H:%M:%S %p')
df['DateTime'] = date1.replace('NaT', df['DateTime'])
print (df)
S. No DateTime Area
0 1 05/03/2019 06:33:00 AM A
1 2 06/03/2019 07:23:45 AM B

Pandas changing date to a shorter format

G'day!
In my limited time working with Python and Pandas one question comes up time and time again - what if my input data has date/time in a long format, how to change it to a shorter version?
For example, the date in the input file would be:
10/10/2019 5:52:30 AM
If I want to perform date/time operations with it, I'll need to convert it to datetime:
df = pd.to_datetime(df['date'], format="%d/%m/%Y %H:%M:%S %p")
So now I have datetime objects in full long format. But what if I only need the day/month/year?
I could of course convert them back to strings and then to convert them back into datetime format.
df = df['date'].dt.strftime("%d/%m/%Y")
df = pd.to_datetime(df['date'], format="%d/%m/%Y")
It hurts my eyes to look at this... There should be a simpler way of doing this, right?
Pandas floor or round functions can do the job:
#Generate the data
df = pd.DataFrame({'year': [2015, 2016],
'month': [2, 3],
'day': [4, 5],
'hour': [2, 23]})
df['Date']=pd.to_datetime(df)
#Floor and round datetime
df['Date'].dt.floor('d')
df['Date'].dt.round('d')
The output for dt.floor is:
0 2015-02-04
1 2016-03-05
Name: Date, dtype: datetime64[ns]
and for dt.round:
0 2015-02-04
1 2016-03-06
Name: Date, dtype: datetime64[ns]

Categories