Pandas date formatting with different multiple date formats problems - python

I am trying to convert a column in my dataframe to dates, which are meant to be birthdays. The data was manually captured over a period of years with different formats. I cant get Pandas to format the whole column correctly.
formats include:
YYYYMMDD
DDMMYYYY
DD/MM/YYYY
DD-MMM-YYYY (eg JAN)
I have tried
dates['BIRTH-DATE(MAIN)'] = pd.to_datetime(dates['BIRTH-DATE(MAIN)'])
but i get the error
ValueError: year 19670314 is out of range
Not sure how I can get it to include multiple date formats?

You could create your own function to handle this. For example, something like:
df = pd.DataFrame({'date': {0: '20180101', 1: '01022018', 2: '01/02/2018', 3: '01-JAN-2018'}})
def fix_date(series, patterns=['%Y%m%d', '%d%m%Y', '%d/%m/%Y', '%d-%b-%Y']):
datetimes = []
for pat in patterns:
datetimes.append(pd.to_datetime(series, format=pat, errors='coerce'))
return pd.concat(datetimes, axis=1).ffill(axis=1).iloc[:, -1]
df['fixed_dates'] = fix_date(df['date'])
[out]
print(df)
date fixed_dates
0 20180101 2018-01-01
1 01022018 2018-02-01
2 01/02/2018 2018-02-01
3 01-JAN-2018 2018-01-01

In my eyes pandas is really good in converting dates but it is nearly impossible to guess always the right format automatically. Use pd.to_datetime with the option errors='coerce' and check the dates which were not converted by hand.

Related

pandas to_datetime but replace with fixed value when fail/coerce, preserve 'meaningful' NaNs

When using pd.to_datetime on my data frame I get this error:
Out of bounds nanosecond timestamp: 30-04-18 00:00:00
Now from looking on StackO I know I can simply use the coerce option:
pd.to_datetime('13000101', format='%Y%m%d', errors='coerce')
But I was wondering if anyone had an idea on how I might replace these values with a fixed value? Say 1900-01-01 00:00:00 (or maybe 1955-11-12 for anyone who gets the reference!)
Reason being that this data frame is part of a process that handles thousands and thousands of JSONs per day. I want to be able to see in the dataset easily the incorrect ones by filtering for said fixed date.
It is just as invalid for the JSON to contain any date before 2010 so using an earlier date is fine and it is also perfectly acceptable to have a blank (NA) date value so I can't rely on just blanking the data.
Replace missing values by some default datetime value in Series.mask only for missing values generated by to_datetime with errors='coerce':
df=pd.DataFrame({"date": [np.nan,'20180101','20-20-0']})
t = pd.to_datetime('1900-01-01')
date = pd.to_datetime(df['date'], format='%Y%m%d', errors='coerce')
df['date'] = date.mask(date.isna() & df['date'].notna(), t)
print (df)
date
0 NaT
1 2018-01-01
2 1900-01-01

Python ValueError: time data '02-01-2020' does not match format '%d/%m/%y' (match)

I am working on a dataset for machine learning but I have an error for the date that not matching. I am tried both times with different strings in format "%d-%m-%y", "%d/%m/%y" but it is not worked for me. What can I do so that problem will solve. What can I do as dataset dates are in a different format?
df_MR['Date'] = pd.to_datetime(df_MR['Date'], format = "%d-%m-%y")```
ValueError: time data '30/01/20' does not match format '%d-%m-%y' (match)
df_MR['Date'] = pd.to_datetime(df_MR['Date'], format = "%d/%m/%y")```
ValueError: time data '02-01-2020' does not match format '%d/%m/%y' (match)
I've had some success using the infer_datetime_format argument of to_datetime in a small example:
>>> df = pd.DataFrame({'a': ['02-01-2020', '03-02-20', '03/02/2020', '04/05/2020']})
>>> pd.to_datetime(df['a'], infer_datetime_format=True)
0 2020-02-01
1 2020-03-02
2 2020-03-02
3 2020-04-05
Name: a, dtype: datetime64[ns]
What can I do as in dataset dates are in different format ?
fix the data source so that it returns coherent data
add an intermediate normalisation pass to your pipeline to handle this
or try both formats in sequence e.g.
try: # try to parse 4 digit years
df_MR['Date'] = pd.to_datetime(df_MR['Date'], format = "%d-%m-%Y")
except ValueError: # fallback to 2 digits year
df_MR['Date'] = pd.to_datetime(df_MR['Date'], format = "%d/%m/%y")
One more alternative is to not pass in a format at all, and hope that pandas will get it right. Since both your date formats aren in DMY order, you could try pd.to_datetime(dt, dayfirst=True).

How to convert python dataframe timestamp to datetime format

I have a dataframe with date information in one column.
The date visually appears in the dataframe in this format: 2019-11-24
but when you print the type it shows up as:
Timestamp('2019-11-24 00:00:00')
I'd like to convert each value in the dataframe to a format like this:
24-Nov
or
7-Nov
for single digit days.
I've tried using various datetime and strptime commands to convert but I am getting errors.
Here's a way to do:
df = pd.DataFrame({'date': ["2014-10-23","2016-09-08"]})
df['date_new'] = pd.to_datetime(df['date'])
df['date_new'] = df['date_new'].dt.strftime("%d-%b")
date date_new
0 2014-10-23 23-Oct
1 2016-09-08 08-Sept

YYMM to date time python

I have a dateframe column in Python that is in the format YYMM. E.g January 1996 is 9601.
I'm having a hard time converting it from 9601 to a useable date time format. I want the new format to be 01-01-1996. Does anyone have any suggestions? I tried pd.to_datetime function but it's not getting the results I'm looking for.
Use to_datetime with parameter format:
df = pd.DataFrame({'col':['9601', '9705']})
df['col'] = pd.to_datetime(df['col'], format='%y%m')
print (df)
col
0 1996-01-01
1 1997-05-01

Working on dates with mm-dd-YY & YY-mm-dd format in pandas

I am trying to do a simple test on pandas capabilities to handle dates & format.
For that i have created a dataframe with values like below. :
df = pd.DataFrame({'date1' : ['10-11-11','12-11-12','10-10-10','12-11-11',
'12-12-12','11-12-11','11-11-11']})
Here I am assuming that the values are dates. And I am converting it into proper format using pandas' to_datetime function.
df['format_date1'] = pd.to_datetime(df['date1'])
print(df)
Out[3]:
date1 format_date1
0 10-11-11 2011-10-11
1 12-11-12 2012-12-11
2 10-10-10 2010-10-10
3 12-11-11 2011-12-11
4 12-12-12 2012-12-12
5 11-12-11 2011-11-12
6 11-11-11 2011-11-11
Here, Pandas is reading the date of the dataframe as "MM/DD/YY" and converting it in native format (i.e. YYYY/MM/DD). I want to check if Pandas can take my input indicating that the date format is actually "YY/MM/DD" and then let it convert into its native format. This will change the value of row no.: 5. To do this, I have run following code. But it is giving me an error.
df3['format_date2'] = pd.to_datetime(df3['date1'], format='%Y/%m/%d')
ValueError: time data '10-10-10' does not match format '%Y/%m/%d' (match)
I have seen the sort of solution here. But I was hoping to get a little easy and crisp answer.
%Y in the format specifier takes the 4-digit year (i.e. 2016). %y takes the 2-digit year (i.e. 16, meaning 2016). Change the %Y to %y and it should work.
Also the dashes in your format specifier are not present. You need to change your format to %y-%m-%d

Categories