How to parse irregular date formats in pandas? - python

I'm parsing a date column that contains irregular date formats that wouldn't be interpreted by pandas'. Dates include different languages for days, months, and years as well as varying formats. The date entries often include timestamps as well. (Bonus: Would separating them by string/regex with lambda/loops be the fastest method?) What's the best option and workflow to tackle these several tens of thousands of date entries?
The entries unknown to pandas and dateutil.parser.
Examples include:
19.8.2017, 21:23:32
31/05/2015 19:41:56
Saturday, 18. May
11 - 15 July 2001
2019/4/28 下午6:29:28
1 JuneMay 2000
19 aprile 2008 21:16:37 GMT+02:00
Samstag, 15. Mai 2010 20:55:10
So 23 Jun 2007 23:45 CEST
28 August 1998
30 June 2001
1 Ноябрь 2008 г. 18:46:59
Sat Jun 18 2011 19:46:46 GMT+0200 (Romance Daylight Time)
May-28-11 6:56:08 PM
Sat Jun 26 2010 21:55:54 GMT+0200 (West-Europa (zomertijd))
lunedì 5 maggio 2008 9.30.33
"ValueError: ('Unknown string format:', '1 JuneMay 2000')"
I realize this may be a cumbersome and undesirable task. Luckily the dates are currently nonessential to my project so they may be left alone, but a solution would be favorable. Any and all replies are appreciated, thank you.

Line by line, lots of your dates works:
>>> pd.to_datetime('19.8.2017, 21:23:32')
Timestamp('2017-08-19 21:23:32')
But there are many matters:
as your format is irregular, pandas cannot guess if 01-02-2019 is the first of february 2019 or the second of january 2019, I don't know if you can,
some of your example cannot be converted into date Saturday, 18. May: which year?
there are month and date in different languages (aprile seems Italian, Samstag is German)
some of your example works without the parenthesis content:
>>> pd.to_datetime('Sat Jun 18 2011 19:46:46 GMT+0200') # works
Timestamp('2011-06-18 19:46:46-0200', tz='pytz.FixedOffset(-120)')
>>> pd.to_datetime('Sat Jun 18 2011 19:46:46 GMT+0200 (Romance Daylight Time) ') # doesn't work.
...
ValueError: ('Unknown string format:', 'Sat Jun 18 2011 19:46:46 GMT+0200 (Romance Daylight Time) ')
It is sure that you cannot have all the date into timestamp, I would try to create a new column with the correctly parsed date in timestamp and the other saved as NaT.
For example:
date
02-01-2019
Saturday, 18. May
will become:
date new date
02-01-2019 Timestamp('2019-01-02 00:00:00.00)
Saturday, 18. May NaT
For this I would delete the parenthesis in the initial column:
df2 = df.assign(
date2=lambda x: x['date'].str.split('(')[0],
new_date=lambda x: x['date2'].apply(lambda y: pd.to_datetime(y, errors='coerce'), axis='columns')) # apply the function row by row
# This will work with python >= 3.6
After, you can see what's remaining with keeping NaT values.
For translation, you can try to replace words but it will be really long.
This is really slow (due to the apply row by row) but if your data are not consistent you cannot work directly on a column.
I hope it will help.

Related

Sort out dates with different formats Python

I am having a hard time trying to sort dates with different formats. I have a Series with inputs containing dates in many different formats and need to extract them and sort them chronologically.
So far I have setup different regex for fully numerical dates (01/01/1989), dates with month (either Mar 12 1989 or March 1989 or 12 Mar 1989) and dates where only the year is given (see code below)
pat1=r'(\d{0,2}[/-]\d{0,2}[/-]\d{2,4})' # matches mm/dd/yy and mm/dd/yyyy
pat2=r'((\d{1,2})?\W?(Jan|Feb|Mar|Apr|May|June|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*\W+(\d{1,2})?\W?\d{4})'
pat3=r'((?<!\d)(\d{4})(?!\d))'
finalpat=pat1 + "|"+ pat2 + "|" + pat3
df2=df1.str.extractall(finalpat).groupby(level=0).first()
I now got a dataframe with the different regex expressions above in different columns that I need to transform in usable times.
The problem I have is that I got dates like Mar 12 1989 and 12 Mar 1989 and Mar 1989 (no day) in the same column of my dataframe.
Without two formats ( Month dd YYYY and dd Month YYYY) I can easily do this :
df3=df2.copy()
dico={"Jan":'01','Feb':'02','Mar':'03','Apr':'04','May':'05','Jun':'06','Jul':'07','Aug':'08','Sep':'09','Oct':'10','Nov':'11','Dec':'12'}
df3[1]=df3[1].str.replace("(?<=[A-Z]{1}[a-z]{2})\w*","") # we replace the month in the column by its number, and remove
for key,item in dico.items(): # the letters in month after the first 3.
df3[1]=df3[1].str.replace(key,item)
df3[1]=df3[1].str.replace("^(\d{1,2}/\d{4})",r'01/\g<1>')
df3[1]=pd.to_datetime(df3[1],format='%d/%m/%Y').dt.strftime('%Y%m%d') # add 01 if no day given
where df3[1] is the column of interest. I use a dictionary to change Month to their number and get my dates as I want them.
The problem is that with two formats of dates ( Mar 12 1989 and 12 Mar 1989), one of the two format will be wrongly transformed.
Is there a way to discriminate between the date formats and apply different transformations accordingly ?
Thanks a lot
problem I have is that I got dates like Mar 12 1989 and 12 Mar 1989
and Mar 1989 (no day) in the same column of my dataframe.
pandas.to_datetime can cope with that, consider following example
import pandas as pd
df = pd.DataFrame({'d_str':["Mar 12 1989", "12 Mar 1989", "Mar 1989"]})
df['d_dt'] = pd.to_datetime(df.d_str)
print(df)
output
d_str d_dt
0 Mar 12 1989 1989-03-12
1 12 Mar 1989 1989-03-12
2 Mar 1989 1989-03-01
Now you can sort using d_dt as it has type datetime64[ns] but you must keep in mind that lack of day is treated as 1st day of given month. Be warned though it might fail if your data contain dates in middle-endian format (mm/dd/yy).

Sort table in pandas by certain value in column

I am using pandas to sort this table by "Departure date" and "Value" which I could by using: sort_values(["Departure date:", "Value"]), but the thing is that I need to sort only Wednesday's flights starting from cheapest.
When I print(type(Data["Departure date])) is says: <class 'pandas.core.series.Series'>, if that helps.
City Departure date Airline Value
Podgorica Sat 1 Jan Ryanair 14.46
Managua Wed 5 Jan Ryanair 1699.05
Bucharest Tue 11 Jan Ryanair 38.24
Oslo Wed 12 Jan Ryanair 24.32
Istanbul Wed 12 Jan Ryanair 120.00
Kyiv Wed 12 Jan Windrose 227.43
I could maybe split Departure date and extract only days of week, sort and join them later but it looks like a lot of work.
I just recently started with python and pandas so any help is welcomed. Thank you!
Isn't sorting by two columns the solution for you? Or do other dates need to stay in the same order?
data.sort_values(['Departure date', 'Value'])

Regex pattern in Pandas Series.str.extract() not pulling single digit day from column

I am attempting to use regex, extracting a date from df['Subject'], on a dataframe series/column and creating a new column df['Date'] with the resulting date extraction.
The following code is extracting most column dates:
Code:
df['Bug Date'] = df['Subject'].str.extract('(\s\w{3}\s\w{3}\s\d{1,2}\s\d{2}\:\d{2}\:\d{2}\s\d{4})')
Input: Typical text row in the df['Subject'] column:
'Call Don today [atron#uw.edu.au - Wed Apr 14 00:18:50 2021]'
' Report access [rbund#gmail.com - Mon Apr 4 13:11:12 2021]'
Output: 'Wed Apr 14 00:18:50 2021'
'Mon Apr 4 13:11:12 2021'
A few of the dates however, all single digit, show up as NaT. Another option I am trying is:
I get no errors, and no changes, in this code:
option1 = '(\s\w{3}\s\w{3}\s\d{1,2}\s\d{2}\:\d{2}\:\d{1}\s\d{4})'
df.replace({'Bug Date':'NaN'},{'Subject':option1},inplace=True)
with Pandas:
DataFrame.replace(self, to_replace=None, value=None, inplace=False, limit=None, regex=False, method='pad')
Help would be appreciated. Why doesnt d{1,2} work on some single digit days and not others? After careful analysis of the strings, I see no difference. However, the bug is consistent. 4 rows containing single digit string for day of the month change to NaN, while many other single digit rows do transfer well to the new column.
Here are a few rows of data. The first 4 are the trouble rows, out of about 200 rows with single and double digit day 'strings':
'Re: report [karen.glass#google.edu - Fri Apr 2 09:27:38 2021]', #results in NaN
'Re: report [hong.li#msoft.edu - Mon Apr 5 09:39:37 2021]', #results in NaN
'Re: report [sdgesmin#563.com - Wed Apr 7 09:21:02 2021]', #results in NaN
'Re: report [pdefgios#utonto.ca - Thu Apr 8 12:40:28 2021]', #results in NaN
'Re: report [zhuig-li7#mail.ghua.edu.cn - Tue Apr 13 02:38:51 2021]', #Good
'Re: report [l4ddgri#eie.grdf - Mon Mar 8 12:50:34 2021]' #Good,
'Re: report [luca.jodfge#ki.sfge - Thu Apr 8 23:52:20 2021]' #Good```
After many a trial and error, I ended up using:
``` df['Bug Date'] = df['Subject'].str.slice(start=-25,stop=-1).str.pad(25)
So this string of date and time column creation gave me no errors but when I tried to convert to_datetime a random error date would pop up. So I gave and extra space within the to_datetime( format= ) code from:
'%a %b %d %H:%M:%S %Y'
to
' %a %b %d %H:%M:%S %Y'
and that seems to have done the trick. <fingers_crossed>

Separating months and years sharing the same column in Pandas dataframe

I currently have a 'Date' column serving as my index for a pandas dataframe that is of the form:
January
February
....
Year2
January
February
...
Year3
(It came from a pdf table extractor.) Is there anyway easily to separate the years and months out, with each month having the proper year or have a proper date-time column to serve as my index?
Right now I am thinking of applying a function where I select if a value is numeric and if so clone over to another column and delete, but there should be an easier way.
All are objects, but the years are in numeric form, whereas the months are in long string form.
Thank you very much in advance.
Using ffill with to_numeric
df['Year']=pd.to_numeric(df.MixCol,errors='coerce').ffill().astype(int)
df=df.loc[pd.to_numeric(df.MixCol,errors='coerce').isnull()]
df
Out[86]:
MixCol Year
1 January 2017
2 February 2017
4 January 2018
5 February 2018
Data input
MixCol
2017
January
February
2018
January
February
2019

combining and formatting dtypes in pandas

Working on a project where ultimately I want to try and predict NBA home game attendance. I've scraped my preliminary data, but still want to add other fields such as arena capacity, win streak and other fields I might find valuable.
In my initial dataframe I'm just not sure how to combine my date fields in a way that will make it easier to plot and work with later. Also any other input would be appreciated as far as other tips. Thanks.
]
You have three original fields here: Date, Year, and Time. (Weekday can be derived from these.)
One route would be to concatenate their string-forms and form a Series of datetimes:
>>> concat = df['Date'] + ' ' + df['Year'].astype(str) + ' ' + df['Time']
>>> df['Fulldate'] = pd.to_datetime(concat)
>>> df
Weekday Date Year Time Fulldate
0 Tue Oct 30 2012 7:00 pm 2012-10-30 19:00:00
1 Tue Oct 30 2012 7:30 pm 2012-10-30 19:30:00
2 Tue Oct 30 2012 7:00 pm 2012-10-30 19:00:00
3 Wed Oct 31 2012 7:30 pm 2012-10-31 19:30:00
4 Wed Oct 31 2012 8:00 pm 2012-10-31 20:00:00
From there, you're free to derive additional fields with the .dt accessor. For instance:
>>> df.Fulldate.dt.month
0 10
1 10
2 10
3 10
4 10
Name: Fulldate, dtype: int64
>>> df.Fulldate.dt.weekday.isin((5, 6)) # weekend games
Here's a full list of datetime-like properties:
https://pandas.pydata.org/pandas-docs/stable/api.html#datetimelike-properties
In the future, try to make your question a little more specific and post something people can easily reproduce, not pictures.

Categories