I am having a hard time trying to sort dates with different formats. I have a Series with inputs containing dates in many different formats and need to extract them and sort them chronologically.
So far I have setup different regex for fully numerical dates (01/01/1989), dates with month (either Mar 12 1989 or March 1989 or 12 Mar 1989) and dates where only the year is given (see code below)
pat1=r'(\d{0,2}[/-]\d{0,2}[/-]\d{2,4})' # matches mm/dd/yy and mm/dd/yyyy
pat2=r'((\d{1,2})?\W?(Jan|Feb|Mar|Apr|May|June|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*\W+(\d{1,2})?\W?\d{4})'
pat3=r'((?<!\d)(\d{4})(?!\d))'
finalpat=pat1 + "|"+ pat2 + "|" + pat3
df2=df1.str.extractall(finalpat).groupby(level=0).first()
I now got a dataframe with the different regex expressions above in different columns that I need to transform in usable times.
The problem I have is that I got dates like Mar 12 1989 and 12 Mar 1989 and Mar 1989 (no day) in the same column of my dataframe.
Without two formats ( Month dd YYYY and dd Month YYYY) I can easily do this :
df3=df2.copy()
dico={"Jan":'01','Feb':'02','Mar':'03','Apr':'04','May':'05','Jun':'06','Jul':'07','Aug':'08','Sep':'09','Oct':'10','Nov':'11','Dec':'12'}
df3[1]=df3[1].str.replace("(?<=[A-Z]{1}[a-z]{2})\w*","") # we replace the month in the column by its number, and remove
for key,item in dico.items(): # the letters in month after the first 3.
df3[1]=df3[1].str.replace(key,item)
df3[1]=df3[1].str.replace("^(\d{1,2}/\d{4})",r'01/\g<1>')
df3[1]=pd.to_datetime(df3[1],format='%d/%m/%Y').dt.strftime('%Y%m%d') # add 01 if no day given
where df3[1] is the column of interest. I use a dictionary to change Month to their number and get my dates as I want them.
The problem is that with two formats of dates ( Mar 12 1989 and 12 Mar 1989), one of the two format will be wrongly transformed.
Is there a way to discriminate between the date formats and apply different transformations accordingly ?
Thanks a lot
problem I have is that I got dates like Mar 12 1989 and 12 Mar 1989
and Mar 1989 (no day) in the same column of my dataframe.
pandas.to_datetime can cope with that, consider following example
import pandas as pd
df = pd.DataFrame({'d_str':["Mar 12 1989", "12 Mar 1989", "Mar 1989"]})
df['d_dt'] = pd.to_datetime(df.d_str)
print(df)
output
d_str d_dt
0 Mar 12 1989 1989-03-12
1 12 Mar 1989 1989-03-12
2 Mar 1989 1989-03-01
Now you can sort using d_dt as it has type datetime64[ns] but you must keep in mind that lack of day is treated as 1st day of given month. Be warned though it might fail if your data contain dates in middle-endian format (mm/dd/yy).
Related
I am cleaning up a dataframe that has date of birth and date of death as a string. There are multiple formats of dates in those columns. Some contain just year (which is all I need). These are the formats of dates:
Jan 10 2020
1913
10/8/2019
June 14th 1980
All I need is the year from each date. I have not been having any luck with pandas to_datetime since a significant portion of the rows only have year to begin with.
Is there a way for me to pull just year from the strings so that I can get each column to look like:
2020
1913
2019
1980
The simplest way is to use a parser which will accept these and other formats:
import pandas as pd
from dateutil import parser
df = pd.DataFrame({"mydates":["Jan 10 2020", "1913", "10/8/2019", "June 14th 1980"]})
df['years'] = df['mydates'].apply(parser.parse).dt.strftime('%Y')
print(df)
You can use str.extract:
df['BirthDate'] = df['BirthDate'].str.extract(r'\d{4}')
I am attempting to use regex, extracting a date from df['Subject'], on a dataframe series/column and creating a new column df['Date'] with the resulting date extraction.
The following code is extracting most column dates:
Code:
df['Bug Date'] = df['Subject'].str.extract('(\s\w{3}\s\w{3}\s\d{1,2}\s\d{2}\:\d{2}\:\d{2}\s\d{4})')
Input: Typical text row in the df['Subject'] column:
'Call Don today [atron#uw.edu.au - Wed Apr 14 00:18:50 2021]'
' Report access [rbund#gmail.com - Mon Apr 4 13:11:12 2021]'
Output: 'Wed Apr 14 00:18:50 2021'
'Mon Apr 4 13:11:12 2021'
A few of the dates however, all single digit, show up as NaT. Another option I am trying is:
I get no errors, and no changes, in this code:
option1 = '(\s\w{3}\s\w{3}\s\d{1,2}\s\d{2}\:\d{2}\:\d{1}\s\d{4})'
df.replace({'Bug Date':'NaN'},{'Subject':option1},inplace=True)
with Pandas:
DataFrame.replace(self, to_replace=None, value=None, inplace=False, limit=None, regex=False, method='pad')
Help would be appreciated. Why doesnt d{1,2} work on some single digit days and not others? After careful analysis of the strings, I see no difference. However, the bug is consistent. 4 rows containing single digit string for day of the month change to NaN, while many other single digit rows do transfer well to the new column.
Here are a few rows of data. The first 4 are the trouble rows, out of about 200 rows with single and double digit day 'strings':
'Re: report [karen.glass#google.edu - Fri Apr 2 09:27:38 2021]', #results in NaN
'Re: report [hong.li#msoft.edu - Mon Apr 5 09:39:37 2021]', #results in NaN
'Re: report [sdgesmin#563.com - Wed Apr 7 09:21:02 2021]', #results in NaN
'Re: report [pdefgios#utonto.ca - Thu Apr 8 12:40:28 2021]', #results in NaN
'Re: report [zhuig-li7#mail.ghua.edu.cn - Tue Apr 13 02:38:51 2021]', #Good
'Re: report [l4ddgri#eie.grdf - Mon Mar 8 12:50:34 2021]' #Good,
'Re: report [luca.jodfge#ki.sfge - Thu Apr 8 23:52:20 2021]' #Good```
After many a trial and error, I ended up using:
``` df['Bug Date'] = df['Subject'].str.slice(start=-25,stop=-1).str.pad(25)
So this string of date and time column creation gave me no errors but when I tried to convert to_datetime a random error date would pop up. So I gave and extra space within the to_datetime( format= ) code from:
'%a %b %d %H:%M:%S %Y'
to
' %a %b %d %H:%M:%S %Y'
and that seems to have done the trick. <fingers_crossed>
I'm parsing a date column that contains irregular date formats that wouldn't be interpreted by pandas'. Dates include different languages for days, months, and years as well as varying formats. The date entries often include timestamps as well. (Bonus: Would separating them by string/regex with lambda/loops be the fastest method?) What's the best option and workflow to tackle these several tens of thousands of date entries?
The entries unknown to pandas and dateutil.parser.
Examples include:
19.8.2017, 21:23:32
31/05/2015 19:41:56
Saturday, 18. May
11 - 15 July 2001
2019/4/28 下午6:29:28
1 JuneMay 2000
19 aprile 2008 21:16:37 GMT+02:00
Samstag, 15. Mai 2010 20:55:10
So 23 Jun 2007 23:45 CEST
28 August 1998
30 June 2001
1 Ноябрь 2008 г. 18:46:59
Sat Jun 18 2011 19:46:46 GMT+0200 (Romance Daylight Time)
May-28-11 6:56:08 PM
Sat Jun 26 2010 21:55:54 GMT+0200 (West-Europa (zomertijd))
lunedì 5 maggio 2008 9.30.33
"ValueError: ('Unknown string format:', '1 JuneMay 2000')"
I realize this may be a cumbersome and undesirable task. Luckily the dates are currently nonessential to my project so they may be left alone, but a solution would be favorable. Any and all replies are appreciated, thank you.
Line by line, lots of your dates works:
>>> pd.to_datetime('19.8.2017, 21:23:32')
Timestamp('2017-08-19 21:23:32')
But there are many matters:
as your format is irregular, pandas cannot guess if 01-02-2019 is the first of february 2019 or the second of january 2019, I don't know if you can,
some of your example cannot be converted into date Saturday, 18. May: which year?
there are month and date in different languages (aprile seems Italian, Samstag is German)
some of your example works without the parenthesis content:
>>> pd.to_datetime('Sat Jun 18 2011 19:46:46 GMT+0200') # works
Timestamp('2011-06-18 19:46:46-0200', tz='pytz.FixedOffset(-120)')
>>> pd.to_datetime('Sat Jun 18 2011 19:46:46 GMT+0200 (Romance Daylight Time) ') # doesn't work.
...
ValueError: ('Unknown string format:', 'Sat Jun 18 2011 19:46:46 GMT+0200 (Romance Daylight Time) ')
It is sure that you cannot have all the date into timestamp, I would try to create a new column with the correctly parsed date in timestamp and the other saved as NaT.
For example:
date
02-01-2019
Saturday, 18. May
will become:
date new date
02-01-2019 Timestamp('2019-01-02 00:00:00.00)
Saturday, 18. May NaT
For this I would delete the parenthesis in the initial column:
df2 = df.assign(
date2=lambda x: x['date'].str.split('(')[0],
new_date=lambda x: x['date2'].apply(lambda y: pd.to_datetime(y, errors='coerce'), axis='columns')) # apply the function row by row
# This will work with python >= 3.6
After, you can see what's remaining with keeping NaT values.
For translation, you can try to replace words but it will be really long.
This is really slow (due to the apply row by row) but if your data are not consistent you cannot work directly on a column.
I hope it will help.
I currently have a 'Date' column serving as my index for a pandas dataframe that is of the form:
January
February
....
Year2
January
February
...
Year3
(It came from a pdf table extractor.) Is there anyway easily to separate the years and months out, with each month having the proper year or have a proper date-time column to serve as my index?
Right now I am thinking of applying a function where I select if a value is numeric and if so clone over to another column and delete, but there should be an easier way.
All are objects, but the years are in numeric form, whereas the months are in long string form.
Thank you very much in advance.
Using ffill with to_numeric
df['Year']=pd.to_numeric(df.MixCol,errors='coerce').ffill().astype(int)
df=df.loc[pd.to_numeric(df.MixCol,errors='coerce').isnull()]
df
Out[86]:
MixCol Year
1 January 2017
2 February 2017
4 January 2018
5 February 2018
Data input
MixCol
2017
January
February
2018
January
February
2019
I have a dataframe with names of months of the year, I.e. Jan, Feb, March etc
and I want to sort the data first by month, then by category so it looks like
Month_Name | Cat
Jan 1
Jan 2
Jan 3
Feb 1
Feb 2
Feb 3
pandas doesn't do custom sort functions for you, but you can easily add a temporary column which is the index of the month, and then sort by that
months = {datetime.datetime(2000,i,1).strftime("%b"): i for i in range(1, 13)}
df["month_number"] = df["month_name"].map(months)
df.sort(columns=[...])
You may wish to take advantage of pandas' good date parsing when reading in your dataframe, though: if you store the dates as dates instead of string month names then you'll be able to sort natively by them.
Use Sort_Dataframeby_MonthandNumeric_cols function to sort dataframe by month and numeric column:
You need to install two packages are shown below.
pip install sorted-months-weekdays
pip install sort-dataframeby-monthorweek
Example:
import pandas as pd
from sorted_months_weekdays import *
from sort_dataframeby_monthorweek import *
df = pd.DataFrame([['Jan',23],['Jan',16],['Dec',35],['Apr',79],['Mar',53],['Mar',12],['Feb',3]], columns=['Month','Sum'])
df
Out[11]:
Month Sum
0 Jan 23
1 Jan 16
2 Dec 35
3 Apr 79
4 Mar 53
5 Mar 12
6 Feb 3
To get sorted dataframe by month and numeric column I have used above function.
Sort_Dataframeby_MonthandNumeric_cols(df = df, monthcolumn='Month',numericcolumn='Sum')
Out[12]:
Month Sum
0 Jan 16
1 Jan 23
2 Feb 3
3 Mar 12
4 Mar 53
5 Apr 79
6 Dec 35