Multiple Date Formate to a single date pattern in pandas dataframe - python

I have a pandas date column in the following format
Date
0 March 13, 2020, March 13, 2020
1 3.9.2021, 3.9.2021, 03.09.2021, 3. September 2021
2 NaN
3 May 20, 2022, May 21, 2022
I tried to convert the pattern to a single pattern to store to a new column.
import pandas as pd
import dateutil.parser
# initialise data of lists.
data = {'Date':['March 13, 2020, March 13, 2020', '3.9.2021, 3.9.2021, 03.09.2021, 3. September 2021', 'NaN','May 20, 2022, May 21, 2022']}
# Create DataFrame
df = pd.DataFrame(data)
df["FormattedDate"] = df.Date.apply(lambda x: dateutil.parser.parse(x.strftime("%Y-%m-%d") ))
But i am getting an error
AttributeError: 'str' object has no attribute 'strftime'
Desired Output
Date DateFormatted
0 March 13, 2020, March 13, 2020 2020-03-13, 2020-03-13
1 3.9.2021, 3.9.2021, 03.09.2021, 3. September 2021 2021-03-09, 2021-03-09, 2021-03-09, 2021-09-03
2 NaN NaN
3 May 20, 2022, May 21, 2022 2022-05-20, 2022-05-21

I was authot of previous solution, so possible solution is change also it for avoid , like separator and like value in date strings is used Series.str.extractall, converting to datetimes and last is aggregate join:
format_list = ["[0-9]{1,2}(?:\,|\.|\/|\-)(?:\s)?[0-9]{1,2}(?:\,|\.|\/|\-)(?:\s)?[0-9]{2,4}",
"[0-9]{1,2}(?:\.)(?:\s)?(?:(?:(?:j|J)a)|(?:(?:f|F)e)|(?:(?:m|M)a)|(?:(?:a|A)p)|(?:(?:m|M)a)|(?:(?:j|J)u)|(?:(?:a|A)u)|(?:(?:s|S)e)|(?:(?:o|O)c)|(?:(?:n|N)o)|(?:(?:d|D)e))\w*(?:\s)?[0-9]{2,4}",
"(?:(?:(?:j|J)an)|(?:(?:f|F)eb)|(?:(?:m|M)ar)|(?:(?:a|A)pr)|(?:(?:m|M)ay)|(?:(?:j|J)un)|(?:(?:j|J)ul)|(?:(?:a|A)ug)|(?:(?:s|S)ep)|(?:(?:o|O)ct)|(?:(?:n|N)ov)|(?:(?:d|D)ec))\w*(?:\s)?(?:\n)?[0-9]{1,2}(?:\s)?(?:\,|\.|\/|\-)?(?:\s)?[0-9]{2,4}(?:\,|\.|\/|\-)?(?:\s)?[0-9]{2,4}",
"[0-9]{1,2}(?:\.)?(?:\s)?(?:\n)?(?:(?:(?:j|J)a)|(?:(?:f|F)e)|(?:(?:m|M)a)|(?:(?:a|A)p)|(?:(?:m|M)a)|(?:(?:j|J)u)|(?:(?:a|A)u)|(?:(?:s|S)e)|(?:(?:o|O)c)|(?:(?:n|N)o)|(?:(?:d|D)e))\w*(?:\,|\.|\/|\-)?(?:\s)?[0-9]{2,4}"]
# initialise data of lists.
data = {'Name':['Today is 09 September 2021', np.nan, '25 December 2021 is christmas', '01/01/2022 is newyear and will be holiday on 02.01.2022 also']}
# Create DataFrame
df = pd.DataFrame(data)
import dateutil.parser
f = lambda x: dateutil.parser.parse(x).strftime("%Y-%m-%d")
df['DateFormatted'] = df['Name'].str.extractall(f'({"|".join(format_list)})')[0].apply(f).groupby(level=0).agg(','.join)
print (df)
Name DateFormatted
0 Today is 09 September 2021 2021-09-09
1 NaN NaN
2 25 December 2021 is christmas 2021-12-25
3 01/01/2022 is newyear and will be holiday on 0... 2022-01-01,2022-02-01
Another alternative is processing lists after remove missing values in generato comprehension with join:
import dateutil.parser
f = lambda x: dateutil.parser.parse(x).strftime("%Y-%m-%d")
df['Date'] = df['Name'].str.findall("|".join(format_list)).dropna().apply(lambda y: ','.join(f(x) for x in y))
print (df)
Name Date
0 Today is 09 September 2021 2021-09-09
1 NaN NaN
2 25 December 2021 is christmas 2021-12-25
3 01/01/2022 is newyear and will be holiday on 0... 2022-01-01,2022-02-01

Related

Python: Order Dates that are in the format: %B %Y

I have a df with dates in the format %B %Y (e.g. June 2021, December 2022 etc.)
Date
Price
Apr 2022
2
Dec 2021
8
I am trying to sort dates in order of oldest to newest but when I try:
.sort_values(by='Date', ascending=False)
it is ordering in alphabetical order.
The 'Date' column is an Object.
ascending=False will sort from newest to oldest, but you are asking to sort oldest to newest, so you don't need that option;
there is a key option to specify how to parse the values before sorting them;
you may or may not want option ignore_index=True, which I included below.
We can use the key option to parse the values into datetime objects with pandas.to_datetime.
import pandas as pd
df = pd.DataFrame({'Date': ['Apr 2022', 'Dec 2021', 'May 2022', 'May 2021'], 'Price': [2, 8, 12, 15]})
df = df.sort_values(by='Date', ignore_index=True, key=pd.to_datetime)
print(df)
# Date Price
# 0 May 2021 15
# 1 Dec 2021 8
# 2 Apr 2022 2
# 3 May 2022 12
Relevant documentation:
DataFrame.sort_values;
to_datetime.

Python dataframe with date (dates and events)

Using PYTHON, I am trying to sort the values in the table based on dates and the associated events. Date1 and event1 are a pair because event1 occurred on date1. The same for other dates and events.
date1 is in a column of its own, and event1 is also in its own separate column.
For example, the table would include as below:
date1
event1
date2
event2
date3
event3
March 6, 2021
eventC
Jan. 1, 2020
eventX
May 11, 2020
eventB
Dec. 6, 2021
eventBB
Feb. 11, 2001
eventYY
June 13, 1990
eventSS
March 16, 2019
eventCD
Jan. 1, 1998
eventRE
May 23, 1989
eventDF
Nov. 1, 2008
eventCC
April 14, 2001
eventWQ
March 17, 1999
eventCV
I would like the result to show as follows where for each row, sort the data from oldest date to the newest, but the order of the events follow their respective dates (e.g. eventC that occurred on March 6, 2021 as the first in row 0 is now reordered to be the third in row 0).
date1
event1
date2
event2
date3
event3
Jan. 1, 2020
eventX
May 11, 2020
eventB
March 6, 2021
eventC
June 13, 1990
eventSS
Feb. 11, 2001
eventYY
Dec. 6, 2021
eventBB
May 23, 1989
eventDF
Jan. 1, 1998
eventRE
March 16, 2019
eventCD
March 17, 1999
eventCV
Nov. 1, 2008
eventCC
April 14, 2001
eventWQ
I would like to keep the output in the table format as above! (unless there is a good reason not to...) =)
This is actually a very small version of a much larger data. Any help would be appreciated.
Thanks to all in advance!!!
Cast the 'Dates' values to datetime and use the sort_values() function. Example:
import pandas as pd
df = pd.DataFrame([{"Date":"02-01-1945","Name":"event2"},{"Date":"20-11-1934", "Name":"event1"}])
print(df)
#Output:
#
# Date Name
#0 02-01-1945 event2
#1 20-11-1934 event1
df["Date"] = pd.to_datetime(df["Date"])
df= df.sort_values(by="Date")
print(df)
#Output:
# Date Name
#1 1934-11-20 event1
#0 1945-02-01 event2

convert month_year value to month name and year columns in python

I've a sample dataframe
year_month
202004
202005
202011
202012
How can I append the month_name + year column to the dataframe
year_month month_name
202004 April 2020
202005 May 2020
202011 Nov 2020
202112 Dec 2021
You can use datetime.strptime to convert your string into a datetime object, then you can use datetime.strftime to convert it back into a string with different format.
>>> import datetime as dt
>>> import pandas as pd
>>> df = pd.DataFrame(['202004', '202005', '202011', '202012'], columns=['year_month'])
>>> df['month_name'] = df['year_month'].apply(lambda x: dt.datetime.strptime(x, '%Y%m').strftime('%b %Y'))
>>> df
year_month month_name
0 202004 Apr 2020
1 202005 May 2020
2 202011 Nov 2020
3 202012 Dec 2020
You can see the full list of format codes here.

Sort pandas dataframe rows according to a string value column

I have the following dataframe :
month price
0 April 102.478015
1 August 94.868053
2 December 97.278205
3 February 100.114510
4 January 99.419109
5 July 93.402928
6 June 96.114224
7 March 101.297762
8 May 102.905340
9 November 97.952169
10 October 95.606478
11 September 94.226803
I would like to have the months in a coherent order (January in the first row until December in the 12th row). How please could I do ?
If necessary, you can copy this dataframe and then execute
pd.read_clipboard(sep='\s\s+')
to have the dataframe on your jupyter notebook
Convert values to ordered categoricals, so possible use DataFrame.sort_values:
cats = ['January','February','March','April','May','June',
'July','August','September','October','November','December']
df['month'] = pd.CategoricalIndex(df['month'], ordered=True, categories=cats)
#alternative
#df['month'] = pd.Categorical(df['month'], ordered=True, categories=cats)
df = df.sort_values('month')
print (df)
month price
4 January 99.419109
3 February 100.114510
7 March 101.297762
0 April 102.478015
8 May 102.905340
6 June 96.114224
5 July 93.402928
1 August 94.868053
11 September 94.226803
10 October 95.606478
9 November 97.952169
2 December 97.278205

Python Keep row if YYYY present else remove the row

I have a dataframe which has a Date column, I want to remove those row from Date column which doesn't have YYYY (eg, 2018, it can be any year) format.
I had used apply method with regex expression but doesn't work ,
df[df.Date.apply(lambda x: re.findall(r'[0-9]{4}', x))]
The Date column can have values such as,
12/3/2018
March 12, 2018
stackoverflow
Mar 12, 2018
no date text
3/12/2018
So here output should be
12/3/2018
March 12, 2018
Mar 12, 2018
3/12/2018
This is one approach. Using pd.to_datetime with errors="coerce"
Ex:
import pandas as pd
df = pd.DataFrame({"Col1": ['12/3/2018', 'March 12, 2018', 'stackoverflow', 'Mar 12, 2018', 'no date text', '3/12/2018']})
df["Col1"] = pd.to_datetime(df["Col1"], errors="coerce")
df = df[df["Col1"].notnull()]
print(df)
Output:
Col1
0 2018-12-03
1 2018-03-12
3 2018-03-12
5 2018-03-12
Or if you want to maintain the original data
import pandas as pd
def validateDate(d):
try:
pd.to_datetime(d)
return d
except:
return None
df = pd.DataFrame({"Col1": ['12/3/2018', 'March 12, 2018', 'stackoverflow', 'Mar 12, 2018', 'no date text', '3/12/2018']})
df["Col1"] = df["Col1"].apply(validateDate)
df.dropna(inplace=True)
print(df)
Output:
Col1
0 12/3/2018
1 March 12, 2018
3 Mar 12, 2018
5 3/12/2018

Categories