Extracting dates using regex following several formats

Extracting dates using regex following several formats - python

(?:\d{1,2}[\-\/])?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec|January|February|March|April|May|June|July|August|September|October|November|December)?[\,\.\s]*(?:\d{1,2}[\-\/\.)\s,]*)+(?:\d{2,4})(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec|January|February|March|April|May|June|July|August|September|October|November|December)?[\,\.\s]*(?:\d{1,2}[\-\/\.),]*)
I was trying to extract dates from the text from these ff. format:
January 1, 2020
January 01, 2020
JANUARY 1, 2020
JANUARY 01, 2020
Jan. 1, 2020
Jan. 01, 2020
JAN. 1, 2020
JAN. 01, 2020
2020 January 1
2020 January 01
2020 Jan. 1
2020 Jan. 01
2020 JAN. 1
2020 JAN. 01
01/01/2020
2020/01/01
01.01.2020
2020.01.01
01-01-2020
2020-01-01
Here's a sample. The problem is when it tries to extract from this format 2020 JAN. 1 , 2020 JAN. 01, 2020 Jan. 01, 2020-01-01.

You can use
pattern = r"""(?ix)
\b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|June?|July?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?) [\s.]* (?:0?[1-9]|[12][0-9]|3[01]) [\s,.]* (?:19|20)(?:\d{2})? # Jan 01 2000
|
(?<!\d)(?:19|20)(?:\d{2})? [\s,.]* (?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|June?|July?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?) [\s.]* (?:0?[1-9]|[12][0-9]|3[01]) # 2000 Jan 01
|
(?<!\d)
(?:
(?:0?[1-9]|1[012])[-/.]?(?:0?[1-9]|[12][0-9]|3[01])[-/.]?(?:19|20)\d\d # MM/dd/yyyy
|
(?:19|20)\d\d[-/.]?(?:0?[1-9]|1[012])[-/.]?(?:0?[1-9]|[12][0-9]|3[01]) # yyyy/MM/dd
)
(?!\d)"""
See the regex demo
The i modifier flag enables case insensitive matching and x enables the VERBOSE mode.

Related

Sorting a Data Frame by Month with Repeating Years, based on Unique 'Other' Column

In pandas, I am trying to sort rows of a large data frame by months. At the moment, the months are out of order. They are sorted alphabetically, but I would like to sort them chronologically.
The tricky part is that I am sorting by a cycle of 21 months for every one product. There are two year columns, one for calendar year and one for fiscal year, and they differ on purpose. Fiscal Year 2021 is January 2021 - September 2021, and Fiscal Year 2022 is October 2021 - September 2022. There are hundreds of products, and the section below is just a sample of two products.
As seen in the table below, the months are out of order, but everything else is in the right order.
Again, ever product has 21 months, from January 2021 to September 2022. I want these to iterate in order for every product.
I am looking for a code to sort this data frame in the right way.
How it looks now (months not chronological by year):
Item
Calendar Year
Fiscal Year
Month
Amount
Product 1
2021
2021
April
45
Product 1
2021
2021
August
85
Product 1
2021
2021
February
25
Product 1
2021
2021
January
15
Product 1
2021
2021
July
75
Product 1
2021
2021
June
65
Product 1
2021
2021
March
35
Product 1
2021
2021
May
55
Product 1
2021
2021
September
95
Product 1
2021
2022
December
125
Product 1
2021
2022
November
115
Product 1
2021
2022
October
105
Product 1
2022
2022
April
405
Product 1
2022
2022
August
805
Product 1
2022
2022
February
205
Product 1
2022
2022
January
1005
Product 1
2022
2022
July
705
Product 1
2022
2022
June
605
Product 1
2022
2022
March
305
Product 1
2022
2022
May
505
Product 1
2022
2022
September
905
Product 2
2021
2021
April
4000
Product 2
2021
2021
August
8000
Product 2
2021
2021
February
2000
Product 2
2021
2021
January
1000
Product 2
2021
2021
July
7000
Product 2
2021
2021
June
6000
Product 2
2021
2021
March
3000
Product 2
2021
2021
May
5000
Product 2
2021
2021
September
9000
Product 2
2021
2022
December
12000
Product 2
2021
2022
November
11000
Product 2
2021
2022
October
10000
Product 2
2022
2022
April
40000
Product 2
2022
2022
August
80000
Product 2
2022
2022
February
20000
Product 2
2022
2022
January
10000
Product 2
2022
2022
July
70000
Product 2
2022
2022
June
60000
Product 2
2022
2022
March
30000
Product 2
2022
2022
May
50000
Product 2
2022
2022
September
90000
How it should look (months in order):
Item
Calendar Year
Fiscal Year
Month
Amount
Product 1
2021
2021
January
15
Product 1
2021
2021
February
25
Product 1
2021
2021
March
35
Product 1
2021
2021
April
45
Product 1
2021
2021
May
55
Product 1
2021
2021
June
65
Product 1
2021
2021
July
75
Product 1
2021
2021
August
85
Product 1
2021
2021
September
95
Product 1
2021
2022
October
105
Product 1
2021
2022
November
115
Product 1
2021
2022
December
125
Product 1
2022
2022
January
1005
Product 1
2022
2022
February
205
Product 1
2022
2022
March
305
Product 1
2022
2022
April
405
Product 1
2022
2022
May
505
Product 1
2022
2022
June
605
Product 1
2022
2022
July
705
Product 1
2022
2022
August
805
Product 1
2022
2022
September
905
Product 2
2021
2021
January
1000
Product 2
2021
2021
February
2000
Product 2
2021
2021
March
3000
Product 2
2021
2021
April
4000
Product 2
2021
2021
May
5000
Product 2
2021
2021
June
6000
Product 2
2021
2021
July
7000
Product 2
2021
2021
August
8000
Product 2
2021
2021
September
9000
Product 2
2021
2022
October
10000
Product 2
2021
2022
November
11000
Product 2
2021
2022
December
12000
Product 2
2022
2022
January
10000
Product 2
2022
2022
February
20000
Product 2
2022
2022
March
30000
Product 2
2022
2022
April
40000
Product 2
2022
2022
May
50000
Product 2
2022
2022
June
60000
Product 2
2022
2022
July
70000
Product 2
2022
2022
August
80000
Product 2
2022
2022
September
90000

First convert values to ordered categoricals, so possible sorting by multiple columns in DataFrame.sort_values:
cat = ['January','February','March','April','May','June',
'July','August','September','October','November','December']
df['Month'] = pd.Categorical(df['Month'], ordered=True, categories=cat)
df = df.sort_values(['Item','Calendar Year','Month'])
Or create DatetimeIndex, so possible sorting by Item with datetimes:
df.index = pd.to_datetime(df['Calendar Year'] + df['Month'], format='%Y%B')
df = df.rename_axis('dt').sort_values(['Item','dt']).reset_index(drop=True)

How to groupby and create a multiindex dataframe

I have a dataframe which looks like this:
0 1 2
0 April 0.002745 ADANIPORTS.NS
1 July 0.005239 ASIANPAINT.NS
2 April 0.003347 AXISBANK.NS
3 April 0.004469 BAJAJ-AUTO.NS
4 June 0.006045 BAJFINANCE.NS
5 June 0.005176 BAJAJFINSV.NS
6 April 0.003321 BHARTIARTL.NS
7 November 0.003469 INFRATEL.NS
8 April 0.002667 BPCL.NS
9 April 0.003864 BRITANNIA.NS
10 April 0.005570 CIPLA.NS
11 October 0.000925 COALINDIA.NS
12 April 0.003666 DRREDDY.NS
13 April 0.002836 EICHERMOT.NS
14 April 0.003793 GAIL.NS
15 April 0.003850 GRASIM.NS
16 April 0.002858 HCLTECH.NS
17 December 0.005666 HDFC.NS
18 April 0.003484 HDFCBANK.NS
19 April 0.004173 HEROMOTOCO.NS
20 April 0.006395 HINDALCO.NS
21 June 0.001844 HINDUNILVR.NS
22 October 0.004620 ICICIBANK.NS
23 April 0.004020 INDUSINDBK.NS
24 January 0.002496 INFY.NS
25 September 0.001835 IOC.NS
26 May 0.002290 ITC.NS
27 April 0.005910 JSWSTEEL.NS
28 April 0.003570 KOTAKBANK.NS
29 May 0.003346 LT.NS
30 April 0.006131 M&M.NS
31 April 0.003912 MARUTI.NS
32 March 0.003596 NESTLEIND.NS
33 April 0.002180 NTPC.NS
34 April 0.003209 ONGC.NS
35 June 0.001796 POWERGRID.NS
36 April 0.004182 RELIANCE.NS
37 April 0.004246 SHREECEM.NS
38 October 0.004836 SBIN.NS
39 April 0.002596 SUNPHARMA.NS
40 April 0.004235 TCS.NS
41 April 0.006729 TATAMOTORS.NS
42 October 0.003395 TATASTEEL.NS
43 August 0.002440 TECHM.NS
44 June 0.003481 TITAN.NS
45 April 0.003749 ULTRACEMCO.NS
46 April 0.005854 UPL.NS
47 April 0.004991 VEDL.NS
48 July 0.001627 WIPRO.NS
49 April 0.003728 ZEEL.NS
how can i create a multiindex dataframe which would groupby in column 0. When i do:
new.groupby([0])
Out[315]: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x0A938BB0>
I am not able to group all the months together.
How to groupby and create a multiindex dataframe

Based on your info, I'd suggest the following:
#rename columns to make useful
new = new.rename(columns={0:'Month',1:'Price', 2:'Ticker'})
new.groupby(['Month','Ticker'])['Price'].sum()
To note - you should change change the 'Month' to a datetime or else the order will be illogical.
Also, the documentation is quite strong for pandas.

How to sort pandas dataframe by two date columns

I have a pandas dataframe like this:
column_year column_Month a_integer_column
0 2014 April 25.326531
1 2014 August 25.544554
2 2015 December 25.678261
3 2014 February 24.801187
4 2014 July 24.990338
... ... ... ...
68 2018 November 26.024931
69 2017 October 25.677333
70 2019 September 24.432361
71 2020 February 25.383648
72 2020 January 25.504831
I now want to sort year column first and then month column, like this below:
column_year column_Month a_integer_column
3 2014 February 24.801187
0 2014 April 25.326531
4 2014 July 24.990338
1 2014 August 25.544554
2 2015 December 25.678261
... ... ... ...
69 2017 October 25.677333
68 2018 November 26.024931
70 2019 September 24.432361
72 2020 January 25.504831
71 2020 February 25.383648
How do i do this?

Let us try to_datetime + argsort:
df=df.iloc[pd.to_datetime(df.column_year.astype(str)+df.column_Month,format='%Y%B').argsort()]
column_year column_Month a_integer_column
3 2014 February 24.801187
0 2014 April 25.326531
4 2014 July 24.990338
1 2014 August 25.544554
2 2015 December 25.678261

You can change the column_Month column into a CategoricalDtype
Months = pd.CategoricalDtype([
'January', 'February', 'March', 'April', 'May', 'June',
'July', 'August', 'September', 'October', 'November', 'December'
], ordered=True)
df.astype({'column_Month': Months}).sort_values(['column_year', 'column_Month'])
column_year column_Month a_integer_column
3 2014 February 24.801187
0 2014 April 25.326531
4 2014 July 24.990338
1 2014 August 25.544554
2 2015 December 25.678261
69 2017 October 25.677333
68 2018 November 26.024931
70 2019 September 24.432361
72 2020 January 25.504831
71 2020 February 25.383648

df=df.sort_values(by=["column_year", "column_Month"], ascending=[True, True])

How to generate even calendar dates?

Does anyone know how to generate a list in Calendar in python (or some other platform) with "even days", month and year from 2018 until 2021?
Example:
Sun, 02 Jan 2019
Tue, 04 Jan 2019
Thur, 06 Jan 2019
Sat, 08 Jan 2019
Sun, 10 Jan 2019
Tue, 12 Jan 2019
Thur, 14 Jan 2019
Sat, 16 Jan 2019
Sun, 18 Jan 2019
Tue, 20 Jan 2019
Thur, 22 Jan 2019
and so on, respecting the calendar until 2021.
EDIT:
how to generate in python a calendar list between 2018 and 2022 with 2 formats:
Day of the week, Date Month Year Time (hours: minutes: seconds) - Year-Month-Date Time (hours: minutes: seconds)
Note:
Dates: Peer dates only
schedule: Randomly generated schedules
Example:
Tue, 02 Jan 2018 00:59:23 - 2018-01-02 00:59:23
Thu, 04 Jan 2018 10:24:52 - 2018-01-04 10:24:52
Sat, 06 Jan 2018 04:11:09 - 2018-01-06 04:11:09
Mon, 08 Jan 2018 16:12:40 - 2018-01-08 16:12:40
Wed, 10 Jan 2018 10:08:15 - 2018-01-10 10:08:15
Fri, 12 Jan 2018 07:10:09 - 2018-01-12 07:10:09
Sun, 14 Jan 2018 11:50:10 - 2018-01-14 11:50:10
Tue, 16 Jan 2018 02:29:22 - 2018-01-16 02:29:22
Thu, 18 Jan 2018 19:07:20 - 2018-01-18 19:07:20
Sat, 20 Jan 2018 08:50:13 - 2018-01-20 08:50:13
Mon, 22 Jan 2018 02:40:02 - 2018-01-22 02:40:02
and so on, until the year 2022 ...

Here's something fairly simple that seems to work and handles leap years:
from calendar import isleap
from datetime import date
# Days in each month (1-12).
MDAYS = [0, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]
def dim(year, month):
""" Number of days in month of the given year. """
return MDAYS[month] + ((month == 2) and isleap(year))
start_year, end_year = 2018, 2021
for year in range(start_year, end_year+1):
for month in range(1, 12+1):
days = dim(year, month)
for day in range(1, days+1):
if day % 2 == 0:
dt = date(year, month, day)
print(dt.strftime('%a, %d %b %Y'))
Output:
Tue, 02 Jan 2018
Thu, 04 Jan 2018
Sat, 06 Jan 2018
Mon, 08 Jan 2018
Wed, 10 Jan 2018
Fri, 12 Jan 2018
Sun, 14 Jan 2018
Tue, 16 Jan 2018
...
Edit:
Here's a way to do what (I think) you asked how to do in your follow-on question:
from calendar import isleap
from datetime import date, datetime, time
from random import randrange
# Days in each month (1-12).
MDAYS = [0, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]
def dim(year, month):
""" Number of days in month of the given year. """
return MDAYS[month] + ((month == 2) and isleap(year))
def whenever():
""" Gets the time value. """
# Currently just returns a randomly selected time of day.
return time(*map(randrange, (24, 60, 60))) # hour:minute:second
start_year, end_year = 2018, 2021
for year in range(start_year, end_year+1):
for month in range(1, 12+1):
days = dim(year, month)
for day in range(1, days+1):
if day % 2 == 0:
dt, when = date(year, month, day), whenever()
dttm = datetime.combine(dt, when)
print(dt.strftime('%a, %d %b %Y'), when, '-', dttm)
Output:
Tue, 02 Jan 2018 00:54:02 - 2018-01-02 00:54:02
Thu, 04 Jan 2018 10:19:51 - 2018-01-04 10:19:51
Sat, 06 Jan 2018 22:48:09 - 2018-01-06 22:48:09
Mon, 08 Jan 2018 06:48:46 - 2018-01-08 06:48:46
Wed, 10 Jan 2018 14:01:54 - 2018-01-10 14:01:54
Fri, 12 Jan 2018 05:42:43 - 2018-01-12 05:42:43
Sun, 14 Jan 2018 21:42:37 - 2018-01-14 21:42:37
Tue, 16 Jan 2018 08:08:39 - 2018-01-16 08:08:39
...

What about:
import datetime
d = datetime.date.today() # Define Start date
while d.year <= 2021: # This will go *through* 2012
if d.day % 2 == 0: # Print if even date
print(d.strftime('%a, %d %b %Y'))
d += datetime.timedelta(days=1) # Jump forward a day
Wed, 31 Oct 2018
Fri, 02 Nov 2018
Sun, 04 Nov 2018
Tue, 06 Nov 2018
Thu, 08 Nov 2018
Sat, 10 Nov 2018
Mon, 12 Nov 2018
Wed, 14 Nov 2018
Fri, 16 Nov 2018
Sun, 18 Nov 2018
Tue, 20 Nov 2018
Thu, 22 Nov 2018
...
Fri, 24 Dec 2021
Sun, 26 Dec 2021
Tue, 28 Dec 2021
Thu, 30 Dec 2021

Change date format in pandas dataframe

I have this dataframe:
date value
1 Thu 17th Nov 2016 385.943800
2 Fri 18th Nov 2016 1074.160340
3 Sat 19th Nov 2016 2980.857860
4 Sun 20th Nov 2016 1919.723960
5 Mon 21st Nov 2016 884.279340
6 Tue 22nd Nov 2016 869.071070
7 Wed 23rd Nov 2016 760.289260
8 Thu 24th Nov 2016 2481.689270
9 Fri 25th Nov 2016 2745.990070
10 Sat 26th Nov 2016 2273.413250
11 Sun 27th Nov 2016 2630.414900
12 Mon 28th Nov 2016 817.322310
13 Tue 29th Nov 2016 1766.876030
14 Wed 30th Nov 2016 469.388420
I would like to change the format of the date column to this format YYYY-MM-DD. The dataframe consists of more than 200 rows, and every day new rows will be added, so I need to find a way to do this automatically.
This link is not helping because it sets the dates like this dates = ['30th November 2009', '31st March 2010', '30th September 2010'] and I can't do it for every row. Anyone knows a way to solve this?

Dateutil will do this job.
from dateutil import parser
print df
df2 = df.copy()
df2.date = df2.date.apply(lambda x: parser.parse(x))
df2
Output:

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting dates using regex following several formats - python

Related

Sorting a Data Frame by Month with Repeating Years, based on Unique 'Other' Column

How to groupby and create a multiindex dataframe

How to sort pandas dataframe by two date columns

How to generate even calendar dates?

Change date format in pandas dataframe

Categories

Resources