Mixed characters and digits date regex

Mixed characters and digits date regex - python

I need to find a Python regular expression in order to match every valid date in a raw text file. I split the text in lines and put them in a Pandas Series, the goal now, is to extract only the date in every line getting a series of dates. I was able to match most of the numerical date formats, but I stopped when I had to deal with literal months (Jan, January, Feb, February,...). In particular, I need a regex (or a set of them) which match the following formats:
- Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
- 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
- Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
- Feb 2009; Sep 2009; Oct 2010
Any help will be appreciated,
thank you in advance!

In line with the comment I made, suggest using split and strip to generate a list of possible dates from your output string and then feed it to dateutils.parser.parse() to turn into a proper datetime object which you can manipulate to your liking.
Possible implementation below:
test = '''- Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
- 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
- Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
- Feb 2009; Sep 2009; Oct 2010'''
list_of_dates = []
for line in test.split('\n'):
for date in line.split(';'):
list_of_dates.append(date.strip(' - '))
from dateutil.parser import parse
def is_date(string):
try:
parse(string)
return True
except ValueError:
return False
found_dates = []
for date in list_of_dates:
if is_date(date):
found_dates.append(parse(date))
for date in found_dates:
print(date)
Result:
2009-03-20 00:00:00
2009-03-20 00:00:00
2009-03-20 00:00:00
2009-03-20 00:00:00
2009-03-20 00:00:00
2009-03-20 00:00:00
2009-03-20 00:00:00
2009-03-20 00:00:00
2009-03-20 00:00:00
2009-03-20 00:00:00
2009-03-21 00:00:00
2009-03-22 00:00:00
2009-02-04 00:00:00
2009-09-04 00:00:00
2010-10-04 00:00:00

Related

How to remove unwanted data from a data column using pandas DataFrame

I'm getting date two times using comma separation along with day in date column from the scraped data. My goal is to remove this December 13, 2021Mon, portion and want to create a separate/new column for days and I also wanted to remove the last one column meaning the Volumn column.
Script
import requests
import pandas as pd
isins=['LU0526609390:EUR','IE00BHBX0Z19:EUR']
dfs = []
for isin in isins:
html = requests.get(f'https://markets.ft.com/data/funds/tearsheet/historical?s={isin}').content
dfs.extend(pd.read_html(html))
df = pd.concat(dfs)
print(df)
Expected Output
Day Date Open High Low Close
Monday Dec 13, 2021 77.77 77.77 77.77 77.77
Friday Dec 10, 2021 77.61 77.61 77.61 77.61
Thursday Dec 09, 2021 77.60 77.60 77.60 77.60
Wednesday Dec 08, 2021 77.47 77.47 77.47 77.47
Tuesday Dec 07, 2021 77.64 77.64 77.64 77.64
Current output
Date Open High Low Close Volume
Monday, December 13, 2021Mon, Dec 13, 2021 77.77 77.77 77.77 77.77 00.00
Friday, December 10, 2021Fri, Dec 10, 2021 77.61 77.61 77.61 77.61 ----
Thursday, December 09, 2021Thu, Dec 09, 2021 77.60 77.60 77.60 77.60 ----
Wednesday, December 08, 2021Wed, Dec 08, 2021 77.47 77.47 77.47 77.47 ----
Tuesday, December 07, 2021Tue, Dec 07, 2021 77.64 77.64 77.64 77.64 ----

I added the necessary steps to your code:
import requests
import pandas as pd
isins=['LU0526609390:EUR','IE00BHBX0Z19:EUR']
dfs = []
for isin in isins:
html = requests.get(f'https://markets.ft.com/data/funds/tearsheet/historical?s={isin}').content
dfs.extend(pd.read_html(html))
df = pd.concat(dfs)
# get the Day column
df.insert(0, 'Day', df['Date'].apply(lambda d: d[:d.find(',')]))
# reformat Date to the desired format
df['Date'] = df['Date'].apply(lambda d: d[-12:])
# remove the Volume column
df.pop('Volume')
print(df)
After those three operations, df looks like this:
Day Date Open High Low Close
0 Monday Dec 13, 2021 77.77 77.77 77.77 77.77
1 Friday Dec 10, 2021 77.61 77.61 77.61 77.61
2 Thursday Dec 09, 2021 77.60 77.60 77.60 77.60
3 Wednesday Dec 08, 2021 77.47 77.47 77.47 77.47
4 Tuesday Dec 07, 2021 77.64 77.64 77.64 77.64
5 Monday Dec 06, 2021 77.70 77.70 77.70 77.70
6 Friday Dec 03, 2021 77.72 77.72 77.72 77.72
...

I would use regex here to split. Then you can combine them and parse anyway you like afterwards:
import requests
import pandas as pd
isins=['LU0526609390:EUR','IE00BHBX0Z19:EUR']
dfs = []
for isin in isins:
html = requests.get(f'https://markets.ft.com/data/funds/tearsheet/historical?s={isin}').content
dfs.extend(pd.read_html(html))
df = pd.concat(dfs)
print(df)
df[['Date_alpha', 'Date_beta']] = df['Date'].str.split(r'(\d{4})(\w{1,3})',expand=True)[[0,1]]
df['Date'] = df['Date_alpha'] + df['Date_beta']
df = df.drop(['Date_alpha', 'Date_beta'], axis=1)
Output:
print(df)
Date Open High Low Close Volume
0 Monday, December 13, 2021 77.77 77.77 77.77 77.77 ----
1 Friday, December 10, 2021 77.61 77.61 77.61 77.61 ----
2 Thursday, December 09, 2021 77.60 77.60 77.60 77.60 ----
3 Wednesday, December 08, 2021 77.47 77.47 77.47 77.47 ----
4 Tuesday, December 07, 2021 77.64 77.64 77.64 77.64 ----
5 Monday, December 06, 2021 77.70 77.70 77.70 77.70 ----
6 Friday, December 03, 2021 77.72 77.72 77.72 77.72 ----
7 Thursday, December 02, 2021 77.56 77.56 77.56 77.56 ----
8 Wednesday, December 01, 2021 77.51 77.51 77.51 77.51 ----
9 Tuesday, November 30, 2021 77.52 77.52 77.52 77.52 ----
10 Monday, November 29, 2021 77.37 77.37 77.37 77.37 ----
11 Friday, November 26, 2021 77.44 77.44 77.44 77.44 ----
12 Thursday, November 25, 2021 77.11 77.11 77.11 77.11 ----
13 Wednesday, November 24, 2021 77.10 77.10 77.10 77.10 ----
14 Tuesday, November 23, 2021 77.02 77.02 77.02 77.02 ----
15 Monday, November 22, 2021 77.32 77.32 77.32 77.32 ----
16 Friday, November 19, 2021 77.52 77.52 77.52 77.52 ----
17 Thursday, November 18, 2021 77.38 77.38 77.38 77.38 ----
18 Wednesday, November 17, 2021 77.26 77.26 77.26 77.26 ----
19 Tuesday, November 16, 2021 77.24 77.24 77.24 77.24 ----
20 Monday, November 15, 2021 77.30 77.30 77.30 77.30 ----
0 Monday, December 13, 2021 11.09 11.09 11.09 11.09 ----
1 Friday, December 10, 2021 11.08 11.08 11.08 11.08 ----
2 Thursday, December 09, 2021 11.08 11.08 11.08 11.08 ----
3 Wednesday, December 08, 2021 11.06 11.06 11.06 11.06 ----
4 Tuesday, December 07, 2021 11.08 11.08 11.08 11.08 ----
5 Monday, December 06, 2021 11.09 11.09 11.09 11.09 ----
6 Friday, December 03, 2021 11.08 11.08 11.08 11.08 ----
7 Thursday, December 02, 2021 11.08 11.08 11.08 11.08 ----
8 Wednesday, December 01, 2021 11.05 11.05 11.05 11.05 ----
9 Tuesday, November 30, 2021 11.07 11.07 11.07 11.07 ----
10 Monday, November 29, 2021 11.07 11.07 11.07 11.07 ----
11 Friday, November 26, 2021 11.08 11.08 11.08 11.08 ----
12 Thursday, November 25, 2021 11.04 11.04 11.04 11.04 ----
13 Wednesday, November 24, 2021 11.03 11.03 11.03 11.03 ----
14 Tuesday, November 23, 2021 11.04 11.04 11.04 11.04 ----
15 Monday, November 22, 2021 11.07 11.07 11.07 11.07 ----
16 Friday, November 19, 2021 11.09 11.09 11.09 11.09 ----
17 Thursday, November 18, 2021 11.06 11.06 11.06 11.06 ----
18 Wednesday, November 17, 2021 11.05 11.05 11.05 11.05 ----
19 Tuesday, November 16, 2021 11.05 11.05 11.05 11.05 ----
20 Monday, November 15, 2021 11.05 11.05 11.05 11.05 ----

Why doesn't this Regex match any of the dates?

I'm trying to match dates in a dataframe with 500 entries using regex:
The dates can appear in the following formats:
04/20/2009; 04/20/09; 4/20/09; 4/3/09
Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
Feb 2009; Sep 2009; Oct 2010
6/2008; 12/2009
2009; 2010
dates[dates[0].str.contains(r'(?P<year>\d?\d?\d\d)')].shape
returns a tuple of shape(500,1)
but
dates[dates[0].str.contains(r'((?P\<day\>(\d?\d)?(\s|-|/|th|st|nd)?)??P\<year\>(\d?\d?\d\d))')].shape
returns a tuple of shape(0,1), but the day group is optional, so shouldnt it still match the year group.

Ok I got it.
The correct regex pattern is:
r'((?P<day>(\d?\d)?(\s|-|/|th|st|nd)?)?(?P<year>\d?\d?\d\d))'
The bracket for the year group was at the wrong position.

Regex for extracting all complex dates formats from a string

I have following string:
"04-20-2009; 04/20/09; 4/20/09; 4/3/09; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009; 20 Mar 2009; 20 March 2009; 2 Mar. 2009; 20 March, 2009; Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009; Feb 2009; Sep 2009; Oct 2010; 6/2008; 12/2009; 2009; 2010; 31/May/2019; 01/October/2019; 1st April"
With Current regex I am able to find all dates format accept two which are 31/May/2019 and 01/October/2019
Current regex which I am using:
(?:\d{1,2}[-/th|st|nd|rd\s]*)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?[a-z\s,.]*(?:\d{1,2}[-/th|st|nd|rd)\s,]*)+(?:\d{2,4})+
Can anyone help in making a regex for extracting all dates mentioned above. I want to solve this using regex only.

try
dates = """04-20-2009; 04/20/09; 4/20/09; 4/3/09; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009; 20 Mar 2009; 20 March 2009; 2 Mar. 2009; 20 March, 2009; Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009; Feb 2009; Sep 2009; Oct 2010; 6/2008; 12/2009; 2009; 2010; 31/May/2019; 01/October/2019"""
pd.to_datetime(pd.Series(dates.split(';')))
0 2009-04-20
1 2009-04-20
2 2009-04-20
3 2009-04-03
4 2009-03-20
5 2009-03-20
6 2009-03-20
7 2009-03-20
8 2009-03-20
9 2009-03-20
10 2009-03-02
11 2009-03-20
12 2009-03-20
13 2009-03-21
14 2009-03-22
15 2009-02-01
16 2009-09-01
17 2010-10-01
18 2008-06-01
19 2009-12-01
20 2009-01-01
21 2010-01-01
22 2019-05-31
23 2019-10-01
dtype: datetime64[ns]

How to scrape a table into pandas dataframe using selenium in python?

I want to copy the table (id=symbolMarket) and save it as a pandas dataframe in this link https://www.myfxbook.com/en/forex-market/currencies/US30-historical-data
How should I do it in the simple/beautiful way please?
Obviously I can retrieve the element one by one, but I believe there is a better way.
(I am using selenium to access the page, if this helps)
Many thanks for sharing knowledge with me

I was pretty hesitant to post this since it is pretty basic, and there is an abundance of solutions that show how to read a html table into pandas dataframe. Makes me wonder if you even attempted to look it up first.
But, just use .read_html(). This will return a list of dataframes. So you'll just have to figure out which dataframe in that list that you want:
import pandas as pd
url = 'https://www.myfxbook.com/en/forex-market/currencies/US30-historical-data'
tables = pd.read_html(url)
Output:
table = tables[3]
print (table)
0 1 ... 5 6
0 Date Open ... Change (Pips) Change (%)
1 Mar 20, 2019 21:00 25737 ... +253.0 +0.97%
2 Mar 19, 2019 21:00 25871 ... -135.0 -0.52%
3 Mar 18, 2019 21:00 25935 ... -63.0 -0.24%
4 Mar 17, 2019 21:00 25864 ... +70.0 +0.27%
5 Mar 16, 2019 21:00 25864 ... -20.0 -0.08%
6 Mar 14, 2019 21:00 25716 ... +153.0 +0.59%
7 Mar 13, 2019 21:00 25756 ... -40.0 -0.16%
8 Mar 12, 2019 21:00 25575 ... +185.0 +0.72%
9 Mar 11, 2019 21:00 25686 ... -93.0 -0.36%
10 Mar 10, 2019 21:00 25470 ... +212.0 +0.83%
11 Mar 09, 2019 21:00 25470 ... -29.0 -0.11%
12 Mar 07, 2019 21:00 25459 ... +61.0 +0.24%
13 Mar 06, 2019 21:00 25673 ... -197.0 -0.77%
14 Mar 05, 2019 21:00 25786 ... -108.0 -0.42%
15 Mar 04, 2019 21:00 25805 ... +3.0 +0.01%
16 Mar 03, 2019 21:00 26114 ... -300.0 -1.16%
17 Feb 28, 2019 21:00 25911 ... +138.0 +0.53%
18 Feb 27, 2019 21:00 26018 ... -89.0 -0.34%
19 Feb 26, 2019 21:00 26005 ... +31.0 +0.12%
20 Feb 25, 2019 21:00 26093 ... -63.0 -0.24%
21 Feb 24, 2019 21:00 26094 ... -3.0 -0.01%
22 Feb 21, 2019 21:00 25825 ... +210.0 +0.81%
23 Feb 20, 2019 21:00 25962 ... -120.0 -0.46%
24 Feb 19, 2019 21:00 25877 ... +88.0 +0.34%
25 Feb 18, 2019 21:00 25894 ... -9.0 -0.03%
26 Feb 17, 2019 21:00 25905 ... +5.0 +0.02%
27 Feb 14, 2019 21:00 25404 ... +500.0 +1.93%
28 Feb 13, 2019 21:00 25483 ... -68.0 -0.27%
29 Feb 12, 2019 21:00 25418 ... +102.0 +0.40%
.. ... ... ... ... ...
71 Dec 11, 2018 21:00 24341 ... +208.0 +0.85%
72 Dec 10, 2018 21:00 24490 ... -152.0 -0.62%
73 Dec 09, 2018 21:00 24338 ... +144.0 +0.59%
74 Dec 06, 2018 21:00 24921 ... -517.0 -2.12%
75 Dec 05, 2018 21:00 25118 ... -189.0 -0.76%
76 Dec 04, 2018 21:00 25033 ... +134.0 +0.53%
77 Dec 03, 2018 21:00 25837 ... -798.0 -3.19%
78 Dec 02, 2018 21:00 25897 ... -55.0 -0.21%
79 Nov 29, 2018 21:00 25367 ... +220.0 +0.86%
80 Nov 28, 2018 21:00 25327 ... +62.0 +0.24%
81 Nov 27, 2018 21:00 24794 ... +568.0 +2.24%
82 Nov 26, 2018 21:00 24546 ... +253.0 +1.02%
83 Nov 25, 2018 21:00 24300 ... +230.0 +0.94%
84 Nov 22, 2018 21:00 24367 ... -80.0 -0.33%
85 Nov 21, 2018 21:00 24497 ... -144.0 -0.59%
86 Nov 20, 2018 21:00 24461 ... +38.0 +0.16%
87 Nov 19, 2018 21:00 25063 ... -604.0 -2.47%
88 Nov 18, 2018 21:00 25410 ... -342.0 -1.36%
89 Nov 15, 2018 21:00 25335 ... +135.0 +0.53%
90 Nov 14, 2018 21:00 25085 ... +256.0 +1.01%
91 Nov 13, 2018 21:00 25378 ... -273.0 -1.09%
92 Nov 12, 2018 21:00 25422 ... -65.0 -0.26%
93 Nov 11, 2018 21:00 25987 ... -577.0 -2.27%
94 Nov 08, 2018 21:00 26184 ... -202.0 -0.78%
95 Nov 07, 2018 21:00 26190 ... +15.0 +0.06%
96 Nov 06, 2018 21:00 25663 ... +572.0 +2.18%
97 Nov 05, 2018 21:00 25481 ... +200.0 +0.78%
98 Nov 04, 2018 21:00 25267 ... +221.0 +0.87%
99 Nov 01, 2018 21:00 25240 ... +40.0 +0.16%
100 Oct 31, 2018 21:00 25090 ... +229.0 +0.90%
[101 rows x 7 columns]

Parsing date/time strings in Pandas DataFrame

I have the following Pandas series of dates/times:
pd.DataFrame({"GMT":["13 Feb 20089:30 AM", "22 Apr 20098:30 AM",
"14 Jul 20108:30 AM", "01 Jan 20118:30 AM"]})
GMT
13 Feb 20089:30 AM
22 Apr 20098:30 AM
14 Jul 20108:30 AM
01 Jan 20118:30 AM
What I would like is to split the date and time portions into two separate columns, i.e.
Date Time
13 Feb 2008 9:30 AM
22 Apr 2009 8:30 AM
14 Jul 2010 8:30 AM
01 Jan 2011 8:30 AM
Any help? Thought about simply splicing each string individually but was wondering if there was a better solution that returned them as datetime objects.

Use to_datetime + dt.strftime:
df['GMT'] = pd.to_datetime(df['GMT'], format='%d %b %Y%H:%M %p')
df['Date'] = df['GMT'].dt.strftime('%d %b %Y')
df['Time'] = df['GMT'].dt.strftime('%H:%M %p')
print (df)
GMT Date Time
0 2008-02-13 09:30:00 13 Feb 2008 09:30 AM
1 2009-04-22 08:30:00 22 Apr 2009 08:30 AM
2 2010-07-14 08:30:00 14 Jul 2010 08:30 AM
3 2011-01-01 08:30:00 01 Jan 2011 08:30 AM
And for datetime objects use dt.date and
dt.time:
df['GMT'] = pd.to_datetime(df['GMT'], format='%d %b %Y%H:%M %p')
df['Date'] = df['GMT'].dt.date
df['Time'] = df['GMT'].dt.time
print (df)
GMT Date Time
0 2008-02-13 09:30:00 2008-02-13 09:30:00
1 2009-04-22 08:30:00 2009-04-22 08:30:00
2 2010-07-14 08:30:00 2010-07-14 08:30:00
3 2011-01-01 08:30:00 2011-01-01 08:30:00
For formats check http://strftime.org/.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Mixed characters and digits date regex - python

Related

How to remove unwanted data from a data column using pandas DataFrame

Why doesn't this Regex match any of the dates?

Regex for extracting all complex dates formats from a string

How to scrape a table into pandas dataframe using selenium in python?

Parsing date/time strings in Pandas DataFrame

Categories

Resources