I have following string:
"04-20-2009; 04/20/09; 4/20/09; 4/3/09; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009; 20 Mar 2009; 20 March 2009; 2 Mar. 2009; 20 March, 2009; Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009; Feb 2009; Sep 2009; Oct 2010; 6/2008; 12/2009; 2009; 2010; 31/May/2019; 01/October/2019; 1st April"
With Current regex I am able to find all dates format accept two which are 31/May/2019 and 01/October/2019
Current regex which I am using:
(?:\d{1,2}[-/th|st|nd|rd\s]*)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?[a-z\s,.]*(?:\d{1,2}[-/th|st|nd|rd)\s,]*)+(?:\d{2,4})+
Can anyone help in making a regex for extracting all dates mentioned above. I want to solve this using regex only.
try
dates = """04-20-2009; 04/20/09; 4/20/09; 4/3/09; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009; 20 Mar 2009; 20 March 2009; 2 Mar. 2009; 20 March, 2009; Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009; Feb 2009; Sep 2009; Oct 2010; 6/2008; 12/2009; 2009; 2010; 31/May/2019; 01/October/2019"""
pd.to_datetime(pd.Series(dates.split(';')))
0 2009-04-20
1 2009-04-20
2 2009-04-20
3 2009-04-03
4 2009-03-20
5 2009-03-20
6 2009-03-20
7 2009-03-20
8 2009-03-20
9 2009-03-20
10 2009-03-02
11 2009-03-20
12 2009-03-20
13 2009-03-21
14 2009-03-22
15 2009-02-01
16 2009-09-01
17 2010-10-01
18 2008-06-01
19 2009-12-01
20 2009-01-01
21 2010-01-01
22 2019-05-31
23 2019-10-01
dtype: datetime64[ns]
Related
I have a list:
['Sun Oct 24 10:31:10 +0000 2021','Sun Oct 24 10:45:02 +0000 2021','Mon Oct 25 04:13:27 +0000 2021',
'Mon Oct 25 04:26:20 +0000 2021','Mon Oct 25 04:32:32 +0000 2021','Mon Oct 25 04:56:39 +0000 2021',
'Mon Oct 25 05:21:21 +0000 2021','Mon Oct 25 06:46:27 +0000 2021','Mon Oct 25 08:59:13 +0000 2021']
How can I get this result:
['Sun Oct 24', 'Sun Oct 24', 'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25',
'Mon Oct 25', 'Mon Oct 25']
We could use re.sub here for a regex based approach:
inp = ['Sun Oct 24 10:31:10 +0000 2021', 'Sun Oct 24 10:45:02 +0000 2021', 'Mon Oct 25 04:13:27 +0000 2021', 'Mon Oct 25 04:26:20 +0000 2021', 'Mon Oct 25 04:32:32 +0000 2021', 'Mon Oct 25 04:56:39 +0000 2021', 'Mon Oct 25 05:21:21 +0000 2021', 'Mon Oct 25 06:46:27 +0000 2021', 'Mon Oct 25 08:59:13 +0000 2021']
output = [re.sub(r'\s+\d{2}:.*$', '', x) for x in inp]
print(output)
# ['Sun Oct 24', 'Sun Oct 24', 'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25',
'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25']
If you have fixed format dates, you can just take first 10 chars of every string date
dates = ['Sun Oct 24 10:31:10 +0000 2021','Sun Oct 24 10:45:02 +0000 2021','Mon Oct 25 04:13:27 +0000 2021',
'Mon Oct 25 04:26:20 +0000 2021','Mon Oct 25 04:32:32 +0000 2021','Mon Oct 25 04:56:39 +0000 2021',
'Mon Oct 25 05:21:21 +0000 2021','Mon Oct 25 06:46:27 +0000 2021','Mon Oct 25 08:59:13 +0000 2021']
trunc_dates = [
date[:10] for date in dates
]
print(trunc_dates)
Output
['Sun Oct 24', 'Sun Oct 24', 'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25']
Also you can use more reliable solution with parsing via dateutil and formatting
from dateutil import parser
dates = ['Sun Oct 24 10:31:10 +0000 2021','Sun Oct 24 10:45:02 +0000 2021','Mon Oct 25 04:13:27 +0000 2021',
'Mon Oct 25 04:26:20 +0000 2021','Mon Oct 25 04:32:32 +0000 2021','Mon Oct 25 04:56:39 +0000 2021',
'Mon Oct 25 05:21:21 +0000 2021','Mon Oct 25 06:46:27 +0000 2021','Mon Oct 25 08:59:13 +0000 2021']
trunc_dates = [
parser.parse(date).strftime('%a %b %d')
for date in dates
]
print(trunc_dates)
Output
['Sun Oct 24', 'Sun Oct 24', 'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25']
I'm getting date two times using comma separation along with day in date column from the scraped data. My goal is to remove this December 13, 2021Mon, portion and want to create a separate/new column for days and I also wanted to remove the last one column meaning the Volumn column.
Script
import requests
import pandas as pd
isins=['LU0526609390:EUR','IE00BHBX0Z19:EUR']
dfs = []
for isin in isins:
html = requests.get(f'https://markets.ft.com/data/funds/tearsheet/historical?s={isin}').content
dfs.extend(pd.read_html(html))
df = pd.concat(dfs)
print(df)
Expected Output
Day Date Open High Low Close
Monday Dec 13, 2021 77.77 77.77 77.77 77.77
Friday Dec 10, 2021 77.61 77.61 77.61 77.61
Thursday Dec 09, 2021 77.60 77.60 77.60 77.60
Wednesday Dec 08, 2021 77.47 77.47 77.47 77.47
Tuesday Dec 07, 2021 77.64 77.64 77.64 77.64
Current output
Date Open High Low Close Volume
Monday, December 13, 2021Mon, Dec 13, 2021 77.77 77.77 77.77 77.77 00.00
Friday, December 10, 2021Fri, Dec 10, 2021 77.61 77.61 77.61 77.61 ----
Thursday, December 09, 2021Thu, Dec 09, 2021 77.60 77.60 77.60 77.60 ----
Wednesday, December 08, 2021Wed, Dec 08, 2021 77.47 77.47 77.47 77.47 ----
Tuesday, December 07, 2021Tue, Dec 07, 2021 77.64 77.64 77.64 77.64 ----
I added the necessary steps to your code:
import requests
import pandas as pd
isins=['LU0526609390:EUR','IE00BHBX0Z19:EUR']
dfs = []
for isin in isins:
html = requests.get(f'https://markets.ft.com/data/funds/tearsheet/historical?s={isin}').content
dfs.extend(pd.read_html(html))
df = pd.concat(dfs)
# get the Day column
df.insert(0, 'Day', df['Date'].apply(lambda d: d[:d.find(',')]))
# reformat Date to the desired format
df['Date'] = df['Date'].apply(lambda d: d[-12:])
# remove the Volume column
df.pop('Volume')
print(df)
After those three operations, df looks like this:
Day Date Open High Low Close
0 Monday Dec 13, 2021 77.77 77.77 77.77 77.77
1 Friday Dec 10, 2021 77.61 77.61 77.61 77.61
2 Thursday Dec 09, 2021 77.60 77.60 77.60 77.60
3 Wednesday Dec 08, 2021 77.47 77.47 77.47 77.47
4 Tuesday Dec 07, 2021 77.64 77.64 77.64 77.64
5 Monday Dec 06, 2021 77.70 77.70 77.70 77.70
6 Friday Dec 03, 2021 77.72 77.72 77.72 77.72
...
I would use regex here to split. Then you can combine them and parse anyway you like afterwards:
import requests
import pandas as pd
isins=['LU0526609390:EUR','IE00BHBX0Z19:EUR']
dfs = []
for isin in isins:
html = requests.get(f'https://markets.ft.com/data/funds/tearsheet/historical?s={isin}').content
dfs.extend(pd.read_html(html))
df = pd.concat(dfs)
print(df)
df[['Date_alpha', 'Date_beta']] = df['Date'].str.split(r'(\d{4})(\w{1,3})',expand=True)[[0,1]]
df['Date'] = df['Date_alpha'] + df['Date_beta']
df = df.drop(['Date_alpha', 'Date_beta'], axis=1)
Output:
print(df)
Date Open High Low Close Volume
0 Monday, December 13, 2021 77.77 77.77 77.77 77.77 ----
1 Friday, December 10, 2021 77.61 77.61 77.61 77.61 ----
2 Thursday, December 09, 2021 77.60 77.60 77.60 77.60 ----
3 Wednesday, December 08, 2021 77.47 77.47 77.47 77.47 ----
4 Tuesday, December 07, 2021 77.64 77.64 77.64 77.64 ----
5 Monday, December 06, 2021 77.70 77.70 77.70 77.70 ----
6 Friday, December 03, 2021 77.72 77.72 77.72 77.72 ----
7 Thursday, December 02, 2021 77.56 77.56 77.56 77.56 ----
8 Wednesday, December 01, 2021 77.51 77.51 77.51 77.51 ----
9 Tuesday, November 30, 2021 77.52 77.52 77.52 77.52 ----
10 Monday, November 29, 2021 77.37 77.37 77.37 77.37 ----
11 Friday, November 26, 2021 77.44 77.44 77.44 77.44 ----
12 Thursday, November 25, 2021 77.11 77.11 77.11 77.11 ----
13 Wednesday, November 24, 2021 77.10 77.10 77.10 77.10 ----
14 Tuesday, November 23, 2021 77.02 77.02 77.02 77.02 ----
15 Monday, November 22, 2021 77.32 77.32 77.32 77.32 ----
16 Friday, November 19, 2021 77.52 77.52 77.52 77.52 ----
17 Thursday, November 18, 2021 77.38 77.38 77.38 77.38 ----
18 Wednesday, November 17, 2021 77.26 77.26 77.26 77.26 ----
19 Tuesday, November 16, 2021 77.24 77.24 77.24 77.24 ----
20 Monday, November 15, 2021 77.30 77.30 77.30 77.30 ----
0 Monday, December 13, 2021 11.09 11.09 11.09 11.09 ----
1 Friday, December 10, 2021 11.08 11.08 11.08 11.08 ----
2 Thursday, December 09, 2021 11.08 11.08 11.08 11.08 ----
3 Wednesday, December 08, 2021 11.06 11.06 11.06 11.06 ----
4 Tuesday, December 07, 2021 11.08 11.08 11.08 11.08 ----
5 Monday, December 06, 2021 11.09 11.09 11.09 11.09 ----
6 Friday, December 03, 2021 11.08 11.08 11.08 11.08 ----
7 Thursday, December 02, 2021 11.08 11.08 11.08 11.08 ----
8 Wednesday, December 01, 2021 11.05 11.05 11.05 11.05 ----
9 Tuesday, November 30, 2021 11.07 11.07 11.07 11.07 ----
10 Monday, November 29, 2021 11.07 11.07 11.07 11.07 ----
11 Friday, November 26, 2021 11.08 11.08 11.08 11.08 ----
12 Thursday, November 25, 2021 11.04 11.04 11.04 11.04 ----
13 Wednesday, November 24, 2021 11.03 11.03 11.03 11.03 ----
14 Tuesday, November 23, 2021 11.04 11.04 11.04 11.04 ----
15 Monday, November 22, 2021 11.07 11.07 11.07 11.07 ----
16 Friday, November 19, 2021 11.09 11.09 11.09 11.09 ----
17 Thursday, November 18, 2021 11.06 11.06 11.06 11.06 ----
18 Wednesday, November 17, 2021 11.05 11.05 11.05 11.05 ----
19 Tuesday, November 16, 2021 11.05 11.05 11.05 11.05 ----
20 Monday, November 15, 2021 11.05 11.05 11.05 11.05 ----
I'm trying to match dates in a dataframe with 500 entries using regex:
The dates can appear in the following formats:
04/20/2009; 04/20/09; 4/20/09; 4/3/09
Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
Feb 2009; Sep 2009; Oct 2010
6/2008; 12/2009
2009; 2010
dates[dates[0].str.contains(r'(?P<year>\d?\d?\d\d)')].shape
returns a tuple of shape(500,1)
but
dates[dates[0].str.contains(r'((?P\<day\>(\d?\d)?(\s|-|/|th|st|nd)?)??P\<year\>(\d?\d?\d\d))')].shape
returns a tuple of shape(0,1), but the day group is optional, so shouldnt it still match the year group.
Ok I got it.
The correct regex pattern is:
r'((?P<day>(\d?\d)?(\s|-|/|th|st|nd)?)?(?P<year>\d?\d?\d\d))'
The bracket for the year group was at the wrong position.
I want to copy the table (id=symbolMarket) and save it as a pandas dataframe in this link https://www.myfxbook.com/en/forex-market/currencies/US30-historical-data
How should I do it in the simple/beautiful way please?
Obviously I can retrieve the element one by one, but I believe there is a better way.
(I am using selenium to access the page, if this helps)
Many thanks for sharing knowledge with me
I was pretty hesitant to post this since it is pretty basic, and there is an abundance of solutions that show how to read a html table into pandas dataframe. Makes me wonder if you even attempted to look it up first.
But, just use .read_html(). This will return a list of dataframes. So you'll just have to figure out which dataframe in that list that you want:
import pandas as pd
url = 'https://www.myfxbook.com/en/forex-market/currencies/US30-historical-data'
tables = pd.read_html(url)
Output:
table = tables[3]
print (table)
0 1 ... 5 6
0 Date Open ... Change (Pips) Change (%)
1 Mar 20, 2019 21:00 25737 ... +253.0 +0.97%
2 Mar 19, 2019 21:00 25871 ... -135.0 -0.52%
3 Mar 18, 2019 21:00 25935 ... -63.0 -0.24%
4 Mar 17, 2019 21:00 25864 ... +70.0 +0.27%
5 Mar 16, 2019 21:00 25864 ... -20.0 -0.08%
6 Mar 14, 2019 21:00 25716 ... +153.0 +0.59%
7 Mar 13, 2019 21:00 25756 ... -40.0 -0.16%
8 Mar 12, 2019 21:00 25575 ... +185.0 +0.72%
9 Mar 11, 2019 21:00 25686 ... -93.0 -0.36%
10 Mar 10, 2019 21:00 25470 ... +212.0 +0.83%
11 Mar 09, 2019 21:00 25470 ... -29.0 -0.11%
12 Mar 07, 2019 21:00 25459 ... +61.0 +0.24%
13 Mar 06, 2019 21:00 25673 ... -197.0 -0.77%
14 Mar 05, 2019 21:00 25786 ... -108.0 -0.42%
15 Mar 04, 2019 21:00 25805 ... +3.0 +0.01%
16 Mar 03, 2019 21:00 26114 ... -300.0 -1.16%
17 Feb 28, 2019 21:00 25911 ... +138.0 +0.53%
18 Feb 27, 2019 21:00 26018 ... -89.0 -0.34%
19 Feb 26, 2019 21:00 26005 ... +31.0 +0.12%
20 Feb 25, 2019 21:00 26093 ... -63.0 -0.24%
21 Feb 24, 2019 21:00 26094 ... -3.0 -0.01%
22 Feb 21, 2019 21:00 25825 ... +210.0 +0.81%
23 Feb 20, 2019 21:00 25962 ... -120.0 -0.46%
24 Feb 19, 2019 21:00 25877 ... +88.0 +0.34%
25 Feb 18, 2019 21:00 25894 ... -9.0 -0.03%
26 Feb 17, 2019 21:00 25905 ... +5.0 +0.02%
27 Feb 14, 2019 21:00 25404 ... +500.0 +1.93%
28 Feb 13, 2019 21:00 25483 ... -68.0 -0.27%
29 Feb 12, 2019 21:00 25418 ... +102.0 +0.40%
.. ... ... ... ... ...
71 Dec 11, 2018 21:00 24341 ... +208.0 +0.85%
72 Dec 10, 2018 21:00 24490 ... -152.0 -0.62%
73 Dec 09, 2018 21:00 24338 ... +144.0 +0.59%
74 Dec 06, 2018 21:00 24921 ... -517.0 -2.12%
75 Dec 05, 2018 21:00 25118 ... -189.0 -0.76%
76 Dec 04, 2018 21:00 25033 ... +134.0 +0.53%
77 Dec 03, 2018 21:00 25837 ... -798.0 -3.19%
78 Dec 02, 2018 21:00 25897 ... -55.0 -0.21%
79 Nov 29, 2018 21:00 25367 ... +220.0 +0.86%
80 Nov 28, 2018 21:00 25327 ... +62.0 +0.24%
81 Nov 27, 2018 21:00 24794 ... +568.0 +2.24%
82 Nov 26, 2018 21:00 24546 ... +253.0 +1.02%
83 Nov 25, 2018 21:00 24300 ... +230.0 +0.94%
84 Nov 22, 2018 21:00 24367 ... -80.0 -0.33%
85 Nov 21, 2018 21:00 24497 ... -144.0 -0.59%
86 Nov 20, 2018 21:00 24461 ... +38.0 +0.16%
87 Nov 19, 2018 21:00 25063 ... -604.0 -2.47%
88 Nov 18, 2018 21:00 25410 ... -342.0 -1.36%
89 Nov 15, 2018 21:00 25335 ... +135.0 +0.53%
90 Nov 14, 2018 21:00 25085 ... +256.0 +1.01%
91 Nov 13, 2018 21:00 25378 ... -273.0 -1.09%
92 Nov 12, 2018 21:00 25422 ... -65.0 -0.26%
93 Nov 11, 2018 21:00 25987 ... -577.0 -2.27%
94 Nov 08, 2018 21:00 26184 ... -202.0 -0.78%
95 Nov 07, 2018 21:00 26190 ... +15.0 +0.06%
96 Nov 06, 2018 21:00 25663 ... +572.0 +2.18%
97 Nov 05, 2018 21:00 25481 ... +200.0 +0.78%
98 Nov 04, 2018 21:00 25267 ... +221.0 +0.87%
99 Nov 01, 2018 21:00 25240 ... +40.0 +0.16%
100 Oct 31, 2018 21:00 25090 ... +229.0 +0.90%
[101 rows x 7 columns]
I need to find a Python regular expression in order to match every valid date in a raw text file. I split the text in lines and put them in a Pandas Series, the goal now, is to extract only the date in every line getting a series of dates. I was able to match most of the numerical date formats, but I stopped when I had to deal with literal months (Jan, January, Feb, February,...). In particular, I need a regex (or a set of them) which match the following formats:
- Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
- 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
- Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
- Feb 2009; Sep 2009; Oct 2010
Any help will be appreciated,
thank you in advance!
In line with the comment I made, suggest using split and strip to generate a list of possible dates from your output string and then feed it to dateutils.parser.parse() to turn into a proper datetime object which you can manipulate to your liking.
Possible implementation below:
test = '''- Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
- 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
- Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
- Feb 2009; Sep 2009; Oct 2010'''
list_of_dates = []
for line in test.split('\n'):
for date in line.split(';'):
list_of_dates.append(date.strip(' - '))
from dateutil.parser import parse
def is_date(string):
try:
parse(string)
return True
except ValueError:
return False
found_dates = []
for date in list_of_dates:
if is_date(date):
found_dates.append(parse(date))
for date in found_dates:
print(date)
Result:
2009-03-20 00:00:00
2009-03-20 00:00:00
2009-03-20 00:00:00
2009-03-20 00:00:00
2009-03-20 00:00:00
2009-03-20 00:00:00
2009-03-20 00:00:00
2009-03-20 00:00:00
2009-03-20 00:00:00
2009-03-20 00:00:00
2009-03-21 00:00:00
2009-03-22 00:00:00
2009-02-04 00:00:00
2009-09-04 00:00:00
2010-10-04 00:00:00