How to remove unwanted data from a data column using pandas DataFrame - python

I'm getting date two times using comma separation along with day in date column from the scraped data. My goal is to remove this December 13, 2021Mon, portion and want to create a separate/new column for days and I also wanted to remove the last one column meaning the Volumn column.
Script
import requests
import pandas as pd
isins=['LU0526609390:EUR','IE00BHBX0Z19:EUR']
dfs = []
for isin in isins:
html = requests.get(f'https://markets.ft.com/data/funds/tearsheet/historical?s={isin}').content
dfs.extend(pd.read_html(html))
df = pd.concat(dfs)
print(df)
Expected Output
Day Date Open High Low Close
Monday Dec 13, 2021 77.77 77.77 77.77 77.77
Friday Dec 10, 2021 77.61 77.61 77.61 77.61
Thursday Dec 09, 2021 77.60 77.60 77.60 77.60
Wednesday Dec 08, 2021 77.47 77.47 77.47 77.47
Tuesday Dec 07, 2021 77.64 77.64 77.64 77.64
Current output
Date Open High Low Close Volume
Monday, December 13, 2021Mon, Dec 13, 2021 77.77 77.77 77.77 77.77 00.00
Friday, December 10, 2021Fri, Dec 10, 2021 77.61 77.61 77.61 77.61 ----
Thursday, December 09, 2021Thu, Dec 09, 2021 77.60 77.60 77.60 77.60 ----
Wednesday, December 08, 2021Wed, Dec 08, 2021 77.47 77.47 77.47 77.47 ----
Tuesday, December 07, 2021Tue, Dec 07, 2021 77.64 77.64 77.64 77.64 ----

I added the necessary steps to your code:
import requests
import pandas as pd
isins=['LU0526609390:EUR','IE00BHBX0Z19:EUR']
dfs = []
for isin in isins:
html = requests.get(f'https://markets.ft.com/data/funds/tearsheet/historical?s={isin}').content
dfs.extend(pd.read_html(html))
df = pd.concat(dfs)
# get the Day column
df.insert(0, 'Day', df['Date'].apply(lambda d: d[:d.find(',')]))
# reformat Date to the desired format
df['Date'] = df['Date'].apply(lambda d: d[-12:])
# remove the Volume column
df.pop('Volume')
print(df)
After those three operations, df looks like this:
Day Date Open High Low Close
0 Monday Dec 13, 2021 77.77 77.77 77.77 77.77
1 Friday Dec 10, 2021 77.61 77.61 77.61 77.61
2 Thursday Dec 09, 2021 77.60 77.60 77.60 77.60
3 Wednesday Dec 08, 2021 77.47 77.47 77.47 77.47
4 Tuesday Dec 07, 2021 77.64 77.64 77.64 77.64
5 Monday Dec 06, 2021 77.70 77.70 77.70 77.70
6 Friday Dec 03, 2021 77.72 77.72 77.72 77.72
...

I would use regex here to split. Then you can combine them and parse anyway you like afterwards:
import requests
import pandas as pd
isins=['LU0526609390:EUR','IE00BHBX0Z19:EUR']
dfs = []
for isin in isins:
html = requests.get(f'https://markets.ft.com/data/funds/tearsheet/historical?s={isin}').content
dfs.extend(pd.read_html(html))
df = pd.concat(dfs)
print(df)
df[['Date_alpha', 'Date_beta']] = df['Date'].str.split(r'(\d{4})(\w{1,3})',expand=True)[[0,1]]
df['Date'] = df['Date_alpha'] + df['Date_beta']
df = df.drop(['Date_alpha', 'Date_beta'], axis=1)
Output:
print(df)
Date Open High Low Close Volume
0 Monday, December 13, 2021 77.77 77.77 77.77 77.77 ----
1 Friday, December 10, 2021 77.61 77.61 77.61 77.61 ----
2 Thursday, December 09, 2021 77.60 77.60 77.60 77.60 ----
3 Wednesday, December 08, 2021 77.47 77.47 77.47 77.47 ----
4 Tuesday, December 07, 2021 77.64 77.64 77.64 77.64 ----
5 Monday, December 06, 2021 77.70 77.70 77.70 77.70 ----
6 Friday, December 03, 2021 77.72 77.72 77.72 77.72 ----
7 Thursday, December 02, 2021 77.56 77.56 77.56 77.56 ----
8 Wednesday, December 01, 2021 77.51 77.51 77.51 77.51 ----
9 Tuesday, November 30, 2021 77.52 77.52 77.52 77.52 ----
10 Monday, November 29, 2021 77.37 77.37 77.37 77.37 ----
11 Friday, November 26, 2021 77.44 77.44 77.44 77.44 ----
12 Thursday, November 25, 2021 77.11 77.11 77.11 77.11 ----
13 Wednesday, November 24, 2021 77.10 77.10 77.10 77.10 ----
14 Tuesday, November 23, 2021 77.02 77.02 77.02 77.02 ----
15 Monday, November 22, 2021 77.32 77.32 77.32 77.32 ----
16 Friday, November 19, 2021 77.52 77.52 77.52 77.52 ----
17 Thursday, November 18, 2021 77.38 77.38 77.38 77.38 ----
18 Wednesday, November 17, 2021 77.26 77.26 77.26 77.26 ----
19 Tuesday, November 16, 2021 77.24 77.24 77.24 77.24 ----
20 Monday, November 15, 2021 77.30 77.30 77.30 77.30 ----
0 Monday, December 13, 2021 11.09 11.09 11.09 11.09 ----
1 Friday, December 10, 2021 11.08 11.08 11.08 11.08 ----
2 Thursday, December 09, 2021 11.08 11.08 11.08 11.08 ----
3 Wednesday, December 08, 2021 11.06 11.06 11.06 11.06 ----
4 Tuesday, December 07, 2021 11.08 11.08 11.08 11.08 ----
5 Monday, December 06, 2021 11.09 11.09 11.09 11.09 ----
6 Friday, December 03, 2021 11.08 11.08 11.08 11.08 ----
7 Thursday, December 02, 2021 11.08 11.08 11.08 11.08 ----
8 Wednesday, December 01, 2021 11.05 11.05 11.05 11.05 ----
9 Tuesday, November 30, 2021 11.07 11.07 11.07 11.07 ----
10 Monday, November 29, 2021 11.07 11.07 11.07 11.07 ----
11 Friday, November 26, 2021 11.08 11.08 11.08 11.08 ----
12 Thursday, November 25, 2021 11.04 11.04 11.04 11.04 ----
13 Wednesday, November 24, 2021 11.03 11.03 11.03 11.03 ----
14 Tuesday, November 23, 2021 11.04 11.04 11.04 11.04 ----
15 Monday, November 22, 2021 11.07 11.07 11.07 11.07 ----
16 Friday, November 19, 2021 11.09 11.09 11.09 11.09 ----
17 Thursday, November 18, 2021 11.06 11.06 11.06 11.06 ----
18 Wednesday, November 17, 2021 11.05 11.05 11.05 11.05 ----
19 Tuesday, November 16, 2021 11.05 11.05 11.05 11.05 ----
20 Monday, November 15, 2021 11.05 11.05 11.05 11.05 ----

Related

Get values of latest year and all its months in pandas

Below is the Raw Data.
Event Month Year
Event1 January 2012
Event1 February 2013
Event1 March 2014
Event1 April 2017
Event1 May 2017
Event1 June 2017
Event2 May 2018
Event2 May 2019
Event3 February 2012
Event3 March 2012
Event3 April 2012
Event1 latest year is 2017 so month should be April, May, June.
Event2 latest year is 2019 so month should be May.
Event3 latest year is 2012 so month should be February, March, April.
Output Should be : -
Event Month Year
Event1 April 2017
Event1 May 2017
Event1 June 2017
Event2 May 2019
Event3 February 2012
Event3 March 2012
Event3 April 2012
You can transform the latest year per group and use it to slice:
out = df[df['Year'].eq(df.groupby('Event')['Year'].transform('max'))]
output:
Event Month Year
3 Event1 April 2017
4 Event1 May 2017
5 Event1 June 2017
7 Event2 May 2019
8 Event3 February 2012
9 Event3 March 2012
10 Event3 April 2012

How to fill in blank cells in df based on string in sequential row, Pandas

I have a df, where the data looks like this:
Time Value
60.8
Jul 2019 58.1
58.8
56.9
Oct 2019 51.8
54.6
56.8
Jan 2020 58.8
54.2
51.3
Apr 2020 52.2
I want to fill in the blank cells in the Time variable according to the calendar year. So:
Time Value
Jun 2019 60.8
Jul 2019 58.1
Aug 2019 58.8
Sep 2019 56.9
Oct 2019 51.8
Nov 2019 54.6
Dec 2019 56.8
Jan 2020 58.8
Feb 2020 54.2
Mar 2020 51.3
Apr 2020 52.2
I saw a post where pandas could be used to fill in numeric values, but since my variable isn't necessarily defined in a numeric way, I'm not entirely sure how to apply it in this situation.
There seem to me to be two ways of approaching this: 1) modifying the list before writing to df. 2) Modifying the df.
I prefer the first solution, but not sure if it is possible.
Thanks.
My script:
totalmonth=['', 'Jul 2019', '', '', 'Oct 2019', '', '', 'Jan 2020', '', '', 'Apr 2020', '']
totalvalue=['60.8', '58.1', '58.8', '56.9', '51.8', '54.6', '56.8', '58.8', '54.2', '51.3', '52.2', '48.7']
df = pd.DataFrame({'Time': totalmonth,
'Value': totalvalue})
Ok, this took me longer than I would like to admit. I solved for your first answer
Output:
***********************BEFORE********************************
['', 'Jul 2019', '', '', 'Oct 2019', '', '', 'Jan 2020', '', '', 'Apr 2020', '']
Time Value
0 60.8
1 Jul 2019 58.1
2 58.8
3 56.9
4 Oct 2019 51.8
5 54.6
6 56.8
7 Jan 2020 58.8
8 54.2
9 51.3
10 Apr 2020 52.2
11 48.7
***********************AFTER********************************
['Jun 2019', 'Jul 2019', 'Aug 2019', 'Sep 2019', 'Oct 2019', 'Nov 2019', 'Dec 2019', 'Jan 2020', 'Feb 2020', 'Mar 2020', 'Apr 2020', 'May 2020']
Time Value
0 Jun 2019 60.8
1 Jul 2019 58.1
2 Aug 2019 58.8
3 Sep 2019 56.9
4 Oct 2019 51.8
5 Nov 2019 54.6
6 Dec 2019 56.8
7 Jan 2020 58.8
8 Feb 2020 54.2
9 Mar 2020 51.3
10 Apr 2020 52.2
11 May 2020 48.7
Code:
from datetime import datetime
from dateutil.relativedelta import relativedelta
totalmonth=['', 'Jul 2019', '', '', 'Oct 2019', '', '', 'Jan 2020', '', '', 'Apr 2020', '']
new_totalmonth = [datetime.strptime(x,'%b %Y') for x in totalmonth if x != '' ]
index = totalmonth.index(min(new_totalmonth).strftime('%b %Y'))
new_totalmonth = [(min(new_totalmonth) + relativedelta(months=x)).strftime('%b %Y') for x in range(-index,len(totalmonth) - index)]
print(new_totalmonth)
Breakdown
This line of code creates a list of all the valid dates and puts them in a format that I can run the min() function on.
new_totalmonth = [datetime.strptime(x,'%b %Y') for x in totalmonth if x != '' ]
What this prints out
print(new_totalmonth)
[datetime.datetime(2019, 7, 1, 0, 0), datetime.datetime(2019, 10, 1, 0, 0), datetime.datetime(2020, 1, 1, 0, 0), datetime.datetime(2020, 4, 1, 0, 0)]
This is creating the variable index and assigning it the index of the minimum date in totalmonth
index = totalmonth.index(min(new_totalmonth).strftime('%b %Y'))
min(new_totalmonth) # this is finding the minimum date in new_totalmonth
print(min(new_totalmonth))
2019-07-01 00:00:00
min(new_totalmonth).strftime('%b %Y') # This is putting that minimum in a format that matches what is in totalmonth so the function totalmonth.index() can get the correct index
print(min(new_totalmonth).strftime('%b %Y'))
Jul 2019
This is using list comprehension.
new_totalmonth = [(min(new_totalmonth) + relativedelta(months=x)).strftime('%b %Y') for x in range(-index,len(totalmonth) - index)]
I am using the index of the minimum date in totalmonth to manipulate the range of values (how many months) I am going to add to the minimum month in totalmonth
range(-index,len(totalmonth) - index)
print(list(range(-index,len(totalmonth) - index)))
[-1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Since the minimum month (Jul 2019) is at index 1 I need to add -1 months to it to get the month that comes before it which is Jun 2019
So it can be broken out to:
(min(new_totalmonth) + relativedelta(months=-1)).strftime('%b %Y') = Jun 2019
(min(new_totalmonth) + relativedelta(months=0)).strftime('%b %Y') = Ju1 2019
(min(new_totalmonth) + relativedelta(months=1)).strftime('%b %Y') = Aug 2019
...
(min(new_totalmonth) + relativedelta(months=10)).strftime('%b %Y') = May 2019
Take all those values and put them in the list new_totalmonth
print(new_totalmonth)
['Jun 2019', 'Jul 2019', 'Aug 2019', 'Sep 2019', 'Oct 2019', 'Nov 2019', 'Dec 2019', 'Jan 2020', 'Feb 2020', 'Mar 2020', 'Apr 2020', 'May 2020']
The minimum minus 1 in the 'Time' column is the start month, the maximum plus 2 in the 'Time' column is the last month, and the target column is updated with date_range() to get successive values.
df['Time'] = pd.to_datetime(df['Time'])
startM = datetime.datetime((df['Time'].min()).year,(df['Time'].min()).month-1,1)
endM = datetime.datetime((df['Time'].max()).year,(df['Time'].max()).month+2,1)
df['Time'] = pd.date_range(startM,endM, freq='1M')
df
Time Value
0 2019-06-30 60.8
1 2019-07-31 58.1
2 2019-08-31 58.8
3 2019-09-30 56.9
4 2019-10-31 51.8
5 2019-11-30 54.6
6 2019-12-31 56.8
7 2020-01-31 58.8
8 2020-02-29 54.2
9 2020-03-31 51.3
10 2020-04-30 52.2
11 2020-05-31 48.7
First use pd.to_datetime to convert the Time column to pandas datetime series t, next use pd.period_range to generate a period range with a monthly frequency and the starting period equals to the calculated period and number of periods equal to the length of the series t, finally use .strftime with a format specifier %b %Y to returns the string representation of the period_range in the desired format:
t = pd.to_datetime(df['Time'])
df['Time'] = pd.period_range(
t.min().to_period('M') - t.idxmin(), periods=len(t), freq='M').strftime('%b %Y')
Details:
# print(t)
0 NaT
1 2019-07-01
2 NaT
3 NaT
4 2019-10-01
5 NaT
6 NaT
7 2020-01-01
8 NaT
9 NaT
10 2020-04-01
11 NaT
Name: Time, dtype: datetime64[ns]
# print(t.min(), t.idxmin())
Timestamp('2019-07-01 00:00:00'), 1
# print(t.min().to_period('M') - t.idxmin())
Period('2019-06', 'M') # starting period of the period range
Result:
# print(df)
Time Value
0 Jun 2019 60.8
1 Jul 2019 58.1
2 Aug 2019 58.8
3 Sep 2019 56.9
4 Oct 2019 51.8
5 Nov 2019 54.6
6 Dec 2019 56.8
7 Jan 2020 58.8
8 Feb 2020 54.2
9 Mar 2020 51.3
10 Apr 2020 52.2
11 May 2020 48.7

How to web scrape a table by using Python?

I am trying to web scrape, by using Python 3, a table off of this website into a .csv file: 2015 NBA National TV Schedule
The chart starts out like:
Date Teams Network
Oct. 27, 8:00 p.m. ET Cleveland # Chicago TNT
Oct. 27, 10:30 p.m. ET New Orleans # Golden State TNT
Oct. 28, 8:00 p.m. ET San Antonio # Oklahoma City ESPN
Oct. 28, 10:30 p.m. ET Minnesota # L.A. Lakers ESPN
I am using these packages:
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
from itertools import groupby
The output I want in a .csv file looks like this:
These are the first four lines of the chart on the website into the .csv file. Notice how multiple dates are used more than once, and the time is in a separate column. How do I implement the scraper to get this output?
pd.read_html will get most of the way there:
In [73]: pd.read_html("https://deadspin.com/nba-national-tv-espn-tnt-abc-nba-tv-1723767952")[0]
Out[73]:
0 1 2
0 Date Teams Network
1 Oct. 27, 8:00 p.m. ET Cleveland # Chicago TNT
2 Oct. 27, 10:30 p.m. ET New Orleans # Golden State TNT
3 Oct. 28, 8:00 p.m. ET San Antonio # Oklahoma City ESPN
4 Oct. 28, 10:30 p.m. ET Minnesota # L.A. Lakers ESPN
.. ... ... ...
139 Apr. 9, 8:30 p.m. ET Cleveland # Chicago ABC
140 Apr. 12, 8:00 p.m. ET Oklahoma City # San Antonio TNT
141 Apr. 12, 10:30 p.m. ET Memphis # L.A. Clippers TNT
142 Apr. 13, 8:00 p.m. ET Orlando # Charlotte ESPN
143 Apr. 13, 10:30 p.m. ET Utah # L.A. Lakers ESPN
You'd just need to parse out the date into columns and separate the teams.
You'll use pandas to grab the table with .read_html(), and then continue with pandas to manipulate the data:
import pandas as pd
import numpy as np
df = pd.read_html('https://deadspin.com/nba-national-tv-espn-tnt-abc-nba-tv-1723767952', header=0)[0]
# Split the Date column at the comma into Date, Time columns
df[['Date','Time']] = df.Date.str.split(',',expand=True)
# Replace substrings in Time column
df['Time'] = df['Time'].str.replace('p.m. ET','PM')
# Can't convert to datetime as there is no year. One way to do it here is anything before
# Jan, add the suffix ', 2015', else add ', 2016'
# If you have more than 1 seasin, would have to work this out another way
df['Date'] = np.where(df.Date.str.startswith(('Oct.', 'Nov.', 'Dec.')), df.Date + ', 2015', df.Date + ', 2016')
# If you want 0 padding for the day, remove '#' from %#d below
# Change the date format from abbreviated month to full name (Ie Oct. -> October)
df['Date'] = pd.to_datetime(df['Date'].astype(str)).dt.strftime('%B %#d, %Y')
# Split the Teams column
df[['Team 1','Team 2']] = df.Teams.str.split('#',expand=True)
# Remove any leading/trailing whitespace
df= df.applymap(lambda x: x.strip() if type(x) is str else x)
# Final dataframe with desired columns
df = df[['Date','Time','Team 1','Team 2','Network']]
Output:
Date Time Team 1 Team 2 Network
0 October 27, 2015 8:00 PM Cleveland Chicago TNT
1 October 27, 2015 10:30 PM New Orleans Golden State TNT
2 October 28, 2015 8:00 PM San Antonio Oklahoma City ESPN
3 October 28, 2015 10:30 PM Minnesota L.A. Lakers ESPN
4 October 29, 2015 8:00 PM Atlanta New York TNT
5 October 29, 2015 10:30 PM Dallas L.A. Clippers TNT
6 October 30, 2015 7:00 PM Miami Cleveland ESPN
7 October 30, 2015 9:30 PM Golden State Houston ESPN
8 November 4, 2015 8:00 PM New York Cleveland ESPN
9 November 4, 2015 10:30 PM L.A. Clippers Golden State ESPN
10 November 5, 2015 8:00 PM Oklahoma City Chicago TNT
11 November 5, 2015 10:30 PM Memphis Portland TNT
12 November 6, 2015 8:00 PM Miami Indiana ESPN
13 November 6, 2015 10:30 PM Houston Sacramento ESPN
14 November 11, 2015 8:00 PM L.A. Clippers Dallas ESPN
15 November 11, 2015 10:30 PM San Antonio Portland ESPN
16 November 12, 2015 8:00 PM Golden State Minnesota TNT
17 November 12, 2015 10:30 PM L.A. Clippers Phoenix TNT
18 November 18, 2015 8:00 PM New Orleans Oklahoma City ESPN
19 November 18, 2015 10:30 PM Chicago Phoenix ESPN
20 November 19, 2015 8:00 PM Milwaukee Cleveland TNT
21 November 19, 2015 10:30 PM Golden State L.A. Clippers TNT
22 November 20, 2015 8:00 PM San Antonio New Orleans ESPN
23 November 20, 2015 10:30 PM Chicago Golden State ESPN
24 November 24, 2015 8:00 PM Boston Atlanta TNT
25 November 24, 2015 10:30 PM L.A. Lakers Golden State TNT
26 December 3, 2015 7:00 PM Oklahoma City Miami TNT
27 December 3, 2015 9:30 PM San Antonio Memphis TNT
28 December 4, 2015 7:00 PM Brooklyn New York ESPN
29 December 4, 2015 9:30 PM Cleveland New Orleans ESPN
.. ... ... ... ... ...
113 March 10, 2016 10:30 PM Cleveland L.A. Lakers TNT
114 March 12, 2016 8:30 PM Oklahoma City San Antonio ABC
115 March 13, 2016 3:30 PM Cleveland L.A. Clippers ABC
116 March 14, 2016 8:00 PM Memphis Houston ESPN
117 March 14, 2016 10:30 PM New Orleans Golden State ESPN
118 March 16, 2016 7:00 PM Oklahoma City Boston ESPN
119 March 16, 2016 9:30 PM L.A. Clippers Houston ESPN
120 March 19, 2016 8:30 PM Golden State San Antonio ABC
121 March 22, 2016 8:00 PM Houston Oklahoma City TNT
122 March 22, 2016 10:30 PM Memphis L.A. Lakers TNT
123 March 23, 2016 8:00 PM Milwaukee Cleveland ESPN
124 March 23, 2016 10:30 PM Dallas Portland ESPN
125 March 29, 2016 8:00 PM Houston Cleveland TNT
126 March 29, 2016 10:30 PM Washington Golden State TNT
127 March 31, 2016 7:00 PM Chicago Houston TNT
128 March 31, 2016 9:30 PM L.A. Clippers Oklahoma City TNT
129 April 1, 2016 8:00 PM Cleveland Atlanta ESPN
130 April 1, 2016 10:30 PM Boston Golden State ESPN
131 April 3, 2016 3:30 PM Oklahoma City Houston ABC
132 April 5, 2016 8:00 PM Chicago Memphis TNT
133 April 5, 2016 10:30 PM L.A. Lakers L.A. Clippers TNT
134 April 6, 2016 7:00 PM Cleveland Indiana ESPN
135 April 6, 2016 9:30 PM Houston Dallas ESPN
136 April 7, 2016 8:00 PM Chicago Miami TNT
137 April 7, 2016 10:30 PM San Antonio Golden State TNT
138 April 9, 2016 8:30 PM Cleveland Chicago ABC
139 April 12, 2016 8:00 PM Oklahoma City San Antonio TNT
140 April 12, 2016 10:30 PM Memphis L.A. Clippers TNT
141 April 13, 2016 8:00 PM Orlando Charlotte ESPN
142 April 13, 2016 10:30 PM Utah L.A. Lakers ESPN
[143 rows x 5 columns]

How to scrape a table into pandas dataframe using selenium in python?

I want to copy the table (id=symbolMarket) and save it as a pandas dataframe in this link https://www.myfxbook.com/en/forex-market/currencies/US30-historical-data
How should I do it in the simple/beautiful way please?
Obviously I can retrieve the element one by one, but I believe there is a better way.
(I am using selenium to access the page, if this helps)
Many thanks for sharing knowledge with me
I was pretty hesitant to post this since it is pretty basic, and there is an abundance of solutions that show how to read a html table into pandas dataframe. Makes me wonder if you even attempted to look it up first.
But, just use .read_html(). This will return a list of dataframes. So you'll just have to figure out which dataframe in that list that you want:
import pandas as pd
url = 'https://www.myfxbook.com/en/forex-market/currencies/US30-historical-data'
tables = pd.read_html(url)
Output:
table = tables[3]
print (table)
0 1 ... 5 6
0 Date Open ... Change (Pips) Change (%)
1 Mar 20, 2019 21:00 25737 ... +253.0 +0.97%
2 Mar 19, 2019 21:00 25871 ... -135.0 -0.52%
3 Mar 18, 2019 21:00 25935 ... -63.0 -0.24%
4 Mar 17, 2019 21:00 25864 ... +70.0 +0.27%
5 Mar 16, 2019 21:00 25864 ... -20.0 -0.08%
6 Mar 14, 2019 21:00 25716 ... +153.0 +0.59%
7 Mar 13, 2019 21:00 25756 ... -40.0 -0.16%
8 Mar 12, 2019 21:00 25575 ... +185.0 +0.72%
9 Mar 11, 2019 21:00 25686 ... -93.0 -0.36%
10 Mar 10, 2019 21:00 25470 ... +212.0 +0.83%
11 Mar 09, 2019 21:00 25470 ... -29.0 -0.11%
12 Mar 07, 2019 21:00 25459 ... +61.0 +0.24%
13 Mar 06, 2019 21:00 25673 ... -197.0 -0.77%
14 Mar 05, 2019 21:00 25786 ... -108.0 -0.42%
15 Mar 04, 2019 21:00 25805 ... +3.0 +0.01%
16 Mar 03, 2019 21:00 26114 ... -300.0 -1.16%
17 Feb 28, 2019 21:00 25911 ... +138.0 +0.53%
18 Feb 27, 2019 21:00 26018 ... -89.0 -0.34%
19 Feb 26, 2019 21:00 26005 ... +31.0 +0.12%
20 Feb 25, 2019 21:00 26093 ... -63.0 -0.24%
21 Feb 24, 2019 21:00 26094 ... -3.0 -0.01%
22 Feb 21, 2019 21:00 25825 ... +210.0 +0.81%
23 Feb 20, 2019 21:00 25962 ... -120.0 -0.46%
24 Feb 19, 2019 21:00 25877 ... +88.0 +0.34%
25 Feb 18, 2019 21:00 25894 ... -9.0 -0.03%
26 Feb 17, 2019 21:00 25905 ... +5.0 +0.02%
27 Feb 14, 2019 21:00 25404 ... +500.0 +1.93%
28 Feb 13, 2019 21:00 25483 ... -68.0 -0.27%
29 Feb 12, 2019 21:00 25418 ... +102.0 +0.40%
.. ... ... ... ... ...
71 Dec 11, 2018 21:00 24341 ... +208.0 +0.85%
72 Dec 10, 2018 21:00 24490 ... -152.0 -0.62%
73 Dec 09, 2018 21:00 24338 ... +144.0 +0.59%
74 Dec 06, 2018 21:00 24921 ... -517.0 -2.12%
75 Dec 05, 2018 21:00 25118 ... -189.0 -0.76%
76 Dec 04, 2018 21:00 25033 ... +134.0 +0.53%
77 Dec 03, 2018 21:00 25837 ... -798.0 -3.19%
78 Dec 02, 2018 21:00 25897 ... -55.0 -0.21%
79 Nov 29, 2018 21:00 25367 ... +220.0 +0.86%
80 Nov 28, 2018 21:00 25327 ... +62.0 +0.24%
81 Nov 27, 2018 21:00 24794 ... +568.0 +2.24%
82 Nov 26, 2018 21:00 24546 ... +253.0 +1.02%
83 Nov 25, 2018 21:00 24300 ... +230.0 +0.94%
84 Nov 22, 2018 21:00 24367 ... -80.0 -0.33%
85 Nov 21, 2018 21:00 24497 ... -144.0 -0.59%
86 Nov 20, 2018 21:00 24461 ... +38.0 +0.16%
87 Nov 19, 2018 21:00 25063 ... -604.0 -2.47%
88 Nov 18, 2018 21:00 25410 ... -342.0 -1.36%
89 Nov 15, 2018 21:00 25335 ... +135.0 +0.53%
90 Nov 14, 2018 21:00 25085 ... +256.0 +1.01%
91 Nov 13, 2018 21:00 25378 ... -273.0 -1.09%
92 Nov 12, 2018 21:00 25422 ... -65.0 -0.26%
93 Nov 11, 2018 21:00 25987 ... -577.0 -2.27%
94 Nov 08, 2018 21:00 26184 ... -202.0 -0.78%
95 Nov 07, 2018 21:00 26190 ... +15.0 +0.06%
96 Nov 06, 2018 21:00 25663 ... +572.0 +2.18%
97 Nov 05, 2018 21:00 25481 ... +200.0 +0.78%
98 Nov 04, 2018 21:00 25267 ... +221.0 +0.87%
99 Nov 01, 2018 21:00 25240 ... +40.0 +0.16%
100 Oct 31, 2018 21:00 25090 ... +229.0 +0.90%
[101 rows x 7 columns]

Mixed characters and digits date regex

I need to find a Python regular expression in order to match every valid date in a raw text file. I split the text in lines and put them in a Pandas Series, the goal now, is to extract only the date in every line getting a series of dates. I was able to match most of the numerical date formats, but I stopped when I had to deal with literal months (Jan, January, Feb, February,...). In particular, I need a regex (or a set of them) which match the following formats:
- Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
- 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
- Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
- Feb 2009; Sep 2009; Oct 2010
Any help will be appreciated,
thank you in advance!
In line with the comment I made, suggest using split and strip to generate a list of possible dates from your output string and then feed it to dateutils.parser.parse() to turn into a proper datetime object which you can manipulate to your liking.
Possible implementation below:
test = '''- Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
- 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
- Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
- Feb 2009; Sep 2009; Oct 2010'''
list_of_dates = []
for line in test.split('\n'):
for date in line.split(';'):
list_of_dates.append(date.strip(' - '))
from dateutil.parser import parse
def is_date(string):
try:
parse(string)
return True
except ValueError:
return False
found_dates = []
for date in list_of_dates:
if is_date(date):
found_dates.append(parse(date))
for date in found_dates:
print(date)
Result:
2009-03-20 00:00:00
2009-03-20 00:00:00
2009-03-20 00:00:00
2009-03-20 00:00:00
2009-03-20 00:00:00
2009-03-20 00:00:00
2009-03-20 00:00:00
2009-03-20 00:00:00
2009-03-20 00:00:00
2009-03-20 00:00:00
2009-03-21 00:00:00
2009-03-22 00:00:00
2009-02-04 00:00:00
2009-09-04 00:00:00
2010-10-04 00:00:00

Categories