Date parsing from full sentences - python

I'm trying to parse dates from individual health records. Since the entries appear to be manual, the date formats are all over the place. My regex patterns are apparently not making the cut for several observations. Here's the list of tasks I need to accomplish along with the accompanying code. Dataframe has been subsetted to 15 observations for convenience.
Parse dates:
#Create DF:
health_records = ['08/11/78 CPT Code: 90801 - Psychiatric Diagnosis Interview',
'Lithium 0.25 (7/11/77). LFTS wnl. Urine tox neg. Serum tox + fluoxetine 500; otherwise neg. TSH 3.28. BUN/Cr: 16/0.83. Lipids unremarkable. B12 363, Folate >20. CBC: 4.9/36/308 Pertinent Medical Review of Systems Constitutional:',
'28 Sep 2015 Primary Care Doctor:',
'06 Mar 1974 Primary Care Doctor:',
'none; but currently has appt with new HJH PCP Rachel Salas, MD on October. 11, 2013 Other Agency Involvement: No',
'.Came back to US on Jan 24 1986, saw Dr. Quackenbush at Beaufort Memorial Hospital. Checked VPA level and found it to be therapeutic and confirmed BPAD dx. Also, has a general physician exam and found to be in good general health, except for being slightly overwt',
'September. 15, 2011 Total time of visit (in minutes):',
'sLanguage based learning disorder, dyslexia. Placed on IEP in 1st grade through Westerbrook HS prior to transitioning to VFA in 8th grade. Graduated from VF Academy in May 2004. Attended 1.5 years college at Arcadia.Employment Currently employed: Yes',
') - Zoloft 100 mg daily: February, 2010 : self-discontinued due to side effects (unknown)',
'6/1998 Primary Care Doctor:',
'12/2008 Primary Care Doctor:',
'ran own business for 35 years, sold in 1985',
'011/14/83 Audit C Score Current:',
'safter evicted in February 1976, hospitalized at Pemberly for 1 mo.Hx of Outpatient Treatment: No',
'. Age 16, 1991, frontal impact. out for two weeks from sports.',
's Mr. Moss is a 27-year-old, Caucasian, engaged veteran of the Navy. He was previously scheduled for an intake at the Southton Sanitorium in January, 2013 but cancelled due to ongoing therapy (see Psychiatric History for more details). He presents to the current intake with primary complaints of sleep difficulties, depressive symptoms, and PTSD.']
import numpy as np
import pandas as pd
df = pd.DataFrame(health_records, columns=['records'])
#Date parsing: patten 1:
df['new'] = (df['records'].str.findall(r'\d{1,2}.\d{1,2}.\d{2,4}')
.astype(str).str.replace(r'\[|\]|\(|\)|,|\'', '').str.strip())
#Date parsing pattern 2:
df['new2'] = (df['records'].str.findall(r'(?:\d{2} )?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{2}, )?\d{4}')
.astype(str).str.replace(r'\[|\]|\'|\,','').str.strip())
df['date'] = df['new']+df['new2']
and here is the output:
df['date']
0 08/11/78
1 7/11/77 16/0.83 4.9/36
2 28 Sep 2015
3 06 Mar 1974
4
5 24 1986
6
7 May 2004
8
9 6/1998
10 12/2008
11
12 011/14
13 February 1976
14
15
As you can see in some places the code works perfectly, but in complex sentences my pattern is not working or spitting out inaccurate results. Here is a list of all possible date combinations:
04/20/2009; 04/20/09; 4/20/09; 4/3/09;
Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009 Mar 20th, 2009;
Mar 21st, 2009; Mar 22nd, 2009;
Feb 2009; Sep 2009; Oct 2010; 6/2008; 12/2009
2009; 2010
Clean dates
Next I tried to clean dates, using a solution provided here - while it should work, since my format is similar to one in that problem, but it doesn't.
#Clean dates to date format
df['clean_date'] = df.date.apply(
lambda x: pd.to_datetime(x).strftime('%m/%d/%Y'))
df['clean_date']
The above code does not work. Any help would be deeply appreciated. Thanks for your time!

Well figured it out on my own. Still had to make some manual adjustments.
df['new'] = (df['header'].str.findall(r'\d{1,2}.\d{1,2}.\d{2,4}')
.astype(str).str.replace(r'\[|\]|\(|\)|,|\'', '').str.strip())
df['new2'] = (df['header'].str.findall(r'(?:\d{2} )?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{2}, )?\d{4}')
.astype(str).str.replace(r'\[|\]|\'|\,','').str.strip())
df['new3'] = (df['header'][455:501].str.findall(r'\d{4}')
.astype(str).str.replace(r'\[|\]|\(|\)|,|\'', '').str.strip())
#Coerce dates data to date-format
df['date1'] = df['new'].str.strip() + df['new2'].str.strip() + df['new3'].str.strip()
df['date1'] = pd.to_datetime(df['date1'], errors='coerce')
df[['date1', 'header']].sort_values(by ='date1')

Related

Python - div scraping where no tag is found

I'm very new to python and a novice at development in general. I've been tasked with iterating through a div and I am able to get some data returned. However, when no tag exists I'm getting null results. Would anyone be able to help--I have researched and tried for days now? I truly appreciate your patience and explanations. I'm trying to extract the following:
Start Date [ie, Sat 31 Jul 2021] which I get results from based on my code
End Date [ie, Fri 20 Aug 2021] this one I get no results based on my code
Description [ie, 20 Night New! Malta, The Adriatic & Greece] this one I get no results based on my code
Ship Name [ie, Viking Sea] which I get results from based on my code
<div class="cd-info"> <!-- Start cd-info -->
From <b>Sat 31 Jul 2021</b><br>
(To Fri 20 Aug 2021)<br>
<b>20 Night New! Malta, The Adriatic & Greece</b><br>
Ship
<a class="red" href="/cruise-ship-viking-sea.html">Viking Sea</a>
<br>
<span class="mobile-no-desktop-yes"><br></span>
More details at<br>
<a target="_blank" onclick="trackOutboundLink('https://www.vikingcruises.com/oceans/cruise-destinations/eastern-mediterranean/malta-adriatic-and-greece/index.html');" href="https://www.vikingcruises.com/oceans/cruise-destinations/eastern-mediterranean/malta-adriatic-and-greece/index.html">
<img class="noborder" src="/logos/viking-cruises.gif" alt="More details for 20 Night New! Malta, The Adriatic & Greece at Viking Cruises">
</a>
</div>
Here is my code (no judgements please..haha)
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
url = "https://www.cruisetimetables.com/cruisesearch.html?id=cse&sailmonth=Jul%202021&destination=Any&departureport=&arrivalport=&openjawcruise=&portofcall=&cruiseship=&duration=Any&chartercruise="
headers = {"Accept-Language": "en-US, en;q=0.5"}
results = requests.get(url, headers=headers)
soup = BeautifulSoup(results.text, "html.parser")
#initiate data storage
cruisestartdate = []
cruiseenddate = []
itinerarydescription = []
cruiseshipname = []
cruisecabinprice = []
destinationportname = []
portdatetime = []
cruise_div = soup.find_all('div', class_='cd-info')
#our loop through each container
for container in cruise_div:
#cruise start date
startdate = container.b.text
print(startdate)
cruisestartdate.append(startdate)
# cruise end date
enddate = container.string
cruiseenddate.append(enddate)
# ship name
ship = container.a.text
cruiseshipname.append(ship)
#pandas dataframe
cruise = pd.DataFrame({
'Sail Date': cruisestartdate,
'End Date': cruiseenddate,
#'Description': description,
'Ship Name': cruiseshipname,
#'imdb': imdb_ratings,
#'metascore': metascores,
#'votes': votes,
#'us_grossMillions': us_gross,
})
print(soup)
print(cruisestartdate)
print(cruiseenddate)
print(itinerarydescription)
print(cruiseshipname)```
Here are my results from the print:
['Sat 31 Jul 2021'] [None] [] ['Viking Sea']
container.text is a nicely formatted list of lines. Just splitit and use the pieces:
cruise_div = soup.find_all('div', class_='cd-info')
#our loop through each container
for container in cruise_div:
lines = container.text.splitlines()
cruisestartdate.append(lines[1])
cruiseenddate.append(lines[2])
itinerarydescription.append(lines[3])
# ship name
ship = container.a.text
cruiseshipname.append(ship)
Another solution, using bs4 API:
import re
import requests
from bs4 import BeautifulSoup
import pandas as pd
# url = "https://www.cruisetimetables.com/cruisesearch.html?id=cse&sailmonth=Jul%202021&destination=Any&departureport=&arrivalport=&openjawcruise=&portofcall=&cruiseship=&duration=Any&chartercruise="
url = "https://www.cruisetimetables.com/cruisesearch.html?id=cse&sailmonth=Aug%202021&destination=Any&departureport=&arrivalport=&openjawcruise=&portofcall=&cruiseship=&duration=Any&chartercruise="
headers = {"Accept-Language": "en-US, en;q=0.5"}
results = requests.get(url, headers=headers)
soup = BeautifulSoup(results.content, "html.parser")
all_data = []
for div in soup.select("div.cd-info"):
from_ = div.b
to_ = from_.find_next_sibling(text=True).strip()
# remove (To )
to_ = re.sub(r"\(To (.*)\)", r"\1", to_)
desc = from_.find_next("b").get_text(strip=True)
# remove html chars
desc = BeautifulSoup(desc, "html.parser").get_text(strip=True)
ship_name = div.a.get_text(strip=True)
all_data.append(
[
from_.get_text(strip=True),
to_,
desc,
ship_name,
]
)
df = pd.DataFrame(all_data, columns=["from", "to", "description", "ship name"])
print(df)
df.to_csv("data.csv", index=False)
Prints:
from to description ship name
0 Tue 3 Aug 2021 Mon 23 Aug 2021 20 Night New! Malta, The Adriatic & Greece Viking Venus
1 Tue 10 Aug 2021 Fri 20 Aug 2021 10 Night New! Malta & Adriatic Jewels Viking Sea
2 Tue 10 Aug 2021 Mon 30 Aug 2021 20 Night New! Malta, The Adriatic & Greece Viking Sea
3 Fri 13 Aug 2021 Fri 20 Aug 2021 7 Night Bermuda Escape Viking Orion
4 Fri 13 Aug 2021 Mon 23 Aug 2021 10 Night New! Malta & Greek Isles Discovery Viking Venus
5 Fri 13 Aug 2021 Thu 2 Sep 2021 20 Night New! Malta, The Adriatic & Greece Viking Venus
6 Sat 14 Aug 2021 Sat 21 Aug 2021 7 Night Iceland's Natural Beauty Viking Sky
7 Mon 16 Aug 2021 Mon 13 Sep 2021 28 Night Star Collector: From Greek Gods to Gaudí Wind Surf
8 Mon 16 Aug 2021 Fri 3 Sep 2021 18 Night Star Collector: Flamenco of the Mediterranean Wind Surf
9 Mon 16 Aug 2021 Wed 29 Sep 2021 44 Night Star Collector: Myths & Masterpieces of the Mediterranean Wind Surf
and saves data.csv (screenshot from LibreOffice):
EDIT: To scrape the data from all pages:
import re
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {"Accept-Language": "en-US, en;q=0.5"}
# url = "https://www.cruisetimetables.com/cruisesearch.html?id=cse&sailmonth=Jul%202021&destination=Any&departureport=&arrivalport=&openjawcruise=&portofcall=&cruiseship=&duration=Any&chartercruise="
url = "https://www.cruisetimetables.com/cruisesearch.html?id=cse&sailmonth=Aug%202021&destination=Any&departureport=&arrivalport=&openjawcruise=&portofcall=&cruiseship=&duration=Any&chartercruise="
offset = 1
all_data = []
while True:
u = url + f"&offset={offset}"
print(f"Getting offset {offset=}")
results = requests.get(u, headers=headers)
soup = BeautifulSoup(results.content, "html.parser")
for div in soup.select("div.cd-info"):
from_ = div.b
to_ = from_.find_next_sibling(text=True).strip()
# remove (To )
to_ = re.sub(r"\(To (.*)\)", r"\1", to_)
desc = from_.find_next("b").get_text(strip=True)
# remove html chars
desc = BeautifulSoup(desc, "html.parser").get_text(strip=True)
ship_name = div.a.get_text(strip=True)
all_data.append(
[
from_.get_text(strip=True),
to_,
desc,
ship_name,
]
)
# get next offset:
offset = soup.select_one(".cd-buttonlist > b + a")
if offset:
offset = int(offset.get_text(strip=True).split("-")[0])
else:
break
df = pd.DataFrame(all_data, columns=["from", "to", "description", "ship name"])
print(df)
df.to_csv("data.csv", index=False)
Prints:
Getting offset offset=1
Getting offset offset=11
Getting offset offset=21
...
from to description ship name
0 Tue 3 Aug 2021 Mon 23 Aug 2021 20 Night New! Malta, The Adriatic & Greece Viking Venus
1 Tue 10 Aug 2021 Fri 20 Aug 2021 10 Night New! Malta & Adriatic Jewels Viking Sea
2 Tue 10 Aug 2021 Mon 30 Aug 2021 20 Night New! Malta, The Adriatic & Greece Viking Sea
3 Fri 13 Aug 2021 Fri 20 Aug 2021 7 Night Bermuda Escape Viking Orion
...
144 Tue 31 Aug 2021 Tue 7 Sep 2021 7 Nights Mediterranean MSC Splendida
145 Tue 31 Aug 2021 Tue 7 Sep 2021 7 Nachte Grose Freiheit - Schwedische Kuste 1 Mein Schiff 6
146 Tue 31 Aug 2021 Tue 7 Sep 2021 7 Night Iceland's Natural Beauty Viking Jupiter
Getting the start date, description, and ship name are straightforward. The only trick here is to get the end date which does lie inside a specific tag. To get it, you have to tweak some regex. Here is one solution similar to #Andrej's:
from bs4 import BeautifulSoup
import re
html = """
<div class="cd-info"> <!-- Start cd-info -->
From <b>Sat 31 Jul 2021</b><br>
(To Fri 20 Aug 2021)<br>
<b>20 Night New! Malta, The Adriatic & Greece</b><br>
Ship
<a class="red" href="/cruise-ship-viking-sea.html">Viking Sea</a>
<br>
<span class="mobile-no-desktop-yes"><br></span>
More details at<br>
<a target="_blank" onclick="trackOutboundLink('https://www.vikingcruises.com/oceans/cruise-destinations/eastern-mediterranean/malta-adriatic-and-greece/index.html');" href="https://www.vikingcruises.com/oceans/cruise-destinations/eastern-mediterranean/malta-adriatic-and-greece/index.html">
<img class="noborder" src="/logos/viking-cruises.gif" alt="More details for 20 Night New! Malta, The Adriatic & Greece at Viking Cruises">
</a>
</div>
"""
soup = BeautifulSoup(html, 'html.parser')
bold = soup.find_all('b')
print(f"Start date: {bold[0].text}")
print(f"Description: {bold[1].text}")
untagged_line = re.sub("\n", "", soup.find(text=re.compile('To')))
end_date = re.sub(r"\(To (.*)\)", r"\1", untagged_line)
print(f"End date: {end_date}")
ship = soup.find('a', class_='red')
print(f"Ship: {ship.text}")
Output:
Start date: Sat 31 Jul 2021
Description: 20 Night New! Malta, The Adriatic & Greece
End date: Fri 20 Aug 2021
Ship: Viking Sea

Unable to find a way to store this scraped data in such a way that i can access it later with the help of a simple loop?

I was trying to scrape all the upcoming event details from an institution:-
import requests
from bs4 import BeautifulSoup
response = requests.get("http://www.iitg.ac.in/home/eventsall/events")
soup = BeautifulSoup(response.content,"html.parser")
cards = soup.find_all("div", attrs={"class": "newsarea"})
iitg_title = []
iitg_date = []
iitg_link = []
for card in cards[0:6]:
iitg_date.append(card.find("div", attrs={"class": "ndate"}).text)
iitg_title.append(card.find("div", attrs={"class": "ntitle"}).text.strip())
iitg_link.append(card.find("div", attrs={"class": "ntitle"}).a['href'])
print("Upcoming event details scraped from iitg website:- \n")
for i in range(len(iitg_title)):
print("Title:- ", iitg_title[i])
print("Dates:- ", iitg_date[i])
print("Link:- ", iitg_link[i])
print('\n')
And the above code fetched me these details:-
Upcoming event details scraped from iitg website:-
Title:- 4 batch for the certification programme on AI & ML by Eckovation in association with E&ICT Academy IIT Guwahati
Dates:- 15 Aug 2020 - 15 Aug 2020
Link:- http://eict.iitg.ac.in/online_courses_training.html
Title:- 8th International and 47th National conference on Fluid Mechanics and Fluid Power
Dates:- 09 Dec 2020 - 11 Dec 2020
Link:- https://event.iitg.ac.in/fmfp2020/
Title:- 4 months Internship programme on VLSI Circuit Design
Dates:- 10 Aug 2020 - 10 Dec 2020
Link:- http://eict.iitg.ac.in/online_courses_training.html
Title:- 6 week Training cum Internship programme on AI & ML under TEQIP-III orgainsed by Assam Science Technology University
Dates:- 10 Aug 2020 - 20 Sep 2020
Link:- http://eict.iitg.ac.in/online_courses_training.html
Title:- 6 week Training cum Internship programme on Industry 4.0 (Industrial IoT) under TEQIP-III orgainsed by Assam Science Technology University
Dates:- 10 Aug 2020 - 20 Sep 2020
Link:- http://eict.iitg.ac.in/online_courses_training.html
Title:- 6 week Training cum Internship programme on Robotics Fundamentals under TEQIP-III orgainsed by Assam Science Technology University
Dates:- 10 Aug 2020 - 20 Sep 2020
Link:- http://eict.iitg.ac.in/online_courses_training.html
Now since from past five hours's I have been messing around my head to be able to store my results in such a way that I can access it later with a simple for loop.
How can I make this possible?
You can use, for example json module to write the data to disk:
import json
import requests
from bs4 import BeautifulSoup
response = requests.get("http://www.iitg.ac.in/home/eventsall/events")
soup = BeautifulSoup(response.content,"html.parser")
cards = soup.find_all("div", attrs={"class": "newsarea"})
events = []
for card in cards[0:6]:
events.append((
card.find("div", attrs={"class": "ntitle"}).text.strip(),
card.find("div", attrs={"class": "ndate"}).text,
card.find("div", attrs={"class": "ntitle"}).a['href']
))
# save data:
with open('data.json', 'w') as f_out:
json.dump(events, f_out)
# ...
# load data back:
with open('data.json', 'r') as f_in:
events = json.load(f_in)
print("Upcoming event details scraped from iitg website:- \n")
for t, d, l in events:
print("Title:- ", t)
print("Dates:- ", d)
print("Link:- ", l)
print('\n')
Prints:
Upcoming event details scraped from iitg website:-
Title:- 4 batch for the certification programme on AI & ML by Eckovation in association with E&ICT Academy IIT Guwahati
Dates:- 15 Aug 2020 - 15 Aug 2020
Link:- http://eict.iitg.ac.in/online_courses_training.html
Title:- 8th International and 47th National conference on Fluid Mechanics and Fluid Power
Dates:- 09 Dec 2020 - 11 Dec 2020
Link:- https://event.iitg.ac.in/fmfp2020/
Title:- 4 months Internship programme on VLSI Circuit Design
Dates:- 10 Aug 2020 - 10 Dec 2020
Link:- http://eict.iitg.ac.in/online_courses_training.html
Title:- 6 week Training cum Internship programme on AI & ML under TEQIP-III orgainsed by Assam Science Technology University
Dates:- 10 Aug 2020 - 20 Sep 2020
Link:- http://eict.iitg.ac.in/online_courses_training.html
Title:- 6 week Training cum Internship programme on Industry 4.0 (Industrial IoT) under TEQIP-III orgainsed by Assam Science Technology University
Dates:- 10 Aug 2020 - 20 Sep 2020
Link:- http://eict.iitg.ac.in/online_courses_training.html
Title:- 6 week Training cum Internship programme on Robotics Fundamentals under TEQIP-III orgainsed by Assam Science Technology University
Dates:- 10 Aug 2020 - 20 Sep 2020
Link:- http://eict.iitg.ac.in/online_courses_training.html

Extracting multiple date formats through regex in python

I am trying to extract date from text in python. These are the possible texts and date patterns in it.
"Auction details: 14 December 2016, Pukekohe Park"
"Auction details: 17 Feb 2017, Gold Sacs Road"
"Auction details: Wednesday 27 Apr 1:00 p.m. (On site)(2016)"
"Auction details: Wednesday 27 Apr 1:00 p.m. (In Rooms - 923 Whangaa Rd, Man)(2016)"
"Auction details: Wed 27 Apr 2:00 p.m., 48 Viaduct Harbour Ave, Auckland, (2016)"
"Auction details: November 16 Wednesday 2:00pm at 48 Viaduct Harbour Ave, Auckland(2016)"
"Auction details: Thursday, 28th February '19"
"Auction details: Friday, 1st February '19"
This is what I have written so far,
mon = ' (?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?) '
day1 = r'\d{1,2}'
day_test = r'\d{1,2}(?:th)|\d{1,2}(?:st)'
year1 = r'\d{4}'
year2 = r'\(\d{4}\)'
dummy = r'.*'
This captures cases 1,2.
match = re.search(day1 + mon + year1, "Auction details: 14 December 2016, Pukekohe Park")
print match.group()
This somewhat captures case 3,4,5. But it prints everything from the text, so in the below case, I want 25 Nov 2016, but the below regex pattern gives me 25 Nov 3:00 p.m. (On Site)(2016).
So Question 1 : How to get only the date here?
match = re.search(day1 + mon + dummy + year2, "Friday 25 Nov 3:00 p.m. (On Site)(2016)")
print match.group()
Question 2 : Similarly, how do capture case 6,7 and 8 ?? What is the regex should be for that?
If not, is there any other better way to capture date from these formats?
You may try
((?:(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)\s+\d{1,2}(?:st|nd|rd|th)?|\d{1,2}(?:st|nd|rd|th)?\s+(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)))(?:.*(\b\d{2}(?:\d{2})?\b))?
See the regex demo.
Note I made all groups in the regex blocks non-capturing ((Nov|Dec) -> (?:Nov|Dec)), added (?:st|nd|rd|th)? optional group after day digit pattern, changed the year matching pattern to \b\d{2}(?:\d{2})?\b so that it only match 4- or 2-digit chunks as whole words, and created an alternation group to account for dates where day comes before month and vice versa.
The day and month are captured into Group 1 and the year is captured into Group 2, so the result is the concatenation of both.
NOTE: In case you need to match years in a safer way you may want to precise the year pattern. E.g., if you want to avoid matching the 4- or 2-digit whole words after :, add a negative lookbehind:
year1 = r'\b(?<!:)\d{2}(?:\d{2})?\b'
^^^^^^
Also, you may add word boundaries around the whole pattern to ensure a whole word match.
Here is the Python demo:
import re
mon = r'(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)'
day1 = r'\d{1,2}(?:st|nd|rd|th)?'
year1 = r'\b\d{2}(?:\d{2})?\b'
dummy = r'.*'
rx = r"((?:{smon}\s+{sday1}|{sday1}\s+{smon}))(?:{sdummy}({syear1}))?".format(smon=mon, sday1=day1, sdummy=dummy, syear1=year1)
# Or, try this if a partial number before a date is parsed as day:
# rx = r"\b((?:{smon}\s+{sday1}|{sday1}\s+{smon}))(?:{sdummy}({syear1}))?".format(smon=mon, sday1=day1, sdummy=dummy, syear1=year1)
strs = ["Auction details: 14 December 2016, Pukekohe Park","Auction details: 17 Feb 2017, Gold Sacs Road","Auction details: Wednesday 27 Apr 1:00 p.m. (On site)(2016)","Auction details: Wednesday 27 Apr 1:00 p.m. (In Rooms - 923 Whangaa Rd, Man)(2016)","Auction details: Wed 27 Apr 2:00 p.m., 48 Viaduct Harbour Ave, Auckland, (2016)","Auction details: November 16 Wednesday 2:00pm at 48 Viaduct Harbour Ave, Auckland(2016)","Auction details: Thursday, 28th February '19","Auction details: Friday, 1st February '19","Friday 25 Nov 3:00 p.m. (On Site)(2016)"]
for s in strs:
print(s)
m = re.search(rx, s)
if m:
print("{} {}".format(m.group(1), m.group(2)))
else:
print("NO MATCH")
Output:
Auction details: 14 December 2016, Pukekohe Park
14 December 2016
Auction details: 17 Feb 2017, Gold Sacs Road
17 Feb 2017
Auction details: Wednesday 27 Apr 1:00 p.m. (On site)(2016)
27 Apr 2016
Auction details: Wednesday 27 Apr 1:00 p.m. (In Rooms - 923 Whangaa Rd, Man)(2016)
27 Apr 2016
Auction details: Wed 27 Apr 2:00 p.m., 48 Viaduct Harbour Ave, Auckland, (2016)
27 Apr 2016
Auction details: November 16 Wednesday 2:00pm at 48 Viaduct Harbour Ave, Auckland(2016)
November 16 2016
Auction details: Thursday, 28th February '19
28th February 19
Auction details: Friday, 1st February '19
1st February 19
Friday 25 Nov 3:00 p.m. (On Site)(2016)
25 Nov 2016

Concatenate ListA elements with partially matching ListB elements

Say I have two python lists as:
ListA = ['Jan 2018', 'Feb 2018', 'Mar 2018']
ListB = ['Sales Jan 2018','Units sold Jan 2018','Sales Feb 2018','Units sold Feb 2018','Sales Mar 2018','Units sold Mar 2018']
I need to get an output as:
List_op = ['Jan 2018 Sales Jan 2018 Units sold Jan 2018','Feb 2018 Sales Feb 2018 Units sold Feb 2018','Mar 2018 Sales Mar 2018 Units sold Mar 2018']
My approach so far:
res=set()
for i in ListB:
for j in ListA:
if j in i:
res.add(f'{i} {j}')
print (res)
this gives me result as:
{'Units sold Jan 2018 Jan 2018', 'Sales Feb 2018 Feb 2018', 'Units sold Mar 2018 Mar 2018', 'Units sold Feb 2018 Feb 2018', 'Sales Jan 2018 Jan 2018', 'Sales Mar 2018 Mar 2018'}
which is definitely not the solution I'm looking for.
What I think is regular expression could be a handful here but I'm not sure how to approach. Any help in this regard is highly appreciated.
Thanks in advance.
Edit:
Values in ListA and ListB are not necessarily to be in order. Therefore for a particular month/year value in ListA, the same month/year value from ListB has to be matched and picked for both 'Sales' and 'Units sold' component and needs to be concatenated.
My main goal here is to get the list which I can use later to generate a statement that I'll be using to write Hive query.
Added more explanation as suggested by #andrew_reece
Assuming no additional edge cases that need taking care of, your original code is not bad, just needs a slight update:
List_op = []
for a in ListA:
combined = a
for b in ListB:
if a in b:
combined += " " + b
List_op.append(combined)
List_op
['Jan 2018 Sales Jan 2018 Units sold Jan 2018',
'Feb 2018 Sales Feb 2018 Units sold Feb 2018',
'Mar 2018 Sales Mar 2018 Units sold Mar 2018']
Supposing ListA and ListB are sorted:
ListA = ['Jan 2018', 'Feb 2018', 'Mar 2018']
ListB = ['Sales Jan 2018','Units sold Jan 2018','Sales Feb 2018','Units sold Feb 2018','Sales Mar 2018','Units sold Mar 2018']
print([v1 + " " + v2 for v1, v2 in zip(ListA, [v1 + " " + v2 for v1, v2 in zip(ListB[::2], ListB[1::2])])])
This will print:
['Jan 2018 Sales Jan 2018 Units sold Jan 2018', 'Feb 2018 Sales Feb 2018 Units sold Feb 2018', 'Mar 2018 Sales Mar 2018 Units sold Mar 2018']
In my example I firstly concatenate ListB variables together and then join ListA with this new list.
String concatenation can become expensive. In Python 3.6+, you can use more efficient f-strings within a list comprehension:
res = [f'{i} {j} {k}' for i, j, k in zip(ListA, ListB[::2], ListB[1::2])]
print(res)
['Jan 2018 Sales Jan 2018 Units sold Jan 2018',
'Feb 2018 Sales Feb 2018 Units sold Feb 2018',
'Mar 2018 Sales Mar 2018 Units sold Mar 2018']
Using itertools.islice, you can avoid the expense of creating new lists:
from itertools import islice
zipper = zip(ListA, islice(ListB, 0, None, 2), islice(ListB, 1, None, 2))
res = [f'{i} {j} {k}' for i, j, k in zipper]

(bs4) trying to differentiate different containers in a HTML page

I have a web page from the Houses of Parlament. it has information on MP declared interests and I would like to store all MP interests for a project that I am thinking of.
root = 'https://publications.parliament.uk/pa/cm/cmregmem/160606/abbott_diane.htm'
root is an example webpage. I want my output to be a dictionary, as there are interests under different sub headings and the entry could be a list.
Problem: if you look at the page, the first interest, (employment and earnings) is not wrapped up in a container, but rather the heading is a tag, and not connected to the text underneath it so I could call soup.find_all('p', {xlms='<p, {'xmlns':'http://www.w3.org/1999/xhtml')
but it would return the headings of expenses, and a few other headings like her name, and not the text under it.
which makes it difficult to iterate through the headings and storing the information
What would be the best way of iterating through the page, storing each heading, and the information under each heading?
Something like this may work:
import urllib.request
from bs4 import BeautifulSoup
ret = {}
page = urllib.request.urlopen("https://publications.parliament.uk/pa/cm/cmregmem/160606/abbott_diane.htm")
content = page.read().decode('utf-8')
soup = BeautifulSoup(content, 'lxml')
valid = False
value = ""
for i in soup.findAll('p'):
if i.find('strong') and i.text is not None:
# ignore first pass
if valid:
ret[key] = value
value = ""
valid = True
key = i.text
elif i.text is not None:
value = value + " " + i.text
# get last entry
if key is not None:
ret[key] = value
for x in ret:
print (x)
print (ret[x])
Outputs
4. Visits outside the UK
Name of donor: (1) Stop Aids (2) Aids Alliance Address of donor: (1) Grayston Centre, 28 Charles St, London N1 6HT (2) Preece House, 91-101 Davigdor Rd, Hove BN3 1RE Amount of donation (or estimate of the probable value): for myself and a member of staff, flights £2,784, accommodation £380.52, other travel costs £172, per diems £183; total £3,519.52. These costs were divided equally between both donors. Destination of visit: Uganda Date of visit: 11-14 November 2015 Purpose of visit: to visit the different organisations and charities (development) in regards to AIDS and HIV. (Registered 09 December 2015)Name of donor: Muslim Charities Forum Address of donor: 6 Whitehorse Mews, 37 Westminster Bridge Road, London SE1 7QD Amount of donation (or estimate of the probable value): for a member of staff and myself, return flights to Nairobi £5,170; one night's accommodation in Hargeisa £107.57; one night's accommodation in Borama £36.21; total £5,313.78 Destination of visit: Somaliland Date of visit: 7-10 April 2016 Purpose of visit: to visit the different refugee camps and charities (development) in regards to the severe drought in Somaliland. (Registered 18 May 2016)Name of donor: British-Swiss Chamber of Commerce Address of donor: Bleicherweg, 128002, Zurich, Switzerland Amount of donation (or estimate of the probable value): flights £200.14; one night's accommodation £177, train fare Geneva to Zurich £110; total £487.14 Destination of visit: Geneva and Zurich, Switzerland Date of visit: 28-29 April 2016 Purpose of visit: to participate in a public panel discussion in Geneva in front of British-Swiss Chamber of Commerce, its members and guests. (Registered 18 May 2016) 
2. (b) Any other support not included in Category 2(a)
Name of donor: Ann Pettifor Address of donor: private Amount of donation or nature and value if donation in kind: £1,651.07 towards rent of an office for my mayoral campaign Date received: 28 August 2015 Date accepted: 30 September 2015 Donor status: individual (Registered 08 October 2015)
1. Employment and earnings
Fees received for co-presenting BBC’s ‘This Week’ TV programme. Address: BBC Broadcasting House, Portland Place, London W1A 1AA. (Registered 04 November 2013)14 May 2015, received £700. Hours: 3 hrs. (Registered 03 June 2015)4 June 2015, received £700. Hours: 3 hrs. (Registered 01 July 2015)18 June 2015, received £700. Hours: 3 hrs. (Registered 01 July 2015)16 July 2015, received £700. Hours: 3 hrs. (Registered 07 August 2015)8 January 2016, received £700 for an appearance on 17 December 2015. Hours: 3 hrs. (Registered 14 January 2016)28 July 2015, received £4,000 for taking part in Grant Thornton’s panel at the JLA/FD Intelligence Post-election event. Address: JLA, 14 Berners Street, London W1T 3LJ. Hours: 5 hrs. (Registered 07 August 2015)23rd October 2015, received £1,500 for co-presenting BBC’s "Have I Got News for You" TV programme. Address: Hat Trick Productions, 33 Oval Road Camden, London NW1 7EA. Hours: 5 hrs. (Registered 26 October 2015)10 October 2015, received £1,400 for taking part in a talk at the New Wolsey Theatre in Ipswich. Address: Clive Conway Productions, 32 Grove St, Oxford OX2 7JT. Hours: 5 hrs. (Registered 26 October 2015)21 March 2016, received £4,000 via Speakers Corner (London) Ltd, Unit 31, Highbury Studios, 10 Hornsey Street, London N7 8EL, from Thompson Reuters, Canary Wharf, London E14 5EP, for speaking and consulting on a panel. Hours: 10 hrs. (Registered 06 April 2016)
Abbott, Ms Diane (Hackney North and Stoke Newington)
House of Commons
Session 2016-17
Publications on the internet

Categories