How to web scrape a chart by using Python? - python

I am trying to web scrape, by using Python 3, a chart off of this website into a .csv file: 2016 NBA National TV Schedule
The chart starts out like:
Tuesday, October 25
8:00 PM Knicks/Cavaliers TNT
10:30 PM Spurs/Warriors TNT
Wednesday, October 26
8:00 PM Thunder/Sixers ESPN
10:30 PM Rockets/Lakers ESPN
I am using these packages:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
The output I want in a .csv file looks like this:
These are the first six lines of the chart on the website into the .csv file. Notice how multiple dates are used more than once. How do I implement the scraper to get this output?

import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
from itertools import groupby
url = 'https://fansided.com/2016/08/11/nba-schedule-2016-national-tv-games/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
days = 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'
data = soup.select_one('.article-content p:has(br)').get_text(strip=True, separator='|').split('|')
dates, last = {}, ''
for v, g in groupby(data, lambda k: any(d in k for d in days)):
if v:
last = [*g][0]
dates[last] = []
else:
dates[last].extend([re.findall(r'([\d:]+ [AP]M) (.*?)/(.*?) (.*)', d)[0] for d in g])
all_data = {'Date':[], 'Time': [], 'Team 1': [], 'Team 2': [], 'Network': []}
for k, v in dates.items():
for time, team1, team2, network in v:
all_data['Date'].append(k)
all_data['Time'].append(time)
all_data['Team 1'].append(team1)
all_data['Team 2'].append(team2)
all_data['Network'].append(network)
df = pd.DataFrame(all_data)
print(df)
df.to_csv('data.csv')
Prints:
Date Time Team 1 Team 2 Network
0 Tuesday, October 25 8:00 PM Knicks Cavaliers TNT
1 Tuesday, October 25 10:30 PM Spurs Warriors TNT
2 Wednesday, October 26 8:00 PM Thunder Sixers ESPN
3 Wednesday, October 26 10:30 PM Rockets Lakers ESPN
4 Thursday, October 27 8:00 PM Celtics Bulls TNT
.. ... ... ... ... ...
159 Saturday, April 8 8:30 PM Clippers Spurs ABC
160 Monday, April 10 8:00 PM Wizards Pistons TNT
161 Monday, April 10 10:30 PM Rockets Clippers TNT
162 Wednesday, April 12 8:00 PM Hawks Pacers ESPN
163 Wednesday, April 12 10:30 PM Pelicans Blazers ESPN
[164 rows x 5 columns]
And saves data.csv (screenshot from Libre Office):

Related

web scraping for sunrise and sunset data using National oceanic and atmospheric administration

I want to scrape data from NOAA (https://gml.noaa.gov/grad/solcalc/). The data I want to get is sunrise and sunset timings for various counties of the US in the last 3 years. I have the coordinates of those counties.
Now the problem which I am facing is I don't know how can I use those coordinates and set time frame to 3 years, while scraping the site such that i don't have to manually specify it each time.
I am using python for scraping.
**I need data in the following format:
latitude | Longitude | year | Month | day | Sunrise | sunset**
I am new to programming I tried available methods listed on web, but nothing served my purpose.
You can use the table.php page to get your data and read them with Pandas. This php script need 3 parameters: year, lat and lon.
import pandas as pd
import requests
import time
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/109.0'
}
# Fill this table with your counties
counties = {
'NY': {'lat': 40.72, 'lon': -74.02},
'LA': {'lat': 37.77, 'lon': -122.42}
}
url = 'https://gml.noaa.gov/grad/solcalc/table.php'
dataset = []
for year in range(2020, 2023):
for county, params in counties.items():
print(year, county)
payload = params | {'year': year}
r = requests.get(url, headers=headers, params=payload)
dfs = pd.read_html(r.text)
# Reshape your data
dfs = (pd.concat(dfs, keys=['Sunrise', 'Sunset', 'SolarNoon']).droplevel(1)
.assign(Year=year, Lat=params['lat'], Lon=params['lon'])
.set_index(['Lat', 'Lon', 'Year', 'Day'], append=True)
.rename_axis(columns='Month').stack('Month')
.unstack(level=0).reset_index())
dataset.append(dfs)
time.sleep(10) # Wait at least 10 seconds not to be banned
out = pd.concat(dataset, ignore_index=True)
out.to_csv('solarcalc.csv', index=False)
Output:
Lat Lon Year Day Month SolarNoon Sunrise Sunset
0 40.72 -74.02 2020 1 Jan 11:59:16 07:20 16:39
1 40.72 -74.02 2020 1 Feb 12:09:33 07:07 17:13
2 40.72 -74.02 2020 1 Mar 12:08:22 06:29 17:48
3 40.72 -74.02 2020 1 Apr 12:59:52 06:39 19:21
4 40.72 -74.02 2020 1 May 12:53:10 05:54 19:53
... ... ... ... ... ... ... ... ...
2187 37.77 -122.42 2022 31 May 13:07:22 05:50 20:25
2188 37.77 -122.42 2022 31 Jul 13:16:06 06:12 20:19
2189 37.77 -122.42 2022 31 Aug 13:10:04 06:39 19:40
2190 37.77 -122.42 2022 31 Oct 12:53:15 07:34 18:12
2191 37.77 -122.42 2022 31 Dec 12:12:35 07:25 17:01
[2192 rows x 8 columns]
Note: if you prefer Month as number, use:
month2num = {'Jan': 1, 'Feb': 2, 'Mar': 3, 'Apr': 4, 'May': 5, 'Jun': 6,
'Jul': 7, 'Aug': 8, 'Sep': 9, 'Oct': 10, 'Nov': 11, 'Dec': 12}
out['Month'] = out['Month'].replace(month2num)

Python - div scraping where no tag is found

I'm very new to python and a novice at development in general. I've been tasked with iterating through a div and I am able to get some data returned. However, when no tag exists I'm getting null results. Would anyone be able to help--I have researched and tried for days now? I truly appreciate your patience and explanations. I'm trying to extract the following:
Start Date [ie, Sat 31 Jul 2021] which I get results from based on my code
End Date [ie, Fri 20 Aug 2021] this one I get no results based on my code
Description [ie, 20 Night New! Malta, The Adriatic & Greece] this one I get no results based on my code
Ship Name [ie, Viking Sea] which I get results from based on my code
<div class="cd-info"> <!-- Start cd-info -->
From <b>Sat 31 Jul 2021</b><br>
(To Fri 20 Aug 2021)<br>
<b>20 Night New! Malta, The Adriatic & Greece</b><br>
Ship
<a class="red" href="/cruise-ship-viking-sea.html">Viking Sea</a>
<br>
<span class="mobile-no-desktop-yes"><br></span>
More details at<br>
<a target="_blank" onclick="trackOutboundLink('https://www.vikingcruises.com/oceans/cruise-destinations/eastern-mediterranean/malta-adriatic-and-greece/index.html');" href="https://www.vikingcruises.com/oceans/cruise-destinations/eastern-mediterranean/malta-adriatic-and-greece/index.html">
<img class="noborder" src="/logos/viking-cruises.gif" alt="More details for 20 Night New! Malta, The Adriatic & Greece at Viking Cruises">
</a>
</div>
Here is my code (no judgements please..haha)
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
url = "https://www.cruisetimetables.com/cruisesearch.html?id=cse&sailmonth=Jul%202021&destination=Any&departureport=&arrivalport=&openjawcruise=&portofcall=&cruiseship=&duration=Any&chartercruise="
headers = {"Accept-Language": "en-US, en;q=0.5"}
results = requests.get(url, headers=headers)
soup = BeautifulSoup(results.text, "html.parser")
#initiate data storage
cruisestartdate = []
cruiseenddate = []
itinerarydescription = []
cruiseshipname = []
cruisecabinprice = []
destinationportname = []
portdatetime = []
cruise_div = soup.find_all('div', class_='cd-info')
#our loop through each container
for container in cruise_div:
#cruise start date
startdate = container.b.text
print(startdate)
cruisestartdate.append(startdate)
# cruise end date
enddate = container.string
cruiseenddate.append(enddate)
# ship name
ship = container.a.text
cruiseshipname.append(ship)
#pandas dataframe
cruise = pd.DataFrame({
'Sail Date': cruisestartdate,
'End Date': cruiseenddate,
#'Description': description,
'Ship Name': cruiseshipname,
#'imdb': imdb_ratings,
#'metascore': metascores,
#'votes': votes,
#'us_grossMillions': us_gross,
})
print(soup)
print(cruisestartdate)
print(cruiseenddate)
print(itinerarydescription)
print(cruiseshipname)```
Here are my results from the print:
['Sat 31 Jul 2021'] [None] [] ['Viking Sea']
container.text is a nicely formatted list of lines. Just splitit and use the pieces:
cruise_div = soup.find_all('div', class_='cd-info')
#our loop through each container
for container in cruise_div:
lines = container.text.splitlines()
cruisestartdate.append(lines[1])
cruiseenddate.append(lines[2])
itinerarydescription.append(lines[3])
# ship name
ship = container.a.text
cruiseshipname.append(ship)
Another solution, using bs4 API:
import re
import requests
from bs4 import BeautifulSoup
import pandas as pd
# url = "https://www.cruisetimetables.com/cruisesearch.html?id=cse&sailmonth=Jul%202021&destination=Any&departureport=&arrivalport=&openjawcruise=&portofcall=&cruiseship=&duration=Any&chartercruise="
url = "https://www.cruisetimetables.com/cruisesearch.html?id=cse&sailmonth=Aug%202021&destination=Any&departureport=&arrivalport=&openjawcruise=&portofcall=&cruiseship=&duration=Any&chartercruise="
headers = {"Accept-Language": "en-US, en;q=0.5"}
results = requests.get(url, headers=headers)
soup = BeautifulSoup(results.content, "html.parser")
all_data = []
for div in soup.select("div.cd-info"):
from_ = div.b
to_ = from_.find_next_sibling(text=True).strip()
# remove (To )
to_ = re.sub(r"\(To (.*)\)", r"\1", to_)
desc = from_.find_next("b").get_text(strip=True)
# remove html chars
desc = BeautifulSoup(desc, "html.parser").get_text(strip=True)
ship_name = div.a.get_text(strip=True)
all_data.append(
[
from_.get_text(strip=True),
to_,
desc,
ship_name,
]
)
df = pd.DataFrame(all_data, columns=["from", "to", "description", "ship name"])
print(df)
df.to_csv("data.csv", index=False)
Prints:
from to description ship name
0 Tue 3 Aug 2021 Mon 23 Aug 2021 20 Night New! Malta, The Adriatic & Greece Viking Venus
1 Tue 10 Aug 2021 Fri 20 Aug 2021 10 Night New! Malta & Adriatic Jewels Viking Sea
2 Tue 10 Aug 2021 Mon 30 Aug 2021 20 Night New! Malta, The Adriatic & Greece Viking Sea
3 Fri 13 Aug 2021 Fri 20 Aug 2021 7 Night Bermuda Escape Viking Orion
4 Fri 13 Aug 2021 Mon 23 Aug 2021 10 Night New! Malta & Greek Isles Discovery Viking Venus
5 Fri 13 Aug 2021 Thu 2 Sep 2021 20 Night New! Malta, The Adriatic & Greece Viking Venus
6 Sat 14 Aug 2021 Sat 21 Aug 2021 7 Night Iceland's Natural Beauty Viking Sky
7 Mon 16 Aug 2021 Mon 13 Sep 2021 28 Night Star Collector: From Greek Gods to Gaudí Wind Surf
8 Mon 16 Aug 2021 Fri 3 Sep 2021 18 Night Star Collector: Flamenco of the Mediterranean Wind Surf
9 Mon 16 Aug 2021 Wed 29 Sep 2021 44 Night Star Collector: Myths & Masterpieces of the Mediterranean Wind Surf
and saves data.csv (screenshot from LibreOffice):
EDIT: To scrape the data from all pages:
import re
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {"Accept-Language": "en-US, en;q=0.5"}
# url = "https://www.cruisetimetables.com/cruisesearch.html?id=cse&sailmonth=Jul%202021&destination=Any&departureport=&arrivalport=&openjawcruise=&portofcall=&cruiseship=&duration=Any&chartercruise="
url = "https://www.cruisetimetables.com/cruisesearch.html?id=cse&sailmonth=Aug%202021&destination=Any&departureport=&arrivalport=&openjawcruise=&portofcall=&cruiseship=&duration=Any&chartercruise="
offset = 1
all_data = []
while True:
u = url + f"&offset={offset}"
print(f"Getting offset {offset=}")
results = requests.get(u, headers=headers)
soup = BeautifulSoup(results.content, "html.parser")
for div in soup.select("div.cd-info"):
from_ = div.b
to_ = from_.find_next_sibling(text=True).strip()
# remove (To )
to_ = re.sub(r"\(To (.*)\)", r"\1", to_)
desc = from_.find_next("b").get_text(strip=True)
# remove html chars
desc = BeautifulSoup(desc, "html.parser").get_text(strip=True)
ship_name = div.a.get_text(strip=True)
all_data.append(
[
from_.get_text(strip=True),
to_,
desc,
ship_name,
]
)
# get next offset:
offset = soup.select_one(".cd-buttonlist > b + a")
if offset:
offset = int(offset.get_text(strip=True).split("-")[0])
else:
break
df = pd.DataFrame(all_data, columns=["from", "to", "description", "ship name"])
print(df)
df.to_csv("data.csv", index=False)
Prints:
Getting offset offset=1
Getting offset offset=11
Getting offset offset=21
...
from to description ship name
0 Tue 3 Aug 2021 Mon 23 Aug 2021 20 Night New! Malta, The Adriatic & Greece Viking Venus
1 Tue 10 Aug 2021 Fri 20 Aug 2021 10 Night New! Malta & Adriatic Jewels Viking Sea
2 Tue 10 Aug 2021 Mon 30 Aug 2021 20 Night New! Malta, The Adriatic & Greece Viking Sea
3 Fri 13 Aug 2021 Fri 20 Aug 2021 7 Night Bermuda Escape Viking Orion
...
144 Tue 31 Aug 2021 Tue 7 Sep 2021 7 Nights Mediterranean MSC Splendida
145 Tue 31 Aug 2021 Tue 7 Sep 2021 7 Nachte Grose Freiheit - Schwedische Kuste 1 Mein Schiff 6
146 Tue 31 Aug 2021 Tue 7 Sep 2021 7 Night Iceland's Natural Beauty Viking Jupiter
Getting the start date, description, and ship name are straightforward. The only trick here is to get the end date which does lie inside a specific tag. To get it, you have to tweak some regex. Here is one solution similar to #Andrej's:
from bs4 import BeautifulSoup
import re
html = """
<div class="cd-info"> <!-- Start cd-info -->
From <b>Sat 31 Jul 2021</b><br>
(To Fri 20 Aug 2021)<br>
<b>20 Night New! Malta, The Adriatic & Greece</b><br>
Ship
<a class="red" href="/cruise-ship-viking-sea.html">Viking Sea</a>
<br>
<span class="mobile-no-desktop-yes"><br></span>
More details at<br>
<a target="_blank" onclick="trackOutboundLink('https://www.vikingcruises.com/oceans/cruise-destinations/eastern-mediterranean/malta-adriatic-and-greece/index.html');" href="https://www.vikingcruises.com/oceans/cruise-destinations/eastern-mediterranean/malta-adriatic-and-greece/index.html">
<img class="noborder" src="/logos/viking-cruises.gif" alt="More details for 20 Night New! Malta, The Adriatic & Greece at Viking Cruises">
</a>
</div>
"""
soup = BeautifulSoup(html, 'html.parser')
bold = soup.find_all('b')
print(f"Start date: {bold[0].text}")
print(f"Description: {bold[1].text}")
untagged_line = re.sub("\n", "", soup.find(text=re.compile('To')))
end_date = re.sub(r"\(To (.*)\)", r"\1", untagged_line)
print(f"End date: {end_date}")
ship = soup.find('a', class_='red')
print(f"Ship: {ship.text}")
Output:
Start date: Sat 31 Jul 2021
Description: 20 Night New! Malta, The Adriatic & Greece
End date: Fri 20 Aug 2021
Ship: Viking Sea

How to exclude certain rows in a table using BeautifulSoup?

The code works fine, however, the URL I'm trying to fetch the table for seems to have headers repeated throughout the table, I'm not sure how to deal with this and remove those rows as I'm trying to get the data into BigQuery and there are certain characters which aren't allowed.
URL = 'https://www.basketball-reference.com/leagues/NBA_2020_games-august.html'
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get(URL)
soup = BeautifulSoup(driver.page_source,'html')
driver.quit()
tables = soup.find_all('table',{"id":["schedule"]})
table = tables[0]
tab_data = [[cell.text for cell in row.find_all(["th","td"])]
for row in table.find_all("tr")]
json_string = ''
headers = [col.replace('.', '_').replace('/', '_').replace('%', 'pct').replace('3', '_3').replace('(', '_').replace(')', '_') for col in tab_data[1]]
for row in tab_data[2:]:
json_string += json.dumps(dict(zip(headers, row))) + '\n'
with open('example.json', 'w') as f:
f.write(json_string)
print(json_string)
You can make class of the tr rows to None so that you don't get duplicate headers.
The following code creates a dataframe from the table
from bs4 import BeautifulSoup
import requests
import pandas as pd
res = requests.get("https://www.basketball-reference.com/leagues/NBA_2020_games-august.html")
soup = BeautifulSoup(res.text, "html.parser")
table = soup.find("div", {"id":"div_schedule"}).find("table")
columns = [i.get_text() for i in table.find("thead").find_all('th')]
data = []
for tr in table.find('tbody').find_all('tr', class_=False):
temp = [tr.find('th').get_text(strip=True)]
temp.extend([i.get_text(strip=True) for i in tr.find_all("td")])
data.append(temp)
df = pd.DataFrame(data, columns = columns)
print(df)
Output:
Date Start (ET) Visitor/Neutral PTS Home/Neutral PTS     Attend. Notes
0 Sat, Aug 1, 2020 1:00p Miami Heat 125 Denver Nuggets 105 Box Score
1 Sat, Aug 1, 2020 3:30p Utah Jazz 94 Oklahoma City Thunder 110 Box Score
2 Sat, Aug 1, 2020 6:00p New Orleans Pelicans 103 Los Angeles Clippers 126 Box Score
3 Sat, Aug 1, 2020 7:00p Philadelphia 76ers 121 Indiana Pacers 127 Box Score
4 Sat, Aug 1, 2020 8:30p Los Angeles Lakers 92 Toronto Raptors 107 Box Score
.. ... ... ... ... ... ... ... .. ... ...
75 Thu, Aug 13, 2020 Portland Trail Blazers Brooklyn Nets
76 Fri, Aug 14, 2020 Philadelphia 76ers Houston Rockets
77 Fri, Aug 14, 2020 Miami Heat Indiana Pacers
78 Fri, Aug 14, 2020 Oklahoma City Thunder Los Angeles Clippers
79 Fri, Aug 14, 2020 Denver Nuggets Toronto Raptors
[80 rows x 10 columns]
In order to insert to bigquery, you can directly insert json to bigquery or a dataframe to bigquery using https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_gbq.html

Extracting multiple date formats through regex in python

I am trying to extract date from text in python. These are the possible texts and date patterns in it.
"Auction details: 14 December 2016, Pukekohe Park"
"Auction details: 17 Feb 2017, Gold Sacs Road"
"Auction details: Wednesday 27 Apr 1:00 p.m. (On site)(2016)"
"Auction details: Wednesday 27 Apr 1:00 p.m. (In Rooms - 923 Whangaa Rd, Man)(2016)"
"Auction details: Wed 27 Apr 2:00 p.m., 48 Viaduct Harbour Ave, Auckland, (2016)"
"Auction details: November 16 Wednesday 2:00pm at 48 Viaduct Harbour Ave, Auckland(2016)"
"Auction details: Thursday, 28th February '19"
"Auction details: Friday, 1st February '19"
This is what I have written so far,
mon = ' (?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?) '
day1 = r'\d{1,2}'
day_test = r'\d{1,2}(?:th)|\d{1,2}(?:st)'
year1 = r'\d{4}'
year2 = r'\(\d{4}\)'
dummy = r'.*'
This captures cases 1,2.
match = re.search(day1 + mon + year1, "Auction details: 14 December 2016, Pukekohe Park")
print match.group()
This somewhat captures case 3,4,5. But it prints everything from the text, so in the below case, I want 25 Nov 2016, but the below regex pattern gives me 25 Nov 3:00 p.m. (On Site)(2016).
So Question 1 : How to get only the date here?
match = re.search(day1 + mon + dummy + year2, "Friday 25 Nov 3:00 p.m. (On Site)(2016)")
print match.group()
Question 2 : Similarly, how do capture case 6,7 and 8 ?? What is the regex should be for that?
If not, is there any other better way to capture date from these formats?
You may try
((?:(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)\s+\d{1,2}(?:st|nd|rd|th)?|\d{1,2}(?:st|nd|rd|th)?\s+(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)))(?:.*(\b\d{2}(?:\d{2})?\b))?
See the regex demo.
Note I made all groups in the regex blocks non-capturing ((Nov|Dec) -> (?:Nov|Dec)), added (?:st|nd|rd|th)? optional group after day digit pattern, changed the year matching pattern to \b\d{2}(?:\d{2})?\b so that it only match 4- or 2-digit chunks as whole words, and created an alternation group to account for dates where day comes before month and vice versa.
The day and month are captured into Group 1 and the year is captured into Group 2, so the result is the concatenation of both.
NOTE: In case you need to match years in a safer way you may want to precise the year pattern. E.g., if you want to avoid matching the 4- or 2-digit whole words after :, add a negative lookbehind:
year1 = r'\b(?<!:)\d{2}(?:\d{2})?\b'
^^^^^^
Also, you may add word boundaries around the whole pattern to ensure a whole word match.
Here is the Python demo:
import re
mon = r'(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)'
day1 = r'\d{1,2}(?:st|nd|rd|th)?'
year1 = r'\b\d{2}(?:\d{2})?\b'
dummy = r'.*'
rx = r"((?:{smon}\s+{sday1}|{sday1}\s+{smon}))(?:{sdummy}({syear1}))?".format(smon=mon, sday1=day1, sdummy=dummy, syear1=year1)
# Or, try this if a partial number before a date is parsed as day:
# rx = r"\b((?:{smon}\s+{sday1}|{sday1}\s+{smon}))(?:{sdummy}({syear1}))?".format(smon=mon, sday1=day1, sdummy=dummy, syear1=year1)
strs = ["Auction details: 14 December 2016, Pukekohe Park","Auction details: 17 Feb 2017, Gold Sacs Road","Auction details: Wednesday 27 Apr 1:00 p.m. (On site)(2016)","Auction details: Wednesday 27 Apr 1:00 p.m. (In Rooms - 923 Whangaa Rd, Man)(2016)","Auction details: Wed 27 Apr 2:00 p.m., 48 Viaduct Harbour Ave, Auckland, (2016)","Auction details: November 16 Wednesday 2:00pm at 48 Viaduct Harbour Ave, Auckland(2016)","Auction details: Thursday, 28th February '19","Auction details: Friday, 1st February '19","Friday 25 Nov 3:00 p.m. (On Site)(2016)"]
for s in strs:
print(s)
m = re.search(rx, s)
if m:
print("{} {}".format(m.group(1), m.group(2)))
else:
print("NO MATCH")
Output:
Auction details: 14 December 2016, Pukekohe Park
14 December 2016
Auction details: 17 Feb 2017, Gold Sacs Road
17 Feb 2017
Auction details: Wednesday 27 Apr 1:00 p.m. (On site)(2016)
27 Apr 2016
Auction details: Wednesday 27 Apr 1:00 p.m. (In Rooms - 923 Whangaa Rd, Man)(2016)
27 Apr 2016
Auction details: Wed 27 Apr 2:00 p.m., 48 Viaduct Harbour Ave, Auckland, (2016)
27 Apr 2016
Auction details: November 16 Wednesday 2:00pm at 48 Viaduct Harbour Ave, Auckland(2016)
November 16 2016
Auction details: Thursday, 28th February '19
28th February 19
Auction details: Friday, 1st February '19
1st February 19
Friday 25 Nov 3:00 p.m. (On Site)(2016)
25 Nov 2016

How should I go about scraping the text in dd tags between specific dt tags on a page using BeautifulSoup?

I am trying to extract the text from the dd classes in between the dd tags (which are being used for to mark different dates). I tried a really hackey method but it didn't work consistenly enough
timeDiv = mezzrowSource.find_all("dd", class_="orange event-date")
eventDiv = mezzrowSource.find_all("dd", class_="event")
index = 0
for time in timeDiv:
returnValue[timeDiv[index].text] = eventDiv[index].text.strip()
if "8" in timeDiv[index+3].text or "4:30" in timeDiv[index+3].text:
break
index += 1
Enumerating in that way resulted in too much text from otherorked most of the time but would sometimes extract events from other dates. Here source of the section in question is pasted below. Any ideas?
<dt class="purple">Sun, September 30th, 2018</dt>
<dd class="orange event-date">4:30 PM to 7:00 PM</dd>
<dd class="event"><a href="/events/4094-mezzrow-classical-salon-with-david-oei"
class="event-title">Mezzrow Classical Salon with David Oei</a>
</dd>
<dd class="orange event-date">8:00 PM to 10:30 PM</dd>
<dd class="event"><a href="/events/4144-luke-sellick-ron-blake-adam-birnbaum"
class="event-title">Luke Sellick, Ron Blake & Adam Birnbaum</a>
</dd>
<dd class="orange event-date">11:00 PM to 1:00 AM</dd>
<dd class="event"><a href="/events/4099-ryo-sasaki-friends-after-hours"
class="event-title">Ryo Sasaki & Friends "After-hours"</a>
</dd>
<dt class="purple">Mon, October 1st, 2018</dt>
<dd class="orange event-date">8:00 PM to 10:30 PM</dd>
<dd class="event"><a href="/events/4137-greg-ruggiero-murray-wall-steve-little"
class="event-title">Greg Ruggiero, Murray Wall & Steve Little</a>
</dd>
<dd class="orange event-date">11:00 PM to 1:00 AM</dd>
<dd class="event"><a href="/events/4174-pasquale-grasso-after-hours"
class="event-title">Pasquale Grasso "After-hours"</a>
</dd>
Expected output is a dictionary that looks like this: {'4:30 PM to 7:00 PM': 'Mezzrow Classical Salon with David Oei', '8:00 PM to 10:30 PM': 'Greg Ruggiero, Murray Wall & Steve Little', '11:00 PM to 1:00 AM': 'Pasquale Grasso "After-hours"'}
If I understand the question correctly you can use zip():
mezzrowSource = BeautifulSoup(html , 'lxml')
timeDiv = [tag.get_text() for tag in mezzrowSource.find_all("dd", class_="orange event-date")]
eventDiv = [tag.get_text().strip() for tag in mezzrowSource.find_all("dd", class_="event")]
print(dict(zip(timeDiv, eventDiv)))
Outputs:
{'4:30 PM to 7:00 PM': 'Mezzrow Classical Salon with David Oei', '8:00 PM to 10:30 PM': 'Greg Ruggiero, Murray Wall & Steve Little', '11:00 PM to 1:00 AM': 'Pasquale Grasso "After-hours"'}
Updated:
The elements you want data from are all siblings i.e. there are no elements containing each set of data, which makes it harder to get the data grouped as you want. The only thing in your favor is the fact that the element with the date comes first then the time and then the title. The time and title can be repeated. So this method selects all the elements we want and iterates over them. In the first iteration it stores the date in a string and creates a list of tuples containing the times and titles. When it next finds a date it appends the date and the list of tuple to a dictionary. At the end of the iterations it appends the final date and list of tuples to the dictionary. It is a bit messy but that is due to the lack of structure in the HTML.
from bs4 import BeautifulSoup
import requests
import re
import pprint
url = 'https://www.mezzrow.com/'
r = requests.get(url)
soup = BeautifulSoup(r.text , 'lxml')
ds = soup.find_all(True, {'class': re.compile('purple|event|orange event_date')})
ret = {}
tmp = []
i = None
for d in ds:
if d.attrs['class']==['purple']:
if i is not None:
ret[i] = tmp
tmp = []
i = (d.get_text())
elif d.attrs['class']==['orange', 'event-date']:
j = d.get_text()
elif d.attrs['class']==['event']:
tmp.append ((j,d.get_text(strip=True)))
ret[i] = tmp
pp = pprint.PrettyPrinter(depth=6)
pp.pprint(ret)
outputs:
{'Fri, October 12th, 2018': [('8:00 PM to 10:30 PM',
'Rossano Sportiello, Pasquale Grasso & Frank '
'Tate'),
('11:00 PM to 2:00 AM',
'Ben Paterson "After-hours"')],
'Fri, October 5th, 2018': [('8:00 PM to 10:30 PM',
'Vanessa Rubin, Brandon McCune, Kenny Davis & '
'Winard Harper'),
('11:00 PM to 2:00 AM',
'Joe Davidian "After-hours"')],
'Mon, October 1st, 2018': [('8:00 PM to 10:30 PM',
'Greg Ruggiero, Murray Wall & Steve Little'),
('11:00 PM to 1:00 AM',
'Pasquale Grasso "After-hours"')],
....
Then select the date you want from the dict object.

Categories