Web scraping: Index out of Bound (Possible scaling error)

Web scraping: Index out of Bound (Possible scaling error) - python

Hi Wrote a web scraping program and it gets the ASN number correctly, but after all the data is scraped, it returns a error "Array Out if Bounds".
I am using Pycharm and latest python version. Below is my code.
There is already a similar issue on stackoverflow but I am not able to get the pieces together and make it work. (Web Scraping List Index Out Of Range) its the exact same error but I am not sure how to get it working for my List.
Error seems to be at current_country = link.split('/')[2]
Any help is appreciated. Thank you.
import urllib.request
import bs4
import re
import json
url = 'https://ipinfo.io/countries'
SITE = 'https://ipinfo.io'
def url_to_soup(url):
req = urllib.request.Request(url)
opener = urllib.request.build_opener()
html = opener.open(req)
soup = bs4.BeautifulSoup(html, "html.parser")
return soup
def find_pages(page):
pages = []
for link in page.find_all(href=re.compile('/countries')):
pages.append(link.get('href'))
return pages
def scrape_pages(links):
mappings = {}
print("Scraping Pages for ASN Data...")
for link in links:
country_page = url_to_soup(SITE + link)
current_country = link.split('/')[2]
print(current_country)
for row in country_page.find_all('tr'):
columns = row.find_all('td')
if len(columns) > 0:
current_asn = re.findall(r'\d+', columns[0].string)[0]
print(current_asn)
"""
name = columns[1].string
routes_v4 = columns[3].string
routes_v6 = columns[5].string
mappings[current_asn] = {'Country': current_country,
'Name': name,
'Routes v4': routes_v4,
'Routes v6': routes_v6}
return mappings """
main_page = url_to_soup(url)
country_links = find_pages(main_page)
#print(country_links)
asn_mappings = scrape_pages(country_links)
print(asn_mappings)

The last href contains string "/countries" in https://ipinfo.io/countries is actually "/countries":
<li>Global ASNs</li>
After splitting this link, it produced list ["", "countries"] where the third element was missing. To fix this problem, simply check the list length before retrieving the third element:
...
current_country = link.split('/')
if len(current_country) < 3:
continue
current_country = current_country[2]
...
Another solution is to exclude the last href by changing the regexp to:
...
for link in page.find_all(href=re.compile('/countries/')):
...

Related

Paginating pages using things other than numbers in python

I am trying to paginate a scraper on my my university's website.
Here is the url for one of the pages:
https://www.bu.edu/com/profile/david-abel/
where david-abel is a first followed by last name. (It would be first-middle-last if a middle was given which poses a problem based on my code only finding first and last currently). I have a plan to deal with middle names but my question is:
How do I go about adding names from my first and lastnames list to my base url to get a corresponding url in the layout above
import requests
from bs4 import BeautifulSoup
url = 'https://www.bu.edu/com/profiles/faculty/page/1/'
data = requests.get(url)
my_data = []
split_names = []
firstnames = []
lastnames = []
middlenames = []
html = BeautifulSoup(data.text, 'html.parser')
professors = html.select('h4.profile-card__name')
for professor in professors:
my_data.append(professor.text)
for name in my_data:
x = name.split()
split_names.append(x)
for name in split_names:
f, l = zip(*split_names)
firstnames.append(f)
lastnames.append(l)
#\/ appending searchable url using names
for name in split_names:
baseurl = "https://www.bu.edu/com/profile/"
newurl = baseurl +
print(firstnames)
print(lastnames)

This simple modification should give you what you want, let me know if you have any more questions or if anything needs to be changed!
# appending searchable url using names
for name in split_names:
baseurl = "https://www.bu.edu/com/profile/"
newurl = baseurl + "-".join(name)
print(newurl)
Even better:
for name in split_names:
profile_url = f"https://www.bu.edu/com/profile/{'-'.join(name)}"
print(profile_url)
As for the pagination part, this should work and is not hard coded. Let's say that new faculty join and there are now 9 pages. This code should still work in that case.
url = 'https://www.bu.edu/com/profiles/faculty/page'
with requests.get(f"{url}/1") as response:
soup = BeautifulSoup(response.text, 'html.parser')
# select pagination numbers shown ex: [2, 3, 7, Next] (Omit the next)
page_numbers = [int(n.text) for n in soup.select("a.page-numbers")[:-1]]
# take the min and max for pagination
start_page, stop_page = min(page_numbers), max(page_numbers) + 1
# loop through pages
for page in range(start_page, stop_page):
with requests.get(f"{url}/{page}") as response:
soup = BeautifulSoup(response.text, 'html.parser')
professors = soup.select('h4.profile-card__name')
# ---
I believe this is the best and most concise way to solve your problem. Just as a tip you should use with when making requests as it takes care of a lot of issues for you and you don't have to pollute the namespace with things like resp1, resp2, etc. Like mentions above, f-strings are amazing and super easy to use.

Beautiful soup doesn't load the whole page

I got this project where I'm scraping data on Trulia.com and where I want to get the max number of page (last number) for a specific location (photo below) so I can loop through it and get all the hrefs.
To get that last number, I have my code that run as planned and should return an integer but it doesn't always return the same number. I added the print(comprehension list) to understand what's wrong. Here is the code and the output below. The return is commented but sould return the last number of the output list as an int.
city_link = "https://www.trulia.com/for_rent/San_Francisco,CA/"
def bsoup(url):
resp = r.get(url, headers=req_headers)
soup = bs(resp.content, 'html.parser')
return soup
def max_page(link):
soup = bsoup(link)
page_num = soup.find_all(attrs={"data-testid":"pagination-page-link"})
print([x.get_text() for x in page_num])
# return int(page_num[-1].get_text())
for x in range(10):
max_page(city_link)
I have no clue why sometimes it's returning something wrong. The photo above is the corresponding link.

Okay, now if I understand what you want, you are trying to see how many pages of links there are for a given location for rent. If we can assume the given link is the only required link, this code:
import requests
import bs4
url = "https://www.trulia.com/for_rent/San_Francisco,CA/"
req = requests.get(url)
soup = bs4.BeautifulSoup(req.content, features='lxml')
def get_number_of_pages(soup):
caption_tag = soup.find('div', class_="Text__TextBase-sc-1cait9d-0-
div Text__TextContainerBase-sc-1cait9d-1 RBSGf")
pagination = caption_tag.text
words = pagination.split(" ")
values = []
for word in words:
if not word.isalpha():
values.append(word)
links_per_page = values[0].split('-')[1]
total_links = values[1].replace(',', '')
no_of_pages = round(int(total_links)/int(links_per_page) + 0.5)
return no_of_pages
for i in range(10):
print(get_number_of_pages(soup))
achieves what you're looking for, and has repeatability because it doesn't interact with javascript, but the pagination caption at the bottom of the page.

Scraping Wikipedia information (table)

I would need to scrape information regarding Elenco dei comuni per regione on Wikipedia. I would like to create an array that can allow me to associate each comune to the corresponding region, i.e. something like this:
'Abbateggio': 'Pescara' -> Abruzzo
I tried to get information using BeautifulSoup and requests as follows:
from bs4 import BeautifulSoup as bs
import requests
with requests.Session() as s: # use session object for efficiency of tcp re-use
s.headers = {'User-Agent': 'Mozilla/5.0'}
r = s.get('https://it.wikipedia.org/wiki/Comuni_d%27Italia')
soup = bs(r.text, 'html.parser')
for ele in soup.find_all('h3')[:6]:
tx = bs(str(ele),'html.parser').find('span', attrs={'class': "mw-headline"})
if tx is not None:
print(tx['id'])
however it does not work (it returns me an empty list).
The information that I have looked at using Inspect of Google Chrome are the following:
<span class="mw-headline" id="Elenco_dei_comuni_per_regione">Elenco dei comuni per regione</span> (table)
Comuni dell'Abruzzo
(this field should change for each region)
then <table class="wikitable sortable query-tablesortes">
Could you please give me advice on how to get such results?
Any help and suggestion will be appreciated.
EDIT:
Example:
I have a word: comunediabbateggio. This word includes Abbateggio. I would like to know which region can be associated with that city, if it exists.
Information from Wikipedia needs to create a dataset that can allow me to check the field and associate to comuni/cities a region.
What I should expect is:
WORD REGION/STATE
comunediabbateggio Pescara
I hope this can help you. Sorry if it was not clear.
Another example for English speaker that might be slightly better for understanding is the following:
Instead of the Italian link above, you can also consider the following: https://en.wikipedia.org/wiki/List_of_comuni_of_Italy . For each region (Lombardia, Veneto, Sicily, ... ) I would need to collect information about the list of communes of the Provinces.
if you click in a link of List of Communes of ... , there is a table that list the comune, e.g. https://en.wikipedia.org/wiki/List_of_communes_of_the_Province_of_Agrigento.

import re
import requests
from bs4 import BeautifulSoup
import pandas as pd
from tqdm import tqdm
target = "https://en.wikipedia.org/wiki/List_of_comuni_of_Italy"
def main(url):
with requests.Session() as req:
r = req.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
provinces = [item.find_next("span").text for item in soup.findAll(
"span", class_="tocnumber", text=re.compile(r"\d[.]\d"))]
search = [item.replace(
" ", "_") if " " in item else item for item in provinces]
nested = []
for item in search:
for a in soup.findAll("span", id=item):
goes = [b.text.split("of ")[-1]
for b in a.find_next("ul").findAll("a")]
nested.append(goes)
dictionary = dict(zip(provinces, nested))
urls = [f'{url[:24]}{b.get("href")}' for item in search for a in soup.findAll(
"span", id=item) for b in a.find_next("ul").findAll("a")]
return urls, dictionary
def parser():
links, dics = main(target)
com = []
for link in tqdm(links):
try:
df = pd.read_html(link)[0]
com.append(df[df.columns[1]].to_list()[:-1])
except ValueError:
com.append(["N/A"])
com = iter(com)
for x in dics:
b = dics[x]
dics[x] = dict(zip(b, com))
print(dics)
parser()

How to scrape embedded links and tabular information

I'm trying to scrape information about the datasets available on this website.
I want to collect the URLs to the resources and at least the title of the dataset.
Using this resource as an example, I want to capture the URL embedded in "Go to resource" and the title listed in the table:
I have created a basic scraper, but it doesn't seem work:
import requests
import csv
from bs4 import BeautifulSoup
site = requests.get('https://data.nsw.gov.au/data/dataset');
data_list=[]
if site.status_code is 200:
content = BeautifulSoup(site.content, 'html.parser')
internals = content.select('.resource-url-analytics')
for url in internals:
title = internals.select=('.resource-url-analytics')[0].get_text()
link = internals.select=('.resource-url-analytics')[0].get('href')
new_data = {"title": title, "link": link}
data_list.append(new_data)
with open ('selector.csv','w') as file:
writer = csv.DictWriter(file, fieldnames = ["dataset", "link"], delimiter = ';')
writer.writeheader()
for row in data_list:
writer.writerow(row)
I would like to write the output to a CSV with columns for the URLs and the titles.
This is an example of the desired output
Greatly appreciative for any assistance

Have a look at the API for the datasets that will likely be the easiest way to do this.
In the meantime, here is how you can get the API links at id level from those pages and store the entire package info for all packages in one list, data_sets, and just the info of interest in another variable (results). Be sure to review the API documentation in case there is a better method - for example, it would be nice if ids could be submitted in batches rather than per id.
Answer below is taking advantage of the endpoint detailed in the documentation which is used to get a full JSON representation of a dataset, resource or other object
Taking the current first result on landing page of:
Vegetation of the Guyra 1:25000 map sheet VIS_ID 240.
We want the last child a of parent h3 with a parent having class .dataset-item. In the below, the spaces between selectors are descendant combinators.
.dataset-item h3 a:last-child
You can shorten this to h3 a:last-child for a small efficiency gain.
This relationship reliably selects all relevant links on page.
Continuing with this example, visiting that retrieved url for first listed item, we can find the id using api endpoint (which retrieves json related to this package), via an attribute=value selector with contains, *, operator. We know this particular api endpoint has a common string so we substring match on the href attribute value:
[href*="/api/3/action/package_show?id="]
The domain can vary and some retrieved links are relative so we have to test if relative and add the appropriate domain.
First page html for that match:
Notes:
data_sets is a list containing all the package data for each package and is extensive. I did this in case you are interest in looking at what is in those packages (besides reviewing the API documentation)
You can get total number of pages from soup object on a page via
num_pages = int(soup.select('[href^="/data/dataset?page="]')[-2].text)
You can alter the loop for less pages.
Session object is used for efficiency of re-using connection. I'm sure there are other improvements to be made. In particular I would look for any method which reduced the number of requests (why I mentioned looking for a batch id endpoint for example).
There can be none to more than one resource url within a returned package. See example here. You can edit code to handle this.
Python:
from bs4 import BeautifulSoup as bs
import requests
import csv
from urllib.parse import urlparse
json_api_links = []
data_sets = []
def get_links(s, url, css_selector):
r = s.get(url)
soup = bs(r.content, 'lxml')
base = '{uri.scheme}://{uri.netloc}'.format(uri=urlparse(url))
links = [base + item['href'] if item['href'][0] == '/' else item['href'] for item in soup.select(css_selector)]
return links
results = []
#debug = []
with requests.Session() as s:
for page in range(1,2): #you decide how many pages to loop
links = get_links(s, 'https://data.nsw.gov.au/data/dataset?page={}'.format(page), '.dataset-item h3 a:last-child')
for link in links:
data = get_links(s, link, '[href*="/api/3/action/package_show?id="]')
json_api_links.append(data)
#debug.append((link, data))
resources = list(set([item.replace('opendata','') for sublist in json_api_links for item in sublist])) #can just leave as set
for link in resources:
try:
r = s.get(link).json() #entire package info
data_sets.append(r)
title = r['result']['title'] #certain items
if 'resources' in r['result']:
urls = ' , '.join([item['url'] for item in r['result']['resources']])
else:
urls = 'N/A'
except:
title = 'N/A'
urls = 'N/A'
results.append((title, urls))
with open('data.csv','w', newline='') as f:
w = csv.writer(f)
w.writerow(['Title','Resource Url'])
for row in results:
w.writerow(row)
All pages
(very long running so consider threading/asyncio):
from bs4 import BeautifulSoup as bs
import requests
import csv
from urllib.parse import urlparse
json_api_links = []
data_sets = []
def get_links(s, url, css_selector):
r = s.get(url)
soup = bs(r.content, 'lxml')
base = '{uri.scheme}://{uri.netloc}'.format(uri=urlparse(url))
links = [base + item['href'] if item['href'][0] == '/' else item['href'] for item in soup.select(css_selector)]
return links
results = []
#debug = []
with requests.Session() as s:
r = s.get('https://data.nsw.gov.au/data/dataset')
soup = bs(r.content, 'lxml')
num_pages = int(soup.select('[href^="/data/dataset?page="]')[-2].text)
links = [item['href'] for item in soup.select('.dataset-item h3 a:last-child')]
for link in links:
data = get_links(s, link, '[href*="/api/3/action/package_show?id="]')
json_api_links.append(data)
#debug.append((link, data))
if num_pages > 1:
for page in range(1, num_pages + 1): #you decide how many pages to loop
links = get_links(s, 'https://data.nsw.gov.au/data/dataset?page={}'.format(page), '.dataset-item h3 a:last-child')
for link in links:
data = get_links(s, link, '[href*="/api/3/action/package_show?id="]')
json_api_links.append(data)
#debug.append((link, data))
resources = list(set([item.replace('opendata','') for sublist in json_api_links for item in sublist])) #can just leave as set
for link in resources:
try:
r = s.get(link).json() #entire package info
data_sets.append(r)
title = r['result']['title'] #certain items
if 'resources' in r['result']:
urls = ' , '.join([item['url'] for item in r['result']['resources']])
else:
urls = 'N/A'
except:
title = 'N/A'
urls = 'N/A'
results.append((title, urls))
with open('data.csv','w', newline='') as f:
w = csv.writer(f)
w.writerow(['Title','Resource Url'])
for row in results:
w.writerow(row)

For simplicity use selenium package:
from selenium import webdriver
import os
# initialise browser
browser = webdriver.Chrome(os.getcwd() + '/chromedriver')
browser.get('https://data.nsw.gov.au/data/dataset')
# find all elements by xpath
get_elements = browser.find_elements_by_xpath('//*[#id="content"]/div/div/section/div/ul/li/div/h3/a[2]')
# collect data
data = []
for item in get_elements:
data.append((item.text, item.get_attribute('href')))
Output:
('Vegetation of the Guyra 1:25000 map sheet VIS_ID 240', 'https://datasets.seed.nsw.gov.au/dataset/vegetation-of-the-guyra-1-25000-map-sheet-vis_id-2401ee52')
('State Vegetation Type Map: Riverina Region Version v1.2 - VIS_ID 4469', 'https://datasets.seed.nsw.gov.au/dataset/riverina-regional-native-vegetation-map-version-v1-0-vis_id-4449')
('Temperate Highland Peat Swamps on Sandstone (THPSS) spatial distribution maps...', 'https://datasets.seed.nsw.gov.au/dataset/temperate-highland-peat-swamps-on-sandstone-thpss-vegetation-maps-vis-ids-4480-to-4485')
('Environmental Planning Instrument - Flood', 'https://www.planningportal.nsw.gov.au/opendata/dataset/epi-flood')
and so on

How can I loop scraping data for multiple pages in a website using python and beautifulsoup4

I am trying to scrape data from the PGA.com website to get a table of all of the golf courses in the United States. In my CSV table I want to include the Name of the golf course ,Address ,Ownership ,Website , Phone number. With this data I would like to geocode it and place into a map and have a local copy on my computer
I utilized Python and Beautiful Soup4 to extract my data. I have reached as far to extract the data and import it into a CSV but I am now having a problem of scraping data from multiple pages on the PGA website. I want to extract ALL THE GOLF COURSES but my script is limited only to one page I want to loop it in away that it will capture all data for golf courses from all pages found in the PGA site. There are about 18000 gold courses and 900 pages to capture data
Attached below is my script. I need help on creating code that will capture ALL data from the PGA website and not just one site but multiple. In this manner it will provide me with all the data of gold courses in the United States.
Here is my script below:
import csv
import requests
from bs4 import BeautifulSoup
url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0"
r = requests.get(url)
soup = BeautifulSoup(r.content)
g_data1=soup.find_all("div",{"class":"views-field-nothing-1"})
g_data2=soup.find_all("div",{"class":"views-field-nothing"})
courses_list=[]
for item in g_data2:
try:
name=item.contents[1].find_all("div",{"class":"views-field-title"})[0].text
except:
name=''
try:
address1=item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
except:
address1=''
try:
address2=item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
except:
address2=''
try:
website=item.contents[1].find_all("div",{"class":"views-field-website"})[0].text
except:
website=''
try:
Phonenumber=item.contents[1].find_all("div",{"class":"views-field-work-phone"})[0].text
except:
Phonenumber=''
course=[name,address1,address2,website,Phonenumber]
courses_list.append(course)
with open ('filename5.csv','wb') as file:
writer=csv.writer(file)
for row in courses_list:
writer.writerow(row)
#for item in g_data1:
#try:
#print item.contents[1].find_all("div",{"class":"views-field-counter"})[0].text
#except:
#pass
#try:
#print item.contents[1].find_all("div",{"class":"views-field-course-type"})[0].text
#except:
#pass
#for item in g_data2:
#try:
#print item.contents[1].find_all("div",{"class":"views-field-title"})[0].text
#except:
#pass
#try:
#print item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
#except:
#pass
#try:
#print item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
#except:
#pass
This script only captures 20 at a time and I want to capture all in one script which account for 18000 golf courses and 900 pages to scrape form.

The PGA website's search have multiple pages, the url follows the pattern:
http://www.pga.com/golf-courses/search?page=1 # Additional info after page parameter here
this means you can read the content of the page, then change the value of page by 1, and read the the next page.... and so on.
import csv
import requests
from bs4 import BeautifulSoup
for i in range(907): # Number of pages plus one
url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
r = requests.get(url)
soup = BeautifulSoup(r.content)
# Your code for each individual page here

if you still read this post , you can try this code too....
from urllib.request import urlopen
from bs4 import BeautifulSoup
file = "Details.csv"
f = open(file, "w")
Headers = "Name,Address,City,Phone,Website\n"
f.write(Headers)
for page in range(1,5):
url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(page)
html = urlopen(url)
soup = BeautifulSoup(html,"html.parser")
Title = soup.find_all("div", {"class":"views-field-nothing"})
for i in Title:
try:
name = i.find("div", {"class":"views-field-title"}).get_text()
address = i.find("div", {"class":"views-field-address"}).get_text()
city = i.find("div", {"class":"views-field-city-state-zip"}).get_text()
phone = i.find("div", {"class":"views-field-work-phone"}).get_text()
website = i.find("div", {"class":"views-field-website"}).get_text()
print(name, address, city, phone, website)
f.write("{}".format(name).replace(",","|")+ ",{}".format(address)+ ",{}".format(city).replace(",", " ")+ ",{}".format(phone) + ",{}".format(website) + "\n")
except: AttributeError
f.close()
where it is written range(1,5) just change that with 0,to the last page , and you will get all details in CSV, i tried very hard to get your data in proper format but it's hard:).

You're putting a link to a single page, it's not going to iterate through each one on its own.
Page 1:
url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0"
Page 2:
http://www.pga.com/golf-courses/search?page=1&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0
Page 907:
http://www.pga.com/golf-courses/search?page=906&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0
Since you're running for page 1 you'll only get 20. You'll need to create a loop that'll run through each page.
You can start off by creating a function that does one page then iterate that function.
Right after the search? in the url, starting at page 2, page=1 begins increasing until page 907 where it's page=906.

I noticed that the first solution had a repetition of the first instance, that is because the 0 page and 1 page is the same page. This is resolved by specifying the start page in the range function. Example below...
for i in range(1, 907): #Number of pages plus one
url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
r = requests.get(url)
soup = BeautifulSoup(r.content, "html5lib") #Can use whichever parser you prefer
# Your code for each individual page here

Had this same exact problem and the solutions above did not work. I solved mine by accounting for cookies. A requests session helps. Create a session and it'll pull all the pages you need by inserting a cookie to all the numbered pages.
import csv
import requests
from bs4 import BeautifulSoup
url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0"
s = requests.Session()
r = s.get(url)

The PGA website has changed this question has been asked.
It seems they organize all courses by: State > City > Course
In light of this change and the popularity of this question, here's how I'd solve this problem today.
Step 1 - Import everything we'll need:
import time
import random
from gazpacho import Soup # https://github.com/maxhumber/gazpacho
from tqdm import tqdm # to keep track of progress
Step 2 - Scrape all the state URL endpoints:
URL = "https://www.pga.com"
def get_state_urls():
soup = Soup.get(URL + "/play")
a_tags = soup.find("ul", {"data-cy": "states"}, mode="first").find("a")
state_urls = [URL + a.attrs['href'] for a in a_tags]
return state_urls
state_urls = get_state_urls()
Step 3 - Write a function to scrape all the city links:
def get_state_cities(state_url):
soup = Soup.get(state_url)
a_tags = soup.find("ul", {"data-cy": "city-list"}).find("a")
state_cities = [URL + a.attrs['href'] for a in a_tags]
return state_cities
state_url = state_urls[0]
city_links = get_state_cities(state_url)
Step 4 - Write a function to scrape all of the courses:
def get_courses(city_link):
soup = Soup.get(city_link)
courses = soup.find("div", {"class": "MuiGrid-root MuiGrid-item MuiGrid-grid-xs-12 MuiGrid-grid-md-6"}, mode="all")
return courses
city_link = city_links[0]
courses = get_courses(city_link)
Step 5 - Write a function to parse all the useful info about a course:
def parse_course(course):
return {
"name": course.find("h5", mode="first").text,
"address": course.find("div", {'class': "jss332"}, mode="first").strip(),
"url": course.find("a", mode="first").attrs["href"]
}
course = courses[0]
parse_course(course)
Step 6 - Loop through everything and save:
all_courses = []
for state_url in tqdm(state_urls):
city_links = get_state_cities(state_url)
time.sleep(random.uniform(1, 10) / 10)
for city_link in city_links:
courses = get_courses(city_link)
time.sleep(random.uniform(1, 10) / 10)
for course in courses:
info = parse_course(course)
all_courses.append(info)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web scraping: Index out of Bound (Possible scaling error) - python

Related

Paginating pages using things other than numbers in python

Beautiful soup doesn't load the whole page

Scraping Wikipedia information (table)

How to scrape embedded links and tabular information

How can I loop scraping data for multiple pages in a website using python and beautifulsoup4

Categories

Resources