I have several URLs which link to Hotel pages and I would like to scrape some data from it.
I'm using the following this script, but I would like to update it:
data=[]
for i in range(0,10):
url = final_list[i]
driver2 = webdriver.Chrome()
driver2.get(url)
sleep(randint(10,20))
soup = BeautifulSoup(driver2.page_source, 'html.parser')
my_table2 = soup.find_all(class_=['title-2', 'rating-score body-3'])
review=soup.find_all(class_='reviews')[-1]
try:
price=soup.find_all('span', attrs={'class':'price'})[-1]
except:
price=soup.find_all('span', attrs={'class':'price'})
for tag in my_table2:
data.append(tag.text.strip())
for p in price:
data.append(p)
for r in review:
data.append(r)
But here's the problem, tag.text.strip() scrape rating numbers like here :
It will strip the number rating into alone value but some hotels don't have the same amout of ratings. Here's a hotel with 7 ratings, the default number is 8. Some have seven ratings, other six, and so on. So in the end, my dataframe is quite screwed. If the hotel doesn't have 8 ratings, the value will be shifted.
My question is : How to tell the script "if there is a value in this tag.text.strip(i) so put the value but if there isn't put None. And of course made that for the eight value.
I tried several things like :
for tag in my_table2:
for i in tag.text.strip()[i]:
if i:
data.append(i)
else:
data.append(None)
But unfortunately, that goes nowhere, so if you could help to figure out the answer, it would be awesome :)
If that could help you, I put link on Hotel that I'm scraping :
https://www.hostelworld.com/pwa/hosteldetails.php/Itaca-Hostel/Barcelona/1279?from=2020-11-21&to=2020-11-22&guests=1
The number ratings are at the end
Thank you.
A few suggestions:
Put your data in a dictionary. You don't have to assume that all tags are present and the order of the tags doesn't matter. You can get the labels and the corresponding ratings with
rating_labels = soup.find_all(class_=['rating-label body-3'])
rating_scores = soup.find_all(class_=['rating-score body-3'])
and then iterate over both lists with zip
move your driver outside of the loop, opening it once is enough
don't use wait but you use Selenium's wait functions. You can wait for a particular element to be present or populated with WebDriverWait(driver, 10).until(EC.presence_of_element_located(your_element)
https://selenium-python.readthedocs.io/waits.html
Cache your scraped HTML code to a file. It's faster for you and politer to the website you are scraping
import selenium
import selenium.webdriver
import time
import random
import os
from bs4 import BeautifulSoup
data = []
final_list = [
'https://www.hostelworld.com/pwa/hosteldetails.php/Itaca-Hostel/Barcelona/1279?from=2020-11-21&to=2020-11-22&guests=1',
'https://www.hostelworld.com/pwa/hosteldetails.php/Be-Ramblas-Hostel/Barcelona/435?from=2020-11-27&to=2020-11-28&guests=1'
]
# load your driver only once to save time
driver = selenium.webdriver.Chrome()
for url in final_list:
data.append({})
# cache the HTML code to the filesystem
# generate a filename from the URL where all non-alphanumeric characters (e.g. :/) are replaced with underscores _
filename = ''.join([s if s.isalnum() else '_' for s in url])
if not os.path.isfile(filename):
driver.get(url)
# better use selenium's wait functions here
time.sleep(random.randint(10, 20))
source = driver.page_source
with open(filename, 'w', encoding='utf-8') as f:
f.write(source)
else:
with open(filename, 'r', encoding='utf-8') as f:
source = f.read()
soup = BeautifulSoup(source, 'html.parser')
review = soup.find_all(class_='reviews')[-1]
try:
price = soup.find_all('span', attrs={'class':'price'})[-1]
except:
price = soup.find_all('span', attrs={'class':'price'})
data[-1]['name'] = soup.find_all(class_=['title-2'])[0].text.strip()
rating_labels = soup.find_all(class_=['rating-label body-3'])
rating_scores = soup.find_all(class_=['rating-score body-3'])
assert len(rating_labels) == len(rating_scores)
for label, score in zip(rating_labels, rating_scores):
data[-1][label.text.strip()] = score.text.strip()
data[-1]['price'] = price.text.strip()
data[-1]['review'] = review.text.strip()
The data can then be easily put in a nicely formatted table using Pandas
import pandas as pd
df = pd.DataFrame(data)
df
If some data is missing/incomplete, Pandas will replace it with 'NaN'
data.append(data[0].copy())
del(data[-1]['Staff'])
data[-1]['name'] = 'Incomplete Hostel'
pd.DataFrame(data)
Related
Im using BS4 for the first time and need to scrape the items from an online catalogue to csv.
I have setup my code however when i run the code the results are only repeating the first item in the catalogue n times (where n is the number of items).
Can someone review my code and let me know where i am going wrong.
Thanks
import requests
from bs4 import BeautifulSoup
from csv import writer
#response = requests.get('https://my.supplychain.nhs.uk/Catalogue/browse/27/anaesthetic-oxygen-and-resuscitation?CoreListRequest=BrowseCoreList')
response = requests.get('https://my.supplychain.nhs.uk/Catalogue/browse/32/nhs-cat?LastCartId=&LastFavouriteId=&CoreListRequest=BrowseAll')
soup = BeautifulSoup(response.text , 'html.parser')
items = soup.find_all(class_='productPrevDetails')
#print(items)
for item in items:
ItemCode = soup.find(class_='product_npc ').get_text().replace('\n','')
ItemNameS = soup.select('p')[58].get_text()
ProductInfo = soup.find(class_='product_key_info').get_text()
print(ItemCode,ItemNameS,ProductInfo)
You always see the first result because you are searching soup, not the item. Try
for item in items:
ItemCode = item.find(class_='product_npc ').get_text().replace('\n','')
ItemNameS = item.select('p')[58].get_text()
ProductInfo = item.find(class_='product_key_info').get_text()
print(ItemCode,ItemNameS,ProductInfo)
I know there are questions like this, I was trying to follow them. I am trying to scrape the info in this page. Ideally I would like as much of the info as possible into a clean/easy to read tsv, but the essential parts to scrape are: ID, Name, Organism, Family, Classification, UniProt ID, Modifications, Sequence and PDB structure IDs (e.g. in this case, there is a list of PDB structures, the first is 1BAS and the last is 4OEG).
I wrote this in python3:
import urllib.request
import sys
import pandas as pd
import bs4
out = open('pdb.parsed.txt', 'a')
for i in range(1000,1005):
# try:
url = 'http://isyslab.info/StraPep/show_detail.php?id=BP' + str(i)
page = urllib.request.urlopen(url)
soup = pd.read_html(page)
print(soup)
I have attached my output here:
I have two questions:
You can see that some of the info that I require is missing (e.g. the sequence has NaN).
More importantly, I cannot see any field that correlates to the list of PDB IDs?
I was hoping to use pd.read_html if possible because in the past I have struggled with urllib/bs4, and I have found that I have been more successful with pd.read_html in recent scraping attempts. Can anyone explain how I could pull out the fields that I need?
I believe you were unable to scrape entries from certain rows such as the 'Sequence' row because these rows were populated by Javascript. The approach that worked for me was to use a combination of Selenium with a Firefox driver to grab the page's html code, and then use Beautiful Soup to parse that code.
Here's how I was able to scrape the pertinent info for the ID, Name, Organism, Family, Classification, UniProt ID, Modifications, Sequence and PDB structure IDs, for each page:
import urllib.request
import sys
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
import csv
pages = []
for page in range(1000,1005):
# try:
info_dict = {}
url = 'http://isyslab.info/StraPep/show_detail.php?id=BP' + str(page)
driver = webdriver.Firefox()
driver.get(url)
html = driver.page_source
bs = BeautifulSoup(html, 'html.parser')
main_table = bs.find('table', attrs={'class': 'main_table'})
rows = main_table.findAll('tr')
for row in rows:
try: # We only want rows from a page where both row title and text are not null
row_header = row.find('th').text
row_text = row.find('td').text
except:
pass
else:
if row_header and row_text:
if row_header in ['ID', 'Name', 'Organism', 'Family', 'Classification', 'UniProt ID']:
info_dict[row_header] = row_text
elif row_header == 'Modification':
try: # Some pages have a null table entry for 'Modification'
mod_text = row.find('table').find('td').text
except:
pass
else:
if mod_text:
info_dict[row_header] = mod_text
else:
info_dict[row_header] = 'NA'
# Pass 'Sequence' and 'Structure' as space separated strings
elif row_header == 'Sequence':
seqs = ''
for i in row_text.split():
seqs += ' ' + i
info_dict[row_header] = seqs[1:]
elif row_header == 'Structure':
pdb_ids = ''
a = row.find('tbody').find_all('a')
for i in a:
if i.text != '[x]': pdb_ids += ' ' + i.text
info_dict[row_header] = pdb_ids[1:]
pages.append(info_dict)
keys = pages[0].keys()
with open('pdb.parsed.txt', 'a') as output_file:
writer = csv.DictWriter(output_file, keys, delimiter='\t')
writer.writeheader()
writer.writerows(pages) # Add a tab-delimited row for each page we scraped
I can then read in the .tsv file I just created as a dataframe if I want:
df = pd.read_csv('pdb.parsed.txt', delimiter='\t')
It looks like this:
Although the contents of columns containing longer strings (such as 'Sequence') are abbreviated, we can verify that the entire sequence is indeed present:
df.iloc[0]['Sequence']
'PALPEDGGSG AFPPGHFKDP KRLYCKNGGF FLRIHPDGRV DGVREKSDPH IKLQLQAEER GVVSIKGVCA NRYLAMKEDG RLLASKCVTD ECFFFERLES NNYNTYRSRK YTSWYVALKR TGQYKLGSKT GPGQKAILFL PMSAKS'
The contents of the saved tsv file look like this:
ID Name Organism Family Classification UniProt ID Modification Sequence Structure
BP1000 Fibroblast growth factor 2 Homo sapiens heparin-binding growth factors family Cytokine/Growth factor FGF2_HUMAN Phosphotyrosine; by TEC PALPEDGGSG AFPPGHFKDP KRLYCKNGGF FLRIHPDGRV DGVREKSDPH IKLQLQAEER GVVSIKGVCA NRYLAMKEDG RLLASKCVTD ECFFFERLES NNYNTYRSRK YTSWYVALKR TGQYKLGSKT GPGQKAILFL PMSAKS 1BAS 1BFB 1BFC 1BFF 1BFG 1BLA 1BLD 1CVS 1EV2 1FGA 1FQ9 1II4 1IIL 2BFH 2FGF 2M49 4FGF 4OEE 4OEF 4OEG
BP1001 Interleukin-2 Homo sapiens IL-2 family Cytokine/Growth factor IL2_HUMAN APTSSSTKKT QLQLEHLLLD LQMILNGINN YKNPKLTRML TFKFYMPKKA TELKHLQCLE EELKPLEEVL NLAQSKNFHL RPRDLISNIN VIVLELKGSE TTFMCEYADE TATIVEFLNR WITFCQSIIS TLT 1IRL 1M47 1M48 1M49 1M4A 1M4B 1M4C 1NBP 1PW6 1PY2 1QVN 1Z92 2B5I 2ERJ 3INK 3QAZ 3QB1 4NEJ 4NEM
BP1002 Insulin Bos taurus insulin family Hormone INS_BOVIN GIVEQCCASV CSLYQLENYC N 1APH 1BPH 1CPH 1DPH 1PID 2A3G 2BN1 2BN3 2INS 2ZP6 3W14 4BS3 4E7T 4E7U 4E7V 4I5Y 4I5Z 4IDW 4IHN 4M4F 4M4H 4M4I 4M4J 4M4L 4M4M
BP1003 Interleukin-1 beta Homo sapiens IL-1 family Cytokine/Growth factor IL1B_HUMAN APVRSLNCTL RDSQQKSLVM SGPYELKALH LQGQDMEQQV VFSMSFVQGE ESNDKIPVAL GLKEKNLYLS CVLKDDKPTL QLESVDPKNY PKKKMEKRFV FNKIEINNKL EFESAQFPNW YISTSQAENM PVFLGGTKGG QDITDFTMQF VSS 1HIB 1I1B 1IOB 1ITB 1L2H 1S0L 1T4Q 1TOO 1TP0 1TWE 1TWM 21BI 2I1B 2KH2 2NVH 31BI 3LTQ 3O4O 3POK 41BI 4DEP 4G6J 4G6M 4GAF 4GAI 4I1B 5BVP 5I1B 6I1B 7I1B 9ILB
BP1004 Lactoferricin-H Homo sapiens transferrin family Antimicrobial TRFL_HUMAN GRRRSVQWCA VSQPEATKCF QWQRNMRKVR GPPVSCIKRD SPIQCIQA 1Z6V 1XV4 1XV7 1Z6W 2GMC 2GMD
I used the following to Anaconda commands to install Selenium, and then the Firefox driver:
conda install -c conda-forge selenium
conda install -c conda-forge geckodriver
Been using beautiful soup to iterate through pages, but for whatever reason I can't get the loop to advance beyond the first page. it seems like it should be easy because it's a text string, but it seems to loop back, maybe it's my structure not my text string?
Here's what I have:
import csv
import urllib2
from bs4 import BeautifulSoup
f = open('nhlstats.csv', "w")
groups=['points', 'shooting', 'goaltending', 'defensive', 'timeonice', 'faceoffs', 'minor-penalties', 'major-penalties']
year = ["2016", "2015","2014","2013","2012"]
for yr in year:
for gr in groups:
url = "http://www.espn.com/nhl/statistics/player/_/stat/points/year/"+str(yr)
#www.espn.com/nhl/statistics/player/_/stat/points/year/2014/
page = urllib2.urlopen(url)
soup=BeautifulSoup(page, "html.parser")
pagecount = soup.findAll(attrs= {"class":"page-numbers"})[0].string
pageliteral = int(pagecount[5:])
for i in range(0,pageliteral):
number = int(((i*40) + 1))
URL = "http://www.espn.com/nhl/statistics/player/_/stat/points/sort/points/year/"+str(yr) + "/count/"+str(number)
page = urllib2.urlopen(url)
soup=BeautifulSoup(page, "html.parser")
for tr in soup.select("#my-players-table tr[class*=player]"):
row =[]
for ob in range(1,15):
player_info = tr('td')[ob].get_text(strip=True)
row.append(player_info)
f.write(str(yr) +","+",".join(row) + "\n")
f.close()
this gets the same first 40 records over and over.
I tried using this solution as an if and did find that doing
prevLink = soup.select('a[rel="nofollow"]')[0]
newurl = "http:" + prevLink.get('href')
did work better, but I'm not sure how to do the loop in such a way that it advances? possibly just tired but my loop there still just goes to the next set of records and gets stuck on that one. please help me fix my loop
UPDATE
my formatting was lost in the copy paste, my actual code looks like:
import csv
import urllib2
from bs4 import BeautifulSoup
f = open('nhlstats.csv', "w")
groups=['points', 'shooting', 'goaltending', 'defensive', 'timeonice', 'faceoffs', 'minor-penalties', 'major-penalties']
year = ["2016", "2015","2014","2013","2012"]
for yr in year:
for gr in groups:
url = "http://www.espn.com/nhl/statistics/player/_/stat/points/year/"+str(yr)
#www.espn.com/nhl/statistics/player/_/stat/points/year/2014/
page = urllib2.urlopen(url)
soup=BeautifulSoup(page, "html.parser")
pagecount = soup.findAll(attrs= {"class":"page-numbers"})[0].string
pageliteral = int(pagecount[5:])
for i in range(0,pageliteral):
number = int(((i*40) + 1))
URL = "http://www.espn.com/nhl/statistics/player/_/stat/points/sort/points/year/"+str(yr) + "/count/"+str(number)
page = urllib2.urlopen(url)
soup=BeautifulSoup(page, "html.parser")
for tr in soup.select("#my-players-table tr[class*=player]"):
row =[]
for ob in range(1,15):
player_info = tr('td')[ob].get_text(strip=True)
row.append(player_info)
f.write(str(yr) +","+",".join(row) + "\n")
f.close()
Your code indenting was mostly at fault. Also it would be wise to actually use the CSV library you imported, this will automatically wrap the player names in quotes to avoid any commas inside from ruining the csv structure.
This works by looking for the link to the next page and extracting the starting count. This is then used to build your the next page get. If no next page can be found, it moves to the next year group. Note, the count is not a page count but a starting entry count.
import csv
import urllib2
from bs4 import BeautifulSoup
groups= ['points', 'shooting', 'goaltending', 'defensive', 'timeonice', 'faceoffs', 'minor-penalties', 'major-penalties']
year = ["2016", "2015", "2014", "2013", "2012"]
with open('nhlstats.csv', "wb") as f_output:
csv_output = csv.writer(f_output)
for yr in year:
for gr in groups:
start_count = 1
while True:
#print "{}, {}, {}".format(yr, gr, start_count) # show progress
url = "http://www.espn.com/nhl/statistics/player/_/stat/points/sort/points/year/{}/count/{}".format(yr, start_count)
page = urllib2.urlopen(url)
soup = BeautifulSoup(page, "html.parser")
for tr in soup.select("#my-players-table tr[class*=player]"):
row = [yr]
for ob in range(1, 15):
player_info = tr('td')[ob].get_text(strip=True)
row.append(player_info)
csv_output.writerow(row)
try:
start_count = int(soup.find(attrs= {"class":"page-numbers"}).find_next('a')['href'].rsplit('/', 1)[1])
except:
break
Using with will also automatically close your file at the end.
This would give you a csv file starting as follows:
2016,"Patrick Kane, RW",CHI,82,46,60,106,17,30,1.29,287,16.0,9,17,20
2016,"Jamie Benn, LW",DAL,82,41,48,89,7,64,1.09,247,16.6,5,17,13
2016,"Sidney Crosby, C",PIT,80,36,49,85,19,42,1.06,248,14.5,9,10,14
2016,"Joe Thornton, C",SJ,82,19,63,82,25,54,1.00,121,15.7,6,8,21
You are changing the URL many times before you are opening it the first time, due to an indentation error. Try this:
for gr in groups:
url = "...some_url..."
page = urllib2.urlopen(url)
...everything else should be indented....
while True:
for rate in soup.find_all('div',{"class":"rating"}):
if rate.img is not None:
print (rate.img['alt'])
try:
driver.find_element_by_link_text('Next').click()
except:
break
driver.quit()
while True:
for rate in soup.findAll('div',{"class":"listing_title"}):
print (rate.a.text)
try:
driver.find_element_by_link_text('Next').click()
except:
break
driver.quit()
This should do what you're looking for. You should grab the parent class of both (I chose .listing, and get each attribute from there, insert them in a dict, and then write the dicts to CSV with the Python CSV library. Just as a fair warning, I didn't run it until it broke, I just broke after the second loop to save some computing.
WARNING HAVE NOT TESTED ON FULL SITE
import csv
import time
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
url = 'http://www.tripadvisor.in/Hotels-g186338-London_England-Hotels.html'
driver = webdriver.Firefox()
driver.get(url)
hotels = []
while True:
html = driver.page_source
soup = BeautifulSoup(html)
listings = soup.select('div.listing')
for l in listings:
hotel = {}
hotel['name'] = l.select('a.property_title')[0].text
hotel['rating'] = float(l.select('img.sprite-ratings')[0]['alt'].split('of')[0])
hotels.append(hotel)
next = driver.find_element_by_link_text('Next')
if not next:
break
else:
next.click()
time.sleep(0.5)
if len(hotels) > 0:
with open('ratings.csv', 'w') as f:
fieldnames = [ k for k in hotels[0].keys() ]
writer = csv.DictWriter(f,fieldnames=fieldnames)
writer.writeheader()
for h in hotels:
writer.writerow(h)
driver.quit()
You should look at using a list.
I would try something like this:
for rate in soup.findAll('div',{"class":["rating","listing_title"]}):
(could be wrong, this machine doesn't have bs4 for me to check, sorry)
I am trying to scrape data from the PGA.com website to get a table of all of the golf courses in the United States. In my CSV table I want to include the Name of the golf course ,Address ,Ownership ,Website , Phone number. With this data I would like to geocode it and place into a map and have a local copy on my computer
I utilized Python and Beautiful Soup4 to extract my data. I have reached as far to extract the data and import it into a CSV but I am now having a problem of scraping data from multiple pages on the PGA website. I want to extract ALL THE GOLF COURSES but my script is limited only to one page I want to loop it in away that it will capture all data for golf courses from all pages found in the PGA site. There are about 18000 gold courses and 900 pages to capture data
Attached below is my script. I need help on creating code that will capture ALL data from the PGA website and not just one site but multiple. In this manner it will provide me with all the data of gold courses in the United States.
Here is my script below:
import csv
import requests
from bs4 import BeautifulSoup
url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0"
r = requests.get(url)
soup = BeautifulSoup(r.content)
g_data1=soup.find_all("div",{"class":"views-field-nothing-1"})
g_data2=soup.find_all("div",{"class":"views-field-nothing"})
courses_list=[]
for item in g_data2:
try:
name=item.contents[1].find_all("div",{"class":"views-field-title"})[0].text
except:
name=''
try:
address1=item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
except:
address1=''
try:
address2=item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
except:
address2=''
try:
website=item.contents[1].find_all("div",{"class":"views-field-website"})[0].text
except:
website=''
try:
Phonenumber=item.contents[1].find_all("div",{"class":"views-field-work-phone"})[0].text
except:
Phonenumber=''
course=[name,address1,address2,website,Phonenumber]
courses_list.append(course)
with open ('filename5.csv','wb') as file:
writer=csv.writer(file)
for row in courses_list:
writer.writerow(row)
#for item in g_data1:
#try:
#print item.contents[1].find_all("div",{"class":"views-field-counter"})[0].text
#except:
#pass
#try:
#print item.contents[1].find_all("div",{"class":"views-field-course-type"})[0].text
#except:
#pass
#for item in g_data2:
#try:
#print item.contents[1].find_all("div",{"class":"views-field-title"})[0].text
#except:
#pass
#try:
#print item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
#except:
#pass
#try:
#print item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
#except:
#pass
This script only captures 20 at a time and I want to capture all in one script which account for 18000 golf courses and 900 pages to scrape form.
The PGA website's search have multiple pages, the url follows the pattern:
http://www.pga.com/golf-courses/search?page=1 # Additional info after page parameter here
this means you can read the content of the page, then change the value of page by 1, and read the the next page.... and so on.
import csv
import requests
from bs4 import BeautifulSoup
for i in range(907): # Number of pages plus one
url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
r = requests.get(url)
soup = BeautifulSoup(r.content)
# Your code for each individual page here
if you still read this post , you can try this code too....
from urllib.request import urlopen
from bs4 import BeautifulSoup
file = "Details.csv"
f = open(file, "w")
Headers = "Name,Address,City,Phone,Website\n"
f.write(Headers)
for page in range(1,5):
url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(page)
html = urlopen(url)
soup = BeautifulSoup(html,"html.parser")
Title = soup.find_all("div", {"class":"views-field-nothing"})
for i in Title:
try:
name = i.find("div", {"class":"views-field-title"}).get_text()
address = i.find("div", {"class":"views-field-address"}).get_text()
city = i.find("div", {"class":"views-field-city-state-zip"}).get_text()
phone = i.find("div", {"class":"views-field-work-phone"}).get_text()
website = i.find("div", {"class":"views-field-website"}).get_text()
print(name, address, city, phone, website)
f.write("{}".format(name).replace(",","|")+ ",{}".format(address)+ ",{}".format(city).replace(",", " ")+ ",{}".format(phone) + ",{}".format(website) + "\n")
except: AttributeError
f.close()
where it is written range(1,5) just change that with 0,to the last page , and you will get all details in CSV, i tried very hard to get your data in proper format but it's hard:).
You're putting a link to a single page, it's not going to iterate through each one on its own.
Page 1:
url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0"
Page 2:
http://www.pga.com/golf-courses/search?page=1&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0
Page 907:
http://www.pga.com/golf-courses/search?page=906&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0
Since you're running for page 1 you'll only get 20. You'll need to create a loop that'll run through each page.
You can start off by creating a function that does one page then iterate that function.
Right after the search? in the url, starting at page 2, page=1 begins increasing until page 907 where it's page=906.
I noticed that the first solution had a repetition of the first instance, that is because the 0 page and 1 page is the same page. This is resolved by specifying the start page in the range function. Example below...
for i in range(1, 907): #Number of pages plus one
url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
r = requests.get(url)
soup = BeautifulSoup(r.content, "html5lib") #Can use whichever parser you prefer
# Your code for each individual page here
Had this same exact problem and the solutions above did not work. I solved mine by accounting for cookies. A requests session helps. Create a session and it'll pull all the pages you need by inserting a cookie to all the numbered pages.
import csv
import requests
from bs4 import BeautifulSoup
url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0"
s = requests.Session()
r = s.get(url)
The PGA website has changed this question has been asked.
It seems they organize all courses by: State > City > Course
In light of this change and the popularity of this question, here's how I'd solve this problem today.
Step 1 - Import everything we'll need:
import time
import random
from gazpacho import Soup # https://github.com/maxhumber/gazpacho
from tqdm import tqdm # to keep track of progress
Step 2 - Scrape all the state URL endpoints:
URL = "https://www.pga.com"
def get_state_urls():
soup = Soup.get(URL + "/play")
a_tags = soup.find("ul", {"data-cy": "states"}, mode="first").find("a")
state_urls = [URL + a.attrs['href'] for a in a_tags]
return state_urls
state_urls = get_state_urls()
Step 3 - Write a function to scrape all the city links:
def get_state_cities(state_url):
soup = Soup.get(state_url)
a_tags = soup.find("ul", {"data-cy": "city-list"}).find("a")
state_cities = [URL + a.attrs['href'] for a in a_tags]
return state_cities
state_url = state_urls[0]
city_links = get_state_cities(state_url)
Step 4 - Write a function to scrape all of the courses:
def get_courses(city_link):
soup = Soup.get(city_link)
courses = soup.find("div", {"class": "MuiGrid-root MuiGrid-item MuiGrid-grid-xs-12 MuiGrid-grid-md-6"}, mode="all")
return courses
city_link = city_links[0]
courses = get_courses(city_link)
Step 5 - Write a function to parse all the useful info about a course:
def parse_course(course):
return {
"name": course.find("h5", mode="first").text,
"address": course.find("div", {'class': "jss332"}, mode="first").strip(),
"url": course.find("a", mode="first").attrs["href"]
}
course = courses[0]
parse_course(course)
Step 6 - Loop through everything and save:
all_courses = []
for state_url in tqdm(state_urls):
city_links = get_state_cities(state_url)
time.sleep(random.uniform(1, 10) / 10)
for city_link in city_links:
courses = get_courses(city_link)
time.sleep(random.uniform(1, 10) / 10)
for course in courses:
info = parse_course(course)
all_courses.append(info)