I am trying to extract following information from the website https://www.brecorder.com/pakistan/2022-11-17
I want to do the following things
Extract Category name
Extract News Articles links as well as headline given on the page
Go to Individual article link and fetch whole news from there
Paginate to previous day and repeat the above mentioned steps
Store everything in a csv file
Now what I have done uptill now is I can get the category name, extract article links and paginate to previous page but my code isn't working well. First of all I am getting random articles links that aren't part of that particular webpage. I can paginate and extract articles link for the previous day but same happens there too (I am attaching a screenshot of it). Moreover, I am unable to click on individual link and get detailed news from there
I am also attaching the snippets of page's html
[[Page's html](https://i.stack.imgur.com/juvg0.png)](https://i.stack.imgur.com/rK1El.png)
My code up till now is
`PATH = r"C:\Users\HP\PycharmProjects\WebScraping\chromedriver.exe"
driver = webdriver.Chrome(PATH)
category_urls = ['https://www.brecorder.com/pakistan/2022-11-17']
Category = []
Headline = []
Date = []
Url = []
News = []
def Url_Extraction():
category_name = driver.find_element_by_css_selector(('div[class="p-4 text-md text-gr bg-orange-200 inline-block my-2 font-sans font-medium text-white"]'))
cat = category_name.text # Save category name in variable
print(f"{cat}")
news_articles = driver.find_elements_by_css_selector(('a[class="story__link "]'))
for element in news_articles:
URL = element.get_attribute('href')
print(URL)
Url.append(URL)
Category.append(cat)
current_time = time.time() - start_time
print(f'{len(Url)} urls extracted')
print(f'{len(Category)} categories extracted')
print(f'Current Time: {current_time / 3600:.2f} hr, {current_time / 60:.2f} min, {current_time:.2f} sec',
flush=True)
try:
next_page = driver.find_element(By.CSS_SELECTOR, 'a[class="infinite-more-link w-40 mx-auto text-center p-2 my-10 border bg-beige-400"')
driver.execute_script("arguments[0].click();", next_page)
except Exception as e:
print(e)
start_time = time.time()
for url in category_urls:
driver.get(url) # Go to Webpage
driver.implicitly_wait(30) # we don't need to wait 30 secs if element is already there (very useful)
for num in range(2):
print(f'page no. {num+1}')
Url_Extraction()
''' Saving URLs to a csv file'''
with open('URL_List', 'w', encoding='utf-8-sig') as f:
writer = csv.writer(f)
writer.writerow(Url)
f.close()
''' Adding Data to a Dataframe'''
cols = ['Url', 'Category']
data = pd.DataFrame(columns=cols, index=range(len(Url)))
for index in range(len(Url)):
data.loc[index].Url = Url[index]
data.loc[index].Category = Category[index]
data.to_csv('URLlist_with_Cat.csv')
time.sleep(3)
driver.quit()
Related
I am totally new to all of this. I am trying to extract articles from a lot of pages but I put only 4 URLS in the code below and need to extract only important paragraphs from <p>text</p> == $0.
Here is my code for this sample:
currency = 'BTC'
btc_today = pd.DataFrame({'Currency':[],
'Date':[],
'Title': [],
'Content': [],
'URL':[]})
links = ["https://www.investing.com/news/cryptocurrency-news/3-reasons-why-bitcoins-drop-to-21k-and-the-marketwide-selloff-could-be-worse-than-you-think-2876810",
"https://www.investing.com/news/cryptocurrency-news/crypto-flipsider-news--btc-below-22k-no-support-for-pow-eth-ripple-brazil-odl-cardano-testnet-problems-mercado-launches-crypto-2876644",
"https://www.investing.com/news/cryptocurrency-news/can-exchanges-create-imaginary-bitcoin-to-dump-price-crypto-platform-exec-answers-2876559",
"https://www.investing.com/news/cryptocurrency-news/bitcoin-drops-7-to-hit-3week-lows-432SI-2876376"]
for link in links:
driver.get(link)
driver.maximize_window()
time.sleep(2)
data = []
date = driver.find_element(By.XPATH, f'/html/body/div[5]/section/div[1]/span').text.strip()
title = driver.find_element(By.XPATH,f'/html/body/div[5]/section/h1').text.strip()
url = link
content = driver.find_elements(By.TAG_NAME, 'p')
for item in content:
body = item.text
print(body)
articles = {'Currency': currency,'Date': date,'Title': title,'Content': body,'URL': url}
btc_today = btc_today.append(pd.DataFrame(articles, index=[0]))
btc_today.reset_index(drop=True, inplace=True)
btc_today
#I got this as a result
output
I have also tried to do it with this loop but it rturns results in many rows and not article by article
for p_number in range(1,10):
try:
content = driver.find_element(By.XPATH, f'/html/body/div[5]/section/div[3]/p[{p_number}]').text.strip()
#print(content)
except NoSuchElementException:
pass
can somebody help, please? I would really really appreciate it. I seriously did my best for days to find a solution but no progress
I am assuming you need to get the main content, for that, change the locator for the 'content':
content = driver.find_elements(By.CSS_SELECTOR, '.WYSIWYG.articlePage p')
Also, there are unnecessary '<p>' tags with the content - "Position added successfully to: " and "Continue reading on DailyCoin", you can ignore that using if statement inside the below for loop:
for item in content:
body = item.text
print(body)
CODE IS HERE
Hi guys
I have some problem with scraping this dynamic site (https://kvartiry-bolgarii.ru/)
I need to get all the links to the home sale ads
I used selenium to load the page and get links to ads after that I move the page down to load new ads. After the new ads are loaded, I start to parse all the links on the page and write them to the list again.
But the data in the list is not updated and the script continues to work with the links that were on the page before scrolling down.
By the way, I set a check so that the script is executed until the last announcement on the site appears in the list, the link to which I found out in advance
How can this problem be corrected?
def get_link_info():
try:
url = "https://kvartiry-bolgarii.ru/"
driver = webdriver.Chrome(
executable_path=r'C:\Users\kk\Desktop\scrape_house\drivers\chromedriver.exe',
options=options
)
driver.get(url)
req = requests.get(url)
req.encoding = 'utf8'
soup = BeautifulSoup(req.text, "lxml")
articles = soup.find_all("div", class_="content")
links_urls = []
for article in articles:
house_url = article.find("a").get("href")
links_urls.append(house_url)
#print(links_urls)
first_link_number = links_urls[-2].split("-")[-1]
first_link_number = first_link_number[1:]
#print(first_link_number)
last_link_number = links_urls[-1].split("-")[-1]
last_link_number = last_link_number[1:]
#print(last_link_number)
html = driver.find_element_by_tag_name('html')
html.send_keys(Keys.END)
check = "https://kvartiry-bolgarii.ru/kvartira-v-elitnom-komplekse-s-unikalynym-sadom-o21751"
for a in links_urls:
if a != check:
for article in articles:
house_url = article.find("a").get("href")
links_urls.append(house_url)
html = driver.find_element_by_tag_name('html')
html.send_keys(Keys.END)
print(links_urls[-1])
else:
print(links_urls[0], links_urls[-1])
print("all links are ready")
Some pointers. You don't need to mix selenium,requests and BeautifulSoup. Just selenium is enough. When you are scrolling infinitely, you need to remove duplicate elements before adding them to your list.
You can try this. This should work.
from selenium import webdriver
import time
def get_link_info():
all_links = []
try:
driver = webdriver.Chrome(executable_path='C:/chromedriver.exe')
driver.get('https://kvartiry-bolgarii.ru/')
time.sleep(3)
old_links = set() # Empty Set
while True:
# Scroll to get more ads
driver.execute_script("window.scrollBy(0,3825)", "")
# Wait for new ads to load
time.sleep(8)
links_divs = driver.find_elements_by_xpath('//div[#class="content"]//a') # Find Elements
ans = set(links_divs) - set(old_links) # Remove old elements
for link in ans:
# Scroll to the link.
driver.execute_script("arguments[0].scrollIntoView();", link)
fir = link.get_attribute('href')
all_links.append(fir)
# Remove Duplicates
old_links = links_divs
except Exception as e:
raise e
get_link_info()
I am trying to scrape Backcountry.com review section. The site uses a dynamic load more section, ie the url doesn't change when you want to load more reviews. I am using Selenium webdriver to interact with the button that loads more review and BeautifulSoup to scrape the reviews.
I was able to successfully interact with the load more button and load all the reviews available. I was also able to scrape the initial reviews that appear before you try the load more button.
IN SUMMARY: I can interact with the load more button, I can scrape the initial reviews available but I cannot scrape all the reviews that are available after I load all.
I have tried to change the html tags to see if that makes a difference. I have tried to increase the sleep time in case the scraper didn't have enough time to complete its job.
# URL and Request code for BeautifulSoup
url_filter_bc = 'https://www.backcountry.com/msr-miniworks-ex-ceramic-water-filter?skid=CAS0479-CE-ONSI&ti=U2VhcmNoIFJlc3VsdHM6bXNyOjE6MTE6bXNy'
res_filter_bc = requests.get(url_filter_bc, headers = {'User-agent' : 'notbot'})
# Function that scrapes the reivews
def scrape_bc(request, website):
newlist = []
soup = BeautifulSoup(request.content, 'lxml')
newsoup = soup.find('div', {'id': 'the-wall'})
reviews = newsoup.find('section', {'id': 'wall-content'})
for row in reviews.find_all('section', {'class': 'upc-single user-content-review review'}):
newdict = {}
newdict['review'] = row.find('p', {'class': 'user-content__body description'}).text
newdict['title'] = row.find('h3', {'class': 'user-content__title upc-title'}).text
newdict['website'] = website
newlist.append(newdict)
df = pd.DataFrame(newlist)
return df
# function that uses Selenium and combines that with the scraper function to output a pandas Dataframe
def full_bc(url, website):
driver = connect_to_page(url, headless=False)
request = requests.get(url, headers = {'User-agent' : 'notbot'})
time.sleep(5)
full_df = pd.DataFrame()
while True:
try:
loadMoreButton = driver.find_element_by_xpath("//a[#class='btn js-load-more-btn btn-secondary pdp-wall__load-more-btn']")
time.sleep(2)
loadMoreButton.click()
time.sleep(2)
except:
print('Done Loading More')
# full_json = driver.page_source
temp_df = pd.DataFrame()
temp_df = scrape_bc(request, website)
full_df = pd.concat([full_df, temp_df], ignore_index = True)
time.sleep(7)
driver.quit()
break
return full_df
I expect a pandas dataframe with 113 rows and three columns.
I am getting a pandas datafram with 18 rows and three columns.
Ok, you clicked loadMoreButton and loaded more reviews. But you keep feeding to scrape_bc the same request content you downloaded once, totally separately from Selenium.
Replace requests.get(...) with driver.page_source and ensure you have driver.page_source in a loop before scrape_bc(...) call
request = driver.page_source
temp_df = pd.DataFrame()
temp_df = scrape_bc(request, website)
This is a loop for scraping several elements. Sometimes price isn't always found. Instead of passing through except - I need to print/write a value for those times when no price is found. The reason being is when it just passes through, it mismatches the variable values when printing (title, link, image, price). Hopefully, you can see my logic below in what I'm trying to accomplish. I'm also attaching a screenshot so you can see what I mean.
#finds titles
deal_title = browser.find_elements_by_xpath("//a[#id='dealTitle']/span")
titles = []
for title in deal_title:
titles.append(title.text)
#finds links
deal_link = browser.find_elements_by_xpath("//div[#class='a-row dealDetailContainer']/div/a[#id='dealTitle']")
links = []
for link in deal_link:
links.append(link.get_attribute('href'))
#finds images
deal_image = browser.find_elements_by_xpath("//a[#id='dealImage']/div/div/div/img")
images = []
for image in deal_image:
images.append(image.get_attribute('src'))
try:
deal_price = browser.find_elements_by_xpath("//div[#class='a-row priceBlock unitLineHeight']/span")
prices = []
for price in deal_price:
prices.append(price.text)
except NoSuchElementException:
price = ("PRINT/WRITE THIS TEXT INSTEAD OF PASSING")
#writes to html
for title, link, image, price in zip(titles, links, images, prices):
f.write("<tr class='border'><td class='image'>" + "<img src=" + image + "></td>" + "<td class='title'>" + title + "</td><td class='price'>" + price + "</td></tr>")
import time
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
capabilities = {
'browserName': 'chrome',
'chromeOptions': {
'useAutomationExtension': False,
'forceDevToolsScreenshot': True,
'args': ['--start-maximized', '--disable-infobars']
}
}
driver = webdriver.Chrome(executable_path='./chromedriver_2.38.exe', desired_capabilities=capabilities)
driver.get("""https://www.amazon.com/gp/goldbox/ref=gbps_ftr_s-4_bedf_page_10?
gb_f_deals1=enforcedCategories:2972638011,
dealStates:AVAILABLE%252CWAITLIST%252CWAITLISTFULL,
includedAccessTypes:,page:10,sortOrder:BY_SCORE,
dealsPerPage:32&pf_rd_p=afc45143-5c9c-4b30-8d5c-d838e760bedf&
pf_rd_s=slot-4&pf_rd_t=701&pf_rd_i=gb_main&
pf_rd_m=ATVPDKIKX0DER&pf_rd_r=ZDV4YBQJFDVR3PAY4ZBS&ie=UTF8""")
time.sleep(15)
golds = driver.find_elements_by_css_selector(".widgetContainer #widgetContent > div.singleCell")
print("found %d golds" % len(golds))
template = """\
<tr class="border">
<td class="image"><img src="{0}"></td>\
<td class="title">{2}</td>\
<td class="price">{3}</td>
</tr>"""
lines = []
for gold in golds:
goldInfo = {}
goldInfo['title'] = gold.find_element_by_css_selector('#dealTitle > span').text
goldInfo['link'] = gold.find_element_by_css_selector('#dealTitle').get_attribute('href')
goldInfo['image'] = gold.find_element_by_css_selector('#dealImage img').get_attribute('src')
try:
goldInfo['price'] = gold.find_element_by_css_selector('.priceBlock > span').text
except NoSuchElementException:
goldInfo['price'] = 'No price display'
print goldInfo['title']
line = template.format(goldInfo['image'], goldInfo['link'], goldInfo['title'], goldInfo['price'])
lines.append(line)
html = """\
<html>
<body>
<table>
{0}
</table>
</body>
</html>\
"""
f = open('./result.html', 'w')
f.write(html.format('\n'.join(lines)))
f.close()
If I understood it correctly, you have got problem with loading some page elements "on time".
The elements you want to scrap may not load yet when you are reading them.
To prevent this, you can use explicit waits(the script will wait specified time until specified element loads).
When using this, there will be a smaller chance you miss some values.
Yes, you can skip the price as well. I am providing you an alternative approach where you can create a List of the available prices and the respective images as follows :
Code Block:
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument('disable-infobars')
browser=webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
browser.get("https://www.amazon.com/gp/goldbox/ref=gbps_ftr_s-4_bedf_page_10?gb_f_deals1=enforcedCategories:2972638011,dealStates:AVAILABLE%252CWAITLIST%252CWAITLISTFULL,includedAccessTypes:,page:10,sortOrder:BY_SCORE,dealsPerPage:32&pf_rd_p=afc45143-5c9c-4b30-8d5c-d838e760bedf&pf_rd_s=slot-4&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=ATVPDKIKX0DER&pf_rd_r=ZDV4YBQJFDVR3PAY4ZBS&ie=UTF8")
#finds images
deal_image = browser.find_elements_by_xpath("//div[#class='a-row priceBlock unitLineHeight']/span//preceding::img[1]")
images = []
for image in deal_image:
images.append(image.get_attribute('src'))
#finds prices
deal_price = browser.find_elements_by_xpath("//div[#class='a-row priceBlock unitLineHeight']/span")
prices = []
for price in deal_price:
prices.append(price.text)
#print the information
for image, price in zip(images, prices):
print(image, price)
Console Output:
https://images-na.ssl-images-amazon.com/images/I/31zt-ovKJqL._AA210_.jpg $9.25
https://images-na.ssl-images-amazon.com/images/I/610%2BKAfr72L._AA210_.jpg $15.89
https://images-na.ssl-images-amazon.com/images/I/41whkQ1m0uL._AA210_.jpg $31.49
https://images-na.ssl-images-amazon.com/images/I/41cAbUWEdoL._AA210_.jpg $259.58 - $782.99
https://images-na.ssl-images-amazon.com/images/I/51raHLFC8wL._AA210_.jpg $139.56
https://images-na.ssl-images-amazon.com/images/I/41fuZZwdruL._AA210_.jpg $41.24
https://images-na.ssl-images-amazon.com/images/I/51N2rdMSh0L._AA210_.jpg $19.50 - $20.99
https://images-na.ssl-images-amazon.com/images/I/515DbJhCtOL._AA210_.jpg $22.97
https://images-na.ssl-images-amazon.com/images/I/51OzOZrj1rL._AA210_.jpg $109.95
https://images-na.ssl-images-amazon.com/images/I/31-QDRkNbhL._AA210_.jpg $15.80
https://images-na.ssl-images-amazon.com/images/I/41vXJ9fvcIL._AA210_.jpg $88.99
https://images-na.ssl-images-amazon.com/images/I/51fKqo2YfcL._AA210_.jpg $21.85
https://images-na.ssl-images-amazon.com/images/I/31GcGUXz9TL._AA210_.jpg $220.99 - $241.99
https://images-na.ssl-images-amazon.com/images/I/41sROkWjnpL._AA210_.jpg $40.48
https://images-na.ssl-images-amazon.com/images/I/51vXMFtZajL._AA210_.jpg $22.72
https://images-na.ssl-images-amazon.com/images/I/512s5ZrjoFL._AA210_.jpg $51.99
https://images-na.ssl-images-amazon.com/images/I/51A8Nfvf8eL._AA210_.jpg $8.30
https://images-na.ssl-images-amazon.com/images/I/51aDac6YN5L._AA210_.jpg $18.53
https://images-na.ssl-images-amazon.com/images/I/31SQON%2BiOBL._AA210_.jpg $10.07
Link:
https://www.amazon.com/gp/goldbox/ref=gbps_ftr_s-4_bedf_page_10?gb_f_deals1=enforcedCategories:2972638011,dealStates:AVAILABLE%252CWAITLIST%252CWAITLISTFULL,includedAccessTypes:,page:10,sortOrder:BY_SCORE,dealsPerPage:32&pf_rd_p=afc45143-5c9c-4b30-8d5c-d838e760bedf&pf_rd_s=slot-4&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=ATVPDKIKX0DER&pf_rd_r=ZDV4YBQJFDVR3PAY4ZBS&ie=UTF8
Browser Snapshot:
I am trying to scrape data from the PGA.com website to get a table of all of the golf courses in the United States. In my CSV table I want to include the Name of the golf course ,Address ,Ownership ,Website , Phone number. With this data I would like to geocode it and place into a map and have a local copy on my computer
I utilized Python and Beautiful Soup4 to extract my data. I have reached as far to extract the data and import it into a CSV but I am now having a problem of scraping data from multiple pages on the PGA website. I want to extract ALL THE GOLF COURSES but my script is limited only to one page I want to loop it in away that it will capture all data for golf courses from all pages found in the PGA site. There are about 18000 gold courses and 900 pages to capture data
Attached below is my script. I need help on creating code that will capture ALL data from the PGA website and not just one site but multiple. In this manner it will provide me with all the data of gold courses in the United States.
Here is my script below:
import csv
import requests
from bs4 import BeautifulSoup
url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0"
r = requests.get(url)
soup = BeautifulSoup(r.content)
g_data1=soup.find_all("div",{"class":"views-field-nothing-1"})
g_data2=soup.find_all("div",{"class":"views-field-nothing"})
courses_list=[]
for item in g_data2:
try:
name=item.contents[1].find_all("div",{"class":"views-field-title"})[0].text
except:
name=''
try:
address1=item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
except:
address1=''
try:
address2=item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
except:
address2=''
try:
website=item.contents[1].find_all("div",{"class":"views-field-website"})[0].text
except:
website=''
try:
Phonenumber=item.contents[1].find_all("div",{"class":"views-field-work-phone"})[0].text
except:
Phonenumber=''
course=[name,address1,address2,website,Phonenumber]
courses_list.append(course)
with open ('filename5.csv','wb') as file:
writer=csv.writer(file)
for row in courses_list:
writer.writerow(row)
#for item in g_data1:
#try:
#print item.contents[1].find_all("div",{"class":"views-field-counter"})[0].text
#except:
#pass
#try:
#print item.contents[1].find_all("div",{"class":"views-field-course-type"})[0].text
#except:
#pass
#for item in g_data2:
#try:
#print item.contents[1].find_all("div",{"class":"views-field-title"})[0].text
#except:
#pass
#try:
#print item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
#except:
#pass
#try:
#print item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
#except:
#pass
This script only captures 20 at a time and I want to capture all in one script which account for 18000 golf courses and 900 pages to scrape form.
The PGA website's search have multiple pages, the url follows the pattern:
http://www.pga.com/golf-courses/search?page=1 # Additional info after page parameter here
this means you can read the content of the page, then change the value of page by 1, and read the the next page.... and so on.
import csv
import requests
from bs4 import BeautifulSoup
for i in range(907): # Number of pages plus one
url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
r = requests.get(url)
soup = BeautifulSoup(r.content)
# Your code for each individual page here
if you still read this post , you can try this code too....
from urllib.request import urlopen
from bs4 import BeautifulSoup
file = "Details.csv"
f = open(file, "w")
Headers = "Name,Address,City,Phone,Website\n"
f.write(Headers)
for page in range(1,5):
url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(page)
html = urlopen(url)
soup = BeautifulSoup(html,"html.parser")
Title = soup.find_all("div", {"class":"views-field-nothing"})
for i in Title:
try:
name = i.find("div", {"class":"views-field-title"}).get_text()
address = i.find("div", {"class":"views-field-address"}).get_text()
city = i.find("div", {"class":"views-field-city-state-zip"}).get_text()
phone = i.find("div", {"class":"views-field-work-phone"}).get_text()
website = i.find("div", {"class":"views-field-website"}).get_text()
print(name, address, city, phone, website)
f.write("{}".format(name).replace(",","|")+ ",{}".format(address)+ ",{}".format(city).replace(",", " ")+ ",{}".format(phone) + ",{}".format(website) + "\n")
except: AttributeError
f.close()
where it is written range(1,5) just change that with 0,to the last page , and you will get all details in CSV, i tried very hard to get your data in proper format but it's hard:).
You're putting a link to a single page, it's not going to iterate through each one on its own.
Page 1:
url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0"
Page 2:
http://www.pga.com/golf-courses/search?page=1&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0
Page 907:
http://www.pga.com/golf-courses/search?page=906&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0
Since you're running for page 1 you'll only get 20. You'll need to create a loop that'll run through each page.
You can start off by creating a function that does one page then iterate that function.
Right after the search? in the url, starting at page 2, page=1 begins increasing until page 907 where it's page=906.
I noticed that the first solution had a repetition of the first instance, that is because the 0 page and 1 page is the same page. This is resolved by specifying the start page in the range function. Example below...
for i in range(1, 907): #Number of pages plus one
url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
r = requests.get(url)
soup = BeautifulSoup(r.content, "html5lib") #Can use whichever parser you prefer
# Your code for each individual page here
Had this same exact problem and the solutions above did not work. I solved mine by accounting for cookies. A requests session helps. Create a session and it'll pull all the pages you need by inserting a cookie to all the numbered pages.
import csv
import requests
from bs4 import BeautifulSoup
url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0"
s = requests.Session()
r = s.get(url)
The PGA website has changed this question has been asked.
It seems they organize all courses by: State > City > Course
In light of this change and the popularity of this question, here's how I'd solve this problem today.
Step 1 - Import everything we'll need:
import time
import random
from gazpacho import Soup # https://github.com/maxhumber/gazpacho
from tqdm import tqdm # to keep track of progress
Step 2 - Scrape all the state URL endpoints:
URL = "https://www.pga.com"
def get_state_urls():
soup = Soup.get(URL + "/play")
a_tags = soup.find("ul", {"data-cy": "states"}, mode="first").find("a")
state_urls = [URL + a.attrs['href'] for a in a_tags]
return state_urls
state_urls = get_state_urls()
Step 3 - Write a function to scrape all the city links:
def get_state_cities(state_url):
soup = Soup.get(state_url)
a_tags = soup.find("ul", {"data-cy": "city-list"}).find("a")
state_cities = [URL + a.attrs['href'] for a in a_tags]
return state_cities
state_url = state_urls[0]
city_links = get_state_cities(state_url)
Step 4 - Write a function to scrape all of the courses:
def get_courses(city_link):
soup = Soup.get(city_link)
courses = soup.find("div", {"class": "MuiGrid-root MuiGrid-item MuiGrid-grid-xs-12 MuiGrid-grid-md-6"}, mode="all")
return courses
city_link = city_links[0]
courses = get_courses(city_link)
Step 5 - Write a function to parse all the useful info about a course:
def parse_course(course):
return {
"name": course.find("h5", mode="first").text,
"address": course.find("div", {'class': "jss332"}, mode="first").strip(),
"url": course.find("a", mode="first").attrs["href"]
}
course = courses[0]
parse_course(course)
Step 6 - Loop through everything and save:
all_courses = []
for state_url in tqdm(state_urls):
city_links = get_state_cities(state_url)
time.sleep(random.uniform(1, 10) / 10)
for city_link in city_links:
courses = get_courses(city_link)
time.sleep(random.uniform(1, 10) / 10)
for course in courses:
info = parse_course(course)
all_courses.append(info)