The Situation
I am trying to scrape webpages to get some data.
I need the html data which is viewable in the browser as a whole for my application.
The Problem
But when I scrape some urls, I am getting data which are not viewable from browser. But in the html code its there. So is there any way to scrape the data which is viewable only in the browser
Code
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
from selenium.common.exceptions import WebDriverException
from selenium.webdriver.chrome.service import Service
options = webdriver.ChromeOptions()
options.add_argument("--headless")
service = Service("/home/nebu/selenium_drivers/chromedriver")
URL = "https://augustasymphony.com/event/top-of-the-world/"
try:
driver = webdriver.Chrome(service = service, options = options)
driver.get(URL)
driver.implicitly_wait(2)
html_content = driver.page_source
driver.quit()
except WebDriverException:
driver.quit()
soup = BeautifulSoup(html_content)
for each in ['header','footer']:
s = soup.find(each)
if s == None:
continue
else:
s.extract()
text = soup.getText(separator=u' ')
print(text)
The Question
Where am I going wrong here?
How can I go about debugging this?
This is simply a case of you needing to extract the data in a more specific manner.
You have 2 options really:
Option 1: (In my opinion the better, as it is faster and less resource heavy.)
import requests
from bs4 import BeautifulSoup as bs
headers = {'Accept': '*/*',
'Connection': 'keep-alive',
'User-Agent': 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683 Safari/537.36 OPR/57.0.3098.91'}
res = requests.get("https://augustasymphony.com/event/top-of-the-world/", headers=headers)
soup = bs(res.text, "lxml")
event_header = soup.find("h2", {"class": "rhino-event-header"}).text.strip()
time = soup.find("p", {"class": "rhino-event-time"}).text.strip()
You can use requests quite simply to find the data as shown in the code above specifically selecting the data you want and perhap saving it in a dictionary. This is the normal way to go about it. It may contain a lot of scripts in the page, however the page doesn't require JavaScript to load said data dynamically.
Option2:
You continue using selenium and can collect the entire body information of the page using one of multiple selections.
driver.find_element_by_id('wrapper').get_attribute('innerHTML') # Entire body
driver.find_element_by_id('tribe-events').get_attribute('innerHTML') # the events list
driver.find_element_by_id('rhino-event-single-content').get_attribute('innerHTML') # the single event
This second option is a lot more just taking the whole html and dumping it.
Personally I would go with the first option creating dictionaries of the cleaned data.
Edit:
To futher illustrate my example
import requests
from bs4 import BeautifulSoup as bs
headers = {'Accept': '*/*',
'Connection': 'keep-alive',
'User-Agent': 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683 Safari/537.36 OPR/57.0.3098.91'}
res = requests.get("https://augustasymphony.com/event/", headers=headers)
soup = bs(res.text, "lxml")
seedlist = {a["href"] for a in soup.find("div", {"id": "tribe-events-content-wrapper"}).find_all("a") if '?ical=1' not in a["href"]}
for seed in seedlist:
res = requests.get(seed, headers=headers)
soup = bs(res.text, "lxml")
data = dict()
data['event_header'] = soup.find("h2", {"class": "rhino-event-header"}).text.strip()
data['time'] = soup.find("p", {"class": "rhino-event-time"}).text.strip()
print(data)
Here I am generting a seedlist of event urls and then going into each one to find information.
It's because some websites detect if it's a web browser.
So they don't send the HTML file back.
That's why there is no HTML send back
Related
This code works and returns the single digit number that i want but its so slow and takes good 10 seconds to complete.I will be running this 4 times for my use so thats 40 seconds wasted every run.
` from selenium import webdriver
from bs4 import BeautifulSoup
options = webdriver.FirefoxOptions()
options.add_argument('--headless')
driver = webdriver.Firefox(options=options)
driver.get('https://warframe.market/items/ivara_prime_blueprint')
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
price_element = soup.find('div', {'class': 'row order-row--Alcph'})
price2=price_element.find('div',{'class':'order-row__price--hn3HU'})
price = price2.text
print(int(price))
driver.close()`
This code on the other hand does not work. It returns None.
` import requests
from bs4 import BeautifulSoup
url='https://warframe.market/items/ivara_prime_blueprint'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
price_element=soup.find('div', {'class': 'row order-row--Alcph'})
price2=price_element.find('div',{'class':'order-row__price--hn3HU'})
price = price2.text
print(int(price))`
First thought was to add user agent but still did not work. When I print(soup) it gives me html code but when i parse it further it stops and starts giving me None even tho its the same command like in selenium example.
The data is loaded dynamically within a <script> tag so Beautifulsoup doesn't see it (it doesn't render Javascript).
As an example, to get the data, you can use:
import json
import requests
from bs4 import BeautifulSoup
url = "https://warframe.market/items/ivara_prime_blueprint"
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36"
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
script_tag = soup.select_one("#application-state")
json_data = json.loads(script_tag.string)
# Uncomment the line below to see all the data
# from pprint import pprint
# pprint(json_data)
for data in json_data["payload"]["orders"]:
print(data["user"]["ingame_name"])
Prints:
Rogue_Monarch
Rappei
KentKoes
Tenno61189
spinifer14
Andyfr0nt
hollowberzinho
You can access the data as a dict and acess the keys/values.
I'd recommend an online tool to view all the JSON since it's quite large.
See also
Parsing out specific values from JSON object in BeautifulSoup
I am throwing together a Walmart review scraper, it currently scrapes html from most Walmart pages without a problem. As soon as I try scraping a page of reviews, it only comes back with a small portion of the page's code, mainly just text from reviews and a few errant tags. Anyone know what the problem could be?
import requests
headers = {
'Accept': '*/*',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36',
'Accept-Language': 'en-us',
'Referer': 'https://www.walmart.com/',
'sec-ch-ua-platform': 'Windows',
}
cookie_jar = {
'_pxvid': '35ed81e0-cb1a-11ec-aad0-504d5a625548',
}
product_num = input('Enter Product Number: ')
url2 = ('https://www.walmart.com/reviews/product/'+str(product_num))
r = requests.get(url2, headers=headers, cookies=cookie_jar, timeout=5)
print(r.text)
As larsks already commented, some content is loaded in dynamically, for example if you scroll down far enough.
BeautifulSoup or requests don't load the whole page, but you can solve this with Selenium.
What Selenium does is it opens your url in a script-controlled web browser, it lets you fill out forms and also scroll down. Below is a code example on how to use Selenium with BS4.
from bs4 import BeautifulSoup
from selenium import webdriver
# Search on google for the driver and save it in the path below
driver = webdriver.Firefox(executable_path="C:\Program Files (x86)\geckodriver.exe")
# for Chrome it's: driver = webdriver.Chrome("C:\Program Files (x86)\chromedriver.exe")
# Here you open the url with the reviews
driver.get("https://www.example.com")
driver.maximize_window()
# This function scrolls down to the bottom of the website
driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
# Now you can scrape the given website from your Selenium browser using:
html = driver.page_source
soup = BeautifulSoup(html)
This solution assumes that the reviews are loaded in through scrolling down the page. Of course you don't have to use BeautifulSoup to scrape the site, it's personal preference. Let me know if it helped.
I'm trying to scrape different agency name from the second page of a webpage using requests module. I can parse the names from it's landing page by sending a get requests to the very url.
However, when it comes to access the names from it's second page and latter, I need to send post http requests along with appropriate parameters. I tried to mimic the post requests exactly the way I see it in dev tools but all I get in return is the following:
<?xml version='1.0' encoding='UTF-8'?>
<partial-response id="j_id1"><redirect url="/ptn/exceptionhandler/sessionExpired.xhtml"></redirect></partial-response>
This is how I've tried:
import requests
from bs4 import BeautifulSoup
from pprint import pprint
link = 'https://www.gebiz.gov.sg/ptn/opportunity/BOListing.xhtml?origin=menu'
url = 'https://www.gebiz.gov.sg/ptn/opportunity/BOListing.xhtml'
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
payload = {
'contentForm': 'contentForm',
'contentForm:j_idt171_windowName': '',
'contentForm:j_idt187_listButton2_HIDDEN-INPUT': '',
'contentForm:j_idt192_searchBar_INPUT-SEARCH': '',
'contentForm:j_idt192_searchBarList_HIDDEN-SUBMITTED-VALUE': '',
'contentForm:j_id135_0': 'Title',
'contentForm:j_id135_1': 'Document No.',
'contentForm:j_id136': 'Match All',
'contentForm:j_idt853_select': 'ON',
'contentForm:j_idt859_select': '0',
'javax.faces.ViewState': soup.select_one('input[name="javax.faces.ViewState"]')['value'],
'javax.faces.source': 'contentForm:j_idt902:j_idt955_2_2',
'javax.faces.partial.event': 'click',
'javax.faces.partial.execute': 'contentForm:j_idt902:j_idt955_2_2 contentForm:j_idt902',
'javax.faces.partial.render': 'contentForm:j_idt902:j_idt955 contentForm dialogForm',
'javax.faces.behavior.event': 'action',
'javax.faces.partial.ajax': 'true'
}
s.headers['Referer'] = 'https://www.gebiz.gov.sg/ptn/opportunity/BOListing.xhtml?origin=menu'
s.headers['Faces-Request'] = 'partial/ajax'
s.headers['Origin'] = 'https://www.gebiz.gov.sg'
s.headers['Host'] = 'www.gebiz.gov.sg'
s.headers['Accept-Encoding'] = 'gzip, deflate, br'
res = s.post(url,data=payload,allow_redirects=False)
# soup = BeautifulSoup(res.text,"lxml")
# for item in soup.select(".commandLink_TITLE-BLUE"):
# print(item.get_text(strip=True))
print(res.text)
How can I parse names from a webpage from it's second page when the url remains unchanged?
You can use Selenium to traverse between pages. The following code will allow you to do this.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.options import Options
import time
chrome_options = Options()
#chrome_options.add_argument("--headless")
#chrome_options.add_argument("user-agent=Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36")
driver = webdriver.Chrome(executable_path="./chromedriver", options=chrome_options)
driver.get("https://www.gebiz.gov.sg/ptn/opportunity/BOListing.xhtml?origin=menu")
#check if next page exists
next_page = driver.find_element_by_xpath("//input[starts-with(#value, 'Next')]")
#click the next button
while next_page is not None:
time.sleep(5)
click_btn = driver.find_element_by_xpath("//input[starts-with(#value, 'Next')]")
click_btn.click()
time.sleep(5)
next_page = driver.find_element_by_xpath("//input[starts-with(#value, 'Next')]")
I have not added the code for extracting the Agency names. I presume it will not be difficult for you.
Make sure to install Selenium and download the chrome driver. Also make sure to download the correct version of the driver. You can confirm the version by viewing the 'About' section of your chrome browser.
I've created a script in python with selenium to scrape the website address located within Contact details in a website. However, the problem is there is no url associated with that link (I can click on that link, though).
How can I parse the website link located within Contact details?
from selenium import webdriver
URL = 'https://www.truelocal.com.au/business/vitfit/sydney'
def get_website_link(driver,link):
driver.get(link)
website = driver.find_element_by_css_selector("[ng-class*='getHaveSecondaryWebsites'] > span").text
print(website)
if __name__ == '__main__':
driver = webdriver.Chrome()
try:
get_website_link(driver,URL)
finally:
driver.quit()
When I run the script, I get the visible text associate with that link which is Visit website.
Element with "Visit website" text is a span, that has vm.openLink(vm.getReadableUrl(vm.getPrimaryWebsite()),'_blank') javascript and not actual href.
My suggestion, if your goal is to scrape and not testing, you can use solution below with requests package to get data as json and extract any information you need.
Another one is actually click, as you did.
import requests
import re
headers = {
'Referer': 'https://www.truelocal.com.au/business/vitfit/sydney',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/73.0.3683.75 Safari/537.36',
'DNT': '1',
}
response = requests.get('https://www.truelocal.com.au/www-js/configuration.constant.js?v=1552032205066',
headers=headers)
assert response.ok
# extract token from response text
token = re.search("token:\\s'(.*)'", response.text)[1]
headers['Accept'] = 'application/json, text/plain, */*'
headers['Origin'] = 'https://www.truelocal.com.au'
response = requests.get(f'https://api.truelocal.com.au/rest/listings/vitfit/sydney?&passToken={token}', headers=headers)
assert response.ok
# use response.text to get full json as text and see what information can be extracted.
contact = response.json()["data"]["listing"][0]["contacts"]["contact"]
website = list(filter(lambda x: x["type"] == "website", contact))[0]["value"]
print(website)
print("the end")
import requests
a = 'http://tmsearch.uspto.gov/bin/showfield?f=toc&state=4809%3Ak1aweo.1.1&p_search=searchstr&BackReference=&p_L=100&p_plural=no&p_s_PARA1={}&p_tagrepl%7E%3A=PARA1%24MI&expr=PARA1+or+PARA2&p_s_PARA2=&p_tagrepl%7E%3A=PARA2%24ALL&a_default=search&f=toc&state=4809%3Ak1aweo.1.1&a_search=Submit+Query'
a = a.format('coca-cola')
b = requests.get(a)
print(b.text)
print(b.url)
If you copy the printed url and paste it in browser, site will open with no problem, but if you do requests.get, i get some token? errors. Is there anything I can do?
VIA requests.get I url back, but no data if doing manually. It says: <html><head><TITLE>TESS -- Error</TITLE></head><body>
First of all, make sure you follow the website's Terms of Use and usage policies.
This is a little bit more complicated that it may seem. You need to maintain a certain state throughout the [web-scraping session][1]. And, you'll need an HTML parser, like BeautifulSoup along the way:
from urllib.parse import parse_qs, urljoin
import requests
from bs4 import BeautifulSoup
SEARCH_TERM = 'coca-cola'
with requests.Session() as session:
session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'}
# get the current search state
response = session.get("https://tmsearch.uspto.gov/")
soup = BeautifulSoup(response.content, "html.parser")
link = soup.find("a", text="Basic Word Mark Search (New User)")["href"]
session.get(urljoin(response.url, link))
state = parse_qs(link)['state'][0]
# perform a search
response = session.post("https://tmsearch.uspto.gov/bin/showfield", data={
'f': 'toc',
'state': state,
'p_search': 'search',
'p_s_All': '',
'p_s_ALL': SEARCH_TERM + '[COMB]',
'a_default': 'search',
'a_search': 'Submit'
})
# print search results
soup = BeautifulSoup(response.content, "html.parser")
print(soup.find("font", color="blue").get_text())
table = soup.find("th", text="Serial Number").find_parent("table")
for row in table('tr')[1:]:
print(row('td')[1].get_text())
It prints all the serial number values from the first search results page, for demonstration purposes.