Parsing a website with BeautifulSoup and Selenium

Parsing a website with BeautifulSoup and Selenium - python

Trying to compare avg. temperatures to actual temperatures by scraping them from: https://usclimatedata.com/climate/binghamton/new-york/united-states/usny0124
I can successfully gather the webpage's source code, but I am having trouble parsing through it to only give the values for the high temps, low temps, rainfall and the averages under the "History" tab, but I can't seem to address the right class/id without getting the only result as "None".
This is what I have so far, with the last line being an attempt to get the high temps only:
from lxml import html
from bs4 import BeautifulSoup
from selenium import webdriver
url = "https://usclimatedata.com/climate/binghamton/new-york/unitedstates/usny0124"
browser = webdriver.Chrome()
browser.get(url)
soup = BeautifulSoup(browser.page_source, "lxml")
data = soup.find("table", {'class': "align_right_climate_table_data_td_temperature_red"})

First of all, these are two different classes - align_right and temperature_red - you've joined them and added that table_data_td for some reason. And, the elements having these two classes are td elements, not table.
In any case, to get the climate table, it looks like you should be looking for the div element having id="climate_table":
climate_table = soup.find(id="climate_table")
Another important thing to note that there is a potential for the "timing" issues here - when you get the driver.page_source value, the climate information might not be there. This is usually approached adding an Explicit Wait after navigating to the page:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
url = "https://usclimatedata.com/climate/binghamton/new-york/unitedstates/usny0124"
browser = webdriver.Chrome()
try:
browser.get(url)
# wait for the climate data to be loaded
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.ID, "climate_table")))
soup = BeautifulSoup(browser.page_source, "lxml")
climate_table = soup.find(id="climate_table")
print(climate_table.prettify())
finally:
browser.quit()
Note the addition of the try/finally that would safely close the browser in case of an error - that would also help to avoid "hanging" browser windows.
And, look into pandas.read_html() that can read your climate information table into a DataFrame auto-magically.

Related

How to scrape text from a hidden element?

I am trying to scrape the text of the Swiss constitution from Link and convert it to markdown. However, the page source is different from what I see in the inspector: The source only contains no script warnings in various languages with the element "app-root" hidden.
The inspector shows a .html file served from here with which I am able to get the desired result. However, using this file directly would not allow me to scrape the subsequent revisions of the law automatically. Is there a way to extract the page source with the element "app-root" displayed?
This code returns "None" but works with the URL set to the .html file:
from selenium import webdriver
from webdriver_manager.firefox import GeckoDriverManager
from selenium.webdriver import FirefoxOptions
from bs4 import BeautifulSoup
from markdownify import markdownify
url = "https://www.fedlex.admin.ch/eli/cc/1999/404/en"
opts = FirefoxOptions()
opts.add_argument("--headless")
driver = webdriver.Firefox(executable_path=GeckoDriverManager().install(), options=opts)
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
div = soup.find("div", {"id": "lawcontent"})
content = markdownify(str(div))
print(content[:200])
Any help is much appreciated.

In your code, you're not giving any time for the driver to render the contents, resulting in incomplete source code.
Waits can be used to wait for required elements to be visible/present etc. The given code below waits for the div content to be visible and then returns the page source code.
Code snippet-
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
url = "https://www.fedlex.admin.ch/eli/cc/1999/404/en"
driver.get(url)
try:
delay=20 #20 second delay
WebDriverWait(driver, delay).until(EC.visibility_of_element_located((By.ID, 'lawcontent')))
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
div = soup.find("div", {"id": "lawcontent"})
content = markdownify(str(div))
print(content[:200])
#raises Exception if element is not visible within delay duration
except TimeoutException:
print("Timeout!!!")

BeautifulSoup sports scraper gives back empty list

I am trying to scrape the results of tennis matches from this website using Python's BeautifulSoup. I have tried a lot of things but I always get back an empty list. Is there an obvious mistake I am making? There are multiple instances of this class on the website when I inspect it, but it does not seem to find it.
import requests
from bs4 import BeautifulSoup
url = 'https://www.flashscore.com/tennis/atp-singles/french-open/results/'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
match_container = soup.find_all('div', class_='event__match event__match--static event__match--last event__match--twoLine')
print(match_container)

Results table is loaded via javascript and BeautifulSoup does not find it, because it's not loaded yet at the moment of parsing. To solve this problem you'll need to use selenium. Here is link for chromedriver.
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
wd = webdriver.Chrome('<PATH_TO_CHROMEDRIVER>',chrome_options=chrome_options)
# load page via selenium
wd.get("https://www.flashscore.com/tennis/atp-singles/french-open/results/")
# wait 5 seconds until results table will be loaded
table = WebDriverWait(wd, 5).until(EC.presence_of_element_located((By.ID, 'live-table')))
# parse content of the grid
soup = BeautifulSoup(table.get_attribute('innerHTML'), 'lxml')
# access grid cells, your logic should be here
for tag in soup.find_all('div', class_='event__match event__match--static event__match--last event__match--twoLine'):
print(tag)

The score data is pulled into the page dynamically, and you're only getting the initial HTML with requests.
As user70 suggested in the comments, the way to do this is to use a tool like Selenium first so you get all the dynamic content you see in your web browser's inspection tool.
There are few guides online showing how this works - you could start with this one maybe:
https://medium.com/ymedialabs-innovation/web-scraping-using-beautiful-soup-and-selenium-for-dynamic-page-2f8ad15efe25

I need to find a way to make my code give time for the page to load, and only then grab the HTML code

So i wanted to grab a real time value from a website displaying "the real time revolution of the earth's population", except when i run the code:
import requests
import urllib.request
from bs4 import BeautifulSoup
url = 'https://www.theworldcounts.com/counters/shocking_environmental_facts_and_statistics/world_population_clock_live'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
population = soup.findAll('p', attrs={'class':'counter'})
print(population[0])
my output is:
<p class="counter" id="counters_number_interveal_5">loading...</p>
The number i am looking to get is replaced by "loading..." so i am looking to find a way to actually get the value. Or an alternative to get the same result.

You can wait for the page to load explicitly using time.sleep(), which will probably get the end-result you want. However, this isn't best practice and could end up waiting longer than the page needed to load.
I would recommend using Selenium instead, which has a multitude of useful features related to this; specifically it can implicitly wait.
The following is how you could use Selenium to wait until the counter is loaded, and not wait any longer.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
url = 'https://www.theworldcounts.com/counters/shocking_environmental_facts_and_statistics/world_population_clock_live'
driver = webdriver.Firefox()
driver.get(url)
try:
WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.XPATH, "//*[#id=\"counters_number_interval_5\"]")))
counter = driver.find_element_by_xpath("//*[#id=\"counters_number_interval_5\"]").text
print(counter)
except TimeoutException:
print("Timed out, couldn't load the page in time")
driver.quit()
You will need to install Selenium, but it's like installing BeautifulSoup - just use pip install selenium

The website is still loading, perhaps use the time module to make the script wait for the answer.
import time
time.sleep(5)
#Wait 5 seconds for the answer
This should be added between the requests.get and the parsing with BeautifulSoup.
EDIT
Rereading your question, the problem is actually in the usage of requests, since it is downloading the html immediately, you need to add the timeout argument for the proper loading of the html:
response = requests.get(url, timeout = 5)

It's because you are targeting the wrong class. You can find the desired result within the second class with the same name counter. Try either of the two - one commented out and the other is active. They both produce the desired result.
import requests
from bs4 import BeautifulSoup
url = 'https://www.theworldcounts.com/counters/shocking_environmental_facts_and_statistics/world_population_clock_live'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
population = soup.find(class_='item-content').find(class_='counter').text
# population = soup.select_one('.item-content > p.counter').text
print(population)

Access all href-links in a deep-class hierarchy

I am trying to access all href-links from a website, the search-results to be precise. My first intention is to get all the links, and then to look further on it. The problem is --> I get some links from the website, but not the links of the search-results. Here is one version of my code.
from selenium import webdriver
from htmldom import htmldom
dom = htmldom.HtmlDom("myWebsite")
dom = dom.createDom()
p_links = dom.find("a")
for link in p_links:
print("URL: " +link.attr("href"))
Here is screen of the HTML of that particular website. In the screen, I marked the href-link I try to access in the future. I am open for any help given, be it in Selenium, htmldom, b4soup, etc.

The data you are after, is loaded with AJAX requests. So, you can't scrape them directly after getting the page source. But, the AJAX request is sent to this URL:
https://open.nrw/solr/collection1/select?q=*%3A*&fl=validated_data_dict%20title%20groups%20notes%20maintainer%20metadata_modified%20res_format%20author_email%20name%20extras_opennrw_spatial%20author%20extras_opennrw_groups%20extras_opennrw_format%20license_id&wt=json&fq=-type:harvest+&sort=title_string%20asc&indent=true&rows=20
which returns the data in JSON format. You can use requests module to scrape this data.
import requests
BASE_URL = 'https://open.nrw/dataset/'
r = requests.get('https://open.nrw/solr/collection1/select?q=*%3A*&fl=validated_data_dict%20title%20groups%20notes%20maintainer%20metadata_modified%20res_format%20author_email%20name%20extras_opennrw_spatial%20author%20extras_opennrw_groups%20extras_opennrw_format%20license_id&wt=json&fq=-type:harvest+&sort=title_string%20asc&indent=true&rows=20')
data = r.json()
for item in data['response']['docs']:
print(BASE_URL + item['name'])
Output:
https://open.nrw/dataset/mags-90-10-dezilsverhaeltnis-der-aequivalenzeinkommen-1512029759099
https://open.nrw/dataset/alkis-nutzungsarten-pro-baublock-wuppertal-w
https://open.nrw/dataset/allgemein-bildende-schulen-am-1510-nach-schulformen-schulen-schueler-und-lehrerbestand-w
https://open.nrw/dataset/altersgruppen-in-meerbusch-gesamt-meerb
https://open.nrw/dataset/amtliche-stadtkarte-wuppertal-raster-w
https://open.nrw/dataset/mais-anteil-abhaengig-erwerbstaetiger-mit-geringfuegiger-beschaeftigung-1477312040433
https://open.nrw/dataset/mags-anteil-der-stillen-reserve-nach-geschlecht-und-altersgruppen-1512033735012
https://open.nrw/dataset/mags-anteil-der-vermoegenslosen-in-nrw-nach-beruflicher-stellung-1512032087083
https://open.nrw/dataset/anzahl-kinderspielplatze-meerb
https://open.nrw/dataset/anzahl-der-sitzungen-von-rat-und-ausschussen-meerb
https://open.nrw/dataset/anzahl-medizinischer-anwendungen-den-oeffentlichen-baedern-duesseldorfs-seit-2006-d
https://open.nrw/dataset/arbeitslose-den-wohnquartieren-duesseldorf-d
https://open.nrw/dataset/arbeitsmarktstatistik-arbeitslose-gelsenkirchen-ge
https://open.nrw/dataset/arbeitsmarktstatistik-arbeitslose-nach-rechtskreisen-des-sgb-ge
https://open.nrw/dataset/arbeitsmarktstatistik-arbeitslose-nach-stadtteilen-gelsenkirchen-ge
https://open.nrw/dataset/arbeitsmarktstatistik-sgb-ii-rechtskreis-auf-stadtteilebene-gelsenkirchen-ge
https://open.nrw/dataset/arbeitsmarktstatistik-sozialversicherungspflichtige-auf-stadtteilebene-gelsenkirchen-ge
https://open.nrw/dataset/verkehrszentrale-arbeitsstellen-in-nordrhein-westfalen-1476688294843
https://open.nrw/dataset/mags-arbeitsvolumen-nach-wirtschaftssektoren-1512025235377
https://open.nrw/dataset/mais-armutsrisikoquoten-nach-geschlecht-und-migrationsstatus-der-personen-1477313317038
As you can see, this returned the first 20 URLs. When you first load the page only 20 items are present. But, if you scroll down, more are loaded. To get more items, you can change the Query String Parameter in the URL. The URL ends with rows=20. You can change this number to get the desired number of results.

Results appear after the initial page load due to the AJAX request.
I managed to get the links with Selenium, however I had to wait for .ckantitle a elements to be loaded (these are the links you want to get).
I should mention that the webdriver will wait for a page to load by
default. It does not wait for loading inside frames or for ajax
requests. It means when you use .get('url'), your browser will wait
until the page is completely loaded and then go to the next command in
the code. But when you are posting an ajax request, webdriver does not
wait and it's your responsibility to wait an appropriate amount of
time for the page or a part of page to load; so there is a module
named expected_conditions.
Code:
from urllib.parse import urljoin
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
url = 'https://open.nrw/suche'
html = None
browser = webdriver.Chrome()
browser.get(url)
delay = 3 # seconds
try:
WebDriverWait(browser, delay).until(
EC.presence_of_element_located((By.CSS_SELECTOR, '.ckantitle a'))
)
html = browser.page_source
except TimeoutException:
print('Loading took too much time!')
finally:
browser.quit()
if html:
soup = BeautifulSoup(html, 'lxml')
links = soup.select('.ckantitle a')
for link in links:
print(urljoin(url, link['href']))
You need to install selenium:
pip install selenium
and get a driver here.

How can I parse table data from website using Selenium?

Im trying to parse the table present in the [website][1]
[1]: http://www.espncricinfo.com/rankings/content/page/211270.html using selenium, as I am beginner . i'm struggling to do that here is my code
from bs4 import BeautifulSoup
import time
from selenium import webdriver
url = "http://www.espncricinfo.com/rankings/content/page/211270.html"
browser = webdriver.Chrome()
browser.get(url)
time.sleep(3)
html = browser.page_source
soup = BeautifulSoup(html, "lxml")
print(len(soup.find_all("table")))
print(soup.find("table", {"class": "expanded_standings"}))
browser.close()
browser.quit()
that I tried, I'm unable to fetch anything from this, any suggestions will be really helpful thanks

The table you are after is within an iframe. So, to get the data from that table you need to switch that iframe first and then do the rest. Here is one way you could do it:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("http://www.espncricinfo.com/rankings/content/page/211270.html")
wait = WebDriverWait(driver, 10)
## if any different table you expect to have then just change the index number within nth-of-type()
## and the appropriate name in the selector
wait.until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR, "iframe[name='testbat']:nth-of-type(1)")))
for table in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "table tr")))[1:]:
data = [item.text for item in table.find_elements_by_css_selector("th,td")]
print(data)
driver.quit()
And the best approach would be in this very case is as follows. No browser simulator is used. Only requests and BeautifulSoup have been used:
import requests
from bs4 import BeautifulSoup
res = requests.get("http://www.espncricinfo.com/rankings/content/page/211270.html")
soup = BeautifulSoup(res.text,"lxml")
## if any different table you expect to have then just change the index number
## and the appropriate name in the selector
item = soup.select("iframe[name='testbat']")[0]['src']
req = requests.get(item)
sauce = BeautifulSoup(req.text,"lxml")
for items in sauce.select("table tr"):
data = [item.text for item in items.select("th,td")]
print(data)
Partial results:
['Rank', 'Name', 'Country', 'Rating']
['1', 'S.P.D. Smith', 'AUS', '947']
['2', 'V. Kohli', 'IND', '912']
['3', 'J.E. Root', 'ENG', '881']

It looks like that page's tables are within iframes. If you have a specific table you want to scrape, try inspecting it using browser dev tools (right click, inspect element in Chrome) and find the iframe element that is wrapping it. The iframe should have a src attribute that holds a url to the page that actually contains that table. You can then use a similar method to the one you tried but instead use the src url.
Selenium can also "jump into" an iframe if you know how to find the iframe in the page's source code.
frame = browser.find_element_by_id("the_iframe_id")
browser.switch_to.frame(frame)
html = browser.page_source etc

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing a website with BeautifulSoup and Selenium - python

Related

How to scrape text from a hidden element?

BeautifulSoup sports scraper gives back empty list

I need to find a way to make my code give time for the page to load, and only then grab the HTML code

Access all href-links in a deep-class hierarchy

How can I parse table data from website using Selenium?

Categories

Resources