How can I parse table data from website using Selenium? - python

Im trying to parse the table present in the [website][1]
[1]: http://www.espncricinfo.com/rankings/content/page/211270.html using selenium, as I am beginner . i'm struggling to do that here is my code
from bs4 import BeautifulSoup
import time
from selenium import webdriver
url = "http://www.espncricinfo.com/rankings/content/page/211270.html"
browser = webdriver.Chrome()
browser.get(url)
time.sleep(3)
html = browser.page_source
soup = BeautifulSoup(html, "lxml")
print(len(soup.find_all("table")))
print(soup.find("table", {"class": "expanded_standings"}))
browser.close()
browser.quit()
that I tried, I'm unable to fetch anything from this, any suggestions will be really helpful thanks

The table you are after is within an iframe. So, to get the data from that table you need to switch that iframe first and then do the rest. Here is one way you could do it:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("http://www.espncricinfo.com/rankings/content/page/211270.html")
wait = WebDriverWait(driver, 10)
## if any different table you expect to have then just change the index number within nth-of-type()
## and the appropriate name in the selector
wait.until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR, "iframe[name='testbat']:nth-of-type(1)")))
for table in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "table tr")))[1:]:
data = [item.text for item in table.find_elements_by_css_selector("th,td")]
print(data)
driver.quit()
And the best approach would be in this very case is as follows. No browser simulator is used. Only requests and BeautifulSoup have been used:
import requests
from bs4 import BeautifulSoup
res = requests.get("http://www.espncricinfo.com/rankings/content/page/211270.html")
soup = BeautifulSoup(res.text,"lxml")
## if any different table you expect to have then just change the index number
## and the appropriate name in the selector
item = soup.select("iframe[name='testbat']")[0]['src']
req = requests.get(item)
sauce = BeautifulSoup(req.text,"lxml")
for items in sauce.select("table tr"):
data = [item.text for item in items.select("th,td")]
print(data)
Partial results:
['Rank', 'Name', 'Country', 'Rating']
['1', 'S.P.D. Smith', 'AUS', '947']
['2', 'V. Kohli', 'IND', '912']
['3', 'J.E. Root', 'ENG', '881']

It looks like that page's tables are within iframes. If you have a specific table you want to scrape, try inspecting it using browser dev tools (right click, inspect element in Chrome) and find the iframe element that is wrapping it. The iframe should have a src attribute that holds a url to the page that actually contains that table. You can then use a similar method to the one you tried but instead use the src url.
Selenium can also "jump into" an iframe if you know how to find the iframe in the page's source code.
frame = browser.find_element_by_id("the_iframe_id")
browser.switch_to.frame(frame)
html = browser.page_source etc

Related

How to scrape text from a hidden element?

I am trying to scrape the text of the Swiss constitution from Link and convert it to markdown. However, the page source is different from what I see in the inspector: The source only contains no script warnings in various languages with the element "app-root" hidden.
The inspector shows a .html file served from here with which I am able to get the desired result. However, using this file directly would not allow me to scrape the subsequent revisions of the law automatically. Is there a way to extract the page source with the element "app-root" displayed?
This code returns "None" but works with the URL set to the .html file:
from selenium import webdriver
from webdriver_manager.firefox import GeckoDriverManager
from selenium.webdriver import FirefoxOptions
from bs4 import BeautifulSoup
from markdownify import markdownify
url = "https://www.fedlex.admin.ch/eli/cc/1999/404/en"
opts = FirefoxOptions()
opts.add_argument("--headless")
driver = webdriver.Firefox(executable_path=GeckoDriverManager().install(), options=opts)
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
div = soup.find("div", {"id": "lawcontent"})
content = markdownify(str(div))
print(content[:200])
Any help is much appreciated.
In your code, you're not giving any time for the driver to render the contents, resulting in incomplete source code.
Waits can be used to wait for required elements to be visible/present etc. The given code below waits for the div content to be visible and then returns the page source code.
Code snippet-
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
url = "https://www.fedlex.admin.ch/eli/cc/1999/404/en"
driver.get(url)
try:
delay=20 #20 second delay
WebDriverWait(driver, delay).until(EC.visibility_of_element_located((By.ID, 'lawcontent')))
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
div = soup.find("div", {"id": "lawcontent"})
content = markdownify(str(div))
print(content[:200])
#raises Exception if element is not visible within delay duration
except TimeoutException:
print("Timeout!!!")

BeautifulSoup sports scraper gives back empty list

I am trying to scrape the results of tennis matches from this website using Python's BeautifulSoup. I have tried a lot of things but I always get back an empty list. Is there an obvious mistake I am making? There are multiple instances of this class on the website when I inspect it, but it does not seem to find it.
import requests
from bs4 import BeautifulSoup
url = 'https://www.flashscore.com/tennis/atp-singles/french-open/results/'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
match_container = soup.find_all('div', class_='event__match event__match--static event__match--last event__match--twoLine')
print(match_container)
Results table is loaded via javascript and BeautifulSoup does not find it, because it's not loaded yet at the moment of parsing. To solve this problem you'll need to use selenium. Here is link for chromedriver.
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
wd = webdriver.Chrome('<PATH_TO_CHROMEDRIVER>',chrome_options=chrome_options)
# load page via selenium
wd.get("https://www.flashscore.com/tennis/atp-singles/french-open/results/")
# wait 5 seconds until results table will be loaded
table = WebDriverWait(wd, 5).until(EC.presence_of_element_located((By.ID, 'live-table')))
# parse content of the grid
soup = BeautifulSoup(table.get_attribute('innerHTML'), 'lxml')
# access grid cells, your logic should be here
for tag in soup.find_all('div', class_='event__match event__match--static event__match--last event__match--twoLine'):
print(tag)
The score data is pulled into the page dynamically, and you're only getting the initial HTML with requests.
As user70 suggested in the comments, the way to do this is to use a tool like Selenium first so you get all the dynamic content you see in your web browser's inspection tool.
There are few guides online showing how this works - you could start with this one maybe:
https://medium.com/ymedialabs-innovation/web-scraping-using-beautiful-soup-and-selenium-for-dynamic-page-2f8ad15efe25

Why does BeautifulSoup find keep returning elements with a class id other than what I'm passing it?

I'm trying to use BeautifulSoup to parse an iframe containing a Korean news article and print out each individual body paragraph in the article. Because the Korean paragraph content lies in a p tag within its own td tag with a class id of "tlTD", I figured I could just loop through each td with that class name and print the p tag like so:
link ="https://gloss.dliflc.edu/GlossHtml/GlossHTML.html?disableBrowserLockout=true&gloss=true&glossLoXmlFileName=/GlossHtml/templates/linksLO/glossLOs/kp_cul312.xml&glossMediaPathRoot=https://gloss.dliflc.edu/products/gloss/"
base_url = "https://oda.dliflc.edu"
driver = webdriver.Chrome()
driver.get(link)
python_button = driver.find_element_by_id("gloss_link_source")
python_button.click()
source_src= driver.find_element_by_id("glossIframe").get_attribute("src")
source_url = urljoin(base_url, source_src)
driver.get(source_url)
soup = BeautifulSoup(driver.page_source, "lxml")
for td in soup.find_all("td", class_="tlTD"):
print(soup.find("p").getText())
The problem is that, instead of printing the body paragraphs, the code repeatedly prints out only the article title which lies in in its own td with a class of "title tlTD". I tried using a lambda expression and a regex to make the class name more exclusive, but I kept getting the same result. Changing soup.find("p") to a find_all successfully made the code print what I wanted, but it also printed a bunch of English version content that I don't want.
I can understand why the article title content would be printed since it includes "tlTD" in the class name, but I'm baffled as to where the English content is coming from. When I inspected the page in google chrome it didn't include any English body paragraphs so why is BeautifulSoup scraping that? Can anyone help explain to me what's going on here and how I can get this code to just print the Korean body paragraph content?
tlTD class td tag inside iframe, you can access iframe data easily like:
xpath to locate iframe :
iframe = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//iframe[#id='glossIframe']")))
Then switch_to the iframe:
driver.switch_to.frame(iframe)
Here's how to switch back to the default content (out of the ):
driver.switch_to.default_content()
explicit-waits more details
EX:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
link = "https://gloss.dliflc.edu/GlossHtml/GlossHTML.html?disableBrowserLockout=true&gloss=true&glossLoXmlFileName=/GlossHtml/templates/linksLO/glossLOs/kp_cul312.xml&glossMediaPathRoot=https://gloss.dliflc.edu/products/gloss/"
driver.get(link)
source_button = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.ID, "gloss_link_source")))
source_button.click()
#switch iframe
iframe = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//iframe[#id='glossIframe']")))
driver.switch_to.frame(iframe)
soup = BeautifulSoup(driver.page_source, "lxml")
#scrape iframe data
for td in soup.find_all("td", class_="tlTD"):
print(td.find("p").getText())

HTML Scraping when there are no html tags

I'm trying to get the elevation data, and start and end pass times from this website.
So far I have looked at the source code and been unable to use Beautiful Soup to get what I want as the source code doesn't have any tags around the information I am interested in. That information is contained in functions by the name of spStart, and it's corresponding arguments. I had a go at using selenium to obtain the Javascript processed code, but I ended up getting the same as the source code on the page and now I'm stuck.
Here is my attempt at using selenium:
import datetime
import time
from bs4 import BeautifulSoup
import re
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import selenium.webdriver.chrome.service as service
from lxml import html
try:
#Launching chrome in headless mode to access inspect element code''
service = service.Service('/correct_path/chromedriver.exe')
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-gpu")
driver = webdriver.Chrome(chrome_options=chrome_options, executable_path=r'/correct_path/chromedriver.exe')
driver.get("https://www.n2yo.com/passes/?s=39090&a=1")
print("Chrome Browser Initialized in Headless Mode")
soup = BeautifulSoup(driver.execute_script("return document.documentElement.innerHTML;"), "lxml")
print(soup)
except KeyboardInterrupt:
driver.quit()
print("Driver Exited")
When I run this code it gives me the html that I see when using the "view source" option in chrome. I was under the impression that by using selenium to get the source this way, I would be seeing what is available when using the "inspect element" option on the same page in chrome.
Would someone mind explaining where I'm going wrong and suggesting a feasible approach to get the data I want, possibly with an explained example? I'd really appreciate it.
Thanks for your time.
No is not the same, Inspect Element inspects the DOM, the source page although is practically the original seed page for the DOM, the DOM can dynamically change and usually changes by JS code,
sometimes quite dramatically. Also you will notice that Inspect Element shows the shadow elements which the source show not.
To see how dramatic is the difference visit chrome://settings/ and click Inspect element and then look at the View page source and compare.
You should target the element after has loaded and take arguments[0] and not the entire page via document
html_of_interest=driver.execute_script('return arguments[0].innerHTML',element)
sel_soup=BeautifulSoup(html_of_interest, 'lxml')
This has 2 practical cases:
1
the element is not yet loaded in the DOM and you need to wait for the element:
browser.get("url")
sleep(experimental) # usually get will finish only after the page is loaded but sometimes there is some JS woo running after on load time
try:
element= WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.ID, 'your_id_of_interest')))
print "element is ready do the thing!"
html_of_interest=driver.execute_script('return arguments[0].innerHTML',element)
sel_soup=BeautifulSoup(html_of_interest, 'html.parser')
except TimeoutException:
print "Somethings wrong!"
2
the element is in a shadow root and you need to expand first the shadow root, probably not your situation but I will mention it here since it is relevant for future reference. ex:
import selenium
from selenium import webdriver
driver = webdriver.Chrome()
from bs4 import BeautifulSoup
def expand_shadow_element(element):
shadow_root = driver.execute_script('return arguments[0].shadowRoot', element)
return shadow_root
driver.get("chrome://settings")
root1 = driver.find_element_by_tag_name('settings-ui')
html_of_interest=driver.execute_script('return arguments[0].innerHTML',root1)
sel_soup=BeautifulSoup(html_of_interest, 'html.parser')
sel_soup# empty root not expande
shadow_root1 = expand_shadow_element(root1)
html_of_interest=driver.execute_script('return arguments[0].innerHTML',shadow_root1)
sel_soup=BeautifulSoup(html_of_interest, 'html.parser')
sel_soup
I don't know what data from that page you are interested in. However, If the tabular data you are after then the below script is worth trying for:
from selenium.webdriver import Chrome
from contextlib import closing
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
URL = "https://www.n2yo.com/passes/?s=39090&a=1"
chrome_options = Options()
chrome_options.add_argument("--headless")
with closing(Chrome(chrome_options=chrome_options)) as driver:
driver.get(URL)
soup = BeautifulSoup(driver.page_source, 'lxml')
for items in soup.select("#passestable tr"):
data = [item.text for item in items.select("th,td")]
print(data)
Partial output:
['Start ', 'Max altitude', 'End ', 'All passes']
['Date, Local time', 'Az', 'Local time', 'Az', 'El', 'Local time', 'Mag ', 'Info']
['20-Feb 19:17', 'N13°', '19:25', 'E76°', '81°', '19:32', 'S191°', '-', 'Map and details']
['21-Feb 06:24', 'SSE151°', '06:31', 'E79°', '43°', '06:38', 'N358°', '-', 'Map and details']

Parsing a website with BeautifulSoup and Selenium

Trying to compare avg. temperatures to actual temperatures by scraping them from: https://usclimatedata.com/climate/binghamton/new-york/united-states/usny0124
I can successfully gather the webpage's source code, but I am having trouble parsing through it to only give the values for the high temps, low temps, rainfall and the averages under the "History" tab, but I can't seem to address the right class/id without getting the only result as "None".
This is what I have so far, with the last line being an attempt to get the high temps only:
from lxml import html
from bs4 import BeautifulSoup
from selenium import webdriver
url = "https://usclimatedata.com/climate/binghamton/new-york/unitedstates/usny0124"
browser = webdriver.Chrome()
browser.get(url)
soup = BeautifulSoup(browser.page_source, "lxml")
data = soup.find("table", {'class': "align_right_climate_table_data_td_temperature_red"})
First of all, these are two different classes - align_right and temperature_red - you've joined them and added that table_data_td for some reason. And, the elements having these two classes are td elements, not table.
In any case, to get the climate table, it looks like you should be looking for the div element having id="climate_table":
climate_table = soup.find(id="climate_table")
Another important thing to note that there is a potential for the "timing" issues here - when you get the driver.page_source value, the climate information might not be there. This is usually approached adding an Explicit Wait after navigating to the page:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
url = "https://usclimatedata.com/climate/binghamton/new-york/unitedstates/usny0124"
browser = webdriver.Chrome()
try:
browser.get(url)
# wait for the climate data to be loaded
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.ID, "climate_table")))
soup = BeautifulSoup(browser.page_source, "lxml")
climate_table = soup.find(id="climate_table")
print(climate_table.prettify())
finally:
browser.quit()
Note the addition of the try/finally that would safely close the browser in case of an error - that would also help to avoid "hanging" browser windows.
And, look into pandas.read_html() that can read your climate information table into a DataFrame auto-magically.

Categories