BeautifulSoup4 doesn't find elements properly

BeautifulSoup4 doesn't find elements properly - python

I am using requests and bs4 to extract the first preview from the link http://duckduckgo.com/?q=who+is+harry+potter
However, when i try to use bs4's find method to find he div with the class 'result__snippet', it returns None. But when I saved the whole webpage to my hard disk and opened it directly and parsed it with bs4, soup.find('div', class_='result__snippet').get_text() returns the perfect output.
Any help?

The website you link to appears to use JavaScript to build the search results, so the page you retrieve using BeautifulSoup doesn't actually contain the search results yet.
If you look at the content of the page that you've retrieved (print(soup.text)) you can see that they suggest that if you don't have JavaScript enabled to use http://duckduckgo.com/html/?q=who+is+harry+potter.
Scraping this URL should provide you with the content that you are looking for.

One way to do this is to use Selenium in combination with BeautifulSoup. Try this, it works.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup as bs
from fake_useragent import UserAgent
url = 'https://duckduckgo.com/?q=who+is+harry+potter&ia=web'
profile = webdriver.FirefoxProfile()
ua1 = UserAgent()
profile.set_preference('general.useragent.override', str(ua1.random))
driver = webdriver.Firefox(profile)
driver.get(url)
while True:
try:
WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.CLASS_NAME, 'result__snippet')))
print('Page is ready!')
break
except TimeoutException:
print('Loading took too much time!')
html = driver.execute_script('return document.body.innerHTML')
driver.close()
b_html = bs(html,'html.parser')
x = b_html.find_all('div', class_='result__snippet')[0].get_text()
Output:
Harry Potter is a series of fantasy novels written by British author J. K. Rowling. The novels chronicle the life of a young wizard, Harry Potter, ...

Related

Why does Selenium think this HTML span is empty?

I am trying to scrape NASDAQ's website for real time stock quotes. When I use chrome developer tools, I can see the span I want to target is (for example with Alphabet as of writing this) <span class="symbol-page-header__pricing-price">$2952.77</span>. I want to extract the $2952.77. My python code is:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
s = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=s)
def get_last_price(ticker):
driver.get(f"https://www.nasdaq.com/market-activity/stocks/{ticker}")
price = driver.find_element(By.CLASS_NAME, "symbol-page-header__pricing-last-price")
print(price.get_attribute('text'))
# p = price.get_attribute('innerHTML')
get_last_price('googl')
The above code returns 'None'. If you uncomment out the line defining p and print it's output, it shows that Selenium thinks the span is empty.
<span class="symbol-page-header__pricing-price"></span>
I don't understand why this is happening. My thought is that it has something to do with the fact that it's probably being rendered dynamically with Javascript, but I thought that was an advantage of Selenium say as opposed to BeautifulSoup... there shouldn't be an issue right?

If you look into the HTML DOM of NASDAQ's Coinbase Global your Locator Strategy selects 2 nodes and the the one you don't want is:
Solution
To print the price information you can use the following Locator Strategy:
Using XPATH and text attribute:
driver.get("https://www.nasdaq.com/market-activity/stocks/coin")
print(WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH,"//span[#class='symbol-page-header__pricing-price' and text()]"))).text)
Using XPATH and get_attribute("innerHTML"):
driver.get("https://www.nasdaq.com/market-activity/stocks/coin")
print(WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH,"//span[#class='symbol-page-header__pricing-price' and text()]"))).get_attribute("innerHTML"))
Console Output:
$263.91
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

There are 2 nodes with the class symbol-page-header__pricing-price. The node that you want is under
<div class="symbol-page-header__pricing-details symbol-page-header__pricing-details--current symbol-page-header__pricing-details--decrease"></div>
So, you need to get inside this div first to ensure you scrape the right one.
Anyways, I'd recommend you to use BeautifulSoup to scrape the HTML text after you’ve finished interacting with the dynamic website with selenium. This will save your time and your memory. It has no need to keep running the browser, so, it would be better if you terminate it (i.e. driver.close()) and use BeautifulSoup to explore the static HTML text.
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
s = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=s)
def get_last_price(ticker):
driver.get(f"https://www.nasdaq.com/market-activity/stocks/{ticker}")
time.sleep(1)
driver.close()
soup = BeautifulSoup(driver.page_source, "lxml")
header = soup.find('div', attrs={'class': 'symbol-page-header__pricing-details symbol-page-header__pricing-details--current symbol-page-header__pricing-details--decrease'})
price = header.find('span', attrs={'class':'symbol-page-header__pricing-price'})
print(price)
print(price.text)
get_last_price('googl')
Output:
>>> <span class="symbol-page-header__pricing-price">$2952.77</span>
>>> $2952.77

Not able to access website url using beautiful soup and python while web scraping

Link that I am scraping : https://www.indusind.com/in/en/personal/cards/credit-card.html
from urllib.request import urlopen
from bs4 import BeautifulSoup
import json, requests, re, sys
from selenium import webdriver
import re
IndusInd_url = "https://www.indusind.com/in/en/personal/cards/credit-card.html"
html = requests.get(IndusInd_url)
soup = BeautifulSoup(html.content, 'lxml')
print(soup)
for x in soup.select("#display-product-cards .text-primary"):
print(x.get_text())
Using the above code I am trying to scrape the titles of the card, but unfortuantely I am getting this output
<html><body><p>This website is secured against online attacks. Your request was blocked due to suspicious behavior<br/>
<br/>
Client IP : 124.123.170.109<br/>
<br/>
Incident Time : 2021-02-24 06:28:10 UTC <br/>
<br/>
Incident ID : YDXx#m6g3nSFLvi5lGg4wgAAAf8<br/>
<br/>
If you feel it was a legitimate request, please contact the website owner for further investigation and remediation with a screenshot of this page.</p></body></html>
Is there any other alternative to follow to scrape the details.
Any help is highly appreciated ! ! !

Please check this.
FYI: Make sure you have the right driver (firefoxe or chrome or whatever with right version)
from selenium import webdriver
import requests
from bs4 import BeautifulSoup
import time
url = 'https://www.indusind.com/in/en/personal/cards/credit-card.html'
# open the chrome driver
driver = webdriver.Chrome(executable_path='webdrivers/chromedriver.exe')
# pings the specified url
driver.get(url)
# sleep time to wait for t seconds to wait for page load
# replace 3 with any int value (int value in seconds)
time.sleep(3)
# gets the page source
pg = driver.page_source
# beautify with beautifulsoup
soup = BeautifulSoup(pg)
# get the titles of the card
for x in soup.select("#display-product-cards .text-primary"):
print(x.get_text())
Below is output image

Can be achieved without BeautifulSoup.
I define the locator with xpath with the value:
//div[#id='display-product-cards']//a[#class='card-title text-primary' and text()!='']
And utilize method .presence_of_all_elements_located.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome(executable_path='webdrivers/chromedriver.exe')
driver.get('https://www.indusind.com/in/en/personal/cards/credit-card.html')
wait = WebDriverWait(driver, 20)
elements = wait.until(EC.presence_of_all_elements_located((By.XPATH, "//div[#id='display-product-cards']//a[#class='card-title text-primary' and text()!='']")))
for element in elements:
print(element.get_attribute('innerHTML'))
driver.quit()

Why is HTML returned by requests different from the real page HTML?

I'm trying to scrape a webpage for getting some data to work with, one of the web pages I want to scrape is this one https://www.etoro.com/people/sparkliang/portfolio, the problem comes when I scrape the web page using:
import requests
h=requests.get('https://www.etoro.com/people/sparkliang/portfolio')
h.content
And gives me a completely different result HTML from the original, for example adding a lot of meta kind and deleting the text or type HTML variables I am searching for.
For example imagine I want to scrape:
<p ng-if=":: item.IsStock" class="i-portfolio-table-hat-fullname ng-binding ng-scope">Shopify Inc.</p>
I use a command like this:
from bs4 import BeautifulSoup
import requests
html_text = requests.get('https://www.etoro.com/people/sparkliang/portfolio').text
print(html_text)
soup = BeautifulSoup(html_text,'lxml')
job = soup.find('p', class_='i-portfolio-table-hat-fullname ng-binding ng-scope').text
This will return me Shopify Inc.
But it doesn't because the html code y load or get from the web page with the requests' library, gets me another complete different html.
I want to know how to get the original html code from the web page.
If you use cntl-f for searching to a keyword like Shopify Inc it wont be even in the code i get from the requests python library

It happens because the page uses dynamic javascript to create the DOM elements. So you won't be able to accomplish it using requests. Instead you should use selenium with a webdriver and wait for the elements to be created before scraping.
You can try downloading ChromeDriver executable here. And if you paste it in the same folder as your script you can run:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import os
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--window-size=1920x1080")
chrome_options.add_argument("--headless")
chrome_driver = os.getcwd() + "\\chromedriver.exe" # CHANGE THIS IF NOT SAME FOLDER
driver = webdriver.Chrome(options=chrome_options, executable_path=chrome_driver)
url = 'https://www.etoro.com/people/sparkliang/portfolio'
driver.get(url)
html_text = driver.page_source
jobs = WebDriverWait(driver, 20).until(
EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'p.i-portfolio-table-hat-fullname'))
)
for job in jobs:
print(job.text)
Here we use selenium with WebDriverWait and EC to ensure that all the elements wil exist when we try to scrape the info we're looking for.
Outputs
Facebook
Apple
Walt Disney
Alibaba
JD.com
Mastercard
...

BeautifulSoup sports scraper gives back empty list

I am trying to scrape the results of tennis matches from this website using Python's BeautifulSoup. I have tried a lot of things but I always get back an empty list. Is there an obvious mistake I am making? There are multiple instances of this class on the website when I inspect it, but it does not seem to find it.
import requests
from bs4 import BeautifulSoup
url = 'https://www.flashscore.com/tennis/atp-singles/french-open/results/'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
match_container = soup.find_all('div', class_='event__match event__match--static event__match--last event__match--twoLine')
print(match_container)

Results table is loaded via javascript and BeautifulSoup does not find it, because it's not loaded yet at the moment of parsing. To solve this problem you'll need to use selenium. Here is link for chromedriver.
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
wd = webdriver.Chrome('<PATH_TO_CHROMEDRIVER>',chrome_options=chrome_options)
# load page via selenium
wd.get("https://www.flashscore.com/tennis/atp-singles/french-open/results/")
# wait 5 seconds until results table will be loaded
table = WebDriverWait(wd, 5).until(EC.presence_of_element_located((By.ID, 'live-table')))
# parse content of the grid
soup = BeautifulSoup(table.get_attribute('innerHTML'), 'lxml')
# access grid cells, your logic should be here
for tag in soup.find_all('div', class_='event__match event__match--static event__match--last event__match--twoLine'):
print(tag)

The score data is pulled into the page dynamically, and you're only getting the initial HTML with requests.
As user70 suggested in the comments, the way to do this is to use a tool like Selenium first so you get all the dynamic content you see in your web browser's inspection tool.
There are few guides online showing how this works - you could start with this one maybe:
https://medium.com/ymedialabs-innovation/web-scraping-using-beautiful-soup-and-selenium-for-dynamic-page-2f8ad15efe25

How can I extract the text elements using Selenium in Python?

Consider:
I am using Selenium to scrape the contents from the App Store: https://apps.apple.com/us/app/bank-of-america-private-bank/id1096813830
I tried to extract the text field "As subject matter experts, our team is very engaging..."
I tried to find elements by class
review_ratings = driver.find_elements_by_class_name('we-truncate we-truncate--multi-line we-truncate--interactive ember-view we-customer-review__body')
review_ratingsList = []
for e in review_ratings:
review_ratingsList.append(e.get_attribute('innerHTML'))
review_ratings
But it returns an empty list [].
Is anything wrong with the code? Or is there a better solution?

Using Requests and Beautiful Soup:
import requests
from bs4 import BeautifulSoup
url = 'https://apps.apple.com/us/app/bank-of-america-private-bank/id1096813830'
res = requests.get(url)
soup = BeautifulSoup(res.text,'lxml')
item = soup.select_one("blockquote > p").text
print(item)
Output:
As subject matter experts, our team is very engaging and focused on our near and long term financial health!

You can use WebDriverWait to wait for visibility of an element and get the text. Please check good Selenium locator.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
#...
wait = WebDriverWait(driver, 5)
review_ratings = wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, ".we-customer-review")))
for review_rating in review_ratings:
starts = review_rating.find_element_by_css_selector(".we-star-rating").get_attribute("aria-label")
title = review_rating.find_element_by_css_selector("h3").text
review = review_rating.find_element_by_css_selector("p").text

Mix Selenium with Beautiful Soup.
Using WebDriver:
from bs4 import BeautifulSoup
from selenium import webdriver
browser = webdriver.Chrome()
url = "https://apps.apple.com/us/app/bank-of-america-private-bank/id1096813830"
browser.get(url)
innerHTML = browser.execute_script("return document.body.innerHTML")
bs = BeautifulSoup(innerHTML, 'html.parser')
bs.blockquote.p.text
Output:
Out[22]: 'As subject matter experts, our team is very engaging and focused on our near and long term financial health!'

Use WebDriverWait and wait for presence_of_all_elements_located and use the following CSS selector.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("https://apps.apple.com/us/app/bank-of-america-private-bank/id1096813830")
review_ratings = WebDriverWait(driver, 20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '.we-customer-review__body p[dir="ltr"]')))
review_ratingsList = []
for e in review_ratings:
review_ratingsList.append(e.get_attribute('innerHTML'))
print(review_ratingsList)
Output:
['As subject matter experts, our team is very engaging and focused on our near and long term financial health!', 'Very much seems to be an unfinished app. Can’t find secure message alert. Or any alerts for that matter. Most of my client team is missing from the “send to” list. I have other functions very useful, when away from my computer.']

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup4 doesn't find elements properly - python

Related

Why does Selenium think this HTML span is empty?

Not able to access website url using beautiful soup and python while web scraping

Why is HTML returned by requests different from the real page HTML?

BeautifulSoup sports scraper gives back empty list

How can I extract the text elements using Selenium in Python?

Categories

Resources