BeautifulSoup cannot find 'class' href

BeautifulSoup cannot find 'class' href - python

This is the page I'm trying to scrape:
https://etherscan.io/address/0xCcE984c41630878b91E20c416dA3F308855E87E2
I want to scrape the lisbox href next to Token label.
I need to scrape href from
class="link-hover d-flex justify-content-between align-items-center"
so my code:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://etherscan.io/address/0xCcE984c41630878b91E20c416dA3F308855E87E2').text
html = BeautifulSoup(page, 'html.parser')
href = html.find(class_ = 'link-hover d-flex justify-content-between align-items-center')['href']
however the result is nothing.
Can anyone help me?

The element of interest is rendered by JavaScript. Thus, you will need some browser automation software to render the JavaScript, in order to get the full HTML necessary.
Note: You could use requests-html which supports JavaScript rendering. However, it does use a browser automation software itself, so, in my opinion, it's best to get rid of the "middle-man".
Selenium
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("https://etherscan.io/address/0xCcE984c41630878b91E20c416dA3F308855E87E2")
elem = browser.find_element_by_id("availableBalanceDropdown")
elem.click()
soup = bs4.BeautifulSoup(browser.page_content(), features="html.parser")
Playwright
from playwright.sync_api import sync_playwright
with sync_playwright() as play:
browser = play.chromium.launch()
page = browser.new_page()
page.goto("https://etherscan.io/address/0xCcE984c41630878b91E20c416dA3F308855E87E2")
page.click("#availableBalanceDropdown")
soup = bs4.BeautifulSoup(page.content(), features="html.parser")
browser.quit()
Once you have the bs4.BeautifulSoup object, it's just a matter of scraping for the CSS selector.
import bs4
soup = bs4.BeautifulSoup(...) # From above examples
elems = soup.select(".link-hover.d-flex.justify-content-between.align-items-center")

Related

Video description scraping with Beautifulsoup

I'm trying to scrape a link in the video description on youtube, but the list always return empty.
I've tried to change the tag from where I'm scraping, but there is no change in either the output nor the error message.
Here's the code I'm using:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.youtube.com/watch?v=gqUqGaXipe8').text
soup = BeautifulSoup(source, 'lxml')
link = [i['href'] for i in soup.findAll('a', class_='yt-simple-endpoint style-scope yt-formatted-string', href=True)]
print(link)
What is wrong, and how can I solve it?

In your case, requests doesn't return the whole HTML structure of the page. If Youtube is filling in the data using JavaScript we must run it through a real browser to get the source of the page, such as Chrome Headless using the Selenium library. Here is the general solution:
from bs4 import BeautifulSoup
from selenium import webdriver
import time
options = webdriver.ChromeOptions()
options.add_argument('--headless')
driver = webdriver.Chrome(options = options)
url = "https://www.youtube.com/watch?v=Oh1nqnZAKxw"
driver.get(url)
time.sleep(2)
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.quit()
link = [i['href'] for i in soup.select('div#meta div#description [href]')]
print(link)

Trouble parsing deeply-nested HTML with BeautifulSoup

for context I am pretty new to Python. I am trying to use bs4 to parse some data out of https://bigfuture.collegeboard.org/college-university-search/university-of-california-los-angeles
To be exact, I want to obtain the 57% number in the "paying" section of the webpage.
My problem is that bs4 will only return the first layer of the HTML, while the data I want is deeply nested in the code. I think it's under 17 divs.
Here is my python code:
import requests
import bs4
url = 'https://bigfuture.collegeboard.org/college-university-search/university-of-california-los-angeles'
res = requests.get(url)
soup = bs4.BeautifulSoup(res.text, "html.parser")
print(soup.find_all("div", {"id": "gwtDiv"}))
(This returns [<div class="clearfix margin60 marginBottomOnly" id="gwtDiv" style="min-height: 300px;height: 300px;height: auto;"></div>] None of the elements inside it are shown.)

If the page is using js to render content inside the element then requests will not be able to get that content since that content is rendered on the client side in a browser. I'd recommend using ChromeDriver and Selenium along with BeautifulSoup.
You can download the chrome driver from here:https://chromedriver.chromium.org/
Put this in the same folder in which you're running your program.
Try something like this
from selenium import webdriver
from bs4 import BeautifulSoup
url = 'https://bigfuture.collegeboard.org/college-university-search/university-of-california-los-angeles'
driver = webdriver.Chrome()
driver.get(url)
html = driver.execute_script("return document.documentElement.outerHTML")
sel_soup = BeautifulSoup(html, 'html.parser')
print(soup.find_all("div", {"id": "gwtDiv"}))

BeatifulSoap find() returns "None" with any name/attributes

I'm trying to get some informations about a product i'm interested in, on Amazon.
I'm using BeatifulSoap library for webscraping :
URL = 'https://www.amazon.it/gp/offer-listing/B08KHL2J5X/ref=dp_olp_unknown_mbc'
page = requests.get(URL,headers=headers)
soup = BeautifulSoup(page.content,'html.parser')
title = soup.find('span',class_='a-size-large a-color-price olpOfferPrice a-text-bold')
print(title)
In the pic, the highlined row it's the one i want to select, but when i run my script i get 'None' everytime. (Printing the entire output after BeatifulSoap call, give me the entire HTML source, so i'm using the right URL)
Any solutions?

You need to use .text() to get the text of an element.
so change:
print(title)
to:
print(title.text)
Output:
EUR 1.153,00

I wouldn't use BS alone in this case. You can easily use add Selenium to scrape the website:
from bs4 import BeautifulSoup
from selenium.webdriver.firefox.options import Options as FirefoxOptions
from selenium import webdriver
url = 'https://www.amazon.it/gp/offer-listing/B08KHL2J5X/ref=dp_olp_unknown_mbc'
driver = webdriver.Safari()
driver.get(url)
html_content = driver.page_source
soup = BeautifulSoup(html_content, "html.parser")
title = soup.find('span',class_='a-size-large a-color-price olpOfferPrice a-text-bold')
print(title)
If you don't can use Safari you have to download the webdriver for Chrome, Firefox etc. but there is plenty of reading material on this topic.

Extracting HTML Table data using Beautiful Soup

I'm looking to extract all of the brands from this page using Beautiful Soup. My program so far is:
from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options
from bs4 import BeautifulSoup
def main():
opts = Options()
opts.headless = True
assert opts.headless # Operating in headless mode
browser = Firefox(options=opts)
browser.get('https://neighborhoodgoods.com/pages/brands')
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')
brand = []
for tag in soup.find('table'):
brand.append(tag.contents.text)
print(brand)
browser.close()
print('This program is terminated.')
I'm struggling with figuring out the right tag to use as the data is nested in tr/td. Any advice? Thanks so much!

If I understand your question correctly, you only want to get the company name (the first <td> of each table)
Try using a CSS Selector td:nth-of-type(1) which selects the first <td> of every table.
import requests
from bs4 import BeautifulSoup
URL = "https://neighborhoodgoods.com/pages/brands"
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
print([tag.text for tag in soup.select("td:nth-of-type(1)")])
Output:
['A.N Other', 'Act + Acre', ...And on.. , 'Wild One']

Beautifulsoup not returning complete HTML of the page

I have been digging on the site for some time and im unable to find the solution to my issue. Im fairly new to web scraping and trying to simply extract some links from a web page using beautiful soup.
url = "https://www.sofascore.com/pt/futebol/2018-09-18"
page = urlopen(url).read()
soup = BeautifulSoup(page, "lxml")
print(soup)
At the most basic level, all im trying to do is access a specific tag within the website. I can work out the rest for myself, but the part im struggling with is the fact that a tag that I am looking for is not in the output.
For example: using the built in find() I can grab the following div class tag:
class="l__grid js-page-layout"
However what i'm actually looking for are the contents of a tag that is embedded at a lower level in the tree.
js-event-list-tournament-events
When I perform the same find operation on the lower-level tag, I get no results.
Using Azure-based Jupyter Notebook, i have tried a number of the solutions to similar problems on stackoverflow and no luck.
Thanks!
Kenny

The page use JS to load the data dynamically so you have to use selenium. Check below code.
Note you have to install selenium and chromedrive (unzip the file and copy into python folder)
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
url = "https://www.sofascore.com/pt/futebol/2018-09-18"
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(chrome_options=options)
driver.get(url)
time.sleep(3)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'html.parser')
container = soup.find_all('div', attrs={
'class':'js-event-list-tournament-events'})
print(container)
or you can use their json api
import requests
url = 'https://www.sofascore.com/football//2018-09-18/json'
r = requests.get(url)
print(r.json())

I had the same problem and the following code worked for me. Chromedriver must be installed!
import time
from bs4 import BeautifulSoup
from selenium import webdriver
chromedriver_path= "/Users/.../chromedriver"
driver = webdriver.Chrome(chromedriver_path)
url = "https://yourURL.com"
driver.get(url)
time.sleep(3) #if you want to wait 3 seconds for the page to load
page_source = driver.page_source
soup = bs4.BeautifulSoup(page_source, 'lxml')
This soup you can use as usual.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup cannot find 'class' href - python

Related

Video description scraping with Beautifulsoup

Trouble parsing deeply-nested HTML with BeautifulSoup

BeatifulSoap find() returns "None" with any name/attributes

Extracting HTML Table data using Beautiful Soup

Beautifulsoup not returning complete HTML of the page

Categories

Resources