How to extract 'div' value by using BeautifulSoup? - python

I want to get 8.9 from follow html tag by using BeautifulSoup.
<div rating-value="8.9" ratings-count="23" product-url="lenovo-v14-ada-amd-ryzen-3-3250u-8-gb-vram-256-gb-ssd-14-inch-windows-home-1-82c6006cuk/version.asp" class="ng-isolate-scope">
import requests
from bs4 import BeautifulSoup
import pandas as pd
website = 'https://www.laptopsdirect.co.uk/ct/laptops-and-netbooks/laptops?fts=laptops'
response = requests.get(website)
soup = BeautifulSoup(response.content, 'lxml')
results = soup.find_all('div', class_='OfferBox')
name = results[0].find('a', class_='offerboxtitle').get_text()
price = results[0].find('span', class_='offerprice').get_text()
review_rating = results[0].find('')
print(review_rating)
I tried:
review_rating = results[0].find('div.rating-value')
None
review_rating = results[0].find('div')['rating-value']
KeyError: 'rating-value'
I'm not familiar with BeautifulSoup yet, so I failed.
Please teach me how to get 8.9?
Thanks

You might use .get method for retrieving attributes values as follows
from bs4 import BeautifulSoup
html = '''<div rating-value="8.9" ratings-count="23" product-url="lenovo-v14-ada-amd-ryzen-3-3250u-8-gb-vram-256-gb-ssd-14-inch-windows-home-1-82c6006cuk/version.asp" class="ng-isolate-scope">'''
soup = BeautifulSoup(html, "html.parser")
print(soup.find("div").get("rating-value"))
output
8.9
Keep in mind that what .get return is str ("8.9").

You are looking for the data in wrong tag. The HTML shows the data inside a <div> but in the soup, it is present inside <star-rating>.
The rating is present as an attribute of a tag called <star-rating>. Just extract the data from it.
price = results[0].find('span', class_='offerprice').get_text()
review_rating = results[0].find('star-rating').get('rating-value')
print(review_rating)
8.9

Related

How to scrape id attribute from HTML with bs4?

I'm trying to scrape a specific piece of data from HTML.
html = '''<p transform="translate(3,15)" class="SoccerPlayer SoccerPlayer-1 Soccer-Team Outcome-Positive" id="12-8-3">
<p transform="translate(89,20)" class="SoccerPlayer SoccerPlayer-514 Soccer-Team Outcome-Positive" data-id="12-9-229">'''
From this piece of html I'm attempting to scrape the class, and id attributes.
I've tried
from bs4 import BeautifulSoup as soup
for pr in soup.find_all("p"):
print(pr['class'], pr['id'])
but I get a keyerror on id.
Issue here is that the second element do not have an attribute id, there is only a data-id, so you have to check that or use .get() if you’re not sure an attribute is defined:
pr.get('id')
Example
from bs4 import BeautifulSoup
html = '''<p transform="translate(3,15)" class="SoccerPlayer SoccerPlayer-1 Soccer-Team Outcome-Positive" id="12-8-3">
<p transform="translate(89,20)" class="SoccerPlayer SoccerPlayer-514 Soccer-Team Outcome-Positive" data-id="12-9-229">'''
soup = BeautifulSoup(html, 'html.parser')
for pr in soup.find_all("p"):
print(pr["class"], pr.get('id'))
Output
['SoccerPlayer', 'SoccerPlayer-1', 'Soccer-Team', 'Outcome-Positive'] 12-8-3
['SoccerPlayer', 'SoccerPlayer-514', 'Soccer-Team', 'Outcome-Positive'] None
An ugly alternative is to iterate the attributes and search for any attribute contains id:
print(pr["class"], pr.get([a for a in pr.attrs if 'id' in a][0]))
->
['SoccerPlayer', 'SoccerPlayer-1', 'Soccer-Team', 'Outcome-Positive'] 12-8-3
['SoccerPlayer', 'SoccerPlayer-514', 'Soccer-Team', 'Outcome-Positive'] 12-9-229
Your code is trying to use the find_all() method without first initializing an instance of BeautifulSoup:
from bs4 import BeautifulSoup
html_data = '''<p transform="translate(3,15)" class="SoccerPlayer SoccerPlayer-1 Soccer-Team Outcome-Positive" id="12-8-3"'''
soup = BeautifulSoup(html_data, 'html.parser')
for pr in soup.find_all("p"):
print(pr["class"], pr["id"])

Get html text with Beautiful Soup

I'm trying to get the number from inside a div:
<div class="tv-symbol-price-quote__value js-symbol-last">122.7<span class="">8</span></div>
I need the 122.7 number, but I cant get it. I have tried with:
strings = soup.find("div", class_="tv-symbol-price-quote__value js-symbol-last").string
But, there are more than one element and I receive "none".
Is there a way to print the childs and get the string from childs?
Use .getText().
For example:
from bs4 import BeautifulSoup
sample_html = """
<div class="tv-symbol-price-quote__value js-symbol-last">122.7<span class="">8</span></div>
"""
soup = BeautifulSoup(sample_html, "html.parser")
strings = soup.find("div", class_="tv-symbol-price-quote__value js-symbol-last").getText()
print(strings)
Output:
122.78
Or use __next__() to get only the 122.7.
soup = BeautifulSoup(sample_html, "html.parser")
strings = soup.find("div", class_="tv-symbol-price-quote__value js-symbol-last").strings.__next__()
print(strings)
Output:
122.7
To only get the first text, search for the tag, and call the next_element method.
from bs4 import BeautifulSoup
html = """
<div class="tv-symbol-price-quote__value js-symbol-last">122.7<span class="">8</span></div>
"""
soup = BeautifulSoup(html, "html.parser")
print(
soup.find("div", class_="tv-symbol-price-quote__value js-symbol-last").next_element
)
Output:
122.7
You could use selenium to find the element and then use BS4 to parse it.
An example would be
import selenium.webdriver as WD
from selenium.webdrive.chrome.options import Options
import bs4 as B
driver = WD.Chrome()
objXpath = driver.find_element_by_xpath("""yourelementxpath""")
objHtml = objXpath.get_attribute("outerHTML")
soup = B.BeutifulSoup(objHtml, 'html.parser')
text = soup.get_text()
This code should work.
DISCLAIMER
I haven't done work w/ selenium and bs4 in a while so you might have to tweak it a little bit.

BeautifulSoup parsing issues some div not showing

I'm trying to parse this page: https://www.ldlc.com/fr-be/informatique/pieces-informatique/carte-professionnelle/c4685/
The problem is, in this element: https://gyazo.com/e544be64a41a121bdb0c0f71aef50692 ,
I want the div that contains the price. If you inspect the page, you can see the html code for this part, shows like this:
<div class="price">
<div class"price">
"thePrice"
<sup>93</sup>
</div>
</div>
BUT, when using page_soup = soup(my_html_page, 'html.parser') or page_soup = soup(my_html_page, 'lxml') or page_soup = soup(my_html_page, 'html5lib') I only get this as the result for that part:
<div class="price"></div>
And that's it. I've been searching for hours on the internet to figure out why that inner div doesn't get parsed.
Three different parsers, and none seems to get passed the fact that the inner child shares the same class name than its parent, if this is the issue.
Hope its help you.
from bs4 import BeautifulSoup
import requests
url = 'https://www.ldlc.com/fr-be/informatique/pieces-informatique/carte-professionnelle/c4685/'
html = BeautifulSoup(requests.get(url).content, 'html.parser')
prices = html.find_all("div", {"class": "price"})
for price in prices:
print(price.text)
print output
561€95
169€94
165€95
1 165€94
7 599€95
267€95
259€94
599€95
511€94
1 042€94
2 572€94
783€95
2 479€94
2 699€95
499€94
386€95
169€94
2 343€95
783€95
499€94
499€94
259€94
267€95
165€95
169€94
2 399€95
561€95
2 699€95
2 699€95
6 059€95
7 589€95
10 991€95
9 619€94
2 479€94
3 135€95
7 589€95
511€94
1 042€94
386€95
599€95
1 165€94
2 572€94
783€95
2 479€94
2 699€95
499€94
169€94
2 343€95
2 699€95
3 135€95
6 816€95
7 589€95
561€95
267€95
To scrape all prices where class="price"> see this example:
import requests
from bs4 import BeautifulSoup
url = 'https://www.ldlc.com/fr-be/informatique/pieces-informatique/carte-professionnelle/c4685/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
# Select all the 'price' classes
for tag in soup.select('div.price'):
print(tag.text)

Access attributes with beautifulSoup and print

I'd like to scrape a site to findall title attributes of h2 tag
<h2 class="1">Titanic_Caprio</h2>
Using this code, I'm accessing the entire h2 tag
from bs4 import BeautifulSoup
import urllib2
url = "http://www.example.it"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
links = soup.findAll('h2')
print "".join([str(x) for x in links] )
using findAll('h2', attrs = {'title'}) doesn't have results. What Am I doing wrong? How can I print out the entire title's list in a file?
The problem is that title is not an attribute of the h2 tag, but of a tag included in it. So you must first search for <h2> tags, and then subtags having a title attribute:
titles = []
h2_list = links = soup.findAll('h2')
for h2 in h2_list:
titles.extend(h2.findAll(lambda x: x.has_attr('title')))
It works because BeautifulSoup can use functions as search filters.
you need to pass key value pairs in attrs
findAll('h2', attrs = {"key":"value"})

Python: BeautifulSoup extract all the heading text from div class

import requests
from bs4 import BeautifulSoup
res = requests.get('http://aicd.companydirectors.com.au/events/events-calendar')
soup = BeautifulSoup(res.text,"lxml")
event_containers = soup.find_all('div', class_ = "col-xs-12 col-sm-6 col-md-8")
first_event = event_containers[0]
print(first_event.h3.text)
By using this code i'm able to extract the event name,I'm trying for a way to loop and extract all the event names and dates ? and also i'm trying to extract the location information which is visable after clicking on readmore link
event_containers is a bs4.element.ResultSet object, which is basically a list of Tag objects.
Just loop over the tags in event_containers and select h3 for the title, div.date for the date and a for the URL, example:
for tag in event_containers:
print(tag.h3.text)
print(tag.select_one('div.date').text)
print(tag.a['href'])
Now, for the location information you'll have to visit each URL and collect the text in div.date.
Full code:
import requests
from bs4 import BeautifulSoup
res = requests.get('http://aicd.companydirectors.com.au/events/events-calendar')
soup = BeautifulSoup(res.text,"lxml")
event_containers = soup.find_all('div', class_ = "col-xs-12 col-sm-6 col-md-8")
base_url = 'http://aicd.companydirectors.com.au'
for tag in event_containers:
link = base_url + tag.a['href']
soup = BeautifulSoup(requests.get(link).text,"lxml")
location = ', '.join(list(soup.select_one('div.event-add').stripped_strings)[1:-1])
print('Title:', tag.h3.text)
print('Date:', tag.select_one('div.date').text)
print('Link:', link)
print('Location:', location)
Try this to get all the events and dates you are after:
import requests
from bs4 import BeautifulSoup
res = requests.get('http://aicd.companydirectors.com.au/events/events-calendar')
soup = BeautifulSoup(res.text,"lxml")
for item in soup.find_all(class_='lead'):
date = item.find_previous_sibling().text.split(" |")[0]
print(item.text,date)

Categories