using beatifulsoup4 to scrape a specific part of html code - python

I am wanting to make a variable equal the 1.65 towards the end of the html code. Currently if i was to run my code it will print "price-text". Any help to be able to swap it to print "1.65" would be great.
<div class="priceText_f71sibe"><span class="size14_f7opyze medium_f1wf24vo priceTextSize_frw9zm9" data-automation-id="price-text">1.65</span></div>
html code
uClient.close()
page_soup = soup(page_html, "html.parser")
price_texts = page_soup.findAll("div",{"class":"priceText_f71sibe"})
price_text = price_texts[0]
a =price_text.span["data-automation-id"]
print (a)

The most popular is property .text
price_text.span.text
But there are other properties and methods
price_text.span.text
price_text.span.string
price_text.span.getText()
price_text.span.get_text()
Documentation for method get_text()
Full working code
from bs4 import BeautifulSoup
html = '<div class="priceText_f71sibe"><span class="size14_f7opyze medium_f1wf24vo priceTextSize_frw9zm9" data-automation-id="price-text">1.65</span></div>'
soup = BeautifulSoup(html, "html.parser")
price_texts = soup.findAll("div",{"class":"priceText_f71sibe"})
price_text = price_texts[0]
a = price_text.span["data-automation-id"]
print(price_text.span.text)
print(price_text.span.string)
print(price_text.span.getText())
print(price_text.span.get_text())

Related

How to extract 'div' value by using BeautifulSoup?

I want to get 8.9 from follow html tag by using BeautifulSoup.
<div rating-value="8.9" ratings-count="23" product-url="lenovo-v14-ada-amd-ryzen-3-3250u-8-gb-vram-256-gb-ssd-14-inch-windows-home-1-82c6006cuk/version.asp" class="ng-isolate-scope">
import requests
from bs4 import BeautifulSoup
import pandas as pd
website = 'https://www.laptopsdirect.co.uk/ct/laptops-and-netbooks/laptops?fts=laptops'
response = requests.get(website)
soup = BeautifulSoup(response.content, 'lxml')
results = soup.find_all('div', class_='OfferBox')
name = results[0].find('a', class_='offerboxtitle').get_text()
price = results[0].find('span', class_='offerprice').get_text()
review_rating = results[0].find('')
print(review_rating)
I tried:
review_rating = results[0].find('div.rating-value')
None
review_rating = results[0].find('div')['rating-value']
KeyError: 'rating-value'
I'm not familiar with BeautifulSoup yet, so I failed.
Please teach me how to get 8.9?
Thanks
You might use .get method for retrieving attributes values as follows
from bs4 import BeautifulSoup
html = '''<div rating-value="8.9" ratings-count="23" product-url="lenovo-v14-ada-amd-ryzen-3-3250u-8-gb-vram-256-gb-ssd-14-inch-windows-home-1-82c6006cuk/version.asp" class="ng-isolate-scope">'''
soup = BeautifulSoup(html, "html.parser")
print(soup.find("div").get("rating-value"))
output
8.9
Keep in mind that what .get return is str ("8.9").
You are looking for the data in wrong tag. The HTML shows the data inside a <div> but in the soup, it is present inside <star-rating>.
The rating is present as an attribute of a tag called <star-rating>. Just extract the data from it.
price = results[0].find('span', class_='offerprice').get_text()
review_rating = results[0].find('star-rating').get('rating-value')
print(review_rating)
8.9

Get html text with Beautiful Soup

I'm trying to get the number from inside a div:
<div class="tv-symbol-price-quote__value js-symbol-last">122.7<span class="">8</span></div>
I need the 122.7 number, but I cant get it. I have tried with:
strings = soup.find("div", class_="tv-symbol-price-quote__value js-symbol-last").string
But, there are more than one element and I receive "none".
Is there a way to print the childs and get the string from childs?
Use .getText().
For example:
from bs4 import BeautifulSoup
sample_html = """
<div class="tv-symbol-price-quote__value js-symbol-last">122.7<span class="">8</span></div>
"""
soup = BeautifulSoup(sample_html, "html.parser")
strings = soup.find("div", class_="tv-symbol-price-quote__value js-symbol-last").getText()
print(strings)
Output:
122.78
Or use __next__() to get only the 122.7.
soup = BeautifulSoup(sample_html, "html.parser")
strings = soup.find("div", class_="tv-symbol-price-quote__value js-symbol-last").strings.__next__()
print(strings)
Output:
122.7
To only get the first text, search for the tag, and call the next_element method.
from bs4 import BeautifulSoup
html = """
<div class="tv-symbol-price-quote__value js-symbol-last">122.7<span class="">8</span></div>
"""
soup = BeautifulSoup(html, "html.parser")
print(
soup.find("div", class_="tv-symbol-price-quote__value js-symbol-last").next_element
)
Output:
122.7
You could use selenium to find the element and then use BS4 to parse it.
An example would be
import selenium.webdriver as WD
from selenium.webdrive.chrome.options import Options
import bs4 as B
driver = WD.Chrome()
objXpath = driver.find_element_by_xpath("""yourelementxpath""")
objHtml = objXpath.get_attribute("outerHTML")
soup = B.BeutifulSoup(objHtml, 'html.parser')
text = soup.get_text()
This code should work.
DISCLAIMER
I haven't done work w/ selenium and bs4 in a while so you might have to tweak it a little bit.

BeautifulSoup parsing issues some div not showing

I'm trying to parse this page: https://www.ldlc.com/fr-be/informatique/pieces-informatique/carte-professionnelle/c4685/
The problem is, in this element: https://gyazo.com/e544be64a41a121bdb0c0f71aef50692 ,
I want the div that contains the price. If you inspect the page, you can see the html code for this part, shows like this:
<div class="price">
<div class"price">
"thePrice"
<sup>93</sup>
</div>
</div>
BUT, when using page_soup = soup(my_html_page, 'html.parser') or page_soup = soup(my_html_page, 'lxml') or page_soup = soup(my_html_page, 'html5lib') I only get this as the result for that part:
<div class="price"></div>
And that's it. I've been searching for hours on the internet to figure out why that inner div doesn't get parsed.
Three different parsers, and none seems to get passed the fact that the inner child shares the same class name than its parent, if this is the issue.
Hope its help you.
from bs4 import BeautifulSoup
import requests
url = 'https://www.ldlc.com/fr-be/informatique/pieces-informatique/carte-professionnelle/c4685/'
html = BeautifulSoup(requests.get(url).content, 'html.parser')
prices = html.find_all("div", {"class": "price"})
for price in prices:
print(price.text)
print output
561€95
169€94
165€95
1 165€94
7 599€95
267€95
259€94
599€95
511€94
1 042€94
2 572€94
783€95
2 479€94
2 699€95
499€94
386€95
169€94
2 343€95
783€95
499€94
499€94
259€94
267€95
165€95
169€94
2 399€95
561€95
2 699€95
2 699€95
6 059€95
7 589€95
10 991€95
9 619€94
2 479€94
3 135€95
7 589€95
511€94
1 042€94
386€95
599€95
1 165€94
2 572€94
783€95
2 479€94
2 699€95
499€94
169€94
2 343€95
2 699€95
3 135€95
6 816€95
7 589€95
561€95
267€95
To scrape all prices where class="price"> see this example:
import requests
from bs4 import BeautifulSoup
url = 'https://www.ldlc.com/fr-be/informatique/pieces-informatique/carte-professionnelle/c4685/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
# Select all the 'price' classes
for tag in soup.select('div.price'):
print(tag.text)

Extract specific portions in html file using python

How can I extract a specific portion of a html file example https://patents.google.com/patent/EP1208209A1/en?oq=medicinal+chemistry
So far I used beautifulsoup to get the text version of the html without all the tags. But I would like my code to read only say the claims sections of the above mentioned file.
here you have mate, i found out that in this site, the claims section is a html with its own Id, making things easier. I just colected the section and gave the string so you can play with it.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://patents.google.com/patent/EP1208209A1/en?oq=medicinal+chemistry")
soup = BeautifulSoup(page.content, 'html.parser')
claim_sect = soup.find_all('section', attrs={"itemprop":"claims"})
print('This is the raw content: \n')
print(str(claim_sect))
print('This is the variable type: \n')
print(str(type(claim_sect)))
str_sect = claim_sect[0]
As far as I see, there are two divs with the class="flex flex-width style-scope patent-result".
soup = BeautifulSoup(sdata)
mydivs = soup.findAll("div", {"class": "flex flex-width style-scope patent-result"})
div_with_claims = mydivs [1]
filename= 'C:/Users/xyz/.ipynb_checkpoints/EP1208209A1.html'
html_file =open(filename, 'r', encoding='utf-8')
source_code = html_file.read()
#print(source_code)
soup = BeautifulSoup(source_code)
print(soup.get_text())
#mydivs = soup.findAll("div", {"class": "flex flex-width style-scope patent-result"})
#div_with_claims = mydivs [1]
#print(div_with_claims)

Scraping multiple pages with python

I'm trying to scrape a multiple page website with beautiful soup. The code works partially. It returns only the last one page instead of all pages. How can I fix the problem?
# import libraries
import urllib.request
from bs4 import BeautifulSoup
# specify the url
aziende = [
'35-azienda-4_planets', '36-azienda-archivio_23', '24-azienda-bm', '16-azienda-brolese_virginio', '39-azienda-castellani', '19-azienda-centro_ottico_bisa', '25-azienda-comel_optik', '37-azienda-de_lorenzo_occhiali', '15-azienda-delta_laser', '34-azienda-dem', '21-azienda-erizzo', '3-azienda-evo', '27-azienda-farben_occhialeria', '32-azienda-gio__eyewear', '7-azienda-gipizeta', '42-azienda-h8', '20-azienda-idea_91', '5-azienda-lem', '41-azienda-lasertec', '22-azienda-le_thi_thu_thu', '28-azienda-m1', '1-azienda-mati_', '38-azienda-metal_dream', '30-azienda-mictu', '23-azienda-nete', '10-azienda-new_italian_design_eyewear', '31-azienda-okki_lux', '9-azienda-ottica_pra_floriani', '12-azienda-pao', '40-azienda-palladio_occhiali', '29-azienda-plastil_due', '17-azienda-punti_di_vista', '14-azienda-quemme', '4-azienda-red_line', '43-azienda-revert', '33-azienda-sm', '6-azienda-scussel', '8-azienda-sistem', '18-azienda-stile_italiano', '26-azienda-tecnodanta', '11-azienda-toffoli_costantino', '13-azienda-tri_color', '2-azienda-zago'
]
for azienda in aziende:
page_link = 'http://www.occhialeriabellunotreviso.it/' + azienda
page = urllib.request.urlopen(page_link ) # query the website and return the html to the variable ‘page’
soup = BeautifulSoup(page, 'html.parser') # parse the html using beautiful soup and store in variable `soup`
# Take out the <div> of name and get its value
name_box = soup.find('h2')
name = name_box.text.strip() # strip() is used to remove starting and trailing
print (name)
Just put the final lines of code that are outside the for-loop inside the for-loop so they are run for every page.
# import libraries
import urllib.request
from bs4 import BeautifulSoup
# specify the url
aziende = [
'35-azienda-4_planets', '36-azienda-archivio_23', '24-azienda-bm', '16-azienda-brolese_virginio', '39-azienda-castellani', '19-azienda-centro_ottico_bisa', '25-azienda-comel_optik', '37-azienda-de_lorenzo_occhiali', '15-azienda-delta_laser', '34-azienda-dem', '21-azienda-erizzo', '3-azienda-evo', '27-azienda-farben_occhialeria', '32-azienda-gio__eyewear', '7-azienda-gipizeta', '42-azienda-h8', '20-azienda-idea_91', '5-azienda-lem', '41-azienda-lasertec', '22-azienda-le_thi_thu_thu', '28-azienda-m1', '1-azienda-mati_', '38-azienda-metal_dream', '30-azienda-mictu', '23-azienda-nete', '10-azienda-new_italian_design_eyewear', '31-azienda-okki_lux', '9-azienda-ottica_pra_floriani', '12-azienda-pao', '40-azienda-palladio_occhiali', '29-azienda-plastil_due', '17-azienda-punti_di_vista', '14-azienda-quemme', '4-azienda-red_line', '43-azienda-revert', '33-azienda-sm', '6-azienda-scussel', '8-azienda-sistem', '18-azienda-stile_italiano', '26-azienda-tecnodanta', '11-azienda-toffoli_costantino', '13-azienda-tri_color', '2-azienda-zago'
]
for azienda in aziende:
page_link = 'http://www.occhialeriabellunotreviso.it/' + azienda
page = urllib.request.urlopen(page_link ) # query the website and return the html to the variable ‘page’
soup = BeautifulSoup(page, 'html.parser') # parse the html using beautiful soup and store in variable `soup`
# Take out the <div> of name and get its value
name_box = soup.find('h2')
name = name_box.text.strip() # strip() is used to remove starting and trailing
print (name)
There is nothing wrong with the way you have already tried except for the indentation. However, alternative approach could be something like below:
import urllib.request
from bs4 import BeautifulSoup
link = 'http://www.occhialeriabellunotreviso.it/{}'
aziende = (
'35-azienda-4_planets', '36-azienda-archivio_23', '24-azienda-bm', '16-azienda-brolese_virginio', '39-azienda-castellani', '19-azienda-centro_ottico_bisa', '25-azienda-comel_optik', '37-azienda-de_lorenzo_occhiali', '15-azienda-delta_laser', '34-azienda-dem', '21-azienda-erizzo', '3-azienda-evo', '27-azienda-farben_occhialeria', '32-azienda-gio__eyewear', '7-azienda-gipizeta', '42-azienda-h8', '20-azienda-idea_91', '5-azienda-lem', '41-azienda-lasertec', '22-azienda-le_thi_thu_thu', '28-azienda-m1', '1-azienda-mati_', '38-azienda-metal_dream', '30-azienda-mictu', '23-azienda-nete', '10-azienda-new_italian_design_eyewear', '31-azienda-okki_lux', '9-azienda-ottica_pra_floriani', '12-azienda-pao', '40-azienda-palladio_occhiali', '29-azienda-plastil_due', '17-azienda-punti_di_vista', '14-azienda-quemme', '4-azienda-red_line', '43-azienda-revert', '33-azienda-sm', '6-azienda-scussel', '8-azienda-sistem', '18-azienda-stile_italiano', '26-azienda-tecnodanta', '11-azienda-toffoli_costantino', '13-azienda-tri_color', '2-azienda-zago'
)
def get_item(url):
for azienda in aziende:
page = urllib.request.urlopen(url.format(azienda))
soup = BeautifulSoup(page, 'html.parser')
name_box = soup.find('h2').get_text(strip=True)
yield name_box
if __name__ == '__main__':
for item in get_item(link):
print(item)

Categories