Get html text with Beautiful Soup - python

I'm trying to get the number from inside a div:
<div class="tv-symbol-price-quote__value js-symbol-last">122.7<span class="">8</span></div>
I need the 122.7 number, but I cant get it. I have tried with:
strings = soup.find("div", class_="tv-symbol-price-quote__value js-symbol-last").string
But, there are more than one element and I receive "none".
Is there a way to print the childs and get the string from childs?

Use .getText().
For example:
from bs4 import BeautifulSoup
sample_html = """
<div class="tv-symbol-price-quote__value js-symbol-last">122.7<span class="">8</span></div>
"""
soup = BeautifulSoup(sample_html, "html.parser")
strings = soup.find("div", class_="tv-symbol-price-quote__value js-symbol-last").getText()
print(strings)
Output:
122.78
Or use __next__() to get only the 122.7.
soup = BeautifulSoup(sample_html, "html.parser")
strings = soup.find("div", class_="tv-symbol-price-quote__value js-symbol-last").strings.__next__()
print(strings)
Output:
122.7

To only get the first text, search for the tag, and call the next_element method.
from bs4 import BeautifulSoup
html = """
<div class="tv-symbol-price-quote__value js-symbol-last">122.7<span class="">8</span></div>
"""
soup = BeautifulSoup(html, "html.parser")
print(
soup.find("div", class_="tv-symbol-price-quote__value js-symbol-last").next_element
)
Output:
122.7

You could use selenium to find the element and then use BS4 to parse it.
An example would be
import selenium.webdriver as WD
from selenium.webdrive.chrome.options import Options
import bs4 as B
driver = WD.Chrome()
objXpath = driver.find_element_by_xpath("""yourelementxpath""")
objHtml = objXpath.get_attribute("outerHTML")
soup = B.BeutifulSoup(objHtml, 'html.parser')
text = soup.get_text()
This code should work.
DISCLAIMER
I haven't done work w/ selenium and bs4 in a while so you might have to tweak it a little bit.

Related

How to extract 'div' value by using BeautifulSoup?

I want to get 8.9 from follow html tag by using BeautifulSoup.
<div rating-value="8.9" ratings-count="23" product-url="lenovo-v14-ada-amd-ryzen-3-3250u-8-gb-vram-256-gb-ssd-14-inch-windows-home-1-82c6006cuk/version.asp" class="ng-isolate-scope">
import requests
from bs4 import BeautifulSoup
import pandas as pd
website = 'https://www.laptopsdirect.co.uk/ct/laptops-and-netbooks/laptops?fts=laptops'
response = requests.get(website)
soup = BeautifulSoup(response.content, 'lxml')
results = soup.find_all('div', class_='OfferBox')
name = results[0].find('a', class_='offerboxtitle').get_text()
price = results[0].find('span', class_='offerprice').get_text()
review_rating = results[0].find('')
print(review_rating)
I tried:
review_rating = results[0].find('div.rating-value')
None
review_rating = results[0].find('div')['rating-value']
KeyError: 'rating-value'
I'm not familiar with BeautifulSoup yet, so I failed.
Please teach me how to get 8.9?
Thanks
You might use .get method for retrieving attributes values as follows
from bs4 import BeautifulSoup
html = '''<div rating-value="8.9" ratings-count="23" product-url="lenovo-v14-ada-amd-ryzen-3-3250u-8-gb-vram-256-gb-ssd-14-inch-windows-home-1-82c6006cuk/version.asp" class="ng-isolate-scope">'''
soup = BeautifulSoup(html, "html.parser")
print(soup.find("div").get("rating-value"))
output
8.9
Keep in mind that what .get return is str ("8.9").
You are looking for the data in wrong tag. The HTML shows the data inside a <div> but in the soup, it is present inside <star-rating>.
The rating is present as an attribute of a tag called <star-rating>. Just extract the data from it.
price = results[0].find('span', class_='offerprice').get_text()
review_rating = results[0].find('star-rating').get('rating-value')
print(review_rating)
8.9

Which element to use in Selenium?

I want to find "Moderat" in <p class="text-spread-level">Moderat</p>
I have tried with id, name, xpath and link text.
Would you like to try this?
from bs4 import BeautifulSoup
import requests
sentences = []
res = requests.get(url) # assign your url in variable
soup = BeautifulSoup(res.text, "lxml")
tag_list = soup.select("p.text-spread-level")
for tag in tag_list:
sentences.append(tag.text)
print(sentences)
Find the element by class name and get the text.
el=driver.find_element_by_class_name('text-spread-level')
val=el.text
print(val)

How to skip or truncate character or symbols from text I need. Web-scraping with beautiful soup

I need to get price (61,990) between div tag but how can I get rid of currency symbol?
Same as here, I need to grab rating only (4.7), but I don't need anything after that, such that img src. How can I ignore it? Or skip it?
Code sample:
from bs4 import BeautifulSoup
import requests
price = []
ratings=[]
response = requests.get("https://www.flipkart.com/laptops/~buyback-guarantee-on-laptops-/pr?sid=6bo%2Cb5g&uniq")
soup = BeautifulSoup(response.text, 'html.parser')
for a in soup.findAll('a',href=True, attrs={'class':'_31qSD5'}):
price=a.find('div', attrs={'class':'_1vC4OE _2rQ-NK'})
rating=a.find('div', attrs={'class':'hGSR34'})
Here. You just need to use the .text method and treat it like a normal string. In this case, retain all but the first character.
from bs4 import BeautifulSoup
import requests
price = []
ratings=[]
response = requests.get("https://www.flipkart.com/laptops/~buyback-guarantee-on-laptops-/pr?sid=6bo%2Cb5g&uniq")
soup = BeautifulSoup(response.text, 'html.parser')
for a in soup.findAll('a',href=True, attrs={'class':'_31qSD5'}):
price=a.find('div', attrs={'class':'_1vC4OE _2rQ-NK'}).text[1:]
rating=a.find('div', attrs={'class':'hGSR34'}).text
print(price)
print(rating)
Out[110]: '4.3'
Out[111]: '52,990'

Extract items within </h> but without <h> from html

I have scraped a website that provides me with Lisbon zip-codes. With BeautifulSoup I was able to get the zip-codes within a class item. However, the zip-codes themselves are still within other classes and I have tried many things to extract all of them from there. However, except for string-manipulation, I couldn't make it work. I am new to webscraping and html, so sorry if this question is very basic..
This is my code:
from bs4 import BeautifulSoup as soup
from requests import get
url='https://worldpostalcode.com/portugal/lisboa/'
response = get(url)
print(response.text)
html_soup = soup(response.text,'lxml')
type(html_soup)
zip_codes=html_soup.find_all('div', {'class' : 'rightc'})
And this is a snippet of the result from which I would like to only extract the zip codes..
[<div class="rightc">1000-246<hr/> 1050-138<hr/> 1069-188<hr/> 1070-204<hr/> 1100-069<hr/> 1100-329<hr/> 1100-591<hr/> 1150-144<hr/> 1169-062<hr/> 1170-128<hr/> 1170-395<hr/> 1200-228<hr/> 1200-604<hr/> 1200-862<hr/> 1250-111<hr/> 1269-121<hr/> 1300-217<hr/> 1300-492<hr/> 1350-092<hr/> 1399-014<hr/> 1400-237<hr/> 1500-061<hr/> 1500-360<hr/> 1500-674<hr/> 1600-232<hr/> 1600-643<hr/> 1700-018<hr/> 1700-302<hr/> 1750-113<hr/> 1750-464<hr/> 1800-262<hr/> 1900-115<hr/> 1900-401<hr/> 1950-208<hr/> 1990-162<hr/> 1000-247<hr/> 1050-139<hr/> 1069-190<hr/> 1070-205<hr/> 1100-070<hr/> 1100-330</div>]
Your result zip_codes has the type bs4.element.ResultSet, which is a set of bs4.element.Tag. So zip_codes[0] is what you're interested in (the first tag found). Use the .text method to strip the <hr> tags. Now you have a long string of zip codes separated by spaces. Strip them out into a list somehow (two options below, option one is more pythonic and faster).
from bs4 import BeautifulSoup as soup
from requests import get
url = 'https://worldpostalcode.com/portugal/lisboa/'
response = get(url)
html_soup = soup(response.text,'lxml')
zip_codes = html_soup.find_all('div', {'class' : 'rightc'})
# option one
zips = zip_codes[0].text.split(' ')
print(zips[:8])
# option two (slower)
zips = []
for zc in zip_codes[0].childGenerator():
zips.append(zc.extract().strip())
print(zips[:8])
Output:
['1000-246', '1050-138', '1069-188', '1070-204', '1100-069', '1100-329', '1100-591', '1150-144']
['1000-246', '1050-138', '1069-188', '1070-204', '1100-069', '1100-329', '1100-591', '1150-144']
html_soup = BeautifulSoup(htmlcontent,'lxml')
type(html_soup)
zip_codes=html_soup.find_all('div', {'class' : 'rightc'})
print(zip_codes[0].text.split(' '))
you can get the text and split it.
o/p :
[u'1000-246', u'1050-138', u'1069-188', u'1070-204',.........]
Use regex to grab the codes
from bs4 import BeautifulSoup
import requests
import re
url = 'https://worldpostalcode.com/portugal/lisboa/'
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
element = soup.select_one('.codelist .rightc')
codes = re.findall(r"\d{4}-\d{3}",element.text)
for code in codes:
print(code)
I would suggest you to replace all the </hr>tags into some delimiter (i.e., # or $ or ,) before loading the page response as soup. Now the job will be so easy once you load it into the soup you can extract the zip codes as a list just by calling the class.
from bs4 import BeautifulSoup as soup
from requests import get
url='https://worldpostalcode.com/portugal/lisboa/'
response = get(url)
print(response.text.replace('<hr>', '#'))
html_soup = soup(response.text,'lxml')
type(html_soup)
zip_codes=html_soup.find_all('div', {'class' : 'rightc'})
zip_codes = zip_codes.text.split('#')
Hope this helps! Cheers!
P.S.: Answer is open for improvements and comments.

get_text doesn't work for a div tag

I am scraping this website: https://icodrops.com/quarkchain/
I want to get the date the token sale ended, which is "14 February". This is stored in a div tag with the class "sale-date". However, when I call ".get_text" on this div tag, I get this:
<bound method Tag.get_text of <div class="sale-date">14 February</div>>
Here is my code:
from bs4 import BeautifulSoup
import requests
page = requests.get("https://icodrops.com/quarkchain/")
soup = BeautifulSoup(page.content, 'html.parser')
pt1 = soup.find(class_ = "white-desk ico-desk")
date = pt1.find(class_= "sale-date").get_text
print(date)
Are there any other ways I can extract the text inside the div tag?
Try this. You forgot to use () at the end of .get_text which should be .get_text():
from bs4 import BeautifulSoup
import requests
page = requests.get("https://icodrops.com/quarkchain/")
soup = BeautifulSoup(page.content, 'html.parser')
date = soup.find(class_= "sale-date").get_text()
print(date)
Change:
date = pt1.find(class_= "sale-date").get_text
To:
date = pt1.find(class_= "sale-date").get_text()

Categories